✨Rubrix is a production-ready Python framework for exploring, annotating, and managing data in NLP projects.

Overview

drawing

Python framework to explore, label, and monitor data for NLP

Usage example · Get started · Quick links · Docs

CI Codecov CI CI CI CI CI

Rubrix.mp4

Example: Named Entity Recognition data exploration and annotation with spaCy and the IMDB dataset

What is Rubrix?

Rubrix is a production-ready Python framework for exploring, annotating, and managing data in NLP projects.

Key features:

  • Open: Rubrix is free, open-source, and 100% compatible with major NLP libraries (Hugging Face transformers, spaCy, Stanford Stanza, Flair, etc.). In fact, you can use and combine your preferred libraries without implementing any specific interface.

  • End-to-end: Most annotation tools treat data collection as a one-off activity at the beginning of each project. In real-world projects, data collection is a key activity of the iterative process of ML model development. Once a model goes into production, you want to monitor and analyze its predictions, and collect more data to improve your model over time. Rubrix is designed to close this gap, enabling you to iterate as much as you need.

  • User and Developer Experience: The key to sustainable NLP solutions is to make it easier for everyone to contribute to projects. Domain experts should feel comfortable interpreting and annotating data. Data scientists should feel free to experiment and iterate. Engineers should feel in control of data pipelines. Rubrix optimizes the experience for these core users to make your teams more productive.

  • Beyond hand-labeling: Classical hand labeling workflows are costly and inefficient, but having humans-in-the-loop is essential. Easily combine hand-labeling with active learning, bulk-labeling, zero-shot models, and weak-supervision in novel data annotation workflows.

Example

Let's see Rubrix in action with a quick example: Bootstraping data annotation with a zero-shot classifier

Why:

  • The availability of pre-trained language models with zero-shot capabilities means you can, sometimes, accelerate your data annotation tasks by pre-annotating your corpus with a pre-trained zeroshot model.
  • The same workflow can be applied if there is a pre-trained "supervised" model that fits your categories but needs fine-tuning for your own use case. For example, fine-tuning a sentiment classifier for a very specific type of message.

Ingredients:

  • A zero-shot classifier from the 🤗 Hub: typeform/distilbert-base-uncased-mnli
  • A dataset containing news
  • A set of target categories: Business, Sports, etc.

What are we going to do:

  1. Make predictions and log them into a Rubrix dataset.
  2. Use the Rubrix web app to explore, filter, and annotate some examples.
  3. Load the annotated examples and create a training set, which you can then use to train a supervised classifier.

1. Predict and log

Let's load the zero-shot pipeline and the dataset (we are using the AGNews dataset for demonstration, but this could be your own dataset). Then, let's go over the dataset records and log them using rb.log(). This will create a Rubrix dataset, accesible from the web app.

from transformers import pipeline
from datasets import load_dataset
import rubrix as rb

model = pipeline('zero-shot-classification', model="typeform/distilbert-base-uncased-mnli")

dataset = load_dataset("ag_news", split='test[0:100]')

labels = ['World', 'Sports', 'Business', 'Sci/Tech']

for item in dataset:
    prediction = model(item['text'], labels)

    record = rb.TextClassificationRecord(
        inputs=item["text"],
        prediction=list(zip(prediction['labels'], prediction['scores']))
    )

    rb.log(record, name="news_zeroshot")

2. Explore, Filter and Label

Now let's access our Rubrix dataset and start annotating data. Let's filter the records predicted as Business with high probability and use the bulk-labeling feature for labeling 15 records as Business:

Zeroshot.Example.mp4

3. Load and create a training set

After a few iterations of data annotation, we can load the Rubrix dataset and create a training set to train or fine-tune a supervised model.

# load the Rubrix dataset as a pandas DataFrame
rb_df = rb.load(name='news_zeroshot')

# filter annotated records
rb_df = rb_df[rb_df.status == "Validated"]

# select text input and the annotated label
train_df = pd.DataFrame({
    "text": rb_df.inputs.transform(lambda r: r["text"]),
    "label": rb_df.annotation,
})

Architecture

Rubrix main components are:

  • Rubrix Python client: Python client to log, load, copy and delete Rubrix datasets.
  • Rubrix server: FastAPI REST service for reading and writing data.
  • Elasticsearch: The storage layer and search engine powering the API and the web app.
  • Rubrix web app: Easy-to-use web application for data exploration and annotation.

Quick links

Doc Description
🚶 First steps New to Rubrix and want to get started?
👩‍🏫 Concepts Want to know more about Rubrix concepts?
🛠️ Setup and install How to configure and install Rubrix
🗒️ Tasks What can you use Rubrix for?
📱 Web app reference How to use the web-app for data exploration and annotation
🐍 Python client API How to use the Python classes and methods
👩‍🍳 Rubrix cookbook How to use Rubrix with your favourite libraries (flair, stanza...)
👋 Community forum Ask questions, share feedback, ideas and suggestions
🤗 Hugging Face tutorial Using Rubrix with 🤗 transformers and datasets
💫 spaCy tutorial Using spaCy with Rubrix for NER projects
🐠 Weak supervision tutorial How to leverage weak supervision with snorkel & Rubrix
🤔 Active learning tutorial How to use active learning with modAL & Rubrix
🧪 Knowledge graph tutorial How to use Rubrix with kglab & pytorch_geometric

Get started

To get started you need to follow three steps:

  1. Install the Python client
  2. Launch the web app
  3. Start logging data

1. Install the Python client

You can install the Python client with pip:

pip install rubrix

2. Launch the web app

There are two ways to launch the web app:

  • a) Using docker-compose (recommended).
  • b) Executing the server code manually

a) Using docker-compose (recommended)

Create a folder:

mkdir rubrix && cd rubrix

and launch the docker-contained web app with the following command:

wget -O docker-compose.yml https://git.io/rb-docker && docker-compose up

This is the recommended way because it automatically includes an Elasticsearch instance, Rubrix's main persistence layer.

b) Executing the server code manually

When executing the server code manually you need to provide an Elasticsearch instance yourself.

  1. First you need to install Elasticsearch (we recommend version 7.10) and launch an Elasticsearch instance. For MacOS and Windows there are Homebrew formulae and a msi package, respectively.
  2. Install the Python client together with its server dependencies:
pip install rubrix[server]
  1. Launch a local instance of the web app
python -m rubrix.server

By default, the Rubrix server will look for your Elasticsearch endpoint at http://localhost:9200. But you can customize this by setting the ELASTICSEARCH environment variable.

3. Start logging data

The following code will log one record into a data set called example-dataset:

import rubrix as rb

rb.log(
    rb.TextClassificationRecord(inputs="My first Rubrix example"),
    name='example-dataset'
)

If you go to your Rubrix web app at http://localhost:6900/, you should see your first dataset. The default username and password are rubrix and 1234. You can also check the REST API docs at http://localhost:6900/api/docs.

Congratulations! You are ready to start working with Rubrix.

To better understand what's possible take a look at Rubrix's Cookbook

Community

As a new open-source project, we are eager to hear your thoughts, fix bugs, and help you get started. Feel free to use the Discussion forum or the Issues and we'll be pleased to help out.

Comments
  • Add monitoring examples with FastAPI: Hugging Face and spaCy

    Add monitoring examples with FastAPI: Hugging Face and spaCy

    The idea would be to add a guide (as a Jupyter Notebook) to be included under docs/guides. This Jupyter notebook will showcase the RubrixHTTPMiddleware for monitoring the predictions of a FastAPI inference endpoint. Here is the example with Hugging Face + FastAPI:

    from fastapi import FastAPI
    from typing import List
    from transformers import pipeline
    from rubrix.client.asgi import RubrixLogHTTPMiddleware
    
    classifier = pipeline("sentiment-analysis", return_all_scores=True)
    
    app = FastAPI()
    
    # define the middleware for logging predictions into a Rubrix Dataset
    app.add_middleware(
        RubrixLogHTTPMiddleware,
        api_endpoint="/predict",
        dataset="monitoring_dataset_v1",
        # you could post-process the predict output with a custom record_mapper function
        # record_mapper=custom_text_classification_mapper,
    )
    
    # prediction endpoint
    @app.post("/predict")
    def predict_batch(batch: List[str]):
        predictions = classifier(batch)
        return [
            {
                "labels": [p["label"] for p in prediction],
                "probabilities": [p["score"] for p in prediction],
            }
            for prediction in predictions
        ]
    

    The steps would be to:

    1. Create a notebook and include the above example
    2. Add an example with a pre-trained transformer TokenClassifier (for example: https://huggingface.co/dslim/bert-base-NER)
    3. Add an example with a spaCy NER pipeline.
    4. (Optionally) Include an example dashboard with Kibana (screenshots, gif or video)
    5. (Optionally) Include an example with ray serve
    documentation good first issue help wanted 
    opened by dvsrepo 19
  • updated readme with `conda` install instruction

    updated readme with `conda` install instruction

    This closes #781.

    • [x] added conda installation instruction (rubrix is available on conda-forge channel)
    • [x] added badges:
      • [x] conda-forge/rubrix (with version)
      • [x] conda-forge/rubrix (with platform specification): example -- "noarch"
      • [x] docs badge
    opened by sugatoray 14
  • [NER Fine tuning] content selection

    [NER Fine tuning] content selection

    Multi word

    Actual state : (VIEW SS) 1- I select various words, highlight is grey and in a solid block (Highlight/words). 2- When selection is done, highlight selection is splited and label selector appears.

    • [x] Should be:

    1- I select various words, highlight is grey and splited (highlight/word) 2- When selection is done, highlight selection is a solid block label selector appears.

    Delete labelling

    • [x] Make clicable the whole tooltip to delete

    Selection on a searched word

    • [x] Selection highlight should not be cut (SS)
    • [x] When selection is containing a search word the label selector does not appear (how it works only on right>left sense)
    • [x] In general : change appearance of results : in place of Orange highlight show text in bold

    Cursor

    • [x] Active "hand" cursor (pointer) on piece of text already annotated/Predicted
    • [x] Active "Text Select" cursor on the rest of record
    • [x] Enlarge the hover state to the whole area : (record + annotated tooltip + empty space between them) (record + predicted tooltip + empty space between us)

    New Select label modal

    • [x] Integrate new UI modal
    • [x] In case of unique label, dont show modal, and just affect label after selecting text
    • [x] Add logic to show first and preselected the last label used
    • [x] Add following Keyboard shortcut: Enter to valid preselected label, and vertical arrow keyboard or Number to valid other labels
    opened by Amelie-V 13
  • Add text2text example (e.g., text summarisation)

    Add text2text example (e.g., text summarisation)

    Add the text summarisation fine-tuning tutorial similar to sentiment classifier fine-tuning tutorial:

    https://rubrix.readthedocs.io/en/stable/tutorials/06-labeling-finetuning.html#3.-Fine-tune-the-pre-trained-mode

    documentation good first issue help wanted 
    opened by frascuchon 13
  • fix: Compute predicted properly for token classification [NEEDS_DATA_UPGRADE]

    fix: Compute predicted properly for token classification [NEEDS_DATA_UPGRADE]

    This PR fixes the way predicted ok/ko info is computed for token classification records.

    To apply this fix to already created datasets, you must first re-log records. Otherwise, stored info won't be updated.

    Closes #1955

    opened by frascuchon 12
  • [Workspaces] Users without personal datasets

    [Workspaces] Users without personal datasets

    Users without personal datasets but that belongs to one or more workspaces which have datasets, should automatically change to one of those workspace?

    Better to show all datasets from all workspaces in datasets list allowing to filter by workspace?

    question app 
    opened by frascuchon 11
  • [Text Class] Optimize Long records view *Prioritary*

    [Text Class] Optimize Long records view *Prioritary*

    • [x] Show labels buttons area above the fold.

    • [x] Create Action to open/close on click the full record in the same view

    • [x] Copy "Show full record" "Show less"

    • [ ] I would grap the opportunity to update the "View more" "view less" on Metrics modal to "Show more" "Show less" and apply the same style there

    enhancement 
    opened by Amelie-V 11
  • [Search] Improve and normalizes the search data model

    [Search] Improve and normalizes the search data model

    Things to keep in mind:

    • Normalize text inputs fields: text, inputs, words must be normalized and use a common pattern for all tasks
    • Several es analyzers for text fields: standard and whitespace(?) for fine tuning searches. Default as standard
    • What about text fields in metadata ? For now, only terms queries are supported. It's mean that metadata fields with large content are not enabled to be queries as full text search.
    • Created indices should contain mapping info only for its fields. A text classification index should not include mapping info for tokens or text predicted (text2text).
    • Review filter fields and align with UI names (if any)
    • What about nested fields? like token or metrics info for token classification, or label and its score for text classification. As default, query string dsl does not support nested queries, but it could be nice include some minimal support for that kind of queries.

    @dvsrepo @dcfidalgo Anything to include here?

    Tasks

    To achieve to do the work, we need tackle following tasks (that will be created as separated issues and linked here)

    1. [Datasets] Avoid using global template for all indices
    2. [Datasets] Dataset migration mechanisms for each release
    3. [Datasets] New es document model per task with backward compatibility fields
    4. [Datasets] Apply migration to new es doc model
    5. [Datasets] Build searches and aggregations using new doc model
    enhancement server 
    opened by frascuchon 11
  • Devise workflow to test the tutorials via a github action

    Devise workflow to test the tutorials via a github action

    The idea here is to devise a workflow to test our tutorials in a semi-automatic way. Ideally, we have a workflow that we can launch manually and let's say every two weeks or so, to test our tutorials. Maybe we can use nbmake for this and follow this blogpost. The tricky part is that for some tutorials we need to change/add/delete a few cells to be able to run them in an automated way ...

    documentation good first issue help wanted 
    opened by dcfidalgo 10
  • [Weak supervision] Rules numbers by label

    [Weak supervision] Rules numbers by label

    For instance:

    Sci/tech 2 Sports 1 Business 4 Politics 0 World 0

    his feature could be used for two things:

    • Help to know how is going the rule definition
    • See the full label list (in "define rules" we dont have this list by default)
    ui 
    opened by Amelie-V 9
  • Any plan to support no-whitespace language?

    Any plan to support no-whitespace language?

    I am planning to use rubrix for Japanese text data. The search functionality doesn't seem to work well on this language. I think it's better if we can customize the tokenizer used in elasticsearch instead of hardcoded "whitespace" tokenizer.

    opened by faisalron 9
  • use a default vector for vector search like`TF-IDF`

    use a default vector for vector search like`TF-IDF`

    Is your feature request related to a problem? Please describe. I do not want to set up anything for vector search but I do want to use it.

    Describe the solution you'd like I would like to see a very straightforward model-agnostic way of using the feature without any specific implementation. DatasetSettings.vectorsearch_tf_idf = True.

    Describe alternatives you've considered N.A.

    Additional context N.A.

    enhancement 
    opened by davidberenstein1957 0
  • chore(deps-dev): update fastapi requirement from <0.89,>=0.75 to >=0.75,<0.90

    chore(deps-dev): update fastapi requirement from <0.89,>=0.75 to >=0.75,<0.90

    Updates the requirements on fastapi to permit the latest version.

    Release notes

    Sourced from fastapi's releases.

    0.89.0

    Features

    • ✨ Add support for function return type annotations to declare the response_model. Initial PR #1436 by @​uriyyo.

    Now you can declare the return type / response_model in the function return type annotation:

    from fastapi import FastAPI
    from pydantic import BaseModel
    

    app = FastAPI()

    class Item(BaseModel): name: str price: float

    @​app.get("/items/") async def read_items() -> list[Item]: return [ Item(name="Portal Gun", price=42.0), Item(name="Plumbus", price=32.0), ]

    FastAPI will use the return type annotation to perform:

    • Data validation
    • Automatic documentation
      • It could power automatic client generators
    • Data filtering

    Before this version it was only supported via the response_model parameter.

    Read more about it in the new docs: Response Model - Return Type.

    Docs

    Translations

    • 🌐 Add Russian translation for docs/ru/docs/fastapi-people.md. PR #5577 by @​Xewus.
    • 🌐 Fix typo in Chinese translation for docs/zh/docs/benchmarks.md. PR #4269 by @​15027668g.
    • 🌐 Add Korean translation for docs/tutorial/cors.md. PR #3764 by @​NinaHwang.

    ... (truncated)

    Commits
    • 69bd7d8 🔖 Release version 0.89.0
    • a6af7c2 📝 Update release notes
    • aa6a8e5 📝 Update release notes
    • c482dd3 ⬆ Update coverage[toml] requirement from <7.0,>=6.5.0 to >=6.5.0,<8.0 (#5801)
    • 681e5c0 📝 Update release notes
    • eb39b0f 📝 Update release notes
    • 27ce2e2 📝 Add External Link: Authorization on FastAPI with Casbin (#5712)
    • f56b0d5 ⬆ Update uvicorn[standard] requirement from <0.19.0,>=0.12.0 to >=0.12.0,<0.2...
    • 5c6d7b2 📝 Update release notes
    • 78813a5 ✏ Fix typo in docs/en/docs/async.md (#5785)
    • Additional commits viewable in compare view

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 0
  • add repr method for Rule, Dataset.

    add repr method for Rule, Dataset.

    Description

    Please include a summary of the changes and the related issue. Please also include relevant motivation and context. List any dependencies that are required for this change.

    Closes #2046

    Type of change

    Please delete options that are not relevant.

    • [ ] New feature (non-breaking change which adds functionality)

    How Has This Been Tested

    Please describe the tests that you ran to verify your changes. And ideally reference tests.

    import argilla as rg
    from argilla.labeling.text_classification.rule import Rule
    
    plz = Rule(query="plz OR please", label="SPAM")
    print(repr(plz))
    >>> Rule(query='plz OR please', label='SPAM', name='plz OR please')
    
    
    records = [
            rg.TextClassificationRecord(text="example"),
            rg.TextClassificationRecord(text="another example"),
            rg.TextClassificationRecord(text="another example another example another example another example another example another example"),
        ]
    dataset = rg.DatasetForTextClassification(records=records)
    print(dataset)
    >>>
        	text                          	annotation	prediction
    0   	example                       	None      	None      
    1   	another example               	None      	None      
    2   	another example another exampl	None      	None      
    ...
    3 TextClassificationRecord records
    
    

    Checklist

    • [x] I have merged the original branch into my forked branch
    • [x] I added relevant documentation
    • [x] follows the style guidelines of this project
    • [x] I did a self-review of my code
    • [x] I added comments to my code
    • [x] I made corresponding changes to the documentation
    • [x] My changes generate no new warnings
    • [x] I have added tests that prove my fix is effective or that my feature works
    opened by Ankush-Chander 1
  • feat(Client): RecordTextClassification pass only the necessary data nstead of all the dataset

    feat(Client): RecordTextClassification pass only the necessary data nstead of all the dataset

    Description

    ref : #2142 Instead of passing all the dataset, only the necessary data is passed through props into the RecordTextClassification component

    WARNING : to merge after #2145 and #2143 have been merge

    Closes #(issue_number)

    Type of change

    Please delete options that are not relevant.

    • [x] Breaking change (fix or feature that would cause existing functionality to not work as expected)

    How Has This Been Tested

    Please describe the tests that you ran to verify your changes. And ideally reference tests.

    • [x] multilabel
    • [x] singlelabel

    Checklist

    • [x] I have merged the original branch into my forked branch
    • [x] I added relevant documentation
    • [x] follows the style guidelines of this project
    • [x] I did a self-review of my code
    • [ ] I added comments to my code
    • [ ] I made corresponding changes to the documentation
    • [x] My changes generate no new warnings
    • [ ] I have added tests that prove my fix is effective or that my feature works
    opened by keithCuniah 0
  • feat(Client): ClassifierExplorationArea.vue pass only the necesarry data

    feat(Client): ClassifierExplorationArea.vue pass only the necesarry data

    Description

    ref : #2142 Instead of passing all the dataset, only the necessary data is passed through props into the ClassifierExplorationArea.vue

    Type of change

    Please delete options that are not relevant.

    • [x] Breaking change (fix or feature that would cause existing functionality to not work as expected) How Has This Been Tested

    Please describe the tests that you ran to verify your changes. And ideally reference tests.

    • [x] multilabel
    • [x] singlelabel

    Checklist

    • [x] I have merged the original branch into my forked branch
    • [x] I added relevant documentation
    • [x] follows the style guidelines of this project
    • [x] I did a self-review of my code
    • [ ] I added comments to my code
    • [ ] I made corresponding changes to the documentation
    • [x] My changes generate no new warnings
    • [ ] I have added tests that prove my fix is effective or that my feature works
    opened by keithCuniah 0
Releases(v1.1.1)
  • v1.1.1(Nov 29, 2022)

  • v1.1.0(Nov 24, 2022)

    1.1.0 (2022-11-24)

    Highlights

    Add, update, and delete rules from a Dataset using the Python client

    You can now manage rules programmatically and reflect them in Argilla Datasets so you can iterate on labeling rules from both Python and the UI. This is especially useful for leveraging linguistic resources (such as terminological lists) and making the rules available in the UI for domain experts to refine them.

    # Read a file with keywords or phrases
    labeling_rules_df = pd.read_csv("../../_static/datasets/weak_supervision_tutorial/labeling_rules.csv")
    
    # Create rules
    predefined_labeling_rules = []
    for index, row in labeling_rules_df.iterrows():
        predefined_labeling_rules.append(
            Rule(row["query"], row["label"])
        )
    
    # Add the rules to the weak_supervision_yt dataset. The rules will be manageable from the UI
    add_rules(dataset="weak_supervision_yt", rules=predefined_labeling_rules
    

    You can find more info about this feature in the deep dive guide: https://docs.argilla.io/en/latest/guides/techniques/weak_supervision.html#3.-Building-and-analyzing-weak-labels

    Sort by timestamp fields in the UI

    Users can now sort the records by last_updated and other timestamp fields to improve the labeling and review processes

    Features

    • #1929 add warning about using wrong hostnames (#1930) (a3bc554)
    • Add, delete and edit labeling rules from Python client (#1884) (d534a29), closes #1855
    • Added more explicit error message regarding dataset name validation (#1933) (c25a225), closes #1931 #1918
    • Allow sort records by event_timestamp or last_updated fields (#1924) (1c08c36), closes #1835
    • Create a contextual help to support the user in the different dataset views (#1913) (8e3851e)
    • Enable metadata length field config by environment variable (#1923) (0ff2de7), closes #1761
    • Update error page (#1932) (caeb7d4), closes #1894
    • Using new top_k_mentions metrics instead of entity_consistency (#1880) (42f702d), closes #1834

    Bug Fixes

    Documentation

    As always, thanks to our amazing contributors!

    • docs: Link key features (#1805) (#1809) by @chschroeder
    • View Docs link in frontend header users.vue (#1915) by @bengsoon
    • fix: Change method for Doc creation by spacy.Language (#1891) by @jamnicki
    Source code(tar.gz)
    Source code(zip)
  • v1.0.1(Nov 4, 2022)

  • v0.19.0(Oct 24, 2022)

  • v0.18.0(Oct 5, 2022)

    0.18.0 (2022-10-05)

    ⚡ Highlights

    Better validation of token classification records

    When working with Token Classification records, there are very often misalignment problems between the entity spans and provided tokens. Before this release, it was difficult to understand and fix these errors because validation happened on the server side.

    With this release, records are validated during instantiation, giving you a clear error message which can help you to fix/ignore problematic records.

    For example, the following record:

    import rubrix as rb
    
    rb.TokenClassificationRecord(
        tokens=["I", "love", "Paris"],
        text="I love Paris!",
        prediction=[("LOC",7,13)]
    )
    

    Will give you the following error message:

    ValueError: Following entity spans are not aligned with provided tokenization
    Spans:
    - [Paris!] defined in ...love Paris!
    Tokens:
    ['I', 'love', 'Paris']
    

    Delete records by query

    Now it's possible to delete specific records, either by ids or by a query using Lucene's syntax. This is useful for clean up and better dataset maintenance:

    import rubrix as rb
    
    ## Delete by id
    rb.delete_records(name="example-dataset", ids=[1,3,5])
    
    ## Discard records by query
    rb.delete_records(name="example-dataset", query="metadata.code=33", discard_only=True)
    

    New tutorials

    We have two new tutorials!

    Few-shot classification with SetFit and a custom dataset: https://rubrix.readthedocs.io/en/stable/tutorials/few-shot-classification-with-setfit.html

    Analyzing predictions with model explainability methods: https://rubrix.readthedocs.io/en/stable/tutorials/nlp_model_explainability.html https://rubrix.readthedocs.io/en/stable/tutorials/few-shot-classification-with-setfit.html

    Features

    Bug Fixes

    Visual enhancements

    Documentation

    • Add interpret tutorial with Transformers (#1728) (c3fa079), closes #1729
    • Adds tutorial about custom few-shot classification with SetFit (#1739) (4f15ee6), closes #1741
    • fixing the active learning tutorial with small-text (#1726) (909efdf), closes #1693
    • raise small-text version to 1.1.0 and adapt tutorial (#1744) (16f19b7), closes #1693
    • Resolve many typos in documentation, comments and tutorials (#1701) (f05e1c1)
    • using official token class. mapper since is compatible now (#1738) (e82fd13), closes #482

    As always, thanks to our amazing contributors!

    • refactor: accept flat text as input for token classification mapper (#1686) by @Ankush-Chander
    • feat(Client): improve httpx errors handling (#1662) by @Ankush-Chander
    • fix: 'MajorityVoter.score' when using multi-labels (#1678) by @dcfidalgo
    • docs: raise small-text version to 1.1.0 and adapt tutorial (#1744) by @chschroeder
    • refactor: Incompatible attribute type fixed (#1675) by @luca-digrazia
    • docs: Resolve many typos in documentation, comments and tutorials (#1701) by @tomaarsen
    • refactor: Collection of changes, primarily regarding test suite and its coverage (#1702) by @tomaarsen
    Source code(tar.gz)
    Source code(zip)
  • v0.17.0(Aug 22, 2022)

    0.17.0 (2022-08-22)

    ⚡ Highlights

    Preparing a training set in the spaCy DocBin format

    prepare_for_training is a method that prepares a dataset for training. Before prepare_for_training prepared the data for easily training Hugginface Transformers.

    Now, you can prepare your training data for spaCy NER pipelines, thanks to our great community contributor @ignacioct !

    With the example below, you can export your Rubrix dataset into a Docbin, save it to disk, and then use it with the spacy train command.

    import spacy
    import rubrix as rb
    
    from datasets import load_dataset
    
    # Load annotated dataset from Rubrix
    rb_dataset = rb.load("ner_dataset")
    
    # Loading an spaCy blank language model to create the Docbin, as it works faster
    nlp = spacy.blank("en")
    
    # After this line, the file will be stored in disk
    rb_dataset.prepare_for_training(framework="spacy", lang=nlp).to_disk("train.spacy")
    

    You can find a full example at: https://rubrix.readthedocs.io/en/v0.17.0/guides/cookbook.html#Train-a-spaCy-model-by-exporting-to-Docbin

    Load large datasets using batches

    Before this release, the rb.load method to read datasets from Python retrieved the full dataset. For large datasets, this could cause high memory consumption, network timeouts, and the inability to read datasets larger than the available memory.

    Thanks to the awesome work by @maxserras. Now it's possible to optimize memory consumption and avoid network timeouts when working with large datasets. To that end, a simple batch-iteration over the whole database can be done employing the from_id parameter in the rb.load method.

    An example of reading the first 1000 records and the next batch of up to 1000 records:

    import rubrix as rb
    dataset_batch_1 = rb.load(name="example-dataset", limit=1000)
    dataset_batch_2 = rb.load(name="example-dataset", limit=1000, id_from=dataset_batch_1[-1].id)
    

    The reference to the rb.load method can be found at: https://rubrix.readthedocs.io/en/v0.17.0/reference/python/python_client.html#rubrix.load

    Larger pagination sizes for faster bulk review and annotation

    Using filters and search for data annotation and review, some users are able to filter and quickly review dozens of records in one go. To serve those users, it's now possible to see and bulk annotate 50 and 100 records in each page.

    Screenshot 2022-08-25 at 10 33 58

    Copy record text to clipboard

    Sometimes is useful to copy the text in records to use inspect it or process it with another application. Now, this is possible thanks to the feature request by our great community member and contributor @Ankush-Chander !

    Screenshot 2022-08-25 at 10 38 19

    Better error logging for generic errors

    Thanks to work done by @Ankush-Chander and @frascuchon we now have more meaningful messages for generic server errors!

    Features

    • Add new pagination size ranges (#1667) (5b4f1f2), closes #1578
    • Allow rb.load fetch records in batches passing the from_id argument (3e6344a)
    • Copy to clipboard the record text (#1625) (d634a7b), closes #1616
    • Error Logging: send error detail in response for generic server errors (#1648) (ad17631)
    • Listeners: allow using query params in the condition through search parameter (#1627) (a0a245d), closes #1622
    • prepare_for_training supports spacy (#1635) (8587808)

    Bug Fixes

    Documentation

    Visual enhancements

    You can see all work included in the release here

    • fix: Update progress bar when refreshing after adding new records (#1666) by @leiyre
    • chore: configure miniconda for readthedocs builder by @frascuchon
    • style: Small visual adjustments for Text2Text record card (#1632) by @leiyre
    • feat: Copy to clipboard the record text (#1625) by @leiyre
    • docs: Add Slack support link in README's get started (#1688) by @dvsrepo
    • chore: update version by @frascuchon
    • feat: Add new pagination size ranges (#1667) by @leiyre
    • fix: handle stream api connection errors gracefully (#1636) by @Ankush-Chander
    • feat: allow rb.load fetch records in batches passing the from_id argument by @maxserras
    • fix(Client): reusing the inner httpx client (#1640) by @frascuchon
    • feat(Error Logging): send error detail in response for generic server errors (#1648) by @frascuchon
    • docs: spacy DocBin cookbook (#1642) by @ignacioct
    • feat: prepare_for_training supports spacy (#1635) by @frascuchon
    • style: Improve card spacing (#1638) by @leiyre
    • docs: Adding Elasticsearch persistence to docker compose section (#1643) by @maxserras
    • chore: remove old rubrix client class (#1639) by @frascuchon
    • feat(Listeners): allow using query params in the condition through search parameter (#1627) by @frascuchon
    • doc: show metric graphs in documentation (#1669) by @leiyre
    • fix(docker-compose.yaml): default volume and disable disk threshold (#1656) by @frascuchon
    • fix: Encode rule name in Weak Labeling API requests (#1649) by @leiyre
    Source code(tar.gz)
    Source code(zip)
  • v0.16.1(Jul 22, 2022)

    0.16.1 (2022-07-22)

    Bug Fixes

    • 'WeakMultiLabels.summary' and 'show_records' after extending the weak label matrix (#1633) (3cb4c07), closes #1631
    • Display metadata in Text2Text dataset (#1626) (0089e0a), closes #1623
    • Show predicted OK/KO when predictions exist (#1620) (ef66e9c), closes #1619

    Documentation

    You can see all work included in the release here

    • fix: 'WeakMultiLabels.summary' and 'show_records' after extending the weak label matrix (#1633) by @dcfidalgo
    • fix: Display metadata in Text2Text dataset (#1626) by @leiyre
    • chore: set version by @dcfidalgo
    • docs: Fix typo in Getting Started -> Concepts (#1618) by @dcfidalgo
    • fix: Show predicted OK/KO when predictions exist (#1620) by @leiyre
    Source code(tar.gz)
    Source code(zip)
  • v0.16.0(Jul 8, 2022)

    0.16.0 (2022-07-08)

    Highlights

    👂 Listeners: enable more interactive workflows between client and server

    Listeners enable you to define functions that get executed under certain conditions when something changes in a dataset. There are many use cases for this: monitoring annotation jobs, monitoring model predictions, enabling active learning workflows, and many more.

    You can find the Python API reference docs here: https://rubrix.readthedocs.io/en/stable/reference/python/python_listeners.html#python-listeners

    We will be documenting these use cases with practical examples, but for this release, we've included a new tutorial for using this with active learning: https://rubrix.readthedocs.io/en/stable/tutorials/active_learning_with_small_text.html. This tutorial includes the following listener function, which implements the active learning loop:

    from rubrix.listeners import listener
    from sklearn.metrics import accuracy_score
    
    # Define some helper variables
    LABEL2INT = trec["train"].features["label-coarse"].str2int
    ACCURACIES = []
    
    # Set up the active learning loop with the listener decorator
    @listener(
        dataset=DATASET_NAME,
        query="status:Validated AND metadata.batch_id:{batch_id}",
        condition=lambda search: search.total==NUM_SAMPLES,
        execution_interval_in_seconds=3,
        batch_id=0
    )
    def active_learning_loop(records, ctx):
    
        # 1. Update active learner
        print(f"Updating with batch_id {ctx.query_params['batch_id']} ...")
        y = np.array([LABEL2INT(rec.annotation) for rec in records])
    
        # initial update
        if ctx.query_params["batch_id"] == 0:
            indices = np.array([rec.id for rec in records])
            active_learner.initialize_data(indices, y)
        # update with the prior queried indices
        else:
            active_learner.update(y)
        print("Done!")
    
        # 2. Query active learner
        print("Querying new data points ...")
        queried_indices = active_learner.query(num_samples=NUM_SAMPLES)
        ctx.query_params["batch_id"] += 1
        new_records = [
            rb.TextClassificationRecord(
                text=trec["train"]["text"][idx],
                metadata={"batch_id": ctx.query_params["batch_id"]},
                id=idx,
            )
            for idx in queried_indices
        ]
    
        # 3. Log the batch to Rubrix
        rb.log(new_records, DATASET_NAME)
    
        # 4. Evaluate current classifier on the test set
        print("Evaluating current classifier ...")
        accuracy = accuracy_score(
            dataset_test.y,
            active_learner.classifier.predict(dataset_test),
        )
        ACCURACIES.append(accuracy)
        print("Done!")
    
        print("Waiting for annotations ...")
    

    📖 New docs!

    https://rubrix.readthedocs.io/

    Screenshot 2022-07-13 at 12 49 42

    🧱 extend_matrix: Weak label augmentation using embeddings

    This release includes an exciting feature to augment the coverage of your weak labels using embeddings. You can find a practical tutorial here: https://rubrix.readthedocs.io/en/stable/tutorials/extend_weak_labels_with_embeddings.html

    Features

    Bug Fixes

    Documentation

    • #1512: change theme to furo (#1564, #1604) (98869d2), closes #1512
    • add 'how to prepare your data for training' to basics (#1589) (a21bcf3)
    • add active learning with small text and listener tutorial (#1585, #1609) (d59573f), closes #1601 #421
    • Add MajorityVoter to references + Add comments about multi-label support of the label models (#1582) (ab481c7)
    • add pip version and dockertag as parameter in the build process (#1560) (73a31e2)

    You can see all work included in the release here

    • chore(docs): remove by @frascuchon
    • docs: add active learning with small text and listener tutorial (#1585, #1609) by @dcfidalgo
    • docs(#1512): change theme to furo (#1564, #1604) by @frascuchon
    • chore: set version by @frascuchon
    • feat(token-class): adjust token spans spaces (#1599) by @frascuchon
    • feat(#1602): new rubrix dataset listeners (#1507, #1586, #1583, #1596) by @frascuchon
    • docs: add 'how to prepare your data for training' to basics (#1589) by @dcfidalgo
    • test: configure numpy to disable multi threading (#1593) by @frascuchon
    • docs: Add MajorityVoter to references + Add comments about multi-label support of the label models (#1582) by @dcfidalgo
    • feat(#1561): standardize icons (#1565) by @leiyre
    • Feat: Improve from datasets (#1567) by @dcfidalgo
    • feat: Add 'extend_matrix' to the WeakMultiLabel class (#1577) by @dcfidalgo
    • docs: add pip version and dockertag as parameter in the build process (#1560) by @frascuchon
    • refactor: remove words references in searches (#1571) by @frascuchon
    • ci: check conda env cache (#1570) by @frascuchon
    • fix(#1264): discard first space after a token (#1591) by @frascuchon
    • ci(package): regenerate view snapshot (#1600) by @frascuchon
    • fix(#1574): search highlighting for a single dot (#1592) by @leiyre
    • fix(#1575): show predicted ok/ko in Text Classifier explore mode (#1576) by @leiyre
    • fix(#1548): access datasets for superusers when workspace is not provided (#1572, #1608) by @frascuchon
    • fix(#1551): don't show error traces for EntityNotFoundError's (#1569) by @frascuchon
    • fix: compatibility with new dataset version (#1566) by @dcfidalgo
    • fix(#1557): allow text editing when clicking the "edit" button (#1558) by @leiyre
    • fix(#1545): highlight words with accents (#1550) by @leiyre
    Source code(tar.gz)
    Source code(zip)
  • v0.15.0(Jun 8, 2022)

    0.15.0 (2022-06-08)

    🔆 Highlights

    🏷️ Configure datasets with a labeling scheme

    You can now predefine and change the label schema of your datasets. This is useful for fixing a set of labels for you and your annotation teams.

    import rubrix as rb
    
    # Define labeling schema
    settings = rb.TextClassificationSettings(label_schema=["A", "B", "C"])
    
    # Apply seetings to a new or already existing dataset
    rb.configure_dataset(name="my_dataset", settings=settings)
    
    # Logging to the newly created dataset triggers the validation checks
    rb.log(rb.TextClassificationRecord(text="text", annotation="D"), "my_dataset")
    #BadRequestApiError: Rubrix server returned an error with http status: 400
    

    Read the docs: https://rubrix.readthedocs.io/en/stable/guides/dataset_settings.html

    🧱 Weak label matrix augmentation using embeddings

    You can now use an augmentation technique inspired by https://github.com/HazyResearch/epoxy to augment the coverage of your rules using embeddings (e.g., sentence transformers). This is useful for improving the recall of your labeling rules.

    Read the tutorial: https://rubrix.readthedocs.io/en/stable/tutorials/extend_weak_labels_with_embeddings.html

    🏛️ Tutorial Gallery

    Tutorials are now organized into different categories and with a new gallery design!

    Read the docs: https://rubrix.readthedocs.io/en/stable/tutorials/introductory.html

    🏁 Basics guide

    This is the first version of the basics guide. This guide will show you how to perform the most basic actions with Rubrix, such as uploading data or data annotation.

    Read the docs: https://rubrix.readthedocs.io/en/stable/getting_started/basics.html

    Features

    • #1134: Allow extending the weak label matrix with embeddings (#1487) (4d54994), closes #1134
    • #1432: configure datasets with a label schema (21e48c0), closes #1432
    • #1446: copy icon position in datasets list (#1448) (7c9fa52), closes #1446
    • #1460: include text hyphenation (#1469) (ec23b2d), closes #1460
    • #1463: change icon position in table header (#1473) (5172324), closes #1463
    • #1467: include animation delay for last progress bar track (#1462) (c772b74), closes #1467
    • configuraton: add elasticsearch ca_cert path variable (#1502) (f0eda12)
    • UI: improve access to actions in metadata and sort dropdowns (#1510) (8d33090), closes #1435

    Bug Fixes

    • #1522: dates metadata fields accessible for sorting (#1529) (a576ceb), closes #1522
    • #1527: check agents instead labels for predicted computation (#1528) (2f2ee2e), closes #1527
    • #1532: correct domain for filter score histogram (#1540) (7478d6c), closes #1532
    • #1533: restrict highlighted fields (3a8b8a9), closes #1533
    • #1534: fix progress in the metrics sidebar when page is refreshed (#1536) (1b572c4)
    • #1539: checkbox behavior with value 0 (#1541) (7a0ab63), closes #1539
    • metrics: compute f1 for text classification (#1530) (147d38a)
    • search: highlight only textual input fields (8b83a82), closes #1538 #1544

    New contributors

    @RafaelBod made his first contribution in https://github.com/recognai/rubrix/pull/1413

    Source code(tar.gz)
    Source code(zip)
  • v0.14.2(May 31, 2022)

    0.14.2 (2022-05-31)

    Bug Fixes

    • #1514: allow ent score None and change default value to 0.0 (#1521) (0a02c70), closes #1514
    • #1516: restore read-only to copied dataset (#1520) (5b9cf0e), closes #1516
    • #1517: stop background task when something happens to main thread (#1519) (0304f40), closes #1517
    • #1518: disable global actions checkbox when no data was found (#1525) (bf35e72), closes #1518
    • UI: remove selected metadata fields for sortable fields dropdown (#1513) (bb9482b)
    Source code(tar.gz)
    Source code(zip)
  • v0.14.1(May 20, 2022)

    0.14.1 (2022-05-20)

    Bug Fixes

    • #1447: change agent when validating records with annotation but default status (#1480) (126e6f4), closes #1447
    • #1472: hide scrollbar in scrollable components (#1490) (b056e4e), closes #1472
    • #1483: close global actions "Annotate as" selector after deselect records checkbox (#1485) (a88f8cb)
    • #1503: Count filter values when loading a dataset with a route query (#1506) (43be9b8), closes #1503
    • documentation: fix user management guide (#1511) (63f7bee), closes #1501
    • filters: sort filter values by count (#1488) (0987167), closes #1484
    Source code(tar.gz)
    Source code(zip)
  • v0.14.0(May 10, 2022)

    0.14.0 (2022-05-10)

    Async version of rb.log

    You can now use the parameter background in the rb.log method to log records without blocking the main process. The main use case is monitoring production pipelines to do prediction monitoring. Here's an example with BentoML (you can find the full example in the updated Monitoring guide):

    from bentoml import BentoService, api, artifacts, env
    from bentoml.adapters import JsonInput
    from bentoml.frameworks.spacy import SpacyModelArtifact
    
    import rubrix as rb
    
    import spacy
    
    nlp = spacy.load("en_core_web_sm")
    
    
    @env(infer_pip_packages=True)
    @artifacts([SpacyModelArtifact("nlp")])
    class SpacyNERService(BentoService):
    
        @api(input=JsonInput(), batch=True)
        def predict(self, parsed_json_list):
            result, rb_records = ([], [])
            for index, parsed_json in enumerate(parsed_json_list):
                doc = self.artifacts.nlp(parsed_json["text"])
                prediction = [{"entity": ent.text, "label": ent.label_} for ent in doc.ents]
                rb_records.append(
                    rb.TokenClassificationRecord(
                        text=doc.text,
                        tokens=[t.text for t in doc],
                        prediction=[
                            (ent.label_, ent.start_char, ent.end_char) for ent in doc.ents
                        ],
                    )
                )
                result.append(prediction)
    
            rb.log(
                name="monitor-for-spacy-ner",
                records=rb_records,
                tags={"framework": "bentoml"},
                background=True,
                verbose=False
            ) # By using the background=True, the model latency won't be affected
    
            return result
    

    Confidence scores in Token Classification (NER)

    To store entity predictions you can attach a score using the last position of the entity tuple (label, char_start, char_end, score). Let's see an example:

    import rubrix as rb
    
    text = "Rubrix is a data science tool"
    
    record = rb.TokenClassificationRecord(
        text=text, 
        tokens=text.split(" "), 
        prediction=[("PRODUCT",  0, 6, 0.99)]
    )
    
    rb.log(record, "ner_with_scores")
    

    Then, in the web application, you and your team can use the score filter to find potentially problematic entities, like in the screenshot below:

    Screenshot 2022-05-12 at 11 49 43

    If you want to see this in action, check this blog post by David Berenstein:

    https://www.rubrix.ml/blog/concise-concepts-rubrix/

    Rule metrics sidebar

    We have a fresh new sidebar for the weak labeling mode, where you can see your overall rule metrics as you define new rules.

    This sidebar should help you quickly understand your progress:

    Screenshot 2022-05-12 at 11 52 10

    See the updated user guide here: https://rubrix.readthedocs.io/en/v0.14.0/reference/webapp/define_rules.html

    Features

    Bug Fixes

    Source code(tar.gz)
    Source code(zip)
  • v0.13.3(Apr 27, 2022)

  • v0.13.2(Apr 12, 2022)

    0.13.2 (2022-04-12)

    Bug Fixes

    Source code(tar.gz)
    Source code(zip)
  • v0.13.1(Apr 1, 2022)

  • v0.13.0(Mar 30, 2022)

    0.13.0 (2022-03-30)

    🗂 Multilabel weak supervision

    You can now build multilabel text classification datasets using query-based rules

    If you want to get started, check out this tutorial.

    https://user-images.githubusercontent.com/1107111/160930404-7b909f1e-b871-4e4c-b1c8-ea9eabfcad21.mp4

    🤗 Reading Hugging Face datasets from the Hub

    You can now read ANY text classification, NER, or text2text dataset directly from the Hub and load it into Rubrix.

    To understand how Rubrix datasets work check out this guide.

    rubrix_conll

    👥 Redesigned team workspaces

    Organizing teams and datasets is a key Rubrix feature. After several rounds of feedback with early users, we've completely redesigned the user experience. Let us know what you think.

    image

    You can get started and configure users and workspaces following this guide

    🔎 Guide for the query language and model

    We have included a new in-depth guide about the Lucene-based query language and data model used for search, weak labeling, loading subsets of data, and metrics.

    Features

    Bug Fixes

    Source code(tar.gz)
    Source code(zip)
  • v0.12.1(Mar 11, 2022)

  • v0.11.1(Mar 11, 2022)

  • v0.12.0(Mar 8, 2022)

    0.12.0 (2022-03-08)

    Features

    Bug Fixes

    Source code(tar.gz)
    Source code(zip)
  • v0.11.0(Feb 20, 2022)

    0.11.0 (2022-02-19)

    Highlights

    Introducing rb.Dataset* and 🤗 Hub integration

    The Dataset classes are lightweight containers for Rubrix records. These classes facilitate importing from and exporting to different formats (e.g., pandas.DataFrame, datasets.Dataset) as well as sharing and versioning Rubrix datasets using the Hugging Face Hub.

    With this release, Rubrix users and teams can use the Hugging Face Hub to share and read both public and private Rubrix datasets for TextClassification, TokenClassification, and Text2Text datasets. This opens up a whole new world of possibilities for data reproducibility and sharing. Let's see an example:

    import rubrix as rb
    from datasets import load_datasets
    
    # 👧🏻 🏷️ Leire has labeled a text classification dataset using a local Rubrix instance
    dataset_rb = rb.load("text_classification_ds", as_pandas=False)
    
    # 👧🏻 exports a Rubrix Dataset to a hf Dataset
    dataset_ds = dataset_rb.to_datasets()
    
    # 👧🏻 🚀 Leire shares the labelled dataset with the world 
    dataset_ds.push_to_hub("text_classification_ds")
    
    # 👨 John downloads the dataset from the Hugging Face Hub
    dataset_ds = load_dataset("leire/text_classification_ds", split="train")
    
    # 👨 reads in dataset
    dataset_rb = rb.read_datasets(dataset_ds, task="TextClassification")
    
    # 👨 🏷️ logs the dataset and continues labeling with his own Rubrix instance
    rb.log(dataset_rb, "john_text_classification_ds")
    

    You can read more at https://rubrix.readthedocs.io/en/stable/guides/datasets.html

    For each record type, there’s a corresponding Dataset class called DatasetFor<RecordType>. You can look up their API in the reference section.

    Improving NER UI and UX

    The UI for Token Classification has been completely redesigned to provide a better user experience for exploration and annotation. This is the first of a set of changes focusing on annotation productivity for token classification.

    Screenshot 2022-02-21 at 12 39 22

    Features

    Bug Fixes

    • #1140: fix/make client models more consistent (#1147) (926bb16), closes #1140
    • client: parse unauthorized api error properly (#1164) (1a5a08d)
    • search: prevent metrics computation breaks searches (#1175) (9f2adc9)
    Source code(tar.gz)
    Source code(zip)
  • v0.10.0(Feb 20, 2022)

    0.10.0 (2022-02-12)

    Now you can use filters in the Define Rules mode (weak labeling). These filters are useful for seeing the impact of rules on specific dataset subpopulations/subsets (e.g., with certain metadata fields, annotated records, etc.):

    Screenshot 2022-02-14 at 11 56 27

    Features

    Bug Fixes

    • #1054: reduce collapsable area. Optimize for annotation (#1106) (48024ba), closes #1054
    • #1054: remove old scroll padlock button (a1d6444), closes #1054
    • #1094: remove computed record fields returned in API results (#1095) (cd61d1e), closes #1094
    • #831: Remove sort field when only one is applied (#1116) (36b276b), closes #831
    • convert pd.NaT to None for event_timestamp (#1105) (21e78e4)
    Source code(tar.gz)
    Source code(zip)
  • v0.9.0(Feb 4, 2022)

    🎉 0.9.0 (2022-02-02)

    • Improve logging
    • Small improvements to the labelling module and weak labeling mode
    • Better setup documentation (python -m rubrix)

    Features

    • #932: label models now modify the prediction_agent when calling LabelModel.predict (#1049) (4a024ee), closes #932
    • #953: add additional metrics to LabelModel.score method (#979) (2887907), closes #953
    • #955: add default for rules in WeakLabels (#976) (34389d3), closes #955 #1011

    Bug Fixes

    Source code(tar.gz)
    Source code(zip)
  • v0.8.2(Jan 31, 2022)

    0.8.2 (2022-01-31)

    Features

    • #1036: remove prediction ok/ko in labelling rules (#1037) (672b852), closes #1036
    • #735: add warning when agent but no prediction/annotation is provided (#987) (ba88c34), closes #735

    Bug Fixes

    Source code(tar.gz)
    Source code(zip)
  • v0.8.1(Jan 20, 2022)

    0.8.1 (2022-01-20)

    Bug Fixes

    • #1002: Show 0 records overall metrics when no rules defined (#1013) (a8a5c79), closes #1002 #1002
    • Breadcrumbs: copy workspace from the breadcrumbs when dataset loading has errors #1003 (33e372d), closes #844
    • statics: handle 404 errors for static files (#1006) (f4b656a)
    • #800: compute common aggregations one by one (#990) (8cf420a), closes #800
    • #800: limit number of metadata fields (#993) (bb6b76b), closes #800
    • #905: copy dataset with rules (#948) (8597b83), closes #905
    • #974: display the dropdown in the last record of the scroll (#986) (e5f8d53), closes #974
    • #977: Remove redirection when accessing login (#996) (b3fe2cb), closes #977
    Source code(tar.gz)
    Source code(zip)
  • v0.8.1-alpha.3(Jan 20, 2022)

  • v0.8.1-alpha.2(Jan 20, 2022)

  • v0.8.1-alpha.1(Jan 19, 2022)

  • v0.8.1-alpha.0(Jan 19, 2022)

  • v0.8.0(Jan 12, 2022)

    Introducing interactive Weak labeling (Define rules mode) 🚀

    We are glad to introduce the most important feature to date: now it's possible to iterate on labeling queries directly in the UI with initial support for multi-class text classification. Multilabel and token classification support is coming soon.

    See the video for the recommended workflow:

    https://user-images.githubusercontent.com/1107111/149346471-93cbd7ee-96a2-451a-8f5e-f9e26b246407.mp4

    Check the updated tutorial: https://rubrix.readthedocs.io/en/master/tutorials/weak-supervision-with-rubrix.html

    What's changed

    • [WeakSupervision] Change load_rules import path in guide and tutorial (#939)
    • fix links to new web app reference (#936)
    • Bugfixes/avoid infinite loop when dataset loading (#934)
    • show nan instead of 0 for precision in summary (#930)
    • fix(api): include_metrics param only for search endponts (#929)
    • [Documentation] Update title page video for docs (#928)
    • update skweak tutorial (#922)
    • [Documentation] Updating the web app docu (#827)
    • publish python package to test.pypi for master and releases branches (#927)
    • [WeakLabels] Align WeakLabels.summary() with web app (#925)
    • UI: show rules without precision properly (#919)
    • chore(build): build docker images for release branches (#921)
    • Docs: Updates readme front video (#923)
    • Docs: Updates weak supervision resources (#920)
    • feat(rules): compute total & ann. coverage before label selection (#916)
    • fix(rules): compute annotated coverage when no label properly (#915)
    • Tutorial: Human-in-the-loop weak supervision with skweak (#869)
    • UI: include affected #records to overall coverage/ann. coverage metrics (#914)
    • fix lint build (#913)
    • UI: manage precision and rules without annotation coverage (#909)
    • fix(#876): process 400 response detail properly (#889)
    • feat(rules): allow compute partial query rule metrics (#907)
    • fix(security): providing default workspace should pass check (#911)
    • UI: reset filters from define rules view (#908)
    • UI: Show number of created rules in rules management view (#910)
    • UI: drop access to rule name field (#904)
    • fix(rules): prevent lost rules with dataset updates (#892)
    • fix(datasets): process owner as part of dataset id (#870)
    • (UI) Rules summary metrics format (#888)
    • UI: Improve code snippet for empty workspace (#886)
    • fix(UI): Remove case sensitive when filtering labels (#882)
    • Docs: Updates Flair zeroshot tutorial (#887)
    • removing wrong video (#885)
    • Update readme (#883)
    • fix(UI) Metrics value by default if no metric (#875)
    • feat(metrics): add token level metrics for token classification from client (#849)
    • UI: New rule metrics layout (#861)
    • chore: expose load_rules from base module (#866)
    • Docs: Regenerates graphs metrics guide (#865)
    • updating loss video (#864)
    • Docs: Update weak supervision guide (#863)
    • Update README.md (#862)
    • Fix: Link loss tutorial (#859)
    • Docs: Improve loss tutorial (#858)
    • Docs: Improve AL and ws tutorials (#857)
    • chore(ci): Include component testing configuration (#839)
    • fix/loss video updated (#853)
    • Docs: Weak supervision guide update (#855)
    • chore(app): upgrade lint dependencies (#841)
    • feat: weak supervision mode (#814)
    • Docs: Review hf tutorial (#852)
    • fix: error link to workspace home (#845)
    • fix(metrics): compute token length for each token (#850)
    • add streaming (#851)
    • fix(rules): prevent division by 0 for overall metrics (#848)
    • small change
    • [Tutorials] Update media structure, remove TLDR heading (#847)
    • Updating videos and images for sentiment classification tutorial (#846)
    • fix(rules): prevent division by zero (#843)
    • new folder and videos for model loss tutorial (#805)
    • feat(token class): add metrics at token level (#838)
    • new folder and images for active learning tutorial (#796)
    • [Tutorials] Typo fix in find label errors tutorial (#842)
    • [Tutorials] Add the new find_label_errors tutorial (#833)
    • [Rule] Modify the client API to the server's weak supervision feature (#840)
    • [LabelModel] Improve Snorkel to not modify the passed in WeakLabels object (#836)
    • feat (search): allow to filtering record metrics fields in search (#837)
    • fix(ui): remove workspace home from code snippet api url (#834)
    • ui: Hide validate button for binary cases in Text classifier (#830)
    • fix print message (#829)
    • feat: Include workspace in url path (#820)
    • fix(ui): align records and global action layouts #825
    • fix(ui): Show labels as selected after validate (#826)
    • feat(labeling rule): implements api endpoint to fetch a single rule (#817)
    • [LabelErrors] Add find_label_errors method (#775)
    • fix(ui): Fix styles in Safari (#815)
    • docs: Add contributors to readme (#822)
    • add missing rubrix import (#819)
    • new folder and images for spacy tutorial (#794)
    • feat(labeling rules): allow edition for rule label and description (#813)
    • refactor(labeling rules): optional label for rule metrics (#811)
    • Fix token alignment on CreationTokenClassificationRecord (#812)
    • feat(server): add overall dataset labeling rules metrics (#807)
    • feat(labeling rules): add coverage for annotated records (#806)
    • fix(ui): Unique ID for scroll state to avoid same state for different dataset records (#809)
    • new folder and images for zeroshot ner tutorial (#804)
    • new folder and images for zeroshot data annotation tutorial (#803)
    • fix(log): check multi-label integrity without search aggregations (#802)
    • updated images, added folder for fastapi tutorial (#801)
    • added folder for weak supervision tutorial (#795)
    • feat(weak supervision): client labeling rules from server (#799)
    • feat(server): labeling rule metrics (#790)
    • fix/edit zero-shot tutorial (#774)
    • fix/edited fastapi tutorial (#773)
    • Fix/edit ner flair tutorial (#766)
    • Fix/edit weaksupervision tutorial (#759)
    • fix(ui): Little changes in fonts (#793)
    • fix(ui): Allow open dataset in new tab from datasets list (#792)
    • feat(server): rubrix namespaces for elasticsearch indices (#789)
    • fix(ui): Show annotation after global validation (#786)
    • remove reload arg launching server using python (#787)
    • updated readme with conda install instruction (#788)
    • fix(ui): Hide scroller component when loading or paginate (#784)
    • fix(ui): allow remove metadata filter from record metadata modal (#772)
    • fix(ui): Token Classifier: validate record without annotation or prediction (#782)
    • Fix/edit active learning tutorial (#760)
    • Docs:minor changes to loss tutorial (#778)
    • Fix/edit model loss tutorial (#767)
    • fix(server): missing deprecated dep (#777)
    • fix(ui): Global validate for records without annotation or prediction (#746)
    • Fix/edit spacy tutorial (#758)
    • Fix/edit labeling tutorial (#750)
    • fix(server) - misaligned entity mentions on CreationTokenClassificationRecord (#771)
    • [Requirements] Require python>=3.7 (#770)
    • [Labeling] Add FlyingSquid label model (#755)
    • Update README.md (#769)
    • Adds Flair example to guide (#762)
    • docs: Updates huggingface examples and adds monitor for Flair (#761)
    • feat(search): show boolean values in metadata (#753)
    • feat(server): allow handle labeling rules for datasets from API (#744)
    • fix(imports): import monitoring with spacy<3.0 fails (#754)
    • [UI] new fonts families (#751)
    • fix(scroll): using new scroll component (#710)
    • fix(ui): filter "validatable" records for global action validate button (#741)
    • feat(monitor): flair ner auto-monitor (#738)

    New Contributors

    • @sugatoray made their first contribution
    • @ruanchaves made their first contribution
    Source code(tar.gz)
    Source code(zip)
  • v0.8.0-alpha.1(Jan 11, 2022)

    • Bugfixes/avoid infinite loop when dataset loading (#934)
    • show nan instead of 0 for precision in summary (#930)
    • fix(api): include_metrics param only for search endponts (#929)
    • [Documentation] Update title page video for docs (#928)
    • update skweak tutorial (#922)
    • [Documentation] Updating the web app docu (#827)
    • revert test.pypi publish
    • publish python package to test.pypi for master and releases branches (#927)
    • [WeakLabels] Align WeakLabels.summary() with web app (#925)
    • UI: show rules without precision properly (#919)
    • chore(build): build docker images for release branches (#921)
    • Docs: Updates readme front video (#923)
    • Docs: Updates weak supervision resources (#920)
    • feat(rules): compute total & ann. coverage before label selection (#916)
    • fix(rules): compute annotated coverage when no label properly (#915)
    • Tutorial: Human-in-the-loop weak supervision with skweak (#869)
    • UI: include affected #records to overall coverage/ann. coverage metrics (#914)
    • fix lint build (#913)
    • UI: manage precision and rules without annotation coverage (#909)
    • fix(#876): process 400 response detail properly (#889)
    • feat(rules): allow compute partial query rule metrics (#907)
    • fix(security): providing default workspace should pass check (#911)
    • UI: reset filters from define rules view (#908)
    • UI: Show number of created rules in rules management view (#910)
    • UI: drop access to rule name field (#904)
    • fix(rules): prevent lost rules with dataset updates (#892)
    • fix(datasets): process owner as part of dataset id (#870)
    • (UI) Rules summary metrics format (#888)
    • UI: Improve code snippet for empty workspace (#886)
    • fix(UI): Remove case sensitive when filtering labels (#882)
    • Docs: Updates Flair zeroshot tutorial (#887)
    • removing wrong video (#885)
    • Update readme (#883)
    • fix(UI) Metrics value by default if no metric (#875)
    • feat(metrics): add token level metrics for token classification from client (#849)
    • UI: New rule metrics layout (#861)
    • chore: expose load_rules from base module (#866)
    • Docs: Regenerates graphs metrics guide (#865)
    • updating loss video (#864)
    • Docs: Update weak supervision guide (#863)
    • Update README.md (#862)
    • Fix: Link loss tutorial (#859)
    • Docs: Improve loss tutorial (#858)
    • Docs: Improve AL and ws tutorials (#857)
    • chore(ci): Include component testing configuration (#839)
    • fix/loss video updated (#853)
    • Docs: Weak supervision guide update (#855)
    • chore(app): upgrade lint dependencies (#841)
    • feat: weak supervision mode (#814)
    • Docs: Review hf tutorial (#852)
    • fix: error link to workspace home (#845)
    • fix(metrics): compute token length for each token (#850)
    • chore: improve dockerignore files
    • add streaming (#851)
    • fix(rules): prevent division by 0 for overall metrics (#848)
    • small change
    • [Tutorials] Update media structure, remove TLDR heading (#847)
    • Updating videos and images for sentiment classification tutorial (#846)
    • fix(rules): prevent division by zero (#843)
    • new folder and videos for model loss tutorial (#805)
    • feat(token class): add metrics at token level (#838)
    • new folder and images for active learning tutorial (#796)
    • [Tutorials] Typo fix in find label errors tutorial (#842)
    • [Tutorials] Add the new find_label_errors tutorial (#833)
    • [Rule] Modify the client API to the server's weak supervision feature (#840)
    • [LabelModel] Improve Snorkel to not modify the passed in WeakLabels object (#836)
    • feat (search): allow to filtering record metrics fields in search (#837)
    • fix(ui): remove workspace home from code snippet api url (#834)
    • ui: Hide validate button for binary cases in Text classifier (#830)
    • fix print message (#829)
    • feat: Include workspace in url path (#820)
    • fix(ui): align records and global action layouts #825
    • fix(ui): Show labels as selected after validate (#826)
    • feat(labeling rule): implements api endpoint to fetch a single rule (#817)
    • [LabelErrors] Add find_label_errors method (#775)
    • fix(ui): Fix styles in Safari (#815)
    • docs: Add contributors to readme (#822)
    • add missing rubrix import (#819)
    • new folder and images for spacy tutorial (#794)
    • feat(labeling rules): allow edition for rule label and description (#813)
    • refactor(labeling rules): optional label for rule metrics (#811)
    • Fix token alignment on CreationTokenClassificationRecord (#812)
    • feat(server): add overall dataset labeling rules metrics (#807)
    • feat(labeling rules): add coverage for annotated records (#806)
    • fix(ui): Unique ID for scroll state to avoid same state for different dataset records (#809)
    • new folder and images for zeroshot ner tutorial (#804)
    • new folder and images for zeroshot data annotation tutorial (#803)
    • fix(log): check multi-label integrity without search aggregations (#802)
    • updated images, added folder for fastapi tutorial (#801)
    • added folder for weak supervision tutorial (#795)
    • feat(weak supervision): client labeling rules from server (#799)
    • feat(server): labeling rule metrics (#790)
    • fix/edit zero-shot tutorial (#774)
    • fix/edited fastapi tutorial (#773)
    • Fix/edit ner flair tutorial (#766)
    • Fix/edit weaksupervision tutorial (#759)
    • fix(ui): Little changes in fonts (#793)
    • fix(ui): Allow open dataset in new tab from datasets list (#792)
    • feat(server): rubrix namespaces for elasticsearch indices (#789)
    • fix(ui): Show annotation after global validation (#786)
    • remove reload arg launching server using python (#787)
    • updated readme with conda install instruction (#788)
    • fix(ui): Hide scroller component when loading or paginate (#784)
    • fix(ui): allow remove metadata filter from record metadata modal (#772)
    • fix(ui): Token Classifier: validate record without annotation or prediction (#782)
    • Fix/edit active learning tutorial (#760)
    • Docs:minor changes to loss tutorial (#778)
    • Fix/edit model loss tutorial (#767)
    • fix(server): missing deprecated dep (#777)
    • fix(ui): Global validate for records without annotation or prediction (#746)
    • Fix/edit spacy tutorial (#758)
    • Fix/edit labeling tutorial (#750)
    • fix(server) - misaligned entity mentions on CreationTokenClassificationRecord (#771)
    • [Requirements] Require python>=3.7 (#770)
    • [Labeling] Add FlyingSquid label model (#755)
    • Update README.md (#769)
    • Adds Flair example to guide (#762)
    • docs: Updates huggingface examples and adds monitor for Flair (#761)
    • feat(search): show boolean values in metadata (#753)
    • feat(server): allow handle labeling rules for datasets from API (#744)
    • fix(imports): import monitoring with spacy<3.0 fails (#754)
    • [UI] new fonts families (#751)
    • fix(scroll): using new scroll component (#710)
    • fix(ui): filter "validatable" records for global action validate button (#741)
    • feat(monitor): flair ner auto-monitor (#738)

    Full Changelog: https://github.com/recognai/rubrix/compare/v0.7.0...v0.8.0-alpha.0

    Source code(tar.gz)
    Source code(zip)
Owner
Recognai
A software company building Natural Language Processing and Machine Learning tools
Recognai
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing ?? ?? ?? We released the 2.0.0 version with TF2 Support. ?? ?? ?? If you

Eliyar Eziz 2k Feb 9, 2021
apple's universal binaries BUT MUCH WORSE (PRACTICAL SHITPOST) (NOT PRODUCTION READY)

hyperuniversality investment opportunity: what if we could run multiple architectures in a single file, again apple universal binaries, but worse how

luna 2 Oct 19, 2021
🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

?? The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Hugging Face 15k Jan 2, 2023
The tool to make NLP datasets ready to use

chazutsu photo from Kaikado, traditional Japanese chazutsu maker chazutsu is the dataset downloader for NLP. >>> import chazutsu >>> r = chazutsu.data

chakki 243 Dec 29, 2022
The projects lets you extract glossary words and their definitions from a given piece of text automatically using NLP techniques

Unsupervised technique to Glossary and Definition Extraction Code Files GPT2-DefinitionModel.ipynb - GPT-2 model for definition generation. Data_Gener

Prakhar Mishra 28 May 25, 2021
Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Grading tools for Advanced NLP (11-711) Installation You'll need docker and unzip to use this repo. For docker, visit the official guide to get starte

Hao Zhu 2 Sep 27, 2022
TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP

TextAttack ?? Generating adversarial examples for NLP models [TextAttack Documentation on ReadTheDocs] About • Setup • Usage • Design About TextAttack

QData 2.2k Jan 3, 2023
:mag: End-to-End Framework for building natural language search interfaces to data by utilizing Transformers and the State-of-the-Art of NLP. Supporting DPR, Elasticsearch, HuggingFace’s Modelhub and much more!

Haystack is an end-to-end framework that enables you to build powerful and production-ready pipelines for different search use cases. Whether you want

deepset 1.4k Feb 18, 2021
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.9k Jan 2, 2023
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.6k Feb 18, 2021
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.9k Dec 31, 2022
fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.

fastNLP fastNLP是一款轻量级的自然语言处理(NLP)工具包,目标是快速实现NLP任务以及构建复杂模型。 fastNLP具有如下的特性: 统一的Tabular式数据容器,简化数据预处理过程; 内置多种数据集的Loader和Pipe,省去预处理代码; 各种方便的NLP工具,例如Embedd

fastNLP 2.8k Jan 1, 2023
fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.

fastNLP fastNLP是一款轻量级的自然语言处理(NLP)工具包,目标是快速实现NLP任务以及构建复杂模型。 fastNLP具有如下的特性: 统一的Tabular式数据容器,简化数据预处理过程; 内置多种数据集的Loader和Pipe,省去预处理代码; 各种方便的NLP工具,例如Embedd

fastNLP 2k Feb 14, 2021
An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations

FantasyBert English | 中文 Introduction An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations. You can imp

Fan 137 Oct 26, 2022
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

Google Research 4.6k Jan 1, 2023
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

Google Research 3.2k Feb 17, 2021
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. Main features: Train new vocabularies and tok

Hugging Face 6.2k Dec 31, 2022
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. Main features: Train new vocabularies and tok

Hugging Face 4.3k Feb 18, 2021