Rubrix is a free and open-source tool for exploring and iterating on data for artificial intelligence projects.

Recognai

Last update: Jan 7, 2023

Related tags

Data Visualization python data-science natural-language-processing artificial-intelligence knowledge-graph developer-tools annotation-tool human-in-the-loop mlops text-labeling ml-monitoring

Overview

Explore, label, and monitor data for AI projects

Rubrix is a free and open-source tool for exploring and iterating on data for artificial intelligence projects.

Rubrix focuses on enabling novel, human in the loop workflows involving data scientists, subject matter experts and ML/data engineers.

With Rubrix, you can:

Monitor the predictions of deployed models.
Label data for starting up or evolving an existing project.
Iterate on ground-truth and predictions to debug, track and improve your data and models over time.
Build custom applications and dashboards on top of your model predictions.

We've tried to make working with Rubrix easy and fun, while keeping it scalable and flexible.

Rubrix is composed of:

a Python library to bridge data and models, which you can install via pip.
a web application to explore and label data, which you can launch using Docker or directly with Python.

This is an example of Rubrix UI annotation mode:

📖 For more information, visit the documentation or if you want to get started, keep reading.

Get started

To get started you need to follow three steps:

Install the Python client
Launch the web app
Start logging data

1. Install the Python client

You can install the Python client with pip:

pip install rubrix

2. Launch the webapp

There are two ways to launch the webapp:

Using docker-compose (recommended).
Executing the server code manually

Using docker-compose (recommended)

Create a folder:

mkdir rubrix && cd rubrix

and launch the docker-contained web app with the following command:

wget -O docker-compose.yml https://raw.githubusercontent.com/recognai/rubrix/master/docker-compose.yaml && docker-compose up

This is the recommended way because it automatically includes an Elasticsearch instance, Rubrix's main persistent layer.

Executing the server code manually

When executing the server code manually you need to provide an Elasticsearch instance yourself.

First you need to install Elasticsearch (we recommend version 7.10) and launch an Elasticsearch instance. For MacOS and Windows there are Homebrew formulae and a msi package, respectively.
Install the Rubrix Python library together with its server dependencies:

pip install rubrix[server]

Launch a local instance of the Rubrix web app

python -m rubrix.server

By default, the Rubrix server will look for your Elasticsearch endpoint at http://localhost:9200. If you want to customize this, you can set the ELASTICSEARCH environment variable pointing to your endpoint.

3. Start logging data

The following code will log one record into the example-dataset dataset:

import rubrix as rb

rb.log(
    rb.TextClassificationRecord(inputs="my first rubrix example"),
    name='example-dataset'
)

BulkResponse(dataset='example-dataset', processed=1, failed=0)

If you go to your Rubrix app at http://localhost:6900/, you should see your first dataset.

Congratulations! You are ready to start working with Rubrix with your own data.

To better understand what's possible take a look at Rubrix's Cookbook

Community

As a new open-source project, we are eager to hear your thoughts, fix bugs, and help you get started. Feel free to use the Discussion forum or the Issues and we'll be pleased to help out.

Comments

Add monitoring examples with FastAPI: Hugging Face and spaCy

The idea would be to add a guide (as a Jupyter Notebook) to be included under docs/guides. This Jupyter notebook will showcase the RubrixHTTPMiddleware for monitoring the predictions of a FastAPI inference endpoint. Here is the example with Hugging Face + FastAPI:

from fastapi import FastAPI
from typing import List
from transformers import pipeline
from rubrix.client.asgi import RubrixLogHTTPMiddleware

classifier = pipeline("sentiment-analysis", return_all_scores=True)

app = FastAPI()

# define the middleware for logging predictions into a Rubrix Dataset
app.add_middleware(
    RubrixLogHTTPMiddleware,
    api_endpoint="/predict",
    dataset="monitoring_dataset_v1",
    # you could post-process the predict output with a custom record_mapper function
    # record_mapper=custom_text_classification_mapper,
)

# prediction endpoint
@app.post("/predict")
def predict_batch(batch: List[str]):
    predictions = classifier(batch)
    return [
        {
            "labels": [p["label"] for p in prediction],
            "probabilities": [p["score"] for p in prediction],
        }
        for prediction in predictions
    ]

The steps would be to:

Create a notebook and include the above example
Add an example with a pre-trained transformer TokenClassifier (for example: https://huggingface.co/dslim/bert-base-NER)
Add an example with a spaCy NER pipeline.
(Optionally) Include an example dashboard with Kibana (screenshots, gif or video)
(Optionally) Include an example with ray serve

documentation good first issue help wanted

opened by dvsrepo 19

updated readme with `conda` install instruction
This closes #781.

[x] added conda installation instruction (rubrix is available on conda-forge channel)

[x] added badges:

[x] conda-forge/rubrix (with version)

[x] conda-forge/rubrix (with platform specification): example -- "noarch"

[x] docs badge
opened by sugatoray 14
[NER Fine tuning] content selection
Multi word

Actual state : (VIEW SS) 1- I select various words, highlight is grey and in a solid block (Highlight/words). 2- When selection is done, highlight selection is splited and label selector appears.

[x] Should be:

1- I select various words, highlight is grey and splited (highlight/word) 2- When selection is done, highlight selection is a solid block label selector appears.

Delete labelling

[x] Make clicable the whole tooltip to delete

Selection on a searched word

[x] Selection highlight should not be cut (SS)

[x] When selection is containing a search word the label selector does not appear (how it works only on right>left sense)

[x] In general : change appearance of results : in place of Orange highlight show text in bold

Cursor

[x] Active "hand" cursor (pointer) on piece of text already annotated/Predicted

[x] Active "Text Select" cursor on the rest of record

[x] Enlarge the hover state to the whole area : (record + annotated tooltip + empty space between them) (record + predicted tooltip + empty space between us)

New Select label modal

[x] Integrate new UI modal

[x] In case of unique label, dont show modal, and just affect label after selecting text

[x] Add logic to show first and preselected the last label used

[x] Add following Keyboard shortcut: Enter to valid preselected label, and vertical arrow keyboard or Number to valid other labels
opened by Amelie-V 13
Add text2text example (e.g., text summarisation)

Add the text summarisation fine-tuning tutorial similar to sentiment classifier fine-tuning tutorial:

https://rubrix.readthedocs.io/en/stable/tutorials/06-labeling-finetuning.html#3.-Fine-tune-the-pre-trained-mode
documentation good first issue help wanted

opened by frascuchon 13
fix: Compute predicted properly for token classification [NEEDS_DATA_UPGRADE]

This PR fixes the way predicted ok/ko info is computed for token classification records.

To apply this fix to already created datasets, you must first re-log records. Otherwise, stored info won't be updated.

Closes #1955

opened by frascuchon 12
[Workspaces] Users without personal datasets

Users without personal datasets but that belongs to one or more workspaces which have datasets, should automatically change to one of those workspace?

Better to show all datasets from all workspaces in datasets list allowing to filter by workspace?
question app

opened by frascuchon 11
[Text Class] Optimize Long records view *Prioritary*
[x] Show labels buttons area above the fold.

[x] Create Action to open/close on click the full record in the same view

[x] Copy "Show full record" "Show less"

[ ] I would grap the opportunity to update the "View more" "view less" on Metrics modal to "Show more" "Show less" and apply the same style there

enhancement
opened by Amelie-V 11
[Search] Improve and normalizes the search data model
Things to keep in mind:

Normalize text inputs fields: text, inputs, words must be normalized and use a common pattern for all tasks

Several es analyzers for text fields: standard and whitespace(?) for fine tuning searches. Default as standard

What about text fields in metadata ? For now, only terms queries are supported. It's mean that metadata fields with large content are not enabled to be queries as full text search.

Created indices should contain mapping info only for its fields. A text classification index should not include mapping info for tokens or text predicted (text2text).

Review filter fields and align with UI names (if any)

What about nested fields? like token or metrics info for token classification, or label and its score for text classification. As default, query string dsl does not support nested queries, but it could be nice include some minimal support for that kind of queries.

@dvsrepo @dcfidalgo Anything to include here?

Tasks

To achieve to do the work, we need tackle following tasks (that will be created as separated issues and linked here)

[Datasets] Avoid using global template for all indices

[Datasets] Dataset migration mechanisms for each release

[Datasets] New es document model per task with backward compatibility fields

[Datasets] Apply migration to new es doc model

[Datasets] Build searches and aggregations using new doc model

enhancement server
opened by frascuchon 11
Devise workflow to test the tutorials via a github action

The idea here is to devise a workflow to test our tutorials in a semi-automatic way. Ideally, we have a workflow that we can launch manually and let's say every two weeks or so, to test our tutorials. Maybe we can use nbmake for this and follow this blogpost. The tricky part is that for some tutorials we need to change/add/delete a few cells to be able to run them in an automated way ...
documentation good first issue help wanted

opened by dcfidalgo 10
[Weak supervision] Rules numbers by label
For instance:

Sci/tech 2 Sports 1 Business 4 Politics 0 World 0

his feature could be used for two things:

Help to know how is going the rule definition

See the full label list (in "define rules" we dont have this list by default)

ui
opened by Amelie-V 9
Any plan to support no-whitespace language?

I am planning to use rubrix for Japanese text data. The search functionality doesn't seem to work well on this language. I think it's better if we can customize the tokenizer used in elasticsearch instead of hardcoded "whitespace" tokenizer.

opened by faisalron 9
feat: get keywords `metric` from Python client

Is your feature request related to a problem? Please describe. The keywords metric is not retrievable via the Python client.

Describe the solution you'd like argilla.metrics.commons.keywords

Describe alternatives you've considered N.A.

Additional context N.A.
enhancement

opened by davidberenstein1957 0
feat: annotator specific `metrics`
Is your feature request related to a problem? Please describe. N.A.

Describe the solution you'd like Sometimes I want to see metrics for a specific annotator

alignment with predictions

alignment with other annotator

distribution of labels assigned

n_labels assigned

multi-label TextClassification

Token Classification

records discarded

annotation speed

time spend annotating

Describe alternatives you've considered N.A.

Additional context Add any other context or screenshots about the feature request here.
enhancement
opened by davidberenstein1957 0
feat: `n-gram` keywords `metrics`

Is your feature request related to a problem? Please describe. Sometimes, singular keywords don`t capture enough information.

Describe the solution you'd like I think it might be interesting to also allow for n-grams within the keywords metric. It might be interesting to be able to distinguish between: "not good" vs "good" vs "very good" vs "not very good".

Describe alternatives you've considered N.A.

Additional context N.A.
enhancement

opened by davidberenstein1957 0
`prepare_for_training` does not work for multi-label dataset
Describe the bug I cannot use multi-label dataset.prepare_for_training directly.

To Reproduce Steps to reproduce the behavior:

Go to any multi-label dataset.

export the dataset via prepare_for_training

use for training directly

Expected behavior multi-label datasets ought to be delivered with https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html or they cannot be used for training

Screenshots N.A.

Environment (please complete the following information):

OS [e.g. iOS]: N.A.

Browser [e.g. chrome, safari]: N.A.

Argilla Version [e.g. 1.0.0]:1.1.1

ElasticSearch Version [e.g. 7.10.2]:N.A.

Docker Image (optional) [e.g. argilla:v1.0.0]:N.A.

Additional context N.A.
bug
opened by davidberenstein1957 0
add `prepare_for_training` for `sparknlp`

Is your feature request related to a problem? Please describe. There is no integration with sparknlp

Describe the solution you'd like I would like to see a better integration of sparknlp with Argilla. "David Berenstein Daniel Vila Suero hey.. you can probably integrate your solution with Spark NLP pipelines as well.. please see this blogpost to see several deployment solutions supported https://medium.com/spark-nlp/deploying-spark-nlp-for-healthcare-from-zero-to-hero-88949b0c866d and here are all the healthcare NLP related notebooks https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings/Healthcare"

Describe alternatives you've considered N.A.

Additional context https://www.linkedin.com/feed/update/urn:li:activity:7016072187870646272/?commentUrn=urn%3Ali%3Acomment%3A(activity%3A7016072187870646272%2C7016076058416295936)&dashCommentUrn=urn%3Ali%3Afsd_comment%3A(7016076058416295936%2Curn%3Ali%3Aactivity%3A7016072187870646272)&dashReplyUrn=urn%3Ali%3Afsd_comment%3A(7016095045325856768%2Curn%3Ali%3Aactivity%3A7016072187870646272)&replyUrn=urn%3Ali%3Acomment%3A(activity%3A7016072187870646272%2C7016095045325856768)
enhancement

opened by davidberenstein1957 0
change password via UI

I have tried EXPORT $ARGILLA.... and following the example closely, but to no avail. I am wondering if it would be better to just allow people to change their password/add users via a webform.
enhancement

opened by alanpaulkwan 1

Releases(v1.1.1)

v1.1.1(Nov 29, 2022)
1.1.1 (2022-11-29)

Bug Fixes

Set proper telemetry version (#1988) (d302891)

Documentation

Fix metric function imports in the example (#1966) (a1f6f6e), closes #1962

Source code(tar.gz)
Source code(zip)
v1.1.0(Nov 24, 2022)
1.1.0 (2022-11-24)

Highlights

Add, update, and delete rules from a Dataset using the Python client

You can now manage rules programmatically and reflect them in Argilla Datasets so you can iterate on labeling rules from both Python and the UI. This is especially useful for leveraging linguistic resources (such as terminological lists) and making the rules available in the UI for domain experts to refine them.

# Read a file with keywords or phrases labeling_rules_df = pd.read_csv("../../_static/datasets/weak_supervision_tutorial/labeling_rules.csv") # Create rules predefined_labeling_rules = [] for index, row in labeling_rules_df.iterrows(): predefined_labeling_rules.append( Rule(row["query"], row["label"]) ) # Add the rules to the weak_supervision_yt dataset. The rules will be manageable from the UI add_rules(dataset="weak_supervision_yt", rules=predefined_labeling_rules

You can find more info about this feature in the deep dive guide: https://docs.argilla.io/en/latest/guides/techniques/weak_supervision.html#3.-Building-and-analyzing-weak-labels

Sort by timestamp fields in the UI

Users can now sort the records by last_updated and other timestamp fields to improve the labeling and review processes

Features

#1929 add warning about using wrong hostnames (#1930) (a3bc554)

Add, delete and edit labeling rules from Python client (#1884) (d534a29), closes #1855

Added more explicit error message regarding dataset name validation (#1933) (c25a225), closes #1931 #1918

Allow sort records by event_timestamp or last_updated fields (#1924) (1c08c36), closes #1835

Create a contextual help to support the user in the different dataset views (#1913) (8e3851e)

Enable metadata length field config by environment variable (#1923) (0ff2de7), closes #1761

Update error page (#1932) (caeb7d4), closes #1894

Using new top_k_mentions metrics instead of entity_consistency (#1880) (42f702d), closes #1834

Bug Fixes

Avoid closing the score filter when dragging the slider (#1822) (91a72c5), closes #1802

Change method for Doc creation by spacy.Language (#1891) (6264983), closes #1890

DAO: datasets dao filter datasets by tasks (#1934) (937b410)

docker: Prevent wrong elastic server for wait-for-it (c6a10c7)

Improve access to label list in Text Classification (#1916) (24729bd), closes #1804

Improve explanation readability (#1815) (52c712e), closes #1774

Monitoring: Serializable log middleware (#1908) (53a57f7)

Move "Show less" button to the end of entities list (#1875) (6d796a4), closes #1779

Remove "Help explain button" in Manage rule view (#1909) (8bc70b0), closes #1807

Remove extra html when text is not highlighted (#1904) (7858dc5), closes #1758

Remove extra type when highlighting the query in the text (#1863) (341c581), closes #1758

Documentation

change iframe for mp4 (dfac8b2)

corrected for iframe (935f586)

Link key features (#1805) (#1809) (4c83604)

resolved miss-direction and old naming in README.md (f45fe1e)

Update README links linkedin and twitter (#1797) (2d4d03a)

As always, thanks to our amazing contributors!

docs: Link key features (#1805) (#1809) by @chschroeder

View Docs link in frontend header users.vue (#1915) by @bengsoon

fix: Change method for Doc creation by spacy.Language (#1891) by @jamnicki

Source code(tar.gz)
Source code(zip)
v1.0.1(Nov 4, 2022)
1.0.1 (2022-11-04)

Bug Fixes

Remove the extra letter "y" (#1814) (f3d5d2e), closes #1811

Update vue-virtual-scroller dependency version (#1813) (147dc8d), closes #1806 #1782 #1816

Documentation

corrected for tutorial and api redirections (#1820) (26ccdcc)

Source code(tar.gz)
Source code(zip)
v0.19.0(Oct 24, 2022)

Source code(tar.gz)
Source code(zip)
v0.18.0(Oct 5, 2022)
0.18.0 (2022-10-05)

⚡ Highlights

Better validation of token classification records

When working with Token Classification records, there are very often misalignment problems between the entity spans and provided tokens. Before this release, it was difficult to understand and fix these errors because validation happened on the server side.

With this release, records are validated during instantiation, giving you a clear error message which can help you to fix/ignore problematic records.

For example, the following record:

import rubrix as rb rb.TokenClassificationRecord( tokens=["I", "love", "Paris"], text="I love Paris!", prediction=[("LOC",7,13)] )

Will give you the following error message:

ValueError: Following entity spans are not aligned with provided tokenization Spans: - [Paris!] defined in ...love Paris! Tokens: ['I', 'love', 'Paris']

Delete records by query

Now it's possible to delete specific records, either by ids or by a query using Lucene's syntax. This is useful for clean up and better dataset maintenance:

import rubrix as rb ## Delete by id rb.delete_records(name="example-dataset", ids=[1,3,5]) ## Discard records by query rb.delete_records(name="example-dataset", query="metadata.code=33", discard_only=True)

New tutorials

We have two new tutorials!

Few-shot classification with SetFit and a custom dataset: https://rubrix.readthedocs.io/en/stable/tutorials/few-shot-classification-with-setfit.html

Analyzing predictions with model explainability methods: https://rubrix.readthedocs.io/en/stable/tutorials/nlp_model_explainability.html https://rubrix.readthedocs.io/en/stable/tutorials/few-shot-classification-with-setfit.html

Features

API: provide a dict for record annotations/predictions (#1658) (12b0f83)

Client: expose client extra headers in init function (#1715) (79f0529), closes #1706

Client: improve httpx errors handling (#1662) (85da336)

Client: validate token classification annotations in client (#1709) (936d1ca), closes #1579

Datasets: delete records by query (#1721) (bc9685d), closes #1714 #1737

Datasets: restrict dataset deletion only to creators and super-users (#1713) (c1bef9d), closes #1740

Server: Add server telemetry (#1687) (d7cc006)

Bug Fixes

'MajorityVoter.score' when using multi-labels (#1678) (0b94c86), closes #1628

Metadata limits: exclude subfields from mappings (#1700) (9f9650e), closes #1699

Normalizes the UnauthorizationError for the API response (#1748) (6a68048)

Search tag reset prior annotation (#1736) (dc0a17f), closes #1711

Visual enhancements

Align App UI with the design system (#1672) (67d6de8), closes #1670

Documentation

Add interpret tutorial with Transformers (#1728) (c3fa079), closes #1729

Adds tutorial about custom few-shot classification with SetFit (#1739) (4f15ee6), closes #1741

fixing the active learning tutorial with small-text (#1726) (909efdf), closes #1693

raise small-text version to 1.1.0 and adapt tutorial (#1744) (16f19b7), closes #1693

Resolve many typos in documentation, comments and tutorials (#1701) (f05e1c1)

using official token class. mapper since is compatible now (#1738) (e82fd13), closes #482

As always, thanks to our amazing contributors!

refactor: accept flat text as input for token classification mapper (#1686) by @Ankush-Chander

feat(Client): improve httpx errors handling (#1662) by @Ankush-Chander

fix: 'MajorityVoter.score' when using multi-labels (#1678) by @dcfidalgo

docs: raise small-text version to 1.1.0 and adapt tutorial (#1744) by @chschroeder

refactor: Incompatible attribute type fixed (#1675) by @luca-digrazia

docs: Resolve many typos in documentation, comments and tutorials (#1701) by @tomaarsen

refactor: Collection of changes, primarily regarding test suite and its coverage (#1702) by @tomaarsen

Source code(tar.gz)
Source code(zip)
v0.17.0(Aug 22, 2022)
0.17.0 (2022-08-22)

⚡ Highlights

Preparing a training set in the spaCy DocBin format

prepare_for_training is a method that prepares a dataset for training. Before prepare_for_training prepared the data for easily training Hugginface Transformers.

Now, you can prepare your training data for spaCy NER pipelines, thanks to our great community contributor @ignacioct !

With the example below, you can export your Rubrix dataset into a Docbin, save it to disk, and then use it with the spacy train command.

import spacy import rubrix as rb from datasets import load_dataset # Load annotated dataset from Rubrix rb_dataset = rb.load("ner_dataset") # Loading an spaCy blank language model to create the Docbin, as it works faster nlp = spacy.blank("en") # After this line, the file will be stored in disk rb_dataset.prepare_for_training(framework="spacy", lang=nlp).to_disk("train.spacy")

You can find a full example at: https://rubrix.readthedocs.io/en/v0.17.0/guides/cookbook.html#Train-a-spaCy-model-by-exporting-to-Docbin

Load large datasets using batches

Before this release, the rb.load method to read datasets from Python retrieved the full dataset. For large datasets, this could cause high memory consumption, network timeouts, and the inability to read datasets larger than the available memory.

Thanks to the awesome work by @maxserras. Now it's possible to optimize memory consumption and avoid network timeouts when working with large datasets. To that end, a simple batch-iteration over the whole database can be done employing the from_id parameter in the rb.load method.

An example of reading the first 1000 records and the next batch of up to 1000 records:

import rubrix as rb dataset_batch_1 = rb.load(name="example-dataset", limit=1000) dataset_batch_2 = rb.load(name="example-dataset", limit=1000, id_from=dataset_batch_1[-1].id)

The reference to the rb.load method can be found at: https://rubrix.readthedocs.io/en/v0.17.0/reference/python/python_client.html#rubrix.load

Larger pagination sizes for faster bulk review and annotation

Using filters and search for data annotation and review, some users are able to filter and quickly review dozens of records in one go. To serve those users, it's now possible to see and bulk annotate 50 and 100 records in each page.

Copy record text to clipboard

Sometimes is useful to copy the text in records to use inspect it or process it with another application. Now, this is possible thanks to the feature request by our great community member and contributor @Ankush-Chander !

Better error logging for generic errors

Thanks to work done by @Ankush-Chander and @frascuchon we now have more meaningful messages for generic server errors!

Features

Add new pagination size ranges (#1667) (5b4f1f2), closes #1578

Allow rb.load fetch records in batches passing the from_id argument (3e6344a)

Copy to clipboard the record text (#1625) (d634a7b), closes #1616

Error Logging: send error detail in response for generic server errors (#1648) (ad17631)

Listeners: allow using query params in the condition through search parameter (#1627) (a0a245d), closes #1622

prepare_for_training supports spacy (#1635) (8587808)

Bug Fixes

Client: reusing the inner httpx client (#1640) (854a972), closes #1646

docker-compose.yaml: default volume and disable disk threshold (#1656) (05ae688), closes #1275

Encode rule name in Weak Labeling API requests (#1649) (4634df8), closes #1645

handle stream api connection errors gracefully (#1636) (a106ec4), closes #1559

Update progress bar when refreshing after adding new records (#1666) (7e0d915), closes #1590

Documentation

Add Slack support link in README's get started (#1688) (bef010c)

Adding Elasticsearch persistence to docker compose section (#1643) (ecdc854)

spacy DocBin cookbook (#1642) (bb98278), closes #420

Visual enhancements

Small visual adjustments for Text2Text record card (#1632) (9c87cf1), closes #1138

Improve card spacing (#1638) (fd4016a), closes #1624

You can see all work included in the release here

fix: Update progress bar when refreshing after adding new records (#1666) by @leiyre

chore: configure miniconda for readthedocs builder by @frascuchon

style: Small visual adjustments for Text2Text record card (#1632) by @leiyre

feat: Copy to clipboard the record text (#1625) by @leiyre

docs: Add Slack support link in README's get started (#1688) by @dvsrepo

chore: update version by @frascuchon

feat: Add new pagination size ranges (#1667) by @leiyre

fix: handle stream api connection errors gracefully (#1636) by @Ankush-Chander

feat: allow rb.load fetch records in batches passing the from_id argument by @maxserras

fix(Client): reusing the inner httpx client (#1640) by @frascuchon

feat(Error Logging): send error detail in response for generic server errors (#1648) by @frascuchon

docs: spacy DocBin cookbook (#1642) by @ignacioct

feat: prepare_for_training supports spacy (#1635) by @frascuchon

style: Improve card spacing (#1638) by @leiyre

docs: Adding Elasticsearch persistence to docker compose section (#1643) by @maxserras

chore: remove old rubrix client class (#1639) by @frascuchon

feat(Listeners): allow using query params in the condition through search parameter (#1627) by @frascuchon

doc: show metric graphs in documentation (#1669) by @leiyre

fix(docker-compose.yaml): default volume and disable disk threshold (#1656) by @frascuchon

fix: Encode rule name in Weak Labeling API requests (#1649) by @leiyre

Source code(tar.gz)
Source code(zip)
v0.16.1(Jul 22, 2022)
0.16.1 (2022-07-22)

Bug Fixes

'WeakMultiLabels.summary' and 'show_records' after extending the weak label matrix (#1633) (3cb4c07), closes #1631

Display metadata in Text2Text dataset (#1626) (0089e0a), closes #1623

Show predicted OK/KO when predictions exist (#1620) (ef66e9c), closes #1619

Documentation

Fix typo in Getting Started -> Concepts (#1618) (b236cb8), closes #1617

You can see all work included in the release here

fix: 'WeakMultiLabels.summary' and 'show_records' after extending the weak label matrix (#1633) by @dcfidalgo

fix: Display metadata in Text2Text dataset (#1626) by @leiyre

chore: set version by @dcfidalgo

docs: Fix typo in Getting Started -> Concepts (#1618) by @dcfidalgo

fix: Show predicted OK/KO when predictions exist (#1620) by @leiyre

Source code(tar.gz)
Source code(zip)
v0.16.0(Jul 8, 2022)
0.16.0 (2022-07-08)

Highlights

👂 Listeners: enable more interactive workflows between client and server

Listeners enable you to define functions that get executed under certain conditions when something changes in a dataset. There are many use cases for this: monitoring annotation jobs, monitoring model predictions, enabling active learning workflows, and many more.

You can find the Python API reference docs here: https://rubrix.readthedocs.io/en/stable/reference/python/python_listeners.html#python-listeners

We will be documenting these use cases with practical examples, but for this release, we've included a new tutorial for using this with active learning: https://rubrix.readthedocs.io/en/stable/tutorials/active_learning_with_small_text.html. This tutorial includes the following listener function, which implements the active learning loop:

from rubrix.listeners import listener from sklearn.metrics import accuracy_score # Define some helper variables LABEL2INT = trec["train"].features["label-coarse"].str2int ACCURACIES = [] # Set up the active learning loop with the listener decorator @listener( dataset=DATASET_NAME, query="status:Validated AND metadata.batch_id:{batch_id}", condition=lambda search: search.total==NUM_SAMPLES, execution_interval_in_seconds=3, batch_id=0 ) def active_learning_loop(records, ctx): # 1. Update active learner print(f"Updating with batch_id {ctx.query_params['batch_id']} ...") y = np.array([LABEL2INT(rec.annotation) for rec in records]) # initial update if ctx.query_params["batch_id"] == 0: indices = np.array([rec.id for rec in records]) active_learner.initialize_data(indices, y) # update with the prior queried indices else: active_learner.update(y) print("Done!") # 2. Query active learner print("Querying new data points ...") queried_indices = active_learner.query(num_samples=NUM_SAMPLES) ctx.query_params["batch_id"] += 1 new_records = [ rb.TextClassificationRecord( text=trec["train"]["text"][idx], metadata={"batch_id": ctx.query_params["batch_id"]}, id=idx, ) for idx in queried_indices ] # 3. Log the batch to Rubrix rb.log(new_records, DATASET_NAME) # 4. Evaluate current classifier on the test set print("Evaluating current classifier ...") accuracy = accuracy_score( dataset_test.y, active_learner.classifier.predict(dataset_test), ) ACCURACIES.append(accuracy) print("Done!") print("Waiting for annotations ...")

📖 New docs!

https://rubrix.readthedocs.io/

🧱 extend_matrix: Weak label augmentation using embeddings

This release includes an exciting feature to augment the coverage of your weak labels using embeddings. You can find a practical tutorial here: https://rubrix.readthedocs.io/en/stable/tutorials/extend_weak_labels_with_embeddings.html

Features

#1561: standardize icons (#1565) (15254e7), closes #1561

#1602: new rubrix dataset listeners (#1507, #1586, #1583, #1596) (65747ab), closes #1602

Add 'extend_matrix' to the WeakMultiLabel class (#1577) (cf89311)

Improve from datasets (#1567) (2b0d607)

token-class: adjust token spans spaces (#1599) (0fb3576)

Bug Fixes

#1264: discard first space after a token (#1591) (eff0ac5), closes #1264

#1545: highlight words with accents (#1550) (c42e77b), closes #1545

#1548: access datasets for superusers when workspace is not provided (#1572, #1608) (0b04bc8), closes #1548

#1551: don't show error traces for EntityNotFoundError's (#1569) (04e101c), closes #1551

#1557: allow text editing when clicking the "edit" button (#1558) (e751414), closes #1557

#1574: search highlighting for a single dot (#1592) (53474a1), closes #1574

#1575: show predicted ok/ko in Text Classifier explore mode (#1576) (ada87c0), closes #1575

compatibility with new dataset version (#1566) (ac26e30)

Documentation

#1512: change theme to furo (#1564, #1604) (98869d2), closes #1512

add 'how to prepare your data for training' to basics (#1589) (a21bcf3)

add active learning with small text and listener tutorial (#1585, #1609) (d59573f), closes #1601 #421

Add MajorityVoter to references + Add comments about multi-label support of the label models (#1582) (ab481c7)

add pip version and dockertag as parameter in the build process (#1560) (73a31e2)

You can see all work included in the release here

chore(docs): remove by @frascuchon

docs: add active learning with small text and listener tutorial (#1585, #1609) by @dcfidalgo

docs(#1512): change theme to furo (#1564, #1604) by @frascuchon

chore: set version by @frascuchon

feat(token-class): adjust token spans spaces (#1599) by @frascuchon

feat(#1602): new rubrix dataset listeners (#1507, #1586, #1583, #1596) by @frascuchon

docs: add 'how to prepare your data for training' to basics (#1589) by @dcfidalgo

test: configure numpy to disable multi threading (#1593) by @frascuchon

docs: Add MajorityVoter to references + Add comments about multi-label support of the label models (#1582) by @dcfidalgo

feat(#1561): standardize icons (#1565) by @leiyre

Feat: Improve from datasets (#1567) by @dcfidalgo

feat: Add 'extend_matrix' to the WeakMultiLabel class (#1577) by @dcfidalgo

docs: add pip version and dockertag as parameter in the build process (#1560) by @frascuchon

refactor: remove words references in searches (#1571) by @frascuchon

ci: check conda env cache (#1570) by @frascuchon

fix(#1264): discard first space after a token (#1591) by @frascuchon

ci(package): regenerate view snapshot (#1600) by @frascuchon

fix(#1574): search highlighting for a single dot (#1592) by @leiyre

fix(#1575): show predicted ok/ko in Text Classifier explore mode (#1576) by @leiyre

fix(#1548): access datasets for superusers when workspace is not provided (#1572, #1608) by @frascuchon

fix(#1551): don't show error traces for EntityNotFoundError's (#1569) by @frascuchon

fix: compatibility with new dataset version (#1566) by @dcfidalgo

fix(#1557): allow text editing when clicking the "edit" button (#1558) by @leiyre

fix(#1545): highlight words with accents (#1550) by @leiyre

Source code(tar.gz)
Source code(zip)
v0.15.0(Jun 8, 2022)
0.15.0 (2022-06-08)

🔆 Highlights

🏷️ Configure datasets with a labeling scheme

You can now predefine and change the label schema of your datasets. This is useful for fixing a set of labels for you and your annotation teams.

import rubrix as rb # Define labeling schema settings = rb.TextClassificationSettings(label_schema=["A", "B", "C"]) # Apply seetings to a new or already existing dataset rb.configure_dataset(name="my_dataset", settings=settings) # Logging to the newly created dataset triggers the validation checks rb.log(rb.TextClassificationRecord(text="text", annotation="D"), "my_dataset") #BadRequestApiError: Rubrix server returned an error with http status: 400

Read the docs: https://rubrix.readthedocs.io/en/stable/guides/dataset_settings.html

🧱 Weak label matrix augmentation using embeddings

You can now use an augmentation technique inspired by https://github.com/HazyResearch/epoxy to augment the coverage of your rules using embeddings (e.g., sentence transformers). This is useful for improving the recall of your labeling rules.

Read the tutorial: https://rubrix.readthedocs.io/en/stable/tutorials/extend_weak_labels_with_embeddings.html

🏛️ Tutorial Gallery

Tutorials are now organized into different categories and with a new gallery design!

Read the docs: https://rubrix.readthedocs.io/en/stable/tutorials/introductory.html

🏁 Basics guide

This is the first version of the basics guide. This guide will show you how to perform the most basic actions with Rubrix, such as uploading data or data annotation.

Read the docs: https://rubrix.readthedocs.io/en/stable/getting_started/basics.html

Features

#1134: Allow extending the weak label matrix with embeddings (#1487) (4d54994), closes #1134

#1432: configure datasets with a label schema (21e48c0), closes #1432

#1446: copy icon position in datasets list (#1448) (7c9fa52), closes #1446

#1460: include text hyphenation (#1469) (ec23b2d), closes #1460

#1463: change icon position in table header (#1473) (5172324), closes #1463

#1467: include animation delay for last progress bar track (#1462) (c772b74), closes #1467

configuraton: add elasticsearch ca_cert path variable (#1502) (f0eda12)

UI: improve access to actions in metadata and sort dropdowns (#1510) (8d33090), closes #1435

Bug Fixes

#1522: dates metadata fields accessible for sorting (#1529) (a576ceb), closes #1522

#1527: check agents instead labels for predicted computation (#1528) (2f2ee2e), closes #1527

#1532: correct domain for filter score histogram (#1540) (7478d6c), closes #1532

#1533: restrict highlighted fields (3a8b8a9), closes #1533

#1534: fix progress in the metrics sidebar when page is refreshed (#1536) (1b572c4)

#1539: checkbox behavior with value 0 (#1541) (7a0ab63), closes #1539

metrics: compute f1 for text classification (#1530) (147d38a)

search: highlight only textual input fields (8b83a82), closes #1538 #1544

New contributors

@RafaelBod made his first contribution in https://github.com/recognai/rubrix/pull/1413
Source code(tar.gz)
Source code(zip)
v0.14.2(May 31, 2022)
0.14.2 (2022-05-31)

Bug Fixes

#1514: allow ent score None and change default value to 0.0 (#1521) (0a02c70), closes #1514

#1516: restore read-only to copied dataset (#1520) (5b9cf0e), closes #1516

#1517: stop background task when something happens to main thread (#1519) (0304f40), closes #1517

#1518: disable global actions checkbox when no data was found (#1525) (bf35e72), closes #1518

UI: remove selected metadata fields for sortable fields dropdown (#1513) (bb9482b)

Source code(tar.gz)
Source code(zip)
v0.14.1(May 20, 2022)
0.14.1 (2022-05-20)

Bug Fixes

#1447: change agent when validating records with annotation but default status (#1480) (126e6f4), closes #1447

#1472: hide scrollbar in scrollable components (#1490) (b056e4e), closes #1472

#1483: close global actions "Annotate as" selector after deselect records checkbox (#1485) (a88f8cb)

#1503: Count filter values when loading a dataset with a route query (#1506) (43be9b8), closes #1503

documentation: fix user management guide (#1511) (63f7bee), closes #1501

filters: sort filter values by count (#1488) (0987167), closes #1484

Source code(tar.gz)
Source code(zip)
v0.14.0(May 10, 2022)
0.14.0 (2022-05-10)

Async version of rb.log

You can now use the parameter background in the rb.log method to log records without blocking the main process. The main use case is monitoring production pipelines to do prediction monitoring. Here's an example with BentoML (you can find the full example in the updated Monitoring guide):

from bentoml import BentoService, api, artifacts, env from bentoml.adapters import JsonInput from bentoml.frameworks.spacy import SpacyModelArtifact import rubrix as rb import spacy nlp = spacy.load("en_core_web_sm") @env(infer_pip_packages=True) @artifacts([SpacyModelArtifact("nlp")]) class SpacyNERService(BentoService): @api(input=JsonInput(), batch=True) def predict(self, parsed_json_list): result, rb_records = ([], []) for index, parsed_json in enumerate(parsed_json_list): doc = self.artifacts.nlp(parsed_json["text"]) prediction = [{"entity": ent.text, "label": ent.label_} for ent in doc.ents] rb_records.append( rb.TokenClassificationRecord( text=doc.text, tokens=[t.text for t in doc], prediction=[ (ent.label_, ent.start_char, ent.end_char) for ent in doc.ents ], ) ) result.append(prediction) rb.log( name="monitor-for-spacy-ner", records=rb_records, tags={"framework": "bentoml"}, background=True, verbose=False ) # By using the background=True, the model latency won't be affected return result

Confidence scores in Token Classification (NER)

To store entity predictions you can attach a score using the last position of the entity tuple (label, char_start, char_end, score). Let's see an example:

import rubrix as rb text = "Rubrix is a data science tool" record = rb.TokenClassificationRecord( text=text, tokens=text.split(" "), prediction=[("PRODUCT", 0, 6, 0.99)] ) rb.log(record, "ner_with_scores")

Then, in the web application, you and your team can use the score filter to find potentially problematic entities, like in the screenshot below:

If you want to see this in action, check this blog post by David Berenstein:

https://www.rubrix.ml/blog/concise-concepts-rubrix/

Rule metrics sidebar

We have a fresh new sidebar for the weak labeling mode, where you can see your overall rule metrics as you define new rules.

This sidebar should help you quickly understand your progress:

See the updated user guide here: https://rubrix.readthedocs.io/en/v0.14.0/reference/webapp/define_rules.html

Features

#1132: introduce async/background version of rb.log (#1391) (900307e), closes #1132

#1247: label models predict method returns DatasetForTextClassification (#1442) (42ca1be), closes #1247

#1379: show prediction score in NER (#1389) (0bdccd2), closes #1379 #1451

#961: rules metrics in sidebar (#1377) (261f53a), closes #961 #1408

home: improve table actions and styles (#1384) (f09746e), closes #1355 #1333

Bug Fixes

#1407: fix visualization in 1024px viewport (#1420) (46f8d4d), closes #1441

#1458: token classifier visualization in Safari (#1459) (01cc492), closes #1458

Source code(tar.gz)
Source code(zip)
v0.13.3(Apr 27, 2022)
0.13.3 (2022-04-27)

Bug Fixes

#1248: allow multiple label attributions in UI (#1424) (a9f8363), closes #1248

#1409: filtering by metadata with value list (#1415) (7aca061), closes #1409

#1410: apply dataset name pattern to user name (#1411) (2087c21), closes #1410

#1428: support cleanlab v2 (#1436) (d189ddb), closes #1428

TokenClassification: display characters between tokens words (#1418) (a08cd7b), closes #1414 #1383

Source code(tar.gz)
Source code(zip)
v0.13.2(Apr 12, 2022)
0.13.2 (2022-04-12)

Bug Fixes

#1265: persist pagination size after query (#1358) (49ca243), closes #1265

#1367: remove record text from metadata modal (#1385) (1782724), closes #1367

#1368: long list of entities in Token Classifier (#1388) (829269f), closes #1368 #1393

#1387: improve metadata distinct values computation (be9f68f), closes #1387

install: remove loguru dependency (#1372) (9e52414), closes #1331 #1305

search: compute dataset schema properly for advanced query dsl (#1380) (f71ab91)

visualization: force break word in selectors (#1406) (5ac1950)

Source code(tar.gz)
Source code(zip)
v0.13.1(Apr 1, 2022)
0.13.1 (2022-04-01)

Bug Fixes

#1244: compute capitalness based on python methods (#1359 #1371) (218f099), closes #1244

#1362: using active api method instead instance (#1363) (bcf446d), closes #1362

#1365: create rules with regex queries (#1369) (c2afc9c), closes #1365

Source code(tar.gz)
Source code(zip)
v0.13.0(Mar 30, 2022)
0.13.0 (2022-03-30)

🗂 Multilabel weak supervision

You can now build multilabel text classification datasets using query-based rules

If you want to get started, check out this tutorial.

https://user-images.githubusercontent.com/1107111/160930404-7b909f1e-b871-4e4c-b1c8-ea9eabfcad21.mp4

🤗 Reading Hugging Face datasets from the Hub

You can now read ANY text classification, NER, or text2text dataset directly from the Hub and load it into Rubrix.

To understand how Rubrix datasets work check out this guide.

👥 Redesigned team workspaces

Organizing teams and datasets is a key Rubrix feature. After several rounds of feedback with early users, we've completely redesigned the user experience. Let us know what you think.

You can get started and configure users and workspaces following this guide

🔎 Guide for the query language and model

We have included a new in-depth guide about the Lucene-based query language and data model used for search, weak labeling, loading subsets of data, and metrics.

Features

#1119: users without personal datasets (#1282) (555d41d), closes #1119 #1318 #1317 #1323 #1324

#1130: cleanup rb namespace by refactoring client API (#1160) (a0fdd8e), closes #1130

#1144: weak supervision for multilabel datasets (#1166) (fd95bae), closes #1144 #1190 #1237 #1233 #1326

datasets: simplify load flow from hf datasets with no rb format (#1234) (a6da1cd), closes #1327

#1180: show Rubrix version in the webapp (#1243) (8c71ad9), closes #1180 #1350 #1349

#1225: prepare tokenclass dataset for hf training (#1231) (ae5e7cd), closes #1225

#950: using record search_keywords for highlighting (#1235) (47616bf), closes #950 #1278 #1316 #1315

#981: add majority voter with multi label support (#1228) (8052aa8), closes #981

Introduce a 'text' argument for the TextClassificationRecord (#1246) (bb7d93e)

Bug Fixes

#1347: allow tooltip record overlapping in Token Classifier (#1352) (87174d3), closes #1347

#1103: remove "Error Distribution" from metrics (#1255) (b9bb5b4), closes #1103

#1149: fix vulnerable dependencies (node-sass) (#1263) (7f8c1d1), closes #1149

#1211: fix score scale (#1261) (8a72281), closes #1211

#1238: show prediction labels when annotating rule (#1239) (0321b88), closes #1238

#1241, #1245: show new line char in metrics plot & increase mentions in entity consistency (#1257) (38930cb), closes #1241 #1245

#1311: small defects about hover style (#1313) (442703c), closes #1311

#1320: render car return in Token Classifier (#1328) (b7f1b7b), closes #1320

#1335: force line break in rules summary (#1336) (2d77a76), closes #1335

#1337: number of records in the overall annotated coverage (#1338) (d384713), closes #1337

#1339: metrics and status not updated when the query is refreshed (#1340) (6fc0a58), closes #1339

#984: manage super user workspaces (#1268) (9b24921), closes #984 #1288 #1290

datasets: prevent error when no annotated records found in dataset (#1284) (c20028f)

install: make starlette an optional dependency (#1295) (32afb3d)

NER: create record annotation from tags (also in from_datasets) (#1283) (adcf1b1)

rules: store single-label rules with a comp. format for old versions (#1334) (eb310d3)

Source code(tar.gz)
Source code(zip)
v0.12.1(Mar 11, 2022)
0.12.1 (2022-03-11)

Bug Fixes

#1238: show prediction labels when annotating rule (#1239) (6c1b975), closes #1238

Source code(tar.gz)
Source code(zip)
v0.11.1(Mar 11, 2022)
0.11.1 (2022-03-11)

Bug Fixes

#1238: show prediction labels when annotating rule (#1239) (28e97c6), closes #1238

Source code(tar.gz)
Source code(zip)
v0.12.0(Mar 8, 2022)
0.12.0 (2022-03-08)

Features

#1029: improve server api logging (#1148) (d4a121a), closes #1029 #1224

#1183: token classification fine-tuning (#1199) (2cdd30b), closes #1183

#1192: disable ssl verify for elasticsearch http client (#1193) (631a729), closes #1192

#950: include search keywords as part of record results (#1201) (2dd5853), closes #950

#970: header redesign (#1185) (fa9c639), closes #970 #1218 #1214 #1223

Implement 'prepare_for_training' for text classification datasets (#1209) (f7fd59c)

Bug Fixes

#1207: using api sdk wrapper for init (#1208) (2495c75), closes #1207

Source code(tar.gz)
Source code(zip)
v0.11.0(Feb 20, 2022)
0.11.0 (2022-02-19)

Highlights

Introducing rb.Dataset* and 🤗 Hub integration

The Dataset classes are lightweight containers for Rubrix records. These classes facilitate importing from and exporting to different formats (e.g., pandas.DataFrame, datasets.Dataset) as well as sharing and versioning Rubrix datasets using the Hugging Face Hub.

With this release, Rubrix users and teams can use the Hugging Face Hub to share and read both public and private Rubrix datasets for TextClassification, TokenClassification, and Text2Text datasets. This opens up a whole new world of possibilities for data reproducibility and sharing. Let's see an example:

import rubrix as rb from datasets import load_datasets # 👧🏻 🏷️ Leire has labeled a text classification dataset using a local Rubrix instance dataset_rb = rb.load("text_classification_ds", as_pandas=False) # 👧🏻 exports a Rubrix Dataset to a hf Dataset dataset_ds = dataset_rb.to_datasets() # 👧🏻 🚀 Leire shares the labelled dataset with the world dataset_ds.push_to_hub("text_classification_ds") # 👨 John downloads the dataset from the Hugging Face Hub dataset_ds = load_dataset("leire/text_classification_ds", split="train") # 👨 reads in dataset dataset_rb = rb.read_datasets(dataset_ds, task="TextClassification") # 👨 🏷️ logs the dataset and continues labeling with his own Rubrix instance rb.log(dataset_rb, "john_text_classification_ds")

You can read more at https://rubrix.readthedocs.io/en/stable/guides/datasets.html

For each record type, there’s a corresponding Dataset class called DatasetFor<RecordType>. You can look up their API in the reference section.

Improving NER UI and UX

The UI for Token Classification has been completely redesigned to provide a better user experience for exploration and annotation. This is the first of a set of changes focusing on annotation productivity for token classification.

Features

#1051: keep predictions labels when annotating (#1077) (f1824ba), closes #1051

#1063: Token Classifier fine tuning content selection (#1084) (9e14d05), closes #1063

#1127: raise startup app error from es connection error (#1145) (7e7e9d8), closes #1127

#422: introducing the rb.Dataset* classes (#1109) (b5bbca6), closes #422

#821: token classifier show predictions in explore view (#1009) (6ba6764), closes #821

#951: new "not covered records by rules" filter (#991) (0649f2a), closes #951 #1156

Bug Fixes

#1140: fix/make client models more consistent (#1147) (926bb16), closes #1140

client: parse unauthorized api error properly (#1164) (1a5a08d)

search: prevent metrics computation breaks searches (#1175) (9f2adc9)

Source code(tar.gz)
Source code(zip)
v0.10.0(Feb 20, 2022)
0.10.0 (2022-02-12)

Now you can use filters in the Define Rules mode (weak labeling). These filters are useful for seeing the impact of rules on specific dataset subpopulations/subsets (e.g., with certain metadata fields, annotated records, etc.):

Features

#1061: unify records results title (#1111) (54ebb15), closes #1061

#982: show filters in labelling rules view (#1038) (7ff677b), closes #982

Bug Fixes

#1054: reduce collapsable area. Optimize for annotation (#1106) (48024ba), closes #1054

#1054: remove old scroll padlock button (a1d6444), closes #1054

#1094: remove computed record fields returned in API results (#1095) (cd61d1e), closes #1094

#831: Remove sort field when only one is applied (#1116) (36b276b), closes #831

convert pd.NaT to None for event_timestamp (#1105) (21e78e4)

Source code(tar.gz)
Source code(zip)
v0.9.0(Feb 4, 2022)
🎉 0.9.0 (2022-02-02)

Improve logging

Small improvements to the labelling module and weak labeling mode

Better setup documentation (python -m rubrix)

Features

#932: label models now modify the prediction_agent when calling LabelModel.predict (#1049) (4a024ee), closes #932

#953: add additional metrics to LabelModel.score method (#979) (2887907), closes #953

#955: add default for rules in WeakLabels (#976) (34389d3), closes #955 #1011

Bug Fixes

#1045: calculate overall precision from overall correct/incorrect in rules (#1086) (1c76d81), closes #1045 #1087

#1053: metadata modal position (#1068) (09b88cc), closes #1053 #1053

#1054: optimize Long records (#1080) (fdd797a), closes #1054

#1067: fix rule definition link when no labels are defined (#1069) (eb958bf), closes #1067

#1081: prevent add records of different task (#1085) (5296e52), closes #1081 #1081

#924: parse new error format in UI (#1082) (f26c79c), closes #924

Source code(tar.gz)
Source code(zip)
v0.8.2(Jan 31, 2022)
0.8.2 (2022-01-31)

Features

#1036: remove prediction ok/ko in labelling rules (#1037) (672b852), closes #1036

#735: add warning when agent but no prediction/annotation is provided (#987) (ba88c34), closes #735

Bug Fixes

#1008: set the event_timestamp when annotating (#1024) (c24fdad), closes #1008

#1015: manage emojis in Token Classification records (#1016) (8b570fb), closes #1015

#1023: handle elasticsearch connection problems on server startup (#1030) (e8c8d86), closes #1023

#1027: Improve client models by reordering fields + forbidding extra args (#1032) (6c1ae7f), closes #1027

#1028: Add videos to Monitoring tutorial (#1033) (6ff3326), closes #1028

#1050: generalizes entity span validation (#1055) (37207bc), closes #1050

#1058: sort by % data in rules list (#1062) (9735f22), closes #1058

#1065: 'B' tag for beginning tokens (#1066) (a5ed329), closes #1065

cleanlab: set cleanlab n_jobs=1 as default (#1059) (189cbcb)

Source code(tar.gz)
Source code(zip)
v0.8.1(Jan 20, 2022)
0.8.1 (2022-01-20)

Bug Fixes

#1002: Show 0 records overall metrics when no rules defined (#1013) (a8a5c79), closes #1002 #1002

Breadcrumbs: copy workspace from the breadcrumbs when dataset loading has errors #1003 (33e372d), closes #844

statics: handle 404 errors for static files (#1006) (f4b656a)

#800: compute common aggregations one by one (#990) (8cf420a), closes #800

#800: limit number of metadata fields (#993) (bb6b76b), closes #800

#905: copy dataset with rules (#948) (8597b83), closes #905

#974: display the dropdown in the last record of the scroll (#986) (e5f8d53), closes #974

#977: Remove redirection when accessing login (#996) (b3fe2cb), closes #977

Source code(tar.gz)
Source code(zip)
v0.8.1-alpha.3(Jan 20, 2022)

Source code(tar.gz)
Source code(zip)
v0.8.1-alpha.2(Jan 20, 2022)
0.8.1-alpha.2 (2022-01-20)

Bug Fixes

#1002: Show 0 records overall metrics when no rules defined (#1007) (a890e17), closes #1002 #1002

Breadcrumbs: copy workspace from the breadcrumbs when dataset loading has errors #1003 (33e372d), closes #844

statics: handle 404 errors for static files (#1006) (f4b656a)

Source code(tar.gz)
Source code(zip)
v0.8.1-alpha.1(Jan 19, 2022)

Source code(tar.gz)
Source code(zip)
v0.8.1-alpha.0(Jan 19, 2022)
0.8.1-alpha.0 (2022-01-19)

Bug Fixes

#800: compute common aggregations one by one (#990) (8cf420a), closes #800

#800: limit number of metadata fields (#993) (bb6b76b), closes #800

#905: copy dataset with rules (#948) (8597b83), closes #905

#974: display the dropdown in the last record of the scroll (#986) (e5f8d53), closes #974

#977: Remove redirection when accessing login (#996) (b3fe2cb), closes #977

Source code(tar.gz)
Source code(zip)
v0.8.0(Jan 12, 2022)
Introducing interactive Weak labeling (Define rules mode) 🚀

We are glad to introduce the most important feature to date: now it's possible to iterate on labeling queries directly in the UI with initial support for multi-class text classification. Multilabel and token classification support is coming soon.

See the video for the recommended workflow:

https://user-images.githubusercontent.com/1107111/149346471-93cbd7ee-96a2-451a-8f5e-f9e26b246407.mp4

Check the updated tutorial: https://rubrix.readthedocs.io/en/master/tutorials/weak-supervision-with-rubrix.html

What's changed

[WeakSupervision] Change load_rules import path in guide and tutorial (#939)

fix links to new web app reference (#936)

Bugfixes/avoid infinite loop when dataset loading (#934)

show nan instead of 0 for precision in summary (#930)

fix(api): include_metrics param only for search endponts (#929)

[Documentation] Update title page video for docs (#928)

update skweak tutorial (#922)

[Documentation] Updating the web app docu (#827)

publish python package to test.pypi for master and releases branches (#927)

[WeakLabels] Align WeakLabels.summary() with web app (#925)

UI: show rules without precision properly (#919)

chore(build): build docker images for release branches (#921)

Docs: Updates readme front video (#923)

Docs: Updates weak supervision resources (#920)

feat(rules): compute total & ann. coverage before label selection (#916)

fix(rules): compute annotated coverage when no label properly (#915)

Tutorial: Human-in-the-loop weak supervision with skweak (#869)

UI: include affected #records to overall coverage/ann. coverage metrics (#914)

fix lint build (#913)

UI: manage precision and rules without annotation coverage (#909)

fix(#876): process 400 response detail properly (#889)

feat(rules): allow compute partial query rule metrics (#907)

fix(security): providing default workspace should pass check (#911)

UI: reset filters from define rules view (#908)

UI: Show number of created rules in rules management view (#910)

UI: drop access to rule name field (#904)

fix(rules): prevent lost rules with dataset updates (#892)

fix(datasets): process owner as part of dataset id (#870)

(UI) Rules summary metrics format (#888)

UI: Improve code snippet for empty workspace (#886)

fix(UI): Remove case sensitive when filtering labels (#882)

Docs: Updates Flair zeroshot tutorial (#887)

removing wrong video (#885)

Update readme (#883)

fix(UI) Metrics value by default if no metric (#875)

feat(metrics): add token level metrics for token classification from client (#849)

UI: New rule metrics layout (#861)

chore: expose load_rules from base module (#866)

Docs: Regenerates graphs metrics guide (#865)

updating loss video (#864)

Docs: Update weak supervision guide (#863)

Update README.md (#862)

Fix: Link loss tutorial (#859)

Docs: Improve loss tutorial (#858)

Docs: Improve AL and ws tutorials (#857)

chore(ci): Include component testing configuration (#839)

fix/loss video updated (#853)

Docs: Weak supervision guide update (#855)

chore(app): upgrade lint dependencies (#841)

feat: weak supervision mode (#814)

Docs: Review hf tutorial (#852)

fix: error link to workspace home (#845)

fix(metrics): compute token length for each token (#850)

add streaming (#851)

fix(rules): prevent division by 0 for overall metrics (#848)

small change

[Tutorials] Update media structure, remove TLDR heading (#847)

Updating videos and images for sentiment classification tutorial (#846)

fix(rules): prevent division by zero (#843)

new folder and videos for model loss tutorial (#805)

feat(token class): add metrics at token level (#838)

new folder and images for active learning tutorial (#796)

[Tutorials] Typo fix in find label errors tutorial (#842)

[Tutorials] Add the new find_label_errors tutorial (#833)

[Rule] Modify the client API to the server's weak supervision feature (#840)

[LabelModel] Improve Snorkel to not modify the passed in WeakLabels object (#836)

feat (search): allow to filtering record metrics fields in search (#837)

fix(ui): remove workspace home from code snippet api url (#834)

ui: Hide validate button for binary cases in Text classifier (#830)

fix print message (#829)

feat: Include workspace in url path (#820)

fix(ui): align records and global action layouts #825

fix(ui): Show labels as selected after validate (#826)

feat(labeling rule): implements api endpoint to fetch a single rule (#817)

[LabelErrors] Add find_label_errors method (#775)

fix(ui): Fix styles in Safari (#815)

docs: Add contributors to readme (#822)

add missing rubrix import (#819)

new folder and images for spacy tutorial (#794)

feat(labeling rules): allow edition for rule label and description (#813)

refactor(labeling rules): optional label for rule metrics (#811)

Fix token alignment on CreationTokenClassificationRecord (#812)

feat(server): add overall dataset labeling rules metrics (#807)

feat(labeling rules): add coverage for annotated records (#806)

fix(ui): Unique ID for scroll state to avoid same state for different dataset records (#809)

new folder and images for zeroshot ner tutorial (#804)

new folder and images for zeroshot data annotation tutorial (#803)

fix(log): check multi-label integrity without search aggregations (#802)

updated images, added folder for fastapi tutorial (#801)

added folder for weak supervision tutorial (#795)

feat(weak supervision): client labeling rules from server (#799)

feat(server): labeling rule metrics (#790)

fix/edit zero-shot tutorial (#774)

fix/edited fastapi tutorial (#773)

Fix/edit ner flair tutorial (#766)

Fix/edit weaksupervision tutorial (#759)

fix(ui): Little changes in fonts (#793)

fix(ui): Allow open dataset in new tab from datasets list (#792)

feat(server): rubrix namespaces for elasticsearch indices (#789)

fix(ui): Show annotation after global validation (#786)

remove reload arg launching server using python (#787)

updated readme with conda install instruction (#788)

fix(ui): Hide scroller component when loading or paginate (#784)

fix(ui): allow remove metadata filter from record metadata modal (#772)

fix(ui): Token Classifier: validate record without annotation or prediction (#782)

Fix/edit active learning tutorial (#760)

Docs:minor changes to loss tutorial (#778)

Fix/edit model loss tutorial (#767)

fix(server): missing deprecated dep (#777)

fix(ui): Global validate for records without annotation or prediction (#746)

Fix/edit spacy tutorial (#758)

Fix/edit labeling tutorial (#750)

fix(server) - misaligned entity mentions on CreationTokenClassificationRecord (#771)

[Requirements] Require python>=3.7 (#770)

[Labeling] Add FlyingSquid label model (#755)

Update README.md (#769)

Adds Flair example to guide (#762)

docs: Updates huggingface examples and adds monitor for Flair (#761)

feat(search): show boolean values in metadata (#753)

feat(server): allow handle labeling rules for datasets from API (#744)

fix(imports): import monitoring with spacy<3.0 fails (#754)

[UI] new fonts families (#751)

fix(scroll): using new scroll component (#710)

fix(ui): filter "validatable" records for global action validate button (#741)

feat(monitor): flair ner auto-monitor (#738)

New Contributors

@sugatoray made their first contribution

@ruanchaves made their first contribution

Source code(tar.gz)
Source code(zip)
v0.8.0-alpha.1(Jan 11, 2022)
Bugfixes/avoid infinite loop when dataset loading (#934)

show nan instead of 0 for precision in summary (#930)

fix(api): include_metrics param only for search endponts (#929)

[Documentation] Update title page video for docs (#928)

update skweak tutorial (#922)

[Documentation] Updating the web app docu (#827)

revert test.pypi publish

publish python package to test.pypi for master and releases branches (#927)

[WeakLabels] Align WeakLabels.summary() with web app (#925)

UI: show rules without precision properly (#919)

chore(build): build docker images for release branches (#921)

Docs: Updates readme front video (#923)

Docs: Updates weak supervision resources (#920)

feat(rules): compute total & ann. coverage before label selection (#916)

fix(rules): compute annotated coverage when no label properly (#915)

Tutorial: Human-in-the-loop weak supervision with skweak (#869)

UI: include affected #records to overall coverage/ann. coverage metrics (#914)

fix lint build (#913)

UI: manage precision and rules without annotation coverage (#909)

fix(#876): process 400 response detail properly (#889)

feat(rules): allow compute partial query rule metrics (#907)

fix(security): providing default workspace should pass check (#911)

UI: reset filters from define rules view (#908)

UI: Show number of created rules in rules management view (#910)

UI: drop access to rule name field (#904)

fix(rules): prevent lost rules with dataset updates (#892)

fix(datasets): process owner as part of dataset id (#870)

(UI) Rules summary metrics format (#888)

UI: Improve code snippet for empty workspace (#886)

fix(UI): Remove case sensitive when filtering labels (#882)

Docs: Updates Flair zeroshot tutorial (#887)

removing wrong video (#885)

Update readme (#883)

fix(UI) Metrics value by default if no metric (#875)

feat(metrics): add token level metrics for token classification from client (#849)

UI: New rule metrics layout (#861)

chore: expose load_rules from base module (#866)

Docs: Regenerates graphs metrics guide (#865)

updating loss video (#864)

Docs: Update weak supervision guide (#863)

Update README.md (#862)

Fix: Link loss tutorial (#859)

Docs: Improve loss tutorial (#858)

Docs: Improve AL and ws tutorials (#857)

chore(ci): Include component testing configuration (#839)

fix/loss video updated (#853)

Docs: Weak supervision guide update (#855)

chore(app): upgrade lint dependencies (#841)

feat: weak supervision mode (#814)

Docs: Review hf tutorial (#852)

fix: error link to workspace home (#845)

fix(metrics): compute token length for each token (#850)

chore: improve dockerignore files

add streaming (#851)

fix(rules): prevent division by 0 for overall metrics (#848)

small change

[Tutorials] Update media structure, remove TLDR heading (#847)

Updating videos and images for sentiment classification tutorial (#846)

fix(rules): prevent division by zero (#843)

new folder and videos for model loss tutorial (#805)

feat(token class): add metrics at token level (#838)

new folder and images for active learning tutorial (#796)

[Tutorials] Typo fix in find label errors tutorial (#842)

[Tutorials] Add the new find_label_errors tutorial (#833)

[Rule] Modify the client API to the server's weak supervision feature (#840)

[LabelModel] Improve Snorkel to not modify the passed in WeakLabels object (#836)

feat (search): allow to filtering record metrics fields in search (#837)

fix(ui): remove workspace home from code snippet api url (#834)

ui: Hide validate button for binary cases in Text classifier (#830)

fix print message (#829)

feat: Include workspace in url path (#820)

fix(ui): align records and global action layouts #825

fix(ui): Show labels as selected after validate (#826)

feat(labeling rule): implements api endpoint to fetch a single rule (#817)

[LabelErrors] Add find_label_errors method (#775)

fix(ui): Fix styles in Safari (#815)

docs: Add contributors to readme (#822)

add missing rubrix import (#819)

new folder and images for spacy tutorial (#794)

feat(labeling rules): allow edition for rule label and description (#813)

refactor(labeling rules): optional label for rule metrics (#811)

Fix token alignment on CreationTokenClassificationRecord (#812)

feat(server): add overall dataset labeling rules metrics (#807)

feat(labeling rules): add coverage for annotated records (#806)

fix(ui): Unique ID for scroll state to avoid same state for different dataset records (#809)

new folder and images for zeroshot ner tutorial (#804)

new folder and images for zeroshot data annotation tutorial (#803)

fix(log): check multi-label integrity without search aggregations (#802)

updated images, added folder for fastapi tutorial (#801)

added folder for weak supervision tutorial (#795)

feat(weak supervision): client labeling rules from server (#799)

feat(server): labeling rule metrics (#790)

fix/edit zero-shot tutorial (#774)

fix/edited fastapi tutorial (#773)

Fix/edit ner flair tutorial (#766)

Fix/edit weaksupervision tutorial (#759)

fix(ui): Little changes in fonts (#793)

fix(ui): Allow open dataset in new tab from datasets list (#792)

feat(server): rubrix namespaces for elasticsearch indices (#789)

fix(ui): Show annotation after global validation (#786)

remove reload arg launching server using python (#787)

updated readme with conda install instruction (#788)

fix(ui): Hide scroller component when loading or paginate (#784)

fix(ui): allow remove metadata filter from record metadata modal (#772)

fix(ui): Token Classifier: validate record without annotation or prediction (#782)

Fix/edit active learning tutorial (#760)

Docs:minor changes to loss tutorial (#778)

Fix/edit model loss tutorial (#767)

fix(server): missing deprecated dep (#777)

fix(ui): Global validate for records without annotation or prediction (#746)

Fix/edit spacy tutorial (#758)

Fix/edit labeling tutorial (#750)

fix(server) - misaligned entity mentions on CreationTokenClassificationRecord (#771)

[Requirements] Require python>=3.7 (#770)

[Labeling] Add FlyingSquid label model (#755)

Update README.md (#769)

Adds Flair example to guide (#762)

docs: Updates huggingface examples and adds monitor for Flair (#761)

feat(search): show boolean values in metadata (#753)

feat(server): allow handle labeling rules for datasets from API (#744)

fix(imports): import monitoring with spacy<3.0 fails (#754)

[UI] new fonts families (#751)

fix(scroll): using new scroll component (#710)

fix(ui): filter "validatable" records for global action validate button (#741)

feat(monitor): flair ner auto-monitor (#738)

Full Changelog: https://github.com/recognai/rubrix/compare/v0.7.0...v0.8.0-alpha.0
Source code(tar.gz)
Source code(zip)