✨Rubrix is a production-ready Python framework for exploring, annotating, and managing data in NLP projects.

Recognai

Last update: Jan 2, 2023

Related tags

Text Data & NLP python nlp elasticsearch data-science machine-learning natural-language-processing pytorch artificial-intelligence weak-supervision knowledge-graph developer-tools active-learning annotation-tool weakly-supervised-learning human-in-the-loop mlops text-labeling

Overview

Python framework to explore, label, and monitor data for NLP

Usage example · Get started · Quick links · Docs

Rubrix.mp4

Example: Named Entity Recognition data exploration and annotation with spaCy and the IMDB dataset

What is Rubrix?

Rubrix is a production-ready Python framework for exploring, annotating, and managing data in NLP projects.

Key features:

Open: Rubrix is free, open-source, and 100% compatible with major NLP libraries (Hugging Face transformers, spaCy, Stanford Stanza, Flair, etc.). In fact, you can use and combine your preferred libraries without implementing any specific interface.
End-to-end: Most annotation tools treat data collection as a one-off activity at the beginning of each project. In real-world projects, data collection is a key activity of the iterative process of ML model development. Once a model goes into production, you want to monitor and analyze its predictions, and collect more data to improve your model over time. Rubrix is designed to close this gap, enabling you to iterate as much as you need.
User and Developer Experience: The key to sustainable NLP solutions is to make it easier for everyone to contribute to projects. Domain experts should feel comfortable interpreting and annotating data. Data scientists should feel free to experiment and iterate. Engineers should feel in control of data pipelines. Rubrix optimizes the experience for these core users to make your teams more productive.
Beyond hand-labeling: Classical hand labeling workflows are costly and inefficient, but having humans-in-the-loop is essential. Easily combine hand-labeling with active learning, bulk-labeling, zero-shot models, and weak-supervision in novel data annotation workflows.

Example

Let's see Rubrix in action with a quick example: Bootstraping data annotation with a zero-shot classifier

Why:

The availability of pre-trained language models with zero-shot capabilities means you can, sometimes, accelerate your data annotation tasks by pre-annotating your corpus with a pre-trained zeroshot model.
The same workflow can be applied if there is a pre-trained "supervised" model that fits your categories but needs fine-tuning for your own use case. For example, fine-tuning a sentiment classifier for a very specific type of message.

Ingredients:

A zero-shot classifier from the 🤗 Hub: typeform/distilbert-base-uncased-mnli
A dataset containing news
A set of target categories: Business, Sports, etc.

What are we going to do:

Make predictions and log them into a Rubrix dataset.
Use the Rubrix web app to explore, filter, and annotate some examples.
Load the annotated examples and create a training set, which you can then use to train a supervised classifier.

1. Predict and log

Let's load the zero-shot pipeline and the dataset (we are using the AGNews dataset for demonstration, but this could be your own dataset). Then, let's go over the dataset records and log them using rb.log(). This will create a Rubrix dataset, accesible from the web app.

from transformers import pipeline
from datasets import load_dataset
import rubrix as rb

model = pipeline('zero-shot-classification', model="typeform/distilbert-base-uncased-mnli")

dataset = load_dataset("ag_news", split='test[0:100]')

labels = ['World', 'Sports', 'Business', 'Sci/Tech']

for item in dataset:
    prediction = model(item['text'], labels)

    record = rb.TextClassificationRecord(
        inputs=item["text"],
        prediction=list(zip(prediction['labels'], prediction['scores']))
    )

    rb.log(record, name="news_zeroshot")

2. Explore, Filter and Label

Now let's access our Rubrix dataset and start annotating data. Let's filter the records predicted as Business with high probability and use the bulk-labeling feature for labeling 15 records as Business:

Zeroshot.Example.mp4

3. Load and create a training set

After a few iterations of data annotation, we can load the Rubrix dataset and create a training set to train or fine-tune a supervised model.

# load the Rubrix dataset as a pandas DataFrame
rb_df = rb.load(name='news_zeroshot')

# filter annotated records
rb_df = rb_df[rb_df.status == "Validated"]

# select text input and the annotated label
train_df = pd.DataFrame({
    "text": rb_df.inputs.transform(lambda r: r["text"]),
    "label": rb_df.annotation,
})

Architecture

Rubrix main components are:

Rubrix Python client: Python client to log, load, copy and delete Rubrix datasets.
Rubrix server: FastAPI REST service for reading and writing data.
Elasticsearch: The storage layer and search engine powering the API and the web app.
Rubrix web app: Easy-to-use web application for data exploration and annotation.

Quick links

Doc	Description
🚶 First steps	New to Rubrix and want to get started?
👩‍🏫 Concepts	Want to know more about Rubrix concepts?
🛠️ Setup and install	How to configure and install Rubrix
🗒️ Tasks	What can you use Rubrix for?
📱 Web app reference	How to use the web-app for data exploration and annotation
🐍 Python client API	How to use the Python classes and methods
👩‍🍳 Rubrix cookbook	How to use Rubrix with your favourite libraries (`flair`, `stanza`...)
👋 Community forum	Ask questions, share feedback, ideas and suggestions
🤗 Hugging Face tutorial	Using Rubrix with 🤗 `transformers` and `datasets`
💫 spaCy tutorial	Using `spaCy` with Rubrix for NER projects
🐠 Weak supervision tutorial	How to leverage weak supervision with `snorkel` & Rubrix
🤔 Active learning tutorial	How to use active learning with `modAL` & Rubrix
🧪 Knowledge graph tutorial	How to use Rubrix with `kglab` & `pytorch_geometric`

Get started

To get started you need to follow three steps:

Install the Python client
Launch the web app
Start logging data

1. Install the Python client

You can install the Python client with pip:

pip install rubrix

2. Launch the web app

There are two ways to launch the web app:

a) Using docker-compose (recommended).
b) Executing the server code manually

a) Using docker-compose (recommended)

Create a folder:

mkdir rubrix && cd rubrix

and launch the docker-contained web app with the following command:

wget -O docker-compose.yml https://git.io/rb-docker && docker-compose up

This is the recommended way because it automatically includes an Elasticsearch instance, Rubrix's main persistence layer.

b) Executing the server code manually

When executing the server code manually you need to provide an Elasticsearch instance yourself.

First you need to install Elasticsearch (we recommend version 7.10) and launch an Elasticsearch instance. For MacOS and Windows there are Homebrew formulae and a msi package, respectively.
Install the Python client together with its server dependencies:

pip install rubrix[server]

Launch a local instance of the web app

python -m rubrix.server

By default, the Rubrix server will look for your Elasticsearch endpoint at http://localhost:9200. But you can customize this by setting the ELASTICSEARCH environment variable.

3. Start logging data

The following code will log one record into a data set called example-dataset:

import rubrix as rb

rb.log(
    rb.TextClassificationRecord(inputs="My first Rubrix example"),
    name='example-dataset'
)

If you go to your Rubrix web app at http://localhost:6900/, you should see your first dataset. The default username and password are rubrix and 1234. You can also check the REST API docs at http://localhost:6900/api/docs.

Congratulations! You are ready to start working with Rubrix.

To better understand what's possible take a look at Rubrix's Cookbook

Community

As a new open-source project, we are eager to hear your thoughts, fix bugs, and help you get started. Feel free to use the Discussion forum or the Issues and we'll be pleased to help out.

Comments

Add monitoring examples with FastAPI: Hugging Face and spaCy

The idea would be to add a guide (as a Jupyter Notebook) to be included under docs/guides. This Jupyter notebook will showcase the RubrixHTTPMiddleware for monitoring the predictions of a FastAPI inference endpoint. Here is the example with Hugging Face + FastAPI:

from fastapi import FastAPI
from typing import List
from transformers import pipeline
from rubrix.client.asgi import RubrixLogHTTPMiddleware

classifier = pipeline("sentiment-analysis", return_all_scores=True)

app = FastAPI()

# define the middleware for logging predictions into a Rubrix Dataset
app.add_middleware(
    RubrixLogHTTPMiddleware,
    api_endpoint="/predict",
    dataset="monitoring_dataset_v1",
    # you could post-process the predict output with a custom record_mapper function
    # record_mapper=custom_text_classification_mapper,
)

# prediction endpoint
@app.post("/predict")
def predict_batch(batch: List[str]):
    predictions = classifier(batch)
    return [
        {
            "labels": [p["label"] for p in prediction],
            "probabilities": [p["score"] for p in prediction],
        }
        for prediction in predictions
    ]

The steps would be to:

Create a notebook and include the above example
Add an example with a pre-trained transformer TokenClassifier (for example: https://huggingface.co/dslim/bert-base-NER)
Add an example with a spaCy NER pipeline.
(Optionally) Include an example dashboard with Kibana (screenshots, gif or video)
(Optionally) Include an example with ray serve

documentation good first issue help wanted

opened by dvsrepo 19

updated readme with `conda` install instruction
This closes #781.

[x] added conda installation instruction (rubrix is available on conda-forge channel)

[x] added badges:

[x] conda-forge/rubrix (with version)

[x] conda-forge/rubrix (with platform specification): example -- "noarch"

[x] docs badge
opened by sugatoray 14
[NER Fine tuning] content selection
Multi word

Actual state : (VIEW SS) 1- I select various words, highlight is grey and in a solid block (Highlight/words). 2- When selection is done, highlight selection is splited and label selector appears.

[x] Should be:

1- I select various words, highlight is grey and splited (highlight/word) 2- When selection is done, highlight selection is a solid block label selector appears.

Delete labelling

[x] Make clicable the whole tooltip to delete

Selection on a searched word

[x] Selection highlight should not be cut (SS)

[x] When selection is containing a search word the label selector does not appear (how it works only on right>left sense)

[x] In general : change appearance of results : in place of Orange highlight show text in bold

Cursor

[x] Active "hand" cursor (pointer) on piece of text already annotated/Predicted

[x] Active "Text Select" cursor on the rest of record

[x] Enlarge the hover state to the whole area : (record + annotated tooltip + empty space between them) (record + predicted tooltip + empty space between us)

New Select label modal

[x] Integrate new UI modal

[x] In case of unique label, dont show modal, and just affect label after selecting text

[x] Add logic to show first and preselected the last label used

[x] Add following Keyboard shortcut: Enter to valid preselected label, and vertical arrow keyboard or Number to valid other labels
opened by Amelie-V 13
Add text2text example (e.g., text summarisation)

Add the text summarisation fine-tuning tutorial similar to sentiment classifier fine-tuning tutorial:

https://rubrix.readthedocs.io/en/stable/tutorials/06-labeling-finetuning.html#3.-Fine-tune-the-pre-trained-mode
documentation good first issue help wanted

opened by frascuchon 13
fix: Compute predicted properly for token classification [NEEDS_DATA_UPGRADE]

This PR fixes the way predicted ok/ko info is computed for token classification records.

To apply this fix to already created datasets, you must first re-log records. Otherwise, stored info won't be updated.

Closes #1955

opened by frascuchon 12
[Workspaces] Users without personal datasets

Users without personal datasets but that belongs to one or more workspaces which have datasets, should automatically change to one of those workspace?

Better to show all datasets from all workspaces in datasets list allowing to filter by workspace?
question app

opened by frascuchon 11
[Text Class] Optimize Long records view *Prioritary*
[x] Show labels buttons area above the fold.

[x] Create Action to open/close on click the full record in the same view

[x] Copy "Show full record" "Show less"

[ ] I would grap the opportunity to update the "View more" "view less" on Metrics modal to "Show more" "Show less" and apply the same style there

enhancement
opened by Amelie-V 11
[Search] Improve and normalizes the search data model
Things to keep in mind:

Normalize text inputs fields: text, inputs, words must be normalized and use a common pattern for all tasks

Several es analyzers for text fields: standard and whitespace(?) for fine tuning searches. Default as standard

What about text fields in metadata ? For now, only terms queries are supported. It's mean that metadata fields with large content are not enabled to be queries as full text search.

Created indices should contain mapping info only for its fields. A text classification index should not include mapping info for tokens or text predicted (text2text).

Review filter fields and align with UI names (if any)

What about nested fields? like token or metrics info for token classification, or label and its score for text classification. As default, query string dsl does not support nested queries, but it could be nice include some minimal support for that kind of queries.

@dvsrepo @dcfidalgo Anything to include here?

Tasks

To achieve to do the work, we need tackle following tasks (that will be created as separated issues and linked here)

[Datasets] Avoid using global template for all indices

[Datasets] Dataset migration mechanisms for each release

[Datasets] New es document model per task with backward compatibility fields

[Datasets] Apply migration to new es doc model

[Datasets] Build searches and aggregations using new doc model

enhancement server
opened by frascuchon 11
Devise workflow to test the tutorials via a github action

The idea here is to devise a workflow to test our tutorials in a semi-automatic way. Ideally, we have a workflow that we can launch manually and let's say every two weeks or so, to test our tutorials. Maybe we can use nbmake for this and follow this blogpost. The tricky part is that for some tutorials we need to change/add/delete a few cells to be able to run them in an automated way ...
documentation good first issue help wanted

opened by dcfidalgo 10
[Weak supervision] Rules numbers by label
For instance:

Sci/tech 2 Sports 1 Business 4 Politics 0 World 0

his feature could be used for two things:

Help to know how is going the rule definition

See the full label list (in "define rules" we dont have this list by default)

ui
opened by Amelie-V 9
Any plan to support no-whitespace language?

I am planning to use rubrix for Japanese text data. The search functionality doesn't seem to work well on this language. I think it's better if we can customize the tokenizer used in elasticsearch instead of hardcoded "whitespace" tokenizer.

opened by faisalron 9
use a default vector for vector search like`TF-IDF`

Is your feature request related to a problem? Please describe. I do not want to set up anything for vector search but I do want to use it.

Describe the solution you'd like I would like to see a very straightforward model-agnostic way of using the feature without any specific implementation. DatasetSettings.vectorsearch_tf_idf = True.

Describe alternatives you've considered N.A.

Additional context N.A.
enhancement

opened by davidberenstein1957 0
chore(deps-dev): update fastapi requirement from <0.89,>=0.75 to >=0.75,<0.90
Updates the requirements on fastapi to permit the latest version.

Release notes

Sourced from fastapi's releases.

0.89.0

Features

✨ Add support for function return type annotations to declare the response_model. Initial PR #1436 by @uriyyo.

Now you can declare the return type / response_model in the function return type annotation:

from fastapi import FastAPI from pydantic import BaseModel app = FastAPI() class Item(BaseModel): name: str price: float
@app.get("/items/") async def read_items() -> list[Item]: return [ Item(name="Portal Gun", price=42.0), Item(name="Plumbus", price=32.0), ]

FastAPI will use the return type annotation to perform:

Data validation

Automatic documentation

It could power automatic client generators

Data filtering

Before this version it was only supported via the response_model parameter.

Read more about it in the new docs: Response Model - Return Type.

Docs

📝 Add External Link: Authorization on FastAPI with Casbin. PR #5712 by @Xhy-5000.

✏ Fix typo in docs/en/docs/async.md. PR #5785 by @Kingdageek.

✏ Fix typo in docs/en/docs/deployment/concepts.md. PR #5824 by @kelbyfaessler.

Translations

🌐 Add Russian translation for docs/ru/docs/fastapi-people.md. PR #5577 by @Xewus.

🌐 Fix typo in Chinese translation for docs/zh/docs/benchmarks.md. PR #4269 by @15027668g.

🌐 Add Korean translation for docs/tutorial/cors.md. PR #3764 by @NinaHwang.

... (truncated)

Commits

69bd7d8 🔖 Release version 0.89.0

a6af7c2 📝 Update release notes

aa6a8e5 📝 Update release notes

c482dd3 ⬆ Update coverage[toml] requirement from <7.0,>=6.5.0 to >=6.5.0,<8.0 (#5801)

681e5c0 📝 Update release notes

eb39b0f 📝 Update release notes

27ce2e2 📝 Add External Link: Authorization on FastAPI with Casbin (#5712)

f56b0d5 ⬆ Update uvicorn[standard] requirement from <0.19.0,>=0.12.0 to >=0.12.0,<0.2...

5c6d7b2 📝 Update release notes

78813a5 ✏ Fix typo in docs/en/docs/async.md (#5785)

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

dependencies
opened by dependabot[bot] 0
add repr method for Rule, Dataset.
Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context. List any dependencies that are required for this change.

Closes #2046

Type of change

Please delete options that are not relevant.

[ ] New feature (non-breaking change which adds functionality)

How Has This Been Tested

Please describe the tests that you ran to verify your changes. And ideally reference tests.

import argilla as rg from argilla.labeling.text_classification.rule import Rule plz = Rule(query="plz OR please", label="SPAM") print(repr(plz)) >>> Rule(query='plz OR please', label='SPAM', name='plz OR please') records = [ rg.TextClassificationRecord(text="example"), rg.TextClassificationRecord(text="another example"), rg.TextClassificationRecord(text="another example another example another example another example another example another example"), ] dataset = rg.DatasetForTextClassification(records=records) print(dataset) >>> text annotation prediction 0 example None None 1 another example None None 2 another example another exampl None None ... 3 TextClassificationRecord records

Checklist

[x] I have merged the original branch into my forked branch

[x] I added relevant documentation

[x] follows the style guidelines of this project

[x] I did a self-review of my code

[x] I added comments to my code

[x] I made corresponding changes to the documentation

[x] My changes generate no new warnings

[x] I have added tests that prove my fix is effective or that my feature works
opened by Ankush-Chander 1
feat(Client): RecordTextClassification pass only the necessary data nstead of all the dataset
Description

ref : #2142 Instead of passing all the dataset, only the necessary data is passed through props into the RecordTextClassification component

WARNING : to merge after #2145 and #2143 have been merge

Closes #(issue_number)

Type of change

Please delete options that are not relevant.

[x] Breaking change (fix or feature that would cause existing functionality to not work as expected)

How Has This Been Tested

Please describe the tests that you ran to verify your changes. And ideally reference tests.

[x] multilabel

[x] singlelabel

Checklist

[x] I have merged the original branch into my forked branch

[x] I added relevant documentation

[x] follows the style guidelines of this project

[x] I did a self-review of my code

[ ] I added comments to my code

[ ] I made corresponding changes to the documentation

[x] My changes generate no new warnings

[ ] I have added tests that prove my fix is effective or that my feature works
opened by keithCuniah 0
feat(Client): ClassifierExplorationArea.vue pass only the necesarry data
Description

ref : #2142 Instead of passing all the dataset, only the necessary data is passed through props into the ClassifierExplorationArea.vue

Type of change

Please delete options that are not relevant.

[x] Breaking change (fix or feature that would cause existing functionality to not work as expected) How Has This Been Tested

Please describe the tests that you ran to verify your changes. And ideally reference tests.

[x] multilabel

[x] singlelabel

Checklist

[x] I have merged the original branch into my forked branch

[x] I added relevant documentation

[x] follows the style guidelines of this project

[x] I did a self-review of my code

[ ] I added comments to my code

[ ] I made corresponding changes to the documentation

[x] My changes generate no new warnings

[ ] I have added tests that prove my fix is effective or that my feature works
opened by keithCuniah 0

Releases(v1.1.1)

v1.1.1(Nov 29, 2022)
1.1.1 (2022-11-29)

Bug Fixes

Set proper telemetry version (#1988) (d302891)

Documentation

Fix metric function imports in the example (#1966) (a1f6f6e), closes #1962

Source code(tar.gz)
Source code(zip)
v1.1.0(Nov 24, 2022)
1.1.0 (2022-11-24)

Highlights

Add, update, and delete rules from a Dataset using the Python client

You can now manage rules programmatically and reflect them in Argilla Datasets so you can iterate on labeling rules from both Python and the UI. This is especially useful for leveraging linguistic resources (such as terminological lists) and making the rules available in the UI for domain experts to refine them.

# Read a file with keywords or phrases labeling_rules_df = pd.read_csv("../../_static/datasets/weak_supervision_tutorial/labeling_rules.csv") # Create rules predefined_labeling_rules = [] for index, row in labeling_rules_df.iterrows(): predefined_labeling_rules.append( Rule(row["query"], row["label"]) ) # Add the rules to the weak_supervision_yt dataset. The rules will be manageable from the UI add_rules(dataset="weak_supervision_yt", rules=predefined_labeling_rules

You can find more info about this feature in the deep dive guide: https://docs.argilla.io/en/latest/guides/techniques/weak_supervision.html#3.-Building-and-analyzing-weak-labels

Sort by timestamp fields in the UI

Users can now sort the records by last_updated and other timestamp fields to improve the labeling and review processes

Features

#1929 add warning about using wrong hostnames (#1930) (a3bc554)

Add, delete and edit labeling rules from Python client (#1884) (d534a29), closes #1855

Added more explicit error message regarding dataset name validation (#1933) (c25a225), closes #1931 #1918

Allow sort records by event_timestamp or last_updated fields (#1924) (1c08c36), closes #1835

Create a contextual help to support the user in the different dataset views (#1913) (8e3851e)

Enable metadata length field config by environment variable (#1923) (0ff2de7), closes #1761

Update error page (#1932) (caeb7d4), closes #1894

Using new top_k_mentions metrics instead of entity_consistency (#1880) (42f702d), closes #1834

Bug Fixes

Avoid closing the score filter when dragging the slider (#1822) (91a72c5), closes #1802

Change method for Doc creation by spacy.Language (#1891) (6264983), closes #1890

DAO: datasets dao filter datasets by tasks (#1934) (937b410)

docker: Prevent wrong elastic server for wait-for-it (c6a10c7)

Improve access to label list in Text Classification (#1916) (24729bd), closes #1804

Improve explanation readability (#1815) (52c712e), closes #1774

Monitoring: Serializable log middleware (#1908) (53a57f7)

Move "Show less" button to the end of entities list (#1875) (6d796a4), closes #1779

Remove "Help explain button" in Manage rule view (#1909) (8bc70b0), closes #1807

Remove extra html when text is not highlighted (#1904) (7858dc5), closes #1758

Remove extra type when highlighting the query in the text (#1863) (341c581), closes #1758

Documentation

change iframe for mp4 (dfac8b2)

corrected for iframe (935f586)

Link key features (#1805) (#1809) (4c83604)

resolved miss-direction and old naming in README.md (f45fe1e)

Update README links linkedin and twitter (#1797) (2d4d03a)

As always, thanks to our amazing contributors!

docs: Link key features (#1805) (#1809) by @chschroeder

View Docs link in frontend header users.vue (#1915) by @bengsoon

fix: Change method for Doc creation by spacy.Language (#1891) by @jamnicki

Source code(tar.gz)
Source code(zip)
v1.0.1(Nov 4, 2022)
1.0.1 (2022-11-04)

Bug Fixes

Remove the extra letter "y" (#1814) (f3d5d2e), closes #1811

Update vue-virtual-scroller dependency version (#1813) (147dc8d), closes #1806 #1782 #1816

Documentation

corrected for tutorial and api redirections (#1820) (26ccdcc)

Source code(tar.gz)
Source code(zip)
v0.19.0(Oct 24, 2022)

Source code(tar.gz)
Source code(zip)
v0.18.0(Oct 5, 2022)
0.18.0 (2022-10-05)

⚡ Highlights

Better validation of token classification records

When working with Token Classification records, there are very often misalignment problems between the entity spans and provided tokens. Before this release, it was difficult to understand and fix these errors because validation happened on the server side.

With this release, records are validated during instantiation, giving you a clear error message which can help you to fix/ignore problematic records.

For example, the following record:

import rubrix as rb rb.TokenClassificationRecord( tokens=["I", "love", "Paris"], text="I love Paris!", prediction=[("LOC",7,13)] )

Will give you the following error message:

ValueError: Following entity spans are not aligned with provided tokenization Spans: - [Paris!] defined in ...love Paris! Tokens: ['I', 'love', 'Paris']

Delete records by query

Now it's possible to delete specific records, either by ids or by a query using Lucene's syntax. This is useful for clean up and better dataset maintenance:

import rubrix as rb ## Delete by id rb.delete_records(name="example-dataset", ids=[1,3,5]) ## Discard records by query rb.delete_records(name="example-dataset", query="metadata.code=33", discard_only=True)

New tutorials

We have two new tutorials!

Few-shot classification with SetFit and a custom dataset: https://rubrix.readthedocs.io/en/stable/tutorials/few-shot-classification-with-setfit.html

Analyzing predictions with model explainability methods: https://rubrix.readthedocs.io/en/stable/tutorials/nlp_model_explainability.html https://rubrix.readthedocs.io/en/stable/tutorials/few-shot-classification-with-setfit.html

Features

API: provide a dict for record annotations/predictions (#1658) (12b0f83)

Client: expose client extra headers in init function (#1715) (79f0529), closes #1706

Client: improve httpx errors handling (#1662) (85da336)

Client: validate token classification annotations in client (#1709) (936d1ca), closes #1579

Datasets: delete records by query (#1721) (bc9685d), closes #1714 #1737

Datasets: restrict dataset deletion only to creators and super-users (#1713) (c1bef9d), closes #1740

Server: Add server telemetry (#1687) (d7cc006)

Bug Fixes

'MajorityVoter.score' when using multi-labels (#1678) (0b94c86), closes #1628

Metadata limits: exclude subfields from mappings (#1700) (9f9650e), closes #1699

Normalizes the UnauthorizationError for the API response (#1748) (6a68048)

Search tag reset prior annotation (#1736) (dc0a17f), closes #1711

Visual enhancements

Align App UI with the design system (#1672) (67d6de8), closes #1670

Documentation

Add interpret tutorial with Transformers (#1728) (c3fa079), closes #1729

Adds tutorial about custom few-shot classification with SetFit (#1739) (4f15ee6), closes #1741

fixing the active learning tutorial with small-text (#1726) (909efdf), closes #1693

raise small-text version to 1.1.0 and adapt tutorial (#1744) (16f19b7), closes #1693

Resolve many typos in documentation, comments and tutorials (#1701) (f05e1c1)

using official token class. mapper since is compatible now (#1738) (e82fd13), closes #482

As always, thanks to our amazing contributors!

refactor: accept flat text as input for token classification mapper (#1686) by @Ankush-Chander

feat(Client): improve httpx errors handling (#1662) by @Ankush-Chander

fix: 'MajorityVoter.score' when using multi-labels (#1678) by @dcfidalgo

docs: raise small-text version to 1.1.0 and adapt tutorial (#1744) by @chschroeder

refactor: Incompatible attribute type fixed (#1675) by @luca-digrazia

docs: Resolve many typos in documentation, comments and tutorials (#1701) by @tomaarsen

refactor: Collection of changes, primarily regarding test suite and its coverage (#1702) by @tomaarsen

Source code(tar.gz)
Source code(zip)
v0.17.0(Aug 22, 2022)
0.17.0 (2022-08-22)

⚡ Highlights

Preparing a training set in the spaCy DocBin format

prepare_for_training is a method that prepares a dataset for training. Before prepare_for_training prepared the data for easily training Hugginface Transformers.

Now, you can prepare your training data for spaCy NER pipelines, thanks to our great community contributor @ignacioct !

With the example below, you can export your Rubrix dataset into a Docbin, save it to disk, and then use it with the spacy train command.

import spacy import rubrix as rb from datasets import load_dataset # Load annotated dataset from Rubrix rb_dataset = rb.load("ner_dataset") # Loading an spaCy blank language model to create the Docbin, as it works faster nlp = spacy.blank("en") # After this line, the file will be stored in disk rb_dataset.prepare_for_training(framework="spacy", lang=nlp).to_disk("train.spacy")

You can find a full example at: https://rubrix.readthedocs.io/en/v0.17.0/guides/cookbook.html#Train-a-spaCy-model-by-exporting-to-Docbin

Load large datasets using batches

Before this release, the rb.load method to read datasets from Python retrieved the full dataset. For large datasets, this could cause high memory consumption, network timeouts, and the inability to read datasets larger than the available memory.

Thanks to the awesome work by @maxserras. Now it's possible to optimize memory consumption and avoid network timeouts when working with large datasets. To that end, a simple batch-iteration over the whole database can be done employing the from_id parameter in the rb.load method.

An example of reading the first 1000 records and the next batch of up to 1000 records:

import rubrix as rb dataset_batch_1 = rb.load(name="example-dataset", limit=1000) dataset_batch_2 = rb.load(name="example-dataset", limit=1000, id_from=dataset_batch_1[-1].id)

The reference to the rb.load method can be found at: https://rubrix.readthedocs.io/en/v0.17.0/reference/python/python_client.html#rubrix.load

Larger pagination sizes for faster bulk review and annotation

Using filters and search for data annotation and review, some users are able to filter and quickly review dozens of records in one go. To serve those users, it's now possible to see and bulk annotate 50 and 100 records in each page.

Copy record text to clipboard

Sometimes is useful to copy the text in records to use inspect it or process it with another application. Now, this is possible thanks to the feature request by our great community member and contributor @Ankush-Chander !

Better error logging for generic errors

Thanks to work done by @Ankush-Chander and @frascuchon we now have more meaningful messages for generic server errors!

Features

Add new pagination size ranges (#1667) (5b4f1f2), closes #1578

Allow rb.load fetch records in batches passing the from_id argument (3e6344a)

Copy to clipboard the record text (#1625) (d634a7b), closes #1616

Error Logging: send error detail in response for generic server errors (#1648) (ad17631)

Listeners: allow using query params in the condition through search parameter (#1627) (a0a245d), closes #1622

prepare_for_training supports spacy (#1635) (8587808)

Bug Fixes

Client: reusing the inner httpx client (#1640) (854a972), closes #1646

docker-compose.yaml: default volume and disable disk threshold (#1656) (05ae688), closes #1275

Encode rule name in Weak Labeling API requests (#1649) (4634df8), closes #1645

handle stream api connection errors gracefully (#1636) (a106ec4), closes #1559

Update progress bar when refreshing after adding new records (#1666) (7e0d915), closes #1590

Documentation

Add Slack support link in README's get started (#1688) (bef010c)

Adding Elasticsearch persistence to docker compose section (#1643) (ecdc854)

spacy DocBin cookbook (#1642) (bb98278), closes #420

Visual enhancements

Small visual adjustments for Text2Text record card (#1632) (9c87cf1), closes #1138

Improve card spacing (#1638) (fd4016a), closes #1624

You can see all work included in the release here

fix: Update progress bar when refreshing after adding new records (#1666) by @leiyre

chore: configure miniconda for readthedocs builder by @frascuchon

style: Small visual adjustments for Text2Text record card (#1632) by @leiyre

feat: Copy to clipboard the record text (#1625) by @leiyre

docs: Add Slack support link in README's get started (#1688) by @dvsrepo

chore: update version by @frascuchon

feat: Add new pagination size ranges (#1667) by @leiyre

fix: handle stream api connection errors gracefully (#1636) by @Ankush-Chander

feat: allow rb.load fetch records in batches passing the from_id argument by @maxserras

fix(Client): reusing the inner httpx client (#1640) by @frascuchon

feat(Error Logging): send error detail in response for generic server errors (#1648) by @frascuchon

docs: spacy DocBin cookbook (#1642) by @ignacioct

feat: prepare_for_training supports spacy (#1635) by @frascuchon

style: Improve card spacing (#1638) by @leiyre

docs: Adding Elasticsearch persistence to docker compose section (#1643) by @maxserras

chore: remove old rubrix client class (#1639) by @frascuchon

feat(Listeners): allow using query params in the condition through search parameter (#1627) by @frascuchon

doc: show metric graphs in documentation (#1669) by @leiyre

fix(docker-compose.yaml): default volume and disable disk threshold (#1656) by @frascuchon

fix: Encode rule name in Weak Labeling API requests (#1649) by @leiyre

Source code(tar.gz)
Source code(zip)
v0.16.1(Jul 22, 2022)
0.16.1 (2022-07-22)

Bug Fixes

'WeakMultiLabels.summary' and 'show_records' after extending the weak label matrix (#1633) (3cb4c07), closes #1631

Display metadata in Text2Text dataset (#1626) (0089e0a), closes #1623

Show predicted OK/KO when predictions exist (#1620) (ef66e9c), closes #1619

Documentation

Fix typo in Getting Started -> Concepts (#1618) (b236cb8), closes #1617

You can see all work included in the release here

fix: 'WeakMultiLabels.summary' and 'show_records' after extending the weak label matrix (#1633) by @dcfidalgo

fix: Display metadata in Text2Text dataset (#1626) by @leiyre

chore: set version by @dcfidalgo

docs: Fix typo in Getting Started -> Concepts (#1618) by @dcfidalgo

fix: Show predicted OK/KO when predictions exist (#1620) by @leiyre

Source code(tar.gz)
Source code(zip)
v0.16.0(Jul 8, 2022)
0.16.0 (2022-07-08)

Highlights

👂 Listeners: enable more interactive workflows between client and server

Listeners enable you to define functions that get executed under certain conditions when something changes in a dataset. There are many use cases for this: monitoring annotation jobs, monitoring model predictions, enabling active learning workflows, and many more.

You can find the Python API reference docs here: https://rubrix.readthedocs.io/en/stable/reference/python/python_listeners.html#python-listeners

We will be documenting these use cases with practical examples, but for this release, we've included a new tutorial for using this with active learning: https://rubrix.readthedocs.io/en/stable/tutorials/active_learning_with_small_text.html. This tutorial includes the following listener function, which implements the active learning loop:

from rubrix.listeners import listener from sklearn.metrics import accuracy_score # Define some helper variables LABEL2INT = trec["train"].features["label-coarse"].str2int ACCURACIES = [] # Set up the active learning loop with the listener decorator @listener( dataset=DATASET_NAME, query="status:Validated AND metadata.batch_id:{batch_id}", condition=lambda search: search.total==NUM_SAMPLES, execution_interval_in_seconds=3, batch_id=0 ) def active_learning_loop(records, ctx): # 1. Update active learner print(f"Updating with batch_id {ctx.query_params['batch_id']} ...") y = np.array([LABEL2INT(rec.annotation) for rec in records]) # initial update if ctx.query_params["batch_id"] == 0: indices = np.array([rec.id for rec in records]) active_learner.initialize_data(indices, y) # update with the prior queried indices else: active_learner.update(y) print("Done!") # 2. Query active learner print("Querying new data points ...") queried_indices = active_learner.query(num_samples=NUM_SAMPLES) ctx.query_params["batch_id"] += 1 new_records = [ rb.TextClassificationRecord( text=trec["train"]["text"][idx], metadata={"batch_id": ctx.query_params["batch_id"]}, id=idx, ) for idx in queried_indices ] # 3. Log the batch to Rubrix rb.log(new_records, DATASET_NAME) # 4. Evaluate current classifier on the test set print("Evaluating current classifier ...") accuracy = accuracy_score( dataset_test.y, active_learner.classifier.predict(dataset_test), ) ACCURACIES.append(accuracy) print("Done!") print("Waiting for annotations ...")

📖 New docs!

https://rubrix.readthedocs.io/

🧱 extend_matrix: Weak label augmentation using embeddings

This release includes an exciting feature to augment the coverage of your weak labels using embeddings. You can find a practical tutorial here: https://rubrix.readthedocs.io/en/stable/tutorials/extend_weak_labels_with_embeddings.html

Features

#1561: standardize icons (#1565) (15254e7), closes #1561

#1602: new rubrix dataset listeners (#1507, #1586, #1583, #1596) (65747ab), closes #1602

Add 'extend_matrix' to the WeakMultiLabel class (#1577) (cf89311)

Improve from datasets (#1567) (2b0d607)

token-class: adjust token spans spaces (#1599) (0fb3576)

Bug Fixes

#1264: discard first space after a token (#1591) (eff0ac5), closes #1264

#1545: highlight words with accents (#1550) (c42e77b), closes #1545

#1548: access datasets for superusers when workspace is not provided (#1572, #1608) (0b04bc8), closes #1548

#1551: don't show error traces for EntityNotFoundError's (#1569) (04e101c), closes #1551

#1557: allow text editing when clicking the "edit" button (#1558) (e751414), closes #1557

#1574: search highlighting for a single dot (#1592) (53474a1), closes #1574

#1575: show predicted ok/ko in Text Classifier explore mode (#1576) (ada87c0), closes #1575

compatibility with new dataset version (#1566) (ac26e30)

Documentation

#1512: change theme to furo (#1564, #1604) (98869d2), closes #1512

add 'how to prepare your data for training' to basics (#1589) (a21bcf3)

add active learning with small text and listener tutorial (#1585, #1609) (d59573f), closes #1601 #421

Add MajorityVoter to references + Add comments about multi-label support of the label models (#1582) (ab481c7)

add pip version and dockertag as parameter in the build process (#1560) (73a31e2)

You can see all work included in the release here

chore(docs): remove by @frascuchon

docs: add active learning with small text and listener tutorial (#1585, #1609) by @dcfidalgo

docs(#1512): change theme to furo (#1564, #1604) by @frascuchon

chore: set version by @frascuchon

feat(token-class): adjust token spans spaces (#1599) by @frascuchon

feat(#1602): new rubrix dataset listeners (#1507, #1586, #1583, #1596) by @frascuchon

docs: add 'how to prepare your data for training' to basics (#1589) by @dcfidalgo

test: configure numpy to disable multi threading (#1593) by @frascuchon

docs: Add MajorityVoter to references + Add comments about multi-label support of the label models (#1582) by @dcfidalgo

feat(#1561): standardize icons (#1565) by @leiyre

Feat: Improve from datasets (#1567) by @dcfidalgo

feat: Add 'extend_matrix' to the WeakMultiLabel class (#1577) by @dcfidalgo

docs: add pip version and dockertag as parameter in the build process (#1560) by @frascuchon

refactor: remove words references in searches (#1571) by @frascuchon

ci: check conda env cache (#1570) by @frascuchon

fix(#1264): discard first space after a token (#1591) by @frascuchon

ci(package): regenerate view snapshot (#1600) by @frascuchon

fix(#1574): search highlighting for a single dot (#1592) by @leiyre

fix(#1575): show predicted ok/ko in Text Classifier explore mode (#1576) by @leiyre

fix(#1548): access datasets for superusers when workspace is not provided (#1572, #1608) by @frascuchon

fix(#1551): don't show error traces for EntityNotFoundError's (#1569) by @frascuchon

fix: compatibility with new dataset version (#1566) by @dcfidalgo

fix(#1557): allow text editing when clicking the "edit" button (#1558) by @leiyre

fix(#1545): highlight words with accents (#1550) by @leiyre

Source code(tar.gz)
Source code(zip)
v0.15.0(Jun 8, 2022)
0.15.0 (2022-06-08)

🔆 Highlights

🏷️ Configure datasets with a labeling scheme

You can now predefine and change the label schema of your datasets. This is useful for fixing a set of labels for you and your annotation teams.

import rubrix as rb # Define labeling schema settings = rb.TextClassificationSettings(label_schema=["A", "B", "C"]) # Apply seetings to a new or already existing dataset rb.configure_dataset(name="my_dataset", settings=settings) # Logging to the newly created dataset triggers the validation checks rb.log(rb.TextClassificationRecord(text="text", annotation="D"), "my_dataset") #BadRequestApiError: Rubrix server returned an error with http status: 400

Read the docs: https://rubrix.readthedocs.io/en/stable/guides/dataset_settings.html

🧱 Weak label matrix augmentation using embeddings

You can now use an augmentation technique inspired by https://github.com/HazyResearch/epoxy to augment the coverage of your rules using embeddings (e.g., sentence transformers). This is useful for improving the recall of your labeling rules.

Read the tutorial: https://rubrix.readthedocs.io/en/stable/tutorials/extend_weak_labels_with_embeddings.html

🏛️ Tutorial Gallery

Tutorials are now organized into different categories and with a new gallery design!

Read the docs: https://rubrix.readthedocs.io/en/stable/tutorials/introductory.html

🏁 Basics guide

This is the first version of the basics guide. This guide will show you how to perform the most basic actions with Rubrix, such as uploading data or data annotation.

Read the docs: https://rubrix.readthedocs.io/en/stable/getting_started/basics.html

Features

#1134: Allow extending the weak label matrix with embeddings (#1487) (4d54994), closes #1134

#1432: configure datasets with a label schema (21e48c0), closes #1432

#1446: copy icon position in datasets list (#1448) (7c9fa52), closes #1446

#1460: include text hyphenation (#1469) (ec23b2d), closes #1460

#1463: change icon position in table header (#1473) (5172324), closes #1463

#1467: include animation delay for last progress bar track (#1462) (c772b74), closes #1467

configuraton: add elasticsearch ca_cert path variable (#1502) (f0eda12)

UI: improve access to actions in metadata and sort dropdowns (#1510) (8d33090), closes #1435

Bug Fixes

#1522: dates metadata fields accessible for sorting (#1529) (a576ceb), closes #1522

#1527: check agents instead labels for predicted computation (#1528) (2f2ee2e), closes #1527

#1532: correct domain for filter score histogram (#1540) (7478d6c), closes #1532

#1533: restrict highlighted fields (3a8b8a9), closes #1533

#1534: fix progress in the metrics sidebar when page is refreshed (#1536) (1b572c4)

#1539: checkbox behavior with value 0 (#1541) (7a0ab63), closes #1539

metrics: compute f1 for text classification (#1530) (147d38a)

search: highlight only textual input fields (8b83a82), closes #1538 #1544

New contributors

@RafaelBod made his first contribution in https://github.com/recognai/rubrix/pull/1413
Source code(tar.gz)
Source code(zip)
v0.14.2(May 31, 2022)
0.14.2 (2022-05-31)

Bug Fixes

#1514: allow ent score None and change default value to 0.0 (#1521) (0a02c70), closes #1514

#1516: restore read-only to copied dataset (#1520) (5b9cf0e), closes #1516

#1517: stop background task when something happens to main thread (#1519) (0304f40), closes #1517

#1518: disable global actions checkbox when no data was found (#1525) (bf35e72), closes #1518

UI: remove selected metadata fields for sortable fields dropdown (#1513) (bb9482b)

Source code(tar.gz)
Source code(zip)
v0.14.1(May 20, 2022)
0.14.1 (2022-05-20)

Bug Fixes

#1447: change agent when validating records with annotation but default status (#1480) (126e6f4), closes #1447

#1472: hide scrollbar in scrollable components (#1490) (b056e4e), closes #1472

#1483: close global actions "Annotate as" selector after deselect records checkbox (#1485) (a88f8cb)

#1503: Count filter values when loading a dataset with a route query (#1506) (43be9b8), closes #1503

documentation: fix user management guide (#1511) (63f7bee), closes #1501

filters: sort filter values by count (#1488) (0987167), closes #1484

Source code(tar.gz)
Source code(zip)
v0.14.0(May 10, 2022)
0.14.0 (2022-05-10)

Async version of rb.log

You can now use the parameter background in the rb.log method to log records without blocking the main process. The main use case is monitoring production pipelines to do prediction monitoring. Here's an example with BentoML (you can find the full example in the updated Monitoring guide):

from bentoml import BentoService, api, artifacts, env from bentoml.adapters import JsonInput from bentoml.frameworks.spacy import SpacyModelArtifact import rubrix as rb import spacy nlp = spacy.load("en_core_web_sm") @env(infer_pip_packages=True) @artifacts([SpacyModelArtifact("nlp")]) class SpacyNERService(BentoService): @api(input=JsonInput(), batch=True) def predict(self, parsed_json_list): result, rb_records = ([], []) for index, parsed_json in enumerate(parsed_json_list): doc = self.artifacts.nlp(parsed_json["text"]) prediction = [{"entity": ent.text, "label": ent.label_} for ent in doc.ents] rb_records.append( rb.TokenClassificationRecord( text=doc.text, tokens=[t.text for t in doc], prediction=[ (ent.label_, ent.start_char, ent.end_char) for ent in doc.ents ], ) ) result.append(prediction) rb.log( name="monitor-for-spacy-ner", records=rb_records, tags={"framework": "bentoml"}, background=True, verbose=False ) # By using the background=True, the model latency won't be affected return result

Confidence scores in Token Classification (NER)

To store entity predictions you can attach a score using the last position of the entity tuple (label, char_start, char_end, score). Let's see an example:

import rubrix as rb text = "Rubrix is a data science tool" record = rb.TokenClassificationRecord( text=text, tokens=text.split(" "), prediction=[("PRODUCT", 0, 6, 0.99)] ) rb.log(record, "ner_with_scores")

Then, in the web application, you and your team can use the score filter to find potentially problematic entities, like in the screenshot below:

If you want to see this in action, check this blog post by David Berenstein:

https://www.rubrix.ml/blog/concise-concepts-rubrix/

Rule metrics sidebar

We have a fresh new sidebar for the weak labeling mode, where you can see your overall rule metrics as you define new rules.

This sidebar should help you quickly understand your progress:

See the updated user guide here: https://rubrix.readthedocs.io/en/v0.14.0/reference/webapp/define_rules.html

Features

#1132: introduce async/background version of rb.log (#1391) (900307e), closes #1132

#1247: label models predict method returns DatasetForTextClassification (#1442) (42ca1be), closes #1247

#1379: show prediction score in NER (#1389) (0bdccd2), closes #1379 #1451

#961: rules metrics in sidebar (#1377) (261f53a), closes #961 #1408

home: improve table actions and styles (#1384) (f09746e), closes #1355 #1333

Bug Fixes

#1407: fix visualization in 1024px viewport (#1420) (46f8d4d), closes #1441

#1458: token classifier visualization in Safari (#1459) (01cc492), closes #1458

Source code(tar.gz)
Source code(zip)
v0.13.3(Apr 27, 2022)
0.13.3 (2022-04-27)

Bug Fixes

#1248: allow multiple label attributions in UI (#1424) (a9f8363), closes #1248

#1409: filtering by metadata with value list (#1415) (7aca061), closes #1409

#1410: apply dataset name pattern to user name (#1411) (2087c21), closes #1410

#1428: support cleanlab v2 (#1436) (d189ddb), closes #1428

TokenClassification: display characters between tokens words (#1418) (a08cd7b), closes #1414 #1383

Source code(tar.gz)
Source code(zip)
v0.13.2(Apr 12, 2022)
0.13.2 (2022-04-12)

Bug Fixes

#1265: persist pagination size after query (#1358) (49ca243), closes #1265

#1367: remove record text from metadata modal (#1385) (1782724), closes #1367

#1368: long list of entities in Token Classifier (#1388) (829269f), closes #1368 #1393

#1387: improve metadata distinct values computation (be9f68f), closes #1387

install: remove loguru dependency (#1372) (9e52414), closes #1331 #1305

search: compute dataset schema properly for advanced query dsl (#1380) (f71ab91)

visualization: force break word in selectors (#1406) (5ac1950)

Source code(tar.gz)
Source code(zip)
v0.13.1(Apr 1, 2022)
0.13.1 (2022-04-01)

Bug Fixes

#1244: compute capitalness based on python methods (#1359 #1371) (218f099), closes #1244

#1362: using active api method instead instance (#1363) (bcf446d), closes #1362

#1365: create rules with regex queries (#1369) (c2afc9c), closes #1365

Source code(tar.gz)
Source code(zip)
v0.13.0(Mar 30, 2022)
0.13.0 (2022-03-30)

🗂 Multilabel weak supervision

You can now build multilabel text classification datasets using query-based rules

If you want to get started, check out this tutorial.

https://user-images.githubusercontent.com/1107111/160930404-7b909f1e-b871-4e4c-b1c8-ea9eabfcad21.mp4

🤗 Reading Hugging Face datasets from the Hub

You can now read ANY text classification, NER, or text2text dataset directly from the Hub and load it into Rubrix.

To understand how Rubrix datasets work check out this guide.

👥 Redesigned team workspaces

Organizing teams and datasets is a key Rubrix feature. After several rounds of feedback with early users, we've completely redesigned the user experience. Let us know what you think.

You can get started and configure users and workspaces following this guide

🔎 Guide for the query language and model

We have included a new in-depth guide about the Lucene-based query language and data model used for search, weak labeling, loading subsets of data, and metrics.

Features

#1119: users without personal datasets (#1282) (555d41d), closes #1119 #1318 #1317 #1323 #1324

#1130: cleanup rb namespace by refactoring client API (#1160) (a0fdd8e), closes #1130

#1144: weak supervision for multilabel datasets (#1166) (fd95bae), closes #1144 #1190 #1237 #1233 #1326

datasets: simplify load flow from hf datasets with no rb format (#1234) (a6da1cd), closes #1327

#1180: show Rubrix version in the webapp (#1243) (8c71ad9), closes #1180 #1350 #1349

#1225: prepare tokenclass dataset for hf training (#1231) (ae5e7cd), closes #1225

#950: using record search_keywords for highlighting (#1235) (47616bf), closes #950 #1278 #1316 #1315

#981: add majority voter with multi label support (#1228) (8052aa8), closes #981

Introduce a 'text' argument for the TextClassificationRecord (#1246) (bb7d93e)

Bug Fixes

#1347: allow tooltip record overlapping in Token Classifier (#1352) (87174d3), closes #1347

#1103: remove "Error Distribution" from metrics (#1255) (b9bb5b4), closes #1103

#1149: fix vulnerable dependencies (node-sass) (#1263) (7f8c1d1), closes #1149

#1211: fix score scale (#1261) (8a72281), closes #1211

#1238: show prediction labels when annotating rule (#1239) (0321b88), closes #1238

#1241, #1245: show new line char in metrics plot & increase mentions in entity consistency (#1257) (38930cb), closes #1241 #1245

#1311: small defects about hover style (#1313) (442703c), closes #1311

#1320: render car return in Token Classifier (#1328) (b7f1b7b), closes #1320

#1335: force line break in rules summary (#1336) (2d77a76), closes #1335

#1337: number of records in the overall annotated coverage (#1338) (d384713), closes #1337

#1339: metrics and status not updated when the query is refreshed (#1340) (6fc0a58), closes #1339

#984: manage super user workspaces (#1268) (9b24921), closes #984 #1288 #1290

datasets: prevent error when no annotated records found in dataset (#1284) (c20028f)

install: make starlette an optional dependency (#1295) (32afb3d)

NER: create record annotation from tags (also in from_datasets) (#1283) (adcf1b1)

rules: store single-label rules with a comp. format for old versions (#1334) (eb310d3)

Source code(tar.gz)
Source code(zip)
v0.12.1(Mar 11, 2022)
0.12.1 (2022-03-11)

Bug Fixes

#1238: show prediction labels when annotating rule (#1239) (6c1b975), closes #1238

Source code(tar.gz)
Source code(zip)
v0.11.1(Mar 11, 2022)
0.11.1 (2022-03-11)

Bug Fixes

#1238: show prediction labels when annotating rule (#1239) (28e97c6), closes #1238

Source code(tar.gz)
Source code(zip)
v0.12.0(Mar 8, 2022)
0.12.0 (2022-03-08)

Features

#1029: improve server api logging (#1148) (d4a121a), closes #1029 #1224

#1183: token classification fine-tuning (#1199) (2cdd30b), closes #1183

#1192: disable ssl verify for elasticsearch http client (#1193) (631a729), closes #1192

#950: include search keywords as part of record results (#1201) (2dd5853), closes #950

#970: header redesign (#1185) (fa9c639), closes #970 #1218 #1214 #1223

Implement 'prepare_for_training' for text classification datasets (#1209) (f7fd59c)

Bug Fixes

#1207: using api sdk wrapper for init (#1208) (2495c75), closes #1207

Source code(tar.gz)
Source code(zip)
v0.11.0(Feb 20, 2022)
0.11.0 (2022-02-19)

Highlights

Introducing rb.Dataset* and 🤗 Hub integration

The Dataset classes are lightweight containers for Rubrix records. These classes facilitate importing from and exporting to different formats (e.g., pandas.DataFrame, datasets.Dataset) as well as sharing and versioning Rubrix datasets using the Hugging Face Hub.

With this release, Rubrix users and teams can use the Hugging Face Hub to share and read both public and private Rubrix datasets for TextClassification, TokenClassification, and Text2Text datasets. This opens up a whole new world of possibilities for data reproducibility and sharing. Let's see an example:

import rubrix as rb from datasets import load_datasets # 👧🏻 🏷️ Leire has labeled a text classification dataset using a local Rubrix instance dataset_rb = rb.load("text_classification_ds", as_pandas=False) # 👧🏻 exports a Rubrix Dataset to a hf Dataset dataset_ds = dataset_rb.to_datasets() # 👧🏻 🚀 Leire shares the labelled dataset with the world dataset_ds.push_to_hub("text_classification_ds") # 👨 John downloads the dataset from the Hugging Face Hub dataset_ds = load_dataset("leire/text_classification_ds", split="train") # 👨 reads in dataset dataset_rb = rb.read_datasets(dataset_ds, task="TextClassification") # 👨 🏷️ logs the dataset and continues labeling with his own Rubrix instance rb.log(dataset_rb, "john_text_classification_ds")

You can read more at https://rubrix.readthedocs.io/en/stable/guides/datasets.html

For each record type, there’s a corresponding Dataset class called DatasetFor<RecordType>. You can look up their API in the reference section.

Improving NER UI and UX

The UI for Token Classification has been completely redesigned to provide a better user experience for exploration and annotation. This is the first of a set of changes focusing on annotation productivity for token classification.

Features

#1051: keep predictions labels when annotating (#1077) (f1824ba), closes #1051

#1063: Token Classifier fine tuning content selection (#1084) (9e14d05), closes #1063

#1127: raise startup app error from es connection error (#1145) (7e7e9d8), closes #1127

#422: introducing the rb.Dataset* classes (#1109) (b5bbca6), closes #422

#821: token classifier show predictions in explore view (#1009) (6ba6764), closes #821

#951: new "not covered records by rules" filter (#991) (0649f2a), closes #951 #1156

Bug Fixes

#1140: fix/make client models more consistent (#1147) (926bb16), closes #1140

client: parse unauthorized api error properly (#1164) (1a5a08d)

search: prevent metrics computation breaks searches (#1175) (9f2adc9)

Source code(tar.gz)
Source code(zip)
v0.10.0(Feb 20, 2022)
0.10.0 (2022-02-12)

Now you can use filters in the Define Rules mode (weak labeling). These filters are useful for seeing the impact of rules on specific dataset subpopulations/subsets (e.g., with certain metadata fields, annotated records, etc.):

Features

#1061: unify records results title (#1111) (54ebb15), closes #1061

#982: show filters in labelling rules view (#1038) (7ff677b), closes #982

Bug Fixes

#1054: reduce collapsable area. Optimize for annotation (#1106) (48024ba), closes #1054

#1054: remove old scroll padlock button (a1d6444), closes #1054

#1094: remove computed record fields returned in API results (#1095) (cd61d1e), closes #1094

#831: Remove sort field when only one is applied (#1116) (36b276b), closes #831

convert pd.NaT to None for event_timestamp (#1105) (21e78e4)

Source code(tar.gz)
Source code(zip)
v0.9.0(Feb 4, 2022)
🎉 0.9.0 (2022-02-02)

Improve logging

Small improvements to the labelling module and weak labeling mode

Better setup documentation (python -m rubrix)

Features

#932: label models now modify the prediction_agent when calling LabelModel.predict (#1049) (4a024ee), closes #932

#953: add additional metrics to LabelModel.score method (#979) (2887907), closes #953

#955: add default for rules in WeakLabels (#976) (34389d3), closes #955 #1011

Bug Fixes

#1045: calculate overall precision from overall correct/incorrect in rules (#1086) (1c76d81), closes #1045 #1087

#1053: metadata modal position (#1068) (09b88cc), closes #1053 #1053

#1054: optimize Long records (#1080) (fdd797a), closes #1054

#1067: fix rule definition link when no labels are defined (#1069) (eb958bf), closes #1067

#1081: prevent add records of different task (#1085) (5296e52), closes #1081 #1081

#924: parse new error format in UI (#1082) (f26c79c), closes #924

Source code(tar.gz)
Source code(zip)
v0.8.2(Jan 31, 2022)
0.8.2 (2022-01-31)

Features

#1036: remove prediction ok/ko in labelling rules (#1037) (672b852), closes #1036

#735: add warning when agent but no prediction/annotation is provided (#987) (ba88c34), closes #735

Bug Fixes

#1008: set the event_timestamp when annotating (#1024) (c24fdad), closes #1008

#1015: manage emojis in Token Classification records (#1016) (8b570fb), closes #1015

#1023: handle elasticsearch connection problems on server startup (#1030) (e8c8d86), closes #1023

#1027: Improve client models by reordering fields + forbidding extra args (#1032) (6c1ae7f), closes #1027

#1028: Add videos to Monitoring tutorial (#1033) (6ff3326), closes #1028

#1050: generalizes entity span validation (#1055) (37207bc), closes #1050

#1058: sort by % data in rules list (#1062) (9735f22), closes #1058

#1065: 'B' tag for beginning tokens (#1066) (a5ed329), closes #1065

cleanlab: set cleanlab n_jobs=1 as default (#1059) (189cbcb)

Source code(tar.gz)
Source code(zip)
v0.8.1(Jan 20, 2022)
0.8.1 (2022-01-20)

Bug Fixes

#1002: Show 0 records overall metrics when no rules defined (#1013) (a8a5c79), closes #1002 #1002

Breadcrumbs: copy workspace from the breadcrumbs when dataset loading has errors #1003 (33e372d), closes #844

statics: handle 404 errors for static files (#1006) (f4b656a)

#800: compute common aggregations one by one (#990) (8cf420a), closes #800

#800: limit number of metadata fields (#993) (bb6b76b), closes #800

#905: copy dataset with rules (#948) (8597b83), closes #905

#974: display the dropdown in the last record of the scroll (#986) (e5f8d53), closes #974

#977: Remove redirection when accessing login (#996) (b3fe2cb), closes #977

Source code(tar.gz)
Source code(zip)
v0.8.1-alpha.3(Jan 20, 2022)

Source code(tar.gz)
Source code(zip)
v0.8.1-alpha.2(Jan 20, 2022)
0.8.1-alpha.2 (2022-01-20)

Bug Fixes

#1002: Show 0 records overall metrics when no rules defined (#1007) (a890e17), closes #1002 #1002

Breadcrumbs: copy workspace from the breadcrumbs when dataset loading has errors #1003 (33e372d), closes #844

statics: handle 404 errors for static files (#1006) (f4b656a)

Source code(tar.gz)
Source code(zip)
v0.8.1-alpha.1(Jan 19, 2022)

Source code(tar.gz)
Source code(zip)
v0.8.1-alpha.0(Jan 19, 2022)
0.8.1-alpha.0 (2022-01-19)

Bug Fixes

#800: compute common aggregations one by one (#990) (8cf420a), closes #800

#800: limit number of metadata fields (#993) (bb6b76b), closes #800

#905: copy dataset with rules (#948) (8597b83), closes #905

#974: display the dropdown in the last record of the scroll (#986) (e5f8d53), closes #974

#977: Remove redirection when accessing login (#996) (b3fe2cb), closes #977

Source code(tar.gz)
Source code(zip)
v0.8.0(Jan 12, 2022)
Introducing interactive Weak labeling (Define rules mode) 🚀

We are glad to introduce the most important feature to date: now it's possible to iterate on labeling queries directly in the UI with initial support for multi-class text classification. Multilabel and token classification support is coming soon.

See the video for the recommended workflow:

https://user-images.githubusercontent.com/1107111/149346471-93cbd7ee-96a2-451a-8f5e-f9e26b246407.mp4

Check the updated tutorial: https://rubrix.readthedocs.io/en/master/tutorials/weak-supervision-with-rubrix.html

What's changed

[WeakSupervision] Change load_rules import path in guide and tutorial (#939)

fix links to new web app reference (#936)

Bugfixes/avoid infinite loop when dataset loading (#934)

show nan instead of 0 for precision in summary (#930)

fix(api): include_metrics param only for search endponts (#929)

[Documentation] Update title page video for docs (#928)

update skweak tutorial (#922)

[Documentation] Updating the web app docu (#827)

publish python package to test.pypi for master and releases branches (#927)

[WeakLabels] Align WeakLabels.summary() with web app (#925)

UI: show rules without precision properly (#919)

chore(build): build docker images for release branches (#921)

Docs: Updates readme front video (#923)

Docs: Updates weak supervision resources (#920)

feat(rules): compute total & ann. coverage before label selection (#916)

fix(rules): compute annotated coverage when no label properly (#915)

Tutorial: Human-in-the-loop weak supervision with skweak (#869)

UI: include affected #records to overall coverage/ann. coverage metrics (#914)

fix lint build (#913)

UI: manage precision and rules without annotation coverage (#909)

fix(#876): process 400 response detail properly (#889)

feat(rules): allow compute partial query rule metrics (#907)

fix(security): providing default workspace should pass check (#911)

UI: reset filters from define rules view (#908)

UI: Show number of created rules in rules management view (#910)

UI: drop access to rule name field (#904)

fix(rules): prevent lost rules with dataset updates (#892)

fix(datasets): process owner as part of dataset id (#870)

(UI) Rules summary metrics format (#888)

UI: Improve code snippet for empty workspace (#886)

fix(UI): Remove case sensitive when filtering labels (#882)

Docs: Updates Flair zeroshot tutorial (#887)

removing wrong video (#885)

Update readme (#883)

fix(UI) Metrics value by default if no metric (#875)

feat(metrics): add token level metrics for token classification from client (#849)

UI: New rule metrics layout (#861)

chore: expose load_rules from base module (#866)

Docs: Regenerates graphs metrics guide (#865)

updating loss video (#864)

Docs: Update weak supervision guide (#863)

Update README.md (#862)

Fix: Link loss tutorial (#859)

Docs: Improve loss tutorial (#858)

Docs: Improve AL and ws tutorials (#857)

chore(ci): Include component testing configuration (#839)

fix/loss video updated (#853)

Docs: Weak supervision guide update (#855)

chore(app): upgrade lint dependencies (#841)

feat: weak supervision mode (#814)

Docs: Review hf tutorial (#852)

fix: error link to workspace home (#845)

fix(metrics): compute token length for each token (#850)

add streaming (#851)

fix(rules): prevent division by 0 for overall metrics (#848)

small change

[Tutorials] Update media structure, remove TLDR heading (#847)

Updating videos and images for sentiment classification tutorial (#846)

fix(rules): prevent division by zero (#843)

new folder and videos for model loss tutorial (#805)

feat(token class): add metrics at token level (#838)

new folder and images for active learning tutorial (#796)

[Tutorials] Typo fix in find label errors tutorial (#842)

[Tutorials] Add the new find_label_errors tutorial (#833)

[Rule] Modify the client API to the server's weak supervision feature (#840)

[LabelModel] Improve Snorkel to not modify the passed in WeakLabels object (#836)

feat (search): allow to filtering record metrics fields in search (#837)

fix(ui): remove workspace home from code snippet api url (#834)

ui: Hide validate button for binary cases in Text classifier (#830)

fix print message (#829)

feat: Include workspace in url path (#820)

fix(ui): align records and global action layouts #825

fix(ui): Show labels as selected after validate (#826)

feat(labeling rule): implements api endpoint to fetch a single rule (#817)

[LabelErrors] Add find_label_errors method (#775)

fix(ui): Fix styles in Safari (#815)

docs: Add contributors to readme (#822)

add missing rubrix import (#819)

new folder and images for spacy tutorial (#794)

feat(labeling rules): allow edition for rule label and description (#813)

refactor(labeling rules): optional label for rule metrics (#811)

Fix token alignment on CreationTokenClassificationRecord (#812)

feat(server): add overall dataset labeling rules metrics (#807)

feat(labeling rules): add coverage for annotated records (#806)

fix(ui): Unique ID for scroll state to avoid same state for different dataset records (#809)

new folder and images for zeroshot ner tutorial (#804)

new folder and images for zeroshot data annotation tutorial (#803)

fix(log): check multi-label integrity without search aggregations (#802)

updated images, added folder for fastapi tutorial (#801)

added folder for weak supervision tutorial (#795)

feat(weak supervision): client labeling rules from server (#799)

feat(server): labeling rule metrics (#790)

fix/edit zero-shot tutorial (#774)

fix/edited fastapi tutorial (#773)

Fix/edit ner flair tutorial (#766)

Fix/edit weaksupervision tutorial (#759)

fix(ui): Little changes in fonts (#793)

fix(ui): Allow open dataset in new tab from datasets list (#792)

feat(server): rubrix namespaces for elasticsearch indices (#789)

fix(ui): Show annotation after global validation (#786)

remove reload arg launching server using python (#787)

updated readme with conda install instruction (#788)

fix(ui): Hide scroller component when loading or paginate (#784)

fix(ui): allow remove metadata filter from record metadata modal (#772)

fix(ui): Token Classifier: validate record without annotation or prediction (#782)

Fix/edit active learning tutorial (#760)

Docs:minor changes to loss tutorial (#778)

Fix/edit model loss tutorial (#767)

fix(server): missing deprecated dep (#777)

fix(ui): Global validate for records without annotation or prediction (#746)

Fix/edit spacy tutorial (#758)

Fix/edit labeling tutorial (#750)

fix(server) - misaligned entity mentions on CreationTokenClassificationRecord (#771)

[Requirements] Require python>=3.7 (#770)

[Labeling] Add FlyingSquid label model (#755)

Update README.md (#769)

Adds Flair example to guide (#762)

docs: Updates huggingface examples and adds monitor for Flair (#761)

feat(search): show boolean values in metadata (#753)

feat(server): allow handle labeling rules for datasets from API (#744)

fix(imports): import monitoring with spacy<3.0 fails (#754)

[UI] new fonts families (#751)

fix(scroll): using new scroll component (#710)

fix(ui): filter "validatable" records for global action validate button (#741)

feat(monitor): flair ner auto-monitor (#738)

New Contributors

@sugatoray made their first contribution

@ruanchaves made their first contribution

Source code(tar.gz)
Source code(zip)
v0.8.0-alpha.1(Jan 11, 2022)
Bugfixes/avoid infinite loop when dataset loading (#934)

show nan instead of 0 for precision in summary (#930)

fix(api): include_metrics param only for search endponts (#929)

[Documentation] Update title page video for docs (#928)

update skweak tutorial (#922)

[Documentation] Updating the web app docu (#827)

revert test.pypi publish

publish python package to test.pypi for master and releases branches (#927)

[WeakLabels] Align WeakLabels.summary() with web app (#925)

UI: show rules without precision properly (#919)

chore(build): build docker images for release branches (#921)

Docs: Updates readme front video (#923)

Docs: Updates weak supervision resources (#920)

feat(rules): compute total & ann. coverage before label selection (#916)

fix(rules): compute annotated coverage when no label properly (#915)

Tutorial: Human-in-the-loop weak supervision with skweak (#869)

UI: include affected #records to overall coverage/ann. coverage metrics (#914)

fix lint build (#913)

UI: manage precision and rules without annotation coverage (#909)

fix(#876): process 400 response detail properly (#889)

feat(rules): allow compute partial query rule metrics (#907)

fix(security): providing default workspace should pass check (#911)

UI: reset filters from define rules view (#908)

UI: Show number of created rules in rules management view (#910)

UI: drop access to rule name field (#904)

fix(rules): prevent lost rules with dataset updates (#892)

fix(datasets): process owner as part of dataset id (#870)

(UI) Rules summary metrics format (#888)

UI: Improve code snippet for empty workspace (#886)

fix(UI): Remove case sensitive when filtering labels (#882)

Docs: Updates Flair zeroshot tutorial (#887)

removing wrong video (#885)

Update readme (#883)

fix(UI) Metrics value by default if no metric (#875)

feat(metrics): add token level metrics for token classification from client (#849)

UI: New rule metrics layout (#861)

chore: expose load_rules from base module (#866)

Docs: Regenerates graphs metrics guide (#865)

updating loss video (#864)

Docs: Update weak supervision guide (#863)

Update README.md (#862)

Fix: Link loss tutorial (#859)

Docs: Improve loss tutorial (#858)

Docs: Improve AL and ws tutorials (#857)

chore(ci): Include component testing configuration (#839)

fix/loss video updated (#853)

Docs: Weak supervision guide update (#855)

chore(app): upgrade lint dependencies (#841)

feat: weak supervision mode (#814)

Docs: Review hf tutorial (#852)

fix: error link to workspace home (#845)

fix(metrics): compute token length for each token (#850)

chore: improve dockerignore files

add streaming (#851)

fix(rules): prevent division by 0 for overall metrics (#848)

small change

[Tutorials] Update media structure, remove TLDR heading (#847)

Updating videos and images for sentiment classification tutorial (#846)

fix(rules): prevent division by zero (#843)

new folder and videos for model loss tutorial (#805)

feat(token class): add metrics at token level (#838)

new folder and images for active learning tutorial (#796)

[Tutorials] Typo fix in find label errors tutorial (#842)

[Tutorials] Add the new find_label_errors tutorial (#833)

[Rule] Modify the client API to the server's weak supervision feature (#840)

[LabelModel] Improve Snorkel to not modify the passed in WeakLabels object (#836)

feat (search): allow to filtering record metrics fields in search (#837)

fix(ui): remove workspace home from code snippet api url (#834)

ui: Hide validate button for binary cases in Text classifier (#830)

fix print message (#829)

feat: Include workspace in url path (#820)

fix(ui): align records and global action layouts #825

fix(ui): Show labels as selected after validate (#826)

feat(labeling rule): implements api endpoint to fetch a single rule (#817)

[LabelErrors] Add find_label_errors method (#775)

fix(ui): Fix styles in Safari (#815)

docs: Add contributors to readme (#822)

add missing rubrix import (#819)

new folder and images for spacy tutorial (#794)

feat(labeling rules): allow edition for rule label and description (#813)

refactor(labeling rules): optional label for rule metrics (#811)

Fix token alignment on CreationTokenClassificationRecord (#812)

feat(server): add overall dataset labeling rules metrics (#807)

feat(labeling rules): add coverage for annotated records (#806)

fix(ui): Unique ID for scroll state to avoid same state for different dataset records (#809)

new folder and images for zeroshot ner tutorial (#804)

new folder and images for zeroshot data annotation tutorial (#803)

fix(log): check multi-label integrity without search aggregations (#802)

updated images, added folder for fastapi tutorial (#801)

added folder for weak supervision tutorial (#795)

feat(weak supervision): client labeling rules from server (#799)

feat(server): labeling rule metrics (#790)

fix/edit zero-shot tutorial (#774)

fix/edited fastapi tutorial (#773)

Fix/edit ner flair tutorial (#766)

Fix/edit weaksupervision tutorial (#759)

fix(ui): Little changes in fonts (#793)

fix(ui): Allow open dataset in new tab from datasets list (#792)

feat(server): rubrix namespaces for elasticsearch indices (#789)

fix(ui): Show annotation after global validation (#786)

remove reload arg launching server using python (#787)

updated readme with conda install instruction (#788)

fix(ui): Hide scroller component when loading or paginate (#784)

fix(ui): allow remove metadata filter from record metadata modal (#772)

fix(ui): Token Classifier: validate record without annotation or prediction (#782)

Fix/edit active learning tutorial (#760)

Docs:minor changes to loss tutorial (#778)

Fix/edit model loss tutorial (#767)

fix(server): missing deprecated dep (#777)

fix(ui): Global validate for records without annotation or prediction (#746)

Fix/edit spacy tutorial (#758)

Fix/edit labeling tutorial (#750)

fix(server) - misaligned entity mentions on CreationTokenClassificationRecord (#771)

[Requirements] Require python>=3.7 (#770)

[Labeling] Add FlyingSquid label model (#755)

Update README.md (#769)

Adds Flair example to guide (#762)

docs: Updates huggingface examples and adds monitor for Flair (#761)

feat(search): show boolean values in metadata (#753)

feat(server): allow handle labeling rules for datasets from API (#744)

fix(imports): import monitoring with spacy<3.0 fails (#754)

[UI] new fonts families (#751)

fix(scroll): using new scroll component (#710)

fix(ui): filter "validatable" records for global action validate button (#741)

feat(monitor): flair ner auto-monitor (#738)

Full Changelog: https://github.com/recognai/rubrix/compare/v0.7.0...v0.8.0-alpha.0
Source code(tar.gz)
Source code(zip)