A pipeline for making highlighted text stand-alone.

Paul Bricman

Last update: Dec 17, 2022

Related tags

Overview

title	emoji	colorFrom	colorTo	sdk	app_file	pinned
decontextualizer	📤	green	gray	streamlit	main.py	false

Decontextualizer

As a second step in improving our content consumption workflows, I investigated a new approach to extracting fragments from a content item before saving them for subsequent surfacing. While the lexiscore deals with content items on a holistic level -- evaluating entire books, articles, and papers -- I speculated then that going granular is a natural next step in building tools which help us locate specific valuable ideas in long-form content. The decontextualizer is a stepping stone in that direction, consisting of a pipeline for making text excerpts compact and semantically self-contained. Concretely, the decontextualizer is a web app able to take in an annotated PDF and automatically tweak the highlighted excerpts so that they make more sense on their own, even out of context.

Installation

The decontextualizer can either be deployed from source or using Docker.

Docker

To deploy the decontextualizer labeler using Docker, first make sure to have Docker installed, then simply run the following.

docker run -p 8501:8501 paulbricman/decontextualizer

The tool should be available at localhost:8501.

From Source

To set up the decontextualizer, clone the repository and run the following:

python3 -m pip install -r requirements.txt
streamlit run main.py

The tool should be available at localhost:8501.

Screenshots

You might also like...

🐸 Identify anything. pyWhat easily lets you identify emails, IP addresses, and more. Feed it a .pcap file or some text and it'll tell you what it is! 🧙‍♀️

5.6k Jan 3, 2023

box is a text-based visual programming language inspired by Unreal Engine Blueprint function graphs.

Box is a text-based visual programming language inspired by Unreal Engine blueprint function graphs. $ cat factorial.box ┌─ƒ(Factorial)───┐

104 Dec 24, 2022

Export solved codewars kata challenges to a text file.

Codewars Kata Exporter Note:this is not totally my work.i've edited the project to make more easier and faster for me.you can find the original work h

4 Aug 13, 2021

Extract knowledge from raw text

Extract knowledge from raw text This repository is a nearly copy-paste of "From Text to Knowledge: The Information Extraction Pipeline" with some cosm

10 Dec 3, 2022

AnnIE - Annotation Platform, tool for open information extraction annotations using text files.

29 Dec 20, 2022

py-trans is a Free Python library for translate text into different languages.

Free Python library to translate text into different languages.

13 Aug 27, 2022

text-to-speach bot - You really do NOT have time for read a newsletter? Now you can listen to it

NewsletterReader You really do NOT have time for read a newsletter? Now you can listen to it The Newsletter of Filipe Deschamps is a great place to re

8 Sep 18, 2021

a python package that lets you add custom colors and text formatting to your scripts in a very easy way!

colormate Python script text formatting package What is colormate? colormate is a python library that lets you add text formatting to your scripts, it

2 Dec 14, 2022

Repository containing the code for An-Gocair text normaliser

Scottish Gaelic Text Normaliser The following project contains the code and resources for the Scottish Gaelic text normalisation project. The repo can

3 Jun 28, 2022

Comments

PDF Failures

Hello!

I attempted two different PDFs (which I can share in a DM or a email) — one was an older-pre computer PDF that had been OCRed professionally and hilighted. Another was a modern PDF, of a book published last year, also highlighted. Zotero was able to extract from both using pdfjs. When i use https://huggingface.co/spaces/paulbricman/decontextualizer i get:

File "/home/user/.local/lib/python3.8/site-packages/streamlit/script_runner.py", line 354, in _run_script
    exec(code, module.__dict__)
File "/home/user/app/main.py", line 17, in <module>
    components.add_section()
File "/home/user/app/components.py", line 46, in add_section
    excerpts = pdf_to_excerpts(filename)
File "/home/user/app/processing.py", line 48, in pdf_to_excerpts
    excerpt = extract_annot(annot, words)
File "/home/user/app/processing.py", line 24, in extract_annot
    quad_count = int(len(quad_points) / 4)

opened by brimwats1 4