122 Repositories
Python metadata-extractor Libraries
Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.
Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.
Open Source research tool to search, browse, analyze and explore large document collections by Semantic Search Engine and Open Source Text Mining & Text Analytics platform (Integrates ETL for document processing, OCR for images & PDF, named entity recognition for persons, organizations & locations, metadata management by thesaurus & ontologies, search user interface & search apps for fulltext search, faceted search & knowledge graph)
Open Semantic Search https://opensemanticsearch.org Integrated search server, ETL framework for document processing (crawling, text extraction, text a
A machine learning software for extracting information from scholarly documents
GROBID GROBID documentation Visit the GROBID documentation for more detailed information. Summary GROBID (or Grobid, but not GroBid nor GroBiD) means
Pytorch implementation of PSEnet with Pyramid Attention Network as feature extractor
Scene Text-Spotting based on PSEnet+CRNN Pytorch implementation of an end to end Text-Spotter with a PSEnet text detector and CRNN text recognizer. We
CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)
CUTIE TensorFlow implementation of the paper "CUTIE: Learning to Understand Documents with Convolutional Universal Text Information Extractor." Xiaohu
This Bot can extract audios and subtitles from video files
Send any valid video file and the bot shows you available streams in it that can be extracted!!
Enumerate Microsoft 365 Groups in a tenant with their metadata
Enumerate Microsoft 365 Groups in a tenant with their metadata Description The all_groups.py script allows to enumerate all Microsoft 365 Groups in a
Automatic Video Library Manager for TV Shows. It watches for new episodes of your favorite shows, and when they are posted it does its magic.
Automatic Video Library Manager for TV Shows. It watches for new episodes of your favorite shows, and when they are posted it does its magic. Exclusiv
Textpipe: clean and extract metadata from text
textpipe: clean and extract metadata from text textpipe is a Python package for converting raw text in to clean, readable text and extracting metadata
Module for automatic summarization of text documents and HTML pages.
Automatic text summarizer Simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains sim
Textpipe: clean and extract metadata from text
textpipe: clean and extract metadata from text textpipe is a Python package for converting raw text in to clean, readable text and extracting metadata
Module for automatic summarization of text documents and HTML pages.
Automatic text summarizer Simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains sim
Html Content / Article Extractor, web scrapping lib in Python
Python-Goose - Article Extractor Intro Goose was originally an article extractor written in Java that has most recently (Aug2011) been converted to a
Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)
trafilatura: Web scraping tool for text discovery and retrieval Description Trafilatura is a Python package and command-line tool which seamlessly dow
Extract embedded metadata from HTML markup
extruct extruct is a library for extracting embedded metadata from HTML markup. Currently, extruct supports: W3C's HTML Microdata embedded JSON-LD Mic
News, full-text, and article metadata extraction in Python 3. Advanced docs:
Newspaper3k: Article scraping & curation Inspired by requests for its simplicity and powered by lxml for its speed: "Newspaper is an amazing python li
scrapes medias, likes, followers, tags and all metadata. Inspired by instagram-php-scraper,bot
instagram_scraper This is a minimalistic Instagram scraper written in Python. It can fetch media, accounts, videos, comments etc. `Comment` and `Like`
Search for documents in a domain through Google. The objective is to extract metadata
MetaFinder - Metadata search through Google _____ __ ___________ .__ .___ / \
Download song lyrics and metadata from Genius.com 🎶🎤
LyricsGenius: a Python client for the Genius.com API lyricsgenius provides a simple interface to the song, artist, and lyrics data stored on Genius.co
Module for automatic summarization of text documents and HTML pages.
Automatic text summarizer Simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains sim
News, full-text, and article metadata extraction in Python 3. Advanced docs:
Newspaper3k: Article scraping & curation Inspired by requests for its simplicity and powered by lxml for its speed: "Newspaper is an amazing python li
Python module for handling audio metadata
Mutagen is a Python module to handle audio metadata. It supports ASF, FLAC, MP4, Monkey's Audio, MP3, Musepack, Ogg Opus, Ogg FLAC, Ogg Speex, Ogg The