Open Source research tool to search, browse, analyze and explore large document collections by Semantic Search Engine and Open Source Text Mining & Text Analytics platform (Integrates ETL for document processing, OCR for images & PDF, named entity recognition for persons, organizations & locations, metadata management by thesaurus & ontologies, search user interface & search apps for fulltext search, faceted search & knowledge graph)

Overview

Open Semantic Search

https://opensemanticsearch.org

Integrated search server, ETL framework for document processing (crawling, text extraction, text analysis, named entity recognition and OCR for images and embedded images in PDF), search user interfaces, text mining, text analytics and search apps for fulltext search, faceted search, exploratory search and knowledge graph search

Build

How to build the deb package for installation on Debian or Ubuntu server or the docker images for running in Docker containers:

Build deb package

To build a deb package for Debian or Ubuntu, call the build script "build-deb" as user root (change user by su or sudo su):

./build-deb

Build docker images

Clone the repository including the dependencies :

git clone --recurse-submodules --remote-submodules https://github.com/opensemanticsearch/open-semantic-search.git

Inside the opensemanticsearch directory, build the Docker images use the docker-compose config docker-compose.yml :

cd opensemanticsearch
docker-compose build

After these builds all the Docker images/dependencies/services can by started together by docker-compose with the config file docker-compose.yml.

You can run the instance by typing :

docker-compose up

You can browse OpenSemanticSearch in your favourite browser at this url :

http://localhost:8080/search/

Automated tests

For CI/CD there are some different automated tests:

Integration tests

Since the submodule Open Semantic ETL uses and needs different powerful services like Solr, spacY-services or Tika-Server by HTTP and REST-API, the automated tests run as integration tests within the docker-compose environment configured in docker-compose.etl-test.yml so these services are available while running the unittests.

End to end tests

Some automated integration tests and end-to-end (E2E) tests within a web browser controlled by the browser automation framework playwright and the node.js / javascript based test framework JEST.

You can extend the automated tests in test/test.js

They run by the docker image Dockerfile-test and need the services of the docker-compose environment docker-compose.test.yml

Dependencies

Dependencies are resolved automatically by building or by installation of the Debian or Ubuntu packages or by building the Docker images.

Documentation on this dependecies which may help debugging dependency hell issues or installations in other environments:

Build dependencies on Source code (GIT)

Dependencies on other Git repositories / submodules of components like Open Semantic ETL are defined in the Git config file .gitmodules

The submodules will be checked out automatically to the subdirectory "src", if you check out this repository by git in recursive mode.

Packaging dependencies of Java archives (JAR)

The submodules tika.deb and solr.deb need the JAR of Apache Tika-Server and Apache Solr.

If not there, they will be downloaded from Apache Software Foundation by wget in the submodule "build" script or its "Dockerfile".

Installation dependencies on Debian/Ubuntu packages (DEB)

Dependecies of tools and libraries, which are available in the Debian or Ubuntu package repositories, are defined in the section "Depends" of the deb package config file DEBIAN/control

https://github.com/opensemanticsearch/open-semantic-search/blob/master/DEBIAN/control

Installation dependencies on Python packages (PIP)

Dependecies of Python libraries which are not available as packages of the Linux distribution but in Python Package Index (PyPI), are defined in

https://github.com/opensemanticsearch/open-semantic-etl/blob/master/src/opensemanticetl/requirements.txt

This dependencies will be installed automatically on installation of the Debian/Ubuntu packages by DEBIAN/postinst of the Debian/Ubuntu packages or by docker build configured by Dockerfile by

pip3 install -r /usr/lib/python3/dist-packages/opensemanticetl/requirements.txt

Comments
  • Docker container

    Docker container

    Separation in / configuration of Docker containers for separation of services and easier deployment on (mutiple) server(s).

    After more usage of/migration to REST-APIs / Microservices like https://github.com/opensemanticsearch/open-semantic-search-apps/issues/19 and implementation of Open Semantic Entity Search API separation in docker container should be easier, since generation/update of configs can be done on different/distributed servers over network.

    enhancement funded by grant or order packaging 
    opened by opensemanticsearch 23
  • delete facets - OSS Desktop

    delete facets - OSS Desktop

    Hello, newbie in OSS.

    Trying to delete all the tags and facets author, person, message to... etc, from OSS Desktop, but didnt work. Tried it over django web UI, deleting all facets, but didnt work. Is it possible to do? Thats a lot of unnecessary information for us. And the tags some times increase too much. I would like to make a tag only with the folder where the file is in. is it possible?

    Thanks

    question 
    opened by rafael844 13
  • Status page for ETL / document processing

    Status page for ETL / document processing

    Status page showing count of now yet fully processed documents / running ETL tasks and occured ETL errors like failed plugins or access rights errors.

    enhancement ui etl ocr feature wish 
    opened by opensemanticsearch 11
  • urllib.error.HTTPError: HTTP Error 500: Server Error (indexing not working anymore)

    urllib.error.HTTPError: HTTP Error 500: Server Error (indexing not working anymore)

    I have an open-semantic-search server running since March on an up-to-date debian 9.6 VM (KVM) with more than 4 millions documents successfully indexed. Everything was running smoothly until a few days ago, but now indexing new files always results in an error (and the files are not indexed anymore):

     Traceback (most recent call last):
      File "/usr/bin/opensemanticsearch-index-dir", line 274, in <module>
        connector.commit()
      File "/usr/lib/python3/dist-packages/opensemanticetl/etl.py", line 295, in commit
        self.exporter.commit()
      File "/usr/lib/python3/dist-packages/opensemanticetl/export_solr.py", line 220, in commit
        request = urllib.request.urlopen( uri )
      File "/usr/lib/python3.5/urllib/request.py", line 163, in urlopen
        return opener.open(url, data, timeout)
      File "/usr/lib/python3.5/urllib/request.py", line 472, in open
        response = meth(req, response)
      File "/usr/lib/python3.5/urllib/request.py", line 582, in http_response
        'http', request, response, code, msg, hdrs)
      File "/usr/lib/python3.5/urllib/request.py", line 510, in error
        return self._call_chain(*args)
      File "/usr/lib/python3.5/urllib/request.py", line 444, in _call_chain
        result = func(*args)
      File "/usr/lib/python3.5/urllib/request.py", line 590, in http_error_default
        raise HTTPError(req.full_url, code, msg, hdrs, fp)
    urllib.error.HTTPError: HTTP Error 500: Server Error
    

    I have no idea what is wrong with my machine and what to look at. Adding more disk space as suggested in issue #115 did not solve the problem (I have 238GB free and 154GB used). The machine has plenty of RAM (15GB, but never uses more than 10) and 8 cores available from the host, and everything was running well for almost a year.

    Thanks in advance for your help. I can provide more information if you need.

    opened by woxop 10
  • Some word documents not indexed properly - text body not accessible to the search engine

    Some word documents not indexed properly - text body not accessible to the search engine

    Hi

    I have the desktop version installed on my system with the intention to be able to tag a large database of word documents and find reference across the whole database. So far so good but as I'm running tests with a subsection of documents (600-700) I'm finding that some of them won't be properly indexed as the search engine can't access the text body. On search these documents merely return results in the document title but nothing of the text body and the preview tab is empty. However I am able to open them directly through Libre Office, so there can't be a problem with permissions.

    It works fine for the majority of documents so I'm at a loss what the difference is.

    Any help would be greatly appreciated

    bug 
    opened by Akhanaten 9
  • Open Semantic Search Server Installation fails on Ubuntu 16.04.3 LTS (Xenial Xerus)

    Open Semantic Search Server Installation fails on Ubuntu 16.04.3 LTS (Xenial Xerus)

    I am trying to install Open Semantic Search Server on an AWS vm with Ubuntu 16.04.3 LTS (Xenial Xerus) by using ALL IN ONE Package: https://www.opensemanticsearch.org/download/open-semantic-search_18.02.02.deb.

    Below are some of the errors I am getting in installation traces, any help on it would be great:

    apache2_reload: Your configuration is broken. Not reloading Apache 2 apache2_reload: apache2: Syntax error on line 216 of /etc/apache2/apache2.conf: Could not open configuration file /etc/apache2/conf-enabled/solr-php-ui.conf: No such file or directory

    However, I checked and I can see solr-php-ui.conf in the path /etc/apache2/conf-enabled.

    _ImportError: No module named 'opensemanticetl' chown: cannot access '/var/lib/opensemanticsearch/db.sqlite3': No such file or directory chmod: cannot access '/var/lib/opensemanticsearch/db.sqlite3': No such file or director_y

    Also once the installation completes, I get 500 error on clicking effectively on all the menu items. And in error logs of apache2, I am getting numerous wsgi:error like:

    [Wed Feb 07 10:32:01.849612 2018] [wsgi:error] [pid 10969] [client 127.0.0.1:38166] mod_wsgi (pid=10969): Target WSGI script '/var/lib/opensemanticsearch/opensemanticsearch/wsgi.py' cannot be loaded as Python module. [Wed Feb 07 10:32:01.849660 2018] [wsgi:error] [pid 10969] [client 127.0.0.1:38166] mod_wsgi (pid=10969): Exception occurred processing WSGI script '/var/lib/opensemanticsearch/opensemanticsearch/wsgi.py'. [Wed Feb 07 10:32:01.849764 2018] [wsgi:error] [pid 9645] [client 127.0.0.1:38168] mod_wsgi (pid=9645): Target WSGI script '/var/lib/opensemanticsearch/opensemanticsearch/wsgi.py' cannot be loaded as Python module. [Wed Feb 07 10:32:01.849814 2018] [wsgi:error] [pid 9645] [client 127.0.0.1:38168] mod_wsgi (pid=9645): Exception occurred processing WSGI script '/var/lib/opensemanticsearch/opensemanticsearch/wsgi.py'. [Wed Feb 07 10:32:01.849817 2018] [wsgi:error] [pid 10969] [client 127.0.0.1:38166] Traceback (most recent call last): [Wed Feb 07 10:32:01.849864 2018] [wsgi:error] [pid 10969] [client 127.0.0.1:38166] File "/var/lib/opensemanticsearch/opensemanticsearch/wsgi.py", line 14, in [Wed Feb 07 10:32:01.849870 2018] [wsgi:error] [pid 10969] [client 127.0.0.1:38166] application = get_wsgi_application() [Wed Feb 07 10:32:01.849880 2018] [wsgi:error] [pid 10969] [client 127.0.0.1:38166] File "/usr/lib/python3/dist-packages/django/core/wsgi.py", line 14, in get_wsgi_application [Wed Feb 07 10:32:01.849885 2018] [wsgi:error] [pid 10969] [client 127.0.0.1:38166] django.setup() [Wed Feb 07 10:32:01.849893 2018] [wsgi:error] [pid 10969] [client 127.0.0.1:38166] File "/usr/lib/python3/dist-packages/django/init.py", line 18, in setup [Wed Feb 07 10:32:01.849898 2018] [wsgi:error] [pid 10969] [client 127.0.0.1:38166] apps.populate(settings.INSTALLED_APPS) [Wed Feb 07 10:32:01.849907 2018] [wsgi:error] [pid 10969] [client 127.0.0.1:38166] File "/usr/lib/python3/dist-packages/django/apps/registry.py", line 78, in populate [Wed Feb 07 10:32:01.849911 2018] [wsgi:error] [pid 10969] [client 127.0.0.1:38166] raise RuntimeError("populate() isn't reentrant")

    bug 
    opened by mohitsahay 9
  • Hierarchy/Taxonomy/Tree/Subclasses for faceted search / interactive filters / named entity overviews

    Hierarchy/Taxonomy/Tree/Subclasses for faceted search / interactive filters / named entity overviews

    Since thanks to SKOS thesaurus concepts or entities can now have relations like broader or narrower or a tree structure: Support this taxonomies/hierarchies and RDF (sub)classes in search UI for faceted search.

    enhancement ui skos funded by grant or order feature wish 
    opened by Mandalka 9
  • Filesystem monitoring *.deb install fails on Ubuntu Xenial

    Filesystem monitoring *.deb install fails on Ubuntu Xenial

    I'm trying to install the filesystem monitoring *.debs on Ubuntu Xenial, but I receive the following error message:

    opensemanticsearch-trigger-filemonitoring : Depends: opensemanticsearch-connector-files (>= 0)
    but it is not installable
    

    What am I'm doing wrong?

    FWIW here are all setup steps done on this server:

    sudo apt install ./open-semantic-search-server-ubuntu_xenial_16.10.10.deb
    sudo apt-get install unoconv
    sudo apt install ./opensemanticsearch-trigger-filemonitoring_15.06.26_all.deb
    ...
    opensemanticsearch-trigger-filemonitoring : Depends: opensemanticsearch-connector-files (>= 0)
    but it is not installable
    
    opened by marbetschar 8
  • Thesaurus (SKOS)

    Thesaurus (SKOS)

    Enhance Named Entities Manager towards Thesaurus Manager:

    • Consolidate models from tagging app to use names or concepts from thesaurus manager.
    • Since concepts or entities can now have relations like broader or narrower or a tree structure: Support this taxonomies/hierarchies in search UI for faceted search
    • Use the alternate labels, aliases, synonyms and hidden labels / misspellings not only for faceted search but for full text search, too. I.e. by exporting to synonyms.txt for Solr or enrich search query
    • Add UI element for choice of language for multilingual thesaurus (to set/change field "lang" in models)
    enhancement ui rdf skos 
    opened by opensemanticsearch 8
  • no data export to Neo4J

    no data export to Neo4J

    on a fresh debian stretch: there is no export of entities to neo4j.

    Version: open-semantic-search_18.12.23

    already installed: py2neo + export_neo4j, and activated by config!

    no logs, something missing?

    opened by ronaldoviber 7
  • Permission Error while uploading an ontology on Open Semantic Search Server

    Permission Error while uploading an ontology on Open Semantic Search Server

    I am using Open Semantic Search Server from https://www.opensemanticsearch.org/download/open-semantic-search_18.02.07.deb and trying to upload an user defined ontology from Manage Structure -> Ontologies -> Add New Ontology.

    After clicking on Save, I am being navigated to a Django Error Page (in attachment) that states - PermissionError at /ontologies/create [Errno 13] Permission denied: '/var/solr/data/core1/conf/named_entities/tmp_ocr_dictionary.txt' -> '/etc/opensemanticsearch/ocr/dictionary.txt'

    On disabling Django Debug, all I am getting is 500 error - nothing much in logs that I have.

    My basic aim here is to use the ontology for tagging documents which is not happening at present. PermissionError at _ontologies_create.zip

    Any assistance or leads on it would be great :)

    bug 
    opened by mohitsahay 7
  • bug regarding the mapping of file paths?

    bug regarding the mapping of file paths?

    Hey,

    first of all, amazing piece of work that you've done! While indexing our server we've probably encountered a bug regarding the mapping of file paths:

    Our current mapping ist defind in /etc/opensemanticsearch/connector-files, with the aim to make search results directly accessible through a apache server: config['mappings'] = { "/mnt/server/": "http://192.168.2.20/server/" } Additionally we made a ln -s /mnt/server /var/www/html/server This works great so far for nearly every file.

    But for any file (like .pdf) inside E-Mails (.msg) or E-Mail archives (*.pst), which means for every E-Mail attachment, the mapping results in: http://192.168.2.20/mnt/server/... instead of: http://192.168.2.20/server/...

    So, the question is, did we make any mistake regarding the configuration or is there a bug regarding attachments?

    Best regards Josef

    opened by josefkarlkraus 1
  • Is it possible to deactivated the standard solr tags like: currency, phone numbers, Money, law clause,...

    Is it possible to deactivated the standard solr tags like: currency, phone numbers, Money, law clause,...

    Hi, I installed the latest Opemsemanticsearch Version as deb-Package in my Ubuntu 22 LTS Hyper-V machine. I'd like to use OSS for our about 1700 docx documentations of non standard Feature of our Software. The indexing of the docs worked without any problems.

    My Problem is: By Default all docx are tagged with multiple default tags, I think the came from the 'Apache solr"!?! Here some examples: grafik

    Is it possible to deactived this tags in the apache solr? If tried the following, which didn't worked:

    • Delete / Clear the index of OSS
    • Edit the {{{/var/solr/data/opensemanticsearch/conf/_schema_analysis_synonyms_skos.json}}} and delete all entries instead of one
    • Reload the 2 OMS Cores with {{{curl http://localhost:8983/solr/admin/cores?action=RELOAD&core=opensemanticsearch}}} {{{curl http://localhost:8983/solr/admin/cores?action=RELOAD&core=opensemanticsearch-entities}}}
    • After reindex my docx, the default tags are still there
    • I have looked inside the '/var/opensemanticsearch/db' sqllite DB, too, but didn't find something useful

    Did anyone has a hint, to get rid of the default tags?

    opened by Aculo0815 3
  • Crawler configured by Datasources UI only crawls Startpage, although option

    Crawler configured by Datasources UI only crawls Startpage, although option "Crawl full domain..."

    Hello,

    I try to crawl a webpage (full domain) but never will be crawled more than the startpage. In the Datasources UI I tried http and https, with www and without, with trailing slash and without. It never works. I would expect that the crawler will follow the links found in the startpage. I have no idea why it does not work as expected.

    (The whole installation was made on bullseye with "one command" as documented in https://opensemanticsearch.org/doc/admin/install/search_server/ )

    opened by rafkamonday 1
  • ETL/Dataimport: Generic JSON importer/ingestor

    ETL/Dataimport: Generic JSON importer/ingestor

    Generic ingestor for JSON, so existing Open Semantic ETL plugins like Named Entity Extraction by Thesaurus and Ontologies, Stemming dependent on language detection and settings and indexing to Open Semantic Search schemas will/can be be used automatically for many additional datasources without developing an specialized importer.

    enhancement etl 
    opened by opensemanticsearch 0
  • OperationalError at /setup/

    OperationalError at /setup/

    I am trying to set up a dev environment with round trip from source to build on a server. I'm most of the way there, and the main menu looks pretty good. But when I press the Config button I get:

    OperationalError at /setup/
    no such table: setup_setup
    
    Exception Location: | /usr/local/lib/python3.9/dist-packages/django/db/backends/sqlite3/base.py, line 423, in execute
    
    

    Any advice? thanks!

    opened by LYPratt 4
Owner
Open Semantic Search
Search, analyze and explore large document collections by Open Source Search Engine, Text Mining, Document analysis and Text Analytics Explorer
Open Semantic Search
It is a image ocr tool using the Tesseract-OCR engine with the pytesseract package and has a GUI.

OCR-Tool It is a image ocr tool made in Python using the Tesseract-OCR engine with the pytesseract package and has a GUI. This is my second ever pytho

Khant Htet Aung 4 Jul 11, 2022
This pyhton script converts a pdf to Image then using tesseract as OCR engine converts Image to Text

Script_Convertir_PDF_IMG_TXT Este script de pyhton convierte un pdf en Imagen luego utilizando tesseract como motor OCR convierte la Imagen a Texto. p

alebogado 1 Jan 27, 2022
A tool for extracting text from scanned documents (via OCR), with user-defined post-processing.

The project is based on older versions of tesseract and other tools, and is now superseded by another project which allows for more granular control o

Maxim 32 Jul 24, 2022
Convert PDF/Image to TXT using EasyOcr - the best OCR engine available!

PDFImage2TXT - DOWNLOAD INSTALLER HERE What can you do with it? Convert scanned PDFs to TXT. Convert scanned Documents to TXT. No coding required!! In

Hans Alemão 2 Feb 22, 2022
A little but useful tool to explore OCR data extracted with `pytesseract` and `opencv`

Screenshot OCR Tool Extracting data from screen time screenshots in iOS and Android. We are exploring 3 options: Simple OCR with no text position usin

Gabriele Marini 1 Dec 7, 2021
Genalog is an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and text alignment capabilities.

Genalog is an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and text alignment capabilities.

Microsoft 235 Dec 22, 2022
Some Boring Research About Products Recognition 、Duplicate Img Detection、Img Stitch、OCR

Products Recognition 介绍 商品识别,围绕在复杂的商场零售场景中,识别出货架图像中的商品信息。主要组成部分: 重复图像检测。【更新进度 4/10】 图像拼接。【更新进度 0/10】 目标检测。【更新进度 0/10】 商品识别。【更新进度 1/10】 OCR。【更新进度 1/10】

zhenjieWang 18 Jan 27, 2022
Tesseract Open Source OCR Engine (main repository)

Tesseract OCR About This package contains an OCR engine - libtesseract and a command line program - tesseract. Tesseract 4 adds a new neural net (LSTM

null 48.4k Jan 9, 2023
A Screen Translator/OCR Translator made by using Python and Tesseract, the user interface are made using Tkinter. All code written in python.

About An OCR translator tool. Made by me by utilizing Tesseract, compiled to .exe using pyinstaller. I made this program to learn more about python. I

Fauzan F A 41 Dec 30, 2022
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted. ocrmypdf # it's a scriptable c

jbarlow83 7.9k Jan 3, 2023
Repository for playing the computer vision apps: People analytics on Raspberry Pi.

play-with-torch Repository for playing the computer vision apps: People analytics on Raspberry Pi. Tools Tested Hardware RasberryPi 4 Model B here, RA

eMHa 1 Sep 23, 2021
Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)

English | 简体中文 Introduction PaddleOCR aims to create multilingual, awesome, leading, and practical OCR tools that help users train better models and a

null 27.5k Jan 8, 2023
Turn images of tables into CSV data. Detect tables from images and run OCR on the cells.

Table of Contents Overview Requirements Demo Modules Overview This python package contains modules to help with finding and extracting tabular data fr

Eric Ihli 311 Dec 24, 2022
Indonesian ID Card OCR using tesseract OCR

KTP OCR Indonesian ID Card OCR using tesseract OCR KTP OCR is python-flask with tesseract web application to convert Indonesian ID Card to text / JSON

Revan Muhammad Dafa 5 Dec 6, 2021
Code for the paper STN-OCR: A single Neural Network for Text Detection and Text Recognition

STN-OCR: A single Neural Network for Text Detection and Text Recognition This repository contains the code for the paper: STN-OCR: A single Neural Net

Christian Bartz 496 Jan 5, 2023
OCR system for Arabic language that converts images of typed text to machine-encoded text.

Arabic OCR OCR system for Arabic language that converts images of typed text to machine-encoded text. The system currently supports only letters (29 l

Hussein Youssef 144 Jan 5, 2023
OCR, Scene-Text-Understanding, Text Recognition

Scene-Text-Understanding Survey [2015-PAMI] Text Detection and Recognition in Imagery: A Survey paper [2014-Front.Comput.Sci] Scene Text Detection and

Alan Tang 354 Dec 12, 2022
Programa que viabiliza a OCR (Optical Character Reading - leitura óptica de caracteres) de um PDF.

Este programa tem o intuito de ser um modificador de arquivos PDF. Os arquivos PDFs podem ser 3: PDFs verdadeiros - em que podem ser selecionados o ti

Daniel Soares Saldanha 2 Oct 11, 2021
OCR engine for all the languages

Description kraken is a turn-key OCR system optimized for historical and non-Latin script material. kraken's main features are: Fully trainable layout

null 431 Jan 4, 2023