Index different CKAN entities in Solr, not just datasets

Overview

ckanext-sitesearch

Index different CKAN entities in Solr, not just datasets

Requirements

This extension requires CKAN 2.9 or higher and Python 3

Features

Search actions

ckanext-sitesearch allows Solr-powered searches on the following CKAN entities:

Entity Action Permissions Notes
Organizations organization_search Public
Groups group_search Public
Users user_search Sysadmins only
Pages page_search Public (individual page permissions apply) Requires ckanext-pages

All *_search actions support most of the same paramters that package_search, except the facet* and include_* ones. That includes q, fq, rows, start and sort.

In all actions, the output matches the one of package_search as well, an object with a count key and a results one, wich is a list of the corresponding entities dict (ie the result of organization_show, user_show etc):

, , ] } ">
{
    "count": 2,
    "results": [
        
    
     ,
        
     
      ,
    ]
}


     
    

Additionally the plugin registers a site_search action that performs a search across all entities that the user is allowed to, including datasets. Results are returned in an object including the keys for which the user has permission to search on. For instance for a sysadmin user that has access to all searches:

, "organizations": , "groups": , "users": , "pages": }">
{
    "datasets": 
       
        ,
    "organizations": 
        
         ,
    "groups": 
         
          ,
    "users": 
          
           ,
    "pages": 
           
             } 
           
          
         
        
       

For each item, the results object is the one described above (count and results keys).

Note that all parameters are passed unchanged to each of the search actions, so this site-wide search is mostly useful for free-text searches like q=flood.

CLI

The plugin inlcudes a ckan command to reindex the current entities in the database in Solr:

ckan sitesearch rebuild 
   

   

Where entity_type is one of organizations, groups, users or pages. You can also pass the id or name of a particular entity to index just that particular one:

ckan sitesearch rebuild organization department-of-transport

Check the command help for additional options:

ckan sitesearch rebuild --help

Installation

To install ckanext-sitesearch:

  1. Activate your CKAN virtual environment, for example:

    . /usr/lib/ckan/default/bin/activate

  2. Clone the source and install it on the virtualenv

    git clone https://github.com/okfn/ckanext-sitesearch.git cd ckanext-sitesearch pip install -e . pip install -r requirements.txt

  3. Add sitesearch to the ckan.plugins setting in your CKAN config file (by default the config file is located at /etc/ckan/default/ckan.ini).

  4. Restart CKAN

Config settings

None at present

Developer installation

To install ckanext-sitesearch for development, activate your CKAN virtualenv and do:

git clone https://github.com/okfn/ckanext-sitesearch.git
cd ckanext-sitesearch
python setup.py develop

Tests

To run the tests, do:

pytest --ckan-ini=test.ini

License

AGPL

Comments
  • Add ISiteSearch interface

    Add ISiteSearch interface

    This PR adds a new ISiteSearch to allow users to hook specific logic before/after each type of search.

    This is inspired by before_package_search and after_package_search methods implemented in IPackageController:

    opened by pdelboca 3
  • Fix tests for 2.10

    Fix tests for 2.10

    This is a Work in Progress:

    While working against master this Blueprint test fails with a CSRF error.

    I think it is related to changes made in https://github.com/ckan/ckan/pull/7096/files so I have updated the test to use Authorization instead of REMOTE_USER.

    Not sure what the strategy will be to update this tests.

    opened by pdelboca 1
  • Properly index organization when creating a package

    Properly index organization when creating a package

    This PR address the scenario to update an organization package_count field when a package is created via the UI. To do this we need to detect when package_update action is updating the package's state from draft to active

    More Info:

    Chaining package_create only covers API or CLI scenarios. (Meaning that we are updating an org package_count when adding a new dataset)

    However, while creating a package using the UI package_create contains a draft dataset. This means that the call to reindex the owner_org will not update the package_count attribute since draft datasets are not counted.

    The change of the package state from draft to active happens latter in the process when submiting a resource and calling to package_update.

    opened by pdelboca 1
  • Rebuild Organizations on package_update

    Rebuild Organizations on package_update

    Following #3 this PR covers the case when the Package's organization is updated.

    We should trigger a rebuild both in the new org and the old org to properly reflect the package_count in the index.

    opened by pdelboca 1
  • Pages module is not loaded at import time using cli

    Pages module is not loaded at import time using cli

    When executing the command ckan sitesearch rebuild pages the current implementation throws a NameError:

    Traceback (most recent call last):
      File "/usr/lib/ckan/venv/bin/ckan", line 8, in <module>
        sys.exit(ckan())
      File "/usr/lib/ckan/venv/lib/python3.8/site-packages/click/core.py", line 829, in __call__
        return self.main(*args, **kwargs)
      File "/usr/lib/ckan/venv/lib/python3.8/site-packages/click/core.py", line 782, in main
        rv = self.invoke(ctx)
      File "/usr/lib/ckan/venv/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
        return _process_result(sub_ctx.command.invoke(sub_ctx))
      File "/usr/lib/ckan/venv/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
        return _process_result(sub_ctx.command.invoke(sub_ctx))
      File "/usr/lib/ckan/venv/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
        return ctx.invoke(self.callback, **ctx.params)
      File "/usr/lib/ckan/venv/lib/python3.8/site-packages/click/core.py", line 610, in invoke
        return callback(*args, **kwargs)
      File "/usr/lib/ckan/venv/lib/python3.8/site-packages/ckanext/sitesearch/cli.py", line 56, in rebuild
        rebuild_pages(defer_commit, force, quiet, entity_id)
      File "/usr/lib/ckan/venv/lib/python3.8/site-packages/ckanext/sitesearch/lib/rebuild.py", line 91, in rebuild_pages
        pages = Page.pages()
    NameError: name 'Page' is not defined
    

    This is because this condition evaluate to false when loading the module:

        if plugin_loaded("pages"):
            from ckanext.pages.db import Page
    

    Moving this import inside the method works properly although I'm not sure why or how CKAN's cli manages imports and plugins load.

    I cannot understand why this import worked when it was under the cli.py module, but not under lib/rebuild.py. 🤔

    opened by pdelboca 1
  • Add rebuild module

    Add rebuild module

    Issue

    Currently when a package is created/deleted, the organization/group index is not refreshed. This means that any call to organization_search/group_search will not return an up-to-date package_count attribute.

    Solution

    • Refactored all the rebuild logic used in the cli command into a new lib called rebuild
    • Created package_create chained action that calls a rebuild on the organization if the new dataset actually belongs to an Organization.
    • Created package_delete chained action that iterates though all the package's groups and rebuild it's index.
    • Created member_create chained action to rebuild the groups index if the action if for actually adding a package to a group.

    Notes

    I like the idea of a rebuild module since it isolates the responsibility to update the index to the own entity. Instead of calling an organization_show in the package_create we simply delegate to the organization itself the action of rebuilding it's own index.

    Testing Notes

    CKAN's fixtures are designed for each test to ensure the database is cleaned before running (instead of the canonical way of cleaning the database after running). This makes that the current implementation of fixtures to fail when doing several test runs because TestGroupOrOrgSearch will still have leftovers of the last test in the previous run. (Adding a "clean_db" fixture clashes with "group_search_fixtures")

    Also, CKAN's fixtures are not conistent in the way they add/remove tables to the database. This can cause clean_db to fail cleaning the pages tables since it is found in the model metadata but not in the database itself (or vice versa). This scenario can happen when tests are executed unitarily. To "fix" it, just execute the whole suite of tests. (pytest ckanext/sitesearch)

    To fix this we force a teardown_class method cleaning the index and database.

    opened by pdelboca 1
  • [WIP] Initial implementation for logging search terms

    [WIP] Initial implementation for logging search terms

    This is a simple implementation of a table to log all the search terms.

    Due to initial requirements, it's necessary to log in the user_id so I'm not sure it's something we want to implement here. (I made it optional just in case).

    Due to the current implementation of site_search calling other searches, we need to track if it's a site_search so we do not log it twice. I'm using flask's global object to track if the request is_site_search.

    opened by pdelboca 1
Owner
Open Knowledge Foundation
Also find us at: @frictionlessdata @opentrials @openspending @openknowledge-archive
Open Knowledge Foundation
A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

tfds-korean A collection of Korean Text Datasets ready to use using Tensorflow-Datasets. TensorFlow-Datasets를 이용한 한국어/한글 데이터셋 모음입니다. Dataset Catalog |

Jeong Ukjae 20 Jul 11, 2022
Abhijith Neil Abraham 2 Nov 5, 2021
A CSRankings-like index for speech researchers

Speech Rankings This project mimics CSRankings to generate an ordered list of researchers in speech/spoken language processing along with their possib

Mutian He 19 Nov 26, 2022
This script just scrapes the most recent Nepali news from Kathmandu Post and notifies the user about current events at regular intervals.It sends out the most recent news at random!

Nepali-news-notifier This script just scrapes the most recent Nepali news from Kathmandu Post and notifies the user about current events at regular in

Sachit Yadav 1 Feb 11, 2022
Just a Basic like Language for Zeno INC

zeno-basic-language Just a Basic like Language for Zeno INC This is written in 100% python. this is basic language like language. so its not for big p

Voidy Devleoper 1 Dec 18, 2021
Just Another Telegram Ai Chat Bot Written In Python With Pyrogram.

OkaeriChatBot Just another Telegram AI chat bot written in Python using Pyrogram. Requirements Python 3.7 or higher.

Wahyusaputra 2 Dec 23, 2021
Words_And_Phrases - Just a repo for useful words and phrases that might come handy in some scenarios. Feel free to add yours

Words_And_Phrases Just a repo for useful words and phrases that might come handy in some scenarios. Feel free to add yours Abbreviations Abbreviation

Subhadeep Mandal 1 Feb 1, 2022
Use the power of GPT3 to execute any function inside your programs just by giving some doctests

gptrun Don't feel like coding today? Use the power of GPT3 to execute any function inside your programs just by giving some doctests. How is this diff

Roberto Abdelkader Martínez Pérez 11 Nov 11, 2022
Code for ACL 2021 main conference paper "Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances".

Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances This repository contains the code and pre-trained mode

ICTNLP 90 Dec 27, 2022
A python script to prefab your scripts/text files, and re create them with ease and not have to open your browser to copy code or write code yourself

Scriptfab - What is it? A python script to prefab your scripts/text files, and re create them with ease and not have to open your browser to copy code

DevNugget 3 Jul 28, 2021
Tool to check whether a GCP bucket is public or not.

Tool to check publicly accessible GCP bucket. Blog https://justm0rph3u5.medium.com/gcp-inspector-auditing-publicly-exposed-gcp-bucket-ac6cad55618c Wha

DIVYANSHU SHUKLA 7 Nov 24, 2022
apple's universal binaries BUT MUCH WORSE (PRACTICAL SHITPOST) (NOT PRODUCTION READY)

hyperuniversality investment opportunity: what if we could run multiple architectures in a single file, again apple universal binaries, but worse how

luna 2 Oct 19, 2021
A python project made to generate code using either OpenAI's codex or GPT-J (Although not as good as codex)

CodeJ A python project made to generate code using either OpenAI's codex or GPT-J (Although not as good as codex) Install requirements pip install -r

TheProtagonist 1 Dec 6, 2021
This is the Alpha of Nutte language, she is not complete yet / Essa é a Alpha da Nutte language, não está completa ainda

nutte-language This is the Alpha of Nutte language, it is not complete yet / Essa é a Alpha da Nutte language, não está completa ainda My language was

catdochrome 2 Dec 18, 2021
Deduplication is the task to combine different representations of the same real world entity.

Deduplication is the task to combine different representations of the same real world entity. This package implements deduplication using active learning. Active learning allows for rapid training without having to provide a large, manually labelled dataset.

null 63 Nov 17, 2022
Smart discord chatbot integrated with Dialogflow to manage different classrooms and assist in teaching!

smart-school-chatbot Smart discord chatbot integrated with Dialogflow to interact with students naturally and manage different classes in a school. De

Tom Huynh 5 Oct 24, 2022