postgres-full-text-search

Postgres full text search options (tsearch, trigram) examples.

Create DB

>> CREATE DATABASE ftdb;

To feed db with an example dataset (dataset.txt, 100k rows, 15 words each one) I used python init_db.py script.

Full text search using simple `ilike`

>> EXPLAIN ANALYZE
   SELECT text, language
   FROM public.document
   WHERE
      text ilike '%field%'
      AND text ilike '%window%'
      AND text ilike '%lamp%'
      AND text ilike '%research%'
      AND language = 'en'
    LIMIT 1;
                                                                  QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..3734.02 rows=1 width=105) (actual time=87.473..87.474 rows=0 loops=1)
   ->  Seq Scan on document  (cost=0.00..3734.02 rows=1 width=105) (actual time=87.466..87.466 rows=0 loops=1)
         Filter: ((text ~~* '%field%'::text) AND (text ~~* '%window%'::text) AND (text ~~* '%lamp%'::text) AND (text ~~* '%research%'::text))
         Rows Removed by Filter: 100001
 Planning Time: 2.193 ms
 Execution Time: 87.500 ms

Full text search using `ilike` supported by trigram index

What is a trigram? See this example:

>> CREATE EXTENSION pg_trgm;
CREATE EXTENSION
>> select show_trgm('fielded');
                show_trgm
-----------------------------------------
 {"  f"," fi",ded,"ed ",eld,fie,iel,lde}

We can improve ilike performance using trigram index, e.g. gin_trgm_ops.

>> CREATE INDEX  ix_document_text_trigram ON document USING gin (text gin_trgm_ops) where language = 'en';
CREATE INDEX

>> EXPLAIN ANALYZE SELECT text, language
   FROM public.document
   WHERE
      text ilike '%field%'
      AND text ilike '%window%'
      AND text ilike '%lamp%'
      AND text ilike '%research%'
      AND language = 'en'
    LIMIT 1;
                                                                                       QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=176.00..180.02 rows=1 width=105) (actual time=1.473..1.474 rows=0 loops=1)
   ->  Bitmap Heap Scan on document  (cost=176.00..180.02 rows=1 width=105) (actual time=1.470..1.471 rows=0 loops=1)
         Recheck Cond: ((text ~~* '%field%'::text) AND (text ~~* '%window%'::text) AND (text ~~* '%lamp%'::text) AND (text ~~* '%research%'::text) AND ((language)::text = 'en'::text))
         ->  Bitmap Index Scan on ix_document_text_trigram  (cost=0.00..176.00 rows=1 width=0) (actual time=1.466..1.466 rows=0 loops=1)
               Index Cond: ((text ~~* '%field%'::text) AND (text ~~* '%window%'::text) AND (text ~~* '%lamp%'::text) AND (text ~~* '%research%'::text))
 Planning Time: 2.389 ms
 Execution Time: 1.524 ms

Create non-default language configuration for tsearch full text search

Postgres does not provide support for many languages by default. However, you can setup the configuration quite easily. You just need additional dictionary files. Here is an example for polish language. Polish dictionary files can be downloaded from: https://github.com/judehunter/polish-tsearch.

polish.affix, polish.stop and polish.dict files should be copied to postgresql tsearch_data location, e.g. /usr/share/postgresql/13/tsearch_data.

There also must be created a configuration (see the docs) inside database:

>> DROP TEXT SEARCH DICTIONARY IF EXISTS polish_hunspell CASCADE;
   CREATE TEXT SEARCH DICTIONARY polish_hunspell (
    TEMPLATE  = ispell,
    DictFile  = polish,
    AffFile   = polish,
    StopWords = polish
  );
  CREATE TEXT SEARCH CONFIGURATION public.polish (
    COPY = pg_catalog.english
  );
  ALTER TEXT SEARCH CONFIGURATION polish
    ALTER MAPPING
    FOR
        asciiword, asciihword, hword_asciipart,  word, hword, hword_part
    WITH
        polish_hunspell, simple;

You need these files and configuration because full text search engine uses lexeme comparing to find best matches (both query pattern and stored text are lexemized):

>> SELECT to_tsquery('english', 'fielded'), to_tsvector('english', text)
   FROM document
   LIMIT 1;
 to_tsquery |                                                                    to_tsvector
------------+----------------------------------------------------------------------------------------------------------------------------------------------------
 'field'    | '19':16 'bat':12 'dead':8 'degre':1 'depth':5 'field':15 'lamp':13 'men':6 'put':14 'ranch':2 'tall':4 'time':3 'underlin':11 'wast':10 'window':9

If you cannot provide dictionary files you can use full text in "simple" form (without transformation to lexeme):

>> SELECT to_tsquery('simple', 'fielded'), to_tsvector('simple', text)
   FROM document
   LIMIT 1;
 to_tsquery |                                                                             to_tsvector
------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 'fielded'  | '19':16 'bat':12 'below':7 'dead':8 'degree':1 'depth':5 'field':15 'lamp':13 'men':6 'putting':14 'ranch':2 'tall':4 'time':3 'underline':11 'waste':10 'window':9

Tsearch full text search without stored index

>> EXPLAIN ANALYZE SELECT text, language
   FROM public.document
   WHERE to_tsvector('english', text) @@ to_tsquery('english', 'fielded & window & lamp & depth & test ')
   LIMIT 1;
                                                                                  QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1000.00..18298.49 rows=1 width=103) (actual time=489.802..491.352 rows=0 loops=1)
   ->  Gather  (cost=1000.00..18298.49 rows=1 width=103) (actual time=489.800..491.349 rows=0 loops=1)
         Workers Planned: 1
         Workers Launched: 1
         ->  Parallel Seq Scan on document  (cost=0.00..17298.39 rows=1 width=103) (actual time=486.644..486.644 rows=0 loops=2)
               Filter: (((language)::text = 'en'::text) AND (to_tsvector('english'::regconfig, text) @@ '''field'' & ''window'' & ''lamp'' & ''depth'' & ''test'''::tsquery))
               Rows Removed by Filter: 50000
 Planning Time: 0.272 ms
 Execution Time: 491.376 ms
(9 rows)

Tsearch full text search with stored partial index ('en')

>> CREATE INDEX ix_en_document_tsvector_text ON public.document USING gin (to_tsvector('english'::regconfig, text)) WHERE language = 'en';
CREATED INDEX
>> EXPLAIN ANALYZE SELECT text, language
   FROM public.document
   WHERE to_tsvector('english', text) @@ to_tsquery('english', 'fielded & window & lamp & depth & test ')
   LIMIT 1;
                                                               QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1000.00..18151.43 rows=1 width=103) (actual time=487.120..488.569 rows=0 loops=1)
   ->  Gather  (cost=1000.00..18151.43 rows=1 width=103) (actual time=487.117..488.567 rows=0 loops=1)
         Workers Planned: 1
         Workers Launched: 1
         ->  Parallel Seq Scan on document  (cost=0.00..17151.33 rows=1 width=103) (actual time=484.418..484.419 rows=0 loops=2)
               Filter: (to_tsvector('english'::regconfig, text) @@ '''field'' & ''window'' & ''lamp'' & ''depth'' & ''test'''::tsquery)
               Rows Removed by Filter: 50000
 Planning Time: 0.193 ms
 Execution Time: 488.596 ms

No difference? Index has not been used... Why is it not working? Ohh, looks to the partial index docs:

However, keep in mind that the predicate must match the conditions used in the queries that are supposed to benefit from the index. To be precise, a partial index can be used in a query only if the system can recognize that the WHERE condition of the query mathematically implies the predicate of the index. PostgreSQL does not have a sophisticated theorem prover that can recognize mathematically equivalent expressions that are written in different forms. (Not only is such a general theorem prover extremely difficult to create, it would probably be too slow to be of any real use.) The system can recognize simple inequality implications, for example "x < 1" implies "x < 2"; otherwise the predicate condition must exactly match part of the query's WHERE condition or the index will not be recognized as usable. Matching takes place at query planning time, not at run time. As a result, parameterized query clauses do not work with a partial index.

We have to add to query a condition that was used to create partial index: document.language = 'en':

>> EXPLAIN ANALYZE SELECT text, language
   FROM public.document
   WHERE
      to_tsvector('english', text) @@ to_tsquery('english', 'fielded & window & lamp & depth & test ')
      AND language = 'en'
   LIMIT 1;                                                                           QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=64.00..68.27 rows=1 width=103) (actual time=0.546..0.548 rows=0 loops=1)
   ->  Bitmap Heap Scan on document  (cost=64.00..68.27 rows=1 width=103) (actual time=0.544..0.545 rows=0 loops=1)
         Recheck Cond: ((to_tsvector('english'::regconfig, text) @@ '''field'' & ''window'' & ''lamp'' & ''depth'' & ''test'''::tsquery) AND ((language)::text = 'en'::text))
         ->  Bitmap Index Scan on ix_en_document_tsvector_text  (cost=0.00..64.00 rows=1 width=0) (actual time=0.540..0.540 rows=0 loops=1)
               Index Cond: (to_tsvector('english'::regconfig, text) @@ '''field'' & ''window'' & ''lamp'' & ''depth'' & ''test'''::tsquery)
 Planning Time: 0.244 ms
 Execution Time: 0.590 ms

Tsearch full text search with partial words

>> EXPLAIN ANALYZE SELECT text, language
   FROM public.document
   WHERE
      to_tsvector('english', text) @@ to_tsquery('english', 'fielded & window & l:*')
      AND language = 'en'
   LIMIT 1;
                                                                   QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on document  (cost=168.00..172.27 rows=1 width=102) (actual time=5.207..5.210 rows=4 loops=1)
   Recheck Cond: ((to_tsvector('english'::regconfig, text) @@ '''field'' & ''window'' & ''l'':*'::tsquery) AND ((language)::text = 'en'::text))
   Heap Blocks: exact=4
   ->  Bitmap Index Scan on ix_en_document_tsvector_text  (cost=0.00..168.00 rows=1 width=0) (actual time=5.202..5.202 rows=4 loops=1)
         Index Cond: (to_tsvector('english'::regconfig, text) @@ '''field'' & ''window'' & ''l'':*'::tsquery)
 Planning Time: 0.240 ms
 Execution Time: 5.240 ms

>> SELECT id,  text
   FROM public.document
   WHERE
      to_tsvector('english', text) @@ to_tsquery('english', 'fielded & window & l:*')
      AND language = 'en'
   LIMIT 20;
  id   |                                                   text
-------+-----------------------------------------------------------------------------------------------------------
     1 | degree ranch time tall depth men below dead window waste underline bat lamp putting field               +
 20152 | Law pony follow memory star whatever window sets oxygen longer word whom glass field actual              +
 21478 | Dried symbol willing design managed shade window pick share faster education drive field land everybody  +
 30293 | Pencil seen engineer labor image entire smallest serve field should riding smaller window imagine traffic+

Tsearch full text search results ranking

>> SELECT
     id,
     ts_rank_cd(to_tsvector('english', text), to_tsquery('english', 'fielded & wind:*')) rank,
     text
    FROM public.document
    WHERE to_tsvector('english', text) @@ to_tsquery('english', 'fielded & wind:*')
    ORDER BY rank DESC
    LIMIT 20;
   id   |    rank     |                                                   text
--------+-------------+-----------------------------------------------------------------------------------------------------------
 100002 |         0.1 | fielded window
   9376 |        0.05 | Own mouse girl effect surprise physical newspaper forgot eat upper field element window simply unhappy   +
  96597 |        0.05 | Opinion fastened pencil rear more theory size window heading field understanding farm up position attack +
  44626 | 0.033333335 | Symbol each halfway window swam spider field page shinning donkey chose until cow cabin congress         +
  80922 | 0.033333335 | Victory famous field shelter girl wind adventure he divide rear tip few studied ruler judge              +
  30293 |       0.025 | Pencil seen engineer labor image entire smallest serve field should riding smaller window imagine traffic+
      1 | 0.016666668 | degree ranch time tall depth men below dead window waste underline bat lamp putting field               +
  21478 | 0.016666668 | Dried symbol willing design managed shade window pick share faster education drive field land everybody  +
  60059 | 0.016666668 | However hungry make proud kids come willing field officer row above highest round wind mile              +
  26001 | 0.014285714 | Earth earlier pocket might sense window way frog fire court family mouth field somebody recognize        +
  20152 | 0.014285714 | Law pony follow memory star whatever window sets oxygen longer word whom glass field actual              +
  37470 |      0.0125 | Farm weight balloon buried wind water donkey grain pig week should damage field was he                   +
  49433 |        0.01 | Wind scientist leaving atom year bad child drink shore spirit field facing indicate wagon here           +
  37851 | 0.007142857 | Field cloud you wife rhythm upward applied weigh continued property replace ahead forgotten trip window  +

text='fielded window' record was added manually to show best match result.

GIST vs GIN

We have created GIN index. But there is also GIST index option. Which one is better? It depends...

>> EXPLAIN ANALYZE SELECT text, language
   FROM public.document
   WHERE
      to_tsvector('english', text) @@ to_tsquery('english', 'fielded & window & lamp & depth & test ')
      AND language = 'en'
   LIMIT 1;
                                                                  QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.28..8.30 rows=1 width=103) (actual time=2.699..2.700 rows=0 loops=1)
   ->  Index Scan using ix_en_document_tsvector_text on document  (cost=0.28..8.30 rows=1 width=103) (actual time=2.697..2.697 rows=0 loops=1)
         Index Cond: (to_tsvector('english'::regconfig, text) @@ '''field'' & ''window'' & ''lamp'' & ''depth'' & ''test'''::tsquery)
 Planning Time: 0.274 ms
 Execution Time: 2.730 ms

GIN seems to be a little bit faster. I don't think I could explain it better than the docs already does:

In choosing which index type to use, GiST or GIN, consider these performance differences:

GIN index lookups are about three times faster than GiST

GIN indexes take about three times longer to build than GiST

GIN indexes are moderately slower to update than GiST indexes, but about 10 times slower if fast-update support was disabled (see Section 58.4.1 for details)

GIN indexes are two-to-three times larger than GiST indexes

Full text search for flask.

flask-msearch Installation To install flask-msearch: pip install flask-msearch # when MSEARCH_BACKEND = "whoosh" pip install whoosh blinker # when MSE

197 Dec 29, 2022

document organizer with tags and full-text-search, in a simple and clean sqlite3 schema

152 Oct 29, 2022

Full-text multi-table search application for Django. Easy to install and use, with good performance.

django-watson django-watson is a fast multi-model full-text search plugin for Django. It is easy to install and use, and provides high quality search

1.1k Jan 3, 2023

Deep Image Search is an AI-based image search engine that includes deep transfor learning features Extraction and tree-based vectorized search.

Deep Image Search - AI-Based Image Search Engine Deep Image Search is an AI-based image search engine that includes deep transfer learning features Ex

139 Jan 1, 2023

Connectionist Temporal Classification (CTC) decoding algorithms: best path, beam search, lexicon search, prefix search, and token passing. Implemented in Python.

CTC Decoding Algorithms Update 2021: installable Python package Python implementation of some common Connectionist Temporal Classification (CTC) decod

736 Jan 3, 2023

Open Source Discord bot with many cool features like Weather, Balance, Avatar, User, Server, RP-commands, Gif search, YouTube search, VK post search etc.

Сокобот Дискорд бот с открытым исходным кодом. Содержит в себе экономику, полезные команды (!аватар, !юзер, !сервер и тд.), рп-команды (!обнять, !глад

2 Jan 16, 2022

Full body anonymization - Realistic Full-Body Anonymization with Surface-Guided GANs

Realistic Full-Body Anonymization with Surface-Guided GANs This is the official

30 Nov 18, 2022

Postgres CLI with autocompletion and syntax highlighting

A REPL for Postgres This is a postgres client that does auto-completion and syntax highlighting. Home Page: http://pgcli.com MySQL Equivalent: http://

10.8k Jan 2, 2023

🍯 16 honeypots in a single pypi package (DNS, HTTP Proxy, HTTP, HTTPS, SSH, POP3, IMAP, STMP, VNC, SMB, SOCKS5, Redis, TELNET, Postgres & MySQL)

Easy to setup customizable honeypots for monitoring network traffic, bots activities and username\password credentials. The current available honeypot

259 Dec 31, 2022

10.8k Dec 30, 2022

Continuous Archiving for Postgres

WAL-E Continuous archiving for Postgres WAL-E is a program designed to perform continuous archiving of PostgreSQL WAL files and base backups. To corre

3.4k Dec 30, 2022

The ormar package is an async mini ORM for Python, with support for Postgres, MySQL, and SQLite.

python async mini orm with fastapi in mind and pydantic validation

1.2k Jan 5, 2023

A basic CRUD application built in flask using postgres as database

flask-postgres-CRUD A basic CRUD application built in flask using postgres as database Taks list Dockerfile Initial docker-compose - It is working Dat

9 Sep 25, 2022

Dockerizing Django with Postgres, Gunicorn, Nginx and Certbot. A fully Django starter project.

Dockerizing Django with Postgres, Gunicorn, Nginx and Certbot 🚀 Features A Django stater project with fully basic requirements for a production-ready

8 Jun 27, 2022

A server shell for you to play with Powered by Django + Nginx + Postgres + Bootstrap + Celery.

1 Jan 10, 2022

CLI utility for updating the EVE Online static data export in a postgres database

EVE SDE Postgres updater CLI utility for updating the EVE Online static data export postgres database. This has been tested with the Fuzzwork postgres

1 Oct 29, 2021

a Scrapy spider that utilizes Postgres as a DB, Squid as a proxy server, Redis for de-duplication and Splash to render JavaScript. All in a microservices architecture utilizing Docker and Docker Compose

This is George's Scraping Project To get started cd into the theZoo file and run: chmod +x script.sh then: ./script.sh This will spin up a Postgres co

7 Nov 27, 2022

FastAPI + Postgres + Docker Compose + Heroku Deploy Template

FastAPI + Postgres + Docker Compose + Heroku Deploy ⚠️ For educational purpose only. Not ready for production use YET Features FastAPI with Postgres s

12 Dec 27, 2022

Sample FastAPI project that uses async SQLAlchemy, SQLModel, Postgres, Alembic, and Docker.

FastAPI + SQLModel + Alembic Sample FastAPI project that uses async SQLAlchemy, SQLModel, Postgres, Alembic, and Docker. Want to learn how to build th

228 Jan 2, 2023

Postgres full text search options (tsearch, trigram) examples

Related tags

Overview

postgres-full-text-search

Create DB

Full text search using simple ilike

Full text search using ilike supported by trigram index

Create non-default language configuration for tsearch full text search

Tsearch full text search without stored index

Tsearch full text search with stored partial index ('en')

Tsearch full text search with partial words

Tsearch full text search results ranking

GIST vs GIN

You might also like...

Full text search for flask.

document organizer with tags and full-text-search, in a simple and clean sqlite3 schema

Full-text multi-table search application for Django. Easy to install and use, with good performance.

Deep Image Search is an AI-based image search engine that includes deep transfor learning features Extraction and tree-based vectorized search.

Connectionist Temporal Classification (CTC) decoding algorithms: best path, beam search, lexicon search, prefix search, and token passing. Implemented in Python.

Open Source Discord bot with many cool features like Weather, Balance, Avatar, User, Server, RP-commands, Gif search, YouTube search, VK post search etc.

Full body anonymization - Realistic Full-Body Anonymization with Surface-Guided GANs

Postgres CLI with autocompletion and syntax highlighting

🍯 16 honeypots in a single pypi package (DNS, HTTP Proxy, HTTP, HTTPS, SSH, POP3, IMAP, STMP, VNC, SMB, SOCKS5, Redis, TELNET, Postgres & MySQL)

Postgres CLI with autocompletion and syntax highlighting

Continuous Archiving for Postgres

The ormar package is an async mini ORM for Python, with support for Postgres, MySQL, and SQLite.

A basic CRUD application built in flask using postgres as database

Dockerizing Django with Postgres, Gunicorn, Nginx and Certbot. A fully Django starter project.

A server shell for you to play with Powered by Django + Nginx + Postgres + Bootstrap + Celery.

CLI utility for updating the EVE Online static data export in a postgres database

a Scrapy spider that utilizes Postgres as a DB, Squid as a proxy server, Redis for de-duplication and Splash to render JavaScript. All in a microservices architecture utilizing Docker and Docker Compose

FastAPI + Postgres + Docker Compose + Heroku Deploy Template

Sample FastAPI project that uses async SQLAlchemy, SQLModel, Postgres, Alembic, and Docker.

Owner

Jarosław Orzeł

This repository details the steps in creating a Part of Speech tagger using Trigram Hidden Markov Models and the Viterbi Algorithm without using external libraries.

A Python-based application demonstrating various search algorithms, namely Depth-First Search (DFS), Breadth-First Search (BFS), and A* Search (Manhattan Distance Heuristic)

Deep Text Search is an AI-powered multilingual text search and recommendation engine with state-of-the-art transformer-based multilingual text embedding (50+ languages).

Zora is a python program that searches for GeoLocation info for given CIDR networks , with options to search with API or without API

ColossalAI-Examples - Examples of training models with hybrid parallelism using ColossalAI

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!

Code for "Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search"

Full text search for flask.

Full-text multi-table search application for Django. Easy to install and use, with good performance.

Full text search using simple `ilike`

Full text search using `ilike` supported by trigram index