Convert a collection of features to a fixed-dimensional matrix using the hashing trick.

Overview

FeatureHasher

Convert a collection of features to a fixed-dimensional matrix using the hashing trick.

Note, this requires Jina>=2.2.4.

Example

Here I use FeatureHasher to hash each sentence of Pride and Prejudice into a 128-dim vector, and then use .match to find top-K similar sentences.

from jina import Document, DocumentArray, Flow

# load 
   
d = Document(uri='https://www.gutenberg.org/files/1342/1342-0.txt').convert_uri_to_text()

# cut into non-empty sentences store in a DA
da = DocumentArray(Document(text=s.strip()) for s in d.text.split('\n') if s.strip())

# use FeatureHasher in a Flow
f = Flow().add(uses='jinahub://FeatureHasher')

embed_da = DocumentArray()
with f:
    f.post('/', da, on_done=lambda req: embed_da.extend(req.docs), show_progress=True)

print('self-matching...')
embed_da.match(embed_da, exclude_self=True, limit=5, normalization=(1, 0))
print('total sentences: ', len(embed_da))
for d in embed_da:
    print(d.text)
    for m in d.matches:
        print(m.scores['cosine'], m.text)
    input()
           Flow@17400[I]:πŸŽ‰ Flow is ready to use!
	πŸ”— Protocol: 		GRPC
	🏠 Local access:	0.0.0.0:52628
	πŸ”’ Private network:	192.168.178.31:52628
	🌐 Public address:	217.70.138.123:52628
β Ή       DONE ━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:00:01 100% ETA: 0 seconds 40 steps done in 1 second
total sentences:  12153
ο»ΏThe Project Gutenberg eBook of Pride and Prejudice, by Jane Austen

   
     *** END OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***

    
      *** START OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***

     
       production, promotion and distribution of Project Gutenberg-tm

      
        Pride and Prejudice

       
         By Jane Austen This eBook is for the use of anyone anywhere in the United States and 
        
          This eBook is for the use of anyone anywhere in the United States and 
         
           by the awkwardness of the application, and at length wholly 
          
            Elizabeth passed the chief of the night in her sister’s room, and 
           
             the happiest memories in the world. Nothing of the past was 
            
              charities and charitable donations in all 50 states of the United 
            
           
          
         
        
       
      
     
    
   

In practice, you can implement matching and storing via an indexer inside Flow. This example is only for demo purpose so any non-feature hashing related ops are implemented outside the Flow to avoid distraction.

You might also like...
A collection of write-ups and solutions for Cyber FastTrack Spring 2021.
A collection of write-ups and solutions for Cyber FastTrack Spring 2021.

IMPORTANT: Please contact us before you use any styling or content shown here! Cyber FastTrack Spring 2021 / National Cyber Scholarship Competition -

Kunyu, more efficient corporate asset collection
Kunyu, more efficient corporate asset collection

Kunyu(ε€θˆ†) - More efficient corporate asset collection English | δΈ­ζ–‡ζ–‡ζ‘£ 0x00 Introduce Tool introduction Kunyu (kunyu), whose name is taken from , is act

WinRemoteEnum is a module-based collection of operations achievable by a low-privileged domain user.

WinRemoteEnum WinRemoteEnum is a module-based collection of operations achievable by a low-privileged domain user, sharing the goal of remotely gather

Vulnerability Exploitation Code Collection Repository

Introduction expbox is an exploit code collection repository List CVE-2021-41349 Exchange XSS PoC = Exchange 2013 update 23 = Exchange 2016 update 2

This collection of tools that makes it easy to secure and/or obfuscate messages, files, and data.
This collection of tools that makes it easy to secure and/or obfuscate messages, files, and data.

Scrambler App This collection of tools that makes it easy to secure and/or obfuscate messages, files, and data. It leverages encryption tools such as

A collection of intelligence about Log4Shell and its exploitation activity

Log4Shell-IOCs Members of the Curated Intelligence Trust Group have compiled a list of IOC feeds and threat reports focused on the recent Log4Shell ex

HatSploit collection of generic payloads designed to provide a wide range of attacks without having to spend time writing new ones.

HatSploit collection of generic payloads designed to provide a wide range of attacks without having to spend time writing new ones.

Osint-Tool - Information collection tool in python

Osint-Tool Herramienta para la recolecciΓ³n de informaciΓ³n Pronto mΓ‘s opciones In

Create a secure tunnel from a custom domain to localhost using Fly and WireGuard.

Fly Dev Tunnel Developers commonly use apps like ngrok, localtunnel, or cloudflared to expose a local web service at a publicly-accessible URL. This i

Owner
Jina AI
A Neural Search Company. We provide the cloud-native neural search solution powered by state-of-the-art AI technology.
Jina AI
A tool used to obfuscate python scripts, bind obfuscated scripts to fixed machine or expire obfuscated scripts.

PyArmor Homepage (δΈ­ζ–‡η‰ˆη½‘η«™) Documentation(δΈ­ζ–‡η‰ˆ) PyArmor is a command line tool used to obfuscate python scripts, bind obfuscated scripts to fixed machine

Dashingsoft 1.9k Dec 30, 2022
Collection Of Discord Hacking Tools / Fun Stuff / Exploits That Is Completely Made Using Python.

Venom Collection Of Discord Hacking Tools / Fun Stuff / Exploits That Is Completely Made Using Python. Report Bug Β· Request Feature Contributing Well,

PndaBoi 25 Dec 6, 2022
Anti-Nuke capabilities, powerful moderation features, auto punishments, captcha-verification and more.

Server-Security-Discord-Bot Anti-Nuke capabilities, powerful moderation features, auto punishments, captcha-verification and more. Installation Instal

null 20 Apr 7, 2022
A windows post exploitation tool that contains a lot of features for information gathering and more.

Crowbar - A windows post exploitation tool Status - βœ”οΈ This project is now considered finished. Any updates from now on will most likely be new script

null 29 Nov 20, 2022
Safe Policy Optimization with Local Features

Safe Policy Optimization with Local Feature (SPO-LF) This is the source-code for implementing the algorithms in the paper "Safe Policy Optimization wi

Akifumi Wachi 6 Jun 5, 2022
An experimental script to perform bulk parsing of arbitrary file features with YARA and console logging.

RonnieColemanYARAParser This script is named after Ronnie Coleman, and peforms bulk lifts on arbitary file features using YARA console logging. Requir

Steve 20 Dec 13, 2022
A collection of over 5.1 million sub-domains and assets belonging to public bug bounty programs, compiled into a repo, for performing bulk operations.

?? Public Bug Bounty Targets Data By BugBountyResources A collection of over 5.1M sub-domains and assets belonging to bug bounty targets, all put in a

Bug Bounty Resources 87 Dec 13, 2022
SpiderFoot automates OSINT collection so that you can focus on analysis.

SpiderFoot is an open source intelligence (OSINT) automation tool. It integrates with just about every data source available and utilises a range of m

Steve Micallef 9k Jan 8, 2023
A forensic collection tool written in Python.

CHIRP A forensic collection tool written in Python. Watch the video overview ?? Table of Contents ?? Table of Contents ?? About ?? Getting Started Pre

Cybersecurity and Infrastructure Security Agency 1k Dec 9, 2022