Search for documents in a domain through Google. The objective is to extract metadata

Overview

Supported Python versions License

MetaFinder - Metadata search through Google

   _____               __             ___________ .__               .___                   
  /     \     ____   _/  |_  _____    \_   _____/ |__|   ____     __| _/   ____   _______  
 /  \ /  \  _/ __ \  \   __\ \__  \    |    __)   |  |  /    \   / __ |  _/ __ \  \_  __ \ 
/    Y    \ \  ___/   |  |    / __ \_  |     \    |  | |   |  \ / /_/ |  \  ___/   |  | \/ 
\____|__  /  \___  >  |__|   (____  /  \___  /    |__| |___|  / \____ |   \___  >  |__|    
        \/       \/               \/       \/               \/       \/       \/          
        
|_ Author: @JosueEncinar
|_ Description: Search for documents in a domain through Google. The objective is to extract metadata
|_ Usage: python3 metafinder.py -d domain.com -l 100 -o /tmp

Installation:

> pip3 install metafinder

Upgrades are also available using:

> pip3 install metafinder --upgrade

Usage

CLI

metafinder -d domain.com -l 20 -o folder [-t 10] [-v] 

Parameters:

  • d: Specifies the target domain.
  • l: Specify the maximum number of results to be searched.
  • o: Specify the path to save the report.
  • t: Optional. Used to configure the threads (4 by default).
  • v: Optional. It is used to display the results on the screen as well.

In Code

import metafinder.extractor as metadata_extractor

documents_limit = 5
domain = "target_domain"
data = metadata_extractor.extract_metadata_from_google_search(domain, documents_limit)
for k,v in data.items():
    print(f"{k}:")
    print(f"|_ URL: {v['url']}")
    for metadata,value in v['metadata'].items():
        print(f"|__ {metadata}: {value}")

document_name = "test.pdf"
try:
    metadata_file = metadata_extractor.extract_metadata_from_document(document_name)
    for k,v in metadata_file.items():
        print(f"{k}: {v}")
except FileNotFoundError:
    print("File not found")

Author

This project has been developed by:

Contributors

Disclaimer!

This Software has been developed for teaching purposes and for use with permission of a potential target. The author is not responsible for any illegitimate use.

Comments
  • Update download.py

    Update download.py

    redirect name file

    :pushpin: References

    • Issue: https://github.com/Josue87/MetaFinder/issues/8

    :tophat: What is the goal?

    To better handle the file name

    :white_check_mark: Checklist

    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] I have checked this works with manual QA
    opened by alanEG 3
  • search engines

    search engines

    These are the results I get:

    Searching in google [!] google error GoogleCaptcha, Google Captcha detected

    Searching in bing [!] bing error HTTP Error 503: Service Unavailable

    Searching in baidu [+] Done

    So for me only Baidu is working.

    opened by truesamurai 2
  • Error donwloading (redirect case)

    Error donwloading (redirect case)

    hello here you parse url to cut name file https://github.com/Josue87/MetaFinder/blob/ff1a16c12e86969cc167e0fa37102eada8bae343/metafinder/utils/file/download.py#L23-L24 but if the url is https://www.domain.com/es/photopassbslegales/ There will be a problem Because there will be nothing to cut ['https:', '', 'www.domain.com', 'es', 'photopassbslegales', ''] why that Because you are taking the URL from https://github.com/Josue87/MetaFinder/blob/ff1a16c12e86969cc167e0fa37102eada8bae343/metafinder/utils/file/download.py#L20 This url is not valid now, but when you send the request, the file will be automatically redirected. The response.url will be assigned to the address to which it was redirected, so you must take the URL for processing from response.url

    opened by alanEG 0
  • Setup Colabotration

    Setup Colabotration

    :pushpin: References

    :tophat: What is the goal?

    Add proper documentation for Contribution.

    :memo: Notes

    A part from the contributing notes, there is a new Pull Request Template to fill

    :art: A picture is worth a thousand words

    Screenshot 2021-05-02 at 23 44 40

    :white_check_mark: Checklist

    • [x] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] I have checked this works with manual QA
    enhancement 
    opened by lucferbux 0
  • Refactor to make the package installable via PyPI

    Refactor to make the package installable via PyPI

    A small refactor has been performed to include the setup.py. Dependencies are now managed by this file, as well as package versioning and so on. Internal imports are now imported using metafinder.utils.<whatever> syntax. For clarity, the launcher (cli.py) has been separated from the processing(...) method. This approach is closer to the use of the package as a library making it closer to be imported as import metafinder.core.

    Some minor issues have also been addressed:

    • The number of the results returned by Google was always 1 less
    • A new argument has been provided to set the output folder
    • The results folder is now created if it does not exist
    • pathlib package has been used for the creation of the folder
    • The usage execution has been clarified
    • Versioning has been started as 0.1.0. This may be changed.

    Installation instructions in the README.md have been updated. Note that these instructions won't be ready until the package is uploaded to PyPI.

    opened by febrezo 0
  • [Feature request] Wayback machine and Local files extractor

    [Feature request] Wayback machine and Local files extractor

    Hi,

    Thanks for your great project, I've tested and got some incredible results in a fast and clean way.

    I've been thinking the following features would be great to be added, if possible:

    • Fetch Wayback documents (or fetching the corresponding urls docs);
    • Fetch Metadata from local files (eg: downloaded previously on a recon).

    Best Regards.

    opened by landaboot 1
Owner
Josué Encinar
Offensive Security Engineer
Josué Encinar
Textpipe: clean and extract metadata from text

textpipe: clean and extract metadata from text textpipe is a Python package for converting raw text in to clean, readable text and extracting metadata

Textpipe 272 Feb 16, 2021
Connectionist Temporal Classification (CTC) decoding algorithms: best path, beam search, lexicon search, prefix search, and token passing. Implemented in Python.

CTC Decoding Algorithms Update 2021: installable Python package Python implementation of some common Connectionist Temporal Classification (CTC) decod

Harald Scheidl 736 Jan 3, 2023
ConferencingSpeech2022; Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge

ConferencingSpeech 2022 challenge This repository contains the datasets list and scripts required for the ConferencingSpeech 2022 challenge. For more

null 21 Dec 2, 2022
Takes a string and puts it through different languages in Google Translate a requested amount of times, returning nonsense.

PythonTextObfuscator Takes a string and puts it through different languages in Google Translate a requested amount of times, returning nonsense. Requi

null 2 Aug 29, 2022
Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

This repo provides the code of the following papers: (GAR) "Generation-Augmented Retrieval for Open-domain Question Answering", ACL 2021 (RIDER) "Read

morning 49 Dec 26, 2022
GooAQ 🥑 : Google Answers to Google Questions!

This repository contains the code/data accompanying our recent work on long-form question answering.

AI2 112 Nov 6, 2022
Utility for Google Text-To-Speech batch audio files generator. Ideal for prompt files creation with Google voices for application in offline IVRs

Google Text-To-Speech Batch Prompt File Maker Are you in the need of IVR prompts, but you have no voice actors? Let Google talk your prompts like a pr

Ponchotitlán 1 Aug 19, 2021
The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank

Main Idea The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank Semantic Search Re

Sergio Arnaud Gomez 2 Jan 28, 2022
Predicting the usefulness of reviews given the review text and metadata surrounding the reviews.

Predicting Yelp Review Quality Table of Contents Introduction Motivation Goal and Central Questions The Data Data Storage and ETL EDA Data Pipeline Da

Jeff Johannsen 3 Nov 27, 2022
texlive expressions for documents

tex2nix Generate Texlive environment containing all dependencies for your document rather than downloading gigabytes of texlive packages. Installation

Jörg Thalheim 70 Dec 26, 2022
Module for automatic summarization of text documents and HTML pages.

Automatic text summarizer Simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains sim

Mišo Belica 3k Jan 8, 2023
Python implementation of TextRank for phrase extraction and summarization of text documents

PyTextRank PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension, used to: extract the top-ranked phrases from text document

derwen.ai 1.9k Jan 6, 2023
A full spaCy pipeline and models for scientific/biomedical documents.

This repository contains custom pipes and models related to using spaCy for scientific documents. In particular, there is a custom tokenizer that adds

AI2 1.3k Jan 3, 2023
Module for automatic summarization of text documents and HTML pages.

Automatic text summarizer Simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains sim

Mišo Belica 2.5k Feb 17, 2021
Python implementation of TextRank for phrase extraction and summarization of text documents

PyTextRank PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension, used to: extract the top-ranked phrases from text document

derwen.ai 1.4k Feb 17, 2021
A full spaCy pipeline and models for scientific/biomedical documents.

This repository contains custom pipes and models related to using spaCy for scientific documents. In particular, there is a custom tokenizer that adds

AI2 831 Feb 17, 2021
Extracting Summary Knowledge Graphs from Long Documents

GraphSum This repo contains the data and code for the G2G model in the paper: Extracting Summary Knowledge Graphs from Long Documents. The other basel

Zeqiu (Ellen) Wu 10 Oct 21, 2022
Implementation of TF-IDF algorithm to find documents similarity with cosine similarity

NLP learning Trying to learn NLP to use in my projects! Table of Contents About The Project Built With Getting Started Requirements Run Usage License

Faraz Farangizadeh 3 Aug 25, 2022