Search for documents in a domain through Google. The objective is to extract metadata

Josué Encinar

Last update: Dec 16, 2022

Related tags

Text Data & NLP MetaFinder

Overview

MetaFinder - Metadata search through Google

   _____               __             ___________ .__               .___                   
  /     \     ____   _/  |_  _____    \_   _____/ |__|   ____     __| _/   ____   _______  
 /  \ /  \  _/ __ \  \   __\ \__  \    |    __)   |  |  /    \   / __ |  _/ __ \  \_  __ \ 
/    Y    \ \  ___/   |  |    / __ \_  |     \    |  | |   |  \ / /_/ |  \  ___/   |  | \/ 
\____|__  /  \___  >  |__|   (____  /  \___  /    |__| |___|  / \____ |   \___  >  |__|    
        \/       \/               \/       \/               \/       \/       \/          
        
|_ Author: @JosueEncinar
|_ Description: Search for documents in a domain through Google. The objective is to extract metadata
|_ Usage: python3 metafinder.py -d domain.com -l 100 -o /tmp

Installation:

> pip3 install metafinder

Upgrades are also available using:

> pip3 install metafinder --upgrade

Usage

CLI

metafinder -d domain.com -l 20 -o folder [-t 10] [-v]

Parameters:

d: Specifies the target domain.
l: Specify the maximum number of results to be searched.
o: Specify the path to save the report.
t: Optional. Used to configure the threads (4 by default).
v: Optional. It is used to display the results on the screen as well.

In Code

import metafinder.extractor as metadata_extractor

documents_limit = 5
domain = "target_domain"
data = metadata_extractor.extract_metadata_from_google_search(domain, documents_limit)
for k,v in data.items():
    print(f"{k}:")
    print(f"|_ URL: {v['url']}")
    for metadata,value in v['metadata'].items():
        print(f"|__ {metadata}: {value}")

document_name = "test.pdf"
try:
    metadata_file = metadata_extractor.extract_metadata_from_document(document_name)
    for k,v in metadata_file.items():
        print(f"{k}: {v}")
except FileNotFoundError:
    print("File not found")

Author

This project has been developed by:

Josué Encinar García -- @JosueEncinar

Contributors

Félix Brezo Fernández -- @febrezo

Disclaimer!

This Software has been developed for teaching purposes and for use with permission of a potential target. The author is not responsible for any illegitimate use.

Comments

Update download.py
redirect name file

:pushpin: References

Issue: https://github.com/Josue87/MetaFinder/issues/8

:tophat: What is the goal?

To better handle the file name

:white_check_mark: Checklist

[ ] I have made corresponding changes to the documentation

[ ] I have added tests that prove my fix is effective or that my feature works

[ ] I have checked this works with manual QA
opened by alanEG 3
search engines

These are the results I get:

Searching in google [!] google error GoogleCaptcha, Google Captcha detected

Searching in bing [!] bing error HTTP Error 503: Service Unavailable

Searching in baidu [+] Done

So for me only Baidu is working.

opened by truesamurai 2
Error donwloading (redirect case)

hello here you parse url to cut name file https://github.com/Josue87/MetaFinder/blob/ff1a16c12e86969cc167e0fa37102eada8bae343/metafinder/utils/file/download.py#L23-L24 but if the url is https://www.domain.com/es/photopassbslegales/ There will be a problem Because there will be nothing to cut ['https:', '', 'www.domain.com', 'es', 'photopassbslegales', ''] why that Because you are taking the URL from https://github.com/Josue87/MetaFinder/blob/ff1a16c12e86969cc167e0fa37102eada8bae343/metafinder/utils/file/download.py#L20 This url is not valid now, but when you send the request, the file will be automatically redirected. The response.url will be assigned to the address to which it was redirected, so you must take the URL for processing from response.url

opened by alanEG 0
Setup Colabotration
:pushpin: References

Issue: Bug #4

:tophat: What is the goal?

Add proper documentation for Contribution.

:memo: Notes

A part from the contributing notes, there is a new Pull Request Template to fill

:art: A picture is worth a thousand words

:white_check_mark: Checklist

[x] I have made corresponding changes to the documentation

[ ] I have added tests that prove my fix is effective or that my feature works

[ ] I have checked this works with manual QA

enhancement
opened by lucferbux 0
Refactor to make the package installable via PyPI
A small refactor has been performed to include the setup.py. Dependencies are now managed by this file, as well as package versioning and so on. Internal imports are now imported using metafinder.utils.<whatever> syntax. For clarity, the launcher (cli.py) has been separated from the processing(...) method. This approach is closer to the use of the package as a library making it closer to be imported as import metafinder.core.

Some minor issues have also been addressed:

The number of the results returned by Google was always 1 less

A new argument has been provided to set the output folder

The results folder is now created if it does not exist

pathlib package has been used for the creation of the folder

The usage execution has been clarified

Versioning has been started as 0.1.0. This may be changed.

Installation instructions in the README.md have been updated. Note that these instructions won't be ready until the package is uploaded to PyPI.
opened by febrezo 0
[Feature request] Wayback machine and Local files extractor
Hi,

Thanks for your great project, I've tested and got some incredible results in a fast and clean way.

I've been thinking the following features would be great to be added, if possible:

Fetch Wayback documents (or fetching the corresponding urls docs);

Fetch Metadata from local files (eg: downloaded previously on a recon).

Best Regards.
opened by landaboot 1

Owner

Josué Encinar

Offensive Security Engineer

GitHub

Textpipe: clean and extract metadata from text

textpipe: clean and extract metadata from text textpipe is a Python package for converting raw text in to clean, readable text and extracting metadata

272 Feb 16, 2021

Connectionist Temporal Classification (CTC) decoding algorithms: best path, beam search, lexicon search, prefix search, and token passing. Implemented in Python.

CTC Decoding Algorithms Update 2021: installable Python package Python implementation of some common Connectionist Temporal Classification (CTC) decod

736 Jan 3, 2023

ConferencingSpeech2022; Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge

ConferencingSpeech 2022 challenge This repository contains the datasets list and scripts required for the ConferencingSpeech 2022 challenge. For more

21 Dec 2, 2022

Gpt2-WebAPI - The objective of this API is to provide the 3 best possible responses to sentences that the user would input via http GET request as a parameter

This repository is a modification of: https://github.com/openai/gpt-2 for our sp

1 Jan 6, 2022

Takes a string and puts it through different languages in Google Translate a requested amount of times, returning nonsense.

PythonTextObfuscator Takes a string and puts it through different languages in Google Translate a requested amount of times, returning nonsense. Requi

2 Aug 29, 2022

Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

This repo provides the code of the following papers: (GAR) "Generation-Augmented Retrieval for Open-domain Question Answering", ACL 2021 (RIDER) "Read

49 Dec 26, 2022

GooAQ 🥑 : Google Answers to Google Questions!

This repository contains the code/data accompanying our recent work on long-form question answering.

112 Nov 6, 2022

Utility for Google Text-To-Speech batch audio files generator. Ideal for prompt files creation with Google voices for application in offline IVRs

Google Text-To-Speech Batch Prompt File Maker Are you in the need of IVR prompts, but you have no voice actors? Let Google talk your prompts like a pr

1 Aug 19, 2021

The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank

Main Idea The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank Semantic Search Re

2 Jan 28, 2022

Search for documents in a domain through Google. The objective is to extract metadata

Related tags

Overview

MetaFinder - Metadata search through Google

Installation:

Usage

CLI

In Code

Author

Contributors

Disclaimer!

Comments

Update download.py

:pushpin: References

:tophat: What is the goal?

:white_check_mark: Checklist

search engines

Error donwloading (redirect case)

Setup Colabotration

:pushpin: References

:tophat: What is the goal?

:memo: Notes

:art: A picture is worth a thousand words

:white_check_mark: Checklist

Refactor to make the package installable via PyPI

[Feature request] Wayback machine and Local files extractor

Owner

Josué Encinar

Textpipe: clean and extract metadata from text

Connectionist Temporal Classification (CTC) decoding algorithms: best path, beam search, lexicon search, prefix search, and token passing. Implemented in Python.

ConferencingSpeech2022; Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge

Gpt2-WebAPI - The objective of this API is to provide the 3 best possible responses to sentences that the user would input via http GET request as a parameter

Takes a string and puts it through different languages in Google Translate a requested amount of times, returning nonsense.

Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

GooAQ 🥑 : Google Answers to Google Questions!

Utility for Google Text-To-Speech batch audio files generator. Ideal for prompt files creation with Google voices for application in offline IVRs

The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank

Predicting the usefulness of reviews given the review text and metadata surrounding the reviews.

texlive expressions for documents

Module for automatic summarization of text documents and HTML pages.

Python implementation of TextRank for phrase extraction and summarization of text documents

A full spaCy pipeline and models for scientific/biomedical documents.

Module for automatic summarization of text documents and HTML pages.

Python implementation of TextRank for phrase extraction and summarization of text documents

A full spaCy pipeline and models for scientific/biomedical documents.

Extracting Summary Knowledge Graphs from Long Documents

Implementation of TF-IDF algorithm to find documents similarity with cosine similarity