UDdup - URLs Deduplication Tool

Overview

UDdup - URLs Deduplication Tool

The tool gets a list of URLs, and removes "duplicate" pages in the sense of URL patterns that are probably repetitive and points to the same web template.

For example:

https://www.example.com/product/123
https://www.example.com/product/456
https://www.example.com/product/123?is_prod=false
https://www.example.com/product/222?is_debug=true

All the above are probably points to the same product "template". Therefore it should be enough to scan only some of these URLs by our various scanners.

The result of the above after UDdup should be:

https://www.example.com/product/123?is_prod=false
https://www.example.com/product/222?is_debug=true

Why do I need it?

Mostly for better (automated) reconnaissance process, with less noise (for both the tester and the target).

Examples

Take a look at demo.txt which is the raw URLs file which results in demo-results.txt.


Installation

With pip (Recommended)

pip install uddup

Manual (from code)

# Clone the repository.
git clone https://github.com/rotemreiss/uddup.git

# Install the Python requirements.
cd uddup
pip install -r requirements.txt

Usage

uddup -u demo.txt -o ./demo-result.txt

More Usage Options

uddup -h

Short Form Long Form Description
-h --help Show this help message and exit
-u --urls File with a list of urls
-o --output Save results to a file
-s --silent Print only the result URLs
-fp --filter-path Filter paths by a given Regex

Filter Paths by Regex

Allows filtering custom paths pattern. For example, if we would like to filter all paths that starts with /product we will need to run:

# Single Regex
uddup -u demo.txt -fp "^product"

Input:

https://www.example.com/
https://www.example.com/privacy-policy
https://www.example.com/product/1
https://www.example2.com/product/2
https://www.example3.com/product/4

Output:

https://www.example.com/
https://www.example.com/privacy-policy

Advanced Regex with multiple path filters

uddup -u demo.txt -fp "(^product)|(^category)"

Contributing

Feel free to fork the repository and submit pull-requests.


Support

Create new GitHub issue

Want to say thanks? :) Message me on Linkedin


License

License

Comments
  • cant run uddup

    cant run uddup

    This tool is so great and really useful but i have noticed you will move the uddup execute script to /usr/local/bin directory which is actually doesnt work some times because its needs to be in /usr/bin directory to be executed i dont know why... I copied the script to /usr/bin and its worked perfectly. Im using kali linux subsystem on windows 11. Sorry if theres a problem with my issue report, its my first issue report on github :V Thanks.

    question 
    opened by siratsami 2
  • Multiple hostnames (domains) which shares the same patterns conflicts

    Multiple hostnames (domains) which shares the same patterns conflicts

    I found out that I missed a very basic case like:

    https://www.example.com/product/123
    https://www.example2.com/product/123
    

    This currently results in one URL instead of two:

    https://www.example.com/product/123
    ```.
    bug 
    opened by rotemreiss 1
  • fix bug with unicode char in urls

    fix bug with unicode char in urls

    This fixes a problem with URLs with UTF8 chars, e.g:

    echo "http://www.shakedos.com:80/index.php/2010/05/עבודה-עם-שפות-ללא-טבלאות-מוכנות/feed/" > /tmp/urls.txt
    uddup -u /tmp/urls.txt 
    ...
    Traceback (most recent call last):
      File "/usr/local/bin/uddup", line 11, in <module>
        sys.exit(interactive())
      File "/usr/local/lib/python3.5/dist-packages/uddup/main.py", line 269, in interactive
        main(args.urls_file, args.output, args.silent, args.filter_path)
      File "/usr/local/lib/python3.5/dist-packages/uddup/main.py", line 184, in main
        for url in f:
      File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
        return codecs.ascii_decode(input, self.errors)[0]
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xd7 in position 100: ordinal not in range(128
    
    bug good first issue 
    opened by Shaked 0
  • Support paths filtering by Regex

    Support paths filtering by Regex

    Support paths filtering by Regex, as unew does.

    Requirements

    • Support custom regex to be provided by the user

    Known limitations

    • Only the path will be filtered while ignoring the hostname and parameters (may be extended in the future)
    enhancement 
    opened by rotemreiss 0
  • [request] - de-duplicate similar paths

    [request] - de-duplicate similar paths

    Hi Rotem,

    Currently, uddup is not able to de-duplicate similar paths like below.

    /users/122/edit
    /users/123/edit
    

    image

    This project https://github.com/ameenmaali/urldedupe trying to solve similar problems is able to de-duplicate them. The only issue is since it's written in C++ it requires rebuilding binary for a new machine.

    -- Regards, @bugbaba

    enhancement 
    opened by bugbaba 1
Releases(v0.9.3)
Owner
Rotem Reiss
Rotem Reiss
Astra is a tool to find URLs and secrets.

Astra finds urls, endpoints, aws buckets, api keys, tokens, etc from a given url/s. It combines the paths and endpoints with the given domain and give

Stinger 198 Dec 27, 2022
:electric_plug: Generating short urls with python has never been easier

pyshorteners A simple URL shortening API wrapper Python library. Installing pip install pyshorteners Documentation https://pyshorteners.readthedocs.i

Ellison 350 Dec 24, 2022
🔗 Generate Phishing URLs 🔗

URLer ?? Generate Phishing URLs ?? URLer Table Of Contents General Information Preview Installation Disclaimer Credits Social Media Bug Report General

mrblackx 5 Feb 8, 2022
URL Shortener in Flask - Web service using Flask framework for Shortener URLs

URL Shortener in Flask Web service using Flask framework for Shortener URLs Install Create Virtual env $ python3 -m venv env Install requirements.txt

Rafnix Guzman 1 Sep 21, 2021
A teeny Tiny module to check URLs against discord's list of phishing domains

A teeny Tiny module to check URLs against discord's list of phishing domains

kaj 1 Aug 29, 2022
Temporary-shortner - A webapp that shortner URLs but for limited time

temporary-shortner A webapp that shortens URLs but for a limited time Demo site

Vitor 2 Jan 7, 2022
Have you ever wondered: Where does this link go? The REDLI Tool follows the path of the URL.

Have you ever wondered: Where does this link go? The REDLI Tool follows the path of the URL. It allows you to see the complete path a redirected URL goes through. It will show you the full redirection path of URLs, shortened links, or tiny URLs.

JAYAKUMAR 28 Sep 11, 2022
A tool programmed to shorten links/mask links

A tool programmed to shorten links/mask links

Anontemitayo 6 Dec 2, 2022
A tool to manage the base URL of the Python package index.

chpip A tool to manage the base URL of the Python package index. Installation $ pip install chpip Usage Set pip index URL Set the base URL of the Pyth

Prodesire 4 Dec 20, 2022
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

img2dataset Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine. Also supports

Romain Beaumont 1.4k Jan 1, 2023
Fast pattern fetcher, Takes a URLs list and outputs the URLs which contains the parameters according to the specified pattern.

Fast Pattern Fetcher (fpf) Coded with <3 by HS Devansh Raghav Fast Pattern Fetcher, Takes a URLs list and outputs the URLs which contains the paramete

whoami security 5 Feb 20, 2022
Snscrape-jsonl-urls-extractor - Extracts urls from jsonl produced by snscrape

snscrape-jsonl-urls-extractor extracts urls from jsonl produced by snscrape Usag

null 1 Feb 26, 2022
:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Dedupe Python Library dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on

Dedupe.io 3.6k Jan 2, 2023
:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Dedupe Python Library dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on

Dedupe.io 2.9k Feb 11, 2021
:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Dedupe Python Library dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on

Dedupe.io 2.9k Feb 17, 2021
Deduplication is the task to combine different representations of the same real world entity.

Deduplication is the task to combine different representations of the same real world entity. This package implements deduplication using active learning. Active learning allows for rapid training without having to provide a large, manually labelled dataset.

null 63 Nov 17, 2022
Distributed, blockchain based hashtables middleware for deduplication of file uploads to the cloud

distributed-blockchain-based-secure-file-dedupe Searching is Distributed, Block and Access List for each upload is unique and it is stored in a single

Abhishek Tangod 1 Dec 2, 2021
Astra is a tool to find URLs and secrets.

Astra finds urls, endpoints, aws buckets, api keys, tokens, etc from a given url/s. It combines the paths and endpoints with the given domain and give

Stinger 198 Dec 27, 2022
A Tool to scrape URLs for a given domain from wayback machine, Commoncrawl and OTX Alienvault

Mr_URL Mr.URL fetches known URLs for a given domain from Wayback Machine, Commoncrawl and OTX Alienvault. It also finds old versions of any given URL

Stinger 9 Sep 5, 2022
Web scrapping tool written in python3, using regex, to get CVEs, Source and URLs.

searchcve Web scrapping tool written in python3, using regex, to get CVEs, Source and URLs. Generates a CSV file in the current directory. Uses the NI

null 32 Oct 10, 2022