UDdup - URLs Deduplication Tool

Overview

UDdup - URLs Deduplication Tool

The tool gets a list of URLs, and removes "duplicate" pages in the sense of URL patterns that are probably repetitive and points to the same web template.

For example:

https://www.example.com/product/123
https://www.example.com/product/456
https://www.example.com/product/123?is_prod=false
https://www.example.com/product/222?is_debug=true

All the above are probably points to the same product "template". Therefore it should be enough to scan only some of these URLs by our various scanners.

The result of the above after UDdup should be:

https://www.example.com/product/123?is_prod=false
https://www.example.com/product/222?is_debug=true

Why do I need it?

Mostly for better (automated) reconnaissance process, with less noise (for both the tester and the target).

Examples

Take a look at demo.txt which is the raw URLs file which results in demo-results.txt.


Installation

With pip (Recommended)

pip install uddup

Manual (from code)

# Clone the repository.
git clone https://github.com/rotemreiss/uddup.git

# Install the Python requirements.
cd uddup
pip install -r requirements.txt

Usage

uddup -u demo.txt -o ./demo-result.txt

More Usage Options

uddup -h

Short Form Long Form Description
-h --help Show this help message and exit
-u --urls File with a list of urls
-o --output Save results to a file
-s --silent Print only the result URLs
-fp --filter-path Filter paths by a given Regex

Filter Paths by Regex

Allows filtering custom paths pattern. For example, if we would like to filter all paths that starts with /product we will need to run:

# Single Regex
uddup -u demo.txt -fp "^product"

Input:

https://www.example.com/
https://www.example.com/privacy-policy
https://www.example.com/product/1
https://www.example2.com/product/2
https://www.example3.com/product/4

Output:

https://www.example.com/
https://www.example.com/privacy-policy

Advanced Regex with multiple path filters

uddup -u demo.txt -fp "(^product)|(^category)"

Contributing

Feel free to fork the repository and submit pull-requests.


Support

Create new GitHub issue

Want to say thanks? :) Message me on Linkedin


License

License

Issues
  • Multiple hostnames (domains) which shares the same patterns conflicts

    Multiple hostnames (domains) which shares the same patterns conflicts

    I found out that I missed a very basic case like:

    https://www.example.com/product/123
    https://www.example2.com/product/123
    

    This currently results in one URL instead of two:

    https://www.example.com/product/123
    ```.
    bug 
    opened by rotemreiss 1
  • Update setup.py

    Update setup.py

    opened by rotemreiss 0
  • Support paths filtering by Regex

    Support paths filtering by Regex

    Support paths filtering by Regex, as unew does.

    Requirements

    • Support custom regex to be provided by the user

    Known limitations

    • Only the path will be filtered while ignoring the hostname and parameters (may be extended in the future)
    enhancement 
    opened by rotemreiss 0
  • fix bug with unicode char in urls

    fix bug with unicode char in urls

    This fixes a problem with URLs with UTF8 chars, e.g:

    echo "http://www.shakedos.com:80/index.php/2010/05/עבודה-עם-שפות-ללא-טבלאות-מוכנות/feed/" > /tmp/urls.txt
    uddup -u /tmp/urls.txt 
    ...
    Traceback (most recent call last):
      File "/usr/local/bin/uddup", line 11, in <module>
        sys.exit(interactive())
      File "/usr/local/lib/python3.5/dist-packages/uddup/main.py", line 269, in interactive
        main(args.urls_file, args.output, args.silent, args.filter_path)
      File "/usr/local/lib/python3.5/dist-packages/uddup/main.py", line 184, in main
        for url in f:
      File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
        return codecs.ascii_decode(input, self.errors)[0]
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xd7 in position 100: ordinal not in range(128
    
    bug good first issue 
    opened by Shaked 0
  • [Feature Request] Multi-threading

    [Feature Request] Multi-threading

    Multi-threading would be great, if it's possible. Seeing how it only uses one core ATM.

    opened by gprime31 5
  • [request] - de-duplicate similar paths

    [request] - de-duplicate similar paths

    Hi Rotem,

    Currently, uddup is not able to de-duplicate similar paths like below.

    /users/122/edit
    /users/123/edit
    

    image

    This project https://github.com/ameenmaali/urldedupe trying to solve similar problems is able to de-duplicate them. The only issue is since it's written in C++ it requires rebuilding binary for a new machine.

    -- Regards, @bugbaba

    enhancement 
    opened by bugbaba 1
Releases(v0.9.3)
Owner
Rotem Reiss
Rotem Reiss
Astra is a tool to find URLs and secrets.

Astra finds urls, endpoints, aws buckets, api keys, tokens, etc from a given url/s. It combines the paths and endpoints with the given domain and give

Stinger 129 Oct 22, 2021
🌐 URL parsing and manipulation made easy.

furl is a small Python library that makes parsing and manipulating URLs easy. Python's standard urllib and urlparse modules provide a number of URL re

Ansgar Grunseid 2.1k Oct 18, 2021
declutters url lists for crawling/pentesting

uro Using a URL list for security testing can be painful as there are a lot of URLs that have uninteresting/duplicate content; uro aims to solve that.

Somdev Sangwan 253 Oct 19, 2021
A simple, immutable URL class with a clean API for interrogation and manipulation.

purl - A simple Python URL class A simple, immutable URL class with a clean API for interrogation and manipulation. Supports Pythons 2.7, 3.3, 3.4, 3.

David Winterbottom 257 Oct 9, 2021
Have you ever wondered: Where does this link go? The REDLI Tool follows the path of the URL.

Have you ever wondered: Where does this link go? The REDLI Tool follows the path of the URL. It allows you to see the complete path a redirected URL goes through. It will show you the full redirection path of URLs, shortened links, or tiny URLs.

JAYAKUMAR 28 Oct 20, 2021
URL Shortener in Flask - Web service using Flask framework for Shortener URLs

URL Shortener in Flask Web service using Flask framework for Shortener URLs Install Create Virtual env $ python3 -m venv env Install requirements.txt

Rafnix Guzman 1 Sep 21, 2021
:electric_plug: Generating short urls with python has never been easier

pyshorteners A simple URL shortening API wrapper Python library. Installing pip install pyshorteners Documentation https://pyshorteners.readthedocs.i

Ellison 316 Oct 25, 2021
A tool programmed to shorten links/mask links

A tool programmed to shorten links/mask links

Anontemitayo 3 Oct 19, 2021