UDdup - URLs Deduplication Tool

Rotem Reiss

Last update: Dec 21, 2022

Related tags

Overview

UDdup - URLs Deduplication Tool

The tool gets a list of URLs, and removes "duplicate" pages in the sense of URL patterns that are probably repetitive and points to the same web template.

For example:

https://www.example.com/product/123
https://www.example.com/product/456
https://www.example.com/product/123?is_prod=false
https://www.example.com/product/222?is_debug=true

All the above are probably points to the same product "template". Therefore it should be enough to scan only some of these URLs by our various scanners.

The result of the above after UDdup should be:

https://www.example.com/product/123?is_prod=false
https://www.example.com/product/222?is_debug=true

Why do I need it?

Mostly for better (automated) reconnaissance process, with less noise (for both the tester and the target).

Examples

Take a look at demo.txt which is the raw URLs file which results in demo-results.txt.

Installation

With pip (Recommended)

pip install uddup

Manual (from code)

# Clone the repository.
git clone https://github.com/rotemreiss/uddup.git

# Install the Python requirements.
cd uddup
pip install -r requirements.txt

Usage

uddup -u demo.txt -o ./demo-result.txt

More Usage Options

uddup -h

Short Form	Long Form	Description
-h	--help	Show this help message and exit
-u	--urls	File with a list of urls
-o	--output	Save results to a file
-s	--silent	Print only the result URLs
-fp	--filter-path	Filter paths by a given Regex

Filter Paths by Regex

Allows filtering custom paths pattern. For example, if we would like to filter all paths that starts with /product we will need to run:

# Single Regex
uddup -u demo.txt -fp "^product"

Input:

https://www.example.com/
https://www.example.com/privacy-policy
https://www.example.com/product/1
https://www.example2.com/product/2
https://www.example3.com/product/4

Output:

https://www.example.com/
https://www.example.com/privacy-policy

Advanced Regex with multiple path filters

uddup -u demo.txt -fp "(^product)|(^category)"

Contributing

Feel free to fork the repository and submit pull-requests.

Support

Create new GitHub issue

Want to say thanks? :) Message me on Linkedin

License

MIT license

Comments

cant run uddup

This tool is so great and really useful but i have noticed you will move the uddup execute script to /usr/local/bin directory which is actually doesnt work some times because its needs to be in /usr/bin directory to be executed i dont know why... I copied the script to /usr/bin and its worked perfectly. Im using kali linux subsystem on windows 11. Sorry if theres a problem with my issue report, its my first issue report on github :V Thanks.
question

opened by siratsami 2
Multiple hostnames (domains) which shares the same patterns conflicts
I found out that I missed a very basic case like:

https://www.example.com/product/123 https://www.example2.com/product/123

This currently results in one URL instead of two:

https://www.example.com/product/123 ```.
bug
opened by rotemreiss 1

fix bug with unicode char in urls

This fixes a problem with URLs with UTF8 chars, e.g:

echo "http://www.shakedos.com:80/index.php/2010/05/עבודה-עם-שפות-ללא-טבלאות-מוכנות/feed/" > /tmp/urls.txt
uddup -u /tmp/urls.txt 
...
Traceback (most recent call last):
  File "/usr/local/bin/uddup", line 11, in <module>
    sys.exit(interactive())
  File "/usr/local/lib/python3.5/dist-packages/uddup/main.py", line 269, in interactive
    main(args.urls_file, args.output, args.silent, args.filter_path)
  File "/usr/local/lib/python3.5/dist-packages/uddup/main.py", line 184, in main
    for url in f:
  File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd7 in position 100: ordinal not in range(128

bug good first issue

opened by Shaked 0

Support paths filtering by Regex
Support paths filtering by Regex, as unew does.

Requirements

Support custom regex to be provided by the user

Known limitations

Only the path will be filtered while ignoring the hostname and parameters (may be extended in the future)

enhancement
opened by rotemreiss 0
[request] - de-duplicate similar paths
Hi Rotem,

Currently, uddup is not able to de-duplicate similar paths like below.

/users/122/edit /users/123/edit

This project https://github.com/ameenmaali/urldedupe trying to solve similar problems is able to de-duplicate them. The only issue is since it's written in C++ it requires rebuilding binary for a new machine.

-- Regards, @bugbaba
enhancement
opened by bugbaba 1

Releases(v0.9.3)

v0.9.3(Feb 28, 2021)
Bug Fixes:

#5 Fix a bug with Unicode char in URLs (UTF-8 support)

Source code(tar.gz)
Source code(zip)
v0.9.2(Feb 7, 2021)
Enhancements:

#3 [feature request] Support paths filtering by Regex

Bug Fixes:

#2 Multiple hostnames (domains) which shares the same patterns conflicts

Source code(tar.gz)
Source code(zip)
v0.9.1.1(Feb 5, 2021)

Stable release with unit-tests.
Source code(tar.gz)
Source code(zip)
0.9.1(Feb 5, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Rotem Reiss

GitHub

Astra is a tool to find URLs and secrets.

Astra finds urls, endpoints, aws buckets, api keys, tokens, etc from a given url/s. It combines the paths and endpoints with the given domain and give

198 Dec 27, 2022

:electric_plug: Generating short urls with python has never been easier

pyshorteners A simple URL shortening API wrapper Python library. Installing pip install pyshorteners Documentation https://pyshorteners.readthedocs.i

350 Dec 24, 2022

🔗 Generate Phishing URLs 🔗

URLer ?? Generate Phishing URLs ?? URLer Table Of Contents General Information Preview Installation Disclaimer Credits Social Media Bug Report General

5 Feb 8, 2022

URL Shortener in Flask - Web service using Flask framework for Shortener URLs

URL Shortener in Flask Web service using Flask framework for Shortener URLs Install Create Virtual env $ python3 -m venv env Install requirements.txt

1 Sep 21, 2021

A teeny Tiny module to check URLs against discord's list of phishing domains

1 Aug 29, 2022

Temporary-shortner - A webapp that shortner URLs but for limited time

temporary-shortner A webapp that shortens URLs but for a limited time Demo site

2 Jan 7, 2022

Have you ever wondered: Where does this link go? The REDLI Tool follows the path of the URL.

Have you ever wondered: Where does this link go? The REDLI Tool follows the path of the URL. It allows you to see the complete path a redirected URL goes through. It will show you the full redirection path of URLs, shortened links, or tiny URLs.

28 Sep 11, 2022

A tool programmed to shorten links/mask links

6 Dec 2, 2022

A tool to manage the base URL of the Python package index.

chpip A tool to manage the base URL of the Python package index. Installation $ pip install chpip Usage Set pip index URL Set the base URL of the Pyth

4 Dec 20, 2022

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

img2dataset Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine. Also supports

1.4k Jan 1, 2023

Fast pattern fetcher, Takes a URLs list and outputs the URLs which contains the parameters according to the specified pattern.

Fast Pattern Fetcher (fpf) Coded with <3 by HS Devansh Raghav Fast Pattern Fetcher, Takes a URLs list and outputs the URLs which contains the paramete

5 Feb 20, 2022

Snscrape-jsonl-urls-extractor - Extracts urls from jsonl produced by snscrape

snscrape-jsonl-urls-extractor extracts urls from jsonl produced by snscrape Usag

1 Feb 26, 2022

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Dedupe Python Library dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on

3.6k Jan 2, 2023

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Dedupe Python Library dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on

2.9k Feb 11, 2021

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Dedupe Python Library dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on

2.9k Feb 17, 2021

Deduplication is the task to combine different representations of the same real world entity.

Deduplication is the task to combine different representations of the same real world entity. This package implements deduplication using active learning. Active learning allows for rapid training without having to provide a large, manually labelled dataset.

63 Nov 17, 2022

UDdup - URLs Deduplication Tool

Related tags

Overview

UDdup - URLs Deduplication Tool

Why do I need it?

Examples

Installation

With pip (Recommended)

Manual (from code)

Usage

More Usage Options

Filter Paths by Regex

Advanced Regex with multiple path filters

Contributing

Support

License

Comments

cant run uddup

Multiple hostnames (domains) which shares the same patterns conflicts

fix bug with unicode char in urls

Support paths filtering by Regex

Requirements

Known limitations

[request] - de-duplicate similar paths

Releases(v0.9.3)

v0.9.3(Feb 28, 2021)

Bug Fixes:

v0.9.2(Feb 7, 2021)

Enhancements:

Bug Fixes:

v0.9.1.1(Feb 5, 2021)

0.9.1(Feb 5, 2021)

Owner

Rotem Reiss

Astra is a tool to find URLs and secrets.

:electric_plug: Generating short urls with python has never been easier

🔗 Generate Phishing URLs 🔗

URL Shortener in Flask - Web service using Flask framework for Shortener URLs

A teeny Tiny module to check URLs against discord's list of phishing domains

Temporary-shortner - A webapp that shortner URLs but for limited time

Have you ever wondered: Where does this link go? The REDLI Tool follows the path of the URL.

A tool programmed to shorten links/mask links

A tool to manage the base URL of the Python package index.

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

Fast pattern fetcher, Takes a URLs list and outputs the URLs which contains the parameters according to the specified pattern.

Snscrape-jsonl-urls-extractor - Extracts urls from jsonl produced by snscrape

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Deduplication is the task to combine different representations of the same real world entity.

Distributed, blockchain based hashtables middleware for deduplication of file uploads to the cloud

Astra is a tool to find URLs and secrets.

A Tool to scrape URLs for a given domain from wayback machine, Commoncrawl and OTX Alienvault

Web scrapping tool written in python3, using regex, to get CVEs, Source and URLs.