Python script for finding duplicate images within a folder.

Overview

Duplicate Image Finder (DIF)


Tired of going through all images in a folder and comparing them manually to check if they are duplicates? The Duplicate Image Finder (DIF) for Python automates this task for you!


Description

The DIF searches for images in a specified target folder, compares the images it found and checks whether these are duplicates. The DIF then outputs the image files classified as duplicates and the filenames of the images having the lowest resolution, so you know which of the duplicate images are safe to be deleted. You can then either delete them manually, or let the DIF delete them for you.

Basic Usage

Use the following function to make DIF search for duplicates in the specified folder:

from difPy import compare_images

1

#1

  1. Test
  • test
Comments
  • run the CLI, how?

    run the CLI, how?

    Hello,

    call me stupid but I try to run the cli version of this code, I can run it from a basic script: from difPy import dif search = dif("C:/Path/to/Folder/")

    and this works. but if I run it as python dif.py -A "C:/Path/to/Folder_A/"

    I get a no such file or directory

    And yes, not very familiar with python (yet)

    Kind Regards,

    Gerrit Kuilder

    question 
    opened by GerritKuilder 4
  • Search results' keys are just names, but sometimes in sub-folders

    Search results' keys are just names, but sometimes in sub-folders

    Hi there! I have a folder like this:

    folder/
    | - IMG_202201.jpg
    | - IMG_202202.jpg
    | - subfolder/
    |  | - IMG_202203.jpg
    

    and i use it as first arg

    i noticed that difPy.dif() search results give me just the file name... without the subfolder anyhow noted :neutral_face:

    this broke my script with FileNotFoundError: [Errno 2] No such file or directory

    bug 
    opened by TheLastGimbus 4
  • PNGs with transparency are mistakenly counted as duplicate and not rendered properly in GUI compare

    PNGs with transparency are mistakenly counted as duplicate and not rendered properly in GUI compare

    Great tool! I learned a lot reading the article you wrote about this as well.

    I tested it on some of my files, but found that I had some PNGs that were just line-art (black line-art on transparent background) were flagged as duplicate when they were completely different, even on high sensitivity. In fact, the listed MSE is 0.00

    They also did not render properly during the image comparison when running -d False, with both image previews looking like black squares. Note: This does not apply to line-art of a different color on transparent background, only black.

    I am not familiar with how the PNG file format encodes black vs transparent, but I believe that the issue stems from that.

    Screen Shot 2022-07-22 at 1 57 07 AM

    question 
    opened by SPRCoreDump 4
  • ValueError.

    ValueError.

    Hi there,

    I'm trying to run this code on folder with more than 80k images:

    Traceback (most recent call last):
      File ".\difpy.py", line 3, in <module>
        dif.compare_images("PATH TO FOLDER")
      File "C:\Users\user\.conda\envs\gan\lib\site-packages\difPy\dif.py", line 35, in compare_images
        imgs_matrix = dif.create_imgs_matrix(directory, px_size)
      File "C:\Users\user\.conda\envs\gan\lib\site-packages\difPy\dif.py", line 121, in create_imgs_matrix
        imgs_matrix = np.concatenate((imgs_matrix, img))
      File "<__array_function__ internals>", line 6, in concatenate
    ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 3 dimension(s) and the array at index 1 has 2 dimension(s)
    

    what am i doing wrong?

    Thanks in advance

    bug 
    opened by rqtqp 4
  • Same duplicate in different keys

    Same duplicate in different keys

    We have found that when you use dif within a folder of folders, there may be some unexpected behaviour. In our case, we have a pair of duplicates in one folder, and a third duplicate in another one. This makes it so result will output:

    image

    So an element that was detected as duplicate is being used later as a key. We do not know if this is bug or a feature, but it may be inconsistent with the behavior of not repeating duplicates in later keys. Still, for our use we can just use a set() as a workaround to ignore "duplicates of duplicates".

    Nice work on the tool, it has helped us a lot with a nasty database. Thank you, have a nice day!

    5m0RZ

    bug 
    opened by Fenho 3
  • Erroneous results on particular image set

    Erroneous results on particular image set

    I've been testing various image sets trying to isolate a bug and I got weird results on this one. There are no duplicates or similar images in this set. Similarity was to high. For example, the first result detected 32 duplicates with many of the files being listed more than once.

    difPy output.zip

    The image set can be downloaded here since it's to big to post. https://drive.google.com/file/d/1pbl7SttHF-mB35V1Q5ehj6A5wCb68o3B/view?usp=sharing

    bug 
    opened by MarcG2 3
  • Match Single Image with Read-Only Directory

    Match Single Image with Read-Only Directory

    Dear Developer,

    Am a noob but still love programming (have just started) so excuse me if anything below is "obvious" or "incorrectly stated".

    I got the gist that this will match all files in the given directory for similarity.

    First Point: Is it possible to match an image (file path to pass as parameter) against a directory path (folder path to pass as parameter)? Which Means that instead of Matching all Images against all images, we could match just one image against all images of a folder.

    Second Point: Is the function writing something in the Search folder (like tensor Data or anything)? Am asking to understand if this can work in read-only directory or not. (I tried reading the code but could not figure it out)

    Third point: If we have to run / call it multiple times on a large folder then would it be taking long time analyzing all files each time or is it possible to provide / pass a path to file / folder where it can save the analysis to save the time?

    Example: (No text in below lines is crossed so please do not ignore if any text is coming crossed. I could not figure out why is it applying this formatting")

    Input_file_path = "~/Downloads/image.jpg" # Any valid Image File Target_Folder_path = "~/A_Readonly_Folder_of_Images" # A Read-only folder with say 56,000 (big number ?) files to search from. Working_File_or_Folder_path: "~/A_File_or_Folder_with_Read_Write_Access" # A Write access enabled file / folder to save analysis data to / from. E.g. If the passed parameter file / folder does not exist then create one and save analysis data. If the passed parameter file / folder does exist then read it and use it instead of analyzing the Target Folder again #calling dif.compare_image(Input_file_path,Target_folder_path,Working_Folder_path)

    Please excuse me if am crossing any limits here. I just became curious about this wonderful concept but I know nothing about github and how it works.

    Best Regards Ashish

    question 
    opened by ashish128 3
  • [CHANGE REQUEST] replacing 'output directory' with 'move_path'

    [CHANGE REQUEST] replacing 'output directory' with 'move_path'

    Hello. first of all I would like to thank you for creating and maintaining this project. It has certainly helped me finding a bunch of duplicate images through my enormous gallery.

    I discovered this project 3/4 months ago. I needed a way for difPy.py to move my duplicate images to certain directories, but it was not possible. I edited the source code - which was really easy, having little to no Python experience prior to this.

    As I recently wanted to make a pull request, I noticed that this repository had been updated, which meant that I had to update my version as well. Along with the updates, I noticed a new output_directory flag, which was only useful if using this program through the command line. I made my changes and would like to introduce my implementation.

    Instead of the (now present) output_directory flag, I added move, silent_move and move_path as parameters to the __init__ function. Here are the details:

    • Their default values are (of course) false
    • move and silent_move would be further passed to the _validate_parameters() function
    • After processing directory_A and directory_B, if move was set to true, the move_path would be validated - checked if it was equal to directory_A and/or directory_B, and it would be further passed to the _process_directory() function
    • An appropriate prompt for the silent_move parameter
    • In the _validate_parameters() function, move and delete can not be both true, as well as move and silent_move accepting only boolean values
    • A _move_imgs() funcion, similar to _delete_imgs(), with appropriate behavior
    • -m, --move, -M, --silent_move, -mp and --move-path CLI flags

    The currently implemented output_directory flag only works for the CLI, but not for python scripts, as it is not passed over to the __init__ funcion. As a result, I have removed the output_directory flag and replaced it with my move implementation. This version takes both the command line and scripts in mind.

    I would be happy to submit a pull request with my changes, If this idea sounds good to you, so you can take a better look at how these changes would be implemented.

    Looking forward to collaborating and contributing to this project as much as I can.

    new feature out of scope 
    opened by bojanmilevski 2
  • Near duplicate Image detection

    Near duplicate Image detection

    Hello, first of all thanks for creating this package It is really good package for detecting Duplicate images. I have tried this package I have found that it is able to detect images which are 100% similarity but I have found that it was not able to detect the images when similarity is not 100% even if similarity is 99.99% or less not able to detect image. I have tried to play with the pixel values and similarity but than also it was not able to detect. So, is there ways to detect such image which having similarity score less than 100% by using difpy package.

    I have attached few images which it was not able to detect. Note:- The percentage values which I have refereed many times found from matchTemplate method the images which are attached having similarity is 99%.

    TOI_Delhi_12-07-2022_4_1 TOI_Delhi_12-07-2022_4_2 TOI_Delhi_12-07-2022_4_3 TOI_Delhi_12-07-2022_4_7 TOI_Delhi_12-07-2022_4_8

    question 
    opened by dhruvbhatnagar9548 2
  • search in Sub directories

    search in Sub directories

    Hi Elise!

    Thank you for existing!

    My Onedrive duplicated my library about 4years ago, that and countless backups from WhatsApp and messager, A 550GB mess, yeah you get the point.

    I'm really new to coding and git so figure ill postcode instead, it's not clean but I'm pressed on time studying applied data science and working as a product manager.

    I have a few more ideas, but the code below was necessary for me right now :)

    Code finds photos in all subdirectories (folder in a folder) in the given file paths. Code I have added is commented: #added by Kristofer from #added by Kristofer to

    `import skimage.color import matplotlib.pyplot as plt import numpy as np import cv2 import os import imghdr import time import collections #added kristofer from pathlib import Path

    class dif:

    def __init__(self, directory_A, directory_B = None, similarity="normal", px_size=50, sort_output=False, show_output=False, delete=False, silent_del=False):
        """
        directory_A (str)......folder path to search for duplicate/similar images
        directory_B (str)....second folder path to search for duplicate/similar images
        similarity (str)....."normal" = searches for duplicates, recommended setting, MSE < 200
                             "high" = serached for exact duplicates, extremly sensitive to details, MSE < 0.1
                             "low" = searches for similar images, MSE < 1000
        px_size (int)........recommended not to change default value
                             resize images to px_size height x width (in pixels) before being compared
                             the higher the pixel size, the more computational ressources and time required 
        sort_output (bool)...False = adds the duplicate images to output dictionary in the order they were found
                             True = sorts the duplicate images in the output dictionars alphabetically 
        show_output (bool)...False = omits the output and doesn't show found images
                             True = shows duplicate/similar images found in output            
        delete (bool)........! please use with care, as this cannot be undone
                             lower resolution duplicate images that were found are automatically deleted
        silent_del (bool)....! please use with care, as this cannot be undone
                             True = skips the asking for user confirmation when deleting lower resolution duplicate images
                             will only work if "delete" AND "silent_del" are both == True
        
        OUTPUT (set).........a dictionary with the filename of the duplicate images 
                             and a set of lower resultion images of all duplicates
        """
        start_time = time.time()
    
       
        if directory_B != None:
            # process both directories
            dif._process_directory(directory_A)
            dif._process_directory(directory_B)
        else:
            # process one directory
            dif._process_directory(directory_A)
            directory_B = directory_A
    
        all_directories_A = [directory_A]
        all_directories_B = [directory_B]
    
        #added by Kristofer from
        for path in Path(directory_A).iterdir():
            if path.is_dir():
                all_directories_A.append(path)
    
        for path in Path(directory_B).iterdir():
            if path.is_dir():
                all_directories_B.append(path)
        
        dif._validate_parameters(sort_output, show_output, similarity, px_size, delete, silent_del)
    
        for dif_A in all_directories_A:
            for dif_B in all_directories_B:
    
                directory_A = str(dif_A)
                directory_B = str(dif_B)
        #added by Kristofer to                    
                       
                if directory_B == directory_A:
                    result, lower_quality = dif._search_one_dir(directory_A, 
                                                                    similarity, px_size, sort_output, show_output, delete)
                else:
                    result, lower_quality = dif._search_two_dirs(directory_A, directory_B, 
                                                                    similarity, px_size, sort_output, show_output, delete)
                    if len(lower_quality) != len(set(lower_quality)):
                        print("DifPy found that there are duplicates within directory A.")
                        
                if sort_output == True:
                    result = collections.OrderedDict(sorted(result.items()))
                
                time_elapsed = np.round(time.time() - start_time, 4)
                
                self.result = result
                self.lower_quality = lower_quality
                self.time_elapsed = time_elapsed
                
                if len(result) == 1:
                    images = "image"
                else:
                    images = "images"
                print("Found", len(result), images, "with one or more duplicate/similar images in", time_elapsed, "seconds.")
                
                if len(result) != 0:
                    if delete:
                        if not silent_del:
                            usr = input("Are you sure you want to delete all lower resolution duplicate images? \nThis cannot be undone. (y/n)")
                            if str(usr) == "y":
                                dif._delete_imgs(set(lower_quality))
                            else:
                                print("Image deletion canceled.")
                        else:
                            dif._delete_imgs(set(lower_quality))
    
                    
            
    def _search_one_dir(directory_A, similarity="normal", px_size=50, sort_output=False, show_output=False, delete=False):
        
        img_matrices_A, filenames_A = dif._create_imgs_matrix(directory_A, px_size)
        result = {}
        lower_quality = []   
        
        ref = dif._map_similarity(similarity)
        
        # find duplicates/similar images within one folder
        for count_A, imageMatrix_A in enumerate(img_matrices_A):
            for count_B, imageMatrix_B in enumerate(img_matrices_A):
                if count_B != 0 and count_B > count_A and count_A != len(img_matrices_A):      
                    rotations = 0
                    while rotations <= 3:
                        if rotations != 0:
                            imageMatrix_B = dif._rotate_img(imageMatrix_B)
    
                        err = dif._mse(imageMatrix_A, imageMatrix_B)
                        if err < ref:
                            if show_output:
                                dif._show_img_figs(imageMatrix_A, imageMatrix_B, err)
                                dif._show_file_info(str("..." + directory_A[-35:]) + "/" + filenames_A[count_A], 
                                                   str("..." + directory_A[-35:]) + "/" + filenames_A[count_B])
                            if filenames_A[count_A] in result.keys():
                                result[filenames_A[count_A]]["duplicates"] = result[filenames_A[count_A]]["duplicates"] + [directory_A + "/" + filenames_A[count_B]]
                            else:
                                result[filenames_A[count_A]] = {"location" : directory_A + "/" + filenames_A[count_A],
                                                                    "duplicates" : [directory_A + "/" + filenames_A[count_B]]
                                                                   }
                            high, low = dif._check_img_quality(directory_A, directory_A, filenames_A[count_A], filenames_A[count_B])
                            lower_quality.append(low)                         
                            break
                        else:
                            rotations += 1    
        if sort_output == True:
            result = collections.OrderedDict(sorted(result.items()))
        return result, lower_quality            
    
    def _search_two_dirs(directory_A, directory_B = None, similarity="normal", px_size=50, sort_output=False, show_output=False, delete=False):
    
        img_matrices_A, filenames_A = dif._create_imgs_matrix(directory_A, px_size)
        img_matrices_B, filenames_B = dif._create_imgs_matrix(directory_B, px_size)
        
        result = {}
        lower_quality = []   
        
        ref = dif._map_similarity(similarity)
            
        # find duplicates/similar images between two folders
        for count_A, imageMatrix_A in enumerate(img_matrices_A):
            for count_B, imageMatrix_B in enumerate(img_matrices_B):
                rotations = 0
                #print(count_A, count_B)
                while rotations <= 3:
    
                    if rotations != 0:
                        imageMatrix_B = dif._rotate_img(imageMatrix_B)
                        
                    err = dif._mse(imageMatrix_A, imageMatrix_B)
                    #print(err)
                    if err < ref:
                        if show_output:
                            dif._show_img_figs(imageMatrix_A, imageMatrix_B, err)
                            dif._show_file_info(str("..." + directory_A[-35:]) + "/" + filenames_A[count_A], 
                                               str("..." + directory_B[-35:]) + "/" + filenames_B[count_B])
                        
                        if filenames_A[count_A] in result.keys():
                            result[filenames_A[count_A]]["duplicates"] = result[filenames_A[count_A]]["duplicates"] + [directory_B + "/" + filenames_B[count_B]]
                        else:
                            result[filenames_A[count_A]] = {"location" : directory_A + "/" + filenames_A[count_A],
                                                                "duplicates" : [directory_B + "/" + filenames_B[count_B]]
                                                               }
                        high, low = dif._check_img_quality(directory_A, directory_B, filenames_A[count_A], filenames_B[count_B])
                        lower_quality.append(low)                         
                        break
                    else:
                        rotations += 1    
                
        if sort_output == True:
            result = collections.OrderedDict(sorted(result.items()))
        return result, lower_quality
    
    def _process_directory(directory):
        # check if directories are valid
        directory += os.sep
        if not os.path.isdir(directory):
            raise FileNotFoundError(f"Directory: " + directory + " does not exist")
        return directory
    
    def _validate_parameters(sort_output, show_output, similarity, px_size, delete, silent_del):
        # validate the parameters of the function
        if sort_output != True and sort_output != False:
            raise ValueError('Invalid value for "sort_output" parameter.')
        if show_output != True and show_output != False:
            raise ValueError('Invalid value for "show_output" parameter.')
        if similarity not in ["low", "normal", "high"]:
            raise ValueError('Invalid value for "similarity" parameter.')
        if px_size < 10 or px_size > 5000:
            raise ValueError('Invalid value for "px_size" parameter.')
        if delete != True and delete != False:
            raise ValueError('Invalid value for "delete" parameter.')   
        if silent_del != True and silent_del != False:
            raise ValueError('Invalid value for "silent_del" parameter.')   
    
    def _create_imgs_matrix(directory, px_size):
        directory = dif._process_directory(directory)
        img_filenames = []
        # create list of all files in directory     
        folder_files = [filename for filename in os.listdir(directory)]
    
        # create images matrix   
        imgs_matrix = []
        for filename in folder_files:
            path = os.path.join(directory, filename)
            # check if the file is not a folder
            if not os.path.isdir(path):
                try:
                    img = cv2.imdecode(np.fromfile(path, dtype=np.uint8), cv2.IMREAD_UNCHANGED)
                    if type(img) == np.ndarray:
                        img = img[..., 0:3]
                        img = cv2.resize(img, dsize=(px_size, px_size), interpolation=cv2.INTER_CUBIC)
                        
                        if len(img.shape) == 2:
                            img = skimage.color.gray2rgb(img)
                        imgs_matrix.append(img)
                        img_filenames.append(filename)
                except:
                    pass
        return imgs_matrix, img_filenames
    
    def _map_similarity(similarity):
        if similarity == "low":
            ref = 1000
        # search for exact duplicate images, extremly sensitive, MSE < 0.1
        elif similarity == "high":
            ref = 0.1
        # normal, search for duplicates, recommended, MSE < 200
        else:
            ref = 200
        return ref
    
    # Function that calulates the mean squared error (mse) between two image matrices
    def _mse(imageA, imageB):
        err = np.sum((imageA.astype("float") - imageB.astype("float")) ** 2)
        err /= float(imageA.shape[0] * imageA.shape[1])
        return err
    
    # Function that plots two compared image files and their mse
    def _show_img_figs(imageA, imageB, err):
        fig = plt.figure()
        plt.suptitle("MSE: %.2f" % (err))
        # plot first image
        ax = fig.add_subplot(1, 2, 1)
        plt.imshow(imageA, cmap=plt.cm.gray)
        plt.axis("off")
        # plot second image
        ax = fig.add_subplot(1, 2, 2)
        plt.imshow(imageB, cmap=plt.cm.gray)
        plt.axis("off")
        # show the images
        plt.show()
        
    # Function for printing filename info of plotted image files
    def _show_file_info(imageA, imageB):
        print("""Duplicate files:\n{} and \n{}
        
        """.format(imageA, imageB))
        
    # Function for rotating an image matrix by a 90 degree angle
    def _rotate_img(image):
        image = np.rot90(image, k=1, axes=(0, 1))
        return image
    
    # Function for checking the quality of compared images, appends the lower quality image to the list
    def _check_img_quality(directoryA, directoryB, imageA, imageB):
        dirA = dif._process_directory(directoryA)
        dirB = dif._process_directory(directoryB)
        size_imgA = os.stat(dirA + imageA).st_size
        size_imgB = os.stat(dirB + imageB).st_size
        if size_imgA >= size_imgB:
            return directoryA + "/" + imageA, directoryB + "/" + imageB
        else:
            return directoryB + "/" + imageB, directoryA + "/" + imageA
        
    # Function for deleting the lower quality images that were found after the search    
    def _delete_imgs(lower_quality_set):
        deleted = 0
        for file in lower_quality_set:
            print("\nDeletion in progress...", end = "\r")
            try:
                os.remove(file)
                print("Deleted file:", file, end = "\r")
                deleted += 1
            except:
                print("Could not delete file:", file, end = "\r")
        print("\n***\nDeleted", deleted, "images.")
    

    `

    new feature 
    opened by DeyoSwed 2
  • Local variable 'imgs_matrix' referenced before assignment

    Local variable 'imgs_matrix' referenced before assignment

    Hello,

    I get this error while trying to run this simple line from your package (the import works). Some help would be very welcome.

    UnboundLocalError: local variable 'imgs_matrix' referenced before assignment

    image

    bug 
    opened by Tesax123 2
  • Refactoring - Optional Merge

    Refactoring - Optional Merge

    Hi Elise :wave:

    first of all, cool idea! I recently needed to compare a large chunks of images and your approach for comparing them worked pretty well :+1:

    That being said, in the current implementation it is rather slow. Comparing larger chunks of images (15000+) takes a while. Moreover, you use a lot of different dependencies where some of them are quite large (e.g. opencv). This makes it difficult to install the tool in specific environments like within a Docker container.

    Since I probably need to compare images in future again, I thought of improving these issues. This pull request provides the results. Before talking about the changes, let me apologize for the huge pull request. I actually do not like larger pull requests for my own repos and prevent from doing them to other persons as well. However, the dependency changes and especially the multiprocessing required a larger restructuring of your tool. Therefore, I totally understand if you do not want to merge the changes. In this case, I'm fine with maintaining a fork of your repository that provides an alternative implementation. Just decide as you like :)

    Here is a brief summary of the changes I made:

    1. Make a clearer cut between CLI and library. The CLI script is now contained in /bin/difpy, while the code in /difPy/difPy.py only contains the library implementation.
    2. Reduce dependencies. The whole technique you describe can be implemented using numpy and Pillow. This makes it possible to create a Docker container running difPy that has only 161MB. Before, with opencv, we were around 1.2GB.
    3. Add multiprocessing. Work can now be distributed between different cores, which should speed up the operation quite a bit for larger image sets.
    4. Add a fast compare option. When image A is similar to image B, one probably does not want to compare B to other images, but is fine with only comparing A with others from here. Sure, this may misses some edge case duplicates, but in most situations it should be fine and provides a huge speedup for the operation.
    5. Change the command line layout. Feels now more intuitive (at least to me :D)
    6. Change the output format. The output format is still JSON based, but does not include much statistic information now. The regular end user is probably not that interested on when a comparison took place, but more on the actual comparison result. The new reduced output format should be easier to read / parse.
    7. Add a Dockerfile for building a container running difpy.

    As I said, many changes. Just think about whether you want to merge or whether we keep these changes in a separate fork. I'm fine with both approaches :wink:

    Best Tobias

    new feature 
    opened by qtc-de 1
  • Multi-processing

    Multi-processing

    I am currently working on making this project multithreaded, as I have many folders with tens of thousands of images(perhaps 100k+), and am wanting a slightly faster option.

    Opening this as a means of communication. If you have a discord account/email that would work better, as I will likely see that before a github issue comment.
    My discord account is thecodingchicken#4835 if you would prefer to reach out there.

    new feature 
    opened by thecodingchicken 3
  • Multi-threading

    Multi-threading

    Hi! I have nice AMD cpu with 8 cores. And when I'm searching thorough 2 big folders, it takes a lof of time because only one of them is being use

    Dividing the work into multiple threads seems as obvious task in this library - would be awesome if you implemented it! (or suggested how it could be done for someone to pull request)

    new feature 
    opened by TheLastGimbus 1
  • feature request: chunking of source folder

    feature request: chunking of source folder

    Thank you for your library! Just giving a heads up that I edited one of your previous versions by adding an additional parameter that allows the src folder to be split into n chunks for processing. Scenario: I have image folders that contain over 50000 images in sequential time over.

    For me, it is most likely that an image file is going to be a duplicate with other image files added around a similar time frame. Comparing against the entire 50000+ for each image took an enormous amount of time. So, I made it so that I could split the folder into chunks of 5000 (for example) and evaluate in sections. It also allowed me to restart from a position if I had to stop evaluation for some reason. There's a little more that I added to make it more robust (for example, for n+1 chunk would also include some amount of files from the previous chunk so that there would be some degree of overlap). Anyway, this worked out well for me and if you are still adding to this library then I found it to be very useful.

    The route I took is not going to be as robust as going through EVERY image each time but in my personal tests, the performance was close enough and the time savings were significant! Cheers,

    new feature 
    opened by ALCarter2 1
Releases(v2.4.5)
  • v2.4.5(Jan 1, 2023)

    Major updates and bug fixes:

    • Fixed issue #42 where duplicate files in subfolders would be added twice to the search.result output dictionary
    • @stberg-os implemented the feature to disable recursive search: search within subfolders can now be turned off
    • Various other minor code updates

    Full Changelog: https://github.com/elisemercury/Duplicate-Image-Finder/compare/v2.4.4...v2.4.5

    Source code(tar.gz)
    Source code(zip)
  • v2.4.4(Aug 25, 2022)

    Major code improvements & fixes

    • Fixed issue #37 where black and white images would not be correctly decoded.
    • Fixed issue where command line parameter -s / -similarity would not accept integers as input
    • Various other fixes in the code

    Full Changelog: https://github.com/elisemercury/Duplicate-Image-Finder/compare/v2.4.3...v2.4.4

    Source code(tar.gz)
    Source code(zip)
  • v2.4.3(Aug 24, 2022)

    Please update to a higher version as a major issue was found in v2.4.3.

    Major bug fix

    • Fixed issue #37 which caused difPy's output to be inaccurate.

    Full Changelog: https://github.com/elisemercury/Duplicate-Image-Finder/compare/v2.4.2...v2.4.3

    Source code(tar.gz)
    Source code(zip)
  • v2.4.2(Aug 21, 2022)

    Please update to a higher version as a major issue was found in v2.4.2.

    Bug fixes & minor code improvements

    • Fixed issue #33 where files with same filename and different folder would be put under the same key in the output results dictionary
    • Removed sort_output parameter as it became obsolete with the above fix
    • Support for setting the MSE threshold for comparison directly from the similarity parameter
    • Implemented handling for issue #32 where CTRL-C would not abort the difPy process when running in a terminal
    • Various other code improvements

    Full Changelog: https://github.com/elisemercury/Duplicate-Image-Finder/compare/v2.4.1...v2.4.2

    Source code(tar.gz)
    Source code(zip)
  • v2.4.1(Jul 10, 2022)

    Minor code updates and bug fixes

    • Changed show progress parameter to default True: the progress bar of difPy will be shown by default
    • Added -Z / -output_directory parameter to CLI interface: allows to set the output folder of the result files
    • More detailed progress tracking: progress bar is shown when difPy is preparing the files in the target folder(s), and when difPy is comparing the images
    • Fixed an issue where search in subfolders was imprecise
    • @ethanmann fixed issue #25
    • Minor other code adjustments and bug fixes

    Full Changelog: https://github.com/elisemercury/Duplicate-Image-Finder/compare/v2.4...v2.4.1

    Source code(tar.gz)
    Source code(zip)
  • v2.4(Jun 30, 2022)

    Major new features and code improvements:

    • Enhancement #12 and #18: added support for search within subfolders
    • Enhancement #11: added support for usage through CLI interface
    • Improved path handling of files to be os-independent
    • Various minor code updates

    Full Changelog: https://github.com/elisemercury/Duplicate-Image-Finder/compare/v2.3...v2.4

    Source code(tar.gz)
    Source code(zip)
  • v2.3(Jun 29, 2022)

    New features and code improvements:

    • Enhancement https://github.com/elisemercury/Duplicate-Image-Finder/pull/19: added support for a progress bar to track the process of difPy
    • Enhancement https://github.com/elisemercury/Duplicate-Image-Finder/pull/20: added support for generation of statistics on the difPy process
    • Fixed bug #17 which caused a FileNotFoundError when files where moved/deleted while difPy is running
    • Various updates & improvements to the code

    Full Changelog: https://github.com/elisemercury/Duplicate-Image-Finder/compare/v2.2...v2.3

    Source code(tar.gz)
    Source code(zip)
  • v2.2(Mar 6, 2022)

  • v2.0(Dec 26, 2021)

    Major code updates and various new features added:

    • Runtime of difPy v2.0 is 6x faster than its previous versions
    • Support for search within two different folders
    • Support for sorting of output by filename alphabetically
    • Optimization and implementation of error handling
    • Various other code improvements
    Source code(tar.gz)
    Source code(zip)
  • v1.2(Nov 10, 2021)

  • v1.0.0(Oct 30, 2021)

    Various updates to the code.

    New features:

    • Automatically delete the lower resolution duplicate files that were found
    • Addition of a new similarity-level at which images are compared: now 3 levels can be chosen ("low", "normal" and "high")

    Upload as package to PyPI.org

    Source code(tar.gz)
    Source code(zip)
  • v0.0(Oct 30, 2021)

Owner
Technical Solutions Specialist @ Cisco Systems
null
Google Project: Search and auto-complete sentences within given input text files, manipulating data with complex data-structures.

Auto-Complete Google Project In this project there is an implementation for one feature of Google's search engines - AutoComplete. Autocomplete, or wo

Hadassah Engel 10 Jun 20, 2022
Wagtail CLIP allows you to search your Wagtail images using natural language queries.

Wagtail CLIP allows you to search your Wagtail images using natural language queries.

Matt Segal 10 Dec 21, 2022
GitScanner is a script to make it easy to search for Exposed Git through an advanced Google search.

GitScanner Legal disclaimer Usage of GitScanner for attacking targets without prior mutual consent is illegal. It is the end user's responsibility to

Kaio Gomes 3 Oct 28, 2022
High level Python client for Elasticsearch

Elasticsearch DSL Elasticsearch DSL is a high-level library whose aim is to help with writing and running queries against Elasticsearch. It is built o

elastic 3.6k Dec 30, 2022
Pysolr — Python Solr client

pysolr pysolr is a lightweight Python client for Apache Solr. It provides an interface that queries the server and returns results based on the query.

Haystack Search 626 Dec 1, 2022
Whoosh indexing capabilities for Flask-SQLAlchemy, Python 3 compatibility fork.

Flask-WhooshAlchemy3 Whoosh indexing capabilities for Flask-SQLAlchemy, Python 3 compatibility fork. Performance improvements and suggestions are read

Blake VandeMerwe 27 Mar 10, 2022
Senginta is All in one Search Engine Scrapper for used by API or Python Module. It's Free!

Senginta is All in one Search Engine Scrapper. With traditional scrapping, Senginta can be powerful to get result from any Search Engine, and convert to Json. Now support only for Google Product Search Engine (GShop, GVideo and many too) and Baidu Search Engine.

null 33 Nov 21, 2022
esguard provides a Python decorator that waits for processing while monitoring the load of Elasticsearch.

esguard esguard provides a Python decorator that waits for processing while monitoring the load of Elasticsearch. Quick Start You need to launch elast

po3rin 5 Dec 8, 2021
A real-time tech course finder, created using Elasticsearch, Python, React+Redux, Docker, and Kubernetes.

A real-time tech course finder, created using Elasticsearch, Python, React+Redux, Docker, and Kubernetes.

Dinesh Sonachalam 130 Dec 20, 2022
a Telegram bot writen in Python for searching files in Drive. Based on SearchX-bot

Drive Search Bot This is a Telegram bot writen in Python for searching files in Drive. Based on SearchX-bot How to deploy? Clone this repo: git clone

Hafitz Setya 25 Dec 9, 2022
Simple algorithm search engine like google in python using function

Mini-Search-Engine-Like-Google I have created the simple algorithm search engine like google in python using function. I am matching every word with w

Sachin Vinayak Dabhade 5 Sep 24, 2021
User-friendly, tiny source code searcher written by pure Python.

User-friendly, tiny source code searcher written in pure Python. Example Usages Cat is equivalent in the regular expression as '^Cat$' bor class Cat

Furkan Onder 106 Nov 2, 2022
Eland is a Python Elasticsearch client for exploring and analyzing data in Elasticsearch with a familiar Pandas-compatible API.

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch

elastic 463 Dec 30, 2022
This is a Telegram Bot written in Python for searching data on Google Drive.

This is a Telegram Bot written in Python for searching data on Google Drive. Supports multiple Shared Drives (TDs). Manual Guide for deploying the bot

Levi 158 Dec 27, 2022
Pythonic Lucene - A simplified python impelementaiton of Apache Lucene

A simplified python impelementaiton of Apache Lucene, mabye helps to understand how an enterprise search engine really works.

Mahdi Sadeghzadeh Ghamsary 2 Sep 12, 2022
A Python web searcher library with different search engines

Robert A simple Python web searcher library with different search engines. Install pip install roberthelper Usage from robert import GoogleSearcher

null 1 Dec 23, 2021
A fast, efficiency python package for searching and getting search results with many different search engines

search A fast, efficiency python package for searching and getting search results with many different search engines. Installation To install the pack

Neurs 0 Oct 6, 2022
Backup a folder to an another folder by using mirror update method.

Mirror Update Backup Backup a folder to an another folder by using mirror update method. How to use Install requirement pip install -r requirements.tx

null 1 Nov 21, 2022
An image base contains 490 images for learning (400 cars and 90 boats), and another 21 images for testingAn image base contains 490 images for learning (400 cars and 90 boats), and another 21 images for testing

SVM Données Une base d’images contient 490 images pour l’apprentissage (400 voitures et 90 bateaux), et encore 21 images pour fait des tests. Prétrait

Achraf Rahouti 3 Nov 30, 2021
Simple python code to fix your combo list by removing any text after a separator or removing duplicate combos

Combo List Fixer A simple python code to fix your combo list by removing any text after a separator or removing duplicate combos Removing any text aft

Hamidreza Dehghan 3 Dec 5, 2022