CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological images.

Related tags

Data Analysis cleanX
Overview

cleanX

(DOI) License: GPL-3Anaconda-Server Badge Anaconda-Server Badge PyPI Anaconda-Server Badge Sanity Sanity

CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological images. JPEG files can be extracted from DICOM files or used directly.

The latest official release:

PyPI Anaconda-Server Badge

primary author: Candace Makeda H. Moore

other authors + contributors: Oleg Sivokon, Andrew Murphy

Continous Integration (CI) status

Sanity Sanity

Requirements

  • a python installation (3.7, 3.8 or 3.9)
  • ability to create virtual environments (recommended, not absolutely necessary)
  • tesserocr, matplotlib, pandas, pillow and opencv
  • optional recommendation of SimpleITK or pydicom for DICOM/dcm to JPG conversion
  • Anaconda is now supported, but not technically necessary

Developer's Guide

Please refer to Developer's Giude for more detailed explanation.

Developing Using Anaconda's Python

Use Git to check out the project's source, then, in the source directory run:

conda create -n cleanx
conda activate -n cleanx
python ./setup.py install_dev

You may have to do this for Python 3.7, Python 3.8 and Python 3.9 if you need to check that your changes will work in all supported versions.

Developing Using python.org's Python

Use Git to check out the project's source, then in the source directory run:

python -m venv .venv
. ./.venv/bin/activate
python ./setup.py install_dev

Similar to conda based setup, you may have to use Python versions 3.7, 3.8 and 3.9 to create three different environments to recreate our CI process.

Supported Platforms

cleanX package is a pure Python package, but it has many dependencies on native libraries. We try to test it on as many platforms as we can to see if dependencies can be installed there. Below is the list of platforms that will potentially work.

Whether python.org Python or Anaconda Python are supported, it means that version 3.7, 3.8 and 3.9 are supported. We know for certain that 3.6 is not supported, and there will be no support in the future.

32-bit Intell and ARM

We don't know if either one of these is supported. There's a good chance that 32-bit Intell will work. There's a good chance that ARM won't.

It's unlikely that the support will be added in the future.

AMD64 (x86)

Linux Win OSX
p Supported Unknown Unknown
a Supported Supported Supported

ARM64

Seems to be unsupported at the moment on both Linux and OSX, but it's likely that support will be added in the future.

Documentation

Online documentation at https://drcandacemakedamoore.github.io/cleanX/

You can also build up-to-date documentation by command.

Documentation can be generated by command:

python setup.py apidoc
python setup.py build_sphinx

The documentation will be generated in ./build/sphinx/html directory. Documentation is generated automatically as new functions are added.

Special additional documentation for medical professionals with limited programming ability is available on the wiki (https://github.com/drcandacemakedamoore/cleanX/wiki/Medical-professional-documentation).

To get a high level overview of some of the functionality of the program you can look at the Jupyter notebooks inside workflow_demo.

Installation

  • setting up a virtual environment is desirable, but not absolutely necessary

  • activate the environment

Anaconda Installation

  • use command for conda as below
conda install -c doctormakeda -c conda-forge cleanx

You need to specify both channels because there are some cleanX dependencies that exist in both Anaconda main channel and in conda-forge

pip installation

  • use pip as below
pip install cleanX

Getting Started

We will imagine a very simple scenario, where we need to automate normalization of the images we have. We stored the images in directory /images/to/clean/ and they all have jpg extension. We want the cleaned images to be saved in the cleaned directory.

Normalization here means ensuring that the lowest pixel value (the darkest part of the image) is as dark as possible and that the lightest part of the image is as light as possible.

CLI Example

The problem above doesn't require writing any new Python code. We can accomplish our task by calling the cleanX command like this:

mkdir cleaned

python -m cleanX images run-pipeline \
    -s Acqure \
    -s Normalize \
    -s "Save(target='cleaned')" \
    -j \
    -r "/images/to/clean/*.jpg"

Let's look at the command's options and arguments:

  • python -m cleanX is the Python's command-line option for loading the cleanX package. All command-line arguments that follow this part are interpreted by cleanX.
  • images sub-command is used for processing of images.
  • run-pipeline sub-command is used to start a Pipeline to process the images.
  • -s (repeatable) option specifies Pipeline Step. Steps map to their class names as found in the cleanX.image_work.steps module. If the __init__ function of a step doesn't take any arguments, only the class name is necessary. If, however, it takes arguments, they must be given using Python's literals, using Python's named arguments syntax.
  • -j option instructs to create journaling pipeline. Journaling pipelines can be restarted from the point where they failed, or had been interrupted.
  • -r allows to specify source for the pipeline. While, normally, we will want to start with Acquire step, if the pipeline was interrupted, we need to tell it where to look for the initial sources.

Once the command finishes, we should see the cleaned directory filled with images with the same names they had in the source directory.

Let's consider another simple task: batch-extraction of images from DICOM files:


mkdir extracted

python -m cleanX dicom extract \
    -i dir /path/to/dicoms/
    -o extracted

This calls cleanX CLI in the way similar to the example above, however, it calls the dicom sub-command with extract-images subcommand.

  • -i tells cleanX to look for directory named /path/to/dicoms
  • -o tells cleanX to save extracted JPGs in extracted directory.

If you have any problems with this check #40 and add issues or discussions.

Coding Example

Below is the equivalent code in Python:

import os

from cleanX.image_work import (
    Acquire,
    Save,
    GlobSource,
    Normalize,
    create_pipeline,
)

dst = 'cleaned'
os.mkdir(dst)

src = GlobSource('/images/to/clean/*.jpg')
p = create_pipeline(
    steps=(
        Acquire(),
        Normalize(),
        Save(dst),
    ),
    journal=True,
)

p.process(src)

Let's look at what's going on here. As before, we've created a pipeline using create_pipeline with three steps: Acquire, Normalize and Save. There are several kinds of sources available for pipelines. We'll use the GlobSource to match our CLI example. We'll specify journal=True to match the -j flag in our CLI example.


And for the DICOM extraction we might use similar code:

imort os

from cleanX.dicom_processing import DicomReader, DirectorySource

dst = 'extracted'
os.mkdir(dst)

reader = DicomReader()
reader.rip_out_jpgs(DirectorySource('/path/to/dicoms/', 'file'), dst)

This will look for the files with dcm extension in /path/to/dicoms/ and try to extract images found in those files, saving them in extracted directory.

About using this library

If you use the library, please credit me and my collaborators. You are only free to use this library according to license. We hope that if you use the library you will open source your entire code base, and send us modifications. You can get in touch with me by starting a discussion (https://github.com/drcandacemakedamoore/cleanX/discussions/37) if you have a legitimate reason to use my library without open-sourcing your code base, or following other conditions, and I can make you specifically a different license.

We are adding new functions and classes all the time. Many unit tests are available in the test folder. Test coverage is currently partial. Some newly added functions allow for rapid automated data augmentation (in ways that are realistic for radiological data). Some other classes and functions are for cleaning datasets including ones that:

  • Get image and metadata out of dcm (DICOM) files into jpeg and csv files
  • Process datasets from csv or json or other formats to generate reports
  • Run on dataframes to make sure there is no image leakage
  • Run on a dataframe to look for demographic or other biases in patients
  • Crop off excessive black frames (run this on single images) one at a time
  • Run on a list to make a prototype tiny Xray others can be compared to
  • Run on image files which are inside a folder to check if they are "clean"
  • Take a dataframe with image names and return plotted(visualized) images
  • Run to make a dataframe of pics in a folder (assuming they all have the same 'label'/diagnosis)
  • Normalize images in terms of pixel values (multiple methods)

All important functions are documented in the online documentation for programmers. You can also check out one of our videos by clicking the linked picture below:

Video

Comments
  • Joss issues

    Joss issues

    This is a list of some improvements/suggestions or issues that may need clarifications.

    • [x] Is this file needed GNU GENERAL PUBLIC LICENSE.txt?

    • [x] Include Conda badges https://anaconda.org/doctormakeda/cleanx/badges

    • [x] Make sure that the test badges link to the test builds. Currently, they link to the image of the badge. Sanity

    • [x] Create a paper folder for the paper files and include a copy of the LICENSE file.

    • [x] Include some examples on how to get started in the readme file. The same applies to the documentation. I would expect at least some sort of getting started guide.

    • [x] Since version v0.1.9 was released, I would expect the current changes to have v0.2.0.dev as the version for these changes in development. Later to be released as v0.2.0. But if you desire to have the current pattern, thats fine.

    • [x] Move all document files to a docs folder. I think readthedocs could also enable the docs have two versions, the stable and the latest.

    • [x] In the Jupyter we have paths like 'D:/projects/cleanX' It would be nice to start by getting the current project's directory and then use relative paths with join. For example:

    dicomfile_directory1 = 'D:/projects/cleanX/test/dicom_example_folder'
    example = pd.read_csv("D:/projects/cleanX/workflow_demo/martians_2051.csv")
    # To
    working_dir = "Path to project home"
    example_path = os.path.normpath(os.path.join(working_dir, "workflow_demo/martians_2051.csv"))
    example = pd.read_csv(example_path)
    

    It would be nice to normalize the paths. This will help Windows users who have a hard time with / and \ characters

    opened by henrykironde 20
  • Examples and workflow_demo

    Examples and workflow_demo

    @drcandacemakedamoore 👍🏿 for getting this to finally install smoothly. Some issues that I have are detailed below.

    README.md Example:

    • [ ] Add s check to see if the path exist cleaned or always delete it first and then make a new one.
    dst = 'cleaned'
    if not  os.path.exists(dst):
        os.mkdir(dst)
    
    dst = 'cleaned'
    os.rmdir(dst)
    os.mkdir(dst)
    

    Improve this README.md example, I had to install SimpleITK and PyDICOM. You could add this to required dependencies.

    (cleanx) henrysenyondo ~/Downloads/cleanX [main] $ python examplecleanX.py 
    WARNING:root:Don't know how to find Tesseract library version
    /Users/henrysenyondo/Downloads/cleanX/cleanX/dicom_processing/__init__.py:37: UserWarning: 
    Neither SimpleITK nor PyDICOM are installed.
    
    Will not be able to extract information from DICOM files.
    
      warnings.warn(
    Traceback (most recent call last):
      File "examplecleanX.py", line 36, in <module>
        from cleanX.dicom_processing import DicomReader
    ImportError: cannot import name 'DicomReader' from 'cleanX.dicom_processing' (/Users/henrysenyondo/Downloads/cleanX/cleanX/dicom_processing/__init__.py)
    (cleanx) henrysenyondo ~/Downloads/cleanX [main] $
    
    

    Use a path that does actually exist in the repo src = GlobSource('/images/to/clean/*.jpg')

    workflow_demo examples:

    • [ ] Use paths that do exist in the repo, or add a comment to point to the data to be used in that given example. Assume that the user is going to run the examples in the root directory, so all paths could be relative to that directory. In the example from cleanX/workflow_demo/classes_workflow.ipynb
    • [ ] Refactor the workflow_demo files, rename them appropriately remove files not needed.
    opened by henrykironde 17
  • pip install cleanx, on mac errors

    pip install cleanx, on mac errors

    Describe the bug No package 'tesseract' found

    Screenshots

    Using legacy 'setup.py install' for tesserocr, since package 'wheel' is not installed.
    Installing collected packages: tesserocr, opencv-python, matplotlib, cleanX
        Running setup.py install for tesserocr ... error
        ERROR: Command errored out with exit status 1:
         command: /Users/henrykironde/Documents/GitHub/testenv/bin/python3.9 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/d5/vm7jfw2x7q550xrltd8ss5440000gn/T/pip-install-d2ki3n8m/tesserocr_9008e8f2373142109467f5269239caa5/setup.py'"'"'; __file__='"'"'/private/var/folders/d5/vm7jfw2x7q550xrltd8ss5440000gn/T/pip-install-d2ki3n8m/tesserocr_9008e8f2373142109467f5269239caa5/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/d5/vm7jfw2x7q550xrltd8ss5440000gn/T/pip-record-zx09fpub/install-record.txt --single-version-externally-managed --compile --install-headers /Users/henrykironde/Documents/GitHub/testenv/include/site/python3.9/tesserocr
             cwd: /private/var/folders/d5/vm7jfw2x7q550xrltd8ss5440000gn/T/pip-install-d2ki3n8m/tesserocr_9008e8f2373142109467f5269239caa5/
        Complete output (20 lines):
        pkg-config failed to find tesseract/leptonica libraries: Package tesseract was not found in the pkg-config search path.
        Perhaps you should add the directory containing `tesseract.pc'
        to the PKG_CONFIG_PATH environment variable
        No package 'tesseract' found
        
        Failed to extract tesseract version from executable: [Errno 2] No such file or directory: 'tesseract'
        Supporting tesseract v3.04.00
        Tesseract major version 3
        Building with configs: {'libraries': ['tesseract', 'lept'], 'compile_time_env': {'TESSERACT_MAJOR_VERSION': 3, 'TESSERACT_VERSION': 50593792}}
        WARNING: The wheel package is not available.
        running install
        running build
        running build_ext
        Detected compiler: unix
        building 'tesserocr' extension
        creating build
        creating build/temp.macosx-11-x86_64-3.9
        clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -I/usr/local/include -I/usr/local/opt/[email protected]/include -I/usr/local/opt/sqlite/include -I/Users/henry/Documents/GitHub/testenv/include -I/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.9/include/python3.9 -c tesserocr.cpp -o build/temp.macosx-11-x86_64-3.9/tesserocr.o
        clang: error: invalid version number in 'MACOSX_DEPLOYMENT_TARGET=11'
        error: command '/usr/bin/clang' failed with exit code 1
        ----------------------------------------
    ERROR: Command errored out with exit status 1: /Users/henrykironde/Documents/GitHub/testenv/bin/python3.9 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/d5/vm7jfw2x7q550xrltd8ss5440000gn/T/pip-install-d2ki3n8m/tesserocr_9008e8f2373142109467f5269239caa5/setup.py'"'"'; __file__='"'"'/private/var/folders/d5/vm7jfw2x7q550xrltd8ss5440000gn/T/pip-install-d2ki3n8m/tesserocr_9008e8f2373142109467f5269239caa5/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/d5/vm7jfw2x7q550xrltd8ss5440000gn/T/pip-record-zx09fpub/install-record.txt --single-version-externally-managed --compile --install-headers /Users/henrykironde/Documents/GitHub/testenv/include/site/python3.9/tesserocr Check the logs for full command output.
    WARNING: You are using pip version 21.1.3; however, version 21.2.4 is available.
    
    (testenv) ➜  cleanX git:(docs) ✗ 
    

    Your computer environment info: (please complete the following information):

    OS: [MacOSX]
    Python V=version [3.9]
    
    opened by henrykironde 7
  • Language cleanup and typos

    Language cleanup and typos

    CleanX uses some sensitive language that may offend some users. I would recommend that you remove words like idiots since it is against the code of conduct for Joss.

    There are typos in the doc strings, like """This class allows normalization by throwing off exxtreme values on" It would be nice to look through the doc strings and try to remove the typos.

    Note: I am still failing to install CleanX, but I think it is some complications with my Conda setup. I will keep you updated. My target is to finish with the review and final decision in 14 days.

    Ref: openjournals/joss-reviews#3632

    opened by henrykironde 5
  • wrong version of zero_to_twofivefive_simplest_norming()

    wrong version of zero_to_twofivefive_simplest_norming()

    We seem to have put in an older (with a small bug) version of the zero_to_twofivefive_simplest_norming(). All image normalization functions should be tested and updated tonight (24/1/2022)

    opened by drcandacemakedamoore 3
  • Suggestions

    Suggestions

    Can you add documentation in the following files?

    • journaline_pipeline.py, starting from line 110
    • steps.py starting from line 112
    • Many functions in the fils dataframes.py, pydicom_adapter.py, and simpleitk_adapter.py
    opened by sbonaretti 3
  • Dependency

    Dependency

    Create a report to help us improve

    Describe the bug tesserocr

    To Reproduce Steps to reproduce the behavior:

    pip install cleanx
    

    Expected behavior A clear and concise description of what you expected to happen. ERROR: Failed building wheel for tesserocr Running setup.py clean for tesserocr

    Screenshots If applicable, add screenshots to help explain your problem.

    Your computer environment info: (please complete the following information): Ubuntu 16.

    OS: [e.g. Linux]
    Python V=version [e.g. 3.7]
    

    I think you should add minimum requirement in the readme file

    opened by delwende 3
  • Testing builds on Windows and Mac

    Testing builds on Windows and Mac

    It would be nice the builds are tested on Windows and Mac. One can do that using GitHub actions: https://docs.github.com/en/actions/using-jobs/using-a-matrix-for-your-jobs#example-adding-configurations https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idruns-on

    opened by fdiblen 2
  • fix image comparison, probably with numpy allclose() function

    fix image comparison, probably with numpy allclose() function

    Image comparison for copies function is too slow and memory intensive at present. Maybe we can implement something with the numpy library that is faster.

    opened by drcandacemakedamoore 2
  • [Security] Workflow on-tag.yml is using vulnerable action s-weigand/setup-conda

    [Security] Workflow on-tag.yml is using vulnerable action s-weigand/setup-conda

    The workflow on-tag.yml is referencing action s-weigand/setup-conda using references v1. However this reference is missing the commit a30654e576ab9e21a25825bf7a5d5f2a9b95b202 which may contain fix to the some vulnerability. The vulnerability fix that is missing by actions version could be related to: (1) CVE fix (2) upgrade of vulnerable dependency (3) fix to secret leak and others. Please consider to update the reference to the action.

    opened by fockboi-lgtm 2
  • Clutter in documentation

    Clutter in documentation

    In retrospect, using https://www.sphinx-doc.org/en/master/man/sphinx-apidoc.html was a bad idea. The code it generates is awful and impossible to control. In particular, there's no way to disable or enable special methods on per-class basis. Similarly for inheritance etc.

    Apparently, we need to replace this with something else that would generate sensible documentation pages. There's no hope that sphinx-apidoc will ever improve.

    opened by wvxvw 2
  • color normalizer- after JOSS review finishes

    color normalizer- after JOSS review finishes

    Some of our users are applying this to color images i.e. endoscopic images. This is by change, and it could have been pathology images. We should add functions explicitly for this starting with finding color outliers. I will attack this once the JOSS review completes.

    opened by drcandacemakedamoore 0
Releases(v0.1.14)
Owner
Candace Makeda Moore, MD
Python, SQL, Javascript, and HTML. I love imaging informatics.
Candace Makeda Moore, MD
Cleaning and analysing aggregated UK political polling data.

Analysing aggregated UK polling data The tweet collection & storage pipeline used in email-service is used to also collect tweets from @britainelects.

Ajay Pethani 0 Dec 22, 2021
Data cleaning tools for Business analysis

Datacleaning datacleaning tools for Business analysis This program is made for Vicky's work. You can use it, too. 数据清洗 该数据清洗工具是为了商业分析 这个程序是为了Vicky的工作而

Lin Jian 3 Nov 16, 2021
Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python.

Fast Laplacian Eigenmaps in python Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python. Comes with an wrapper for NMS

null 17 Jul 9, 2022
Using Python to derive insights on particular Pokemon, Types, Generations, and Stats

Pokémon Analysis Andreas Nikolaidis February 2022 Introduction Exploratory Analysis Correlations & Descriptive Statistics Principal Component Analysis

Andreas 1 Feb 18, 2022
A Python adaption of Augur to prioritize cell types in perturbation analysis.

A Python adaption of Augur to prioritize cell types in perturbation analysis.

Theis Lab 2 Mar 29, 2022
yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data.

The yt Project yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data. yt supports structured, varia

The yt project 367 Dec 25, 2022
PCAfold is an open-source Python library for generating, analyzing and improving low-dimensional manifolds obtained via Principal Component Analysis (PCA).

PCAfold is an open-source Python library for generating, analyzing and improving low-dimensional manifolds obtained via Principal Component Analysis (PCA).

Burn Research 4 Oct 13, 2022
Vectorizers for a range of different data types

Vectorizers for a range of different data types

Tutte Institute for Mathematics and Computing 69 Dec 29, 2022
Spaghetti: an open-source Python library for the analysis of network-based spatial data

pysal/spaghetti SPAtial GrapHs: nETworks, Topology, & Inference Spaghetti is an open-source Python library for the analysis of network-based spatial d

Python Spatial Analysis Library 203 Jan 3, 2023
TextDescriptives - A Python library for calculating a large variety of statistics from text

A Python library for calculating a large variety of statistics from text(s) using spaCy v.3 pipeline components and extensions. TextDescriptives can be used to calculate several descriptive statistics, readability metrics, and metrics related to dependency distance.

null 150 Dec 30, 2022
A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

Coiled 102 Nov 10, 2022
VevestaX is an open source Python package for ML Engineers and Data Scientists.

VevestaX Track failed and successful experiments as well as features. VevestaX is an open source Python package for ML Engineers and Data Scientists.

Vevesta 24 Dec 14, 2022
Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Karate Club is an unsupervised machine learning extension library for NetworkX. Please look at the Documentation, relevant Paper, Promo Video, and Ext

Benedek Rozemberczki 1.8k Jan 9, 2023
Meltano: ELT for the DataOps era. Meltano is open source, self-hosted, CLI-first, debuggable, and extensible.

Meltano is open source, self-hosted, CLI-first, debuggable, and extensible. Pipelines are code, ready to be version c

Meltano 625 Jan 2, 2023
Python Package for DataHerb: create, search, and load datasets.

The Python Package for DataHerb A DataHerb Core Service to Create and Load Datasets.

DataHerb 4 Feb 11, 2022
Python tools for querying and manipulating BIDS datasets.

PyBIDS is a Python library to centralize interactions with datasets conforming BIDS (Brain Imaging Data Structure) format.

Brain Imaging Data Structure 180 Dec 18, 2022
Python dataset creator to construct datasets composed of OpenFace extracted features and Shimmer3 GSR+ Sensor datas

Python dataset creator to construct datasets composed of OpenFace extracted features and Shimmer3 GSR+ Sensor datas

Gabriele 3 Jul 5, 2022
Hue Editor: Open source SQL Query Assistant for Databases/Warehouses

Hue Editor: Open source SQL Query Assistant for Databases/Warehouses

Cloudera 759 Jan 7, 2023
Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Data lineage made simple, reliable, and automated. Effortlessly track the flow of data, understand dependencies and analyze impact. Features Visualiza

null 898 Jan 9, 2023