Data Loader Plugin - Python

Table of Content (ToC)

Data Loader Plugin - Python
Table of Content (ToC)
Overview
References
- Python module
- Python virtual environments
Installation
- Clone this Git repository
- Python environment
Usage
- Install the data-loader-plugin module
  - Install in the Python user space
  - Installation in a dedicated Python virtual environment
- Use data-loader-plugin as a module from another Python program
Development / Contribution
- Test the data loader plugin Python module

Table of contents generated with markdown-toc

Overview

The data loader plugin, aims at supporting running programs (e.g., API service backends) when downloading data from cloud services such as AWS S3. It provides a base Python library, namely data-loader-plugin, offering a few methods to download data files from AWS S3.

References

Python module

GitHub: https://github.com/cloud-helpers/python-plugin-data-loader/tree/master/data_loader_plugin
PyPi: https://pypi.org/project/data-loader-plugin/
Read the Docs (RTD): https://readthedocs.org/projects/data-loader-plugin/

Python virtual environments

Pyenv and pipenv: http://github.com/machine-learning-helpers/induction-python/tree/master/installation/virtual-env

Installation

Clone this Git repository

$ mkdir -p ~/dev/infra && \
  git clone [email protected]:cloud-helpers/python-plugin-data-loader.git ~/dev/infra/python-plugin-data-loader
$ cd ~/dev/infra/python-plugin-data-loader

Python environment

If not already done so, install pyenv, Python 3.9 and, pip and pipenv
- PyEnv:

$ git clone https://github.com/pyenv/pyenv.git ${HOME}/.pyenv
$ cat >> ~/.profile2 << _EOF

# Python
eval "\$(pyenv init --path)"

_EOF
$ cat >> ~/.bashrc << _EOF

# Python
export PYENV_ROOT="\${HOME}/.pyenv"
export PATH="\${PYENV_ROOT}/bin:\${PATH}"
. ~/.profile2
if command -v pyenv 1>/dev/null 2>&1
then
        eval "\$(pyenv init -)"
fi
if command -v pipenv 1>/dev/null 2>&1
then
        eval "\$(pipenv --completion)"
fi

_EOF
$ . ~/.bashrc

Python 3.9:

$ pyenv install 3.9.8 && pyenv local 3.9.8

pip:

$ python -mpip install -U pip

pipenv:

$ python -mpip install -U pipenv

Usage

Install the `data-loader-plugin` module

There are at least two ways to install the data-loader-plugin module, in the Python user space with pip and in a dedicated virtual environment with pipenv.
- Both options may be installed in parallel
- The Python user space (typically, /usr/local/opt/[email protected] on MacOS or ~/.pyenv/versions/3.9.8 on Linux) may already have many other modules installed, parasiting a fine-grained control over the versions of every Python dependency. If all the versions are compatible, then that option is convenient as it is available from the whole user space, not just from this sub-directory
In the remainder of that Usage section, it will be assumed that the data-loader-plugin module has been installed and readily available from the environment, whether that environment is virtual or not. In other words, to adapt the documentation for the case where pipenv is used, just add pipenv run in front of every Python-related command.

Install in the Python user space

Install and use the data-loader-plugin module in the user space (with pip):

$ python -mpip uninstall data-loader-plugin
$ python -mpip install -U data-loader-plugin

Installation in a dedicated Python virtual environment

Install and use the data-loader-plugin module in a virtual environment:

$ pipenv shell
(python-...-JwpAHotb) ✔ python -mpip install -U data-loader-plugin
(python-...-JwpAHotb) ✔ python -mpip install -U data-loader-plugin
(python-...-JwpAHotb) ✔ exit

Use `data-loader-plugin` as a module from another Python program

Check the data file with the AWS command-line (CLI):

$ aws s3 ls --human s3://nyc-tlc/trip\ data/yellow_tripdata_2021-07.csv --no-sign-request
2021-10-29 20:44:34  249.3 MiB yellow_tripdata_2021-07.csv

Module import statements:

>>> import importlib
>>> from types import ModuleType
>>> from data_loader_plugin.base import DataLoaderBase

Create an instance of the DataLoaderBase Python class:

>>> plugin: ModuleType = importlib.import_module("data_loader_plugin.copyfile")
>>> data_loader: DataLoaderBase = plugin.DataLoader(
        local_path='/tmp/yellow_tripdata_2021-07.csv',
        external_url='s3://nyc-tlc/trip\ data/yellow_tripdata_2021-07.csv',
    )
>>> data_load_success, message = data_loader.load()

Development / Contribution

Build the source distribution and Python artifacts (wheels):

$ rm -rf _skbuild/ build/ dist/ .tox/ __pycache__/ .pytest_cache/ MANIFEST *.egg-info/
$ pipenv run python setup.py sdist bdist_wheel

Upload to Test PyPi (no Linux binary wheel can be uploaded on PyPi):

$ PYPIURL="https://test.pypi.org"
$ pipenv run twine upload -u __token__ --repository-url ${PYPIURL}/legacy/ dist/*
Uploading distributions to https://test.pypi.org/legacy/
Uploading data_loader_plugin-0.0.1-py3-none-any.whl
100%|███████████████████████████████████████| 23.1k/23.1k [00:02<00:00, 5.84kB/s]
Uploading data-loader-plugin-0.0.1.tar.gz
100%|███████████████████████████████████████| 23.0k/23.0k [00:01<00:00, 15.8kB/s]

View at:
https://test.pypi.org/project/data-loader-plugin/0.0.1/

Upload/release the Python packages onto the PyPi repository:
- Register the authentication token for access to PyPi:

$ PYPIURL="https://upload.pypi.org"
$ pipenv run keyring set ${PYPIURL}/ __token__
Password for '__token__' in '${PYPIURL}/':

$ pipenv run twine upload -u __token__ --repository-url ${PYPIURL}/legacy/ dist/*
Uploading distributions to https://upload.pypi.org/legacy/
Uploading data_loader_plugin-0.0.1-py3-none-any.whl
100%|███████████████████████████████████████| 23.1k/23.1k [00:02<00:00, 5.84kB/s]
Uploading data-loader-plugin-0.0.1.tar.gz
100%|███████████████████████████████████████| 23.0k/23.0k [00:01<00:00, 15.8kB/s]

View at:
https://pypi.org/project/data-loader-plugin/0.0.1/

Note that the documentation is built automatically by ReadTheDocs (RTD)
- The documentation is available from https://data-loader-plugin.readthedocs.io/en/latest/
- The RTD project is setup on https://readthedocs.org/projects/data-loader-plugin/
Build the documentation manually (with Sphinx):

$ pipenv run python setup.py build_sphinx
running build_sphinx
Running Sphinx v4.3.0
[autosummary] generating autosummary for: README.md
myst v0.15.2: ..., words_per_minute=200)
building [mo]: targets for 0 po files that are out of date
building [html]: targets for 1 source files that are out of date
updating environment: [new config] 1 added, 0 changed, 0 removed
reading sources... [100%] README
...
looking for now-outdated files... none found
pickling environment... done
checking consistency... done
preparing documents... done
writing output... [100%] README
...
build succeeded.

The HTML pages are in build/sphinx/html.

Re-generate the Python dependency files (requirements.txt) for the CI/CD pipeline (currently Travis CI):

$ pipenv --rm; rm -f Pipfile.lock; pipenv install; pipenv install --dev
$ git add Pipfile.lock
$ pipenv lock -r > ci/requirements.txt
$ pipenv lock --dev -r > ci/requirements-dev.txt
$ git add ci/requirements.txt ci/requirements-dev.txt
$ git commit -m "[CI] Upgraded the Python dependencies for the Travis CI pipeline"

Test the data loader plugin Python module

Enter into the pipenv Shell:

$ pipenv shell
(python-...-iVzKEypY) ✔ python -V
Python 3.9.8

Uninstall any previously installed data-loader-plugin module/library:

(python-...-iVzKEypY) ✔ python -mpip uninstall data-loader-plugin

Launch a simple test with pytest

(python-iVzKEypY) ✔ python -mpytest tests
=================== test session starts ==================
platform darwin -- Python 3.9.8, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: ~/dev/infra/python-plugin-data-loader
plugins: cov-3.0.0
collected 3 items

tests/test_copyfile.py .                             [ 33%]
tests/test_s3.py ..                                  [100%]
====================== 3 passed in 1.22s ==================

Exit the pipenv Shell:

(python-...-iVzKEypY) ✔ exit

Automatically load and dump your dataclasses 📂🙋

file dataclasses Installation By default, filedataclasses comes with support for JSON files only. To support other formats like YAML and TOML, filedat

1 Dec 30, 2021

ioztat is a storage load analysis tool for OpenZFS

ioztat is a storage load analysis tool for OpenZFS. It provides iostat-like statistics at an individual dataset/zvol level.

116 Nov 25, 2022

Certipy is a Python tool to enumerate and abuse misconfigurations in Active Directory Certificate Services (AD CS).

Certipy Certipy is a Python tool to enumerate and abuse misconfigurations in Active Directory Certificate Services (AD CS). Based on the C# variant Ce

1.3k Jan 1, 2023

Python implementation for Active Directory certificate abuse

Certipy is a Python tool to enumerate and abuse misconfigurations in Active Directory Certificate Services (AD CS). Based on the C# variant Ce

1.3k Jan 9, 2023

This Python library searches through a static directory and appends artist, title, track number, album title, duration, and genre to a .json object

This Python library searches through a static directory (needs to match your environment) and appends artist, title, track number, album title, duration, and genre to a .json object. This .json object is then used to post data to a specified table in a local MySQL database, credentials of which the user must set.

1 Jun 20, 2022

Added a .nojekyll file to the toplevel of docs directory

Github pages uses jekyll by default. When jekyll generates a site, files whose name starts with an underscore are not included in the destination (source, bottom).

This is why they are not served by github pages as long as you are lacking a .nojekyll (to disable jekyll processing).

See.

opened by data-corentinv 0

Python plugin/extra to load data files from an external source (such as AWS S3) to a local directory

Related tags

Overview

Data Loader Plugin - Python

Table of Content (ToC)

Overview

References

Python module

Python virtual environments

Installation

Clone this Git repository

Python environment

Usage

Install the data-loader-plugin module

Install in the Python user space

Installation in a dedicated Python virtual environment

Use data-loader-plugin as a module from another Python program

Development / Contribution

Test the data loader plugin Python module

You might also like...

Automatically load and dump your dataclasses 📂🙋

ioztat is a storage load analysis tool for OpenZFS

Certipy is a Python tool to enumerate and abuse misconfigurations in Active Directory Certificate Services (AD CS).

Python implementation for Active Directory certificate abuse

This Python library searches through a static directory and appends artist, title, track number, album title, duration, and genre to a .json object

Sequence clustering and database creation using mmseqs, from local fasta files

Navigate to your directory of choice the proceed as follows

User management system (UMS), has the primary purpose of connecting to an Active Directory (AD)

Get a link to the web version of a git-tracked file or directory

Comments

Added a .nojekyll file to the toplevel of docs directory

Owner

Cloud Helpers

A Python utility belt containing simple tools, a stdlib like feel, and extra batteries. Hashing, Caching, Timing, Progress, and more made easy!

It is a personal assistant chatbot, capable to perform many tasks same as Google Assistant plus more extra features...

Extra scripts to improve user experience related to OpenTaiko

Pipenv-local-deps-repro - Reproduction of a local transitive dependency on pipenv

Python 3.9.4 Graphics and Compute Shader Framework and Primitives with no external module dependencies

Process RunGap output file of a workout and load data into Apple Numbers Spreadsheet and my website with API calls

External Network Pentest Automation using Shodan API and other tools.

MiniJVM is simple java virtual machine written by python language, it can load class file from file system and run it.

Transparently load variables from environment or JSON/YAML file.

Load dependent libraries dynamically.

Install the `data-loader-plugin` module

Use `data-loader-plugin` as a module from another Python program