PdpCLI is a pandas DataFrame processing CLI tool which enables you to build a pandas pipeline from a configuration file.

Yasuhiro Yamaguchi

Last update: Jan 7, 2022

Related tags

Overview

PdpCLI

Quick Links

Introduction
Installation
Tutorial

Introduction

PdpCLI is a pandas DataFrame processing CLI tool which enables you to build a pandas pipeline powered by pdpipe from a configuration file. You can also extend pipeline stages and data readers / writers by using your own python scripts.

Features

Process pandas DataFrame from CLI without wrting Python scripts
Support multiple configuration file formats: YAML, JSON, Jsonnet
Read / write data files in the following formats: CSV, TSV, JSON, JSONL, pickled DataFrame
Import / export data with multiple protocols: S3 / Databse (MySQL, Postgres, SQLite, ...) / HTTP(S)
Extensible pipeline and data readers / writers

Installation

Installing the library is simple using pip.

$ pip install "pdpcli[all]"

Tutorial

Basic Usage

Write a pipeline config file config.yml like below. The type fields under pipeline correspond to the snake-cased class names of the PdpipelineStages. Other fields such as stage and columns are the parameters of the __init__ methods of the corresponging classes. Internally, this configuration file is converted to Python objects by colt.

pipeline:
  type: pipeline
  stages:
    drop_columns:
      type: col_drop
      columns:
        - name
        - job

    encode:
      type: one_hot_encode
      columns: sex

    tokenize:
      type: tokenize_text
      columns: content

    vectorize:
      type: tfidf_vectorize_token_lists
      column: content
      max_features: 10

Build a pipeline by training on train.csv. The following command generages a pickled pipeline file pipeline.pkl after training. If you specify a URL of file path, it will be automatically downloaded and cached.

$ pdp build config.yml pipeline.pkl --input-file https://github.com/altescy/pdpcli/raw/main/tests/fixture/data/train.csv

Apply the fitted pipeline to test.csv and get output of a processed file processed_test.jsonl by the following command. PdpCLI automatically detects the output file format based on the file name. In this example, the processed DataFrame will be exported as the JSON-Lines format.

$ pdp apply pipeline.pkl https://github.com/altescy/pdpcli/raw/main/tests/fixture/data/test.csv --output-file processed_test.jsonl

You can also directly run the pipeline from a config file without fitting pipeline.

$ pdp apply config.yml test.csv --output-file processed_test.jsonl

It is possible to override or add parameters by adding command line arguments:

pdp apply config.yml test.csv pipeline.stages.drop_columns.column=name

Data Reader / Writer

PdpCLI automatically detects a suitable data reader / writer based on a given file name. If you need to use the other data reader / writer, add a reader or writer config to config.yml. The following config is an exmaple to use SQL data reader. SQL reader fetches records from the specified database and converts them into a pandas DataFrame.

reader:
    type: sql
    dsn: postgres://${env:POSTGRES_USER}:${env:POSTGRES_PASSWORD}@your.posgres.server/your_database

Config files are interpreted by OmegaConf, so ${env:...} is interpolated by environment variables.

Prepare yuor SQL file query.sql to fetch data from the database:

select * from your_table limit 1000

You can execute the pipeline with SQL data reader via:

$ POSTGRES_USER=user POSTGRES_PASSWORD=password pdp apply config.yml query.sql

Plugins

By using plugins, you can extend PdpCLI. This plugin feature enables you to use your own pipeline stages, data readers / writers and commands.

Add a new stage

Write your plugin script mypdp.py like below. Stage.register(" ") registers your pipeline stages, and you can specify these stages by writing type: in your config file.

import pdpcli

@pdpcli.Stage.register("print")
class PrintStage(pdpcli.Stage):
    def _prec(self, df):
        return True

    def _transform(self, df, verbose):
        print(df.to_string(index=False))
        return df

Update config.yml to use your plugin.

pipeline:
    type: pipeline
    stages:
        drop_columns:
        ...

        print:
            type: print

        encode:
        ...

Execute command with --module mypdp and you can see the processed DataFrame after running drop_columns.

$ pdp apply config.yml test.csv --module mypdp

Add a new command

You can also add new commands not only stages.

Add the following script to mypdp.py. This greet command prints out a greeting message with your name.

@pdpcli.Subcommand.register(
    name="greet",
    description="say hello",
    help="say hello",
)
class GreetCommand(pdpcli.Subcommand):
    requires_plugins = False

    def set_arguments(self):
        self.parser.add_argument("--name", default="world")

    def run(self, args):
        print(f"Hello, {args.name}!")

To register this command, you need to create the .pdpcli_plugins file in which module names are listed for each line. Due to module importing order, the --module option is unavailable for command registration.

.pdpcli_plugins ">

$ echo "mypdp" > .pdpcli_plugins

Run the following command and get a message like below. By using the .pdpcli_plugins file, it is is not needed to add the --module option to a command line for each execution.

$ pdp greet --name altescy
Hello, altescy!

You might also like...

gget is a free and open-source command-line tool and Python package that enables efficient querying of genomic databases.

gget is a free and open-source command-line tool and Python package that enables efficient querying of genomic databases. gget consists of a collection of separate but interoperable modules, each designed to facilitate one type of database querying in a single line of code.

570 Dec 29, 2022

CLI program that allows you to change your Alacritty config with one command without editing the config file.

Pycritty Change your alacritty config on the fly! Installation: pip install pycritty By default, only the program itself will be installed, but you ca

184 Jan 7, 2023

This is a CLI program which can help you generate your own QR Code.

Python-QR-code-generator This is a CLI program which can help you generate your own QR Code. Single.py This will allow you only to input a single mess

1 Dec 24, 2021

CLI tool and python library that converts the output of popular command-line tools and file-types to JSON or Dictionaries. This allows piping of output to tools like jq and simplifying automation scripts.

jc JSONifies the output of many CLI tools and file-types for easier parsing in scripts

5.8k Jan 3, 2023

A CLI tool that scans through a directory and organizes all loose files into folders by file type.

Releases(v0.4.1)

v0.4.1(Oct 1, 2021)
Add py.typed marker (#1)

Modify CI settings (#2)

Upgrade dependent packages (#3)

Bump version number to v0.4.1 (#4)

Source code(tar.gz)
Source code(zip)
v0.3.1(Jun 21, 2021)
Fix reader construction of apply command

Use pysen for linter & formatter

Source code(tar.gz)
Source code(zip)
v0.3.0(Feb 24, 2021)

Source code(tar.gz)
Source code(zip)
v0.2.0(Feb 21, 2021)

Source code(tar.gz)
Source code(zip)
v0.1.0(Feb 21, 2021)

Source code(tar.gz)
Source code(zip)

PdpCLI is a pandas DataFrame processing CLI tool which enables you to build a pandas pipeline from a configuration file.

Related tags

Overview

PdpCLI

Quick Links

Introduction

Features

Installation

Tutorial

Basic Usage

Data Reader / Writer

Plugins

Add a new stage

Add a new command

You might also like...

gget is a free and open-source command-line tool and Python package that enables efficient querying of genomic databases.

CLI program that allows you to change your Alacritty config with one command without editing the config file.

This is a CLI program which can help you generate your own QR Code.

CLI tool and python library that converts the output of popular command-line tools and file-types to JSON or Dictionaries. This allows piping of output to tools like jq and simplifying automation scripts.

A CLI tool that scans through a directory and organizes all loose files into folders by file type.

tox-server is a command line tool which runs tox in a loop and calls it with commands from a remote CLI.

Pyrdle - Play Wordle in the CLI. Write an algorithm to play Wordle for you. Ruin all of the fun you've been having

flora-dev-cli (fd-cli) is command line interface software to interact with flora blockchain.

Python-Stock-Info-CLI: Get stock info through CLI by passing stock ticker.

Releases(v0.4.1)

v0.4.1(Oct 1, 2021)

v0.3.1(Jun 21, 2021)

v0.3.0(Feb 24, 2021)

v0.2.0(Feb 21, 2021)

v0.1.0(Feb 21, 2021)

Owner

Yasuhiro Yamaguchi

A cli tool , which shows you all the next possible words you can guess from in the game of Wordle.

A python script that enables a raspberry pi sd card through the CLI and automates the process of configuring network details and ssh.

Yts-cli-streamer - A CLI movie streaming client which works on yts.mx API written in python

A CLI based task manager tool which helps you track your daily task and activity.

AWS Interactive CLI - Allows you to execute a complex AWS commands by chaining one or more other AWS CLI dependency

Sink is a CLI tool that allows users to synchronize their local folders to their Google Drives. It is similar to the Git CLI and allows fast and reliable syncs with the drive.

[WIP]An ani-cli like cli tool for movies and webseries

A ZSH plugin that enables you to use OpenAI's powerful Codex AI in the command line.

Alacritty terminal used with Bash, Tmux, Vim, Mutt, Lynx, etc. and the many different additions added to each configuration file