redun aims to be a more expressive and efficient workflow framework

insitro

Last update: Jan 4, 2023

Related tags

Miscellaneous python workflow-engine

Overview

redun

yet another redundant workflow engine

redun aims to be a more expressive and efficient workflow framework, built on top of the popular Python programming language. It takes the somewhat contrarian view that writing dataflows directly is unnecessarily restrictive, and by doing so we lose abstractions we have come to rely on in most modern high-level languages (control flow, compositiblity, recursion, high order functions, etc). redun's key insight is that workflows can be expressed as lazy expressions, that are then evaluated by a scheduler which performs automatic parallelization, caching, and data provenance logging.

redun's key features are:

Workflows are defined by lazy expressions that when evaluated emit dynamic directed acyclic graphs (DAGs), enabling complex data flows.
Incremental computation that is reactive to both data changes as well as code changes.
Workflow tasks can be executed on a variety of compute backend (threads, processes, AWS batch jobs, Spark jobs, etc).
Data changes are detected for in memory values as well as external data sources such as files and object stores using file hashing.
Code changes are detected by hashing individual Python functions and comparing against historical call graph recordings.
Past intermediate results are cached centrally and reused across workflows.
Past call graphs can be used as a data lineage record and can be queried for debugging and auditing.

See the docs, tutorial, and influences for more.

About the name: The name "redun" is self deprecating (there are A LOT of workflow engines), but it is also a reference to its original inspiration, the redo build system.

Install

pip install redun

See developing for more information on working with the code.

Postgres backend

To use postgres as a recording backend, use

pip install redun[postgres]

The above assumes the following dependencies are installed:

pg_config (in the postgresql-devel package; on ubuntu: apt-get install libpq-dev)
gcc (on ubuntu or similar sudo apt-get install gcc)

Use cases

redun's general approach to defining workflows makes it a good choice for implementing workflows for a wide-variety of use cases:

Small taste

Here is a quick example of using redun for a familar workflow, compiling a C program (full example). In general, any kind of data processing could be done within each task (e.g. reading and writing CSVs, DataFrames, databases, APIs).

File: """ Compile one C file into an object file. """ os.system(f"gcc -c {c_file.path}") return File(c_file.path.replace(".c", ".o")) @task() def link(prog_path: str, o_files: List[File]) -> File: """ Link several object files together into one program. """ o_files=" ".join(o_file.path for o_file in o_files) os.system(f"gcc -o {prog_path} {o_files}") return File(prog_path) @task() def make_prog(prog_path: str, c_files: List[File]) -> File: """ Compile one program from its source C files. """ o_files = [ compile(c_file) for c_file in c_files ] prog_file = link(prog_path, o_files) return prog_file # Definition of programs and their source C files. files = { "prog": [ File("prog.c"), File("lib.c"), ], "prog2": [ File("prog2.c"), File("lib.c"), ], } @task() def make(files : Dict[str, List[File]] = files) -> List[File]: """ Top-level task for compiling all the programs in the project. """ progs = [ make_prog(prog_path, c_files) for prog_path, c_files in files.items() ] return progs ">

# make.py

import os
from typing import Dict, List

from redun import task, File


redun_namespace = "redun.examples.compile"


@task()
def compile(c_file: File) -> File:
    """
    Compile one C file into an object file.
    """
    os.system(f"gcc -c {c_file.path}")
    return File(c_file.path.replace(".c", ".o"))


@task()
def link(prog_path: str, o_files: List[File]) -> File:
    """
    Link several object files together into one program.
    """
    o_files=" ".join(o_file.path for o_file in o_files)
    os.system(f"gcc -o {prog_path} {o_files}")
    return File(prog_path)


@task()
def make_prog(prog_path: str, c_files: List[File]) -> File:
    """
    Compile one program from its source C files.
    """
    o_files = [
        compile(c_file)
        for c_file in c_files
    ]
    prog_file = link(prog_path, o_files)
    return prog_file


# Definition of programs and their source C files.
files = {
    "prog": [
        File("prog.c"),
        File("lib.c"),
    ],
    "prog2": [
        File("prog2.c"),
        File("lib.c"),
    ],
}


@task()
def make(files : Dict[str, List[File]] = files) -> List[File]:
    """
    Top-level task for compiling all the programs in the project.
    """
    progs = [
        make_prog(prog_path, c_files)
        for prog_path, c_files in files.items()
    ]
    return progs

Notice, that besides the @task decorator, the code follows typical Python conventions and is organized like a sequential program.

We can run the workflow using the redun run command:

redun run make.py make

[redun] redun :: version 0.4.15
[redun] config dir: /Users/rasmus/projects/redun/examples/compile/.redun
[redun] Upgrading db from version -1.0 to 2.0...
[redun] Start Execution 69c40fe5-c081-4ca6-b232-e56a0a679d42:  redun run make.py make
[redun] Run    Job 72bdb973:  redun.examples.compile.make(files={'prog': [File(path=prog.c, hash=dfa3aba7), File(path=lib.c, hash=a2e6cbd9)], 'prog2': [File(path=prog2.c, hash=c748e4c7), File(path=lib.c, hash=a2e6cbd9)]}) on default
[redun] Run    Job 096be12b:  redun.examples.compile.make_prog(prog_path='prog', c_files=[File(path=prog.c, hash=dfa3aba7), File(path=lib.c, hash=a2e6cbd9)]) on default
[redun] Run    Job 32ed5cf8:  redun.examples.compile.make_prog(prog_path='prog2', c_files=[File(path=prog2.c, hash=c748e4c7), File(path=lib.c, hash=a2e6cbd9)]) on default
[redun] Run    Job dfdd2ee2:  redun.examples.compile.compile(c_file=File(path=prog.c, hash=dfa3aba7)) on default
[redun] Run    Job 225f924d:  redun.examples.compile.compile(c_file=File(path=lib.c, hash=a2e6cbd9)) on default
[redun] Run    Job 3f9ea7ae:  redun.examples.compile.compile(c_file=File(path=prog2.c, hash=c748e4c7)) on default
[redun] Run    Job a8b21ec0:  redun.examples.compile.link(prog_path='prog', o_files=[File(path=prog.o, hash=4934098e), File(path=lib.o, hash=7caa7f9c)]) on default
[redun] Run    Job 5707a358:  redun.examples.compile.link(prog_path='prog2', o_files=[File(path=prog2.o, hash=cd0b6b7e), File(path=lib.o, hash=7caa7f9c)]) on default
[redun]
[redun] | JOB STATUS 2021/06/18 10:34:29
[redun] | TASK                             PENDING RUNNING  FAILED  CACHED    DONE   TOTAL
[redun] |
[redun] | ALL                                    0       0       0       0       8       8
[redun] | redun.examples.compile.compile         0       0       0       0       3       3
[redun] | redun.examples.compile.link            0       0       0       0       2       2
[redun] | redun.examples.compile.make            0       0       0       0       1       1
[redun] | redun.examples.compile.make_prog       0       0       0       0       2       2
[redun]
[File(path=prog, hash=a8d14a5e), File(path=prog2, hash=04bfff2f)]

This should have taken three C source files (lib.c, prog.c, and prog2.c), compiled them to three object files (lib.o, prog.o, prog2.o), and then linked them into two binaries (prog and prog2). Specifically, redun automatically determined the following dataflow DAG and performed the compiling and linking steps in separate threads:

Using the redun log command, we can see the full job tree of the most recent execution (denoted -):

redun log -

Exec 69c40fe5-c081-4ca6-b232-e56a0a679d42 [ DONE ] 2021-06-18 10:34:28:  run make.py make
Duration: 0:00:01.47

Jobs: 8 (DONE: 8, CACHED: 0, FAILED: 0)
--------------------------------------------------------------------------------
Job 72bdb973 [ DONE ] 2021-06-18 10:34:28:  redun.examples.compile.make(files={'prog': [File(path=prog.c, hash=dfa3aba7), File(path=lib.c, hash=a2e6cbd9)], 'prog2': [File(path=prog2.c, hash=c748e4c7), Fil
  Job 096be12b [ DONE ] 2021-06-18 10:34:28:  redun.examples.compile.make_prog('prog', [File(path=prog.c, hash=dfa3aba7), File(path=lib.c, hash=a2e6cbd9)])
    Job dfdd2ee2 [ DONE ] 2021-06-18 10:34:28:  redun.examples.compile.compile(File(path=prog.c, hash=dfa3aba7))
    Job 225f924d [ DONE ] 2021-06-18 10:34:28:  redun.examples.compile.compile(File(path=lib.c, hash=a2e6cbd9))
    Job a8b21ec0 [ DONE ] 2021-06-18 10:34:28:  redun.examples.compile.link('prog', [File(path=prog.o, hash=4934098e), File(path=lib.o, hash=7caa7f9c)])
  Job 32ed5cf8 [ DONE ] 2021-06-18 10:34:28:  redun.examples.compile.make_prog('prog2', [File(path=prog2.c, hash=c748e4c7), File(path=lib.c, hash=a2e6cbd9)])
    Job 3f9ea7ae [ DONE ] 2021-06-18 10:34:28:  redun.examples.compile.compile(File(path=prog2.c, hash=c748e4c7))
    Job 5707a358 [ DONE ] 2021-06-18 10:34:29:  redun.examples.compile.link('prog2', [File(path=prog2.o, hash=cd0b6b7e), File(path=lib.o, hash=7caa7f9c)])

Notice, redun automatically detected that lib.c only needed to be compiled once and that its result can be reused (a form of common subexpression elimination).

Using the --file option, we can see all files (or URLs) that were read, r, or written, w, by the workflow:

redun log --file

File 2b6a7ce0 2021-06-18 11:41:42 r  lib.c
File d90885ad 2021-06-18 11:41:42 rw lib.o
File 2f43c23c 2021-06-18 11:41:42 w  prog
File dfa3aba7 2021-06-18 10:34:28 r  prog.c
File 4934098e 2021-06-18 10:34:28 rw prog.o
File b4537ad7 2021-06-18 11:41:42 w  prog2
File c748e4c7 2021-06-18 10:34:28 r  prog2.c
File cd0b6b7e 2021-06-18 10:34:28 rw prog2.o

We can also look at the provenance of a single file, such as the binary prog:

link(prog_path, o_files) prog_path = 'prog' o_files = [File(path=prog.o, hash=4934098e), File(path=lib.o, hash=d90885ad)] prog_path <-- argument of make_prog(prog_path, c_files) <-- origin o_files <-- derives from compile_result = File(path=lib.o, hash=d90885ad) compile_result_2 = <4934098e> File(path=prog.o, hash=4934098e) compile_result <-- <45054a8f> compile(c_file) c_file = <2b6a7ce0> File(path=lib.c, hash=2b6a7ce0) c_file <-- argument of make_prog(prog_path, c_files) <-- argument of make(files) <-- origin compile_result_2 <-- <8d85cebc> compile(c_file_2) c_file_2 = File(path=prog.c, hash=dfa3aba7) c_file_2 <-- argument of <74cceb4e> make_prog(prog_path, c_files) <-- argument of <45400ab5> make(files) <-- origin ">

redun log prog

File 2f43c23c 2021-06-18 11:41:42 w  prog
Produced by Job a8b21ec0

  Job a8b21ec0-e60b-4486-bcf4-4422be265608 [ DONE ] 2021-06-18 11:41:42:  redun.examples.compile.link('prog', [File(path=prog.o, hash=4934098e), File(path=lib.o, hash=d90885ad)])
  Traceback: Exec 4a2b624d > (1 Job) > Job 2f8b4b5f make_prog > Job a8b21ec0 link
  Duration: 0:00:00.24

    CallNode 6c56c8d472dc1d07cfd2634893043130b401dc84 redun.examples.compile.link
      Args:   'prog', [File(path=prog.o, hash=4934098e), File(path=lib.o, hash=d90885ad)]
      Result: File(path=prog, hash=2f43c23c)

    Task a20ef6dc2ab4ed89869514707f94fe18c15f8f66 redun.examples.compile.link

      def link(prog_path: str, o_files: List[File]) -> File:
          """
          Link several object files together into one program.
          """
          o_files=" ".join(o_file.path for o_file in o_files)
          os.system(f"gcc -o {prog_path} {o_files}")
          return File(prog_path)


    Upstream dataflow:

      result = File(path=prog, hash=2f43c23c)

      result <-- <6c56c8d4> link(prog_path, o_files)
        prog_path = 
          
            'prog'
        o_files   = 
           
             [File(path=prog.o, hash=4934098e), File(path=lib.o, hash=d90885ad)]

      prog_path <-- argument of 
            
              make_prog(prog_path, c_files)
                <-- origin

      o_files <-- derives from
        compile_result   = 
             
               File(path=lib.o, hash=d90885ad)
        compile_result_2 = <4934098e> File(path=prog.o, hash=4934098e)

      compile_result <-- <45054a8f> compile(c_file)
        c_file = <2b6a7ce0> File(path=lib.c, hash=2b6a7ce0)

      c_file <-- argument of 
              
                make_prog(prog_path, c_files) <-- argument of 
               
                 make(files) <-- origin compile_result_2 <-- <8d85cebc> compile(c_file_2) c_file_2 = 
                
                  File(path=prog.c, hash=dfa3aba7) c_file_2 <-- argument of <74cceb4e> make_prog(prog_path, c_files) <-- argument of <45400ab5> make(files) <-- origin

This output shows the original link task source code responsible for creating the program prog, as well as the full derivation, denoted "upstream dataflow". See the full example for a deeper explanation of this output. To understand more about the data structure that powers these kind of queries, see call graphs.

We can change one of the input files, such as lib.c, and rerun the workflow. Due to redun's automatic incremental compute, only the minimal tasks are rerun:

redun run make.py make

[redun] redun :: version 0.4.15
[redun] config dir: /Users/rasmus/projects/redun/examples/compile/.redun
[redun] Start Execution 4a2b624d-b6c7-41cb-acca-ec440c2434db:  redun run make.py make
[redun] Run    Job 84d14769:  redun.examples.compile.make(files={'prog': [File(path=prog.c, hash=dfa3aba7), File(path=lib.c, hash=2b6a7ce0)], 'prog2': [File(path=prog2.c, hash=c748e4c7), File(path=lib.c, hash=2b6a7ce0)]}) on default
[redun] Run    Job 2f8b4b5f:  redun.examples.compile.make_prog(prog_path='prog', c_files=[File(path=prog.c, hash=dfa3aba7), File(path=lib.c, hash=2b6a7ce0)]) on default
[redun] Run    Job 4ae4eaf6:  redun.examples.compile.make_prog(prog_path='prog2', c_files=[File(path=prog2.c, hash=c748e4c7), File(path=lib.c, hash=2b6a7ce0)]) on default
[redun] Cached Job 049a0006:  redun.examples.compile.compile(c_file=File(path=prog.c, hash=dfa3aba7)) (eval_hash=434cbbfe)
[redun] Run    Job 0f8df953:  redun.examples.compile.compile(c_file=File(path=lib.c, hash=2b6a7ce0)) on default
[redun] Cached Job 98d24081:  redun.examples.compile.compile(c_file=File(path=prog2.c, hash=c748e4c7)) (eval_hash=96ab0a2b)
[redun] Run    Job 8c95f048:  redun.examples.compile.link(prog_path='prog', o_files=[File(path=prog.o, hash=4934098e), File(path=lib.o, hash=d90885ad)]) on default
[redun] Run    Job 9006bd19:  redun.examples.compile.link(prog_path='prog2', o_files=[File(path=prog2.o, hash=cd0b6b7e), File(path=lib.o, hash=d90885ad)]) on default
[redun]
[redun] | JOB STATUS 2021/06/18 11:41:43
[redun] | TASK                             PENDING RUNNING  FAILED  CACHED    DONE   TOTAL
[redun] |
[redun] | ALL                                    0       0       0       2       6       8
[redun] | redun.examples.compile.compile         0       0       0       2       1       3
[redun] | redun.examples.compile.link            0       0       0       0       2       2
[redun] | redun.examples.compile.make            0       0       0       0       1       1
[redun] | redun.examples.compile.make_prog       0       0       0       0       2       2
[redun]
[File(path=prog, hash=2f43c23c), File(path=prog2, hash=b4537ad7)]

Notice, two of the compile jobs are cached (prog.c and prog2.c), but compiling the library lib.c and the downstream link steps correctly rerun.

Check out the examples for more example workflows and features of redun. Also, see the design notes for more information on redun's design.

Mixed compute backends

In the above example, each task ran in its own thread. However, more generally each task can run in its own process, Docker container, AWS Batch job, or Spark job. With minimal configuration, users can lightly annotate where they would like each task to run. redun will automatically handle the data and code movement as well as backend scheduling:

@task(executor="process")
def a_process_task(a):
    # This task runs in its own process.
    b = a_batch_task(a)
    c = a_spark_task(b)
    return c

@task(executor="batch", memory=4, vcpus=5)
def a_batch_task(a):
    # This task runs in its own AWS Batch job.
    # ...

@task(executor="spark")
def a_spark_task(b):
    # This task runs in its own Spark job.
    sc = get_spark_context()
    # ...

See the executor documentation for more.

What's the trick?

How did redun automatically perform parallel compute, caching, and data provenance in the example above? The trick is that redun builds up an expression graph representing the workflow and evaluates the expressions using graph reduction. For example, the workflow above went through the following evaluation process:

For a more in-depth walk-through, see the scheduler tutorial.

Why not another workflow engine?

redun focuses on making multi-domain scientific pipelines easy to develop and deploy. The automatic parallelism, caching, code and data reactivity, as well as data provenance features makes it a great fit for such work. However, redun does not attempt to solve all possible workflow problems, so it's perfectly reasonable to supplement it with other tools. For example, while redun provides a very expressive way to define task parallelism, it does not attempt to perform the kind of fine-grain data parallelism more commonly provided by Spark or Dask. Fortunately, redun does not perform any "dirty tricks" (e.g. complex static analysis or call stack manipulation), and so we have found it possible to safely combine redun with other frameworks (e.g. pyspark, pytorch, Dask, etc) to achieve the benefits of each tool.

Lastly, redun does not provide its own compute cluster, but instead builds upon other systems that do, such as cloud provider services for batch Docker jobs or Spark jobs.

For more details on how redun compares to other related ideas, see the influences section.

Comments

Add Docker image as task option
Hi,

For dockerized tasks, would it be possible to make the Docker image a task option, rather than an executor option?

e.g.

@task(executor="batch", image="{ECR_URL}/{IMAGE_NAME}") def some_task(): # ... pass

As far as I can tell, running multiple tasks on Batch in different containers currently requires specifying a separate executor for each task. This feels redundant, since the executor configuration (queue, etc) is typically the same except for the image.

Thanks!
opened by mstone-modulus 9
Running redun pipelines locally with Docker
Being able to run workflows locally with Docker is a huge advantage of Nextflow over other workflow tools in my opinion. I saw there is a mode (debug=True) to run local Docker containers in redun as well but it relies on the S3 scratch space.

I think we can easily add a Docker executor which would allow folks to run pipelines fully cloud agnostic, using local Docker containers for tasks. There are two options:

Add a Docker executor and use the volume mount to mount local folders (I'm assuming we can just process files without staging them).

Fewer changes: using the AWS Batch executor, one can use the debug=True flag to use local Docker containers to run the pipeline. To overcome the S3 dependency, one can use a locally hosted minio. The only changes we'd need to make is to add the endpoint_url parameter to the boto S3 client in two places in file.py:

Change https://github.com/insitro/redun/blob/main/redun/file.py#L462 to:

[...] client = _local.s3_raw = boto3.client("s3", endpoint_url=endpoint_url)

Change https://github.com/insitro/redun/blob/main/redun/file.py#L448 to:

[...] client = _local.s3 = s3fs.S3FileSystem(anon=False, client_kwargs={"endpoint_url": endpoint_url})

Let me know what you think. I'm happy to draft a quick PR to make it happen.
opened by ricomnl 7
AWS Batch executor enhancements: shared memory requirements, and more faithful local dev experience
Greetings @mattrasmus and team!

This PR adds new capabilities to the AWS Batch Executor.

Tasks can specify shared_memory as a task option. This will set the "Shared memory size" option in the Batch Job Definition. Some background on this- by default, docker containers are run with a very low 64MB shared memory setting (size of /dev/shm). Certain resource-heavy tasks require much more shared memory. In my case, it's training deep neural nets using Pytorch. See here for background.

When running Batch tasks as local docker containers using debug=True, task options for compute resource requirements will now be enforced locally. This applies to vcpus, memory, gpus, and shared_memory. This is done using the docker flags --cpus, --memory, --gpus and --shm. My original goal here was just to make it possible to run in local containers with high shared memory requirements, but effectively this will make batch debug mode that much more faithful to the behavior in Batch.

If this is a change you'd be interested in merging into redun, great! Happy to update the PR in response to feedback.

Cheers! Dan Spitz
opened by spitz-dan-l 7

redun does not execute any code for a simple example

I'm trying a simple example with redun == 0.8.7:

from redun import task, File
import pandas as pd

PATH = File("input.csv")

@task
def load_data(path):
    return pd.read_csv(path)


@task
def main(path: File = PATH) -> File:
    data = load_data(path)
    data.to_csv("data.csv")
    return File("./data.csv")

where input.csv is simply

$ cat input.csv 
x,
1,
2,
3,

running redun then gives:

$ redun run cli.py main
[redun] redun :: version 0.8.7
[redun] config dir: .redun
[redun] Upgrading db from version -1.0 to 3.1...
[redun] Tasks will require namespace soon. Either set namespace in the `@task` decorator or with the module-level variable `redun_namespace`.
tasks needing namespace: cli.py:load_data, cli.py:main
[redun] Start Execution f6e622ee-907c-4500-8673-464f8e5a12b6:  redun run cli.py main
[redun] Run    Job c3c581e5:  main(path=File(path=input.csv, hash=98c594bd)) on default
[redun] 
[redun] | JOB STATUS 2022/04/28 00:18:12
[redun] | TASK    PENDING RUNNING  FAILED  CACHED    DONE   TOTAL
[redun] | 
[redun] | ALL           0       0       0       0       1       1
[redun] | main          0       0       0       0       1       1
[redun] 
[redun] Execution duration: 0.14 seconds
File(path=./data.csv, hash=65b8e975)

But no file is produced and the load_data() task never runs.

Is this a bug, or am I doing something wrong?

opened by elanmart 4

Quality of life improvements for kubernetes executor
Changed s3_scratch_prefix to scratch_prefix

Enabled usage of gsutils instead of aws cli during in submit_command() function

Added incluster loading of kubeconfig (when running head node)

Added service_account_name and annotations field to k8s_utils.create_job_object() function and therefore PodSpec

Fetched upstream changes
opened by ricomnl 3

Docker executor: No such container

To reproduce: I cloned the latest state of the redun repository and installed it via pip install -e. Then I executed the following:

cd examples/docker
cp ../05_aws_batch/data.tsv .
cd docker
make setup
make build
cd ..

I also added a docker executor to the .redun/redun.ini file in the docker example folder:

# redun configuration.

[backend]
db_uri = sqlite:///redun.db

[executors.default]
type = local
max_workers = 20

[executors.docker]
type = docker
image = redun_example
scratch = scratch

Upon running redun run workflow.py main I encountered the following error:

[redun] Executor[docker]: submit redun job b4be464d-aa88-499b-9946-4404a3481bc8 as Docker container 89c03051ce785094202c556bfc9f619e2c005eaf9f16a8355bdbff684e955cc4:
[redun]   container_id = 89c03051ce785094202c556bfc9f619e2c005eaf9f16a8355bdbff684e955cc4
[redun]   scratch_path = /Users/ricomeinl/Desktop/retro/redun/examples/docker/.redun/scratch/jobs/9b358c2bab1d7db8c92811b1c7ef53fac23209fe
[redun] 
[redun] *** Workflow error
[redun] 
[redun] | JOB STATUS 2022/05/28 15:56:51
[redun] | TASK                                         PENDING RUNNING  FAILED  CACHED    DONE   TOTAL
[redun] | 
[redun] | ALL                                                1       5       0       0       0       6
[redun] | redun.examples.docker.count_colors_by_script       0       1       0       0       0       1
[redun] | redun.examples.docker.main                         0       1       0       0       0       1
[redun] | redun.examples.docker.task_on_docker               0       1       0       0       0       1
[redun] | redun.postprocess_script                           1       0       0       0       0       1
[redun] | redun.script                                       0       1       0       0       0       1
[redun] | redun.script_task                                  0       1       0       0       0       1
[redun] 
[redun] Execution duration: 2.11 seconds
Error: No such container: 5e83c5b505f4fbd4a53850ff44b556b409337675d4a165a88ad89614568e8125
[redun] *** Execution failed. Traceback (most recent task last):
[redun]   File "/Users/ricomeinl/Desktop/retro/redun/redun/executors/docker.py", line 347, in _monitor
[redun]     for job in jobs:
[redun]   File "/Users/ricomeinl/Desktop/retro/redun/redun/executors/docker.py", line 237, in iter_job_status
[redun]     logs = subprocess.check_output(["docker", "logs", job_id]).decode("utf8")
[redun]   File "/Users/ricomeinl/.pyenv/versions/3.8.13/lib/python3.8/subprocess.py", line 415, in check_output
[redun]     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
[redun]   File "/Users/ricomeinl/.pyenv/versions/3.8.13/lib/python3.8/subprocess.py", line 516, in run
[redun]     raise CalledProcessError(retcode, process.args,
[redun] CalledProcessError: Command '['docker', 'logs', '5e83c5b505f4fbd4a53850ff44b556b409337675d4a165a88ad89614568e8125']' returned non-zero exit status 1.

opened by ricomnl 3

Simplify set_arg_defaults.

@mattrasmus This PR uses the signature.bind method to simplify the logic for applying default kwargs. This is much simpler, and should continue to work on future versions of python, regardless of changes.

Testing

I added a few test cases to the relevant test in test_scheduler.py. They pass on both the old and new code.

opened by lucaswiman 3
Setting up the environment prior to task execution
Hey there! I have been continuing to play around with redun and I continue to be impressed.

While I am a big fan of redun's simple model of a task as a pure input-output function, I find that in practice it is sometimes useful to set up the environment right before a task executes.

A simple example is- say I want the environment variable MY_VAR to be set to the value MY_VALUE in all tasks that run.

Currently, it seems like the only ways to do this are:

(Local executor) - Set MY_VAR prior to starting redun.

(Batch executor) - Build docker images with MY_VAR set and reference those images.

(works everywhere) Modify all tasks to explicitly set MY_VAR themselves before doing anything else.

All of the above become more difficult when the desired value of MY_VAR becomes dependent on something that can change across runs- e.g. a --setup option.

I recognize that any generic facility for running code prior to task execution can be misused, and may weaken redun's ability to perfectly track provenance, as well as complicate its execution model. So I understand if it is not in the cards. Just wanted to mention it. I also wonder if there's a functional-programming-inspired approach that could work, a la monads.
opened by spitz-dan-l 3
SQLalchemy caching warning

I get the following warning in stdout with SQLAlchemy version 1.4.28:

/opt/miniconda3/envs/key-test/lib/python3.7/site-packages/redun/backends/db/__init__.py:1937: SAWarning: Class Column will not make use of SQL compilation caching as it does not set the 'inherit_cache' attribute to ``True``. This can have significant performance implications including some performance degradations in comparison to prior SQLAlchemy versions. Set this attribute to True if this object can make use of the cache key generated by the superclass. Alternatively, this attribute may be set to False which will disable this warning. (Background on this error at: https://sqlalche.me/e/14/cprf)

Setting inherit_cache = True as a Column class variable fixes the error as suggested above. Is this correct in regards to redun functionality?

opened by bbremer 3

Nesting tasks with different executors: confusing results

I want to run most of my tasks on a local executor, then pass inputs to a large batch function. I've tried expressing this in what I think is the obvious way (following https://github.com/insitro/redun/blob/main/examples/05_aws_batch/workflow.py#L41):

import os
from redun import task, Dir, script

redun_namespace = "test"

@task(executor='batch', memory=1, cpus=1)
def bioformats2raw():
    outfile = f"1907.zarr"
    cmd = f"mkdir {outfile}; touch {outfile}/a {outfile}/b"
    return script(cmd,
                  outputs=[Dir(f"s3://projectid-davidek-zarr-conversion/{outfile}").stage(outfile)])

@task(executor="default")
def main():
    return bioformats2raw()

and my redun.ini defines two executor environments:

# redun configuration.

[backend]
db_uri = sqlite:///redun.db

[executors.default]
type = local
max_workers = 20

[executors.batch]
type = aws_batch
image = awsaccountid.dkr.ecr.us-west-2.amazonaws.com/bioformats2raw
queue = davidek-test-queue-2
s3_scratch = s3://projectid-davidek-redun-test/redun/
role = arn:aws:iam::awsaccountid:role/batch-ecs-job-role
job_name_prefix = redun-example

When I run this, it does say each of the tasks will run in the expected locations:

[redun] Cached Job c3871573:  test.main() (eval_hash=3fd24539)
[redun] Run    Job 8bb5452f:  test.bioformats2raw() on batch
[redun] Executor[batch]: submit redun job 8bb5452f-3338-4b9a-85ca-5288f177f7a8 as AWS Batch job 77b7faf1-413a-4d54-8dc5-fa89f524d924:
[redun]   job_id          = 77b7faf1-413a-4d54-8dc5-fa89f524d924
[redun]   job_name        = redun-example-1ba283a7dc057903b17a2a1997df629a34c5839c
[redun]   s3_scratch_path = s3://projectid-davidek-redun-test/redun/jobs/1ba283a7dc057903b17a2a1997df629a34c5839c
[redun]   retry_attempts  = 0
[redun]   debug           = False

however, while the batch job runs, it looks like script steps actually ran local:

[redun] Run    Job 7514841a:  redun.script(command='(\n# Save command to temp file.\nCOMMAND_FILE="$(mktemp)"\ncat > "$COMMAND_FILE" <<"EOF"\n#!/usr/bin/env bash\nset -exo pipefail\nmkdir 1907.zarr; touch 1907.zarr/a 1907.zarr/b\nEOF\n\n# Execute t..., inputs=[], outputs=[StagingDir(local=Dir(path=1907.zarr, hash=907a6eabb8205d543fb669976617797b5a78f289), remote=Dir(path=s3://projectid-davidek-zarr-conversion/1907.zarr, hash=3de668659261e26cd39c31e6b9452f701b2784c3))], task_options={}, temp_path=None) on default
[redun] Run    Job 29f40cf1:  redun.script_task(command='(\n# Save command to temp file.\nCOMMAND_FILE="$(mktemp)"\ncat > "$COMMAND_FILE" <<"EOF"\n#!/usr/bin/env bash\nset -exo pipefail\nmkdir 1907.zarr; touch 1907.zarr/a 1907.zarr/b\nEOF\n\n# Execute t...) on default
[redun] Run    Job 5392c430:  redun.postprocess_script(result=b'upload: 1907.zarr/b to s3://projectid-davidek-zarr-conversion/1907.zarr/b\nupload: 1907.zarr/a to s3://projectid-davidek-zarr-conversion/1907.zarr/a\n', outputs=[StagingDir(local=Dir(path=1907.zarr, hash=907a6eabb8205d543fb669976617797b5a78f289), remote=Dir(path=s3://projectid-davidek-zarr-conversion/1907.zarr, hash=3de668659261e26cd39c31e6b9452f701b2784c3))], temp_path=None) on default

I verified this as well by seeing the artifacts of the batch task in my local redun dir. I ensured I'm not running with batch debug=True or --pdb.

I was able to force the batch task to run on batch by making the default executor be batch:

[executors.default]
type = awsbatch

so it looks to me like a "batch" task that calls script actually runs script() in a default executor, possibly because script invokes _script, which is an @task that doesn't specify an executor: https://github.com/insitro/redun/blob/main/redun/scripting.py#L158

Either way, how do I get the functionality of script() to run in an AWS batch task?

opened by dakoner 3

Config option `db_aws_secret_name` only works if the secret is in `us-west-2`

I'm unable to use an aws secret for my redun db credentials, unless the secret is stored in the us-west-2 region.

In my .redun file:

[backend]
db_aws_secret_name = my_redun_secret

Then, on the commandline:

$ redun db info
Traceback (most recent call last):
  File "***/bin/redun", line 11, in <module>
    client.execute()
  File "***/lib/python3.7/site-packages/redun/cli.py", line 1021, in execute
    return args.func(args, extra_args, argv)
  File "***/lib/python3.7/site-packages/redun/cli.py", line 2602, in db_info_command
    backend = setup_backend_db(args.config, args.repo)
  File "***/lib/python3.7/site-packages/redun/cli.py", line 705, in setup_backend_db
    return RedunBackendDb(config=backend_config)
  File "***/lib/python3.7/site-packages/redun/backends/db/__init__.py", line 1047, in __init__
    self.db_uri: str = RedunBackendDb._get_uri(db_uri, config)
  File "***/lib/python3.7/site-packages/redun/backends/db/__init__.py", line 2324, in _get_uri
    return RedunBackendDb._get_uri_from_secret(db_aws_secret_name)
  File "***/lib/python3.7/site-packages/redun/backends/db/__init__.py", line 2354, in _get_uri_from_secret
    get_secret_value_response = client.get_secret_value(SecretId=secret_name)
  File "***/lib/python3.7/site-packages/botocore/client.py", line 391, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "***/lib/python3.7/site-packages/botocore/client.py", line 719, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDeniedException) when calling the GetSecretValue operation: User: *** is not authorized to perform: secretsmanager:GetSecretValue on resource: my_redun_secret because no identity-based policy allows the secretsmanager:GetSecretValue action

I think the fix will involve checking for a region config variable before creating the aws client to lookup the secret.

I am able to work around this by using db_uri, db_username_env and db_password_env instead.

Cheers!

opened by spitz-dan-l 2

CVE-2007-4559 Patch

Patching CVE-2007-4559

Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

opened by TrellixVulnTeam 0
Overflow Error with Pickle Protocol 3
I am building a data pipeline with redun where large numpy.ndarrays are passed through different task functions. However, I can very easily hit this overflow exception:

OverflowError: serializing a bytes object larger than 4 GiB requires pickle protocol 4 or higher

The cause of this issue seems straightforward - redun limits the pickle protocol to version 3 https://github.com/insitro/redun/blob/0cd06c8147700f67b777b5a43a6d3e3925274bff/redun/utils.py#L40 which cannot serialize objects beyond 4 GiB in size. Whenever redun tries to serialize a decent sized numpy array, either the pickle_dumps or pickle_dump function cannot handle the object. It seems like this serialization is used in value hashing, so any use of a large ndarray breaks redun even without caching the arrays. I can provide code to re-create this, but it's pretty easy to do.

Are there any plans to fix this constraint? Passing large arrays/tensors beyond 4 GiB in size is very common, and this seems like a blindspot in redun that would be hit frequently. It's certainly the largest blocker for us being able to use redun, which we would love to do given how elegant and straightforward the library is so far (and the lovely script tasks!).

Some ideas for solutions:

A quick solution would probably be to just bump up the protocol version from 3 to 4. Version 4 released in python 3.4, which is well within the current supported python versions. However, I think this would also break all previous cache values produced by redun, which is probably unacceptable.

Customizing Pickler/Unpickler instances to use a higher protocol in certain situations

Switching to a more robust serialization tool like dill or cloudpickle (same problems as before with cache breakage)

Perhaps redun can wrap large objects such that only their hashes are included in the upstream expressions (which would require Value-level serialization beyond pickle 3).

I'm not really an expert in this, so perhaps these ideas are not workable in redun.

Another easy(er) workaround would be to use file-based objects, but this would lead to tons of file I/O and all arrays would have to be stored on the hard drive, even ones that don't need to be cached. Not to mention that the constant "read file -> perform operation -> save file" loop is very clunky and makes code unreadable and inflexible. I've also tried making a custom ProxyValue class that did custom serialization, but I just kept running in circles between the type registry, the argument preprocessing, the differences between serialization, deserialization, __setstate__, __getstate__, etc.

I don't think my software version matters, but just to be thorough:

Python - 3.10.5 redun - 0.8.16 OS - Ubuntu Linux 20.04.1 x86_64 kernel 5.15.0
opened by TylerSpears 2

Database connection error after completing very long task

(I'm on redun 0.8.6, so apologies if this is fixed in a more recent version.)

After running a task that took 27 hours to complete, redun errors with psycopg2.OperationalError: SSL connection has been closed unexpectedly.

Presumably the connection has expired while waiting this long. It may be good to attempt to reconnect every so often. There may also be arguments that could be passed to the sqlalchemy engine that accomplish this.

Full traceback:

Traceback (most recent call last):                                                                                                                                                               
  File "***/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1803, in _execute_context                                                                       
    cursor, statement, parameters, context                                                                                                                                                       
  File "***/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute                                                                           
    cursor.execute(statement, parameters)                                                                                                                                                        
psycopg2.OperationalError: SSL connection has been closed unexpectedly                                                                                                                            
                                                                                                                                                                                                 
                                                                                                                                                                                                  
The above exception was the direct cause of the following exception:                                                                                                                             
                                                                                                                                                                                                  
Traceback (most recent call last):                                                                                                                                                               
  File "***/bin/redun", line 11, in <module>                                                                                                                             
    client.execute()                                                                                                                                                                             
  File "***/lib/python3.7/site-packages/redun/cli.py", line 1021, in execute                                                                                             
    return args.func(args, extra_args, argv)                                                                                                                                                     
  File "***/lib/python3.7/site-packages/redun/cli.py", line 1558, in run_command                                                                                         
    tags=tags,                                                                                                                                                                                   
  File "***/lib/python3.7/site-packages/redun/scheduler.py", line 811, in run                                                                                            
    self.process_events(result)                                                                                                                                                                  
  File "***/lib/python3.7/site-packages/redun/scheduler.py", line 855, in process_events                                                                                 
    event_func()                                                                                                                                                                                 
  File "***/lib/python3.7/site-packages/redun/scheduler.py", line 1215, in <lambda>                                                                                      
    self.events_queue.put(lambda: self._done_job(job, result, job_tags=job_tags))                                                                                                                
  File "***/lib/python3.7/site-packages/redun/scheduler.py", line 1244, in _done_job                                                                                     
    self.set_cache(job.eval_hash, job.task.hash, job.args_hash, result)                                                                                                                          
  File "***/lib/python3.7/site-packages/redun/scheduler.py", line 1630, in set_cache                                                                                     
    self.backend.set_eval_cache(eval_hash, task_hash, args_hash, value, value_hash=None)                                                                                                         
  File "***/lib/python3.7/site-packages/redun/backends/db/__init__.py", line 1696, in set_eval_cache                                                                     
    value_hash = self.record_value(value)                                                                                                                                                        
  File "***/lib/python3.7/site-packages/redun/backends/db/__init__.py", line 1475, in record_value                                                                       
    value_row = self.session.query(Value).filter_by(value_hash=value_hash).first()                                                                                                               
  File "***/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 2810, in first                                                                                    
    return self.limit(1)._iter().first()                                                                                                                                                          
  File "***/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 2897, in _iter                                                                                     
    execution_options={"_sa_orm_load_options": self.load_options},                                                                                                                                 
  File "***/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1692, in execute                                                                                  
    result = conn._execute_20(statement, params or {}, execution_options)                                                                                                                            
  File "***/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1614, in _execute_20                                                                                 
    return meth(self, args_10style, kwargs_10style, execution_options)                                                                                                                                 
  File "***/lib/python3.7/site-packages/sqlalchemy/sql/elements.py", line 326, in _execute_on_connection                                                                      
    self, multiparams, params, execution_options                                                                                                                                                         
  File "***/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1491, in _execute_clauseelement                                                                       
    cache_hit=cache_hit,                                                                                                                                                                                 
  File "***/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1846, in _execute_context                                                                              
    e, statement, parameters, cursor, context                                                                                                                                                             
  File "***/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 2027, in _handle_dbapi_exception                                                                        
    sqlalchemy_exception, with_traceback=exc_info[2], from_=e                                                                                                                                             
  File "***/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 207, in raise_                                                                                           
    raise exception                                                                                                                                                                                         
  File "***/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1803, in _execute_context                                                                                  
    cursor, statement, parameters, context                                                                                                                                                                   
  File "***/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute                                                                                       
    cursor.execute(statement, parameters)                                                                                                                                                                       
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) SSL connection has been closed unexpectedly                                                                                                            
                                                                                                                                                                                                                       
[SQL: SELECT value.value_hash AS value_value_hash, value.type AS value_type, value.format AS value_format, value.value AS value_value                                                                                   
FROM value                                                                                             
WHERE value.value_hash = %(value_hash_1)s                                                               
 LIMIT %(param_1)s]                                                                                       
[parameters: {'value_hash_1': '8ca852f0d79ba0485e9ac2c9c175853adf296425', 'param_1': 1}]                   
(Background on this error at: https://sqlalche.me/e/14/e3q8)

opened by spitz-dan-l 0

please document required setup for scheduling tasks from python code
The documentation helpfully provides examples of how to instantiate a Scheduler and run tasks, e.g.

scheduler = Scheduler() result = scheduler.run(main())

However, that does not appear to take advantage of caching -- tasks run every time -- so it's not quite analogous with running something like

client = RedunClient() client.execute(["redun", "run", "tasks.py", "main"])

I'm guessing that's because scheduler object isn't making use of the database. And I expect it's relevant that I'm seeing this message when calling scheduler.run().

INFO redun:__init__.py:1199 Upgrading db from version -1.0 to 3.1...

How do you set up a scheduler object so that it behaves more like calling redun on the command line? Or do you recommend using RedunClient instead? And could you please add that to the docs?
opened by hamr 2
Code packaging for script tasks

Currently script tasks cannot use code packaging. This would be useful for a variety of purposes, including running non-python code snippets that may import user-authored source code.

To do it, I think you'd need to render the code package extraction code as bash in the script wrapper. Any future changes to code packaging that affected extraction would have to be maintained in two places- the python version and bash version.

Is this something you'd be interested in accepting as a PR?

Cheers, Dan Spitz

opened by spitz-dan-l 0
Task-specific code packaging

Hey there redun team!

This one is a bit of a corner case. I would like to disable code packaging for a particular task in a pipeline, but keep it enabled for the rest. I think having a task option to disable code packaging would be the most natural way to do this. As it stands, I have to use two separate executors, one without code packaging and one with.

The use case- I'm using redun to orchestrate some end-to-end tests. In the tasks where I actually run the tests, I want it to use the exact code built into the image (so code packaging disabled). For the rest of the tasks that set up and process the test results, code packaging should still be fine and handy (so enabled).

Cheers, Dan Spitz

opened by spitz-dan-l 1

redun aims to be a more expressive and efficient workflow framework

Related tags

Overview

redun

Install

Postgres backend

Use cases

Small taste

Mixed compute backends

What's the trick?

Why not another workflow engine?

Comments

Testing

Patching CVE-2007-4559

Owner

insitro

MeepoBenchmark - This project aims at providing the scripts, logs, and analytic results for Meepo Blockchain

Python project that aims to discover CDP neighbors and map their Layer-2 topology within a shareable medium like Visio or Draw.io.

PyWorkflow(PyWF) - A Python Binding of C++ Workflow

LOC-FLOW is an “hands-free” earthquake location workflow to process continuous seismic records

A frontend to ease the use of pulseaudio's routing capabilities, mimicking voicemeeter's workflow

A python based app to improve your presentation workflow

A Snakemake workflow for standardised sc/snRNAseq analysis

A basic animation modding workflow for FFXIV

This is a fork of the BakeTool with some improvements that I did to have better workflow.

Cylc: a workflow engine for cycling systems

Alfred 4 Workflow to search through your maintained/watched/starred GitHub repositories.

A community-driven python bot that aims to be as simple as possible to serve humans with their everyday tasks

Project aims to map out common user behavior on the computer

EasyBuild is a software build and installation framework that allows you to manage (scientific) software on High Performance Computing (HPC) systems in an efficient way.

Framework for creating efficient data processing pipelines

A Python utility belt containing simple tools, a stdlib like feel, and extra batteries. Hashing, Caching, Timing, Progress, and more made easy!

A Lite Package focuses on making overwrite and mending functions easier and more flexible.

TextColor - An easy to use Python library which allows you to make your terminal outputs more colorful and therefore easier to read and understand

Macros in Python: quasiquotes, case classes, LINQ and more!