Reproducible Data Science at Scale!

Overview

GitHub release GitHub license GoDoc Go Report Card Slack Status CLA assistant

Pachyderm: The Data Foundation for Machine Learning

Pachyderm provides the data layer that allows machine learning teams to productionize and scale their machine learning lifecycle. With Pachyderm’s industry leading data versioning, pipelines and lineage teams gain data driven automation, petabyte scalability and end-to-end reproducibility. Teams using Pachyderm get their ML projects to market faster, lower data processing and storage costs, and can more easily meet regulatory compliance requirements

Features

  • Automated Data Versioning: Pachyderm’s Data Versioning gives teams an automated and performant way to keep track of all data changes.
  • Data-Driven Pipelines: Pachyderm’s Containerized Pipelines speed data processing while lowering compute costs.
  • Immutable Data Lineage: Pachyderm’s data lineage provides an immutable record for all activities and assets in the ML lifecycle.
  • Console: The Pachyderm Console provides an intuitive visualization of your DAG (directed acyclic graph), and aids in reproducibility.
  • Notebooks: Pachyderm Notebooks provide an easy way to interact with Pachyderm data versioning and pipelines via Jupyter notebooks.

Getting Started

To start deploying your end-to-end version-controlled data pipelines, try us for free on Hub with little to no setup or run Pachyderm locally. You can also deploy on AWS/GCE/Azure in about 5 minutes.

You can also refer to our complete documentation to see tutorials, check out example projects, and learn about advanced features of Pachyderm.

If you'd like to see some examples and learn about core use cases for Pachyderm:

Documentation

Official Documentation

Community

Keep up to date and get Pachyderm support via:

  • Twitter Follow us on Twitter.
  • Slack Status Join our community Slack Channel to get help from the Pachyderm team and other users.

Contributing

To get started, sign the Contributor License Agreement.

You should also check out our contributing guide.

Send us PRs, we would love to see what you do! You can also check our GH issues for things labeled "help-wanted" as a good place to start. We're sometimes bad about keeping that label up-to-date, so if you don't see any, just let us know.

Join Us

WE'RE HIRING! Love Docker, Go and distributed systems? Learn more about our open positions

Usage Metrics

Pachyderm automatically reports anonymized usage metrics. These metrics help us understand how people are using Pachyderm and make it better. They can be disabled by setting the env variable METRICS to false in the pachd container.

License Information

Pachyderm has moved some components of Pachyderm Platform to a source-available limited license.

We remain committed to the culture of open source, developing our product transparently and collaboratively with our community, and giving our community and customers source code access and the ability to study and change the software to suit their needs.

Under the Pachyderm Community License, you can access the source code and modify or redistribute it; there is only one thing you cannot do, and that is use it to make a competing offering.

Check out our License FAQ Page for more information.

Comments
  • pachd fails: panic: failed to initialize pach client: context deadline exceeded

    pachd fails: panic: failed to initialize pach client: context deadline exceeded

    What happened?:

    Ran pachctl deploy to create an on-premises pachyderm cluster:

    pachctl deploy custom --object-store s3 any-string 10 <bucket> <accesskey> <secretkey> rook-ceph-rgw-my-store.rook-ceph:80 --etcd-storage-class nfs-client --image-pull-secret boss-6000 --namespace pachyderm --dynamic-etcd-nodes 1
    

    pachd is failing to start up and is reporting the following in the logs:

    2019-12-16T18:46:13Z INFO no Jaeger collector found (JAEGER_COLLECTOR_SERVICE_HOST not set) 
    2019-12-16T18:46:19Z WARNING TLS disabled: could not stat public cert at /pachd-tls-cert/tls.crt: stat /pachd-tls-cert/tls.crt: no such file or directory 
    2019-12-16T18:46:19Z WARNING s3gateway TLS disabled: could not stat public cert at /pachd-tls-cert/tls.crt: stat /pachd-tls-cert/tls.crt: no such file or directory 
    2019-12-16T18:46:20Z INFO validating kubernetes access returned no errors 
    2019-12-16T18:46:49Z INFO error starting githook server context deadline exceeded 
     
    panic: failed to initialize pach client: context deadline exceeded 
     
    goroutine 492 [running]: 
    github.com/pachyderm/pachyderm/src/server/pkg/serviceenv.(*ServiceEnv).GetPachClient(0xc00021f450, 0x2ad81a0, 0xc00053a2c0, 0xc00053a2c0) 
    	src/github.com/pachyderm/pachyderm/src/server/pkg/serviceenv/service_env.go:171 +0x11a 
    github.com/pachyderm/pachyderm/src/server/pps/server.(*apiServer).master.func1(0x0, 0x0) 
    	src/github.com/pachyderm/pachyderm/src/server/pps/server/master.go:58 +0xe5 
    github.com/pachyderm/pachyderm/src/server/pkg/backoff.RetryNotify(0xc00088c220, 0x2a99520, 0xc00061d6e0, 0xc0009fbfb8, 0x2a, 0xc00113f4c0) 
    	src/github.com/pachyderm/pachyderm/src/server/pkg/backoff/retry.go:35 +0x4a 
    github.com/pachyderm/pachyderm/src/server/pps/server.(*apiServer).master(0xc0002bcfc0) 
    	src/github.com/pachyderm/pachyderm/src/server/pps/server/master.go:52 +0x20a 
    created by github.com/pachyderm/pachyderm/src/server/pps/server.NewAPIServer 
    	src/github.com/pachyderm/pachyderm/src/server/pps/server/server.go:67 +0x3d4 
    panic: failed to initialize pach client: context deadline exceeded 
     
    goroutine 513 [running]: 
    github.com/pachyderm/pachyderm/src/server/pkg/serviceenv.(*ServiceEnv).GetPachClient(0xc00021f450, 0x2ad81e0, 0xc0000560d0, 0x7f01cf21a008) 
    	src/github.com/pachyderm/pachyderm/src/server/pkg/serviceenv/service_env.go:171 +0x11a 
    github.com/pachyderm/pachyderm/src/server/transaction/server.newAPIServer.func1(0xc001107260) 
    	src/github.com/pachyderm/pachyderm/src/server/transaction/server/api_server.go:43 +0x48 
    created by github.com/pachyderm/pachyderm/src/server/transaction/server.newAPIServer 
    	src/github.com/pachyderm/pachyderm/src/server/transaction/server/api_server.go:43 +0x103 
    
    

    What you expected to happen?:

    pachd should load successfully

    How to reproduce it (as minimally and precisely as possible)?:

    Anything else we need to know?:

    Environment?:

    • Kubernetes version (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.3", GitCommit:"b3cbbae08ec52a7fc73d334838e18d17e8512749", GitTreeState:"clean", BuildDate:"2019-11-13T11:23:11Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.3", GitCommit:"b3cbbae08ec52a7fc73d334838e18d17e8512749", GitTreeState:"clean", BuildDate:"2019-11-13T11:13:49Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}
    
    • Pachyderm CLI and pachd server version (use pachctl version):
    COMPONENT           VERSION
    pachctl             1.9.9
    pachd               1.9.9
    
    • Cloud provider (e.g. aws, azure, gke) or local deployment (e.g. minikube vs dockerized k8s): on-premises Rancher 2.3.3 with 7 nodes
    • OS (e.g. from /etc/os-release): Ubuntu 18.04
    • Others:
    opened by benwbooth 27
  • Stuck in

    Stuck in "Pulling State" after delete-all

    Hi. I have worked on Pachyderm several days. Since there're many repos, so I use delete all to delete all of the files and do new experiments, Then, I have new problems.

    I use the 'fruit-stand' example to see whether it works, but although I do the same operation as I did several days before. But the program is stuck in the Pulling State like following for several hours.

    pachctl list-job
    ID                                 OUTPUT                                    STARTED              DURATION            STATE
    48521d8b6ff23c6ba66665ef6807d0e0   filter/665d9d51d3114a1e98c5bb7f432f8374   About a minute ago   -                   pulling
    649ddf50f993eafc56c5d205b7dd153f   filter/fbc0accb6d4f48cc9caec8de282d8b5a   About a minute ago   -                   pulling
    

    I search the problems in the previous issues. And I find that it is because Pachyderm cannot find the required image. So I speculate that when I do ``delete-all` , also the docker image has been deleted.

    opened by ShengjieLuo 27
  • Explore micro k8s as minikube alternative

    Explore micro k8s as minikube alternative

    Mostly a suggestion for documentation purposes. User reported easier installation path and MUCH better performance in dev cluster mode.

    Not sure if there are other "gotchas" or limitations, but we've seen a few users asking about it so probably worth documenting.

    feature request docs 
    opened by JoeyZwicker 25
  • start-kube-docker not working in Vagrant image

    start-kube-docker not working in Vagrant image

    Trying to run Pachyderm in Vagrant using the Vagrantfile/init.sh in the Github documentation QUICKSTART.md. gcr.io/google_containers/hyperkube:v1.1.2 container does not start.

    Steps to reproduce:

    vagrant destroy # or download per README.md
    vagrant up
    vagrant ssh
    
    go get github.com/pachyderm/pachyderm/...
    cd ~/go/src/github.com/pachyderm/pachyderm
    etc/kube/start-kube-docker.sh
    
    ~/pachyderm_vagrant$ vagrant version
    Installed Version: 1.7.4
    Latest Version: 1.7.4
    
    You're running an up-to-date version of Vagrant!
    

    Console log: kubeNotStarting.txt

    opened by brinman2002 23
  • Error 'the server has asked for the client to provide credentials'

    Error 'the server has asked for the client to provide credentials'

    Hi. I deploy pachyderm on another server. The installation is successful with:

    COMPONENT           VERSION
    pachctl             1.1.0
    pachd               1.1.0
    

    However, when I begin the pipeline fruit stand

    ID                                 OUTPUT                                     STARTED             DURATION             STATE
    d6ddca2dfd72e6a0da7053ba5151b4cb   filter3/1dd4428dcc4d40359e7bcc3cdb594f3b   8 seconds ago       Less than a second   failure
    2610bae2936923f0ce850c04f2cedad3   filter2/d58ac2a1231d4db4bb2487554bf36273   25 minutes ago      Less than a second   failure
    dfe6acbbcd241d55a394a95077df5d1e   filter/a4934ebe280c4e2cae2e6cfb4b1c4c04    2 hours ago         Less than a second   failure
    e9a26e0594f6bd00bacefa33c1b9850a   filter/78a4cea8361c41f3a9a2c8e2f9679bb0    2 hours ago         Less than a second   failure
    

    I tried it several times, but all failure.

    See pipeline information here

    NAME                INPUT               OUTPUT              STATE
    filter              data                filter              running
    sum3                filter2             sum3                running
    filter2             data                filter2             running
    sum2                filter2             sum2                running
    filter3             data                filter3             running
    sum                 filter              sum                 running
    

    I check the log for the problem

    pachctl get-logs d6ddca2dfd72e6a0da7053ba5151b4cb
    the server has asked for the client to provide credentials (get pods)
    

    I can delete the repo in this pipeline, but I can't delete this pipeline See here,

    pachctl delete-pipeline filter3
    error from DeletePipeline: the server has asked for the client to provide credentials (get pods)
    

    I searched the information in source code See here

    src/server/vendor/k8s.io/kubernetes/pkg/api/errors/errors.go 
    case http.StatusUnauthorized:
            reason = unversioned.StatusReasonUnauthorized
            message = "the server has asked for the client to provide credentials"
    

    Unfortunately, I have hit so many problems these days... I have to ask for problems everyday...

    opened by ShengjieLuo 21
  • Get Pachyderm Working with OpenShift

    Get Pachyderm Working with OpenShift

    Per our discussions, it looks like it may be a privilege issue or something similar. I have attached the steps to install the OpenShift vagrant image and then how to deploy pachyderm. You can troubleshoot through normal Kube commands as well. Everything is created and running, except the pachd pods do not start.

    https://gist.github.com/munchee13/8cf64f2c1797d1d60891b28a193767f6

    opened by munchee13 21
  • Improve Custom/On-Premise Docs

    Improve Custom/On-Premise Docs

    There's been a lot of interest in on-prem clusters lately, and our docs on the subject aren't very good. The main issues that people have been hitting are:

    • When to use the custom deploy
    • Confusion around how to use the custom deploy for cloud vs. on prem deploys
    • How to modify the manifest for on prem deploys
    • What needs to be changed in the deploy process for OpenShift, OpenStack, and other systems.
    • NotExist error with CEPH S3 interface deploy

    I think the following will probably make the process better for on prem users:

    • Use the Helm chart as default for on prem deploys. Our users seem to indicate in Slack that this is much easier.
    • Update the custom/on-prem docs to emphasize custom object stores in the cloud vs. completely on-prem solutions.
    • Test the custom deploy commands and the Helm chart to see if updates are needed for the latest Pachyderm versions.
    docs openshift size: XL priority: high 
    opened by dwhitena 20
  • Add support for minio deploy

    Add support for minio deploy

    This adds support for Minio and all other S3 compatible servers. This patch also uses minio-go. This has an added benefit i.e this can be used S3 as well transparently.

    opened by harshavardhana 20
  • pipelines stalling

    pipelines stalling

    I have a pachyderm repo with a few commits commits in it (each a seperate branch). Each commit is about 100MB in size. The processing step begin and starts output data into the next pipeline but stops working after about 1MB of data output. If I force finish the commit and inspect the output is see that only some of the expected output was generated and the some of the files list the unix epoch as their creation date some of the time. Moreover, the pipeline takes an exceptionally long time to run compared to running it out side of pachyderm.

    opened by JonathanFraser 20
  • Contexts

    Contexts

    This introduces a new, backwards-incompatible version of configs, a migration to update old configs, and the related behavioral changes to pachctl from the new config.

    Closes #3774 Fixes #3538 Closes #3036 Fixes #3419

    Contexts

    The largest addition is contexts, which are akin to kubectl contexts. Instead of building off of config V1 and having a different config file for each context (as proposed in #3774), this stores all contexts in a single config file. The reasons for this:

    1. The hope with multiple files was that we could save effort by just building off of config V1. I no longer think that is the case, because bolting contexts on top of that design appears to be just as much work, if not more.
    2. This design avoids ambiguity when a user sets PACH_CONFIG that the multiple config file design has.
    3. It more closely follows k8s' approach.

    Contexts have the same fields as config V1, plus a context source field, which specifies where contexts came from.

    Active context

    Config V2 contains a reference to the currently active context. The active context can be overridden via the env var PACH_CONTEXT.

    Metrics

    Rather than having a global flag --no-metrics that needs to be passed in for each call to pachctl, the ability to disable metrics are now specified in the config.

    Config implementation changes

    Configs are now read only once per run of pachctl. Before, it was read multiple times, which allowed for subtle bugs (e.g. if the user ID wasn't yet set, it would be reset multiple times.) I did not add any further locking to ensure changes can't overwrite each other. Given the current uses of configs (which are predominantly read-only), I think this is safe enough for now, but it is certainly not bulletproof!

    Migrations

    A migration is run the first time the config is read. The V1 config is turned into a context.

    Deployments

    When a (non-dry-run) deployment succeeds, a new pach context is created, and the user is automatically switched to the new context.

    New commands

    • pachctl config get metrics - gets whether metrics are enabled
    • pachctl config set metrics (true|false) - sets whether metrics are enabled
    • pachctl config get active-context - gets the active context
    • pachctl config set active-context [name] - sets the active context
    • pachctl config get context [name] - gets a context config JSON by name
    • pachctl config set context [name] [--overwrite] - sets a context config from JSON stdin
    • pachctl config update context [name] --pachd-address=[address] - updates the pachd address of an existing context (this is the only field that is updatable without completely overwriting a context, at the moment.)
    • pachctl config delete context [name] - removes a context
    • pachctl config list context - lists all contexts

    Removals

    The global --no-metrics and --no-port-forwarding flags were removed, in favor of a config values.

    opened by ysimonson 19
  • Better documentation for cluster API ingress

    Better documentation for cluster API ingress

    It seems a lot of user questions center around how API ingress works. Didn't find docs about this, so just wrote this up for a customer. Should probably be migrated into our docs if we don't have something like this already

    To clarify about ingress/NodePorts: pachyderm doesn't really care how a user gets access to it, so long as their local pachctl client can talk to the pachd pod. pachctl has built-in support for two different methods: setting the PACHD_ADDRESS, and using pachctl port-forward

    1. setting the PACHD_ADDRESS env var to point at a host:port that directs traffic to the pachd pod tells pachctl to just talk to that endpoint directly. This is the flow NodePort supports -- it makes the internal pachd pod's API port accessible on the cluster's external address, so that users can set PACHD_ADDRESS=cluster-address:30650 (30650 is the default), and the k8s/OC cluster will send that traffic to the pachd pod. because this port is a global resource at the k8s/OC cluster level, it needs to be unique per pachyderm cluster. but it can be changed to whatever you want for a given pachyderm deployment, and shouldn't affect pachyderm's internal operation (so long as the user of pachctl has the right value in their PACHD_ADDRESS variable)

    2. ~~pachctl port-forward piggy-backs on kubectl to fetch the name of the pachd pod within the k8s cluster/namespace the user is currently connected to, and then runs kubectl port-forward to direct traffic from the user's local machine to pachd via the k8s API. In this case, setting the PACHD_ADDRESS variable isn't needed, but the user needs to have k8s access set up, pointing at the namespace for their pachyderm cluster~~

    Edited, based on @ysimonson 's suggestion:

    1. pachctl port-forward piggy-backs off kubectl's config file and client API -- it reads kubectl's config file to fetch the name of the pachd pod within the k8s cluster/namespace the user is currently connected to, then uses the kubernetes API to effectively run kubectl port-forward. This directs traffic from the user's local machine to pachd via the k8s API. If using pachctl port-forward, then setting the PACHD_ADDRESS variable isn't needed. Instead, the user needs to have k8s access set up, pointing at the namespace for their pachyderm cluster

    As of 1.8.3 pachctl port-forward happens automatically when running any pachctl command that tries to access pachd, but in order to open a persistent tunnel to a number of other ports pachyderm uses (the dashboard, git and auth hooks, the built-in HTTP file API, etc) users will still need to run pachctl port-forward explicitly

    also, it seems pachctl port-forward isn't working with openshift, but the following oc port-forward command does effectively the same thing:

    PACHD_POD_NAME=`oc get pod --output=json | jq -r '.items[] | select(.metadata.name|startswith("pachd")).metadata.name'`  #  -r flag is needed to not get quotes in the output
    
    oc port-forward pod/$PACHD_POD_NAME 30650:1650
    
    docs openshift size: L priority: high solutions-architecture 
    opened by gabrielgrant 19
  • Can't run pachctl on WSL2

    Can't run pachctl on WSL2

    What happened?:

    Following the local pachyderm instructions (running on WSL2 / 20.04):

    • Install homebrew and run the Next steps
      • tested everything works via brew install hello
    • install pachctl via brew tap pachyderm/tap && brew install pachyderm/tap/[email protected]

    Trying to run pachctl, gives the following message:

    pachctl
    zsh: permission denied: pachctl
    
    # same when running it via bash:
    /bin/bash pachctl
    /home/linuxbrew/.linuxbrew/bin/pachctl: /home/linuxbrew/.linuxbrew/bin/pachctl: cannot execute binary file
    

    What you expected to happen?:

    Not get the permission denied: pachctl message.

    How to reproduce it (as minimally and precisely as possible)?:

    # run the next steps as recommended from the following command too
    /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
    
    brew tap pachyderm/tap && brew install pachyderm/tap/[email protected]
    pachctl
    

    Anything else we need to know?: Installing other packages via brew seems to work, so I don't think its a homebrew issue. (E.g. I can finish the local deploy guide, including the helm install via brew)

    Environment?:

    • Kubernetes version (use kubectl version):
    • Pachyderm CLI and pachd server version (use pachctl version):
    • Cloud provider (e.g. aws, azure, gke) or local deployment (e.g. minikube vs dockerized k8s):
    • If you deployed with helm, the values you used (helm get values pachyderm):
    • OS (e.g. from /etc/os-release):
    • Others:

    This is on WSL2 .

    cat /etc/os-release
    NAME="Ubuntu"
    VERSION="20.04.5 LTS (Focal Fossa)"
    ID=ubuntu
    ID_LIKE=debian
    PRETTY_NAME="Ubuntu 20.04.5 LTS"
    VERSION_ID="20.04"
    HOME_URL="https://www.ubuntu.com/"
    SUPPORT_URL="https://help.ubuntu.com/"
    BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
    PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
    VERSION_CODENAME=focal
    UBUNTU_CODENAME=focal
    

    The permissions for linuxbrew look allright:

    ls -lhA /home/linuxbrew/.linuxbrew/bin/pachctl
    lrwxrwxrwx 1 mheiser mheiser 40 Dec 27 09:14 /home/linuxbrew/.linuxbrew/bin/pachctl -> ../Cellar/[email protected]/v2.4.2/bin/pachctl
    
    bug 
    opened by Persedes 6
  • [2.4.x backport][Jupyter] Fix datums-related error message when notebooks starts up

    [2.4.x backport][Jupyter] Fix datums-related error message when notebooks starts up

    Make datums request at notebooks startup only when connected to a cluster and logged in if auth enabled (when mount server response is 200).

    Ticket: https://linear.app/pachyderm/issue/INT-760/error-message-when-starting-up-notebooks

    JIRA: INT-782

    opened by smalyala 1
  • Warn about outdated pachctl

    Warn about outdated pachctl

    This implements some compatibility checking between pachctl (actually any Go client) and pachd.

    Compatibility is defined as:

    • Either the client or the server are 0.0.0 as developer builds are.
    • If either the client or the server are a pre-release (nightly/alpha/beta/rc), then the server and client versions have to be an exact match.
    • Otherwise, the major and minor number have to be the same.

    See the test cases in version_test.go.

    The client always calls InspectCluster before connecting, so this PR modifies InspectCluster to take the client's version as a parameter. The server then checks that version against it's own version, and returns warnings in the reponse. There is also a version_warnings_ok flag. Old servers won't set that (since it's not in the message version they have), so the client can detect a way-too-old server. If there are any warnings set, the client will log them at level error. Technically this would be intrusive to users of the go client, but since the pctx.TODO() logger points to no-op logger until someone calls InitPachctlLogger() and that's a symbol they can't import, only pachctl users will ever see this.

    It is a little weird to change InspectCluster from taking Empty to taking a message type, but it seems perfectly safe to me. The Go client API doesn't change; only people that directly generate stubs and call methods on it (like some of our tests) are affected. Users of the Go client that want to send a version have the option to do so with a new function in the client, InspectClusterWithVersion.

    The server logs at INFO level whenever an incompatible client is detected, so even if users miss the warnings, administrators can know.

    Here's an example of what it looks like in pachctl (runs for every command, can't be turned off). In this particular case, a "released" client is talking to a nightly build, which requires an exact version match between client and server:

    Capture

    And the server logs: Capture

    The server log is the same for every case (modulo the error field), but the client message varies based on the constants in admin/api_server.go. Feel free to wordsmith them.

    Annoyingly, we don't seem to send an auth token with InspectCluster, so the server can't report the user name that is using an out of date client. We should probably do something about that.

    opened by jrockway 1
  • Increase reliability of debug dumps

    Increase reliability of debug dumps

    Fixes CORE-1193 and CORE-1294.

    This PR does a bunch of stuff to make debug dumps more reliable, at least without burning the whole thing down and starting over.

    pachctl debug dump can now specify a timeout; it defaults to 30m.

    The timeout is adjusted down on the server side to about 90% of the client timeout. That means the debug dumper has some time to handle context deadline exceeded and start producing output before the RPC is totally aborted. I've had good results with timeouts as low as 100ms; you don't get everything, but you get some files. At 30m it should be Really Good (tm).

    Every multi-step operation that the dumper does now continues in the face of errors, if the error doesn't affect the next thing. Every for loop or function that does two+ things now uses multierr.Append to collect all the errors. That means if we hit an issue where we try to do something silly like InspectPipeline an input repo, we just continue doing the whole debug dump anyway. At the very end, an error will be returned, but we can still write all the other files.

    I fixed the thing where we did InspectPipeline on an input repo; there was a missing continue statement. I also tried to fix PPS's error message for a pipeline not being found, but it's actually not relevant to this PR. (I don't think the code can ever hit the case I "fixed", but in case it does, hey now the error type is correct. We still don't return grpc.status = NotFound from PPS under any of these circumstances though.)

    I added some arbitrary timeouts around things I don't think will be too slow, like we did for Loki.

    I noticed that the Pod Describer from the k8s library can't take a context. That means it could run forever, so I put it in a background goroutine; the foreground goroutine tries to get its output until the context expires, and then it just abandons it and moves on. This will leak memory if it runs forever, but hey, after we review the debug dump we'll probably tell you to restart pachd anyway. In the future we'll have to just collect pod YAMLs instead of "describe" output. Or fork k8s.io/client-go to make the silly thing take a context.

    As an example, here's what a run with an aggressive timeout looks like now:

    $ rm dump.tgz; /usr/bin/time pachctl debug dump dump.tgz  --timeout=1s ; tar tzvf dump.tgz; du -h dump.tgz
    rpc error: code = Unknown desc = listPipelines: context deadline exceeded; appLogs: context deadline exceeded; collectDatabaseDump: collectDatabaseTables: list tables: context deadline exceeded
    Command exited with non-zero status 1
    0.09user 0.02system 0:01.04elapsed 11%CPU (0avgtext+0avgdata 66060maxresident)k
    0inputs+2176outputs (0major+2005minor)pagefaults 0swaps
    -rwxrwxrwx 0/0            6214 1969-12-31 19:00 source-repos/default/benchmark-upload/commits.json
    -rwxrwxrwx 0/0           52020 1969-12-31 19:00 source-repos/default/benchmark-upload/commits-chart.png
    -rwxrwxrwx 0/0            8083 1969-12-31 19:00 source-repos/default/images/commits.json
    -rwxrwxrwx 0/0           45961 1969-12-31 19:00 source-repos/default/images/commits-chart.png
    -rwxrwxrwx 0/0              17 1969-12-31 19:00 pachd/pachd-84f6794987-74hf2/pachd/version.txt
    -rwxrwxrwx 0/0            7612 1969-12-31 19:00 pachd/pachd-84f6794987-74hf2/pachd/describe.txt
    -rwxrwxrwx 0/0         8690350 1969-12-31 19:00 pachd/pachd-84f6794987-74hf2/pachd/logs.txt
    -rwxrwxrwx 0/0              80 1969-12-31 19:00 pachd/pachd-84f6794987-74hf2/pachd/logs-previous/error.txt
    -rwxrwxrwx 0/0         8640042 1969-12-31 19:00 pachd/pachd-84f6794987-74hf2/pachd/logs-loki.txt
    -rwxrwxrwx 0/0           22422 1969-12-31 19:00 pachd/pachd-84f6794987-74hf2/pachd/go_info.txt
    -rwxrwxrwx 0/0           11559 1969-12-31 19:00 pachd/pachd-84f6794987-74hf2/pachd/goroutine
    -rwxrwxrwx 0/0           84444 1969-12-31 19:00 pachd/pachd-84f6794987-74hf2/pachd/heap
    -rwxrwxrwx 0/0              26 1969-12-31 19:00 database/activities/error.txt
    -rwxrwxrwx 0/0              26 1969-12-31 19:00 database/row-counts/error.txt
    -rwxrwxrwx 0/0              26 1969-12-31 19:00 database/table-sizes/error.txt
    1.1M    dump.tgz
    

    We end up with data (and a long chain of error messages) even if we hit timeouts.

    opened by jrockway 1
  • dlock: add logging around lock acquisition and release

    dlock: add logging around lock acquisition and release

    It's often interesting to have information about when locks are acquired or lost, so this adds it around all uses of DLock. The actual calls to Lock/TryLock/Unlock are wrapped in a span, reporting how long it took to acquire or release the lock, and any errors that might have occurred. The time spent waiting for the lock is reported as the spanDuration on the DLock.Lock (etc.) span, and all messages that are logged using the returned context have a withLock and locked field, to make it clear where the context came from. (The lock timing spans also have a withLock field, but locked isn't set until the lock is actually acquired.)

    Here's what the chunk GC looks like starting up:

    Capture

    From this, we can see that we waited 21.86 seconds to take the lock, and that several GC runs have occurred while holding that lock. (If there was an error, that would also be logged.)

    The span only tracks time spent actually interacting with the locking machinery; the total time the lock was held is reported at the end though.

    When unlocking, we identify the lock by the prefix field instead of withLock. That's so that you can compare the two and see which context is being used to gate the unlocking operation vs. which lock is being unlocked.

    opened by jrockway 1
Releases(v2.5.0-alpha.2)
Owner
Pachyderm
Containerized Data Analytics
Pachyderm
A website for courses of Major Computer Science, NKU

A website for courses of Major Computer Science, NKU

Sakura 0 Oct 6, 2022
graphical orbitational simulation of solar system planets with real values and physics implemented so you get a nice elliptical orbits. you can change timestamp value or scale from source code idc.

solarSystemOrbitalSimulation graphical orbitational simulation of solar system planets with real values and physics implemented so you get a nice elli

Mega 3 Mar 3, 2022
Data-Scrapping SEO - the project uses various data scrapping and Google autocompletes API tools to provide relevant points of different keywords so that search engines can be optimized

Data-Scrapping SEO - the project uses various data scrapping and Google autocompletes API tools to provide relevant points of different keywords so that search engines can be optimized; as this information is gathered, the marketing team can target the top keywords to get your company’s website higher on a results page.

Vibhav Kumar Dixit 2 Jul 18, 2022
A tutorial for people to run synthetic data replica's from source healthcare datasets

Synthetic-Data-Replica-for-Healthcare Description What is this? A tailored hands-on tutorial showing how to use Python to create synthetic data replic

null 11 Mar 22, 2022
advance python series: Data Classes, OOPs, python

Working With Pydantic - Built-in Data Process ========================== Normal way to process data (reading json file): the normal princiople, it's f

Phung Hưng Binh 1 Nov 8, 2021
A Python library for setting up projects using tabular data.

A Python library for setting up projects using tabular data. It can create project folders, standardize delimiters, and convert files to CSV from either individual files or a directory.

null 0 Dec 13, 2022
An open source utility for creating publication quality LaTex figures generated from OpenFOAM data files.

foamTEX An open source utility for creating publication quality LaTex figures generated from OpenFOAM data files. Explore the docs » Report Bug · Requ

null 1 Dec 19, 2021
Python code for working with NFL play by play data.

nfl_data_py nfl_data_py is a Python library for interacting with NFL data sourced from nflfastR, nfldata, dynastyprocess, and Draft Scout. Includes im

null 82 Jan 5, 2023
This contains timezone mapping information for when preprocessed from the geonames data

when-data This contains timezone mapping information for when preprocessed from the geonames data. It exists in a separate repository so that one does

Armin Ronacher 2 Dec 7, 2021
Quick tutorial on orchest.io that shows how to build multiple deep learning models on your data with a single line of code using python

Deep AutoViML Pipeline for orchest.io Quickstart Build Deep Learning models with a single line of code: deep_autoviml Deep AutoViML helps you build te

Ram Seshadri 6 Oct 2, 2022
Generates, filters, parses, and cleans data regarding the financial disclosures of judges in the American Judicial System

This repository contains code that gets data regarding financial disclosures from the Court Listener API main.py: contains driver code that interacts

Ali Rastegar 2 Aug 6, 2022
Soccerdata - Efficiently scrape soccer data from various sources

SoccerData is a collection of wrappers over soccer data from Club Elo, ESPN, FBr

Pieter Robberechts 195 Jan 4, 2023
DataAnalysis: Some data analysis projects in charles_pikachu

DataAnalysis DataAnalysis: Some data analysis projects in charles_pikachu You can star this repository to keep track of the project if it's helpful fo

null 9 Nov 4, 2022
Fully reproducible, Dockerized, step-by-step, tutorial on how to mock a "real-time" Kafka data stream from a timestamped csv file. Detailed blog post published on Towards Data Science.

time-series-kafka-demo Mock stream producer for time series data using Kafka. I walk through this tutorial and others here on GitHub and on my Medium

Maria Patterson 26 Nov 15, 2022
Kedro is an open-source Python framework for creating reproducible, maintainable and modular data science code

A Python framework for creating reproducible, maintainable and modular data science code.

QuantumBlack Labs 7.9k Jan 1, 2023
ReproZip is a tool that simplifies the process of creating reproducible experiments from command-line executions, a frequently-used common denominator in computational science.

ReproZip ReproZip is a tool aimed at simplifying the process of creating reproducible experiments from command-line executions, a frequently-used comm

null 267 Jan 1, 2023
The RAP community of practice includes all analysts and data scientists who are interested in adopting the working practices included in reproducible analytical pipelines (RAP) at NHS Digital.

The RAP community of practice includes all analysts and data scientists who are interested in adopting the working practices included in reproducible analytical pipelines (RAP) at NHS Digital.

NHS Digital 50 Dec 22, 2022
ckan 3.6k Dec 27, 2022
Lightweight, Python library for fast and reproducible experimentation :microscope:

Steppy What is Steppy? Steppy is a lightweight, open-source, Python 3 library for fast and reproducible experimentation. Steppy lets data scientist fo

minerva.ml 134 Jul 10, 2022
A toolkit for reproducible reinforcement learning research.

garage garage is a toolkit for developing and evaluating reinforcement learning algorithms, and an accompanying library of state-of-the-art implementa

Reinforcement Learning Working Group 1.6k Jan 9, 2023