Reproducible Data Science at Scale!

Pachyderm

Last update: Dec 29, 2022

Related tags

Documentation go docker kubernetes distributed-systems data-science big-data analytics containers data-analysis pachyderm

Overview

Pachyderm: The Data Foundation for Machine Learning

Pachyderm provides the data layer that allows machine learning teams to productionize and scale their machine learning lifecycle. With Pachyderm’s industry leading data versioning, pipelines and lineage teams gain data driven automation, petabyte scalability and end-to-end reproducibility. Teams using Pachyderm get their ML projects to market faster, lower data processing and storage costs, and can more easily meet regulatory compliance requirements

Features

Automated Data Versioning: Pachyderm’s Data Versioning gives teams an automated and performant way to keep track of all data changes.
Data-Driven Pipelines: Pachyderm’s Containerized Pipelines speed data processing while lowering compute costs.
Immutable Data Lineage: Pachyderm’s data lineage provides an immutable record for all activities and assets in the ML lifecycle.
Console: The Pachyderm Console provides an intuitive visualization of your DAG (directed acyclic graph), and aids in reproducibility.
Notebooks: Pachyderm Notebooks provide an easy way to interact with Pachyderm data versioning and pipelines via Jupyter notebooks.

Getting Started

To start deploying your end-to-end version-controlled data pipelines, try us for free on Hub with little to no setup or run Pachyderm locally. You can also deploy on AWS/GCE/Azure in about 5 minutes.

You can also refer to our complete documentation to see tutorials, check out example projects, and learn about advanced features of Pachyderm.

If you'd like to see some examples and learn about core use cases for Pachyderm:

Documentation

Official Documentation

Community

Keep up to date and get Pachyderm support via:

Follow us on Twitter.
Join our community Slack Channel to get help from the Pachyderm team and other users.

Contributing

To get started, sign the Contributor License Agreement.

You should also check out our contributing guide.

Send us PRs, we would love to see what you do! You can also check our GH issues for things labeled "help-wanted" as a good place to start. We're sometimes bad about keeping that label up-to-date, so if you don't see any, just let us know.

Join Us

WE'RE HIRING! Love Docker, Go and distributed systems? Learn more about our open positions

Usage Metrics

Pachyderm automatically reports anonymized usage metrics. These metrics help us understand how people are using Pachyderm and make it better. They can be disabled by setting the env variable METRICS to false in the pachd container.

License Information

Pachyderm has moved some components of Pachyderm Platform to a source-available limited license.

We remain committed to the culture of open source, developing our product transparently and collaboratively with our community, and giving our community and customers source code access and the ability to study and change the software to suit their needs.

Under the Pachyderm Community License, you can access the source code and modify or redistribute it; there is only one thing you cannot do, and that is use it to make a competing offering.

Check out our License FAQ Page for more information.

Comments

pachd fails: panic: failed to initialize pach client: context deadline exceeded

What happened?:

Ran pachctl deploy to create an on-premises pachyderm cluster:

pachctl deploy custom --object-store s3 any-string 10 <bucket> <accesskey> <secretkey> rook-ceph-rgw-my-store.rook-ceph:80 --etcd-storage-class nfs-client --image-pull-secret boss-6000 --namespace pachyderm --dynamic-etcd-nodes 1

pachd is failing to start up and is reporting the following in the logs:

2019-12-16T18:46:13Z INFO no Jaeger collector found (JAEGER_COLLECTOR_SERVICE_HOST not set) 
2019-12-16T18:46:19Z WARNING TLS disabled: could not stat public cert at /pachd-tls-cert/tls.crt: stat /pachd-tls-cert/tls.crt: no such file or directory 
2019-12-16T18:46:19Z WARNING s3gateway TLS disabled: could not stat public cert at /pachd-tls-cert/tls.crt: stat /pachd-tls-cert/tls.crt: no such file or directory 
2019-12-16T18:46:20Z INFO validating kubernetes access returned no errors 
2019-12-16T18:46:49Z INFO error starting githook server context deadline exceeded 
 
panic: failed to initialize pach client: context deadline exceeded 
 
goroutine 492 [running]: 
github.com/pachyderm/pachyderm/src/server/pkg/serviceenv.(*ServiceEnv).GetPachClient(0xc00021f450, 0x2ad81a0, 0xc00053a2c0, 0xc00053a2c0) 
	src/github.com/pachyderm/pachyderm/src/server/pkg/serviceenv/service_env.go:171 +0x11a 
github.com/pachyderm/pachyderm/src/server/pps/server.(*apiServer).master.func1(0x0, 0x0) 
	src/github.com/pachyderm/pachyderm/src/server/pps/server/master.go:58 +0xe5 
github.com/pachyderm/pachyderm/src/server/pkg/backoff.RetryNotify(0xc00088c220, 0x2a99520, 0xc00061d6e0, 0xc0009fbfb8, 0x2a, 0xc00113f4c0) 
	src/github.com/pachyderm/pachyderm/src/server/pkg/backoff/retry.go:35 +0x4a 
github.com/pachyderm/pachyderm/src/server/pps/server.(*apiServer).master(0xc0002bcfc0) 
	src/github.com/pachyderm/pachyderm/src/server/pps/server/master.go:52 +0x20a 
created by github.com/pachyderm/pachyderm/src/server/pps/server.NewAPIServer 
	src/github.com/pachyderm/pachyderm/src/server/pps/server/server.go:67 +0x3d4 
panic: failed to initialize pach client: context deadline exceeded 
 
goroutine 513 [running]: 
github.com/pachyderm/pachyderm/src/server/pkg/serviceenv.(*ServiceEnv).GetPachClient(0xc00021f450, 0x2ad81e0, 0xc0000560d0, 0x7f01cf21a008) 
	src/github.com/pachyderm/pachyderm/src/server/pkg/serviceenv/service_env.go:171 +0x11a 
github.com/pachyderm/pachyderm/src/server/transaction/server.newAPIServer.func1(0xc001107260) 
	src/github.com/pachyderm/pachyderm/src/server/transaction/server/api_server.go:43 +0x48 
created by github.com/pachyderm/pachyderm/src/server/transaction/server.newAPIServer 
	src/github.com/pachyderm/pachyderm/src/server/transaction/server/api_server.go:43 +0x103

What you expected to happen?:

pachd should load successfully

How to reproduce it (as minimally and precisely as possible)?:

Anything else we need to know?:

Environment?:

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.3", GitCommit:"b3cbbae08ec52a7fc73d334838e18d17e8512749", GitTreeState:"clean", BuildDate:"2019-11-13T11:23:11Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.3", GitCommit:"b3cbbae08ec52a7fc73d334838e18d17e8512749", GitTreeState:"clean", BuildDate:"2019-11-13T11:13:49Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}

Pachyderm CLI and pachd server version (use pachctl version):

COMPONENT           VERSION
pachctl             1.9.9
pachd               1.9.9

Cloud provider (e.g. aws, azure, gke) or local deployment (e.g. minikube vs dockerized k8s): on-premises Rancher 2.3.3 with 7 nodes
OS (e.g. from /etc/os-release): Ubuntu 18.04
Others:

opened by benwbooth 27

Stuck in "Pulling State" after delete-all
Hi. I have worked on Pachyderm several days. Since there're many repos, so I use delete all to delete all of the files and do new experiments, Then, I have new problems.

I use the 'fruit-stand' example to see whether it works, but although I do the same operation as I did several days before. But the program is stuck in the Pulling State like following for several hours.

pachctl list-job ID OUTPUT STARTED DURATION STATE 48521d8b6ff23c6ba66665ef6807d0e0 filter/665d9d51d3114a1e98c5bb7f432f8374 About a minute ago - pulling 649ddf50f993eafc56c5d205b7dd153f filter/fbc0accb6d4f48cc9caec8de282d8b5a About a minute ago - pulling

I search the problems in the previous issues. And I find that it is because Pachyderm cannot find the required image. So I speculate that when I do ``delete-all` , also the docker image has been deleted.
opened by ShengjieLuo 27
Explore micro k8s as minikube alternative

Mostly a suggestion for documentation purposes. User reported easier installation path and MUCH better performance in dev cluster mode.

Not sure if there are other "gotchas" or limitations, but we've seen a few users asking about it so probably worth documenting.
feature request docs

opened by JoeyZwicker 25

start-kube-docker not working in Vagrant image

Trying to run Pachyderm in Vagrant using the Vagrantfile/init.sh in the Github documentation QUICKSTART.md. gcr.io/google_containers/hyperkube:v1.1.2 container does not start.

Steps to reproduce:

vagrant destroy # or download per README.md
vagrant up
vagrant ssh

go get github.com/pachyderm/pachyderm/...
cd ~/go/src/github.com/pachyderm/pachyderm
etc/kube/start-kube-docker.sh

~/pachyderm_vagrant$ vagrant version
Installed Version: 1.7.4
Latest Version: 1.7.4

You're running an up-to-date version of Vagrant!

Console log: kubeNotStarting.txt

opened by brinman2002 23

Error 'the server has asked for the client to provide credentials'

Hi. I deploy pachyderm on another server. The installation is successful with:

COMPONENT           VERSION
pachctl             1.1.0
pachd               1.1.0

However, when I begin the pipeline fruit stand

ID                                 OUTPUT                                     STARTED             DURATION             STATE
d6ddca2dfd72e6a0da7053ba5151b4cb   filter3/1dd4428dcc4d40359e7bcc3cdb594f3b   8 seconds ago       Less than a second   failure
2610bae2936923f0ce850c04f2cedad3   filter2/d58ac2a1231d4db4bb2487554bf36273   25 minutes ago      Less than a second   failure
dfe6acbbcd241d55a394a95077df5d1e   filter/a4934ebe280c4e2cae2e6cfb4b1c4c04    2 hours ago         Less than a second   failure
e9a26e0594f6bd00bacefa33c1b9850a   filter/78a4cea8361c41f3a9a2c8e2f9679bb0    2 hours ago         Less than a second   failure

I tried it several times, but all failure.

See pipeline information here

NAME                INPUT               OUTPUT              STATE
filter              data                filter              running
sum3                filter2             sum3                running
filter2             data                filter2             running
sum2                filter2             sum2                running
filter3             data                filter3             running
sum                 filter              sum                 running

I check the log for the problem

pachctl get-logs d6ddca2dfd72e6a0da7053ba5151b4cb
the server has asked for the client to provide credentials (get pods)

I can delete the repo in this pipeline, but I can't delete this pipeline See here,

pachctl delete-pipeline filter3
error from DeletePipeline: the server has asked for the client to provide credentials (get pods)

I searched the information in source code See here

src/server/vendor/k8s.io/kubernetes/pkg/api/errors/errors.go 
case http.StatusUnauthorized:
        reason = unversioned.StatusReasonUnauthorized
        message = "the server has asked for the client to provide credentials"

Unfortunately, I have hit so many problems these days... I have to ask for problems everyday...

opened by ShengjieLuo 21

Get Pachyderm Working with OpenShift

Per our discussions, it looks like it may be a privilege issue or something similar. I have attached the steps to install the OpenShift vagrant image and then how to deploy pachyderm. You can troubleshoot through normal Kube commands as well. Everything is created and running, except the pachd pods do not start.

https://gist.github.com/munchee13/8cf64f2c1797d1d60891b28a193767f6

opened by munchee13 21
Improve Custom/On-Premise Docs
There's been a lot of interest in on-prem clusters lately, and our docs on the subject aren't very good. The main issues that people have been hitting are:

When to use the custom deploy

Confusion around how to use the custom deploy for cloud vs. on prem deploys

How to modify the manifest for on prem deploys

What needs to be changed in the deploy process for OpenShift, OpenStack, and other systems.

NotExist error with CEPH S3 interface deploy

I think the following will probably make the process better for on prem users:

Use the Helm chart as default for on prem deploys. Our users seem to indicate in Slack that this is much easier.

Update the custom/on-prem docs to emphasize custom object stores in the cloud vs. completely on-prem solutions.

Test the custom deploy commands and the Helm chart to see if updates are needed for the latest Pachyderm versions.

docs openshift size: XL priority: high
opened by dwhitena 20
Add support for minio deploy

This adds support for Minio and all other S3 compatible servers. This patch also uses minio-go. This has an added benefit i.e this can be used S3 as well transparently.

opened by harshavardhana 20
pipelines stalling

I have a pachyderm repo with a few commits commits in it (each a seperate branch). Each commit is about 100MB in size. The processing step begin and starts output data into the next pipeline but stops working after about 1MB of data output. If I force finish the commit and inspect the output is see that only some of the expected output was generated and the some of the files list the unix epoch as their creation date some of the time. Moreover, the pipeline takes an exceptionally long time to run compared to running it out side of pachyderm.

opened by JonathanFraser 20
Contexts
This introduces a new, backwards-incompatible version of configs, a migration to update old configs, and the related behavioral changes to pachctl from the new config.

Closes #3774 Fixes #3538 Closes #3036 Fixes #3419

Contexts

The largest addition is contexts, which are akin to kubectl contexts. Instead of building off of config V1 and having a different config file for each context (as proposed in #3774), this stores all contexts in a single config file. The reasons for this:

The hope with multiple files was that we could save effort by just building off of config V1. I no longer think that is the case, because bolting contexts on top of that design appears to be just as much work, if not more.

This design avoids ambiguity when a user sets PACH_CONFIG that the multiple config file design has.

It more closely follows k8s' approach.

Contexts have the same fields as config V1, plus a context source field, which specifies where contexts came from.

Active context

Config V2 contains a reference to the currently active context. The active context can be overridden via the env var PACH_CONTEXT.

Metrics

Rather than having a global flag --no-metrics that needs to be passed in for each call to pachctl, the ability to disable metrics are now specified in the config.

Config implementation changes

Configs are now read only once per run of pachctl. Before, it was read multiple times, which allowed for subtle bugs (e.g. if the user ID wasn't yet set, it would be reset multiple times.) I did not add any further locking to ensure changes can't overwrite each other. Given the current uses of configs (which are predominantly read-only), I think this is safe enough for now, but it is certainly not bulletproof!

Migrations

A migration is run the first time the config is read. The V1 config is turned into a context.

Deployments

When a (non-dry-run) deployment succeeds, a new pach context is created, and the user is automatically switched to the new context.

New commands

pachctl config get metrics - gets whether metrics are enabled

pachctl config set metrics (true|false) - sets whether metrics are enabled

pachctl config get active-context - gets the active context

pachctl config set active-context [name] - sets the active context

pachctl config get context [name] - gets a context config JSON by name

pachctl config set context [name] [--overwrite] - sets a context config from JSON stdin

pachctl config update context [name] --pachd-address=[address] - updates the pachd address of an existing context (this is the only field that is updatable without completely overwriting a context, at the moment.)

pachctl config delete context [name] - removes a context

pachctl config list context - lists all contexts

Removals

The global --no-metrics and --no-port-forwarding flags were removed, in favor of a config values.
opened by ysimonson 19
Better documentation for cluster API ingress
It seems a lot of user questions center around how API ingress works. Didn't find docs about this, so just wrote this up for a customer. Should probably be migrated into our docs if we don't have something like this already

To clarify about ingress/NodePorts: pachyderm doesn't really care how a user gets access to it, so long as their local pachctl client can talk to the pachd pod. pachctl has built-in support for two different methods: setting the PACHD_ADDRESS, and using pachctl port-forward

setting the PACHD_ADDRESS env var to point at a host:port that directs traffic to the pachd pod tells pachctl to just talk to that endpoint directly. This is the flow NodePort supports -- it makes the internal pachd pod's API port accessible on the cluster's external address, so that users can set PACHD_ADDRESS=cluster-address:30650 (30650 is the default), and the k8s/OC cluster will send that traffic to the pachd pod. because this port is a global resource at the k8s/OC cluster level, it needs to be unique per pachyderm cluster. but it can be changed to whatever you want for a given pachyderm deployment, and shouldn't affect pachyderm's internal operation (so long as the user of pachctl has the right value in their PACHD_ADDRESS variable)

~~pachctl port-forward piggy-backs on kubectl to fetch the name of the pachd pod within the k8s cluster/namespace the user is currently connected to, and then runs kubectl port-forward to direct traffic from the user's local machine to pachd via the k8s API. In this case, setting the PACHD_ADDRESS variable isn't needed, but the user needs to have k8s access set up, pointing at the namespace for their pachyderm cluster~~

Edited, based on @ysimonson 's suggestion:

pachctl port-forward piggy-backs off kubectl's config file and client API -- it reads kubectl's config file to fetch the name of the pachd pod within the k8s cluster/namespace the user is currently connected to, then uses the kubernetes API to effectively run kubectl port-forward. This directs traffic from the user's local machine to pachd via the k8s API. If using pachctl port-forward, then setting the PACHD_ADDRESS variable isn't needed. Instead, the user needs to have k8s access set up, pointing at the namespace for their pachyderm cluster

As of 1.8.3 pachctl port-forward happens automatically when running any pachctl command that tries to access pachd, but in order to open a persistent tunnel to a number of other ports pachyderm uses (the dashboard, git and auth hooks, the built-in HTTP file API, etc) users will still need to run pachctl port-forward explicitly

also, it seems pachctl port-forward isn't working with openshift, but the following oc port-forward command does effectively the same thing:

PACHD_POD_NAME=`oc get pod --output=json | jq -r '.items[] | select(.metadata.name|startswith("pachd")).metadata.name'` # -r flag is needed to not get quotes in the output oc port-forward pod/$PACHD_POD_NAME 30650:1650
docs openshift size: L priority: high solutions-architecture
opened by gabrielgrant 19
Can't run pachctl on WSL2
What happened?:

Following the local pachyderm instructions (running on WSL2 / 20.04):

Install homebrew and run the Next steps

tested everything works via brew install hello

install pachctl via brew tap pachyderm/tap && brew install pachyderm/tap/[email protected]

Trying to run pachctl, gives the following message:

pachctl zsh: permission denied: pachctl # same when running it via bash: /bin/bash pachctl /home/linuxbrew/.linuxbrew/bin/pachctl: /home/linuxbrew/.linuxbrew/bin/pachctl: cannot execute binary file

What you expected to happen?:

Not get the permission denied: pachctl message.

How to reproduce it (as minimally and precisely as possible)?:

# run the next steps as recommended from the following command too /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" brew tap pachyderm/tap && brew install pachyderm/tap/[email protected] pachctl

Anything else we need to know?: Installing other packages via brew seems to work, so I don't think its a homebrew issue. (E.g. I can finish the local deploy guide, including the helm install via brew)

Environment?:

Kubernetes version (use kubectl version):

Pachyderm CLI and pachd server version (use pachctl version):

Cloud provider (e.g. aws, azure, gke) or local deployment (e.g. minikube vs dockerized k8s):

If you deployed with helm, the values you used (helm get values pachyderm):

OS (e.g. from /etc/os-release):

Others:

This is on WSL2 .

cat /etc/os-release NAME="Ubuntu" VERSION="20.04.5 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.5 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal

The permissions for linuxbrew look allright:

ls -lhA /home/linuxbrew/.linuxbrew/bin/pachctl lrwxrwxrwx 1 mheiser mheiser 40 Dec 27 09:14 /home/linuxbrew/.linuxbrew/bin/pachctl -> ../Cellar/[email protected]/v2.4.2/bin/pachctl
bug
opened by Persedes 6
[2.4.x backport][Jupyter] Fix datums-related error message when notebooks starts up

Make datums request at notebooks startup only when connected to a cluster and logged in if auth enabled (when mount server response is 200).

Ticket: https://linear.app/pachyderm/issue/INT-760/error-message-when-starting-up-notebooks

JIRA: INT-782

opened by smalyala 1
Warn about outdated pachctl
This implements some compatibility checking between pachctl (actually any Go client) and pachd.

Compatibility is defined as:

Either the client or the server are 0.0.0 as developer builds are.

If either the client or the server are a pre-release (nightly/alpha/beta/rc), then the server and client versions have to be an exact match.

Otherwise, the major and minor number have to be the same.

See the test cases in version_test.go.

The client always calls InspectCluster before connecting, so this PR modifies InspectCluster to take the client's version as a parameter. The server then checks that version against it's own version, and returns warnings in the reponse. There is also a version_warnings_ok flag. Old servers won't set that (since it's not in the message version they have), so the client can detect a way-too-old server. If there are any warnings set, the client will log them at level error. Technically this would be intrusive to users of the go client, but since the pctx.TODO() logger points to no-op logger until someone calls InitPachctlLogger() and that's a symbol they can't import, only pachctl users will ever see this.

It is a little weird to change InspectCluster from taking Empty to taking a message type, but it seems perfectly safe to me. The Go client API doesn't change; only people that directly generate stubs and call methods on it (like some of our tests) are affected. Users of the Go client that want to send a version have the option to do so with a new function in the client, InspectClusterWithVersion.

The server logs at INFO level whenever an incompatible client is detected, so even if users miss the warnings, administrators can know.

Here's an example of what it looks like in pachctl (runs for every command, can't be turned off). In this particular case, a "released" client is talking to a nightly build, which requires an exact version match between client and server:

And the server logs:

The server log is the same for every case (modulo the error field), but the client message varies based on the constants in admin/api_server.go. Feel free to wordsmith them.

Annoyingly, we don't seem to send an auth token with InspectCluster, so the server can't report the user name that is using an out of date client. We should probably do something about that.
opened by jrockway 1
Increase reliability of debug dumps
Fixes CORE-1193 and CORE-1294.

This PR does a bunch of stuff to make debug dumps more reliable, at least without burning the whole thing down and starting over.

pachctl debug dump can now specify a timeout; it defaults to 30m.

The timeout is adjusted down on the server side to about 90% of the client timeout. That means the debug dumper has some time to handle context deadline exceeded and start producing output before the RPC is totally aborted. I've had good results with timeouts as low as 100ms; you don't get everything, but you get some files. At 30m it should be Really Good (tm).

Every multi-step operation that the dumper does now continues in the face of errors, if the error doesn't affect the next thing. Every for loop or function that does two+ things now uses multierr.Append to collect all the errors. That means if we hit an issue where we try to do something silly like InspectPipeline an input repo, we just continue doing the whole debug dump anyway. At the very end, an error will be returned, but we can still write all the other files.

I fixed the thing where we did InspectPipeline on an input repo; there was a missing continue statement. I also tried to fix PPS's error message for a pipeline not being found, but it's actually not relevant to this PR. (I don't think the code can ever hit the case I "fixed", but in case it does, hey now the error type is correct. We still don't return grpc.status = NotFound from PPS under any of these circumstances though.)

I added some arbitrary timeouts around things I don't think will be too slow, like we did for Loki.

I noticed that the Pod Describer from the k8s library can't take a context. That means it could run forever, so I put it in a background goroutine; the foreground goroutine tries to get its output until the context expires, and then it just abandons it and moves on. This will leak memory if it runs forever, but hey, after we review the debug dump we'll probably tell you to restart pachd anyway. In the future we'll have to just collect pod YAMLs instead of "describe" output. Or fork k8s.io/client-go to make the silly thing take a context.

As an example, here's what a run with an aggressive timeout looks like now:

$ rm dump.tgz; /usr/bin/time pachctl debug dump dump.tgz --timeout=1s ; tar tzvf dump.tgz; du -h dump.tgz rpc error: code = Unknown desc = listPipelines: context deadline exceeded; appLogs: context deadline exceeded; collectDatabaseDump: collectDatabaseTables: list tables: context deadline exceeded Command exited with non-zero status 1 0.09user 0.02system 0:01.04elapsed 11%CPU (0avgtext+0avgdata 66060maxresident)k 0inputs+2176outputs (0major+2005minor)pagefaults 0swaps -rwxrwxrwx 0/0 6214 1969-12-31 19:00 source-repos/default/benchmark-upload/commits.json -rwxrwxrwx 0/0 52020 1969-12-31 19:00 source-repos/default/benchmark-upload/commits-chart.png -rwxrwxrwx 0/0 8083 1969-12-31 19:00 source-repos/default/images/commits.json -rwxrwxrwx 0/0 45961 1969-12-31 19:00 source-repos/default/images/commits-chart.png -rwxrwxrwx 0/0 17 1969-12-31 19:00 pachd/pachd-84f6794987-74hf2/pachd/version.txt -rwxrwxrwx 0/0 7612 1969-12-31 19:00 pachd/pachd-84f6794987-74hf2/pachd/describe.txt -rwxrwxrwx 0/0 8690350 1969-12-31 19:00 pachd/pachd-84f6794987-74hf2/pachd/logs.txt -rwxrwxrwx 0/0 80 1969-12-31 19:00 pachd/pachd-84f6794987-74hf2/pachd/logs-previous/error.txt -rwxrwxrwx 0/0 8640042 1969-12-31 19:00 pachd/pachd-84f6794987-74hf2/pachd/logs-loki.txt -rwxrwxrwx 0/0 22422 1969-12-31 19:00 pachd/pachd-84f6794987-74hf2/pachd/go_info.txt -rwxrwxrwx 0/0 11559 1969-12-31 19:00 pachd/pachd-84f6794987-74hf2/pachd/goroutine -rwxrwxrwx 0/0 84444 1969-12-31 19:00 pachd/pachd-84f6794987-74hf2/pachd/heap -rwxrwxrwx 0/0 26 1969-12-31 19:00 database/activities/error.txt -rwxrwxrwx 0/0 26 1969-12-31 19:00 database/row-counts/error.txt -rwxrwxrwx 0/0 26 1969-12-31 19:00 database/table-sizes/error.txt 1.1M dump.tgz

We end up with data (and a long chain of error messages) even if we hit timeouts.
opened by jrockway 1
dlock: add logging around lock acquisition and release

It's often interesting to have information about when locks are acquired or lost, so this adds it around all uses of DLock. The actual calls to Lock/TryLock/Unlock are wrapped in a span, reporting how long it took to acquire or release the lock, and any errors that might have occurred. The time spent waiting for the lock is reported as the spanDuration on the DLock.Lock (etc.) span, and all messages that are logged using the returned context have a withLock and locked field, to make it clear where the context came from. (The lock timing spans also have a withLock field, but locked isn't set until the lock is actually acquired.)

Here's what the chunk GC looks like starting up:

From this, we can see that we waited 21.86 seconds to take the lock, and that several GC runs have occurred while holding that lock. (If there was an error, that would also be logged.)

The span only tracks time spent actually interacting with the locking machinery; the total time the lock was held is reported at the end though.

When unlocking, we identify the lock by the prefix field instead of withLock. That's so that you can compare the two and see which context is being used to gate the unlocking operation vs. which lock is being unlocked.

opened by jrockway 1