An example of repository data as bundles

Related tags

Distribution bundles
Overview

Bundles

This repository is just an example of how we can host Git bundles in a way that supports fetching data from precomputed bundles without the origin server needing to manage those bundles.

This repository is mirrored as an Azure Static Web Site at https://nice-ocean-0f3ec7d10.azurestaticapps.net.

This repository contains a set of bundles corresponding to the data of the git/git repository in its master branch at different timepoints throughout October 2021.

Proposal for fetching bundles

Git clients can fetch a "table of contents" from some predetermined URL, such as https://nice-ocean-0f3ec7d10.azurestaticapps.net/bundles.json hosted by this repository.

This URL stores a JSON list with objects containing a few known members:

  • uri (required): the URI of the bundle being referenced.
  • timestamp: the timestamp of this URI.
  • requires: If this bundle is not closed under reachability (and might contain thin packs), then which uri is the "previous" one that contains a previous set of objects. (This assumes that the bundles can be ordered linearly.)

Cloning

The clone.sh script shows how we can create a new repository using these bundles. After initializing a new repository, we can use fetch.py to download all of the bundles in the JSON list. We then add the origin remote and fetch the remaining data from that list.

stolee@stolee-linux-metal:/_git$ GIT_TRACE2_PERF=/_git/trace2.txt /_git/bundles/clone.sh https://github.com/git/git git-bundle-test
Initialized empty Git repository in /_git/git-bundle-test/.git/
Downloading https://nice-ocean-0f3ec7d10.azurestaticapps.net/2021-10-01.bundle to .git/bundles/0.bundle
Downloading https://nice-ocean-0f3ec7d10.azurestaticapps.net/2021-10-4.bundle to .git/bundles/1.bundle
Downloading https://nice-ocean-0f3ec7d10.azurestaticapps.net/2021-10-7.bundle to .git/bundles/2.bundle
Downloading https://nice-ocean-0f3ec7d10.azurestaticapps.net/2021-10-12.bundle to .git/bundles/3.bundle
Downloading https://nice-ocean-0f3ec7d10.azurestaticapps.net/2021-10-13.bundle to .git/bundles/4.bundle
Downloading https://nice-ocean-0f3ec7d10.azurestaticapps.net/2021-10-14.bundle to .git/bundles/5.bundle
Downloading https://nice-ocean-0f3ec7d10.azurestaticapps.net/2021-10-15.bundle to .git/bundles/6.bundle
Downloading https://nice-ocean-0f3ec7d10.azurestaticapps.net/2021-10-19.bundle to .git/bundles/7.bundle
Downloading https://nice-ocean-0f3ec7d10.azurestaticapps.net/2021-10-26.bundle to .git/bundles/8.bundle
Note: switching to 'FETCH_HEAD'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at af6d1d602a Git 2.33.1

The trace2 logs for this run are available as trace2.txt, so you can see how small the git fetch origin portion of clone.sh is.

stolee@stolee-linux-metal:/_git$ cd git-bundle-test/
stolee@stolee-linux-metal:/_git/git-bundle-test$ git branch -v
* (HEAD detached at FETCH_HEAD) af6d1d602a Git 2.33.1
  refs/bundles/2021-10-01       cefe983a32 The ninth batch
  refs/bundles/2021-10-12       2a97289ad8 Twelfth batch
  refs/bundles/2021-10-13       2bd2f258f4 Sync with Git 2.33.1
  refs/bundles/2021-10-14       9875c51553 Merge branch 'ja/doc-status-types-and-copies'
  refs/bundles/2021-10-15       f443b226ca Thirteenth batch
  refs/bundles/2021-10-19       9d530dc002 The fourteenth batch
  refs/bundles/2021-10-26       e9e5ba39a7 The fifteenth batch
  refs/bundles/2021-10-4        0785eb7698 The tenth batch
  refs/bundles/2021-10-7        106298f7f9 The eleventh batch

stolee@stolee-linux-metal:/_git/git-bundle-test$ ls .git/objects/pack/
stolee@stolee-linux-metal:/_git/git-bundle-test$ ls -al .git/objects/pack/
total 241064
drwxrwxr-x 2 stolee stolee      4096 Oct 28 11:52 .
drwxrwxr-x 4 stolee stolee      4096 Oct 28 11:52 ..
-rw-rw-r-- 1 stolee stolee   8877836 Oct 28 11:52 multi-pack-index
-r--r--r-- 1 stolee stolee     18152 Oct 28 11:52 pack-0de3636531b9ce15eae60de09224e8a62d9d0a4c.idx
-r--r--r-- 1 stolee stolee   1515581 Oct 28 11:52 pack-0de3636531b9ce15eae60de09224e8a62d9d0a4c.pack
-r--r--r-- 1 stolee stolee      9612 Oct 28 11:52 pack-1938b2e1527f7167687ee27e18951aac9a0baed1.idx
-r--r--r-- 1 stolee stolee    849728 Oct 28 11:52 pack-1938b2e1527f7167687ee27e18951aac9a0baed1.pack
-r--r--r-- 1 stolee stolee   8514836 Oct 28 11:52 pack-3174045eb5b62a6749b1daf60c0acfe8fda0facc.idx
-r--r--r-- 1 stolee stolee 100176426 Oct 28 11:52 pack-3174045eb5b62a6749b1daf60c0acfe8fda0facc.pack
-r--r--r-- 1 stolee stolee    298880 Oct 28 11:52 pack-43362f7e98023f4698ac7c3ace1f739616212d34.idx
-r--r--r-- 1 stolee stolee  11376553 Oct 28 11:52 pack-43362f7e98023f4698ac7c3ace1f739616212d34.pack
-r--r--r-- 1 stolee stolee     10928 Oct 28 11:52 pack-67d22f7b765041b551444e1c21c5950b3e9392d8.idx
-r--r--r-- 1 stolee stolee   1231140 Oct 28 11:52 pack-67d22f7b765041b551444e1c21c5950b3e9392d8.pack
-r--r--r-- 1 stolee stolee     27756 Oct 28 11:52 pack-6ab2c38b678cf338a9fa0cf2faf65653ef00f1cb.idx
-r--r--r-- 1 stolee stolee   1942093 Oct 28 11:52 pack-6ab2c38b678cf338a9fa0cf2faf65653ef00f1cb.pack
-r--r--r-- 1 stolee stolee      9780 Oct 28 11:52 pack-8271f33d606a5ab8804c97a1135f441a1c2ca361.idx
-r--r--r-- 1 stolee stolee    517529 Oct 28 11:52 pack-8271f33d606a5ab8804c97a1135f441a1c2ca361.pack
-r--r--r-- 1 stolee stolee     15324 Oct 28 11:52 pack-937b1699b65fd2cacbd9bc119b09fb05fd1a685c.idx
-r--r--r-- 1 stolee stolee   1166484 Oct 28 11:52 pack-937b1699b65fd2cacbd9bc119b09fb05fd1a685c.pack
-r--r--r-- 1 stolee stolee     14428 Oct 28 11:52 pack-98e8a35d1a2ad91a56b29b5b3e60182ca7dcbdaa.idx
-r--r--r-- 1 stolee stolee   1082390 Oct 28 11:52 pack-98e8a35d1a2ad91a56b29b5b3e60182ca7dcbdaa.pack
-r--r--r-- 1 stolee stolee   8499240 Oct 28 11:52 pack-b805e409cb3ed85b98e4c58697e33e1027f367a7.idx
-r--r--r-- 1 stolee stolee 100595382 Oct 28 11:52 pack-b805e409cb3ed85b98e4c58697e33e1027f367a7.pack
-r--r--r-- 1 stolee stolee      1324 Oct 28 11:52 pack-f58f8c9ebfd3fdfa41a79f6558bc5122019778d7.idx
-r--r--r-- 1 stolee stolee     37462 Oct 28 11:52 pack-f58f8c9ebfd3fdfa41a79f6558bc5122019778d7.pack

Fetching

As we download and store the bundles from the list of URIs, we update the bundle.latestTimestamp config value. This allows us to reexamine the table of contents and only download the bundles that are newer than that timestamp.

(If the timestamps have altered in a way that our previously-downloaded bundles are no longer in the list, hopefully we could use the requires members to download bundles until closing the missing objects. This is not implemented in fetch.py.)

Here is a test of the idea by manually modifying bundle.latestTimestamp:

stolee@stolee-linux-metal:/_git/git-bundle-test$ git config --replace-all bundle.latestTimestamp 1634072372
stolee@stolee-linux-metal:/_git/git-bundle-test$ git config --local --list
core.repositoryformatversion=0
core.filemode=true
core.bare=false
core.logallrefupdates=true
bundle.latesttimestamp=1634072372
remote.origin.url=https://github.com/git/git
remote.origin.fetch=+refs/heads/*:refs/remotes/origin/*
stolee@stolee-linux-metal:/_git/git-bundle-test$ /_git/bundles/fetch.py
Downloading https://nice-ocean-0f3ec7d10.azurestaticapps.net/2021-10-14.bundle to .git/bundles/0.bundle
Downloading https://nice-ocean-0f3ec7d10.azurestaticapps.net/2021-10-15.bundle to .git/bundles/1.bundle
Downloading https://nice-ocean-0f3ec7d10.azurestaticapps.net/2021-10-19.bundle to .git/bundles/2.bundle
Downloading https://nice-ocean-0f3ec7d10.azurestaticapps.net/2021-10-26.bundle to .git/bundles/3.bundle

Benefits over server-declared URIs

  1. The organization of the bundles is completely separate from the origin server. The bundle server can reorganize as needed without communicating with the origin server.

  2. The bundle server can be completely independent of the origin. If a company wants to create a local bundle cache, then users can point to it through client-side configuration instead of needing to communicate through the origin server.

  3. We can extend the server capabilities to advertise a number of bundle caches, and let the client pick their favorite one. This can present ways to optimize for network latency before committing to a download.

Things not covered in this proposal

  • We don't have a way to authenticate to the bundles. The table of contents and the bundles themselves could be under some form of authentication that is not covered here. We would want to extend the standard to handle auth appropriately, probably through a credential helper.

  • We don't consider encrypted bundles. It is likely possible to extend the table of contents with information about each bundle being encrypted with some public key, allowing future clients to understand that option and do the right thing. Extensions like this are obviously possible with the JSON format (as opposed to a custom format that might cause accidental restrictions).

Custom things to this implementation

  • The bundles attempt to store refs as refs/bundles/<X>, but somehow the bundles end up putting the refs as refs/heads/refs/bundles/<X>. To avoid polluting refs/remotes/ or other refspaces, the refs/heads/ is stripped out in these cases. The ref space could be very flexible, depending on how the bundle organizer designs it.

  • The first bundle is big: it includes all data in master from around 30 days ago. The rest are picking daily updates (if master moved in that time). This layout could shift over time, and I would expect the bundle maintenance to merge the oldest two bundles after generating a new, "latest" bundle.

  • These bundles only care about master, but they could be a full snapshot of refs/heads/. They could also contain all of the tags, if we wanted. (Tags would not want to be hidden away in another ref namespace, I think.)

  • Here, I am using a static web page to serve the data, but it could be a fancy web service with a real REST API. Specifically, it might be nice to add a GET parameter to the table of contents that allows us to specify a filter, such as https://{uri}/bundles?filter=blob:none. Alternatively, we could list the filter as part of the JSON objects and let the client decide without special modification to the URL.

  • Note: Bundles require modification to allow object filters, but that would be valuable for allowing these bundles to work at huge scale.

  • These bundle table of contents could be located via CDN, but they could also be on a GHES replica or some other tiny service. They could even be hosted as a route on github.com and backed by a near-the-edge microservice.

  • Notice that I don't include any details about "how does the client discover the table of contents?" This is currently vauge, but we could add things to the Git protocol to advertise the table's location. I think separating the table itself out of the origin Git server is helpful because we might want multiple, geodistributed locations. The GVFS Cache Servers do this: the origin advertises the possible cache server URLs and then the cache servers manage their own lists of precomputed packs. The client can decide which of those locations is best for them. The client could use a ping to test latency and choose the closest one that way. The specific way that Git could advertise this could look a lot like the gvfs/config endpoint which has other data than just the cache servers. We could create a "config" endpoint for clones that advertises these tables, but also advertises things like "you should use --filter=blob:none here" or other advanced recommendations.

You might also like...
Packages of Example Data for The Effect

causaldata This repository will contain R, Stata, and Python packages, all called causaldata, which contain data sets that can be used to implement th

A project based example of Data pipelines, ML workflow management, API endpoints and Monitoring.
A project based example of Data pipelines, ML workflow management, API endpoints and Monitoring.

MLOps template with examples for Data pipelines, ML workflow management, API development and Monitoring.

Example of scraping a paginated API endpoint and dumping the data into a DB

Provider API Scraper Example Example of scraping a paginated API endpoint and dumping the data into a DB. Pre-requisits Python = 3.9 Pipenv Setup # i

Minimal working example of data acquisition with nidaqmx python API

Data Aquisition using NI-DAQmx python API Based on this project It is a minimal working example for data acquisition using the NI-DAQmx python API. It

A tool to nowcast quarterly data with monthly indicators: US consumption example

MIDAS_Nowcaster A tool to nowcast quarterly data with monthly indicators: US consumption example Pulls data directly from FRED from a list of codes -

Example Code Notebooks for Data Visualization in Python
Example Code Notebooks for Data Visualization in Python

This repository contains sample code scripts for creating awesome data visualizations from scratch using different python libraries (such as matplotli

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Proteno This is the data release associated with the corresponding NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deploymen

This repository is home to the Optimus data transformation plugins for various data processing needs.

Transformers Optimus's transformation plugins are implementations of Task and Hook interfaces that allows execution of arbitrary jobs in optimus. To i

Nasdaq Cloud Data Service (NCDS) provides a modern and efficient method of delivery for realtime exchange data and other financial information. This repository provides an SDK for developing applications to access the NCDS.

Nasdaq Cloud Data Service (NCDS) Nasdaq Cloud Data Service (NCDS) provides a modern and efficient method of delivery for realtime exchange data and ot

A public data repository for datasets created from TransLink GTFS data.
A public data repository for datasets created from TransLink GTFS data.

TransLink Spatial Data What: TransLink is the statutory public transit authority for the Metro Vancouver region. This GitHub repository is a collectio

Team nan solution repository for FPT data-centric competition. Data augmentation, Albumentation, Mosaic, Visualization, KNN application
Team nan solution repository for FPT data-centric competition. Data augmentation, Albumentation, Mosaic, Visualization, KNN application

FPT_data_centric_competition - Team nan solution repository for FPT data-centric competition. Data augmentation, Albumentation, Mosaic, Visualization, KNN application

A compendium of useful, interesting, inspirational usage of pandas functions, each example will be an ipynb file

Pandas_by_examples A compendium of useful/interesting/inspirational usage of pandas functions, each example will be an ipynb file What is this reposit

Example python package with pybind11 cpp extension

Developing C++ extension in Python using pybind11 This is a summary of the commands used in the tutorial.

Example app using FastAPI and JWT

FastAPI-Auth Example app using FastAPI and JWT virtualenv -p python3 venv source venv/bin/activate pip3 install -r requirements.txt mv config.yaml.exa

FastAPI Learning Example,对应中文视频学习教程:https://space.bilibili.com/396891097

视频教学地址 中文学习教程 1、本教程每一个案例都可以独立跑,前提是安装好依赖包。 2、本教程并未按照官方教程顺序,而是按照实际使用顺序编排。 Video Teaching Address FastAPI Learning Example 1.Each case in this tutorial c

Minimal example utilizing fastapi and celery with RabbitMQ for task queue, Redis for celery backend and flower for monitoring the celery tasks.

FastAPI with Celery Minimal example utilizing FastAPI and Celery with RabbitMQ for task queue, Redis for Celery backend and flower for monitoring the

python fastapi example  connection to mysql
python fastapi example connection to mysql

Quickstart Then run the following commands to bootstrap your environment with poetry: git clone https://github.com/xiaozl/fastapi-realworld-example-ap

Example of integrating Poetry with Docker leveraging multi-stage builds.

Poetry managed Python FastAPI application with Docker multi-stage builds This repo serves as a minimal reference on setting up docker multi-stage buil

Toy example of an applied ML pipeline for me to experiment with MLOps tools.

Toy Machine Learning Pipeline Table of Contents About Getting Started ML task description and evaluation procedure Dataset description Repository stru

Owner
Derrick Stolee
I used to be a mathematician in computational graph theory. These days I spend most of my time contributing to Git.
Derrick Stolee
py2app is a Python setuptools command which will allow you to make standalone Mac OS X application bundles and plugins from Python scripts.

py2app is a Python setuptools command which will allow you to make standalone Mac OS X application bundles and plugins from Python scripts. py2app is

Ronald Oussoren 222 Dec 30, 2022
A django compressor tool that bundles css, js files to a single css, js file with webpack and updates your html files with respective css, js file path.

django-webpacker's documentation: Introduction: django-webpacker is a django compressor tool which bundles css, js files to a single css, js file with

MicroPyramid 72 Aug 18, 2022
Use webpack to generate your static bundles without django's staticfiles or opaque wrappers.

django-webpack-loader Use webpack to generate your static bundles without django's staticfiles or opaque wrappers. Django webpack loader consumes the

null 2.4k Dec 24, 2022
Example-bot-discord - Example bot discord xD

example-python-bot-discord Clone this repository Grab a token on Discord's devel

Amitminer 1 Mar 14, 2022
Neon-erc20-example - Example of creating SPL token and wrapping it with ERC20 interface in Neon EVM

Example of wrapping SPL token by ERC2-20 interface in Neon Requirements Install

null 7 Mar 28, 2022
Example-custom-ml-block-keras - Custom Keras ML block example for Edge Impulse

Custom Keras ML block example for Edge Impulse This repository is an example on

Edge Impulse 8 Nov 2, 2022
Python-kafka-reset-consumergroup-offset-example - Python Kafka reset consumergroup offset example

Python Kafka reset consumergroup offset example This is a simple example of how

Willi Carlsen 1 Feb 16, 2022
ckan 3.6k Dec 27, 2022
Example repository for custom C++/CUDA operators for TorchScript

Custom TorchScript Operators Example This repository contains examples for writing, compiling and using custom TorchScript operators. See here for the

null 106 Dec 14, 2022
An example repository for how to generate results using PyBaMM

PyBaMM results This repository provides a template for generating results (for example, for a paper) using PyBaMM Installation Install PyBaMM using a

PyBaMM Team 7 Oct 9, 2022