πŸ‚ 🌾 Oxen.ai

Welcome to the Herd! This is a whirlwind tour of the Oxen.ai codebase. This is an evolving artifact meant to document the tool and codebase.

Each section dives into a different part of the code base as well as the file formats on disk. It is a resource for engineers who want to contribute or extend the tooling, or simply want to learn the inner workings.

What is Oxen?

Oxen at it's core is a blazing fast data version control tool written in Rust. It is optimized for large machine learning datasets. These datasets could consist of many small files (think an images/ folder for computer vision tasks), a few large files (a collection of timeseries datasets as csvs), or many large files (an LLM pre-training dataset of parquet files).

In the Git Book they define "Version Control" as "a system that records changes to a file or set of files over time so that you can recall specific versions later." As a software engineer we typically use tools such as git to version our source code. This allows us to keep every single version of a file so that we can revert back to a previous state and compare changes over time. While git is great for versioning smaller assets such as files in a code base, it struggles to version large datasets.

Why build Oxen?

As machine learning engineers, we were frustrated with the speed of managing and iterating on datasets that traditionally would not fit well into git. There are extensions to git such as git-lfs but they are like fitting a square peg in a round hole and come with their own issues.

Data versions should be easy to interact with locally, fast to sync to a remote, seamless contribute to, and feel like you have terrabytes accessible at your fingertips by slicing and downloading subsets locally when you need it.

Why write this book?

"What I cannot create, I do not understand". - Richard Feynman

When it comes to open source contribution and scaling up a software project this is true as well. This book is for developers to get an understanding of the internals, design decisions, and places for improvement in the Oxen.ai code base. Open source is meant to not only be open, but understandable. This is an evolving artifact meant to document the tool and codebase.

The concepts listed in this book are not perfect, but are meant to be a guide posts for the current implementation. Along the way we will point out areas for improvement. If you get to a section and think "Why do we do this? The HAS to be a better way." you are probably right! Check out improvements for some ideas we already have, and feel free to add your own.

Why is Oxen fast?

This is always one of the first questions we get. The simple answer is that there are many optimizations that make Oxen fast. Many are just fundamental computer science concepts but when stacked together make a nice developer experience for iterating on datasets.

Why the name Oxen?

"Oxen" πŸ‚ comes from the fact that the tooling will plow, maintain, and version your data like a good farmer tends to their fields 🌾. During the agricultural revolution the Ox allowed humans to automate the process of plowing fields so they could specialize in higher level tasks. Data is the lifeblood of ML/AI. Let Oxen take care of the grunt work of your infrastructure so you can focus on the higher-level problems that matter to your product.

Where to start?

First you will want to install Oxen. Once you have the tool up and running, we can dive into the implementation details. If you already have the tool up and running, feel free to skip directly to learning about domains or how to add a command.

Like any project, let's start by learning how to build and run the codebase.

πŸ› οΈ Development

There are a few ways of getting up and running with Oxen. The most straightforward way is to install the latest pre-built version of Oxen from the open source repository.

If you are actually going to be writing code, it is important to setup your development environment and start writing some code. This section has resources on how to install Oxen from source, how to build and run Oxen, add your first command, unit test, and how to release a new version of Oxen.

πŸ§‘β€πŸ’» Installation

How to install the Oxen client, server, or python package. If you are a developer, you will want to build from source. If you are flying by and learning Oxen you can install the python package or the command line tool from the GitHub releases page.

πŸ’» Command Line Tools

The Oxen client can be installed via homebrew or by downloading the relevant binaries for Linux or Windows.

You can find the source code for the client here and can also build for source for your platform. The continuous integration pipeline will build binaries for each release in this repository.

Mac

brew tap Oxen-AI/oxen
brew install oxen

Ubuntu Latest

Check the GitHub releases page for the latest version of the client and server.

wget https://github.com/Oxen-AI/Oxen/releases/latest/download/oxen-ubuntu-latest.deb
sudo dpkg -i oxen-ubuntu-latest.deb

Ubuntu 20.04

wget https://github.com/Oxen-AI/Oxen/releases/latest/download/oxen-ubuntu-20.04.deb
sudo dpkg -i oxen-ubuntu-20.04.deb

Windows

wget https://github.com/Oxen-AI/Oxen/releases/latest/download/oxen.exe

Other Linux

Binaries are coming for other Linux distributions in the future. In the meanwhile, you can build from source.

🌎 Server Install

The Oxen server binary can be deployed where ever you want to store and backup your data. It is an HTTP server that the client communicates with to enable collaboration.

Mac

brew tap Oxen-AI/oxen-server
brew install oxen-server

Docker

wget https://github.com/Oxen-AI/Oxen/releases/latest/download/oxen-server-docker.tar
docker load < oxen-server-docker.tar
docker run -d -v /var/oxen/data:/var/oxen/data -p 80:3001 oxen/oxen-server:latest

Ubuntu Latest

wget https://github.com/Oxen-AI/Oxen/releases/latest/download/oxen-server-ubuntu-latest.deb
sudo dpkg -i oxen-server-ubuntu-latest.deb

Ubuntu 20.04

wget https://github.com/Oxen-AI/Oxen/releases/latest/download/oxen-server-ubuntu-20.04.deb
sudo dpkg -i oxen-server-ubuntu-20.04.deb

Windows

wget https://github.com/Oxen-AI/Oxen/releases/latest/download/oxen-server.exe

To get up and running using the client and server, you can follow the getting started docs.

🐍 Python Package

$ pip install oxenai

Note that this will only install the Python library and not the command line tool.

Installing Oxen through Jupyter Notebooks or Google Colab

Create and run this cell:

!pip install oxenai

πŸ”¨ Build & Run

Install Dependencies

Oxen is purely written in Rust πŸ¦€. You should install the Rust toolchain with rustup: https://www.rust-lang.org/tools/install.

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

If you are a developer and want to learn more about adding code or the overall architecture start here. Otherwise a quick start to make sure everything is working follows.

Building from Source

To build the command line tool from source, you can follow these steps.

  1. Install rustup via the instructions at https://rustup.rs/

  2. Clone the repository https://github.com/Oxen-AI/Oxen

    git clone git@github.com:Oxen-AI/Oxen.git
    
  3. cd into the cloned repository

    cd Oxen
    
  4. Run this command (the release flag is recommended but not necessary):

    cargo build --release
    
  5. After the build has finished, the oxen binary will be in Oxen/target/release (or, if you did not use the --release flag, Oxen/target/debug).

    Now, to make it usable from a terminal window, you have the option to add it to create a symlink or to add it to your PATH.

  6. To add oxen to your PATH:

    Add this line to your .bashrc (or equivalent, e.g. .zshrc)

    export PATH="$PATH:/path/to/Oxen/target/release"
    
  7. Alternatively, to create a symlink, run the following command:

    sudo ln -s /path/to/Oxen/target/release/oxen /usr/local/bin/oxen
    

    Note that if you did not use the --release flag when building Oxen, you will have to change the path.

Library, CLI, Server

There are three components that are built during cargo build and they are separated into three directories within the src folder.

ls src
cli/
lib/
server/

The library is all the shared code between the CLI and Server. This contains the majority of classes and business logic. The CLI and Server are meant to be thin wrappers over the core oxen library functionality.

The library is also used for the Python Client which should also remain a thin wrapper.

Speed up the build process

You can use the mold linker to speed up builds (The commercial Mac OS version is sold).

Assuming you have purchased a license, you can use the following instructions to install sold and configure cargo to use it for building Oxen:

git clone https://github.com/bluewhalesystems/sold.git

mkdir sold/build
cd sold/build
cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER=c++ ..
cmake --build . -j $(nproc)
sudo cmake --install .

Then create .cargo/config.toml in your Oxen repo root with the following content:

[target.x86_64-unknown-linux-gnu]
rustflags = ["-C", "link-arg=-fuse-ld=/usr/local/bin/ld64.mold"]

[target.x86_64-apple-darwin]
rustflags = ["-C", "link-arg=-fuse-ld=/usr/local/bin/ld64.mold"]

For macOS with Apple Silicon, you can use the lld linker.

brew install llvm

Then create .cargo/config.toml in your Oxen repo root with the following:

[target.aarch64-apple-darwin]
rustflags = [ "-C", "link-arg=-fuse-ld=/opt/homebrew/opt/llvm/bin/ld64.lld", ]

Run Oxen-Server

Generate a config file and token to give user access to the server

./target/debug/oxen-server add-user --email ox@oxen.ai --name Ox --output user_config.toml

Copy the config to the default locations

mkdir ~/.oxen
mv user_config.toml ~/.oxen/user_config.toml
cp ~/.oxen/user_config.toml data/test/config/user_config.toml

Set where you want the data to be synced to. The default sync directory is ./data/ to change it set the SYNC_DIR environment variable to a path.

export SYNC_DIR=/path/to/sync/dir

Run the server

./target/debug/oxen-server start

To run the server with live reload, first install cargo-watch

cargo install cargo-watch

Then run the server like this

cargo watch -- cargo run --bin oxen-server start

CLI Commands

Now feel free to try out some CLI commands and see the tool in action!

oxen init .
oxen status
oxen add images/
oxen status
oxen commit -m "added images"
oxen create-remote --name ox/wikipedia --host 0.0.0.0:3001 --scheme http
oxen config --set-remote origin http://localhost:3001/ox/wikipedia
oxen push origin main

Adding a Command

The main entry point to the Command Line Interface (CLI) is through the main.rs file. This file is located in the Oxen/src/cli/src directory.

Each command is defined in it's own submodule and implements the RunCmd trait.

#![allow(unused)]
fn main() {
#[async_trait]
pub trait RunCmd {
    fn name(&self) -> &str;
    fn args(&self) -> clap::Command;
    async fn run(&self, args: &clap::ArgMatches) -> Result<(), OxenError>;
}
}

These submodules can be found in cmd subdirectory. They are named after the command they implement. For example if you are curious how oxen add is implemented, you would look at add.rs.

Moo' World

To show this pattern in action, let's add a new command to Oxen. This new command will be a simple "Hello, World!" command. The new command will be named "moo" and will be implemented in the moo.rs file.

The command simply prints "moo!" when you run oxen moo. It also takes a --loud flag which makes it print "MOO!" instead of "moo!" if you pass the flag as well as a -n flag which adds extra o's to the end of the string.

$ oxen moo

moo
oxen moo --loud

MOO!
oxen moo -n 10

moooooooooo

Name The Command

The first method to implement in the trait is simply the name of the command. This is used to identify the command in the CLI and in the help menu.

#![allow(unused)]
fn main() {
impl RunCmd for MooCmd {
    fn name(&self) -> &str {
        "moo"
    }
}
}

Setup Args

The next step is setting up the command line arguments. We use the clap crate to handle the command line arguments. The arguments are defined in the args method.

#![allow(unused)]
fn main() {
impl RunCmd for MooCmd {
    fn args(&self) -> Command {
        // Setups the CLI args for the command
        Command::new(NAME)
            .about("Hello, world! πŸ‚")
            .arg(
                Arg::new("number")
                    .long("number")
                    .short('n')
                    .help("How long is the moo?")
                    .default_value("2")
                    .action(clap::ArgAction::Set),
            )
            .arg(
                Arg::new("loud")
                    .long("loud")
                    .short('l')
                    .help("Make the MOO louder.")
                    .action(clap::ArgAction::SetTrue),
            )
    }
}
}

Parse Args and Run Command

Finally we need to implement the run method which is called when the command is run. The run method is called with the parsed command line arguments.

#![allow(unused)]
fn main() {
impl RunCmd for MooCmd {
    async fn run(&self, args: &clap::ArgMatches) -> Result<(), OxenError> {
        // Parse Args
        let n = args
            .get_one::<String>("number")
            .expect("Must supply number")
            .parse::<usize>()
            .expect("number must be a valid integer.");

        let loud = args.get_flag("loud");
        if loud {
            // Print the moo loudly with -n number of o's
            println!("M{}!", "O".repeat(n));
        } else {
            // Print the moo with -n number of o's
            println!("m{}", "o".repeat(n));
        }

        Ok(())
    }
}
}

If a command returns an OxenError it will be handled and printed in the main.rs file and return a non zero exit code.

Add to CLI

Now that our command is implemented, we need to add it to the CLI. This is done in the main.rs file. All you need to do is add a new instance of your command to the cmds vector. The rest of the file is just adding the arguments, parsing them, then calling your run method.

#![allow(unused)]
fn main() {
let cmds: Vec<Box<dyn cmd::RunCmd>> = vec![
    Box::new(cmd::AddCmd),
    Box::new(cmd::MooCmd), // Your new command
];

// ... run commands
}

This should be all you need to get Oxen to go "MOO!". Let's build and run.

cargo build
./target/debug/oxen moo --help

You will see the help menu for your new command.

Hello, world! πŸ‚

Usage: oxen moo [OPTIONS]

Options:
  -n, --number <number>  How long is the moo? [default: 2]
  -l, --loud             Make the MOO louder.
  -h, --help             Print help

Then you can simply run your command.

./target/debug/oxen moo

You should see the output "moo"

moo

You can also make the moo louder with the --loud flag and add more o's with the -n flag.

$ ./target/debug/oxen moo --loud
MOO!
$ ./target/debug/oxen moo -n 10
moooooooooo

πŸŽ‰ And there you have it!

Congrats on adding your first command to Oxen! The moo command is already implemented in the main Oxen codebase as an easter egg and an example you can follow along with.

Coding Guidelines

TODO: Add some basic rust ones

https://doc.rust-lang.org/nightly/style-guide/

fmt & clippy

Before checking in a PR, please make sure to run cargo fmt and cargo clippy. This will format your code and check for errors.

cargo fmt
cargo clippy --fix --allow-dirty

Try to avoid .clone() if possible

Pass values by reference or by value instead of cloning them unless absolutely necessary. Cloning can be expensive, especially for large structs or strings.

Use PathBuf and Path over String and str

When referencing file system paths, use Path and PathBuf over String and &str. This is because PathBuf is a struct that represents a path and is more powerful than a raw string. For example, it makes sure the paths are cross platform (windows and unix) and allows you to check if a path is a file or directory. PathBuf also has other useful methods to get the file name, directory name, components, etc.

Use impl AsRef where possible

As function parameters, instead of taking in a &Path, &str, PathBuf or String take in a impl AsRef<Path> or impl AsRef<str>. This way the consumer can decide whether or not they want the value to be borrowed or passed in by reference, and does not have to make sure the value is a reference.

This makes it much is easier and more flexible for external consumers.

TODO: Examples of signatures and external consumers

#![allow(unused)]
fn main() {
pub fn load_path(repo: &LocalRepository, path: impl AsRef<Path>) -> Result<MerkleTreeNode>
}

vs

#![allow(unused)]
fn main() {
pub fn load_path(repo: &LocalRepository, path: PathBuf) -> Result<MerkleTreeNode>
}

vs

#![allow(unused)]
fn main() {
pub fn load_path(repo: &LocalRepository, path: &Path) -> Result<MerkleTreeNode>
}

Use util::fs functions over std::fs

The util::fs functions handle errors a little more gracefully and have additional functionality for reading and writing to the file system cross platform. For example std::fs::remove_file does not tell you which file could not be removed and will give you an error like this:

Os { code: 2, kind: NotFound, message: "No such file or directory" }

util::fs::remove_file will add the file name to the error message so you can see which file could not be removed.

Be cognizant of loading too much of the Merkle Tree

The CommitMerkleTree struct lets you load subsets of the merkle tree and skip directly to dir nodes. It also has functionality to only load 1 or 2 levels deep (avoiding deep recursion).

Make sure you are only loading what you need to get the information you need to return.

For example, if you just need the size of a directory, you don't need to load it's children.

#![allow(unused)]
fn main() {
let load_recursive = false;
let node = CommitMerkleTree::from_path(repo, commit, path, load_recursive)?;
}

If you need all the files in a directory, and you don't want all the levels below it, you can specify the depth to load.

#![allow(unused)]
fn main() {
// This will load the VNodes and the FileNode/DirNode children of the VNodes
let node = CommitMerkleTree::read_depth(repo, hash, 2)?;
}

Testing

We are all smart software engineers, but when it comes to entering a new codebase we all want confidence that making a change doesn't have a cascading effect. It is important to make sure that turning off the (proverbial) lights in the kitchen πŸ’‘ doesn't make the roof collapse 🏠.

Luckily each command within Oxen has a well defined interface, and each command can be tested independently.

For example:

#![allow(unused)]
fn main() {
// Initialize Repo
let repo = repositories::init("./test_repo")?;
// Add File
repositories::add(&repo, &"hello.txt")?;
// Commit File
repositories::commit(&repo, &"add hello.txt")?;
}

We chain these commands together into a sequence of integration and unit tests to make sure the end to end system works as expected.

Writing Tests

The best place to reference when looking at tests within Oxen are the lib/src/repositories modules themselves. You'll find some familiar names within the repositories:: namespace.

For example:

We follow a Domain Driven Design approach to development. The tests are located within the same module as the code they are testing. Checkout all the domain objects here.

All tests for these commands are found below their respective module. Let's look at an example command and break down the different parts of the test.

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    // ... include necessary modules

    #[test]
    fn test_repositories_init() -> Result<(), OxenError> {
        test::run_empty_dir_test(|repo_dir| {
            // Init repo
            let repo = repositories::init(repo_dir)?;

            // Init should create the .oxen directory
            let hidden_dir = util::fs::oxen_hidden_dir(repo_dir);
            let config_file = util::fs::config_filepath(repo_dir);
            assert!(hidden_dir.exists());
            assert!(config_file.exists());

            Ok(())
        })
    }
}
}

First you will notice that the tests are within a mod tests block. This is a Rust feature that allows you to group tests together within a particular module.

In order to run all the tests within a particular command module you can run:

cargo test --lib repositories::init

This will run all the tests within the repositories::init module.

Returning Errors

You will notice that all the tests return Result<(), OxenError>. This means they will catch any errors that might occur when running different command.

The OxenError is a custom error type that is defined in the lib/src/error.rs file. It is a simple enum that represents an error that can occur in Oxen. When you unwrap ? a function that returns a Result<(), OxenError> you will receive the error and the test will fail.

Setup & Teardown

Next you will see that most tests are wrapped in a closure defined in our test.rs file.

#![allow(unused)]
fn main() {
test::run_empty_dir_test(|repo_dir| {
    // ... your test code here

    Ok(())
})
}

These closures takes care of a lot of the boiler plate around setting up a test directory, and deleting it after the test is run.

For example run_empty_dir_test will pass a unique directory to the closure, and delete it when finished. This way we can run all the isolated tests in parallel and not worry about leaking files from one test impacting another.

There are many other helper functions you can use to setup and teardown your tests, including populating repositories with sample data, and setting up remote repositories. See the full list in the test.rs file.

Running All Tests

Make sure your server is running on the default port and host, then run

Note: tests open up a lot of file handles, so limit num test threads if running everything.

You an also increase the number of open files your system allows ulimit before running tests:

ulimit -n 10240
cargo test -- --test-threads=$(nproc)

It can be faster (in terms of compilation and runtime) to run a specific test. To run a specific library test:

cargo test --lib test_get_metadata_text_readme

To run with all debug output and run a specific test

env RUST_LOG=debug,liboxen=debug,integration_test=debug cargo test -- --nocapture test_command_push_clone_pull_push

To set a different test host you can set the OXEN_TEST_HOST environment variable

env OXEN_TEST_HOST=0.0.0.0:4000 cargo test

πŸ‚ Releasing Oxen Into The Wild 🌾

Right now this is mainly for me to document how I release new versions of the open source Oxen.ai binaries.

If anyone wants to help with the release process, please let me know!

Bump CLI/Server Versions

For the CLI and Oxen-Server binaries, make sure to update in all Cargo.toml files in our Oxen-AI/Oxen repo.

Create Tag

We use git tags to kick off CI within GitHub actions.

git tag -a v$VERSION -m "version $VERSION"

Push Tag

Builds will show up in this repositories releases with the tag you just specified.

git push origin v$VERSION

Update Homebrew Install

There are separate homebrew repositories for the oxen CLI and the oxen-server binary.

Oxen-AI/homebrew-oxen

Oxen-AI/homebrew-oxen-server

You will need to compute shasum(s) of each release and update the Formula/*.rb in both repos above.

Use the compute_hashes.sh script in homebrew-oxen repo to compute the shasum(s) of each release.

To verify the formula(s) locally:

cd /path/to/homebrew-oxen-server
brew install Formula/oxen.rb
oxen --version
cd /path/to/homebrew-oxen-server
brew install Formula/oxen-server.rb
oxen-server --version

Update Release Notes

TODO: We need to get better at this.

Suggestions welcome πŸ™.

πŸ‚ Domain Objects

Now for the fun part! Hopefully you have already built Oxen and learned how to add your first command.

In order to fully grok the Oxen codebase, it's important to define a few terms and understand the different domain objects. This way you'll have the right terminology to build upon and know where to look when adding or debugging features.

These domains are defined so we are all speaking the same language while diving into the code base. We will start with what the objects are, why they exist, and how objects are stored on disk, then we will build up intuition of how the system works as a whole.

πŸ‘€ Peeking Under the Hood

Similar to git, we store all the meta data for a repository in a hidden local .oxen directory. To start the learning journey let's initialize and empty Oxen repository locally by using oxen init.

mkdir my-repo
cd my-repo
oxen init
echo "# New Oxen Repo" > README.md
oxen add README.md
oxen commit -m "Initial Commit"

The best way to start learning the architecture and different domain objects is by poking around in this directory.

ls .oxen

You will see a variety of files and folders, including:

HEAD
config.toml
history/
refs/
tree/
versions/

Let's use these files and folders as a jumping off point to learn about the different domain objects.

First Up: Repositories

All of the domain objects exist within the context of a "Repository", so let's start there. All of the files and folders within the .oxen directory represent different sub components of a Repository, but we need some over arching objects to kick the process all off. These are what we call the LocalRepository and RemoteRepository.

Repositories

When we talk about data in Oxen, we usually talk about "Repositories". A Repository lives within your working directory of data in a hidden .oxen directory. You can think of a Repository as a series of snapshots of your data at any given point in time.

File Versions

Each snapshot contains a "mini filesystem" representing all the files and folders in that snapshot. The each mini filesystem is represented by a commit, and is stored in the .oxen directory so that we can return to it at any point in time.

To see this in action let's instantiate a local oxen repository and see what it looks like.

$ oxen init
$ ls -trla
total 0
drwxr-xr-x  23 bessie  staff  736 May 22 16:41 ../
drwxr-xr-x   3 bessie  staff   96 May 22 16:41 ./
drwxr-xr-x  10 bessie  staff  320 May 22 16:41 .oxen/

This magic .oxen directory is what will hold all the snapshots of your data. Think of it as a local database that lets you roll back your data to any point in time.

Content Addressable File System

How are the different versions stored on disk? Let's add and commit some files to the repository and see what happens.

$ echo "Hello" > hello.txt
$ echo "World" > world.txt
$ oxen add hello.txt world.txt
$ oxen commit -m "Add hello.txt and world.txt"

Each file that gets added and committed to oxen gets stored in a Content Addressable File System in the .oxen/versions directory. Oxen first computes a hash of the file, then stores the file in a sub directory that mirrors the hash. This means that the file can be retrieved by its hash at any time.

$ tree .oxen/versions

.oxen/versions
└── files
    β”œβ”€β”€ 18
    β”‚Β Β  └── 066113d946cfa640ffc8773c83f61b
    β”‚Β Β      └── data
    └── a7
        └── 666c8f5aaf946ca629d9d20c29aa6a
            └── data

6 directories, 2 files

What's up with these funky hexadecimal directory names? Well each directory is a hash of the file. To see this in action, Oxen has a handy command to inspect information about an individual file.

oxen info -v world.txt

hash	size	data_type	mime_type	extension	last_updated_commit_id
18066113d946cfa640ffc8773c83f61b	6	text	text/plain	txt	2c610ae8e424a4c8

oxen info prints out a tab separated list of the hash, size, data type, mime type, extension, and the last updated commit id of the file.

In this case, the hash for the world.txt file is 18066113d946cfa640ffc8773c83f61b. As for the directory structure above, you can see we split the hash and use the first two characters (18) of the hash as a prefix to the directory name. This is a common pattern in content addressable file systems to make sure you do not have too many sub-directories in a single directory.

Manually Inspect Older Versions

Currently the files in Oxen are uncompressed in the versions directory, so you can simply cat the file to see the contents.

$ cat .oxen/versions/files/a7/666c8f5aaf946ca629d9d20c29aa6a/data

Hello

Note: We have compression in our list of future improvements that could be made to the system, but the fact that we keep them uncompressed is a nice property of the system. It allows us to take advantage of the native file format of the files on disk with out additional compression / decompression steps.

Storing New Versions

Let's change the hello.txt file and commit it again.

$ echo "Hello, World!" > hello.txt
$ oxen add hello.txt
$ oxen commit -m "Update hello.txt"

Now look at the .oxen/versions directory. You will see that we have a new hashed directory for the file. This means that the file has been updated and a new snapshot has been created.

$ tree .oxen/versions

.oxen/versions
└── files
    β”œβ”€β”€ 18
    β”‚Β Β  └── 066113d946cfa640ffc8773c83f61b
    β”‚Β Β      └── data
    β”œβ”€β”€ a7
    β”‚Β Β  └── 666c8f5aaf946ca629d9d20c29aa6a
    β”‚Β Β      └── data
    └── ce
        └── 1931b6136c7ad3e2a42fb0521986ba
            └── data

8 directories, 3 files

Let's look at each individual file in the versions dir.

$ cat .oxen/versions/files/a7/666c8f5aaf946ca629d9d20c29aa6a/data
Hello

$ cat .oxen/versions/files/18/066113d946cfa640ffc8773c83f61b/data
World

$ cat .oxen/versions/files/ce/1931b6136c7ad3e2a42fb0521986ba/data
Hello, World!

While this doesn't give you the full picture of how Oxen works, hopefully gives you a starting point into the Content Addressable File System that Oxen uses to store all versions of the files. We will get into the details of the commit databases and other data structures as we dive into more domains.

LocalRepository

Since all of the data for all of the versions is simply stored in a hidden subdirectory, the first object we introduce is the LocalRepository. This object simply represents the path to the repository so that we know where to look for subsequent objects.

src/lib/src/model/repository/local_repository.rs

#![allow(unused)]
fn main() {
pub struct LocalRepository {
    pub path: PathBuf,
    // Optional remotes to sync the data to
    remote_name: Option<String>,
    pub remotes: Vec<Remote>,
}
}

Whenever starting down a code path within the CLI the first thing we do is find where the .oxen directory is and instantiate our LocalRepository object.

There is a handy helper method to get a repo from the current dir. This recursively traverses up in the directory structure to find a .oxen directory and instantiates the LocalRepository object.

#![allow(unused)]
fn main() {
let repository = LocalRepository::from_current_dir()?;
}

You may want to reference the code for the add command to see how instantiating a LocalRepository works in practice.

You will notice that not only does a LocalRepository have a path, but it also has a remote_name and remotes. These are read from .oxen/config.toml and tell inform Oxen where to sync the data to.

Remotes

A remote in the context of Oxen is simply a name and a url. The name is a human readable representation and the url is the actual location of the remote repository.

#![allow(unused)]
fn main() {
pub struct Remote {
    pub name: String,
    pub url: String,
}
}

The remotes can be set through the oxen config command.

oxen config --set-remote origin http://localhost:3001/my-namespace/my-repo

If you look in the .oxen/config.toml file you will see the remotes listed there.

remote_name = "origin"

[[remotes]]
name = "origin"
url = "http://localhost:3001/my-namespace/my-repo"

You can have multiple remotes as well as a default remote specified by remote_name. The default remote is the remote that will be used when you run oxen push or oxen pull without specifying a remote.

RemoteRepository

On the other end of the LocalRepository is the RemoteRepository. This object represents the remote repository that the LocalRepository is connected to. It has the same url as the Remote object.

#![allow(unused)]
fn main() {
pub struct RemoteRepository {
    pub namespace: String,
    pub name: String,
    pub remote: Remote,
}
}

All repositories that are stored on the oxen-server have a namespace and name. This helps us organize the repositories on disk, as well as in a way that is meaningful to the user.

In order to create a RemoteRepository we will first need to spin up an oxen-server instance. From your debug build you can do something like the following.

export SYNC_DIR=/path/to/sync/dir
./target/debug/oxen-server start

This will start a server on the default host 0.0.0.0 and port 3000. The environment variable SYNC_DIR tells the server where to write the data to on disk.

Then we can use the oxen create-remote command from the CLI.

oxen create-remote --name my-namespace/my-repo --host 0.0.0.0:3000 --scheme http

If you look in the SYNC_DIR you will see a directory structure that mirrors the namespace/repo-name of the repository you just created. There will be a .oxen directory with the remote repository created for you as well.

ls -trla /path/to/sync/dir/my-namespace/my-repo/.oxen

What's cool is that on disk the RemoteRepository is the same structure as the LocalRepository. This means that we can use the same code to manipulate the RemoteRepository on the server as we can the LocalRepository on the client.

If you didn't configure the remote earlier, you can do so now.

oxen config --set-remote origin http://0.0.0.0:3000/my-namespace/my-repo

Then simply push the data to the remote.

oxen push

This copies all the data from the local .oxen directory to the remote repository. Remember the versions directory from before? Let's see what it looks like on the remote.

$ cat /path/to/sync/dir/my-namespace/my-repo/.oxen/versions/files/ce/1931b6136c7ad3e2a42fb0521986ba/data
Hello, World!

There we go! Data is in tact on the remote server. This is the beauty of Oxen. There are not too many fancy bells and whistles when you look under the hood. Just a content addressable file system with a library that is shared between the client and server.

Next up we will look at Commits. These objects represent the group of files that were are in a single snapshot, and we will learn how Oxen knows which versions were added, removed, changed in the repository and when.

Commits

If you are familiar with git the concept of a commit and branch should be very familiar. What you may not have done is look under the hood as to how they are stored. In Oxen, many of the concepts are similar.

A commit is a checksum or hash value representing all the files is within a specific version. You may recognize them as a string of hexadecimal characters (0–9 and a–f) looking something like a72b68036af144bfe2dff0fb08a746c4.

Run oxen log within your Oxen repository and you will see the initial commit.

commit a72b68036af144bfe2dff0fb08a746c4

Author: ox
Date:   Thursday, 09 May 2024 22:29:00 +00

    Initialized Repo πŸ‚

You will see these hashes all over the place in Oxen and can use them as pointers to get to specific versions.

Commits as Merkle Tree Nodes

Under the hood most objects in Oxen are stored in a Merkle Tree data structure. At the root of each merkle tree is a commit object.

All the nodes in the tree are stored in the .oxen/tree/nodes directory.

$ tree .oxen/tree/nodes/

.oxen/tree/nodes/
β”œβ”€β”€ 589
β”‚Β Β  └── 8d0aa535709791ea84a341307fc3
β”‚Β Β      β”œβ”€β”€ children
β”‚Β Β      └── node
β”œβ”€β”€ 88b
β”‚Β Β  └── e33604f2ae2153443bff158c31495
β”‚Β Β      β”œβ”€β”€ children
β”‚Β Β      └── node
└── a72
    └── b68036af144bfe2dff0fb08a746c4
        β”œβ”€β”€ children
        └── node

You'll see that nodes are content addressable by their hash, and each subdirectory is a level in the merkle tree. These files are hard to inspect on their own, so we can use the oxen node command to inspect the individual node databases.

$ oxen node a72b68036af144bfe2dff0fb08a746c4

CommitNode
	hash: a72b68036af144bfe2dff0fb08a746c4
	message: adding README
	parent_ids: []
	author: oxbot
	email: oxbot@oxen.ai
	timestamp: 2024-08-19 23:06:41.525894 +00:00:00

Here we have a nice beautiful commit object. There is a list of parent commit ids, a message, the author, the email, and the timestamp.

To see the full tree that lies below this commit, you can use the oxen tree command.

$ oxen tree -n a72b68036af144bfe2dff0fb08a746c4

You'll see that the tree is printed out in a human readable format. This tree only has a single README.md file in the root directory. Trees can get much more complex, and we will dive into this more in the Merkle Trees section.

[Commit] a72b68036af144bfe2dff0fb08a746c4 -> adding README parent_ids ""
  [Dir]  -> 5898d0aa535709791ea84a341307fc3 11 B (1 nodes) (1 files) [latest commit a72b68036af144bfe2dff0fb08a746c4]
    [VNode] 88be33604f2ae2153443bff158c31495 (1 children)
      [File] README.md -> 43744a971e29c0f56c293f855f11814 11 B [latest commit a72b68036af144bfe2dff0fb08a746c4]

Commit Metadata

All of the metadata within a commit object is important for computing it's id. The id can be used to verify the integrity of the data within a commit. More on this later.

The first piece of metadata is the user that made the commit. The user data is read from the global ~/.config/oxen/user_config.toml file. You can set your user info with the oxen config command.

$ oxen config --name 'Bessie' --email 'bessie@your_email.com'
$ cat ~/.config/oxen/user_config.toml
name = "Bessie"
email = "bessie@your_email.com"

It also contains the timestamp of the commit, and a user provided message. All of these pieces of data are used in computing the commit id, which is a unique representation of the data in this commit.

Commit Id (Hash)

Each commit has a unique id (hash) that can be verified to ensure the integrity of the data in this commit. It is a combination of the data within all the files of the commit, the user data, timestamp, and the message.

What's nice about this is that once the data has been synced to the remote server, we can verify that the data is valid by computing the hashes of the files and the commit data and comparing this to the id of the commit in the database.

Commit History

Every commit (except the first) has a list of parent commit ids. Most commits have a single parent, but in the case of a merge commit, there can be multiple parent commit ids. You can traverse the commit history by following the parent commit ids until you hit the first commit.

You can use the oxen log command to print out the commit history starting with the most recent commit on the current branch.

Next Up: Branches

Learn how commits relate to Branches in the next section.

Branches

Branches are a key feature of many VCS systems. They allow users to work in parallel without making changes that step on each other's toes.

The branching model in Oxen is inspired by git meaning branches are lightweight and quick to create. When creating a branch, we are never copying any of the raw datasets in the repository. Under the hood, a branch is really just a named reference to a commit. Creating a new branch simply creates a new named reference.

#![allow(unused)]
fn main() {
pub struct Branch {
    pub name: String,
    pub commit_id: String,
}
}

On the first commit of a repository, a default branch called main is created and points to the initial commit.

Refs

To see how this works in practice, let's look at how branches are stored on disk. All of the branches within a repository are stored in a key-value rocksdb database. This database can be found in the .oxen/refs directory.

Let's inspect this database with our oxen db list command.

$ oxen db list .oxen/refs

main	c719c887cc250784

This shows us that there is a single branch, main, that points to the commit id c719c887cc250784.

If we create a new branch, say foo, it will also be stored in the database with the same commit id as the current branch you are on.

$ oxen checkout -b foo
$ oxen db list .oxen/refs

main	c719c887cc250784
foo	c719c887cc250784

To see the list of current branches as well as which one you currently have checked out, you can use the oxen branch command.

$ oxen branch

* foo
  main

The * indicates the foo branch is currently checked out. The way we store the current branch is by creating a HEAD file in the .oxen directory.

This file contains the name of the branch or commit id that is currently checked out.

$ cat .oxen/HEAD

foo

Let's make a commit and see how the branches stored on disk change.

$ echo "foo" > foo.txt
$ oxen add foo.txt
$ oxen commit -m "foo commit"

Committing with message: foo commit
Commit 9ef4176b1b4422a7 done.

We now have a new commit id 9ef4176b1b4422a7. If we look at the refs database, we can see that the foo branch has been updated to point to the new commit id.

$ oxen db list .oxen/refs

foo	9ef4176b1b4422a7
main	c719c887cc250784

If we look at oxen log we will see that the foo branch is now the most recent commit.

commit 9ef4176b1b4422a7

Author: Ox Bot
Date:   Thursday, 30 May 2024 04:04:53 +00

    foo commit

commit c719c887cc250784

Author: Ox Bot
Date:   Tuesday, 28 May 2024 03:03:49 +00

    adding questions.jsonl

You can checkout a specific commit by using the oxen checkout command with the commit id.

$ oxen checkout c719c887cc250784

This will update the HEAD file to point to the commit id instead of the branch name.

$ cat .oxen/HEAD

c719c887cc250784

You will notice that our foo.txt file is no longer present in the working directory. If you perform a oxen status you will see that we are now in a "detached HEAD" state. This means that we are no longer on a branch and are instead on an individual commit.

Don't worry, the file foo.txt is still alive and well in the .oxen/versions directory, and can be restored by checking out the foo branch again.

$ oxen checkout foo

That's it! The relationship between branches, commits, and the HEAD commit is really that simple. Branches are just a named reference to a commit id that make it easier to find a particular chain of commits.

You can progress a branch as many commits as you want without affecting the main branch. When you are ready to merge your branch into the main branch, you can use the oxen merge command which will be covered later.

Next Up: Files & Directories

Now that you know the basic data structures for branches and commits, let's dive into how branches and commits are tied to a set of files and directories with the Merkle Tree data structure.

Next Up: Merkle Trees

Files, Directories and Merkle Trees 🌲

When you create a commit within Oxen, you can think of it as a snapshot of the state of the files and directories in the repository at a particular point in time. This means each commit will need a reference to all the files and directories that are present in the repository at that point in time.

Let's use a small dataset as an example.

README.md
LICENSE
images/
  image0.jpg
  image1.jpg
  image2.jpg
  image3.jpg
  image4.jpg

This is a simple file structure with a README.md at the top level and a sub-directory of images. Start by initializing a new repository, then adding and committing the files.

oxen init
oxen add README.md
oxen add images/
oxen commit -m "adding data"

On commit we save off all the hashes of the file contents and save the data into a Content Addressable File System (CAFS) within the .oxen/versions directory. This makes it so we don't duplicate the same file data across commits.

$ tree .oxen/versions/

.oxen/versions/
β”œβ”€β”€ 43
β”‚Β Β  └── 94f02b679bcf0114b1fb631c250d0a
β”‚Β Β      └── data
β”œβ”€β”€ 58
β”‚Β Β  └── 8b7f5296c1a6041d350d1f6be41b3
β”‚Β Β      └── data
β”œβ”€β”€ 64
β”‚Β Β  └── e1a1512c6d5b1b6dcf2122326370f1
β”‚Β Β      └── data
β”œβ”€β”€ 74
β”‚Β Β  └── bfd17b6b7c9b183878a26e1e62a30e
β”‚Β Β      └── data
β”œβ”€β”€ 7c
β”‚Β Β  └── 42afd26e73b8bfbc798288f1def1ed
β”‚Β Β      └── data
β”œβ”€β”€ c8
β”‚Β Β  └── 2d11a1e1223598d930454eecfab6ea
β”‚Β Β      └── data
└── dc
    └── 92962a4b05f5453718783fe3fc4b10
        └── data

15 directories, 7 files

Each file is accessible by its hash and original extension the file was stored with. For example, the hash of images/image0.jpg is 74bfd17b6b7c9b183878a26e1e62a30e and it's extension is jpg, so the original contents can be found at .oxen/versions/74/bfd17b6b7c9b183878a26e1e62a30e/data.

To find the hash and extension of any file in a commit, you can use the oxen info command.

oxen info images/image0.jpg
74bfd17b6b7c9b183878a26e1e62a30e	13030	image	image/jpeg	jpg	12099a4ca3b15c36

The CAFS makes it easy to fetch the file data for a given commit, but we need some sort of database that lists the original file names and paths. This way when switching between commits we can efficiently restore the files that have been added/changed/removed.

Switching Between Versions

The simplest solution would be to have a key-value database for every commit that listed the file paths and pointed to their hashes and extensions.

Commit A

README.md -> {"hash": "64e1a1512c6d5b1b6dcf2122326370f1", "extension": ".md"}
LICENSE -> {"hash": "7c42afd26e73b8bfbc798288f1def1ed", "extension": ""}
images/image1.jpg -> {"hash": "74bfd17b6b7c9b183878a26e1e62a30e", "extension": ".jpg"}
images/image2.jpg -> {"hash": "dc92962a4b05f5453718783fe3fc4b10", "extension": ".jpg"}
images/image3.jpg -> {"hash": "588b7f5296c1a6041d350d1f6be41b3", "extension": ".jpg"}
images/image4.jpg -> {"hash": "c82d11a1e1223598d930454eecfab6ea", "extension": ".jpg"}
images/image5.jpg -> {"hash": "4394f02b679bcf0114b1fb631c250d0a", "extension": ".jpg"}

We could store this in a rocksdb database in .oxen/history/{commit_hash}/files. The keys would be the file paths and the values would be the hashes and extensions. Then when swapping between commits all we would have to do is clear the current working directory and re-construct all the files from the respective commit database!

Psuedo Code:

set commit_hash 1d278f841510b8e7
rm -rf working_dir
for dir, hash, ext in (oxen db list .oxen/versions/files/$commit_hash/) ;
  mkdir -p working_dir/$dir ;
  cp .oxen/versions/files/$commit_hash/$hash/data$ext working_dir/$dir/ ;
end

Version control complete. Let's call it a day and go relax on the beach 😎 🏝️.

Of course, we are not here to build naive inefficient version control tool. Oxen is a blazing fast version control system that is designed to handle large amounts of data efficiently. Even if clearing and restoring the working directory is simple, there are many reasons it is not optimal (including wiping out untracked files).

Data Duplication πŸ˜₯

To see why this naive approach is sub-optimal, imagine we are collecting image training data for a computer vision system. We put Oxen in a loop adding one new image at a time to the images/ directory. Each time we add an image we commit the changes.

for i in (seq 100) ;
  # imaginary data collection pipeline
  cp /path/to/images/image$i.jpg images/image$i.jpg ;

  # oxen add and commit
  oxen add images/image$i.jpg ;
  oxen commit -m "adding image$i" ;
end

If we had gone the naive route, this would balloon in redundancy even with just our list of pointers to hashes. Each database list repeats the same file paths and the hashes over and over again.

Commit A

README.md         -> hash1
LICENSE           -> hash2
images/image0.jpg -> hash3
images/image1.jpg -> hash4
images/image2.jpg -> hash5
images/image3.jpg -> hash6

Commit B

README.md         -> hash1 # repeated 1 time
LICENSE           -> hash2 # repeated 1 time
images/image0.jpg -> hash3 # repeated 1 time
images/image1.jpg -> hash4 # repeated 1 time
images/image2.jpg -> hash5 # repeated 1 time
images/image3.jpg -> hash6 # repeated 1 time
images/image4.jpg -> hash7 # NEW

Commit C

README.md         -> hash1 # repeated 2 times
LICENSE           -> hash2 # repeated 2 times
images/image0.jpg -> hash3 # repeated 2 times
images/image1.jpg -> hash4 # repeated 2 times
images/image2.jpg -> hash5 # repeated 2 times
images/image3.jpg -> hash6 # repeated 2 times
images/image4.jpg -> hash7 # repeated 1 time
images/image5.jpg -> hash8 # NEW

...

Commit 10_000

README.md              -> hash1 # repeated N times
LICENSE                -> hash2 # repeated N times
images/image0.jpg      -> hash3 # repeated N times
images/image1.jpg      -> hash4 # repeated N times
images/image2.jpg      -> hash5 # repeated N times
images/image3.jpg      -> hash6 # repeated N times
images/image4.jpg      -> hash7 # repeated N times
images/image5.jpg      -> hash8 # repeated N times
...
images/image10_000.jpg -> hash10_000

Do the math once we get to a dataset of 10,000 images. Each commit duplicates 10,000+1 values. 10,000 + 10,001 + 10,002 + 10,003 = 40,006 values in our collective databases.

.oxen/history/COMMIT_A/files -> 10,000 values
.oxen/history/COMMIT_B/files -> 10,001 values
.oxen/history/COMMIT_C/files -> 10,002 values
.oxen/history/COMMIT_D/files -> 10,003 values

Total Values: 40,006

A key observation is that we are duplicating a lot of data across commits. This will be a common pattern to look for when optimizing the storage within Oxen.

Optimizations w/ Merkle Trees

Adding one file should not require to you copy the entire key-value database. We need some sort of data structure that can efficiently store the file paths and hashes without duplicating too much data across commits.

Enter Merkle Trees 🌲.

Files and directories are already organized in a tree like fashion, so a Merkle Tree is a more natural fit for storing and traversing the file structure to begin with. The Oxen Merkle Tree implementation also make it so when we add additional data, we only need to copy subtrees instead of copying the entire database for each commit.

What does a Merkle Tree within Oxen look like?

Commit A

At the root node of the Merkle tree is a commit hash. This is the identifier you know and love which you can reference any commit in the system by.

The root commit hash represents the content of all the data below it, including the files contained in the images/ directory as well as the files directly in the . root directory (README.md, LICENSE, etc). Additionally all the files within a directory get sectioned off into VNodes. We will return to the importance of VNodes in a bit.

At each level of the tree we see the contents of all the files hashed within that directory, and bucketed into VNodes.

Adding a File

To see what happens when we add a new file to our repository, let's revisit our previous example of adding images to the images/ directory. Say we have 8 images in our images/ directory and we want to add a new image (9.jpg).

The first thing we have to do is find which VNode bucket it falls into (more on this later). Then we can recompute the hash of this subtree, and recursively update the hashes above it until we get to the root node.

In this case we make four total updates to the tree, highlighted in green.

  1. Add the contents of the new image to our .oxen/versions/ directory
  2. Find the VNode it belongs to, and deep copy it to a new VNode with a new hash
  3. Update the VNode hash of the images/ parent directory
  4. Update the root node hash

Commit B

The Merkle Tree nodes are all global to the repository, and can get re-used and shared between commits. Instead of copying the entire database to our new commit, only copy the subtrees that changed. On adding a file, we only need to update a single VNode and copy it's contents. This is a much faster operation than copying every file within our databases.

For another example, let's see what happens when we update the README.md file.

Commit C

This time, we only need to update the VNode that contains the README.md file and it's parent in the root node.

Why use VNodes?

One of the goals of Oxen is to be able to scale to directories with an arbitrary number of files. Imagine for a second that you have a directory of 100k or 1 million images. Storing all of these values directly at the directory level node would be inefficient. Every time you commit a single image to the directory, you would need to copy all the pointers and recompute the hash for the entire directory.

For example imagine we had no VNodes at the directory level.

No VNode

If we want to add a single file, we would have to copy all the pointers and recompute the hash for the entire directory.

Add File

VNode's add an intermediate bucket we can add files to so that we only have to copy a subset of pointers. Which VNode a file belongs to is computed from the hash of file path name itself. This way files get evenly distributed into buckets within the tree.

With VNode

You'll notice two parts to the VNode. The first is first two letters (AB) of the hash of the file path name, and the second is the hash of the VNode contents (#DFEGA72). To add an image, now we only need to find the bucket (based on it's file path), compute it's new hash, and make a copy of the items of the VNode database for it's new hash.

With VNode Add File

To drive this home, let's go back to our example directory with 10,000 images with the naive implementation from before. Remember 4 additions to the images directory after it contained 10,000 node resulted in 40,006 values in our database. Say our bucket size for VNodes is 10,000/256 ~= 40. This means on average we are copying 40 values with each commit. This will result in 10,160 total values in our DB instead of 40,006.

Printing the Merkle Tree

To bring these concepts to life, let's create a repo of many images and use the oxen tree command to print the Merkle Tree. In the Oxen-AI/Oxen repo we have a script that will create a directory with an arbitrary number images and add them to the images/ directory.

When developing and testing Oxen, this script is handy to generate synthetic datasets to push the performance of the system. For example you could create a dataset of 1,000,000 images and see how long it takes to add and commit the changes.

# WARNING: This will create a directory with 1,000,000 images and take a while to run
$ python benchmark/generate_image_repo_parallel.py --output_dir ~/Data/1m_images --num_images 1000000 --num_dirs 2 --image_size 64 64

For this example we will stick to a smaller dataset of 20 images. It will be easier to visualize the Merkle Tree.

# 😌 This will create a much smaller dataset of 20 images
$ python benchmark/generate_image_repo_parallel.py --output_dir ~/Data/20_images --num_images 20 --num_dirs 1 --image_size 64 64

After the dataset is created, go ahead and create and initialize an oxen repository.

$ cd ~/Data/20_images
$ oxen init

Before we add and commit the files, we are going to make a quick tweak to the configuration to use a smaller VNode bucket size. The default size is 10,000, but we are going to set it to 5 to make it easier to see the tree updates in this toy example.

Edit the .oxen/config.toml file to set the vnode_size to 5.

$ cat .oxen/config.toml
remotes = []
min_version = "0.19.0"
vnode_size = 6

Now add and commit the files.

$ oxen add .
$ oxen commit -m "adding all data"

Then we can use the oxen tree command to print the entire Merkle Tree.

$ oxen tree

[Commit] bb2e7778ddc8f40788d4d34993955bfd "adding data" -> Bessie ox@oxen.ai parent_ids ""
  [Dir] 7a892f11ae586978f3b170182599cc5e "/" (15.8 MB) (22 files) (commit bb2e7778ddc8f40788d4d34993955bfd)  (1 children)
    [VNode] 801de12b06a74b5a2d0b978af067e32b  (3 children)
      [File] dcd78180c335f3afed68656b6b12c248 "README.md" (98 B) (commit bb2e7778ddc8f40788d4d34993955bfd)
      [Dir] 73ae655940b4873cf6b1557c3806d65c "images/" (15.8 MB) (20 files) (commit bb2e7778ddc8f40788d4d34993955bfd)  (1 children)
        [VNode] 51d73f367fcbc4f11228ff2e56fba5d3  (1 children)
          [Dir] 79c6625dcad70d16be56aca9426442ee "split_0/" (15.8 MB) (20 files) (commit bb2e7778ddc8f40788d4d34993955bfd)  (4 children)
            [VNode] 3543cada59e52d3a391603661b6f9721  (6 children)
              [File] 6d11185298ec825208a1f3fce23b9d6c "noise_image_14.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
              [File] 56e2d45af9958af0680fceb3ab00d18c "noise_image_17.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
              [File] 82c2ec408abf2defc2dd5289b29a1e80 "noise_image_19.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
              [File] 1d952254a3be50f3d2d70a1398aee524 "noise_image_4.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
              [File] cb118df353e0814d42472818405b9384 "noise_image_7.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
              [File] 2ccc49262256aae9802c581d90736b34 "noise_image_9.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
            [VNode] e5529567f3246881d58335ee5102281c  (4 children)
              [File] 38593fea717a1d4c2e771674ebc9ca81 "noise_image_0.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
              [File] d0871bf54a0c99f2336da66ab20b6785 "noise_image_1.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
              [File] e406e9a852da65dad3f012dd86a98919 "noise_image_10.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
              [File] 48dca231fd96fd847935e7e6623b32d9 "noise_image_11.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
            [VNode] c9e4baf0332cf23a803a8870e295e0b5  (7 children)
              [File] abf629ca6ed9414e8a4f884d2f98dd2a "noise_image_13.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
              [File] 4598fdac03e5aaa52aa5cd1c51231a2 "noise_image_15.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
              [File] 3e0989396c1d6a7f38164faf96c4662e "noise_image_18.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
              [File] 1506f02a6a10123af68084200b67583a "noise_image_2.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
              [File] 1d72a37baeef7b0c02b4ce7482e4592 "noise_image_5.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
              [File] 26e656316e9e292f8b27fcb49654237a "noise_image_6.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
              [File] 4f3611ebc27072c6359ac05f4e6c98e "noise_image_8.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
            [VNode] 38c8edc066ae7d1ac36a92223a9c39ee  (3 children)
              [File] 6dc552a73f1beb27356903f84c4a7b33 "noise_image_12.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
              [File] 9335d768de6058fbe3da1a7857c400e7 "noise_image_16.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
              [File] 4855043d603912f6141f5f148851146f "noise_image_3.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
      [File] fe0b2b028ff23fd726b14f6c7694ffd8 "images.csv" (764 B) (commit bb2e7778ddc8f40788d4d34993955bfd)

The first thing you'll notice is that each VNode is not guaranteed to have 6 children. This is because we first instantiate the number of VNodes given the VNode size, and then we fill them with the files. The hashing algorithm will then distribute the files across the VNodes given the file hash % number of VNodes and some VNodes will end up with more children than others. On average a good hashing algorithm will distribute the files evenly across the VNodes.

Remember - VNodes exist in order to help us update smaller subtrees when we add, remove or change a file. To see this in action, let's add a new image to the images/split_0 directory by converting a png to jpeg.

ffmpeg -i images/split_0/noise_image_11.png images/split_0/noise_image_11.jpg
oxen add images/split_0/noise_image_11.jpg
oxen commit -m "adding images/split_0/noise_image_11.jpg"

Then print the tree again, and try to find where the new image was added.

$ oxen tree

[Commit] 1382a90346dce99134fbb8a7359d81df "adding images/split_0/noise_image_11.jpg" -> Bessie ox@oxen.ai parent_ids "bb2e7778ddc8f40788d4d34993955bfd"
  [Dir] 73aa45daae11e300cb452b072f22bc3a "/" (16.0 MB) (23 files) (commit 1382a90346dce99134fbb8a7359d81df)  (1 children)
    [VNode] 57e2282b08cfdb57c66d5c2c0341fc2d  (3 children)
      [File] dcd78180c335f3afed68656b6b12c248 "README.md" (98 B) (commit bb2e7778ddc8f40788d4d34993955bfd)
      [Dir] 7363444d86e0cb47c4247bd7f05c13f3 "images/" (16.0 MB) (21 files) (commit 1382a90346dce99134fbb8a7359d81df)  (1 children)
        [VNode] 8344a2ef7c01c891fb33a357efabc2b6  (1 children)
          [Dir] b5d819fd018381b3026848bed854830b "split_0/" (16.0 MB) (21 files) (commit 1382a90346dce99134fbb8a7359d81df)  (4 children)
            [VNode] 3543cada59e52d3a391603661b6f9721  (6 children)
              [File] 6d11185298ec825208a1f3fce23b9d6c "noise_image_14.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
              [File] 56e2d45af9958af0680fceb3ab00d18c "noise_image_17.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
              [File] 82c2ec408abf2defc2dd5289b29a1e80 "noise_image_19.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
              [File] 1d952254a3be50f3d2d70a1398aee524 "noise_image_4.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
              [File] cb118df353e0814d42472818405b9384 "noise_image_7.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
              [File] 2ccc49262256aae9802c581d90736b34 "noise_image_9.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
            [VNode] e5529567f3246881d58335ee5102281c  (4 children)
              [File] 38593fea717a1d4c2e771674ebc9ca81 "noise_image_0.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
              [File] d0871bf54a0c99f2336da66ab20b6785 "noise_image_1.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
              [File] e406e9a852da65dad3f012dd86a98919 "noise_image_10.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
              [File] 48dca231fd96fd847935e7e6623b32d9 "noise_image_11.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
            [VNode] c9e4baf0332cf23a803a8870e295e0b5  (7 children)
              [File] abf629ca6ed9414e8a4f884d2f98dd2a "noise_image_13.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
              [File] 4598fdac03e5aaa52aa5cd1c51231a2 "noise_image_15.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
              [File] 3e0989396c1d6a7f38164faf96c4662e "noise_image_18.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
              [File] 1506f02a6a10123af68084200b67583a "noise_image_2.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
              [File] 1d72a37baeef7b0c02b4ce7482e4592 "noise_image_5.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
              [File] 26e656316e9e292f8b27fcb49654237a "noise_image_6.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
              [File] 4f3611ebc27072c6359ac05f4e6c98e "noise_image_8.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
            [VNode] 88e4003b65f2f3f16c16a91230593746  (4 children)
              [File] 6550c5d7deaa62dbb78d8effbbda375f "noise_image_11.jpg" (285.4 KB) (commit 1382a90346dce99134fbb8a7359d81df)
              [File] 6dc552a73f1beb27356903f84c4a7b33 "noise_image_12.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
              [File] 9335d768de6058fbe3da1a7857c400e7 "noise_image_16.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
              [File] 4855043d603912f6141f5f148851146f "noise_image_3.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
      [File] fe0b2b028ff23fd726b14f6c7694ffd8 "images.csv" (764 B) (commit bb2e7778ddc8f40788d4d34993955bfd)

You will see that the vnode 88e4003b65f2f3f16c16a91230593746 is the one that contains the new image. The other VNode's have not changed. As a result, we only had to make a copy of a single node with 3 children instead of 20.

When you get into larger directories, there is a trade off between number of vnodes and the size of the vnode. The fewer number of vnodes, the faster it will be to read all the nodes. The smaller the vnode, the faster it will be to write, and less data we copy when we add, remove or change a file. In practice we find 10k entries per vnode is a good compromise in terms of storage and performance.

File Chunk Deduplication

Not yet implemented, but scoped out here:

File Chunk Deduplication

Benefits of the Merkle Tree

Hopefully if you are new to Merkle Trees, this should give you a good intuition for how they work in practice. There are a few nice properties of a Merkle Tree as our data structure.

  1. When we add, remove or change a file, we only need to update the subtree that contains that file. This means the storage grows logarithmically with the number of files in the repository instead of linearly.

  2. To recompute the root hash of a commit, we only need to hash the file paths and the hashes of the files that have changed. This means we can efficiently verify the integrity of the data by recomputing subtrees instead of the whole tree.

  3. We can use it to understand the small chunks of the data that changes when transferring over the network when syncing repositories.

  4. Since each subtree is also a merkle tree, this will allow us to clone small subtrees, make changes, and push them back up to the parent. This can be powerful when you only want to update the README for example but have a directory of images you are not planning on changing.

Commands

The core commands in commands in Oxen map to git so that it is an easy learning curve to get started.

oxen init
oxen add images/
oxen status
oxen commit -m "adding images"
oxen push origin main

Join us as we break down each command step by step, as if you were building it from scratch.

init

oxen init

TODO: add details on how the command works

oxen add

After initializing a repository, you can add files to it using the oxen add command.

oxen add <file>

This is the workhorse of oxen that does most of the compute. Under the hood, oxen add does a few operations:

  • Hashes the file(s) and directory structure
  • Computes any additional metadata about the file
    • Sizes
    • File Types
    • Schemas (for JSON, CSV, etc.)
  • Copies a version of the file into the content addressable versions store
  • Adds a record to the staged index

Under the hood

In order to see what this looks like on disk, let's create a few files and directories, then add them to a fresh repository.

mkdir my-project
cd my-project
oxen init

TODO:

clone

Data Types

There are core data types that Oxen detects and can do special processing with. This helps increase visibility into your data, and makes it extensible to your use case.

  • Text
  • Image
  • Audio
  • Video
  • Tabular
  • Blob

Optimizations

There are many optimizations that make Oxen go brrr. From the core merkle tree structure, to hashing protocol, to networking, to interacting with remote datasets. Oxen is meant to make it feel like you have terabytes of data at your fingertips whatever machine you are on.

TLDRs

We often get asked: "What makes Oxen different from other VCS?"

Without diving into the gnitty gritty details, here are some highlights. If you want to go deeper, don't worry, we also dive deep into the implementation details of each throughout the book.

Merkle Tree

  • Downloading Sub Trees
  • Per Folder Sub Trees
  • Block Level Dedup
  • Only download latest
    • When you get to TB scale data, you do not want to have to pull down data from previous commits to compute the current tree.
  • Push / Pull Bottle Neck
  • Objects
    • Trees
    • VNodes
    • Blobs
    • Schemas

Hashing

  • xxHash
  • pure hashing throughput
  • non-cryptographic hashing fn

Data Frames and Schemas are First Class Citizens

Other VCS systems are optimized for text files and code. In the case of datasets, we often deal with data frames which have other properties such as schema that we want to track.

Native File Formats

Take advantage of existing file formats such as arrow, parquet, duckdb, etc. Unlike git or other VCS that try to be smart with compression, we can leverage the existing file formats that are already highly optimized for the specific use case.

For example, apache arrow is a memory mapped file that makes random access to rows very fast. If we were to compress this data and reconstruct it we would lose the benefits of the memory mapped file.

This is a design tradeoff that is made throughout oxen which makes it less efficient in terms of storage on disk, but easier to integrate with.

Visibility into data is a key design goal of Oxen. Visibility means speed for data to be visible as well, and the less assumptions we make here, the more we can leverage and extend existing file formats.

Concurrency

  • Fearless concurrency
  • Hashing data
  • Moooooving data over the network
  • Moooooving data on disk

Networking

  • Smart Chunking

Remote Workspaces

Don't download the entire dataset just to contribute.

  • oxen workspace add
  • oxen workspace commit
  • oxen workspace df
  • oxen workspace ls

Compression (Coming Soon)

Merkle Tree

Hashing

One of the optimizations within Oxen is to use the xxHash algorithm to hash the file contents. This is a very fast hashing algorithm that is designed to be very memory efficient. It is also very fast to compute.

Compared to SHA or MD5 hashes which can hash data at < 1GB/s, xxHash can hash data at 30GB/s. This is a significant improvement for large files, and speeds up the process of adding and committing files to Oxen.

Inspect File Hashes

You can quickly inspect the xxHash of any file using the oxen info command.

oxen info -v file.txt

Compression

TODO: Block level deduplication can be turned on in order to shrink older datasets. It comes at a reconstruction cost of the data, so can be turned on and off depending on your use case.

File Chunk Deduplication

The Merkle Tree optimizations we've talked about so far make adding and committing snapshots of the directory structure to Oxen snappy. The remainder of the storage cost is at the individual leaf nodes of the tree. We want Oxen to be efficient at storing many files as well as efficient at storing large files. In our context these large files may be data frames of parquet, arrow, jsonl, csv, etc.

To visualize how much storage space individual nodes take, let's look at a large CSV in the .oxen/versions directory. The pointers and hashes within the tree itself are relatively small, but the file contents themselves are large.

Large CSV

Remember, so far each time you make a commit, we make an entire copy of the file contents itself and put it into the .oxen/versions directory under it's hash.

Large CSV

This can be a problem if we keep updating the same file over over and over again. Five small changes to a 1GB file will result in 5GB of storage.

Think back to the key insight that we made earlier about duplicating data between versions of our tree. The same thing applies to large files. For example - what if you are editing and committing a CSV file one row at a time.

Prompts CSV

This results in a lot of duplicated data. In fact rows 0-1,000,000 are all the same between Version A and Version B.

To combat this, Oxen uses a technique called file chunk deduplication. Instead of duplicating the entire raw file in the .oxen/versions directory, we can break the file into "chunks" then store these chunks in a content addressable manner.

Chunks

Then if you update any rows in the file, we create a new chunk for that section and update the leaf nodes of the tree.

Chunks

So the original tree looks like this.

Chunks

Then when the row is added, all we have to do is update the final chunk and propagate the changes up the tree.

Update Chunk

What's great about this, is now we also only need to sync the chunks that have changed over the network and not the entire file. Chunks themselves are ~16kb in size, so updating a single value in a 1GB file will only sync this small portion of the file.

When swapping between versions we simply reconstruct the chunks associated with each file and write them to disk. This means we need a small db that stores the mapping from chunk_idx -> chunk_hash. We can now store this in the .oxen/versions directory instead of the file contents.

Chunk Mapping

Storage

TODO: Support different storage backends

  • Local SSDs
  • NFS
  • S3
  • FSpec

Why not write an extension for Git?

We often get asked "Why not use Git-LFS?" and why write an entire Version Control System (VCS) from scratch?

The short answer is developer experience. Owning the tool means owning the end to end developer experience. The plugins and extensions to git that we have tried in the path feel like they were shoe horned in and clunky.

There are certain properties of git that make it great for versioning code, but fall short when it comes to data. Git is really good at huge histories of small text files, like code bases. Where git does not shine is binary files. Many datasets consist of binary files such as images, video, audio, parquet files, etc which are not efficiently stored in git.

There are extensions to git such as git-LFS that work okay when you have one or two large assets that are complimentary to your code, but they are still extremely slow when it comes to versioning large sets of large binary files.

This section dives into some of the reasons why we did not want to use Git-LFS and wrote an entire VCS from scratch.

What git is missing (could be a good blog post title)

lorem ipsum...

LFS is More Complexity

The first reason is purely the mental model users have to keep in their head.

To quote Gregory Szorc:

"LFS is more complex for Git end users.

Git users have to install, configure, and sometimes know about the existence of Git LFS. Version control should just work. Large file handling should just work. End-users shouldn't have to care that large files are handled slightly differently from small files."

Push / Pull Bottleneck

How do I atomically add one file without pulling everything? Even further...how do I modify one row or column?

Ex) you update test.csv, I update README.md, we can both push without worrying about pulling first.

Ex) Updating data frame directly with duckdb

Do we call this a VCS-DB or some new term? Allows for higher throughput writes while allowing you to snapshot and diff and query the history.

Network Protocols

Git sequentially reads and writes objects via packfiles over the network to create your local and remote copies of the repository. This is inefficient for large files, resulting in slow uploads / downloads of data that is not text.

SHA-256 is Slow

In order to tell if a file is changed, version control systems use a hashing function of the data. By default git uses SHA-256. This is a relatively slow hashing algorithm. By contrast Oxen uses a faster hashing algorithm called xxHash. This results in hashing speeds of 31 GB/s for xxHash vs 0.8 GB/s of SHA-256.

The small change in hashing speeds results in an improvement in terms of developer experience when it comes to larger datasets.

If you want to see this in action, simply run git add on a large directory of data vs oxen add and see the difference in time.

Git Status Spews All Files

Purely from Developer Experience this is not great. What if you add 100k images in a single commit? It's not practical to have git status show you all 100k files that are added.

With datasets we are dealing more with distributions of data, not individual data points.

Downloading Full History

Git by default downloads the entire history of the repository on clone. When it comes to datasets, I may only want to download the latest version of a file, not the entire history.

Oxen gives you the flexibility to download just what you need at the time of training, inference, or testing.

Removing Objects

Removing objects from git is a bit complex because of the references that are made through out the packfiles. Indices have to be recomputed and the history of the repository has to be rewritten if you want to remove a single file.

https://git-scm.com/book/en/v2/Git-Internals-Maintenance-and-Data-Recovery

There are a lot of great things about Git, but one feature that can cause issues is the fact that a git clone downloads the entire history of the project, including every version of every file. This is fine if the whole thing is source code, because Git is highly optimized to compress that data efficiently. However, if someone at any point in the history of your project added a single huge file, every clone for all time will be forced to download that large file, even if it was removed from the project in the very next commit. Because it’s reachable from the history, it will always be there.

Packfiles

One of the ways that git can save space is by using delta encoding to store only the differences between files. They do this through the use of packfiles. The objects are not stored directly in the objects directory rather packed up together in a packfile to make data transfer and compression easier.

Within a pack file there are multiple ways an object can be stored.

This is great for codebases, but not optimal for binary files.

Sources

Git SCM book

  • https://git-scm.com/book/en/v2/Git-Internals-Maintenance-and-Data-Recovery

Avoid Git-LFS if Possible

  • https://news.ycombinator.com/item?id=27134972
  • https://gregoryszorc.com/blog/2021/05/12/why-you-shouldn%27t-use-git-lfs/

Dev.to git-internals

  • https://dev.to/calebsander/git-internals-part-2-packfiles-1jg8

Pack files

Git’s database internals I: packed object store

  • https://github.blog/2022-08-29-gits-database-internals-i-packed-object-store/#delta-compression