π πΎ Oxen.ai
Welcome to the Herd! This is a whirlwind tour of the Oxen.ai codebase. This is an evolving artifact meant to document the tool and codebase.
Each section dives into a different part of the code base as well as the file formats on disk. It is a resource for engineers who want to contribute or extend the tooling, or simply want to learn the inner workings.
What is Oxen?
Oxen at it's core is a blazing fast data version control tool written in Rust. It is optimized for large machine learning datasets. These datasets could consist of many small files (think an images/ folder for computer vision tasks), a few large files (a collection of timeseries datasets as csvs), or many large files (an LLM pre-training dataset of parquet files).
In the Git Book they define "Version Control" as "a system that records changes to a file or set of files over time so that you can recall specific versions later." As a software engineer we typically use tools such as git
to version our source code. This allows us to keep every single version of a file so that we can revert back to a previous state and compare changes over time. While git
is great for versioning smaller assets such as files in a code base, it struggles to version large datasets.
Why build Oxen?
As machine learning engineers, we were frustrated with the speed of managing and iterating on datasets that traditionally would not fit well into git
. There are extensions to git
such as git-lfs
but they are like fitting a square peg in a round hole and come with their own issues.
Data versions should be easy to interact with locally, fast to sync to a remote, seamless contribute to, and feel like you have terrabytes accessible at your fingertips by slicing and downloading subsets locally when you need it.
Why write this book?
"What I cannot create, I do not understand". - Richard Feynman
When it comes to open source contribution and scaling up a software project this is true as well. This book is for developers to get an understanding of the internals, design decisions, and places for improvement in the Oxen.ai code base. Open source is meant to not only be open, but understandable. This is an evolving artifact meant to document the tool and codebase.
The concepts listed in this book are not perfect, but are meant to be a guide posts for the current implementation. Along the way we will point out areas for improvement. If you get to a section and think "Why do we do this? The HAS to be a better way." you are probably right! Check out improvements for some ideas we already have, and feel free to add your own.
Why is Oxen fast?
This is always one of the first questions we get. The simple answer is that there are many optimizations that make Oxen fast. Many are just fundamental computer science concepts but when stacked together make a nice developer experience for iterating on datasets.
Why the name Oxen?
"Oxen" π comes from the fact that the tooling will plow, maintain, and version your data like a good farmer tends to their fields πΎ. During the agricultural revolution the Ox allowed humans to automate the process of plowing fields so they could specialize in higher level tasks. Data is the lifeblood of ML/AI. Let Oxen take care of the grunt work of your infrastructure so you can focus on the higher-level problems that matter to your product.
Where to start?
First you will want to install Oxen. Once you have the tool up and running, we can dive into the implementation details. If you already have the tool up and running, feel free to skip directly to learning about domains or how to add a command.
Like any project, let's start by learning how to build and run the codebase.
π οΈ Development
There are a few ways of getting up and running with Oxen. The most straightforward way is to install the latest pre-built version of Oxen from the open source repository.
If you are actually going to be writing code, it is important to setup your development environment and start writing some code. This section has resources on how to install Oxen from source, how to build and run Oxen, add your first command, unit test, and how to release a new version of Oxen.
π§βπ» Installation
How to install the Oxen client, server, or python package. If you are a developer, you will want to build from source. If you are flying by and learning Oxen you can install the python package or the command line tool from the GitHub releases page.
π» Command Line Tools
The Oxen client can be installed via homebrew or by downloading the relevant binaries for Linux or Windows.
You can find the source code for the client here and can also build for source for your platform. The continuous integration pipeline will build binaries for each release in this repository.
Mac
brew tap Oxen-AI/oxen
brew install oxen
Ubuntu Latest
Check the GitHub releases page for the latest version of the client and server.
wget https://github.com/Oxen-AI/Oxen/releases/latest/download/oxen-ubuntu-latest.deb
sudo dpkg -i oxen-ubuntu-latest.deb
Ubuntu 20.04
wget https://github.com/Oxen-AI/Oxen/releases/latest/download/oxen-ubuntu-20.04.deb
sudo dpkg -i oxen-ubuntu-20.04.deb
Windows
wget https://github.com/Oxen-AI/Oxen/releases/latest/download/oxen.exe
Other Linux
Binaries are coming for other Linux distributions in the future. In the meanwhile, you can build from source.
π Server Install
The Oxen server binary can be deployed where ever you want to store and backup your data. It is an HTTP server that the client communicates with to enable collaboration.
Mac
brew tap Oxen-AI/oxen-server
brew install oxen-server
Docker
wget https://github.com/Oxen-AI/Oxen/releases/latest/download/oxen-server-docker.tar
docker load < oxen-server-docker.tar
docker run -d -v /var/oxen/data:/var/oxen/data -p 80:3001 oxen/oxen-server:latest
Ubuntu Latest
wget https://github.com/Oxen-AI/Oxen/releases/latest/download/oxen-server-ubuntu-latest.deb
sudo dpkg -i oxen-server-ubuntu-latest.deb
Ubuntu 20.04
wget https://github.com/Oxen-AI/Oxen/releases/latest/download/oxen-server-ubuntu-20.04.deb
sudo dpkg -i oxen-server-ubuntu-20.04.deb
Windows
wget https://github.com/Oxen-AI/Oxen/releases/latest/download/oxen-server.exe
To get up and running using the client and server, you can follow the getting started docs.
π Python Package
$ pip install oxenai
Note that this will only install the Python library and not the command line tool.
Installing Oxen through Jupyter Notebooks or Google Colab
Create and run this cell:
!pip install oxenai
π¨ Build & Run
Install Dependencies
Oxen is purely written in Rust π¦. You should install the Rust toolchain with rustup: https://www.rust-lang.org/tools/install.
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
If you are a developer and want to learn more about adding code or the overall architecture start here. Otherwise a quick start to make sure everything is working follows.
Building from Source
To build the command line tool from source, you can follow these steps.
-
Install rustup via the instructions at https://rustup.rs/
-
Clone the repository https://github.com/Oxen-AI/Oxen
git clone git@github.com:Oxen-AI/Oxen.git
-
cd
into the cloned repositorycd Oxen
-
Run this command (the release flag is recommended but not necessary):
cargo build --release
-
After the build has finished, the
oxen
binary will be inOxen/target/release
(or, if you did not use the --release flag,Oxen/target/debug
).Now, to make it usable from a terminal window, you have the option to add it to create a symlink or to add it to your
PATH
. -
To add oxen to your
PATH
:Add this line to your
.bashrc
(or equivalent, e.g..zshrc
)export PATH="$PATH:/path/to/Oxen/target/release"
-
Alternatively, to create a symlink, run the following command:
sudo ln -s /path/to/Oxen/target/release/oxen /usr/local/bin/oxen
Note that if you did not use the
--release
flag when building Oxen, you will have to change the path.
Library, CLI, Server
There are three components that are built during cargo build
and they are separated into three directories within the src
folder.
ls src
cli/
lib/
server/
The library is all the shared code between the CLI and Server. This contains the majority of classes and business logic. The CLI and Server are meant to be thin wrappers over the core oxen library functionality.
The library is also used for the Python Client which should also remain a thin wrapper.
Speed up the build process
You can use the mold linker to speed up builds (The commercial Mac OS version is sold).
Assuming you have purchased a license, you can use the following instructions to install sold and configure cargo to use it for building Oxen:
git clone https://github.com/bluewhalesystems/sold.git
mkdir sold/build
cd sold/build
cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER=c++ ..
cmake --build . -j $(nproc)
sudo cmake --install .
Then create .cargo/config.toml
in your Oxen repo root with the following
content:
[target.x86_64-unknown-linux-gnu]
rustflags = ["-C", "link-arg=-fuse-ld=/usr/local/bin/ld64.mold"]
[target.x86_64-apple-darwin]
rustflags = ["-C", "link-arg=-fuse-ld=/usr/local/bin/ld64.mold"]
For macOS with Apple Silicon, you can use the lld linker.
brew install llvm
Then create .cargo/config.toml
in your Oxen repo root with the following:
[target.aarch64-apple-darwin]
rustflags = [ "-C", "link-arg=-fuse-ld=/opt/homebrew/opt/llvm/bin/ld64.lld", ]
Run Oxen-Server
Generate a config file and token to give user access to the server
./target/debug/oxen-server add-user --email ox@oxen.ai --name Ox --output user_config.toml
Copy the config to the default locations
mkdir ~/.oxen
mv user_config.toml ~/.oxen/user_config.toml
cp ~/.oxen/user_config.toml data/test/config/user_config.toml
Set where you want the data to be synced to. The default sync directory is ./data/
to change it set the SYNC_DIR environment variable to a path.
export SYNC_DIR=/path/to/sync/dir
Run the server
./target/debug/oxen-server start
To run the server with live reload, first install cargo-watch
cargo install cargo-watch
Then run the server like this
cargo watch -- cargo run --bin oxen-server start
CLI Commands
Now feel free to try out some CLI commands and see the tool in action!
oxen init .
oxen status
oxen add images/
oxen status
oxen commit -m "added images"
oxen create-remote --name ox/wikipedia --host 0.0.0.0:3001 --scheme http
oxen config --set-remote origin http://localhost:3001/ox/wikipedia
oxen push origin main
Adding a Command
The main entry point to the Command Line Interface (CLI) is through the main.rs file. This file is located in the Oxen/src/cli/src directory.
Each command is defined in it's own submodule and implements the RunCmd
trait.
#![allow(unused)] fn main() { #[async_trait] pub trait RunCmd { fn name(&self) -> &str; fn args(&self) -> clap::Command; async fn run(&self, args: &clap::ArgMatches) -> Result<(), OxenError>; } }
These submodules can be found in cmd
subdirectory. They are named after the command they implement. For example if you are curious how oxen add
is implemented, you would look at add.rs.
Moo' World
To show this pattern in action, let's add a new command to Oxen. This new command will be a simple "Hello, World!" command. The new command will be named "moo" and will be implemented in the moo.rs file.
The command simply prints "moo!" when you run oxen moo
. It also takes a --loud
flag which makes it print "MOO!" instead of "moo!" if you pass the flag as well as a -n
flag which adds extra o's to the end of the string.
$ oxen moo
moo
oxen moo --loud
MOO!
oxen moo -n 10
moooooooooo
Name The Command
The first method to implement in the trait is simply the name of the command. This is used to identify the command in the CLI and in the help menu.
#![allow(unused)] fn main() { impl RunCmd for MooCmd { fn name(&self) -> &str { "moo" } } }
Setup Args
The next step is setting up the command line arguments. We use the clap crate to handle the command line arguments. The arguments are defined in the args
method.
#![allow(unused)] fn main() { impl RunCmd for MooCmd { fn args(&self) -> Command { // Setups the CLI args for the command Command::new(NAME) .about("Hello, world! π") .arg( Arg::new("number") .long("number") .short('n') .help("How long is the moo?") .default_value("2") .action(clap::ArgAction::Set), ) .arg( Arg::new("loud") .long("loud") .short('l') .help("Make the MOO louder.") .action(clap::ArgAction::SetTrue), ) } } }
Parse Args and Run Command
Finally we need to implement the run
method which is called when the command is run. The run
method is called with the parsed command line arguments.
#![allow(unused)] fn main() { impl RunCmd for MooCmd { async fn run(&self, args: &clap::ArgMatches) -> Result<(), OxenError> { // Parse Args let n = args .get_one::<String>("number") .expect("Must supply number") .parse::<usize>() .expect("number must be a valid integer."); let loud = args.get_flag("loud"); if loud { // Print the moo loudly with -n number of o's println!("M{}!", "O".repeat(n)); } else { // Print the moo with -n number of o's println!("m{}", "o".repeat(n)); } Ok(()) } } }
If a command returns an OxenError
it will be handled and printed in the main.rs
file and return a non zero exit code.
Add to CLI
Now that our command is implemented, we need to add it to the CLI. This is done in the main.rs file. All you need to do is add a new instance of your command to the cmds
vector. The rest of the file is just adding the arguments, parsing them, then calling your run
method.
#![allow(unused)] fn main() { let cmds: Vec<Box<dyn cmd::RunCmd>> = vec![ Box::new(cmd::AddCmd), Box::new(cmd::MooCmd), // Your new command ]; // ... run commands }
This should be all you need to get Oxen to go "MOO!". Let's build and run.
cargo build
./target/debug/oxen moo --help
You will see the help menu for your new command.
Hello, world! π
Usage: oxen moo [OPTIONS]
Options:
-n, --number <number> How long is the moo? [default: 2]
-l, --loud Make the MOO louder.
-h, --help Print help
Then you can simply run your command.
./target/debug/oxen moo
You should see the output "moo"
moo
You can also make the moo louder with the --loud
flag and add more o's with the -n
flag.
$ ./target/debug/oxen moo --loud
MOO!
$ ./target/debug/oxen moo -n 10
moooooooooo
π And there you have it!
Congrats on adding your first command to Oxen! The moo command is already implemented in the main Oxen codebase as an easter egg and an example you can follow along with.
Coding Guidelines
TODO: Add some basic rust ones
https://doc.rust-lang.org/nightly/style-guide/
fmt & clippy
Before checking in a PR, please make sure to run cargo fmt
and cargo clippy
. This will format your code and check for errors.
cargo fmt
cargo clippy --fix --allow-dirty
Try to avoid .clone() if possible
Pass values by reference or by value instead of cloning them unless absolutely necessary. Cloning can be expensive, especially for large structs or strings.
Use PathBuf and Path over String and str
When referencing file system paths, use Path and PathBuf over String and &str. This is because PathBuf is a struct that represents a path and is more powerful than a raw string. For example, it makes sure the paths are cross platform (windows and unix) and allows you to check if a path is a file or directory. PathBuf also has other useful methods to get the file name, directory name, components, etc.
Use impl AsRef
where possible
As function parameters, instead of taking in a &Path
, &str
, PathBuf
or String
take in a impl AsRef<Path>
or impl AsRef<str>
. This way the consumer can decide whether or not they want the value to be borrowed or passed in by reference, and does not have to make sure the value is a reference.
This makes it much is easier and more flexible for external consumers.
TODO: Examples of signatures and external consumers
#![allow(unused)] fn main() { pub fn load_path(repo: &LocalRepository, path: impl AsRef<Path>) -> Result<MerkleTreeNode> }
vs
#![allow(unused)] fn main() { pub fn load_path(repo: &LocalRepository, path: PathBuf) -> Result<MerkleTreeNode> }
vs
#![allow(unused)] fn main() { pub fn load_path(repo: &LocalRepository, path: &Path) -> Result<MerkleTreeNode> }
Use util::fs functions over std::fs
The util::fs functions handle errors a little more gracefully and have additional functionality for reading and writing to the file system cross platform. For example std::fs::remove_file
does not tell you which file could not be removed and will give you an error like this:
Os { code: 2, kind: NotFound, message: "No such file or directory" }
util::fs::remove_file
will add the file name to the error message so you can see which file could not be removed.
Be cognizant of loading too much of the Merkle Tree
The CommitMerkleTree
struct lets you load subsets of the merkle tree and skip directly to dir nodes. It also has functionality to only load 1 or 2 levels deep (avoiding deep recursion).
Make sure you are only loading what you need to get the information you need to return.
For example, if you just need the size of a directory, you don't need to load it's children.
#![allow(unused)] fn main() { let load_recursive = false; let node = CommitMerkleTree::from_path(repo, commit, path, load_recursive)?; }
If you need all the files in a directory, and you don't want all the levels below it, you can specify the depth to load.
#![allow(unused)] fn main() { // This will load the VNodes and the FileNode/DirNode children of the VNodes let node = CommitMerkleTree::read_depth(repo, hash, 2)?; }
Testing
We are all smart software engineers, but when it comes to entering a new codebase we all want confidence that making a change doesn't have a cascading effect. It is important to make sure that turning off the (proverbial) lights in the kitchen π‘ doesn't make the roof collapse π .
Luckily each command within Oxen has a well defined interface, and each command can be tested independently.
For example:
#![allow(unused)] fn main() { // Initialize Repo let repo = repositories::init("./test_repo")?; // Add File repositories::add(&repo, &"hello.txt")?; // Commit File repositories::commit(&repo, &"add hello.txt")?; }
We chain these commands together into a sequence of integration and unit tests to make sure the end to end system works as expected.
Writing Tests
The best place to reference when looking at tests within Oxen are the lib/src/repositories
modules themselves. You'll find some familiar names within the repositories::
namespace.
For example:
We follow a Domain Driven Design approach to development. The tests are located within the same module as the code they are testing. Checkout all the domain objects here.
All tests for these commands are found below their respective module. Let's look at an example command and break down the different parts of the test.
#![allow(unused)] fn main() { #[cfg(test)] mod tests { // ... include necessary modules #[test] fn test_repositories_init() -> Result<(), OxenError> { test::run_empty_dir_test(|repo_dir| { // Init repo let repo = repositories::init(repo_dir)?; // Init should create the .oxen directory let hidden_dir = util::fs::oxen_hidden_dir(repo_dir); let config_file = util::fs::config_filepath(repo_dir); assert!(hidden_dir.exists()); assert!(config_file.exists()); Ok(()) }) } } }
First you will notice that the tests are within a mod tests
block. This is a Rust feature that allows you to group tests together within a particular module.
In order to run all the tests within a particular command module you can run:
cargo test --lib repositories::init
This will run all the tests within the repositories::init
module.
Returning Errors
You will notice that all the tests return Result<(), OxenError>
. This means they will catch any errors that might occur when running different command.
The OxenError
is a custom error type that is defined in the lib/src/error.rs file. It is a simple enum that represents an error that can occur in Oxen. When you unwrap ?
a function that returns a Result<(), OxenError>
you will receive the error and the test will fail.
Setup & Teardown
Next you will see that most tests are wrapped in a closure defined in our test.rs file.
#![allow(unused)] fn main() { test::run_empty_dir_test(|repo_dir| { // ... your test code here Ok(()) }) }
These closures takes care of a lot of the boiler plate around setting up a test directory, and deleting it after the test is run.
For example run_empty_dir_test
will pass a unique directory to the closure, and delete it when finished. This way we can run all the isolated tests in parallel and not worry about leaking files from one test impacting another.
There are many other helper functions you can use to setup and teardown your tests, including populating repositories with sample data, and setting up remote repositories. See the full list in the test.rs file.
Running All Tests
Make sure your server is running on the default port and host, then run
Note: tests open up a lot of file handles, so limit num test threads if running everything.
You an also increase the number of open files your system allows ulimit before running tests:
ulimit -n 10240
cargo test -- --test-threads=$(nproc)
It can be faster (in terms of compilation and runtime) to run a specific test. To run a specific library test:
cargo test --lib test_get_metadata_text_readme
To run with all debug output and run a specific test
env RUST_LOG=debug,liboxen=debug,integration_test=debug cargo test -- --nocapture test_command_push_clone_pull_push
To set a different test host you can set the OXEN_TEST_HOST
environment variable
env OXEN_TEST_HOST=0.0.0.0:4000 cargo test
π Releasing Oxen Into The Wild πΎ
Right now this is mainly for me to document how I release new versions of the open source Oxen.ai binaries.
If anyone wants to help with the release process, please let me know!
Bump CLI/Server Versions
For the CLI and Oxen-Server binaries, make sure to update in all Cargo.toml files in our Oxen-AI/Oxen repo.
Create Tag
We use git tags to kick off CI within GitHub actions.
git tag -a v$VERSION -m "version $VERSION"
Push Tag
Builds will show up in this repositories releases with the tag you just specified.
git push origin v$VERSION
Update Homebrew Install
There are separate homebrew repositories for the oxen
CLI and the oxen-server
binary.
You will need to compute shasum(s) of each release and update the Formula/*.rb
in both repos above.
Use the compute_hashes.sh script in homebrew-oxen repo to compute the shasum(s) of each release.
To verify the formula(s) locally:
cd /path/to/homebrew-oxen-server
brew install Formula/oxen.rb
oxen --version
cd /path/to/homebrew-oxen-server
brew install Formula/oxen-server.rb
oxen-server --version
Update Release Notes
TODO: We need to get better at this.
Suggestions welcome π.
π Domain Objects
Now for the fun part! Hopefully you have already built Oxen and learned how to add your first command.
In order to fully grok the Oxen codebase, it's important to define a few terms and understand the different domain objects. This way you'll have the right terminology to build upon and know where to look when adding or debugging features.
These domains are defined so we are all speaking the same language while diving into the code base. We will start with what the objects are, why they exist, and how objects are stored on disk, then we will build up intuition of how the system works as a whole.
π Peeking Under the Hood
Similar to git
, we store all the meta data for a repository in a hidden local .oxen
directory. To start the learning journey let's initialize and empty Oxen repository locally by using oxen init.
mkdir my-repo
cd my-repo
oxen init
echo "# New Oxen Repo" > README.md
oxen add README.md
oxen commit -m "Initial Commit"
The best way to start learning the architecture and different domain objects is by poking around in this directory.
ls .oxen
You will see a variety of files and folders, including:
HEAD
config.toml
history/
refs/
tree/
versions/
Let's use these files and folders as a jumping off point to learn about the different domain objects.
First Up: Repositories
All of the domain objects exist within the context of a "Repository", so let's start there. All of the files and folders within the .oxen
directory represent different sub components of a Repository, but we need some over arching objects to kick the process all off. These are what we call the LocalRepository
and RemoteRepository
.
Repositories
When we talk about data in Oxen, we usually talk about "Repositories". A Repository lives within your working directory of data in a hidden .oxen
directory. You can think of a Repository as a series of snapshots of your data at any given point in time.
Each snapshot contains a "mini filesystem" representing all the files and folders in that snapshot. The each mini filesystem is represented by a commit, and is stored in the .oxen
directory so that we can return to it at any point in time.
To see this in action let's instantiate a local oxen repository and see what it looks like.
$ oxen init
$ ls -trla
total 0
drwxr-xr-x 23 bessie staff 736 May 22 16:41 ../
drwxr-xr-x 3 bessie staff 96 May 22 16:41 ./
drwxr-xr-x 10 bessie staff 320 May 22 16:41 .oxen/
This magic .oxen
directory is what will hold all the snapshots of your data. Think of it as a local database that lets you roll back your data to any point in time.
Content Addressable File System
How are the different versions stored on disk? Let's add and commit some files to the repository and see what happens.
$ echo "Hello" > hello.txt
$ echo "World" > world.txt
$ oxen add hello.txt world.txt
$ oxen commit -m "Add hello.txt and world.txt"
Each file that gets added and committed to oxen gets stored in a Content Addressable File System in the .oxen/versions
directory. Oxen first computes a hash of the file, then stores the file in a sub directory that mirrors the hash. This means that the file can be retrieved by its hash at any time.
$ tree .oxen/versions
.oxen/versions
βββ files
βββ 18
βΒ Β βββ 066113d946cfa640ffc8773c83f61b
βΒ Β βββ data
βββ a7
βββ 666c8f5aaf946ca629d9d20c29aa6a
βββ data
6 directories, 2 files
What's up with these funky hexadecimal directory names? Well each directory is a hash of the file. To see this in action, Oxen has a handy command to inspect information about an individual file.
oxen info -v world.txt
hash size data_type mime_type extension last_updated_commit_id
18066113d946cfa640ffc8773c83f61b 6 text text/plain txt 2c610ae8e424a4c8
oxen info
prints out a tab separated list of the hash, size, data type, mime type, extension, and the last updated commit id of the file.
In this case, the hash for the world.txt
file is 18066113d946cfa640ffc8773c83f61b
. As for the directory structure above, you can see we split the hash and use the first two characters (18
) of the hash as a prefix to the directory name. This is a common pattern in content addressable file systems to make sure you do not have too many sub-directories in a single directory.
Manually Inspect Older Versions
Currently the files in Oxen are uncompressed in the versions directory, so you can simply cat
the file to see the contents.
$ cat .oxen/versions/files/a7/666c8f5aaf946ca629d9d20c29aa6a/data
Hello
Note: We have compression in our list of future improvements that could be made to the system, but the fact that we keep them uncompressed is a nice property of the system. It allows us to take advantage of the native file format of the files on disk with out additional compression / decompression steps.
Storing New Versions
Let's change the hello.txt
file and commit it again.
$ echo "Hello, World!" > hello.txt
$ oxen add hello.txt
$ oxen commit -m "Update hello.txt"
Now look at the .oxen/versions
directory. You will see that we have a new hashed directory for the file. This means that the file has been updated and a new snapshot has been created.
$ tree .oxen/versions
.oxen/versions
βββ files
βββ 18
βΒ Β βββ 066113d946cfa640ffc8773c83f61b
βΒ Β βββ data
βββ a7
βΒ Β βββ 666c8f5aaf946ca629d9d20c29aa6a
βΒ Β βββ data
βββ ce
βββ 1931b6136c7ad3e2a42fb0521986ba
βββ data
8 directories, 3 files
Let's look at each individual file in the versions dir.
$ cat .oxen/versions/files/a7/666c8f5aaf946ca629d9d20c29aa6a/data
Hello
$ cat .oxen/versions/files/18/066113d946cfa640ffc8773c83f61b/data
World
$ cat .oxen/versions/files/ce/1931b6136c7ad3e2a42fb0521986ba/data
Hello, World!
While this doesn't give you the full picture of how Oxen works, hopefully gives you a starting point into the Content Addressable File System that Oxen uses to store all versions of the files. We will get into the details of the commit databases and other data structures as we dive into more domains.
LocalRepository
Since all of the data for all of the versions is simply stored in a hidden subdirectory, the first object we introduce is the LocalRepository
. This object simply represents the path
to the repository so that we know where to look for subsequent objects.
src/lib/src/model/repository/local_repository.rs
#![allow(unused)] fn main() { pub struct LocalRepository { pub path: PathBuf, // Optional remotes to sync the data to remote_name: Option<String>, pub remotes: Vec<Remote>, } }
Whenever starting down a code path within the CLI the first thing we do is find where the .oxen
directory is and instantiate our LocalRepository
object.
There is a handy helper method to get a repo from the current dir. This recursively traverses up in the directory structure to find a .oxen
directory and instantiates the LocalRepository
object.
#![allow(unused)] fn main() { let repository = LocalRepository::from_current_dir()?; }
You may want to reference the code for the add command to see how instantiating a LocalRepository
works in practice.
You will notice that not only does a LocalRepository
have a path
, but it also has a remote_name
and remotes
. These are read from .oxen/config.toml
and tell inform Oxen where to sync the data to.
Remotes
A remote in the context of Oxen is simply a name and a url. The name is a human readable representation and the url is the actual location of the remote repository.
#![allow(unused)] fn main() { pub struct Remote { pub name: String, pub url: String, } }
The remotes can be set through the oxen config
command.
oxen config --set-remote origin http://localhost:3001/my-namespace/my-repo
If you look in the .oxen/config.toml
file you will see the remotes listed there.
remote_name = "origin"
[[remotes]]
name = "origin"
url = "http://localhost:3001/my-namespace/my-repo"
You can have multiple remotes as well as a default remote specified by remote_name
. The default remote is the remote that will be used when you run oxen push
or oxen pull
without specifying a remote.
RemoteRepository
On the other end of the LocalRepository
is the RemoteRepository
. This object represents the remote repository that the LocalRepository
is connected to. It has the same url
as the Remote
object.
#![allow(unused)] fn main() { pub struct RemoteRepository { pub namespace: String, pub name: String, pub remote: Remote, } }
All repositories that are stored on the oxen-server
have a namespace
and name
. This helps us organize the repositories on disk, as well as in a way that is meaningful to the user.
In order to create a RemoteRepository
we will first need to spin up an oxen-server
instance. From your debug build you can do something like the following.
export SYNC_DIR=/path/to/sync/dir
./target/debug/oxen-server start
This will start a server on the default host 0.0.0.0 and port 3000. The environment variable SYNC_DIR
tells the server where to write the data to on disk.
Then we can use the oxen create-remote
command from the CLI.
oxen create-remote --name my-namespace/my-repo --host 0.0.0.0:3000 --scheme http
If you look in the SYNC_DIR
you will see a directory structure that mirrors the namespace/repo-name of the repository you just created. There will be a .oxen
directory with the remote repository created for you as well.
ls -trla /path/to/sync/dir/my-namespace/my-repo/.oxen
What's cool is that on disk the RemoteRepository
is the same structure as the LocalRepository
. This means that we can use the same code to manipulate the RemoteRepository
on the server as we can the LocalRepository
on the client.
If you didn't configure the remote earlier, you can do so now.
oxen config --set-remote origin http://0.0.0.0:3000/my-namespace/my-repo
Then simply push the data to the remote.
oxen push
This copies all the data from the local .oxen directory to the remote repository. Remember the versions directory from before? Let's see what it looks like on the remote.
$ cat /path/to/sync/dir/my-namespace/my-repo/.oxen/versions/files/ce/1931b6136c7ad3e2a42fb0521986ba/data
Hello, World!
There we go! Data is in tact on the remote server. This is the beauty of Oxen. There are not too many fancy bells and whistles when you look under the hood. Just a content addressable file system with a library that is shared between the client and server.
Next up we will look at Commits. These objects represent the group of files that were are in a single snapshot, and we will learn how Oxen knows which versions were added, removed, changed in the repository and when.
Commits
If you are familiar with git
the concept of a commit
and branch
should be very familiar. What you may not have done is look under the hood as to how they are stored. In Oxen, many of the concepts are similar.
A commit is a checksum or hash value representing all the files is within a specific version. You may recognize them as a string of hexadecimal characters (0β9 and aβf) looking something like a72b68036af144bfe2dff0fb08a746c4
.
Run oxen log
within your Oxen repository and you will see the initial commit.
commit a72b68036af144bfe2dff0fb08a746c4
Author: ox
Date: Thursday, 09 May 2024 22:29:00 +00
Initialized Repo π
You will see these hashes all over the place in Oxen and can use them as pointers to get to specific versions.
Commits as Merkle Tree Nodes
Under the hood most objects in Oxen are stored in a Merkle Tree data structure. At the root of each merkle tree is a commit object.
All the nodes in the tree are stored in the .oxen/tree/nodes
directory.
$ tree .oxen/tree/nodes/
.oxen/tree/nodes/
βββ 589
βΒ Β βββ 8d0aa535709791ea84a341307fc3
βΒ Β βββ children
βΒ Β βββ node
βββ 88b
βΒ Β βββ e33604f2ae2153443bff158c31495
βΒ Β βββ children
βΒ Β βββ node
βββ a72
βββ b68036af144bfe2dff0fb08a746c4
βββ children
βββ node
You'll see that nodes are content addressable by their hash, and each subdirectory is a level in the merkle tree. These files are hard to inspect on their own, so we can use the oxen node
command to inspect the individual node databases.
$ oxen node a72b68036af144bfe2dff0fb08a746c4
CommitNode
hash: a72b68036af144bfe2dff0fb08a746c4
message: adding README
parent_ids: []
author: oxbot
email: oxbot@oxen.ai
timestamp: 2024-08-19 23:06:41.525894 +00:00:00
Here we have a nice beautiful commit object. There is a list of parent commit ids, a message, the author, the email, and the timestamp.
To see the full tree that lies below this commit, you can use the oxen tree
command.
$ oxen tree -n a72b68036af144bfe2dff0fb08a746c4
You'll see that the tree is printed out in a human readable format. This tree only has a single README.md file in the root directory. Trees can get much more complex, and we will dive into this more in the Merkle Trees section.
[Commit] a72b68036af144bfe2dff0fb08a746c4 -> adding README parent_ids ""
[Dir] -> 5898d0aa535709791ea84a341307fc3 11 B (1 nodes) (1 files) [latest commit a72b68036af144bfe2dff0fb08a746c4]
[VNode] 88be33604f2ae2153443bff158c31495 (1 children)
[File] README.md -> 43744a971e29c0f56c293f855f11814 11 B [latest commit a72b68036af144bfe2dff0fb08a746c4]
Commit Metadata
All of the metadata within a commit object is important for computing it's id
. The id can be used to verify the integrity of the data within a commit. More on this later.
The first piece of metadata is the user that made the commit. The user data is read from the global ~/.config/oxen/user_config.toml
file. You can set your user info with the oxen config
command.
$ oxen config --name 'Bessie' --email 'bessie@your_email.com'
$ cat ~/.config/oxen/user_config.toml
name = "Bessie"
email = "bessie@your_email.com"
It also contains the timestamp of the commit, and a user provided message. All of these pieces of data are used in computing the commit id, which is a unique representation of the data in this commit.
Commit Id (Hash)
Each commit has a unique id (hash) that can be verified to ensure the integrity of the data in this commit. It is a combination of the data within all the files of the commit, the user data, timestamp, and the message.
What's nice about this is that once the data has been synced to the remote server, we can verify that the data is valid by computing the hashes of the files and the commit data and comparing this to the id of the commit in the database.
Commit History
Every commit (except the first) has a list of parent commit ids. Most commits have a single parent, but in the case of a merge commit, there can be multiple parent commit ids. You can traverse the commit history by following the parent commit ids until you hit the first commit.
You can use the oxen log
command to print out the commit history starting with the most recent commit on the current branch.
Next Up: Branches
Learn how commits relate to Branches in the next section.
Branches
Branches are a key feature of many VCS systems. They allow users to work in parallel without making changes that step on each other's toes.
The branching model in Oxen is inspired by git
meaning branches are lightweight and quick to create. When creating a branch, we are never copying any of the raw datasets in the repository. Under the hood, a branch is really just a named reference to a commit. Creating a new branch simply creates a new named reference.
#![allow(unused)] fn main() { pub struct Branch { pub name: String, pub commit_id: String, } }
On the first commit of a repository, a default branch called main
is created and points to the initial commit.
Refs
To see how this works in practice, let's look at how branches are stored on disk. All of the branches within a repository are stored in a key-value rocksdb database. This database can be found in the .oxen/refs
directory.
Let's inspect this database with our oxen db list
command.
$ oxen db list .oxen/refs
main c719c887cc250784
This shows us that there is a single branch, main
, that points to the commit id c719c887cc250784
.
If we create a new branch, say foo
, it will also be stored in the database with the same commit id as the current branch you are on.
$ oxen checkout -b foo
$ oxen db list .oxen/refs
main c719c887cc250784
foo c719c887cc250784
To see the list of current branches as well as which one you currently have checked out, you can use the oxen branch
command.
$ oxen branch
* foo
main
The *
indicates the foo
branch is currently checked out. The way we store the current branch is by creating a HEAD
file in the .oxen
directory.
This file contains the name of the branch or commit id that is currently checked out.
$ cat .oxen/HEAD
foo
Let's make a commit and see how the branches stored on disk change.
$ echo "foo" > foo.txt
$ oxen add foo.txt
$ oxen commit -m "foo commit"
Committing with message: foo commit
Commit 9ef4176b1b4422a7 done.
We now have a new commit id 9ef4176b1b4422a7
. If we look at the refs
database, we can see that the foo
branch has been updated to point to the new commit id.
$ oxen db list .oxen/refs
foo 9ef4176b1b4422a7
main c719c887cc250784
If we look at oxen log
we will see that the foo
branch is now the most recent commit.
commit 9ef4176b1b4422a7
Author: Ox Bot
Date: Thursday, 30 May 2024 04:04:53 +00
foo commit
commit c719c887cc250784
Author: Ox Bot
Date: Tuesday, 28 May 2024 03:03:49 +00
adding questions.jsonl
You can checkout a specific commit by using the oxen checkout
command with the commit id.
$ oxen checkout c719c887cc250784
This will update the HEAD
file to point to the commit id instead of the branch name.
$ cat .oxen/HEAD
c719c887cc250784
You will notice that our foo.txt
file is no longer present in the working directory. If you perform a oxen status
you will see that we are now in a "detached HEAD" state. This means that we are no longer on a branch and are instead on an individual commit.
Don't worry, the file foo.txt
is still alive and well in the .oxen/versions
directory, and can be restored by checking out the foo
branch again.
$ oxen checkout foo
That's it! The relationship between branches, commits, and the HEAD commit is really that simple. Branches are just a named reference to a commit id that make it easier to find a particular chain of commits.
You can progress a branch as many commits as you want without affecting the main branch. When you are ready to merge your branch into the main branch, you can use the oxen merge
command which will be covered later.
Next Up: Files & Directories
Now that you know the basic data structures for branches and commits, let's dive into how branches and commits are tied to a set of files and directories with the Merkle Tree data structure.
Next Up: Merkle Trees
Files, Directories and Merkle Trees π²
When you create a commit within Oxen, you can think of it as a snapshot of the state of the files and directories in the repository at a particular point in time. This means each commit will need a reference to all the files and directories that are present in the repository at that point in time.
Let's use a small dataset as an example.
README.md
LICENSE
images/
image0.jpg
image1.jpg
image2.jpg
image3.jpg
image4.jpg
This is a simple file structure with a README.md at the top level and a sub-directory of images. Start by initializing a new repository, then adding and committing the files.
oxen init
oxen add README.md
oxen add images/
oxen commit -m "adding data"
On commit we save off all the hashes of the file contents and save the data into a Content Addressable File System (CAFS) within the .oxen/versions
directory. This makes it so we don't duplicate the same file data across commits.
$ tree .oxen/versions/
.oxen/versions/
βββ 43
βΒ Β βββ 94f02b679bcf0114b1fb631c250d0a
βΒ Β βββ data
βββ 58
βΒ Β βββ 8b7f5296c1a6041d350d1f6be41b3
βΒ Β βββ data
βββ 64
βΒ Β βββ e1a1512c6d5b1b6dcf2122326370f1
βΒ Β βββ data
βββ 74
βΒ Β βββ bfd17b6b7c9b183878a26e1e62a30e
βΒ Β βββ data
βββ 7c
βΒ Β βββ 42afd26e73b8bfbc798288f1def1ed
βΒ Β βββ data
βββ c8
βΒ Β βββ 2d11a1e1223598d930454eecfab6ea
βΒ Β βββ data
βββ dc
βββ 92962a4b05f5453718783fe3fc4b10
βββ data
15 directories, 7 files
Each file is accessible by its hash and original extension the file was stored with. For example, the hash of images/image0.jpg
is 74bfd17b6b7c9b183878a26e1e62a30e
and it's extension is jpg
, so the original contents can be found at .oxen/versions/74/bfd17b6b7c9b183878a26e1e62a30e/data
.
To find the hash and extension of any file in a commit, you can use the oxen info
command.
oxen info images/image0.jpg
74bfd17b6b7c9b183878a26e1e62a30e 13030 image image/jpeg jpg 12099a4ca3b15c36
The CAFS makes it easy to fetch the file data for a given commit, but we need some sort of database that lists the original file names and paths. This way when switching between commits we can efficiently restore the files that have been added/changed/removed.
Switching Between Versions
The simplest solution would be to have a key-value database for every commit that listed the file paths and pointed to their hashes and extensions.
Commit A
README.md -> {"hash": "64e1a1512c6d5b1b6dcf2122326370f1", "extension": ".md"}
LICENSE -> {"hash": "7c42afd26e73b8bfbc798288f1def1ed", "extension": ""}
images/image1.jpg -> {"hash": "74bfd17b6b7c9b183878a26e1e62a30e", "extension": ".jpg"}
images/image2.jpg -> {"hash": "dc92962a4b05f5453718783fe3fc4b10", "extension": ".jpg"}
images/image3.jpg -> {"hash": "588b7f5296c1a6041d350d1f6be41b3", "extension": ".jpg"}
images/image4.jpg -> {"hash": "c82d11a1e1223598d930454eecfab6ea", "extension": ".jpg"}
images/image5.jpg -> {"hash": "4394f02b679bcf0114b1fb631c250d0a", "extension": ".jpg"}
We could store this in a rocksdb database in .oxen/history/{commit_hash}/files
. The keys would be the file paths and the values would be the hashes and extensions. Then when swapping between commits all we would have to do is clear the current working directory and re-construct all the files from the respective commit database!
Psuedo Code:
set commit_hash 1d278f841510b8e7
rm -rf working_dir
for dir, hash, ext in (oxen db list .oxen/versions/files/$commit_hash/) ;
mkdir -p working_dir/$dir ;
cp .oxen/versions/files/$commit_hash/$hash/data$ext working_dir/$dir/ ;
end
Version control complete. Let's call it a day and go relax on the beach π ποΈ.
Of course, we are not here to build naive inefficient version control tool. Oxen is a blazing fast version control system that is designed to handle large amounts of data efficiently. Even if clearing and restoring the working directory is simple, there are many reasons it is not optimal (including wiping out untracked files).
Data Duplication π₯
To see why this naive approach is sub-optimal, imagine we are collecting image training data for a computer vision system. We put Oxen in a loop adding one new image at a time to the images/
directory. Each time we add an image we commit the changes.
for i in (seq 100) ;
# imaginary data collection pipeline
cp /path/to/images/image$i.jpg images/image$i.jpg ;
# oxen add and commit
oxen add images/image$i.jpg ;
oxen commit -m "adding image$i" ;
end
If we had gone the naive route, this would balloon in redundancy even with just our list of pointers to hashes. Each database list repeats the same file paths and the hashes over and over again.
Commit A
README.md -> hash1
LICENSE -> hash2
images/image0.jpg -> hash3
images/image1.jpg -> hash4
images/image2.jpg -> hash5
images/image3.jpg -> hash6
Commit B
README.md -> hash1 # repeated 1 time
LICENSE -> hash2 # repeated 1 time
images/image0.jpg -> hash3 # repeated 1 time
images/image1.jpg -> hash4 # repeated 1 time
images/image2.jpg -> hash5 # repeated 1 time
images/image3.jpg -> hash6 # repeated 1 time
images/image4.jpg -> hash7 # NEW
Commit C
README.md -> hash1 # repeated 2 times
LICENSE -> hash2 # repeated 2 times
images/image0.jpg -> hash3 # repeated 2 times
images/image1.jpg -> hash4 # repeated 2 times
images/image2.jpg -> hash5 # repeated 2 times
images/image3.jpg -> hash6 # repeated 2 times
images/image4.jpg -> hash7 # repeated 1 time
images/image5.jpg -> hash8 # NEW
...
Commit 10_000
README.md -> hash1 # repeated N times
LICENSE -> hash2 # repeated N times
images/image0.jpg -> hash3 # repeated N times
images/image1.jpg -> hash4 # repeated N times
images/image2.jpg -> hash5 # repeated N times
images/image3.jpg -> hash6 # repeated N times
images/image4.jpg -> hash7 # repeated N times
images/image5.jpg -> hash8 # repeated N times
...
images/image10_000.jpg -> hash10_000
Do the math once we get to a dataset of 10,000 images. Each commit duplicates 10,000+1 values. 10,000 + 10,001 + 10,002 + 10,003 = 40,006 values in our collective databases.
.oxen/history/COMMIT_A/files -> 10,000 values
.oxen/history/COMMIT_B/files -> 10,001 values
.oxen/history/COMMIT_C/files -> 10,002 values
.oxen/history/COMMIT_D/files -> 10,003 values
Total Values: 40,006
A key observation is that we are duplicating a lot of data across commits. This will be a common pattern to look for when optimizing the storage within Oxen.
Optimizations w/ Merkle Trees
Adding one file should not require to you copy the entire key-value database. We need some sort of data structure that can efficiently store the file paths and hashes without duplicating too much data across commits.
Enter Merkle Trees π².
Files and directories are already organized in a tree like fashion, so a Merkle Tree is a more natural fit for storing and traversing the file structure to begin with. The Oxen Merkle Tree implementation also make it so when we add additional data, we only need to copy subtrees instead of copying the entire database for each commit.
What does a Merkle Tree within Oxen look like?
At the root node of the Merkle tree is a commit hash. This is the identifier you know and love which you can reference any commit in the system by.
The root commit hash represents the content of all the data below it, including the files contained in the images/
directory as well as the files directly in the .
root directory (README.md, LICENSE, etc). Additionally all the files within a directory get sectioned off into VNodes. We will return to the importance of VNodes in a bit.
At each level of the tree we see the contents of all the files hashed within that directory, and bucketed into VNodes.
Adding a File
To see what happens when we add a new file to our repository, let's revisit our previous example of adding images to the images/
directory. Say we have 8 images in our images/
directory and we want to add a new image (9.jpg).
The first thing we have to do is find which VNode bucket it falls into (more on this later). Then we can recompute the hash of this subtree, and recursively update the hashes above it until we get to the root node.
In this case we make four total updates to the tree, highlighted in green.
- Add the contents of the new image to our
.oxen/versions/
directory - Find the VNode it belongs to, and deep copy it to a new VNode with a new hash
- Update the VNode hash of the
images/
parent directory - Update the root node hash
The Merkle Tree nodes are all global to the repository, and can get re-used and shared between commits. Instead of copying the entire database to our new commit, only copy the subtrees that changed. On adding a file, we only need to update a single VNode and copy it's contents. This is a much faster operation than copying every file within our databases.
For another example, let's see what happens when we update the README.md
file.
This time, we only need to update the VNode that contains the README.md
file and it's parent in the root node.
Why use VNodes?
One of the goals of Oxen is to be able to scale to directories with an arbitrary number of files. Imagine for a second that you have a directory of 100k or 1 million images. Storing all of these values directly at the directory level node would be inefficient. Every time you commit a single image to the directory, you would need to copy all the pointers and recompute the hash for the entire directory.
For example imagine we had no VNodes at the directory level.
If we want to add a single file, we would have to copy all the pointers and recompute the hash for the entire directory.
VNode's add an intermediate bucket we can add files to so that we only have to copy a subset of pointers. Which VNode a file belongs to is computed from the hash of file path name itself. This way files get evenly distributed into buckets within the tree.
You'll notice two parts to the VNode. The first is first two letters (AB
) of the hash of the file path name, and the second is the hash of the VNode contents (#DFEGA72
). To add an image, now we only need to find the bucket (based on it's file path), compute it's new hash, and make a copy of the items of the VNode database for it's new hash.
To drive this home, let's go back to our example directory with 10,000 images with the naive implementation from before. Remember 4 additions to the images directory after it contained 10,000 node resulted in 40,006 values in our database. Say our bucket size for VNodes is 10,000/256 ~= 40. This means on average we are copying 40 values with each commit. This will result in 10,160 total values in our DB instead of 40,006.
Printing the Merkle Tree
To bring these concepts to life, let's create a repo of many images and use the oxen tree
command to print the Merkle Tree. In the Oxen-AI/Oxen repo we have a script that will create a directory with an arbitrary number images and add them to the images/
directory.
When developing and testing Oxen, this script is handy to generate synthetic datasets to push the performance of the system. For example you could create a dataset of 1,000,000 images and see how long it takes to add and commit the changes.
# WARNING: This will create a directory with 1,000,000 images and take a while to run
$ python benchmark/generate_image_repo_parallel.py --output_dir ~/Data/1m_images --num_images 1000000 --num_dirs 2 --image_size 64 64
For this example we will stick to a smaller dataset of 20 images. It will be easier to visualize the Merkle Tree.
# π This will create a much smaller dataset of 20 images
$ python benchmark/generate_image_repo_parallel.py --output_dir ~/Data/20_images --num_images 20 --num_dirs 1 --image_size 64 64
After the dataset is created, go ahead and create and initialize an oxen repository.
$ cd ~/Data/20_images
$ oxen init
Before we add and commit the files, we are going to make a quick tweak to the configuration to use a smaller VNode bucket size. The default size is 10,000, but we are going to set it to 5 to make it easier to see the tree updates in this toy example.
Edit the .oxen/config.toml
file to set the vnode_size
to 5.
$ cat .oxen/config.toml
remotes = []
min_version = "0.19.0"
vnode_size = 6
Now add and commit the files.
$ oxen add .
$ oxen commit -m "adding all data"
Then we can use the oxen tree
command to print the entire Merkle Tree.
$ oxen tree
[Commit] bb2e7778ddc8f40788d4d34993955bfd "adding data" -> Bessie ox@oxen.ai parent_ids ""
[Dir] 7a892f11ae586978f3b170182599cc5e "/" (15.8 MB) (22 files) (commit bb2e7778ddc8f40788d4d34993955bfd) (1 children)
[VNode] 801de12b06a74b5a2d0b978af067e32b (3 children)
[File] dcd78180c335f3afed68656b6b12c248 "README.md" (98 B) (commit bb2e7778ddc8f40788d4d34993955bfd)
[Dir] 73ae655940b4873cf6b1557c3806d65c "images/" (15.8 MB) (20 files) (commit bb2e7778ddc8f40788d4d34993955bfd) (1 children)
[VNode] 51d73f367fcbc4f11228ff2e56fba5d3 (1 children)
[Dir] 79c6625dcad70d16be56aca9426442ee "split_0/" (15.8 MB) (20 files) (commit bb2e7778ddc8f40788d4d34993955bfd) (4 children)
[VNode] 3543cada59e52d3a391603661b6f9721 (6 children)
[File] 6d11185298ec825208a1f3fce23b9d6c "noise_image_14.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[File] 56e2d45af9958af0680fceb3ab00d18c "noise_image_17.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[File] 82c2ec408abf2defc2dd5289b29a1e80 "noise_image_19.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[File] 1d952254a3be50f3d2d70a1398aee524 "noise_image_4.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[File] cb118df353e0814d42472818405b9384 "noise_image_7.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[File] 2ccc49262256aae9802c581d90736b34 "noise_image_9.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[VNode] e5529567f3246881d58335ee5102281c (4 children)
[File] 38593fea717a1d4c2e771674ebc9ca81 "noise_image_0.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[File] d0871bf54a0c99f2336da66ab20b6785 "noise_image_1.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[File] e406e9a852da65dad3f012dd86a98919 "noise_image_10.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[File] 48dca231fd96fd847935e7e6623b32d9 "noise_image_11.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[VNode] c9e4baf0332cf23a803a8870e295e0b5 (7 children)
[File] abf629ca6ed9414e8a4f884d2f98dd2a "noise_image_13.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[File] 4598fdac03e5aaa52aa5cd1c51231a2 "noise_image_15.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[File] 3e0989396c1d6a7f38164faf96c4662e "noise_image_18.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[File] 1506f02a6a10123af68084200b67583a "noise_image_2.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[File] 1d72a37baeef7b0c02b4ce7482e4592 "noise_image_5.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[File] 26e656316e9e292f8b27fcb49654237a "noise_image_6.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[File] 4f3611ebc27072c6359ac05f4e6c98e "noise_image_8.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[VNode] 38c8edc066ae7d1ac36a92223a9c39ee (3 children)
[File] 6dc552a73f1beb27356903f84c4a7b33 "noise_image_12.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[File] 9335d768de6058fbe3da1a7857c400e7 "noise_image_16.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[File] 4855043d603912f6141f5f148851146f "noise_image_3.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[File] fe0b2b028ff23fd726b14f6c7694ffd8 "images.csv" (764 B) (commit bb2e7778ddc8f40788d4d34993955bfd)
The first thing you'll notice is that each VNode is not guaranteed to have 6 children. This is because we first instantiate the number of VNodes given the VNode size, and then we fill them with the files. The hashing algorithm will then distribute the files across the VNodes given the file hash % number of VNodes
and some VNodes will end up with more children than others. On average a good hashing algorithm will distribute the files evenly across the VNodes.
Remember - VNodes exist in order to help us update smaller subtrees when we add, remove or change a file. To see this in action, let's add a new image to the images/split_0
directory by converting a png to jpeg.
ffmpeg -i images/split_0/noise_image_11.png images/split_0/noise_image_11.jpg
oxen add images/split_0/noise_image_11.jpg
oxen commit -m "adding images/split_0/noise_image_11.jpg"
Then print the tree again, and try to find where the new image was added.
$ oxen tree
[Commit] 1382a90346dce99134fbb8a7359d81df "adding images/split_0/noise_image_11.jpg" -> Bessie ox@oxen.ai parent_ids "bb2e7778ddc8f40788d4d34993955bfd"
[Dir] 73aa45daae11e300cb452b072f22bc3a "/" (16.0 MB) (23 files) (commit 1382a90346dce99134fbb8a7359d81df) (1 children)
[VNode] 57e2282b08cfdb57c66d5c2c0341fc2d (3 children)
[File] dcd78180c335f3afed68656b6b12c248 "README.md" (98 B) (commit bb2e7778ddc8f40788d4d34993955bfd)
[Dir] 7363444d86e0cb47c4247bd7f05c13f3 "images/" (16.0 MB) (21 files) (commit 1382a90346dce99134fbb8a7359d81df) (1 children)
[VNode] 8344a2ef7c01c891fb33a357efabc2b6 (1 children)
[Dir] b5d819fd018381b3026848bed854830b "split_0/" (16.0 MB) (21 files) (commit 1382a90346dce99134fbb8a7359d81df) (4 children)
[VNode] 3543cada59e52d3a391603661b6f9721 (6 children)
[File] 6d11185298ec825208a1f3fce23b9d6c "noise_image_14.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[File] 56e2d45af9958af0680fceb3ab00d18c "noise_image_17.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[File] 82c2ec408abf2defc2dd5289b29a1e80 "noise_image_19.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[File] 1d952254a3be50f3d2d70a1398aee524 "noise_image_4.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[File] cb118df353e0814d42472818405b9384 "noise_image_7.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[File] 2ccc49262256aae9802c581d90736b34 "noise_image_9.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[VNode] e5529567f3246881d58335ee5102281c (4 children)
[File] 38593fea717a1d4c2e771674ebc9ca81 "noise_image_0.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[File] d0871bf54a0c99f2336da66ab20b6785 "noise_image_1.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[File] e406e9a852da65dad3f012dd86a98919 "noise_image_10.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[File] 48dca231fd96fd847935e7e6623b32d9 "noise_image_11.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[VNode] c9e4baf0332cf23a803a8870e295e0b5 (7 children)
[File] abf629ca6ed9414e8a4f884d2f98dd2a "noise_image_13.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[File] 4598fdac03e5aaa52aa5cd1c51231a2 "noise_image_15.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[File] 3e0989396c1d6a7f38164faf96c4662e "noise_image_18.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[File] 1506f02a6a10123af68084200b67583a "noise_image_2.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[File] 1d72a37baeef7b0c02b4ce7482e4592 "noise_image_5.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[File] 26e656316e9e292f8b27fcb49654237a "noise_image_6.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[File] 4f3611ebc27072c6359ac05f4e6c98e "noise_image_8.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[VNode] 88e4003b65f2f3f16c16a91230593746 (4 children)
[File] 6550c5d7deaa62dbb78d8effbbda375f "noise_image_11.jpg" (285.4 KB) (commit 1382a90346dce99134fbb8a7359d81df)
[File] 6dc552a73f1beb27356903f84c4a7b33 "noise_image_12.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[File] 9335d768de6058fbe3da1a7857c400e7 "noise_image_16.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[File] 4855043d603912f6141f5f148851146f "noise_image_3.png" (788.1 KB) (commit bb2e7778ddc8f40788d4d34993955bfd)
[File] fe0b2b028ff23fd726b14f6c7694ffd8 "images.csv" (764 B) (commit bb2e7778ddc8f40788d4d34993955bfd)
You will see that the vnode 88e4003b65f2f3f16c16a91230593746
is the one that contains the new image. The other VNode's have not changed. As a result, we only had to make a copy of a single node with 3 children instead of 20.
When you get into larger directories, there is a trade off between number of vnodes and the size of the vnode. The fewer number of vnodes, the faster it will be to read all the nodes. The smaller the vnode, the faster it will be to write, and less data we copy when we add, remove or change a file. In practice we find 10k entries per vnode is a good compromise in terms of storage and performance.
File Chunk Deduplication
Not yet implemented, but scoped out here:
Benefits of the Merkle Tree
Hopefully if you are new to Merkle Trees, this should give you a good intuition for how they work in practice. There are a few nice properties of a Merkle Tree as our data structure.
-
When we add, remove or change a file, we only need to update the subtree that contains that file. This means the storage grows logarithmically with the number of files in the repository instead of linearly.
-
To recompute the root hash of a commit, we only need to hash the file paths and the hashes of the files that have changed. This means we can efficiently verify the integrity of the data by recomputing subtrees instead of the whole tree.
-
We can use it to understand the small chunks of the data that changes when transferring over the network when syncing repositories.
-
Since each subtree is also a merkle tree, this will allow us to clone small subtrees, make changes, and push them back up to the parent. This can be powerful when you only want to update the README for example but have a directory of images you are not planning on changing.
Commands
The core commands in commands in Oxen map to git so that it is an easy learning curve to get started.
oxen init
oxen add images/
oxen status
oxen commit -m "adding images"
oxen push origin main
Join us as we break down each command step by step, as if you were building it from scratch.
init
oxen init
TODO: add details on how the command works
oxen add
After initializing a repository, you can add files to it using the oxen add
command.
oxen add <file>
This is the workhorse of oxen that does most of the compute. Under the hood, oxen add
does a few operations:
- Hashes the file(s) and directory structure
- Computes any additional metadata about the file
- Sizes
- File Types
- Schemas (for JSON, CSV, etc.)
- Copies a version of the file into the content addressable
versions
store - Adds a record to the
staged
index
Under the hood
In order to see what this looks like on disk, let's create a few files and directories, then add them to a fresh repository.
mkdir my-project
cd my-project
oxen init
TODO:
clone
Data Types
There are core data types that Oxen detects and can do special processing with. This helps increase visibility into your data, and makes it extensible to your use case.
- Text
- Image
- Audio
- Video
- Tabular
- Blob
Optimizations
There are many optimizations that make Oxen go brrr. From the core merkle tree structure, to hashing protocol, to networking, to interacting with remote datasets. Oxen is meant to make it feel like you have terabytes of data at your fingertips whatever machine you are on.
TLDRs
We often get asked: "What makes Oxen different from other VCS?"
Without diving into the gnitty gritty details, here are some highlights. If you want to go deeper, don't worry, we also dive deep into the implementation details of each throughout the book.
Merkle Tree
- Downloading Sub Trees
- Per Folder Sub Trees
- Block Level Dedup
- Only download latest
- When you get to TB scale data, you do not want to have to pull down data from previous commits to compute the current tree.
- Push / Pull Bottle Neck
- Objects
- Trees
- VNodes
- Blobs
- Schemas
Hashing
- xxHash
- pure hashing throughput
- non-cryptographic hashing fn
Data Frames and Schemas are First Class Citizens
Other VCS systems are optimized for text files and code. In the case of datasets, we often deal with data frames which have other properties such as schema that we want to track.
Native File Formats
Take advantage of existing file formats such as arrow, parquet, duckdb, etc. Unlike git or other VCS that try to be smart with compression, we can leverage the existing file formats that are already highly optimized for the specific use case.
For example, apache arrow is a memory mapped file that makes random access to rows very fast. If we were to compress this data and reconstruct it we would lose the benefits of the memory mapped file.
This is a design tradeoff that is made throughout oxen which makes it less efficient in terms of storage on disk, but easier to integrate with.
Visibility into data is a key design goal of Oxen. Visibility means speed for data to be visible as well, and the less assumptions we make here, the more we can leverage and extend existing file formats.
Concurrency
- Fearless concurrency
- Hashing data
- Moooooving data over the network
- Moooooving data on disk
Networking
- Smart Chunking
Remote Workspaces
Don't download the entire dataset just to contribute.
- oxen workspace add
- oxen workspace commit
- oxen workspace df
- oxen workspace ls
Compression (Coming Soon)
- Block level dedup
- zlib
Merkle Tree
Hashing
One of the optimizations within Oxen is to use the xxHash algorithm to hash the file contents. This is a very fast hashing algorithm that is designed to be very memory efficient. It is also very fast to compute.
Compared to SHA or MD5 hashes which can hash data at < 1GB/s, xxHash can hash data at 30GB/s. This is a significant improvement for large files, and speeds up the process of adding and committing files to Oxen.
Inspect File Hashes
You can quickly inspect the xxHash of any file using the oxen info
command.
oxen info -v file.txt
Compression
TODO: Block level deduplication can be turned on in order to shrink older datasets. It comes at a reconstruction cost of the data, so can be turned on and off depending on your use case.
File Chunk Deduplication
The Merkle Tree optimizations we've talked about so far make adding and committing snapshots of the directory structure to Oxen snappy. The remainder of the storage cost is at the individual leaf nodes of the tree. We want Oxen to be efficient at storing many files as well as efficient at storing large files. In our context these large files may be data frames of parquet, arrow, jsonl, csv, etc.
To visualize how much storage space individual nodes take, let's look at a large CSV in the .oxen/versions
directory. The pointers and hashes within the tree itself are relatively small, but the file contents themselves are large.
Remember, so far each time you make a commit, we make an entire copy of the file contents itself and put it into the .oxen/versions
directory under it's hash.
This can be a problem if we keep updating the same file over over and over again. Five small changes to a 1GB file will result in 5GB of storage.
Think back to the key insight that we made earlier about duplicating data between versions of our tree. The same thing applies to large files. For example - what if you are editing and committing a CSV file one row at a time.
This results in a lot of duplicated data. In fact rows 0-1,000,000 are all the same between Version A and Version B.
To combat this, Oxen uses a technique called file chunk deduplication. Instead of duplicating the entire raw file in the .oxen/versions
directory, we can break the file into "chunks" then store these chunks in a content addressable manner.
Then if you update any rows in the file, we create a new chunk for that section and update the leaf nodes of the tree.
So the original tree looks like this.
Then when the row is added, all we have to do is update the final chunk and propagate the changes up the tree.
What's great about this, is now we also only need to sync the chunks that have changed over the network and not the entire file. Chunks themselves are ~16kb in size, so updating a single value in a 1GB file will only sync this small portion of the file.
When swapping between versions we simply reconstruct the chunks associated with each file and write them to disk. This means we need a small db that stores the mapping from chunk_idx -> chunk_hash. We can now store this in the .oxen/versions
directory instead of the file contents.
Storage
TODO: Support different storage backends
- Local SSDs
- NFS
- S3
- FSpec
Why not write an extension for Git?
We often get asked "Why not use Git-LFS?" and why write an entire Version Control System (VCS) from scratch?
The short answer is developer experience. Owning the tool means owning the end to end developer experience. The plugins and extensions to git that we have tried in the path feel like they were shoe horned in and clunky.
There are certain properties of git that make it great for versioning code, but fall short when it comes to data. Git is really good at huge histories of small text files, like code bases. Where git does not shine is binary files. Many datasets consist of binary files such as images, video, audio, parquet files, etc which are not efficiently stored in git.
There are extensions to git such as git-LFS that work okay when you have one or two large assets that are complimentary to your code, but they are still extremely slow when it comes to versioning large sets of large binary files.
This section dives into some of the reasons why we did not want to use Git-LFS and wrote an entire VCS from scratch.
What git is missing (could be a good blog post title)
lorem ipsum...
LFS is More Complexity
The first reason is purely the mental model users have to keep in their head.
To quote Gregory Szorc:
"LFS is more complex for Git end users.
Git users have to install, configure, and sometimes know about the existence of Git LFS. Version control should just work. Large file handling should just work. End-users shouldn't have to care that large files are handled slightly differently from small files."
Push / Pull Bottleneck
How do I atomically add one file without pulling everything? Even further...how do I modify one row or column?
Ex) you update test.csv, I update README.md, we can both push without worrying about pulling first.
Ex) Updating data frame directly with duckdb
Do we call this a VCS-DB or some new term? Allows for higher throughput writes while allowing you to snapshot and diff and query the history.
Network Protocols
Git sequentially reads and writes objects via packfiles over the network to create your local and remote copies of the repository. This is inefficient for large files, resulting in slow uploads / downloads of data that is not text.
SHA-256 is Slow
In order to tell if a file is changed, version control systems use a hashing function of the data. By default git uses SHA-256. This is a relatively slow hashing algorithm. By contrast Oxen uses a faster hashing algorithm called xxHash. This results in hashing speeds of 31 GB/s for xxHash vs 0.8 GB/s of SHA-256.
The small change in hashing speeds results in an improvement in terms of developer experience when it comes to larger datasets.
If you want to see this in action, simply run git add
on a large directory of data vs oxen add
and see the difference in time.
Git Status Spews All Files
Purely from Developer Experience this is not great. What if you add 100k images in a single commit? It's not practical to have git status show you all 100k files that are added.
With datasets we are dealing more with distributions of data, not individual data points.
Downloading Full History
Git by default downloads the entire history of the repository on clone. When it comes to datasets, I may only want to download the latest version of a file, not the entire history.
Oxen gives you the flexibility to download just what you need at the time of training, inference, or testing.
Removing Objects
Removing objects from git is a bit complex because of the references that are made through out the packfiles. Indices have to be recomputed and the history of the repository has to be rewritten if you want to remove a single file.
https://git-scm.com/book/en/v2/Git-Internals-Maintenance-and-Data-Recovery
There are a lot of great things about Git, but one feature that can cause issues is the fact that a git clone downloads the entire history of the project, including every version of every file. This is fine if the whole thing is source code, because Git is highly optimized to compress that data efficiently. However, if someone at any point in the history of your project added a single huge file, every clone for all time will be forced to download that large file, even if it was removed from the project in the very next commit. Because itβs reachable from the history, it will always be there.
Packfiles
One of the ways that git can save space is by using delta encoding to store only the differences between files. They do this through the use of packfiles. The objects are not stored directly in the objects directory rather packed up together in a packfile to make data transfer and compression easier.
Within a pack file there are multiple ways an object can be stored.
This is great for codebases, but not optimal for binary files.
Sources
Git SCM book
- https://git-scm.com/book/en/v2/Git-Internals-Maintenance-and-Data-Recovery
Avoid Git-LFS if Possible
- https://news.ycombinator.com/item?id=27134972
- https://gregoryszorc.com/blog/2021/05/12/why-you-shouldn%27t-use-git-lfs/
Dev.to git-internals
- https://dev.to/calebsander/git-internals-part-2-packfiles-1jg8
Pack files
Gitβs database internals I: packed object store
- https://github.blog/2022-08-29-gits-database-internals-i-packed-object-store/#delta-compression