Optimizations

There are many optimizations that make Oxen go brrr. From the core merkle tree structure, to hashing protocol, to networking, to interacting with remote datasets. Oxen is meant to make it feel like you have terabytes of data at your fingertips whatever machine you are on.

TLDRs

We often get asked: "What makes Oxen different from other VCS?"

Without diving into the gnitty gritty details, here are some highlights. If you want to go deeper, don't worry, we also dive deep into the implementation details of each throughout the book.

Merkle Tree

Downloading Sub Trees
Per Folder Sub Trees
Block Level Dedup
Only download latest
- When you get to TB scale data, you do not want to have to pull down data from previous commits to compute the current tree.
Push / Pull Bottle Neck
Objects
- Trees
- VNodes
- Blobs
- Schemas

Hashing

xxHash
pure hashing throughput
non-cryptographic hashing fn

Data Frames and Schemas are First Class Citizens

Other VCS systems are optimized for text files and code. In the case of datasets, we often deal with data frames which have other properties such as schema that we want to track.

Native File Formats

Take advantage of existing file formats such as arrow, parquet, duckdb, etc. Unlike git or other VCS that try to be smart with compression, we can leverage the existing file formats that are already highly optimized for the specific use case.

For example, apache arrow is a memory mapped file that makes random access to rows very fast. If we were to compress this data and reconstruct it we would lose the benefits of the memory mapped file.

This is a design tradeoff that is made throughout oxen which makes it less efficient in terms of storage on disk, but easier to integrate with.

Visibility into data is a key design goal of Oxen. Visibility means speed for data to be visible as well, and the less assumptions we make here, the more we can leverage and extend existing file formats.

Concurrency

Fearless concurrency
Hashing data
Moooooving data over the network
Moooooving data on disk

Networking

Smart Chunking

Remote Workspaces

Don't download the entire dataset just to contribute.

oxen workspace add
oxen workspace commit
oxen workspace df
oxen workspace ls

Compression (Coming Soon)

Block level dedup
zlib