Skip to content

ProllyTree Storage Backends Guide

ProllyTree supports multiple storage backends to meet different performance, persistence, and deployment requirements. This guide provides a comprehensive overview of each available storage backend, their characteristics, use cases, and configuration options.

Overview

ProllyTree uses a pluggable storage architecture through the NodeStorage trait, allowing you to choose the appropriate backend for your specific needs:

  • InMemoryNodeStorage: Fast, volatile storage for development and testing
  • FileNodeStorage: Simple file-based persistence for local applications
  • RocksDBNodeStorage: High-performance LSM-tree storage for production workloads
  • GitNodeStorage: Git object store integration for development (experimental)

InMemoryNodeStorage

Description

The in-memory storage backend keeps all ProllyTree nodes in a HashMap in RAM. This provides the fastest access times but offers no persistence across application restarts.

Characteristics

  • Performance: Fastest read/write operations
  • Persistence: None - data is lost when application terminates
  • Memory Usage: Entire tree stored in RAM
  • Concurrency: Thread-safe with internal locking
  • Storage Overhead: Minimal (just HashMap overhead)

Use Cases

  • Unit testing: Fast test execution without I/O overhead
  • Development: Quick prototyping and debugging
  • Caching layer: Temporary storage for frequently accessed data
  • Small datasets: When entire dataset fits comfortably in memory

Usage Example

use prollytree::storage::InMemoryNodeStorage;
use prollytree::tree::{ProllyTree, Tree};
use prollytree::config::TreeConfig;

let storage = InMemoryNodeStorage::<32>::new();
let config = TreeConfig::<32>::default();
let mut tree = ProllyTree::new(storage, config);

// Data will be lost when `tree` goes out of scope
tree.insert(b"key".to_vec(), b"value".to_vec());

Configuration

The in-memory storage is self-contained and requires no configuration. It automatically manages memory allocation and cleanup.

FileNodeStorage

Description

The file storage backend persists each ProllyTree node as a separate file on the filesystem using binary serialization. Configuration data is stored in separate files with a config_ prefix.

Characteristics

  • Performance: Moderate - limited by filesystem I/O
  • Persistence: Full persistence across application restarts
  • Storage Format: Binary-serialized nodes (using bincode)
  • File Organization: One file per node, named by hash
  • Platform Support: Works on all platforms with filesystem access

Use Cases

  • Local applications: Desktop applications needing persistence
  • Development: When you need persistence but don't want database setup
  • Small to medium datasets: Up to thousands of nodes
  • Debugging: Easy to inspect individual node files

Usage Example

use prollytree::storage::FileNodeStorage;
use prollytree::tree::{ProllyTree, Tree};
use prollytree::config::TreeConfig;
use std::path::PathBuf;

let storage_dir = PathBuf::from("./prolly_data");
let storage = FileNodeStorage::<32>::new(storage_dir);
let config = TreeConfig::<32>::default();
let mut tree = ProllyTree::new(storage, config);

tree.insert(b"key".to_vec(), b"value".to_vec());
// Data persists in ./prolly_data/ directory

File Structure

prolly_data/
├── a1b2c3d4e5f6... (node file - hex hash)
├── f6e5d4c3b2a1... (node file - hex hash)
├── config_tree_config (configuration file)
└── config_custom_key (custom configuration)

Limitations

  • Scalability: Performance degrades with large number of nodes
  • Atomicity: No atomic updates across multiple nodes
  • Concurrent Access: Not safe for concurrent writers

RocksDBNodeStorage

Description

RocksDB storage provides a production-ready, high-performance backend using Facebook's RocksDB LSM-tree implementation. It's optimized for ProllyTree's content-addressed, write-heavy workload patterns.

Characteristics

  • Performance: High throughput for both reads and writes
  • Persistence: Durable storage with WAL (Write-Ahead Log)
  • Scalability: Handles millions of nodes efficiently
  • Compression: LZ4 for hot data, Zstd for cold data
  • Caching: Multi-level caching (LRU cache + RocksDB block cache)
  • Compaction: Background cleanup of obsolete data

Architecture

Application
LRU Cache (1000 nodes default)
RocksDB
├── Write Buffer (128MB)
├── Block Cache (512MB)
├── Bloom Filters (10 bits/key)
└── SST Files (compressed)

Use Cases

  • Production applications: High-performance persistent storage
  • Large datasets: Millions of nodes and frequent updates
  • Write-heavy workloads: Frequent tree modifications
  • Distributed systems: Building block for distributed storage

Usage Example

use prollytree::storage::RocksDBNodeStorage;
use prollytree::tree::{ProllyTree, Tree};
use prollytree::config::TreeConfig;
use std::path::PathBuf;

// Basic usage
let db_path = PathBuf::from("./rocksdb_data");
let storage = RocksDBNodeStorage::<32>::new(db_path)?;
let config = TreeConfig::<32>::default();
let mut tree = ProllyTree::new(storage, config);

// Custom cache size
let storage = RocksDBNodeStorage::<32>::with_cache_size(db_path, 5000)?;

// Custom RocksDB options
let mut opts = RocksDBNodeStorage::<32>::default_options();
opts.set_write_buffer_size(256 * 1024 * 1024); // 256MB
let storage = RocksDBNodeStorage::<32>::with_options(db_path, opts)?;

Configuration Options

Default Optimizations

  • Write Buffer: 128MB for batching writes
  • Memory Tables: Up to 4 concurrent memtables
  • Compression: LZ4 for L0-L2, Zstd for bottom levels
  • Block Cache: 512MB for frequently accessed data
  • Bloom Filters: 10 bits per key for faster lookups

Performance Tuning

use rocksdb::{Options, DBCompressionType, BlockBasedOptions, Cache};

let mut opts = Options::default();

// Increase write buffer for high write throughput
opts.set_write_buffer_size(256 * 1024 * 1024);

// More aggressive compression for storage efficiency
opts.set_compression_type(DBCompressionType::Zstd);

// Larger block cache for read-heavy workloads
let cache = Cache::new_lru_cache(1024 * 1024 * 1024); // 1GB
let mut block_opts = BlockBasedOptions::default();
block_opts.set_block_cache(&cache);
opts.set_block_based_table_factory(&block_opts);

Batch Operations

RocksDB storage supports efficient batch operations:

let nodes = vec![
    (hash1, node1),
    (hash2, node2),
    (hash3, node3),
];

// Atomic batch insert
storage.batch_insert_nodes(nodes)?;

// Atomic batch delete
storage.batch_delete_nodes(&[hash1, hash2])?;

Monitoring and Maintenance

  • Statistics: RocksDB provides detailed performance metrics
  • Compaction: Automatic background compaction
  • Backup: Use RocksDB backup utilities for data safety
  • Tuning: Monitor write amplification and adjust settings

GitNodeStorage

Description

The Git storage backend stores ProllyTree nodes as Git blob objects in a Git repository. This experimental backend is designed for development workflows where you want to leverage Git's content-addressable storage.

⚠️ Important Limitations

Development Use Only: GitNodeStorage should only be used for local development and experimentation. It is not suitable for production use due to several important limitations:

  1. Dangling Objects: ProllyTree nodes are stored as Git blob objects but are not committed to any branch or tag. These objects exist as "dangling" or "unreachable" objects in Git's object database.

  2. Garbage Collection Risk: Git's garbage collector (git gc) will delete these dangling objects during cleanup operations. This can happen:

  3. When running git gc manually
  4. Automatically during Git operations (push, pull, repack, etc.)
  5. When Git's automatic garbage collection triggers

  6. Data Loss: Since the objects are not referenced by any commit, branch, or tag, they will be permanently lost when garbage collected. There is no recovery mechanism.

Characteristics

  • Storage Format: Git blob objects (binary serialized nodes)
  • Content Addressing: Leverages Git's SHA-1 content addressing
  • Persistence: Temporary - objects can be garbage collected
  • Integration: Works with existing Git repositories
  • Caching: LRU cache for performance

Use Cases (Development Only)

  • Git Integration Experiments: Testing Git-based storage concepts
  • Development Workflows: Temporary storage during development
  • Learning: Understanding content-addressable storage
  • Prototyping: Rapid prototyping with Git infrastructure

Usage Example

// Only available with "git" feature
#[cfg(feature = "git")]
use prollytree::git::GitNodeStorage;

let repo = gix::open(".")?;
let dataset_dir = std::path::PathBuf::from("./git_data");
let storage = GitNodeStorage::<32>::new(repo, dataset_dir)?;

// ⚠️ WARNING: Data may be lost during git gc!
let config = TreeConfig::<32>::default();
let mut tree = ProllyTree::new(storage, config);
tree.insert(b"key".to_vec(), b"value".to_vec());

Data Safety Measures

If you must use GitNodeStorage for development, consider these safety measures:

  1. Disable Automatic GC:

    git config gc.auto 0
    git config gc.autopacklimit 0
    

  2. Create Temporary Commits (advanced):

    # Periodically commit to preserve objects
    git add -A
    git commit -m "temp: preserve prolly objects"
    

  3. Use Separate Repository: Create a dedicated Git repository just for ProllyTree storage to avoid conflicts.

Architecture

ProllyTree Node
Bincode Serialization
Git Blob Object (dangling)
Git Object Database
⚠️ git gc → Deletion

Storage Backend Comparison

Feature InMemory File RocksDB Git
Persistence None Full Full Temporary⚠️
Performance Fastest Moderate High Moderate
Scalability RAM-limited Poor Excellent Poor
Setup Complexity None None Low Medium
Production Ready No Limited Yes No⚠️
Concurrent Access Limited No Yes Limited
Storage Overhead None High Low Medium
Backup/Recovery N/A File copy RocksDB tools Git tools

Choosing the Right Backend

Development & Testing

  • Unit Tests: InMemoryNodeStorage
  • Integration Tests: FileNodeStorage or InMemoryNodeStorage
  • Local Development: FileNodeStorage or RocksDBNodeStorage

Production Deployments

  • Small Applications: FileNodeStorage (with careful consideration)
  • High-Performance Applications: RocksDBNodeStorage
  • Distributed Systems: RocksDBNodeStorage as foundation

Experimental

  • Git Integration Research: GitNodeStorage (development only)

Performance Benchmarks

Run the storage comparison benchmarks to understand performance characteristics:

# Compare all available backends
cargo bench --bench storage_bench --features rocksdb_storage

# Run specific benchmark
cargo bench --bench storage_bench storage_insert

Migration Between Backends

Currently, there's no built-in migration tool between storage backends. To migrate:

  1. Export Data: Iterate through the old storage and collect all key-value pairs
  2. Create New Storage: Initialize the target storage backend
  3. Import Data: Insert all data into the new storage
  4. Validate: Verify data integrity after migration

Example migration pattern:

// Export from old storage
let old_tree = ProllyTree::load_from_storage(old_storage, config.clone())?;
let mut data = Vec::new();
// ... collect all key-value pairs

// Import to new storage
let mut new_tree = ProllyTree::new(new_storage, config);
for (key, value) in data {
    new_tree.insert(key, value);
}

Best Practices

General

  • Choose the simplest backend that meets your requirements
  • Always benchmark with your specific data patterns
  • Consider backup and recovery procedures
  • Plan for data growth and scaling needs

InMemoryNodeStorage

  • Monitor memory usage to prevent OOM conditions
  • Use for temporary data only
  • Consider data loss implications

FileNodeStorage

  • Ensure adequate disk space and I/O performance
  • Implement application-level locking for concurrent access
  • Regular filesystem maintenance and monitoring

RocksDBNodeStorage

  • Monitor RocksDB metrics for performance tuning
  • Configure appropriate cache sizes for your workload
  • Plan for disk space and compaction overhead
  • Use batch operations for bulk updates

GitNodeStorage

  • Never use in production
  • Disable automatic garbage collection during development
  • Use dedicated Git repositories
  • Regularly backup important data to commits
  • Understand that data can be lost without warning

Troubleshooting

Common Issues

OutOfMemory with InMemoryNodeStorage

  • Reduce dataset size or switch to persistent storage
  • Monitor heap usage and tune JVM/runtime parameters

Poor Performance with FileNodeStorage

  • Check filesystem performance and available disk space
  • Consider switching to RocksDBNodeStorage for better performance
  • Reduce concurrent access patterns

RocksDB Compilation Issues

  • Ensure proper build tools (cmake, C++ compiler)
  • Check RocksDB system dependencies
  • Use pre-built binaries if available

Git Storage Data Loss

  • This is expected behavior - objects are not committed
  • Disable garbage collection or switch to persistent storage
  • Create periodic commits to preserve important data

For additional help, consult the project documentation or open an issue on the GitHub repository.