lakefs-spec Integration¶
The lakefs-spec project provides a filesystem-like interface to lakeFS, built on the fsspec ecosystem. This enables seamless integration with data science libraries like pandas, dask, and other fsspec-compatible tools.
Overview¶
lakefs-spec implements the fsspec protocol for lakeFS, allowing you to use lakeFS repositories as if they were regular filesystems. This approach is particularly powerful for data science workflows where you want to leverage existing fsspec-compatible libraries without changing your code structure.
Key Features¶
- Filesystem API - Standard filesystem operations (read, write, list, delete, metadata)
- fsspec Compatibility - Works seamlessly with pandas, dask, and 100+ other fsspec libraries
- URI-based Access - Simple
lakefs://
URI scheme for intuitive path handling - Transaction Support - Atomic operations with automatic rollback capabilities
- Data Science Integration - Native support for popular data science workflows
- Path Resolution - Intelligent handling of lakeFS repository, branch, and object paths
When to Use lakefs-spec¶
Choose lakefs-spec when you need:
- Data Science Workflows - Working with pandas, dask, or other data science tools
- Filesystem Semantics - Standard filesystem operations and path handling
- Existing fsspec Code - Migrating from other fsspec filesystems (S3, GCS, Azure)
- Simple Integration - Minimal code changes for lakeFS adoption
- Library Compatibility - Access to the broad fsspec ecosystem
Installation and Setup¶
Installation¶
Install lakefs-spec using pip:
# Basic installation
pip install lakefs-spec
# With optional dependencies for enhanced functionality
pip install lakefs-spec[pandas,dask]
# Development installation with all dependencies
pip install lakefs-spec[dev]
System Requirements¶
- Python 3.8 or higher
- lakeFS server 0.90.0 or higher
- Network access to your lakeFS instance
Configuration and Authentication¶
lakefs-spec supports multiple authentication methods to connect to your lakeFS instance.
Method 1: Environment Variables¶
Set environment variables for automatic authentication:
export LAKEFS_ENDPOINT="https://your-lakefs-instance.com"
export LAKEFS_ACCESS_KEY_ID="your-access-key"
export LAKEFS_SECRET_ACCESS_KEY="your-secret-key"
from lakefs_spec import LakeFSFileSystem
# Automatically uses environment variables
fs = LakeFSFileSystem()
Method 2: lakectl Configuration File¶
lakefs-spec automatically discovers credentials from your lakectl configuration:
# ~/.lakectl.yaml
credentials:
access_key_id: your-access-key
secret_access_key: your-secret-key
server:
endpoint_url: https://your-lakefs-instance.com
from lakefs_spec import LakeFSFileSystem
# Automatically uses ~/.lakectl.yaml
fs = LakeFSFileSystem()
Method 3: Direct Configuration¶
Pass configuration directly to the filesystem:
from lakefs_spec import LakeFSFileSystem
# Direct configuration
fs = LakeFSFileSystem(
host="https://your-lakefs-instance.com",
username="your-access-key",
password="your-secret-key"
)
Method 4: Configuration Dictionary¶
Use a configuration dictionary for complex setups:
from lakefs_spec import LakeFSFileSystem
config = {
"host": "https://your-lakefs-instance.com",
"username": "your-access-key",
"password": "your-secret-key",
"verify_ssl": True,
"timeout": 30,
"retries": 3
}
fs = LakeFSFileSystem(**config)
Advanced Configuration Options¶
lakefs-spec supports additional configuration options for production environments:
from lakefs_spec import LakeFSFileSystem
fs = LakeFSFileSystem(
host="https://your-lakefs-instance.com",
username="your-access-key",
password="your-secret-key",
# SSL Configuration
verify_ssl=True,
ssl_cert_path="/path/to/cert.pem",
# Connection Settings
timeout=60,
retries=5,
retry_delay=1.0,
# Performance Settings
cache_size=1000,
block_size=5 * 1024 * 1024, # 5MB blocks
# Logging
log_level="INFO"
)
URI Patterns and Path Resolution¶
lakefs-spec uses a structured URI scheme to access lakeFS resources:
URI Structure¶
Path Components¶
- Repository: The lakeFS repository name
- Branch: The branch or commit reference
- Path: The object path within the repository
URI Examples¶
# Basic file access
'lakefs://my-repo/main/data/file.csv'
# Nested directory structure
'lakefs://analytics/feature-branch/datasets/2024/january/sales.parquet'
# Root directory listing
'lakefs://my-repo/main/'
# Specific commit reference
'lakefs://my-repo/c7a4b2f8d9e1a3b5c6d7e8f9/data/snapshot.json'
Path Resolution Examples¶
from lakefs_spec import LakeFSFileSystem
import pandas as pd
fs = LakeFSFileSystem()
# List repositories
repos = fs.ls('lakefs://')
print("Available repositories:", repos)
# List branches in a repository
branches = fs.ls('lakefs://my-repo/')
print("Available branches:", branches)
# List objects in a branch
objects = fs.ls('lakefs://my-repo/main/')
print("Objects in main branch:", objects)
# Check if path exists
exists = fs.exists('lakefs://my-repo/main/data/file.csv')
print("File exists:", exists)
# Get path information
info = fs.info('lakefs://my-repo/main/data/file.csv')
print("File info:", info)
Working with Different Path Types¶
# Absolute paths
df = pd.read_csv('lakefs://repo/main/data/file.csv')
# Relative paths (when working within a specific context)
fs = LakeFSFileSystem()
with fs.transaction('lakefs://repo/feature-branch/') as tx:
# Within transaction, paths are relative to the transaction root
df = pd.read_csv('data/file.csv') # Resolves to lakefs://repo/feature-branch/data/file.csv
Quick Start Example¶
Here's a complete example showing lakefs-spec setup and basic usage:
import pandas as pd
from lakefs_spec import LakeFSFileSystem
# Initialize filesystem (uses environment variables or ~/.lakectl.yaml)
fs = LakeFSFileSystem()
# Create sample data
data = {
'timestamp': pd.date_range('2024-01-01', periods=100, freq='H'),
'value': range(100),
'category': ['A', 'B'] * 50
}
df = pd.DataFrame(data)
# Write data to lakeFS
df.to_parquet('lakefs://my-repo/main/data/sample.parquet', index=False)
# Read data back
df_loaded = pd.read_parquet('lakefs://my-repo/main/data/sample.parquet')
print(f"Loaded {len(df_loaded)} rows")
# List files in the data directory
files = fs.ls('lakefs://my-repo/main/data/')
print("Files in data directory:", files)
# Check file metadata
info = fs.info('lakefs://my-repo/main/data/sample.parquet')
print("File size:", info['size'], "bytes")
Filesystem API Concepts¶
lakefs-spec implements standard filesystem operations that work consistently across different storage backends:
Core Operations¶
- Read Operations:
open()
,cat()
,cat_file()
- Write Operations:
open()
with write mode,pipe()
- Directory Operations:
ls()
,mkdir()
,rmdir()
- File Operations:
rm()
,mv()
,cp()
- Metadata Operations:
info()
,exists()
,size()
Transaction Support¶
lakefs-spec provides atomic operations through transactions:
# Atomic multi-file operation
with fs.transaction('lakefs://repo/branch/') as tx:
# All operations within this block are atomic
df1.to_parquet('data/file1.parquet')
df2.to_parquet('data/file2.parquet')
# Automatically commits on successful completion
# Automatically rolls back on any error
Documentation Sections¶
Explore detailed documentation for specific use cases:
- Filesystem API - Complete filesystem operations reference
- Data Science Integrations - pandas, dask, and other library examples
- Transactions - Atomic operation patterns and best practices
Troubleshooting¶
Common Issues¶
Authentication Errors
# Verify your configuration
fs = LakeFSFileSystem()
try:
repos = fs.ls('lakefs://')
print("Authentication successful")
except Exception as e:
print(f"Authentication failed: {e}")
Path Resolution Issues
# Check if repository and branch exist
if fs.exists('lakefs://my-repo/'):
print("Repository exists")
if fs.exists('lakefs://my-repo/main/'):
print("Branch exists")
else:
print("Branch not found")
else:
print("Repository not found")
Connection Issues
# Test connection with timeout
fs = LakeFSFileSystem(timeout=10)
try:
fs.ls('lakefs://')
print("Connection successful")
except TimeoutError:
print("Connection timeout - check network and endpoint")
Next Steps¶
- Learn about filesystem operations for detailed API usage
- Explore data science integrations for pandas, dask, and other libraries
- Understand transaction patterns for atomic operations
- Check the lakefs-spec documentation for advanced features
See Also¶
SDK Selection and Comparison: - Python SDK Overview - Compare all Python SDK options - SDK Decision Matrix - Choose the right SDK for your use case - API Feature Comparison - Detailed feature comparison across SDKs
lakefs-spec Documentation: - Filesystem API Reference - Complete filesystem operations guide - Data Science Integrations - pandas, dask, and other library examples - Transaction Patterns - Atomic operations and rollback scenarios
Alternative SDK Options: - High-Level SDK - Simplified Python interface with transactions - High-Level SDK Quickstart - Object-oriented interface - Generated SDK - Direct API access for custom operations - Boto3 Integration - S3-compatible operations
Setup and Configuration: - Installation Guide - Complete setup instructions for all SDKs - Authentication Methods - All credential configuration options - Best Practices - Production configuration guidance
Data Science Workflows: - Data Science Tutorial - End-to-end data analysis workflow - pandas Integration Examples - Working with DataFrames - Jupyter Notebook Patterns - Interactive analysis
Learning Resources: - ETL Pipeline Tutorial - Building data pipelines with filesystem operations - ML Experiment Tracking - Model versioning workflows - fsspec Ecosystem Guide - Compatible libraries and tools
Reference Materials: - URI Pattern Guide - Understanding lakefs:// URIs - Error Handling - Common issues and solutions - Performance Optimization - Optimize filesystem operations
External Resources: - lakefs-spec Documentation - Official lakefs-spec documentation - fsspec Documentation - Core filesystem specification - pandas I/O Documentation - pandas file operations - dask DataFrame Documentation - dask integration patterns