API Comparison¶
This comprehensive comparison helps you choose the right Python SDK for your specific use case by comparing features, performance characteristics, and trade-offs across all available options.
Quick Decision Matrix¶
| Use Case | Recommended SDK | Alternative |
|---|---|---|
| Data Science & Analytics | lakefs-spec | High-Level SDK |
| Production ETL Pipelines | High-Level SDK | Generated SDK |
| Custom API Operations | Generated SDK | High-Level SDK |
| Jupyter Notebooks | lakefs-spec | High-Level SDK |
| ML Experiment Tracking | High-Level SDK | lakefs-spec |
| Large File Processing | lakefs-spec | High-Level SDK |
| Microservices Integration | Generated SDK | High-Level SDK |
Feature Comparison Matrix¶
Core Repository Operations¶
| Feature | High-Level SDK | Generated SDK | lakefs-spec |
|---|---|---|---|
| Repository Management | |||
| Create Repository | ✅ Full | ✅ Full | ❌ None |
| Delete Repository | ✅ Full | ✅ Full | ❌ None |
| List Repositories | ✅ Full | ✅ Full | ❌ None |
| Repository Metadata | ✅ Full | ✅ Full | ❌ None |
| Branch Operations | |||
| Create Branch | ✅ Full | ✅ Full | ✅ Limited |
| Delete Branch | ✅ Full | ✅ Full | ✅ Limited |
| List Branches | ✅ Full | ✅ Full | ✅ Limited |
| Branch Protection | ✅ Full | ✅ Full | ❌ None |
| Commit Operations | |||
| Create Commit | ✅ Full | ✅ Full | ✅ Full |
| List Commits | ✅ Full | ✅ Full | ✅ Limited |
| Commit Metadata | ✅ Full | ✅ Full | ✅ Limited |
| Cherry Pick | ✅ Full | ✅ Full | ❌ None |
Object Operations¶
| Feature | High-Level SDK | Generated SDK | lakefs-spec |
|---|---|---|---|
| Basic Operations | |||
| Upload Object | ✅ Full | ✅ Full | ✅ Full |
| Download Object | ✅ Full | ✅ Full | ✅ Full |
| Delete Object | ✅ Full | ✅ Full | ✅ Full |
| List Objects | ✅ Full | ✅ Full | ✅ Full |
| Advanced Operations | |||
| Streaming I/O | ✅ Full | 🔶 Manual | ✅ Full |
| Batch Operations | ✅ Full | 🔶 Manual | ✅ Full |
| Object Metadata | ✅ Full | ✅ Full | ✅ Full |
| Presigned URLs | ✅ Full | ✅ Full | ❌ None |
| Multipart Upload | ✅ Full | ✅ Full | ✅ Full |
Data Management Features¶
| Feature | High-Level SDK | Generated SDK | lakefs-spec |
|---|---|---|---|
| Transactions | |||
| Atomic Operations | ✅ Full | 🔶 Manual | ✅ Full |
| Rollback Support | ✅ Full | 🔶 Manual | ✅ Full |
| Context Managers | ✅ Full | ❌ None | ✅ Full |
| Import/Export | |||
| Data Import | ✅ Full | ✅ Full | ❌ None |
| Import Status | ✅ Full | ✅ Full | ❌ None |
| Export Operations | ✅ Full | ✅ Full | ❌ None |
| Merge Operations | |||
| Branch Merging | ✅ Full | ✅ Full | ❌ None |
| Conflict Resolution | ✅ Full | ✅ Full | ❌ None |
| Merge Strategies | ✅ Full | ✅ Full | ❌ None |
Integration Capabilities¶
| Feature | High-Level SDK | Generated SDK | lakefs-spec |
|---|---|---|---|
| Data Science Libraries | |||
| Pandas Integration | ✅ Full | 🔶 Manual | ✅ Native |
| Dask Integration | ✅ Full | 🔶 Manual | ✅ Native |
| PyArrow Integration | ✅ Full | 🔶 Manual | ✅ Native |
| File System Interface | |||
| fsspec Compatibility | 🔶 Limited | ❌ None | ✅ Native |
| Path-like Operations | ✅ Full | 🔶 Manual | ✅ Native |
| Glob Patterns | ✅ Full | 🔶 Manual | ✅ Native |
Performance Characteristics¶
Throughput Comparison¶
| Operation Type | High-Level SDK | Generated SDK | lakefs-spec |
|---|---|---|---|
| Small Files (< 1MB) | |||
| Single Upload | Good | Good | Excellent |
| Batch Upload | Excellent | Good | Excellent |
| Single Download | Good | Good | Excellent |
| Batch Download | Excellent | Good | Excellent |
| Large Files (> 100MB) | |||
| Streaming Upload | Excellent | Good | Excellent |
| Streaming Download | Excellent | Good | Excellent |
| Multipart Upload | Excellent | Good | Excellent |
| Metadata Operations | |||
| List Objects | Good | Good | Excellent |
| Object Stats | Good | Good | Excellent |
| Branch Operations | Excellent | Good | Good |
Memory Usage¶
| SDK | Memory Efficiency | Notes |
|---|---|---|
| High-Level SDK | Good | Optimized for common patterns, connection pooling |
| Generated SDK | Fair | Direct API access, manual optimization needed |
| lakefs-spec | Excellent | Designed for large datasets, streaming-first |
Latency Characteristics¶
| Operation | High-Level SDK | Generated SDK | lakefs-spec |
|---|---|---|---|
| Connection Setup | Fast | Fast | Fast |
| Authentication | Fast | Fast | Fast |
| First Request | Medium | Medium | Fast |
| Subsequent Requests | Fast | Fast | Fast |
| Batch Operations | Fast | Medium | Fast |
Trade-offs Analysis¶
High-Level SDK¶
Strengths: - Comprehensive feature set with advanced capabilities - Built-in transaction support and error handling - Optimized for common lakeFS workflows - Excellent documentation and examples - Connection pooling and performance optimizations
Weaknesses: - Additional abstraction layer may hide some API details - Larger dependency footprint - May not expose all Generated SDK capabilities immediately
Best For: - Production applications requiring robust error handling - Complex workflows with transactions - Teams wanting comprehensive lakeFS integration - Applications requiring advanced features like imports/exports
Generated SDK¶
Strengths: - Direct access to all lakeFS API capabilities - Minimal abstraction, maximum control - Automatically updated with API changes - Smaller dependency footprint - Full async support where available
Weaknesses: - Requires more boilerplate code - Manual error handling and retry logic - No built-in transaction support - Less optimized for common patterns
Best For: - Custom integrations requiring specific API access - Microservices with minimal dependencies - Applications needing fine-grained control - Integration with existing API client patterns
lakefs-spec¶
Strengths: - Native fsspec integration for data science workflows - Excellent performance for file operations - Seamless integration with pandas, dask, and other libraries - Optimized for large dataset operations - Familiar filesystem interface
Weaknesses: - Limited repository management capabilities - No direct access to advanced lakeFS features - Focused primarily on file operations - Third-party maintenance dependency
Best For: - Data science and analytics workflows - Jupyter notebook environments - Large dataset processing - Integration with existing fsspec-based tools - Teams familiar with filesystem interfaces
Decision Guidelines¶
Choose High-Level SDK When:¶
- Building production applications with complex lakeFS workflows
- Need transaction support and advanced error handling
- Want comprehensive feature access with minimal code
- Team prefers high-level abstractions
- Building ETL pipelines or data management systems
# Example: Complex workflow with transactions
import lakefs
client = lakefs.Client()
repo = client.repository("my-repo")
with repo.branch("feature").transaction() as tx:
# Multiple operations in atomic transaction
tx.upload("data/file1.csv", data1)
tx.upload("data/file2.csv", data2)
# Automatically commits or rolls back
Choose Generated SDK When:¶
- Need access to specific API endpoints not covered by High-Level SDK
- Building microservices with minimal dependencies
- Require fine-grained control over API interactions
- Integrating with existing API client patterns
- Need async support for specific operations
# Example: Direct API access for custom operations
from lakefs_sdk import LakeFSApi, Configuration
config = Configuration(host="http://localhost:8000")
api = LakeFSApi(config)
# Direct API call with full control
response = api.list_repositories(
prefix="project-",
amount=100,
after="cursor"
)
Choose lakefs-spec When:¶
- Working primarily with data science libraries
- Processing large datasets with streaming requirements
- Using Jupyter notebooks for analysis
- Need filesystem-like interface
- Integrating with existing fsspec-based workflows
# Example: Data science workflow
import pandas as pd
import lakefs_spec
# Direct pandas integration
df = pd.read_parquet("lakefs://repo/branch/data/dataset.parquet")
processed_df = df.groupby("category").sum()
processed_df.to_parquet("lakefs://repo/branch/results/summary.parquet")
Migration Paths¶
From File Systems to lakeFS¶
- Start with lakefs-spec: Familiar filesystem interface
- Add High-Level SDK: For repository management and advanced features
- Consider Generated SDK: For custom integrations and specific API needs
Between lakeFS SDKs¶
- Generated → High-Level: Gradual migration, can access Generated SDK through High-Level
- High-Level → Generated: For specific API access, use
client.sdkproperty - Any SDK → lakefs-spec: For data science workflows, can run in parallel
See Also¶
SDK Selection and Setup: - Python SDK Overview - Complete SDK overview and selection guide - SDK Decision Matrix - Interactive decision guide - Getting Started Guide - Installation and setup for all SDKs - Authentication Methods - Credential configuration
SDK-Specific Documentation: - High-Level SDK Overview - Detailed High-Level SDK documentation - High-Level SDK Quickstart - Basic operations and examples - Generated SDK Overview - Direct API access patterns - Generated SDK Examples - Common usage patterns - lakefs-spec Overview - Filesystem interface documentation - lakefs-spec Integrations - Data science library examples
Feature-Specific Guides: - Transaction Patterns - Atomic operations across SDKs - Object I/O Operations - File handling patterns - Data Import/Export - Bulk data operations - Filesystem Operations - File-like operations
Learning Resources: - Data Science Tutorial - End-to-end workflow examples - ETL Pipeline Tutorial - Building data pipelines - ML Experiment Tracking - Model versioning workflows
Reference Materials: - Best Practices - Production deployment guidelines - Performance Optimization - SDK performance tuning - Troubleshooting - Common issues and solutions - Error Handling Patterns - Exception handling strategies
Migration Guides: - SDK Migration Strategies - Moving between SDKs - Legacy Integration - Integrate with existing systems
External Resources: - High-Level SDK API Reference - Complete API documentation - Generated SDK API Reference - Auto-generated API docs - lakefs-spec Documentation - Third-party filesystem interface