API Comparison¶
This comprehensive comparison helps you choose the right Python SDK for your specific use case by comparing features, performance characteristics, and trade-offs across all available options.
Quick Decision Matrix¶
Use Case | Recommended SDK | Alternative |
---|---|---|
Data Science & Analytics | lakefs-spec | High-Level SDK |
Production ETL Pipelines | High-Level SDK | Generated SDK |
Custom API Operations | Generated SDK | High-Level SDK |
Jupyter Notebooks | lakefs-spec | High-Level SDK |
ML Experiment Tracking | High-Level SDK | lakefs-spec |
Large File Processing | lakefs-spec | High-Level SDK |
Microservices Integration | Generated SDK | High-Level SDK |
Feature Comparison Matrix¶
Core Repository Operations¶
Feature | High-Level SDK | Generated SDK | lakefs-spec |
---|---|---|---|
Repository Management | |||
Create Repository | ✅ Full | ✅ Full | ❌ None |
Delete Repository | ✅ Full | ✅ Full | ❌ None |
List Repositories | ✅ Full | ✅ Full | ❌ None |
Repository Metadata | ✅ Full | ✅ Full | ❌ None |
Branch Operations | |||
Create Branch | ✅ Full | ✅ Full | ✅ Limited |
Delete Branch | ✅ Full | ✅ Full | ✅ Limited |
List Branches | ✅ Full | ✅ Full | ✅ Limited |
Branch Protection | ✅ Full | ✅ Full | ❌ None |
Commit Operations | |||
Create Commit | ✅ Full | ✅ Full | ✅ Full |
List Commits | ✅ Full | ✅ Full | ✅ Limited |
Commit Metadata | ✅ Full | ✅ Full | ✅ Limited |
Cherry Pick | ✅ Full | ✅ Full | ❌ None |
Object Operations¶
Feature | High-Level SDK | Generated SDK | lakefs-spec |
---|---|---|---|
Basic Operations | |||
Upload Object | ✅ Full | ✅ Full | ✅ Full |
Download Object | ✅ Full | ✅ Full | ✅ Full |
Delete Object | ✅ Full | ✅ Full | ✅ Full |
List Objects | ✅ Full | ✅ Full | ✅ Full |
Advanced Operations | |||
Streaming I/O | ✅ Full | 🔶 Manual | ✅ Full |
Batch Operations | ✅ Full | 🔶 Manual | ✅ Full |
Object Metadata | ✅ Full | ✅ Full | ✅ Full |
Presigned URLs | ✅ Full | ✅ Full | ❌ None |
Multipart Upload | ✅ Full | ✅ Full | ✅ Full |
Data Management Features¶
Feature | High-Level SDK | Generated SDK | lakefs-spec |
---|---|---|---|
Transactions | |||
Atomic Operations | ✅ Full | 🔶 Manual | ✅ Full |
Rollback Support | ✅ Full | 🔶 Manual | ✅ Full |
Context Managers | ✅ Full | ❌ None | ✅ Full |
Import/Export | |||
Data Import | ✅ Full | ✅ Full | ❌ None |
Import Status | ✅ Full | ✅ Full | ❌ None |
Export Operations | ✅ Full | ✅ Full | ❌ None |
Merge Operations | |||
Branch Merging | ✅ Full | ✅ Full | ❌ None |
Conflict Resolution | ✅ Full | ✅ Full | ❌ None |
Merge Strategies | ✅ Full | ✅ Full | ❌ None |
Integration Capabilities¶
Feature | High-Level SDK | Generated SDK | lakefs-spec |
---|---|---|---|
Data Science Libraries | |||
Pandas Integration | ✅ Full | 🔶 Manual | ✅ Native |
Dask Integration | ✅ Full | 🔶 Manual | ✅ Native |
PyArrow Integration | ✅ Full | 🔶 Manual | ✅ Native |
File System Interface | |||
fsspec Compatibility | 🔶 Limited | ❌ None | ✅ Native |
Path-like Operations | ✅ Full | 🔶 Manual | ✅ Native |
Glob Patterns | ✅ Full | 🔶 Manual | ✅ Native |
Performance Characteristics¶
Throughput Comparison¶
Operation Type | High-Level SDK | Generated SDK | lakefs-spec |
---|---|---|---|
Small Files (< 1MB) | |||
Single Upload | Good | Good | Excellent |
Batch Upload | Excellent | Good | Excellent |
Single Download | Good | Good | Excellent |
Batch Download | Excellent | Good | Excellent |
Large Files (> 100MB) | |||
Streaming Upload | Excellent | Good | Excellent |
Streaming Download | Excellent | Good | Excellent |
Multipart Upload | Excellent | Good | Excellent |
Metadata Operations | |||
List Objects | Good | Good | Excellent |
Object Stats | Good | Good | Excellent |
Branch Operations | Excellent | Good | Good |
Memory Usage¶
SDK | Memory Efficiency | Notes |
---|---|---|
High-Level SDK | Good | Optimized for common patterns, connection pooling |
Generated SDK | Fair | Direct API access, manual optimization needed |
lakefs-spec | Excellent | Designed for large datasets, streaming-first |
Latency Characteristics¶
Operation | High-Level SDK | Generated SDK | lakefs-spec |
---|---|---|---|
Connection Setup | Fast | Fast | Fast |
Authentication | Fast | Fast | Fast |
First Request | Medium | Medium | Fast |
Subsequent Requests | Fast | Fast | Fast |
Batch Operations | Fast | Medium | Fast |
Trade-offs Analysis¶
High-Level SDK¶
Strengths: - Comprehensive feature set with advanced capabilities - Built-in transaction support and error handling - Optimized for common lakeFS workflows - Excellent documentation and examples - Connection pooling and performance optimizations
Weaknesses: - Additional abstraction layer may hide some API details - Larger dependency footprint - May not expose all Generated SDK capabilities immediately
Best For: - Production applications requiring robust error handling - Complex workflows with transactions - Teams wanting comprehensive lakeFS integration - Applications requiring advanced features like imports/exports
Generated SDK¶
Strengths: - Direct access to all lakeFS API capabilities - Minimal abstraction, maximum control - Automatically updated with API changes - Smaller dependency footprint - Full async support where available
Weaknesses: - Requires more boilerplate code - Manual error handling and retry logic - No built-in transaction support - Less optimized for common patterns
Best For: - Custom integrations requiring specific API access - Microservices with minimal dependencies - Applications needing fine-grained control - Integration with existing API client patterns
lakefs-spec¶
Strengths: - Native fsspec integration for data science workflows - Excellent performance for file operations - Seamless integration with pandas, dask, and other libraries - Optimized for large dataset operations - Familiar filesystem interface
Weaknesses: - Limited repository management capabilities - No direct access to advanced lakeFS features - Focused primarily on file operations - Third-party maintenance dependency
Best For: - Data science and analytics workflows - Jupyter notebook environments - Large dataset processing - Integration with existing fsspec-based tools - Teams familiar with filesystem interfaces
Decision Guidelines¶
Choose High-Level SDK When:¶
- Building production applications with complex lakeFS workflows
- Need transaction support and advanced error handling
- Want comprehensive feature access with minimal code
- Team prefers high-level abstractions
- Building ETL pipelines or data management systems
# Example: Complex workflow with transactions
import lakefs
client = lakefs.Client()
repo = client.repository("my-repo")
with repo.branch("feature").transaction() as tx:
# Multiple operations in atomic transaction
tx.upload("data/file1.csv", data1)
tx.upload("data/file2.csv", data2)
# Automatically commits or rolls back
Choose Generated SDK When:¶
- Need access to specific API endpoints not covered by High-Level SDK
- Building microservices with minimal dependencies
- Require fine-grained control over API interactions
- Integrating with existing API client patterns
- Need async support for specific operations
# Example: Direct API access for custom operations
from lakefs_sdk import LakeFSApi, Configuration
config = Configuration(host="http://localhost:8000")
api = LakeFSApi(config)
# Direct API call with full control
response = api.list_repositories(
prefix="project-",
amount=100,
after="cursor"
)
Choose lakefs-spec When:¶
- Working primarily with data science libraries
- Processing large datasets with streaming requirements
- Using Jupyter notebooks for analysis
- Need filesystem-like interface
- Integrating with existing fsspec-based workflows
# Example: Data science workflow
import pandas as pd
import lakefs_spec
# Direct pandas integration
df = pd.read_parquet("lakefs://repo/branch/data/dataset.parquet")
processed_df = df.groupby("category").sum()
processed_df.to_parquet("lakefs://repo/branch/results/summary.parquet")
Migration Paths¶
From File Systems to lakeFS¶
- Start with lakefs-spec: Familiar filesystem interface
- Add High-Level SDK: For repository management and advanced features
- Consider Generated SDK: For custom integrations and specific API needs
Between lakeFS SDKs¶
- Generated → High-Level: Gradual migration, can access Generated SDK through High-Level
- High-Level → Generated: For specific API access, use
client.sdk
property - Any SDK → lakefs-spec: For data science workflows, can run in parallel
See Also¶
SDK Selection and Setup: - Python SDK Overview - Complete SDK overview and selection guide - SDK Decision Matrix - Interactive decision guide - Getting Started Guide - Installation and setup for all SDKs - Authentication Methods - Credential configuration
SDK-Specific Documentation: - High-Level SDK Overview - Detailed High-Level SDK documentation - High-Level SDK Quickstart - Basic operations and examples - Generated SDK Overview - Direct API access patterns - Generated SDK Examples - Common usage patterns - lakefs-spec Overview - Filesystem interface documentation - lakefs-spec Integrations - Data science library examples
Feature-Specific Guides: - Transaction Patterns - Atomic operations across SDKs - Object I/O Operations - File handling patterns - Data Import/Export - Bulk data operations - Filesystem Operations - File-like operations
Learning Resources: - Data Science Tutorial - End-to-end workflow examples - ETL Pipeline Tutorial - Building data pipelines - ML Experiment Tracking - Model versioning workflows
Reference Materials: - Best Practices - Production deployment guidelines - Performance Optimization - SDK performance tuning - Troubleshooting - Common issues and solutions - Error Handling Patterns - Exception handling strategies
Migration Guides: - SDK Migration Strategies - Moving between SDKs - Legacy Integration - Integrate with existing systems
External Resources: - High-Level SDK API Reference - Complete API documentation - Generated SDK API Reference - Auto-generated API docs - lakefs-spec Documentation - Third-party filesystem interface