Python Integration with lakeFS¶

lakeFS provides multiple Python integration options to suit different use cases and development patterns. This comprehensive guide helps you choose the right SDK and get started quickly.

Legacy SDK Deprecation

If your project is currently using the legacy Python lakefs-client, please be aware that this version has been deprecated. As of release v1.44.0, it's no longer supported for new updates or features.

SDK Architecture and Relationships¶

Understanding the relationship between different Python SDKs helps you make informed decisions:

graph TD
    A[Your Application] --> B[High-Level SDK]
    A --> C[Generated SDK]
    A --> D[lakefs-spec]

    B --> C
    C --> F[lakeFS API]
    D --> F

    style B fill:#e1f5fe
    style C fill:#f3e5f5
    style D fill:#e8f5e8

High-Level SDK is built on top of the Generated SDK, providing simplified interfaces while maintaining access to the underlying client
Generated SDK provides direct access to all lakeFS API endpoints based on the OpenAPI specification
lakefs-spec offers a filesystem-like interface compatible with the fsspec ecosystem

Comprehensive SDK Comparison¶

Feature	High-Level SDK	Generated SDK	lakefs-spec
Installation	`pip install lakefs`	`pip install lakefs-sdk`	`pip install lakefs-spec`
API Style	Object-oriented, simplified	Direct API mapping	Filesystem-like
Learning Curve	Easy	Moderate	Easy
Repository Management	✅ Full support	✅ Full support	❌ Not supported
Branch Operations	✅ Simplified interface	✅ Full API access	❌ Limited
Object Operations	✅ Streaming I/O	✅ Manual handling	✅ File-like operations
Transactions	✅ Built-in support	⚠️ Manual implementation	✅ Context managers
Data Science Integration	⚠️ Via file-like objects	❌ Manual integration	✅ Native pandas/dask support
Async Support	❌ Sync only	⚠️ Limited	❌ Sync only
Error Handling	✅ Pythonic exceptions	✅ API-level exceptions	✅ Filesystem exceptions
Performance	Good	Best (direct API)	Good
Maintenance	lakeFS team	Auto-generated	Third-party

SDK Strengths and Use Cases¶

High-Level SDK¶

Strengths: - Intuitive, Pythonic API design - Built-in transaction support with context managers - Streaming I/O operations for large files - Automatic connection management and retries - Access to underlying Generated SDK when needed

Best for: - Data engineers building ETL pipelines - Python developers new to lakeFS - Applications requiring transaction semantics - Workflows with large file uploads/downloads

Example use cases: - Data pipeline orchestration - Batch data processing - Model training workflows - Data quality validation

Generated SDK¶

Strengths: - Complete API coverage (all endpoints available) - Direct mapping to lakeFS REST API - Fine-grained control over requests - Auto-generated from OpenAPI specification - Consistent with other language SDKs

Best for: - Advanced users needing full API control - Custom tooling and integrations - Operations not covered by High-Level SDK - Performance-critical applications

Example use cases: - Custom lakeFS management tools - Advanced metadata operations - Integration with existing API frameworks - Performance-optimized data access

lakefs-spec¶

Strengths: - Filesystem-like API familiar to Python developers - Native integration with pandas, dask, and other fsspec libraries - Transparent handling of lakeFS URIs - Built-in transaction support - Third-party maintained with active community

Best for: - Data scientists and analysts - Jupyter notebook workflows - Existing fsspec-based applications - Quick prototyping and exploration

Example use cases: - Interactive data analysis - Machine learning experimentation - Data exploration in notebooks - Integration with existing data science stacks

SDK Selection Decision Matrix¶

Use this decision tree to choose the right SDK for your needs:

🤔 What's your primary use case?¶

📊 Data Science & Analytics¶

Working with pandas/dask? → lakefs-spec
Need transactions in notebooks? → lakefs-spec or High-Level SDK
Building ML pipelines? → High-Level SDK

🔧 Data Engineering & ETL¶

Building data pipelines? → High-Level SDK
Need transaction support? → High-Level SDK
Processing large files? → High-Level SDK (streaming I/O)

🏗️ Application Development¶

Building lakeFS management tools? → Generated SDK
Need full API control? → Generated SDK
Integrating with existing systems? → Generated SDK

🎯 Feature-Based Selection¶

If you need...	Choose...	Why?
Simplest API	High-Level SDK	Pythonic, intuitive interface
Complete API access	Generated SDK	All endpoints available
Pandas integration	lakefs-spec	Native fsspec support
Transaction support	High-Level SDK or lakefs-spec	Built-in context managers
Streaming large files	High-Level SDK	Optimized I/O operations
Custom tooling	Generated SDK	Full control and flexibility
Jupyter notebooks	lakefs-spec	Filesystem-like operations

🚀 Experience Level Guide¶

New to lakeFS¶

Start with Getting Started
Try High-Level SDK for general use
Or lakefs-spec for data science workflows

Experienced with lakeFS¶

Use Generated SDK for advanced operations
Combine multiple SDKs as needed
Check best practices for optimization

Migrating from S3¶

Review S3 Gateway documentation for S3-compatible access
Consider gradual migration strategies
Plan integration with existing S3-based workflows

Quick Start¶

Getting Started - Installation and setup guide
Choose your SDK - Select the appropriate SDK for your use case
Follow tutorials - Learn with real-world examples

Documentation Sections¶

Getting Started - Installation, authentication, and basic setup
High-Level SDK - Comprehensive SDK documentation
Generated SDK - Direct API access patterns
lakefs-spec - Filesystem API and data science integrations
Tutorials - Real-world examples and workflows
Reference - API comparison, best practices, and troubleshooting