Skip to content

Python Integration with lakeFS

lakeFS provides multiple Python integration options to suit different use cases and development patterns. This comprehensive guide helps you choose the right SDK and get started quickly.

Legacy SDK Deprecation

If your project is currently using the legacy Python lakefs-client, please be aware that this version has been deprecated. As of release v1.44.0, it's no longer supported for new updates or features.

SDK Architecture and Relationships

Understanding the relationship between different Python SDKs helps you make informed decisions:

graph TD
    A[Your Application] --> B[High-Level SDK]
    A --> C[Generated SDK]
    A --> D[lakefs-spec]

    B --> C
    C --> F[lakeFS API]
    D --> F

    style B fill:#e1f5fe
    style C fill:#f3e5f5
    style D fill:#e8f5e8
  • High-Level SDK is built on top of the Generated SDK, providing simplified interfaces while maintaining access to the underlying client
  • Generated SDK provides direct access to all lakeFS API endpoints based on the OpenAPI specification
  • lakefs-spec offers a filesystem-like interface compatible with the fsspec ecosystem

Comprehensive SDK Comparison

Feature High-Level SDK Generated SDK lakefs-spec
Installation pip install lakefs pip install lakefs-sdk pip install lakefs-spec
API Style Object-oriented, simplified Direct API mapping Filesystem-like
Learning Curve Easy Moderate Easy
Repository Management ✅ Full support ✅ Full support ❌ Not supported
Branch Operations ✅ Simplified interface ✅ Full API access ❌ Limited
Object Operations ✅ Streaming I/O ✅ Manual handling ✅ File-like operations
Transactions ✅ Built-in support ⚠️ Manual implementation ✅ Context managers
Data Science Integration ⚠️ Via file-like objects ❌ Manual integration ✅ Native pandas/dask support
Async Support ❌ Sync only ⚠️ Limited ❌ Sync only
Error Handling ✅ Pythonic exceptions ✅ API-level exceptions ✅ Filesystem exceptions
Performance Good Best (direct API) Good
Maintenance lakeFS team Auto-generated Third-party

SDK Strengths and Use Cases

High-Level SDK

Strengths: - Intuitive, Pythonic API design - Built-in transaction support with context managers - Streaming I/O operations for large files - Automatic connection management and retries - Access to underlying Generated SDK when needed

Best for: - Data engineers building ETL pipelines - Python developers new to lakeFS - Applications requiring transaction semantics - Workflows with large file uploads/downloads

Example use cases: - Data pipeline orchestration - Batch data processing - Model training workflows - Data quality validation

Generated SDK

Strengths: - Complete API coverage (all endpoints available) - Direct mapping to lakeFS REST API - Fine-grained control over requests - Auto-generated from OpenAPI specification - Consistent with other language SDKs

Best for: - Advanced users needing full API control - Custom tooling and integrations - Operations not covered by High-Level SDK - Performance-critical applications

Example use cases: - Custom lakeFS management tools - Advanced metadata operations - Integration with existing API frameworks - Performance-optimized data access

lakefs-spec

Strengths: - Filesystem-like API familiar to Python developers - Native integration with pandas, dask, and other fsspec libraries - Transparent handling of lakeFS URIs - Built-in transaction support - Third-party maintained with active community

Best for: - Data scientists and analysts - Jupyter notebook workflows - Existing fsspec-based applications - Quick prototyping and exploration

Example use cases: - Interactive data analysis - Machine learning experimentation - Data exploration in notebooks - Integration with existing data science stacks

SDK Selection Decision Matrix

Use this decision tree to choose the right SDK for your needs:

🤔 What's your primary use case?

📊 Data Science & Analytics

🔧 Data Engineering & ETL

🏗️ Application Development

🎯 Feature-Based Selection

If you need... Choose... Why?
Simplest API High-Level SDK Pythonic, intuitive interface
Complete API access Generated SDK All endpoints available
Pandas integration lakefs-spec Native fsspec support
Transaction support High-Level SDK or lakefs-spec Built-in context managers
Streaming large files High-Level SDK Optimized I/O operations
Custom tooling Generated SDK Full control and flexibility
Jupyter notebooks lakefs-spec Filesystem-like operations

🚀 Experience Level Guide

New to lakeFS

  1. Start with Getting Started
  2. Try High-Level SDK for general use
  3. Or lakefs-spec for data science workflows

Experienced with lakeFS

Migrating from S3

  1. Review S3 Gateway documentation for S3-compatible access
  2. Consider gradual migration strategies
  3. Plan integration with existing S3-based workflows

Quick Start

  1. Getting Started - Installation and setup guide
  2. Choose your SDK - Select the appropriate SDK for your use case
  3. Follow tutorials - Learn with real-world examples

Documentation Sections

Need Help?

See Also

Getting Started: - Installation and Setup Guide - Complete setup instructions for all SDKs - Authentication Methods - Configure credentials and connections

SDK-Specific Documentation: - High-Level SDK Overview - Simplified Python interface - Generated SDK Overview - Direct API access - lakefs-spec Overview - Filesystem operations

Learning Resources: - Real-World Tutorials - End-to-end examples and workflows - Best Practices Guide - Production deployment guidance - API Comparison Matrix - Feature comparison across SDKs

External Resources: - High-Level SDK Documentation - Complete API reference - Generated SDK Documentation - Auto-generated API docs - lakefs-spec Documentation - Third-party filesystem interface