Skip to content

High-Level Python SDK

The High-Level Python SDK provides a simplified, Pythonic interface for working with lakeFS. Built on top of the Generated SDK, it offers advanced features like transactions, streaming I/O, and intuitive object management while maintaining the full power of the underlying API.

Key Concepts

Repository-Centric Design

The High-Level SDK is organized around repositories as the primary entry point. All operations flow from repository objects to branches, commits, and objects, providing a natural hierarchy that mirrors lakeFS's data model.

Lazy Evaluation

Objects are created lazily - creating a Repository, Branch, or StoredObject instance doesn't immediately interact with the server. Operations only execute when you call action methods like create(), upload(), or commit().

Fluent Interface

The SDK supports method chaining for common workflows:

repo = lakefs.repository("my-repo").create(storage_namespace="s3://bucket/path")
commit = repo.branch("main").object("file.txt").upload(data="content").commit("Add file")

Built-in Error Handling

All operations include comprehensive error handling with specific exception types for different failure scenarios, making it easier to build robust applications.

Key Features

  • Simplified API - Pythonic interface that abstracts complex operations
  • Transaction Support - Atomic operations with automatic rollback capabilities
  • Streaming I/O - File-like objects for efficient handling of large datasets
  • Import Management - Sophisticated data import operations with progress tracking
  • Batch Operations - Efficient bulk operations for better performance
  • Generated SDK Access - Direct access to underlying Generated SDK when needed
  • Automatic Authentication - Seamless credential discovery from environment or config files

Architecture Overview

The High-Level SDK is structured in layers:

High-Level SDK (lakefs package)
├── Repository Management
├── Branch & Reference Operations  
├── Object I/O & Streaming
├── Transaction Management
├── Import/Export Operations
└── Generated SDK (lakefs_sdk)
    └── Direct API Access

Core Classes

Repository

The main entry point for all operations. Represents a lakeFS repository and provides access to branches, tags, and metadata.

Branch

Extends Reference with write capabilities. Supports object uploads, commits, merges, and transaction management.

Reference

Read-only access to any lakeFS reference (branch, commit, or tag). Provides object listing and reading capabilities.

StoredObject & WriteableObject

Represent objects in lakeFS with full I/O capabilities including streaming, metadata management, and batch operations.

ImportManager

Handles complex data import operations with support for various source types and progress monitoring.

Documentation Sections

Quick Example

import lakefs

# Create a repository
repo = lakefs.repository("my-repo").create(
    storage_namespace="s3://my-bucket/repos/my-repo"
)

# Create a branch and upload data
branch = repo.branch("feature-branch").create(source_reference="main")
obj = branch.object("data/file.txt").upload(data="Hello, lakeFS!")

# Commit changes
commit = branch.commit(message="Add new data file")
print(f"Committed: {commit.id}")

Installation

pip install lakefs

Authentication

The SDK automatically discovers credentials from: 1. Environment variables (LAKEFS_ACCESS_KEY_ID, LAKEFS_SECRET_ACCESS_KEY, LAKEFS_ENDPOINT) 2. Configuration file (~/.lakectl.yaml) 3. Explicit client configuration

When to Use High-Level SDK

Choose the High-Level SDK when you need: - Simplified workflows - Common operations with minimal code - Transaction support - Atomic operations across multiple changes - Streaming I/O - Efficient handling of large files - Import management - Complex data ingestion workflows - Python-first experience - Pythonic interfaces and error handling

For direct API control or operations not covered by the high-level interface, you can access the underlying Generated SDK through the client property.

Next Steps

Start with the quickstart guide to learn the basics, then explore specific features in the detailed sections.

See Also

Getting Started: - Python SDK Overview - Compare all Python SDK options - Installation Guide - Setup and authentication - SDK Selection Guide - Choose the right SDK

High-Level SDK Documentation: - Quickstart Guide - Basic operations and examples - Repository Management - Create, configure, and manage repositories - Version Control - Branches, commits, and merging - Object Operations - Upload, download, and streaming I/O - Data Import/Export - Bulk data operations - Transaction Patterns - Atomic operations and rollback - Advanced Features - Performance optimization and patterns

Alternative SDK Options: - Generated SDK - Direct API access for advanced use cases - lakefs-spec - Filesystem interface for data science - Boto3 Integration - S3-compatible operations

Learning Resources: - Data Science Tutorial - End-to-end data analysis workflow - ETL Pipeline Tutorial - Building data pipelines - ML Experiment Tracking - Model versioning

Reference Materials: - API Comparison - Feature comparison across SDKs - Best Practices - Production deployment guidance - Troubleshooting - Common issues and solutions

External Resources: - High-Level SDK API Reference - Complete API documentation - Generated SDK Access - Using Generated SDK from High-Level SDK