Repository Management¶
Learn how to create, configure, and manage lakeFS repositories using the High-Level Python SDK. Repositories are the top-level containers in lakeFS that hold all your data, branches, and version history.
Repository Concepts¶
Repository Structure¶
A lakeFS repository consists of: - Storage namespace: The underlying storage location (S3 bucket, Azure container, etc.) - Default branch: The main branch created automatically (typically "main") - Metadata: Repository-level configuration and properties - Branches and tags: Version control references within the repository
Lazy Initialization¶
Repository objects are created lazily - instantiating a Repository
object doesn't immediately connect to the server. Operations only execute when you call action methods.
Creating Repositories¶
Basic Repository Creation¶
import lakefs
# Create a new repository
repo = lakefs.repository("my-repo").create(
storage_namespace="s3://my-bucket/repos/my-repo"
)
print(f"Repository created: {repo.id}")
print(f"Storage namespace: {repo.properties.storage_namespace}")
print(f"Default branch: {repo.properties.default_branch}")
Expected Output:
Repository with Custom Configuration¶
# Create repository with custom settings
repo = lakefs.repository("custom-repo").create(
storage_namespace="s3://my-bucket/repos/custom",
default_branch="develop",
include_samples=True # Include sample data
)
print(f"Created with default branch: {repo.properties.default_branch}")
Expected Output:
Safe Repository Creation¶
from lakefs.exceptions import ConflictException
try:
# Try to create repository
repo = lakefs.repository("existing-repo").create(
storage_namespace="s3://my-bucket/repos/existing",
exist_ok=False # Fail if repository exists
)
print("Repository created successfully")
except ConflictException:
print("Repository already exists")
# Connect to existing repository instead
repo = lakefs.repository("existing-repo")
Using exist_ok Parameter¶
# Create repository or connect to existing one
repo = lakefs.repository("safe-repo").create(
storage_namespace="s3://my-bucket/repos/safe",
exist_ok=True # Don't fail if repository exists
)
# This will either create a new repository or return the existing one
print(f"Repository ready: {repo.id}")
Connecting to Existing Repositories¶
Basic Connection¶
# Connect to existing repository (no server call yet)
repo = lakefs.repository("existing-repo")
# Access properties (triggers server call)
print(f"Repository: {repo.id}")
print(f"Created: {repo.properties.creation_date}")
print(f"Storage: {repo.properties.storage_namespace}")
With Custom Client¶
from lakefs.client import Client
# Use custom client configuration
client = Client(
username="custom-access-key",
password="custom-secret-key",
host="https://my-lakefs.example.com"
)
repo = lakefs.Repository("my-repo", client=client)
Repository Properties and Metadata¶
Accessing Repository Properties¶
repo = lakefs.repository("my-repo")
# Get repository properties
props = repo.properties
print(f"Repository ID: {props.id}")
print(f"Creation Date: {props.creation_date}")
print(f"Default Branch: {props.default_branch}")
print(f"Storage Namespace: {props.storage_namespace}")
# Properties are cached after first access
print(f"Same properties object: {props is repo.properties}")
Expected Output:
Repository ID: my-repo
Creation Date: 1640995200
Default Branch: main
Storage Namespace: s3://my-bucket/repos/my-repo
Same properties object: True
Repository Metadata¶
# Access repository metadata
metadata = repo.metadata
print(f"Metadata: {metadata}")
# Metadata is a dictionary of key-value pairs
for key, value in metadata.items():
print(f"{key}: {value}")
Expected Output:
Listing Repositories¶
List All Repositories¶
# List all repositories (using default client)
for repo in lakefs.repositories():
print(f"Repository: {repo.id}")
print(f" Storage: {repo.properties.storage_namespace}")
print(f" Default branch: {repo.properties.default_branch}")
print(f" Created: {repo.properties.creation_date}")
print()
Expected Output:
Repository: repo1
Storage: s3://bucket1/repos/repo1
Default branch: main
Created: 1640995200
Repository: repo2
Storage: s3://bucket2/repos/repo2
Default branch: develop
Created: 1641081600
Filtered Repository Listing¶
# List repositories with prefix filter
for repo in lakefs.repositories(prefix="prod-"):
print(f"Production repo: {repo.id}")
# List repositories with pagination
for repo in lakefs.repositories(after="repo-m", max_amount=10):
print(f"Repository: {repo.id}")
Using Custom Client for Listing¶
from lakefs.client import Client
client = Client(host="https://my-lakefs.example.com")
# List repositories using custom client
for repo in lakefs.repositories(client=client, prefix="team-"):
print(f"Team repository: {repo.id}")
Repository Navigation¶
Accessing Branches¶
repo = lakefs.repository("my-repo")
# Get specific branch
main_branch = repo.branch("main")
dev_branch = repo.branch("development")
# List all branches
print("All branches:")
for branch in repo.branches():
print(f" {branch.id}")
# List branches with filtering
print("Feature branches:")
for branch in repo.branches(prefix="feature-"):
print(f" {branch.id}")
Accessing Tags¶
# Get specific tag
v1_tag = repo.tag("v1.0.0")
# List all tags
print("All tags:")
for tag in repo.tags():
print(f" {tag.id}")
# List recent tags
print("Recent tags:")
for tag in repo.tags(max_amount=5):
print(f" {tag.id}")
Accessing References¶
# Access any reference (branch, commit, or tag)
main_ref = repo.ref("main")
commit_ref = repo.ref("c7a632d74f46c...")
tag_ref = repo.ref("v1.0.0")
# Using ref expressions
previous_commit = repo.ref("main~1") # Previous commit on main
head_commit = repo.commit("c7a632d74f46c...") # Specific commit
Repository Operations¶
Repository Information¶
repo = lakefs.repository("my-repo")
# Display comprehensive repository information
def show_repo_info(repo):
props = repo.properties
metadata = repo.metadata
print(f"Repository: {props.id}")
print(f"Created: {props.creation_date}")
print(f"Storage: {props.storage_namespace}")
print(f"Default Branch: {props.default_branch}")
if metadata:
print("Metadata:")
for key, value in metadata.items():
print(f" {key}: {value}")
# Count branches and tags
branch_count = len(list(repo.branches(max_amount=1000)))
tag_count = len(list(repo.tags(max_amount=1000)))
print(f"Branches: {branch_count}")
print(f"Tags: {tag_count}")
show_repo_info(repo)
Repository Statistics¶
def get_repo_stats(repo):
"""Get comprehensive repository statistics"""
stats = {
'id': repo.id,
'properties': repo.properties._asdict(),
'metadata': repo.metadata,
'branches': [],
'tags': [],
'total_objects': 0
}
# Collect branch information
for branch in repo.branches():
branch_info = {
'id': branch.id,
'commit_id': branch.get_commit().id,
'object_count': len(list(branch.objects(max_amount=1000)))
}
stats['branches'].append(branch_info)
stats['total_objects'] += branch_info['object_count']
# Collect tag information
for tag in repo.tags():
stats['tags'].append({
'id': tag.id,
'commit_id': tag.get_commit().id
})
return stats
# Get and display stats
stats = get_repo_stats(repo)
print(f"Repository {stats['id']} has {len(stats['branches'])} branches and {len(stats['tags'])} tags")
print(f"Total objects across all branches: {stats['total_objects']}")
Repository Deletion¶
Basic Deletion¶
from lakefs.exceptions import NotFoundException
repo = lakefs.repository("repo-to-delete")
try:
repo.delete()
print("Repository deleted successfully")
except NotFoundException:
print("Repository not found")
Safe Deletion with Confirmation¶
def delete_repository_safely(repo_id: str, confirm: bool = False):
"""Safely delete a repository with confirmation"""
repo = lakefs.repository(repo_id)
if not confirm:
print(f"This will permanently delete repository '{repo_id}'")
print("Set confirm=True to proceed")
return False
try:
# Show repository info before deletion
props = repo.properties
print(f"Deleting repository: {props.id}")
print(f"Storage namespace: {props.storage_namespace}")
repo.delete()
print("Repository deleted successfully")
return True
except NotFoundException:
print(f"Repository '{repo_id}' not found")
return False
except Exception as e:
print(f"Error deleting repository: {e}")
return False
# Usage
delete_repository_safely("test-repo", confirm=True)
Error Handling¶
Common Repository Errors¶
from lakefs.exceptions import (
NotFoundException,
ConflictException,
NotAuthorizedException,
ServerException
)
def handle_repository_operations():
try:
# Try various repository operations
repo = lakefs.repository("my-repo").create(
storage_namespace="s3://my-bucket/repos/my-repo"
)
except ConflictException:
print("Repository already exists")
repo = lakefs.repository("my-repo")
except NotAuthorizedException:
print("Not authorized to create repository")
return None
except ServerException as e:
print(f"Server error: {e}")
return None
try:
# Access repository properties
props = repo.properties
print(f"Repository: {props.id}")
except NotFoundException:
print("Repository not found")
return None
return repo
Validation and Best Practices¶
def validate_repository_config(repo_id: str, storage_namespace: str):
"""Validate repository configuration before creation"""
# Validate repository ID
if not repo_id or not repo_id.replace('-', '').replace('_', '').isalnum():
raise ValueError("Repository ID must contain only alphanumeric characters, hyphens, and underscores")
# Validate storage namespace
if not storage_namespace.startswith(('s3://', 'gs://', 'azure://', 'file://')):
raise ValueError("Storage namespace must start with a valid protocol (s3://, gs://, azure://, file://)")
print(f"Configuration valid for repository: {repo_id}")
return True
# Usage
try:
validate_repository_config("my-repo", "s3://my-bucket/repos/my-repo")
repo = lakefs.repository("my-repo").create(
storage_namespace="s3://my-bucket/repos/my-repo"
)
except ValueError as e:
print(f"Configuration error: {e}")
Advanced Repository Patterns¶
Repository Factory Pattern¶
class RepositoryManager:
"""Centralized repository management"""
def __init__(self, client=None):
self.client = client or lakefs.Client()
self._repositories = {}
def get_or_create_repository(self, repo_id: str, storage_namespace: str, **kwargs):
"""Get existing repository or create new one"""
if repo_id in self._repositories:
return self._repositories[repo_id]
try:
repo = lakefs.Repository(repo_id, client=self.client).create(
storage_namespace=storage_namespace,
exist_ok=True,
**kwargs
)
self._repositories[repo_id] = repo
return repo
except Exception as e:
print(f"Failed to get/create repository {repo_id}: {e}")
return None
def list_managed_repositories(self):
"""List all managed repositories"""
return list(self._repositories.keys())
# Usage
manager = RepositoryManager()
repo1 = manager.get_or_create_repository("repo1", "s3://bucket/repo1")
repo2 = manager.get_or_create_repository("repo2", "s3://bucket/repo2")
Repository Cloning Pattern¶
def clone_repository_structure(source_repo_id: str, target_repo_id: str,
target_storage: str):
"""Clone repository structure (branches and tags) to new repository"""
source = lakefs.repository(source_repo_id)
target = lakefs.repository(target_repo_id).create(
storage_namespace=target_storage,
exist_ok=True
)
# Clone branches
for branch in source.branches():
if branch.id != source.properties.default_branch:
try:
target.branch(branch.id).create(
source_reference=source.properties.default_branch
)
print(f"Cloned branch: {branch.id}")
except ConflictException:
print(f"Branch {branch.id} already exists")
# Clone tags
for tag in source.tags():
try:
commit_id = tag.get_commit().id
target.tag(tag.id).create(commit_id=commit_id)
print(f"Cloned tag: {tag.id}")
except ConflictException:
print(f"Tag {tag.id} already exists")
return target
# Usage
cloned_repo = clone_repository_structure("source-repo", "target-repo", "s3://bucket/target")
Key Points¶
- Lazy evaluation: Repository objects don't connect to server until you access properties or call methods
- Caching: Repository properties are cached after first access for performance
- Error handling: Use specific exception types for robust error handling
- Navigation: Use repository objects as entry points to access branches, tags, and references
- Metadata: Repository metadata provides additional configuration and information
See Also¶
High-Level SDK Workflow: - Quickstart Guide - Basic repository operations and setup - Branch Operations - Working with branches and commits - Object Management - Managing objects within repositories - Transaction Patterns - Atomic operations across repositories - Import/Export Operations - Bulk data operations
Repository Management: - High-Level SDK Overview - Architecture and key concepts - Advanced Features - Performance optimization and patterns - Generated SDK Access - Direct API access for advanced operations
Alternative Approaches: - Generated SDK Repository API - Direct API access - lakefs-spec Limitations - Why lakefs-spec doesn't support repository management - Boto3 Limitations - S3 compatibility doesn't include repository operations
Learning Resources: - Data Science Tutorial - Repository setup for data science workflows - ETL Pipeline Tutorial - Repository management in data pipelines - ML Experiment Tracking - Repository organization for ML projects
Reference Materials: - API Comparison - Repository features across SDKs - Best Practices - Production deployment guidance - Troubleshooting - Common repository issues and solutions
External Resources: - lakeFS Repository Concepts - Core lakeFS repository concepts - High-Level SDK API Reference - Complete repository API documentation