Key Facts
- ✓ Dmitry Listvin manages analytical data storage at Avito
- ✓ Real analytical workloads transform standard S3 into the most capricious element of the architecture
- ✓ The article focuses on extracting maximum performance from Ceph for heavy analytical queries
- ✓ Achieving high HDD throughput is critical when running heavy analytical queries on data
Quick Summary
Dmitry Listvin manages analytical data storage at Avito and has shared the company's experience with building Lakehouse architectures on object storage systems.
The core challenge discussed is how real-world analytical workloads rapidly transform standard S3-like storage from a simple solution into the most unpredictable component of the entire architecture. The article specifically focuses on extracting maximum performance from Ceph storage systems.
Key technical considerations include achieving high HDD throughput when users need to run heavy analytical queries directly on stored data. The experience demonstrates that while object storage provides scalable foundations for Lakehouse implementations, the performance characteristics require careful optimization to handle analytical processing demands effectively.
Building Lakehouse on Object Storage
Organizations implementing Lakehouse architectures face unique challenges when building on top of object storage systems. The approach requires balancing scalability with performance, particularly when analytical workloads demand rapid data access and processing capabilities.
Standard object storage implementations, while providing reliable data persistence, often struggle with the performance characteristics needed for heavy analytical queries. This creates a critical bottleneck that can impact the entire data pipeline's effectiveness.
The architecture must account for:
- Consistent performance under varying query loads
- Efficient data retrieval patterns for analytical processing
- Scalability without sacrificing query responsiveness
- Cost-effective storage that supports active analytical workloads
"Всем привет! Меня зовут Дмитрий Листвин, я занимаюсь аналитическим хранилищем данных в Авито"
— Dmitry Listvin, Data Storage Specialist at Avito
Ceph Performance Optimization
Extracting maximum performance from Ceph requires understanding how HDD-based storage behaves under analytical query loads. The system must efficiently handle high-throughput demands while maintaining reliable data access patterns.
Heavy analytical queries impose significant stress on storage infrastructure, particularly when accessing large datasets stored across distributed object storage nodes. Achieving optimal HDD throughput becomes critical for maintaining query performance and overall system responsiveness.
Performance optimization strategies focus on:
- Maximizing sequential read operations for large dataset scans
- Reducing latency through intelligent data placement
- Managing concurrent access patterns from multiple analytical queries
- Balancing storage node utilization across the cluster
Managing Analytical Workload Challenges
Real analytical workloads expose the limitations of treating object storage as a simple S3-compatible solution. The unpredictable nature of query patterns transforms storage into the most variable component of the Lakehouse architecture.
When users run heavy analytical queries directly on stored data, the storage layer must support:
- High-bandwidth data streaming for complex aggregations
- Random access patterns for exploratory analysis
- Consistent performance during peak usage periods
- Efficient metadata operations for query planning
These requirements make the storage subsystem the critical factor in overall analytical platform performance, requiring specialized optimization approaches beyond standard object storage configurations.
Key Insights and Best Practices
The experience shared by Avito demonstrates that successful Lakehouse implementations require treating storage performance as a primary architectural concern rather than an afterthought. Organizations must proactively address throughput and latency requirements.
Critical success factors include:
- Understanding the specific performance characteristics of the underlying storage technology
- Designing data layouts that optimize for analytical query patterns
- Implementing monitoring and tuning processes for continuous performance optimization
- Balancing cost considerations with performance requirements
By focusing on these areas, organizations can build Lakehouse architectures that deliver consistent analytical performance while maintaining the scalability and cost benefits of object storage foundations.



