🔗 GitHub Link with full content(code, documents, design diagrams and video etc.)
▶️ Video explanation

✍️ Intro: Enterprise-Grade Offline Data Warehouse Solution for E-Commerce
This project aims to build an enterprise-grade offline data warehouse solution based on e-commerce platform order data. By leveraging Docker containers to simulate a big data platform, it achieves a complete workflow from ETL processing to data warehouse modeling, OLAP analysis, and data visualization.

The core value of this project lies in its implementation of enterprise-grade data warehouse modeling, integrating e-commerce order data with relevant business themes through standardized dimension modeling and fact table design, ensuring data accuracy, consistency, and traceability. Meanwhile, the deployment of a big data cluster via Docker containers simplifies environment management and operational costs, offering a flexible deployment model for distributed batch processing powered by Spark. Additionally, the project incorporates CI/CD automation, enabling rapid iterations while maintaining the stability and reliability of the data pipeline. Storage and computation are also highly optimized to maximize hardware resource utilization.
To monitor and manage the system effectively, a Grafana-based cluster monitoring system has been implemented, providing real-time insights into cluster health metrics and assisting in performance tuning and capacity planning. Finally, by integrating business intelligence (BI) and visualization solutions, the project transforms complex data warehouse analytics into intuitive dashboards and reports, allowing business teams to make data-driven decisions more efficiently.
By combining these critical features—including:
this project delivers a professional, robust, and highly efficient solution for enterprises dealing with large-scale data processing and analytics.
Core Feature 1: Data Warehouse Modeling and Documentation
- 🔥 Core Highlights:
- Full dimensional modeling process (Star Schema / Snowflake Schema)
- Standardized development norms (ODS/DWD/DWM/DWS/DWT/ADS six-layer modeling)
- Business Matrix: defining & managing dimensions & fact tables
- 📦 Deliverables:
- Data warehouse design document (Markdown)
- Hive SQL modeling code
- DWH Dimensional Modelling Architecture Diagram


Core Feature 2: A Self-Built Distributed Big Data Platform
- 🔥 Core Highlights:
- Fully containerized deployment with Docker for quick replication
- High-availability environment: Hadoop + Hive + Spark + Zookeeper + ClickHouse
- 📦 Deliverables:
- Docker images (Open sourced on GitHub Container Registry)
docker-compose.yml
(one-click cluster startup)- Infra Configuration Files for Cluster: Hadoop, ZooKeeper, Hive, MySql, Spark, Prometheus&Grafana, Airflow
- Container Internal Scripts: Hadoop, ZooKeeper, Hive, MySql, Spark, Prometheus&Grafana, Airflow
- Common Used Snippets for Cluster: Hadoop, ZooKeeper, Hive, MySql, Spark, Prometheus&Grafana, Airflow

Core Feature 3: Distributed Batch Processing
- 🔥 Core Highlights:
- ETL processing using PySpark
- data ETL job: OLTP to DWH && DWH to OLAP
- Data Warehouse internal processing: ODS → DWD → DIM/DWM → DWS → ADS
- batch processing job scheduler using Airflow
- 📦 Deliverables:
- PySpark and Spark SQL Code
- Code - Data Pipeline (OLTP -> DWH, DWH -> OLAP)
- Code - Batch Processing (DWH Internal Transform)
- Code - Scheduling based on Airflow (DAGs)


Core Feature 4: CI/CD Automation
- 🔥 Core Highlights:
- Automated Airflow DAG deployment (auto-sync with code updates)
- Automated Spark job submission (eliminates manual
spark-submit
) - Hive table schema change detection (automatic alerts)
- 📦 Deliverables:
- GitHub Actions workflow pipeline
.yaml
- CI/CD code and documentation
- Sample log screenshots

Core Feature 5: Storage & Computation Optimization
- 🔥 Core Highlights:
- SQL optimization (dynamic partitioning, indexing, storage partitioning)
- Spark tuning: Salting, Skew Join Hint, Broadcast Join,
reduceByKey
vs.groupByKey
- Hive tuning: Z-Order sorting (boost ClickHouse queries), Parquet + Snappy compression
- 📦 Deliverables:
- Pre & post optimization performance comparison
- Spark optimization code
- SQL execution plan screenshots
Core Feature 6: DevOps - Monitoring and Alerting
- 🔥 Core Highlights:
- Prometheus + Grafana for performance monitoring Hadoop Cluster / MySQL
- AlertManager for alerting and email receiving
- 📦 Deliverables:
- Code - Monitoring Services Configuaration Files: Prometheus, Grafana, AlertManager
- Code - Monitoring Services Start&Stop Scripts: Prometheus, Grafana, AlertManager
- Code - Container Metrics Exporter Start&Stop Scripts:
my-start-node-exporter.sh
&my-stop-node-exporter.sh
- Key Screenshots

Core Feature 7: Business Intelligence & Visualization
- 🔥 Core Highlights:
- PowerBI dashboards for data analysis
- Real business-driven visualizations
- Providing actionable business insights
- 📦 Deliverables:
- PowerBI visualization screenshots
- PowerBI
.pbix
file - Key business metric explanations (BI Insights)

License
This project is licensed under the MIT License - see the LICENSE file for details.
Created and maintained by Smars-Bin-Hu.