Portfolios I : Enterprise-Grade Offline Data Warehouse Solution for E-Commerce

🔗 GitHub Link with full content(code, documents, design diagrams and video etc.)

GitHub - Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform: This project aims to build an enterprise-grade offline data warehouse solution based on e-commerce platform order data.

This project aims to build an enterprise-grade offline data warehouse solution based on e-commerce platform order data. - Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform

https://github.com/Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform?tab=readme-ov-file

▶️ Video explanation

✍️ Intro: Enterprise-Grade Offline Data Warehouse Solution for E-Commerce

This project aims to build an enterprise-grade offline data warehouse solution based on e-commerce platform order data. By leveraging Docker containers to simulate a big data platform, it achieves a complete workflow from ETL processing to data warehouse modeling, OLAP analysis, and data visualization.

The core value of this project lies in its implementation of enterprise-grade data warehouse modeling, integrating e-commerce order data with relevant business themes through standardized dimension modeling and fact table design, ensuring data accuracy, consistency, and traceability. Meanwhile, the deployment of a big data cluster via Docker containers simplifies environment management and operational costs, offering a flexible deployment model for distributed batch processing powered by Spark. Additionally, the project incorporates CI/CD automation, enabling rapid iterations while maintaining the stability and reliability of the data pipeline. Storage and computation are also highly optimized to maximize hardware resource utilization.

To monitor and manage the system effectively, a Grafana-based cluster monitoring system has been implemented, providing real-time insights into cluster health metrics and assisting in performance tuning and capacity planning. Finally, by integrating business intelligence (BI) and visualization solutions, the project transforms complex data warehouse analytics into intuitive dashboards and reports, allowing business teams to make data-driven decisions more efficiently.

By combining these critical features—including:

this project delivers a professional, robust, and highly efficient solution for enterprises dealing with large-scale data processing and analytics.

Core Feature 1: Data Warehouse Modeling and Documentation

🔥 Core Highlights:

Full dimensional modeling process (Star Schema / Snowflake Schema)
Standardized development norms (ODS/DWD/DWM/DWS/DWT/ADS six-layer modeling)
Business Matrix: defining & managing dimensions & fact tables

📦 Deliverables:

Data warehouse design document (Markdown)
Hive SQL modeling code
DWH Dimensional Modelling Architecture Diagram

Core Feature 2: A Self-Built Distributed Big Data Platform

🔥 Core Highlights:

Fully containerized deployment with Docker for quick replication
High-availability environment: Hadoop + Hive + Spark + Zookeeper + ClickHouse

📦 Deliverables:

Docker images (Open sourced on GitHub Container Registry)
docker-compose.yml (one-click cluster startup)
Infra Configuration Files for Cluster: Hadoop, ZooKeeper, Hive, MySql, Spark, Prometheus&Grafana, Airflow
Container Internal Scripts: Hadoop, ZooKeeper, Hive, MySql, Spark, Prometheus&Grafana, Airflow
Common Used Snippets for Cluster: Hadoop, ZooKeeper, Hive, MySql, Spark, Prometheus&Grafana, Airflow

Core Feature 3: Distributed Batch Processing

🔥 Core Highlights:

ETL processing using PySpark
data ETL job: OLTP to DWH && DWH to OLAP
Data Warehouse internal processing: ODS → DWD → DIM/DWM → DWS → ADS
batch processing job scheduler using Airflow

📦 Deliverables:

PySpark and Spark SQL Code
Code - Data Pipeline (OLTP -> DWH, DWH -> OLAP)
Code - Batch Processing (DWH Internal Transform)
Code - Scheduling based on Airflow (DAGs)

Core Feature 4: CI/CD Automation

🔥 Core Highlights:

Automated Airflow DAG deployment (auto-sync with code updates)
Automated Spark job submission (eliminates manual spark-submit)
Hive table schema change detection (automatic alerts)

📦 Deliverables:

GitHub Actions workflow pipeline .yaml
CI/CD code and documentation
Sample log screenshots

Core Feature 5: Storage & Computation Optimization

🔥 Core Highlights:

SQL optimization (dynamic partitioning, indexing, storage partitioning)
Spark tuning: Salting, Skew Join Hint, Broadcast Join, reduceByKey vs. groupByKey
Hive tuning: Z-Order sorting (boost ClickHouse queries), Parquet + Snappy compression

📦 Deliverables:

Pre & post optimization performance comparison
Spark optimization code
SQL execution plan screenshots

Core Feature 6: DevOps - Monitoring and Alerting

🔥 Core Highlights:

Prometheus + Grafana for performance monitoring Hadoop Cluster / MySQL
AlertManager for alerting and email receiving

📦 Deliverables:

Code - Monitoring Services Configuaration Files: Prometheus, Grafana, AlertManager
Code - Monitoring Services Start&Stop Scripts: Prometheus, Grafana, AlertManager
Code - Container Metrics Exporter Start&Stop Scripts: my-start-node-exporter.sh & my-stop-node-exporter.sh
Key Screenshots

Core Feature 7: Business Intelligence & Visualization

🔥 Core Highlights:

PowerBI dashboards for data analysis
Real business-driven visualizations
Providing actionable business insights

📦 Deliverables:

PowerBI visualization screenshots
PowerBI .pbix file
Key business metric explanations (BI Insights)

License

This project is licensed under the MIT License - see the LICENSE file for details.

Created and maintained by Smars-Bin-Hu.