Portfolios I : Enterprise-Grade Offline Data Warehouse Solution for E-Commerce
Portfolios I : Enterprise-Grade Offline Data Warehouse Solution for E-Commerce

Portfolios I : Enterprise-Grade Offline Data Warehouse Solution for E-Commerce

slugs
This project aims to build an enterprise-grade offline data warehouse solution based on e-commerce platform order data.
Tech Stack
Spark
Hadoop
ZooKeeper
Airflow
PySpark
SQL
Data Warehousing
Dimensional Modelling
Big Data Engineering
E-Commerce DE
Prometheus
Grafana
Docker
MySQL
Oracle
Hive

🔗 GitHub Link with full content(code, documents, design diagrams and video etc.)

 

▶️ Video explanation

Video preview

✍️ Intro: Enterprise-Grade Offline Data Warehouse Solution for E-Commerce

This project aims to build an enterprise-grade offline data warehouse solution based on e-commerce platform order data. By leveraging Docker containers to simulate a big data platform, it achieves a complete workflow from ETL processing to data warehouse modeling, OLAP analysis, and data visualization.
notion image
The core value of this project lies in its implementation of enterprise-grade data warehouse modeling, integrating e-commerce order data with relevant business themes through standardized dimension modeling and fact table design, ensuring data accuracy, consistency, and traceability. Meanwhile, the deployment of a big data cluster via Docker containers simplifies environment management and operational costs, offering a flexible deployment model for distributed batch processing powered by Spark. Additionally, the project incorporates CI/CD automation, enabling rapid iterations while maintaining the stability and reliability of the data pipeline. Storage and computation are also highly optimized to maximize hardware resource utilization.
To monitor and manage the system effectively, a Grafana-based cluster monitoring system has been implemented, providing real-time insights into cluster health metrics and assisting in performance tuning and capacity planning. Finally, by integrating business intelligence (BI) and visualization solutions, the project transforms complex data warehouse analytics into intuitive dashboards and reports, allowing business teams to make data-driven decisions more efficiently.
By combining these critical features—including:
this project delivers a professional, robust, and highly efficient solution for enterprises dealing with large-scale data processing and analytics.

Core Feature 1: Data Warehouse Modeling and Documentation

  • 🔥 Core Highlights:
    • Full dimensional modeling process (Star Schema / Snowflake Schema)
    • Standardized development norms (ODS/DWD/DWM/DWS/DWT/ADS six-layer modeling)
    • Business Matrix: defining & managing dimensions & fact tables
  • 📦 Deliverables:
    • Data warehouse design document (Markdown)
    • Hive SQL modeling code
    • DWH Dimensional Modelling Architecture Diagram
notion image
notion image

Core Feature 2: A Self-Built Distributed Big Data Platform

  • 🔥 Core Highlights:
    • Fully containerized deployment with Docker for quick replication
    • High-availability environment: Hadoop + Hive + Spark + Zookeeper + ClickHouse
  • 📦 Deliverables:
    • Docker images (Open sourced on GitHub Container Registry)
    • docker-compose.yml (one-click cluster startup)
    • Infra Configuration Files for Cluster: Hadoop, ZooKeeper, Hive, MySql, Spark, Prometheus&Grafana, Airflow
    • Container Internal Scripts: Hadoop, ZooKeeper, Hive, MySql, Spark, Prometheus&Grafana, Airflow
    • Common Used Snippets for Cluster: Hadoop, ZooKeeper, Hive, MySql, Spark, Prometheus&Grafana, Airflow
notion image

Core Feature 3: Distributed Batch Processing

  • 🔥 Core Highlights:
    • ETL processing using PySpark
    • data ETL job: OLTP to DWH && DWH to OLAP
    • Data Warehouse internal processing: ODS → DWD → DIM/DWM → DWS → ADS
    • batch processing job scheduler using Airflow
  • 📦 Deliverables:
    • PySpark and Spark SQL Code
    • Code - Data Pipeline (OLTP -> DWH, DWH -> OLAP)
    • Code - Batch Processing (DWH Internal Transform)
    • Code - Scheduling based on Airflow (DAGs)
notion image
notion image

Core Feature 4: CI/CD Automation

  • 🔥 Core Highlights:
    • Automated Airflow DAG deployment (auto-sync with code updates)
    • Automated Spark job submission (eliminates manual spark-submit)
    • Hive table schema change detection (automatic alerts)
  • 📦 Deliverables:
    • GitHub Actions workflow pipeline .yaml
    • CI/CD code and documentation
    • Sample log screenshots
notion image

Core Feature 5: Storage & Computation Optimization

  • 🔥 Core Highlights:
    • SQL optimization (dynamic partitioning, indexing, storage partitioning)
    • Spark tuning: Salting, Skew Join Hint, Broadcast Join, reduceByKey vs. groupByKey
    • Hive tuning: Z-Order sorting (boost ClickHouse queries), Parquet + Snappy compression
  • 📦 Deliverables:
    • Pre & post optimization performance comparison
    • Spark optimization code
    • SQL execution plan screenshots

Core Feature 6: DevOps - Monitoring and Alerting

  • 🔥 Core Highlights:
    • Prometheus + Grafana for performance monitoring Hadoop Cluster / MySQL
    • AlertManager for alerting and email receiving
  • 📦 Deliverables:
    • Code - Monitoring Services Configuaration Files: Prometheus, Grafana, AlertManager
    • Code - Monitoring Services Start&Stop Scripts: Prometheus, Grafana, AlertManager
    • Code - Container Metrics Exporter Start&Stop Scripts: my-start-node-exporter.sh & my-stop-node-exporter.sh
    • Key Screenshots
notion image

Core Feature 7: Business Intelligence & Visualization

  • 🔥 Core Highlights:
    • PowerBI dashboards for data analysis
    • Real business-driven visualizations
    • Providing actionable business insights
  • 📦 Deliverables:
    • PowerBI visualization screenshots
    • PowerBI .pbix file
    • Key business metric explanations (BI Insights)
notion image

License

This project is licensed under the MIT License - see the LICENSE file for details.
Created and maintained by Smars-Bin-Hu.