Background

In the early days of business, there is only one kind of database, transactional database. As the business grows, the requirements to support data analysis emerge. To do data analysis, we can make use of data warehousing architecture. Data Warehouse is a collection of integrated databases designed to support the DSS (decision support system) function1 and are considered as a core component of BI (business intelligence)2.

In this blog post, I will present my architecture designed to be used for mobile app review analysis requirements.

Data Layers

Data in the data warehouse comes from at least one source, whether operational/transactional database, flat files, or any other data repository. The figure below shows the basic data warehouse architecture.

Basic Data Warehouse Architecture
Figure 2.1 Basic Data Warehouse Architecture

As can be seen from the previous image, there are several layers of repository. While there might be different warehousing architecture, the principles of data layer still persist. Some might skips warehouse layer entirely and directly connects staging to the data marts.

The first layer is of course the data source layer. The source layer can be made up of several operational systems, flat files, or even web API. The sources in this layer may come from internal or external third-party system. Due to the difference in data types, schemas, and semantics, the data from this source layer is considered “dirty”. By dirty we means that this data is not ready for analysis, especially aggregated analysis.

So the next layer, staging, is used as temporary repository for data cleaning and processing. This staging area collects data from source layer and prepare it for analysis.

The third layer is warehouse. While the data is already treated and prepared on the previous layer, this layer is used to construct a new OLAP specific schema called star schema. Basically we convert the structure of the data from the transactional model with focus on data normalization, streamlined storage, and write/update speed to the denormalized analytical model with focus on optimizing data reading.

A star schema is a collection of dimension tables and one or more fact tables2,3. The fact table contains transactional information while dimension table contains only master data2,3,4,5.

In the warehouse layer, we use the star schema to create a uniform and centralized data across the organization.

The data mart layer is optional in some cases. This layer provides easier access to data stored in data warehouse from different business areas. Data in this layer are in a ready-to-use format rather than the facts and dimensions table format in warehouse that needs to be joined together.

The last layer is presentation layer, where the consumer of the data is located. This consumer can be analytical tools like dashboard or BI systems, mining systems, or even regular user that access the data marts using query.

For the app review analysis, we will use all layers explained previously and are to be detailed in the next several subsections.

Data Source and Staging

The review data comes from two different sources, and both are external to the organization. To learn more about where the data comes from and what their schema looks like please visit my previous blog post, Scraping and Preparing App Review Data, that explain it in detail. The cleaning and treatment of the data is also explained in the post.

After preparing the data, now we need to change the schema to star schema which will be explained in the next subsection.

Warehouse Dimensional Model

The figure below shows the dimensional model of the warehouse layer for app review analysis.

App Review Dimensional Model
Figure 2.2.1 App Review Dimensional Model

The dimension tables are suffixed with the _dim keyword while fact tables didn’t have any prefix or suffix.

Data Marts

Data in warehouse layer are then moved to data marts for specific use cases such as analysis and data mining. The figure below shows the data marts for app review analysis.

App Review Data Marts
Figure 2.3.1 App Review Data Marts

Table app_reviews and 2021_app_reviews contains the reviews data and are used in dashboarding tools. Table sentiment_analysis_dataset and sampled_sentiment_analysis are used to train sentiment analysis while table sentiment_analysis is used to store the sentiment predicted by the ML model. Table topic_modelling contains training data for topic modelling.

Moving Data Between Layers

The movement path of data from source to the warehouse is called data pipeline. The act of moving data from one layer to another is referred to as Extract, Transform and Load, or ETL for short.

To configure the pipeline and run the ETL jobs, we use Airflow to monitor and execute our workflows. For more detailed information on the installation of Airflow please refer to my blog post, Running Airflow in Docker.

Here’s a snippet that shows how we move the review data from source layer to staging layer.

Summary

For conducting analysis on app reviews data, we construct data warehouse environment with 4 data repository layer: source, staging, warehouse, and data marts. The schema used in warehouse layer is star schema, it has two fact tables and five dimension tables. Data marts is also used here for analysis or mining with specific purpose and requirements.

The data is moved between layers periodically with an automated system, Airflow, that runs the workflows defined as DAG.

References

[1] Inmon, W. H. (2005). Building the data warehouse. John wiley & sons.

[2] Dedić, N., & Stanier, C. (2016). An evaluation of the challenges of multilingualism in data warehouse development.

[3] Cios, K. J., Swiniarski, R. W., Pedrycz, W., & Kurgan, L. A. (2007). The knowledge discovery process. In Data Mining (pp. 9-24). Springer, Boston, MA.

[4] Kimball, R., Ross, M., Thorthwaite, W., Becker, B., & Mundy, J. (2008). The data warehouse lifecycle toolkit. John Wiley & Sons.

[5] Jensen, C. S., Pedersen, T. B., & Thomsen, C. (2010). Multidimensional databases and data warehousing. Synthesis Lectures on Data Management, 2(1), 1-111.