Background

Mobile or portable devices are the most widely used technology of this era. With the increase of internet coverage and the pandemic situation, people are getting used to staying indoors and limiting their outdoor activities. This results in increased usage of mobile devices in addition to increased online commerce transactions. Users are free to download and install any e-commerce app to their liking. Their choice, though personal and subjective, were influenced by the players of the e-commerce sector. Businesses are competing to provide better, cheaper, and safer shopping experiences. However, that competition produces an astronomically long list of options for users to choose from. While that’s just a natural implication of competition, users are overwhelmed by those choices. They need a comparison, a simple score, or others’ experience1 to quickly decide which one is the most suitable for their needs.

Users will look at the app reviews to compare and get general understanding of each app strengths and weaknesses. Businesses can also analyze their reviews to make improvement on their services and resolve problems on their internal business processes. Therefore, review data is very important.

In this post, I will explore the collection and cleaning process of App Reviews data. We will start from the data collection step, followed by the preparation and lastly we will show how to automate this data pipeline using Airflow.

Data Collection

There are two main sources of app reviews, Google Play Store and Apple Store. We will refer to Google Play Store as Play Store from now on, since it is a bit too much to repeat it again and again.

For scraping Play Store reviews, I use google-play-scraper library by JoMingyu. It’s a great scraper in general. It supports pausing using continuation, supports scraping app details and even app permissions, also the author is still actively maintaining it so it’s another plus for this library.

For Apple Store however, things are a bit bleak. Since my pipeline uses Python, I’m forced to look for scraping library that are written in Python. At the time I work on this project, I couldn’t find any decent scraper. The best one that I’ve found is written in JavaScript. Since I’m running out of time, I decided to just bite the bullet. I constructed an automated daily scraper using app-store-scraper library by facundoolano. The automated scraper is running on a Cloud Function instance in Google Cloud Platform. The construction of this function however, is not included in this post.

Play Store Review

Here is an example on how to scrape review data from a game called Penguin Isle.

from google_play_scraper import Sort, reviews

result, continuation_token = reviews(
    'com.fantome.penguinisle',
    lang='en',
    country='us',
    sort=Sort.MOST_RELEVANT,
    count=3,
    filter_score_with=5
)

The reviews from google-play-scraper library contains a lot of information. Let’s see what the output of the previous command is.

[
    {
        "userName": "Alyssa Williams",
        "userImage": "https://lh3.googleusercontent.com/-cVEHKr7mzv8/AAAAAAAAAAI/AAAAAAAAAAA/AKF05nB2r3GUkji31m0tC4ylFNiVMpmNWA/photo.jpg",
        "content": "This is literally the best idle game I have ever played. The penguins waddle around and live their best lives in the cutest little outfits. I just unlocked the little penguins and I have been sobbing uncontrollably for ten minutes because they are so adorable. There are only two suggestions I have for this game: more of the penguin info ads. I love them. I have learned so much about all the teeny fellas. Secondly, I would like to be able to name my 'guins so I can tell them apart.",
        "score": 5,
        "thumbsUpCount": 54,
        "reviewCreatedVersion": "1.16",
        "at": datetime.datetime(2020, 2, 24, 17, 19, 34),
        "replyContent": "Hello, We will gradually improve the various systems in the game to enhance the player's game experience. We have recorded your suggestions and feedback to the planner. If you have any other suggestions and ideas, please feel free to contact us at penguinisle@habby.com.Thank you for playing!",
        "repliedAt": datetime.datetime(2020, 2, 24, 18, 30, 42),
        "reviewId": "gp:AOqpTOE0Iy5S9Je1F8W1BgCl6l_TCFP_QN4qGtRATX3PeB5VV9aZu6UHfMWdYFF1at4qZ59xxLNHFqYLql5SL-k"
    },
    ...
]

As you can see, straight from the output the data is pretty detailed already. I’m pretty satisfied with it, so we will continue to the App Store scraping section next.

App Store Review

As I’ve explained before, the construction of the automated scraper function is not going to be explained in this post. Therefore, I can only show you the scraping result. Here’s what it looks like.

[
    {
        "id": "8301510944",
        "userName": "Sonyabelle",
        "userUrl": "https://itunes.apple.com/id/reviews/id200106552",
        "version": "2.165.0",
        "score": 4,
        "title": "Tokped terbaik",
        "text": "Tokped keren! Beli yoga matt gratis krn dpt point!",
        "url": "https://itunes.apple.com/id/review?id=1001394201&type=Purple%20Software",
        "updated": "2022-01-30T16:55:08-07:00"
    },
    ...
]

Not much information is captured by this scraper, but I’ll need to make do with it somehow. Now that we have data from two sources, we can’t possibly just merge them as once because of data type and schema issues. So, in the next section I will explain the preparation process for the review data.

Data Preparation

Since the review data are scraped from different source, aside from cleaning it we need to transform it as well. We will make sure the schema is uniform and are fit for those data. Here’s the schema for the staging data layer.

column_name             data_type        mode
=============================================
app_name                   STRING    NULLABLE
_____________________________________________
app_id                     STRING    NULLABLE
_____________________________________________
review_id                  STRING    NULLABLE
_____________________________________________
user_name                  STRING    NULLABLE
_____________________________________________
user_image                 STRING    NULLABLE
_____________________________________________
review                     STRING    NULLABLE
_____________________________________________
rating                      INT64    NULLABLE
_____________________________________________
thumbs_up_count             INT64    NULLABLE
_____________________________________________
app_version                STRING    NULLABLE
_____________________________________________
created_date            TIMESTAMP    NULLABLE
_____________________________________________
reply                      STRING    NULLABLE
_____________________________________________
replied_at              TIMESTAMP    NULLABLE
_____________________________________________
platform                   STRING    NULLABLE
_____________________________________________
dag_execution_date      TIMESTAMP    NULLABLE
_____________________________________________

For the cleaning part, we need to make sure that the data are not duplicate. To prevent duplicate data, we can do a deduplication by the review id. Next, we need to coalesce null data with empty state of the data type. For example, null string is replaced with empty string, null integer is replaced with 0, and so on.

App Details Data

Aside from review data, I also collect the app details containing their version, recent changes, description, etc. They’re scraped and processed using the same method as review data. However, since their columns is too long to show, I will only put the staging layer schema here.

Here’s how their cleaned structure look like.

column_name             data_type        mode
=============================================
app_id                     STRING    NULLABLE
_____________________________________________
alt_app_id                 STRING    NULLABLE
_____________________________________________
app_name                   STRING    NULLABLE
_____________________________________________
url                        STRING    NULLABLE
_____________________________________________
version                    STRING    NULLABLE
_____________________________________________
description                STRING    NULLABLE
_____________________________________________
platform                   STRING    NULLABLE
_____________________________________________
genre_id                   STRING    NULLABLE
_____________________________________________
genre                      STRING    NULLABLE
_____________________________________________
rating                    FLOAT64    NULLABLE
_____________________________________________
rating_count                INT64    NULLABLE
_____________________________________________
review_count                INT64    NULLABLE
_____________________________________________
original_price            FLOAT64    NULLABLE
_____________________________________________
is_on_sale                BOOLEAN    NULLABLE
_____________________________________________
is_free                   BOOLEAN    NULLABLE
_____________________________________________
currency                   STRING    NULLABLE
_____________________________________________
size_in_byte              FLOAT64    NULLABLE
_____________________________________________
min_os_version             STRING    NULLABLE
_____________________________________________
released_date           TIMESTAMP    NULLABLE
_____________________________________________
updated_date            TIMESTAMP    NULLABLE
_____________________________________________
screenshots                STRING       ARRAY
_____________________________________________
recent_changes             STRING    NULLABLE
_____________________________________________
developer                  STRING    NULLABLE
_____________________________________________
developer_id               STRING    NULLABLE
_____________________________________________
developer_email            STRING    NULLABLE
_____________________________________________
developer_website          STRING    NULLABLE
_____________________________________________
content_rating             STRING    NULLABLE
_____________________________________________
dag_execution_date      TIMESTAMP    NULLABLE
_____________________________________________

Summary

The steps of collection app review data are scraping and cleaning. App reviews data are scraped from Google Play Store using google-play-scraper and Apple Store using app-store-scraper. Aside from reviews, I also scraped app details data that contains the description, app version, recent changes, etc; to enrich the review data for analysis.

Huge thanks to the author of the scraper library, JoMingyu and facundoolano.

References

[1] Genc-Nayebi, N., & Abran, A. (2017). A systematic literature review: Opinion mining studies from mobile app store user reviews. Journal of Systems and Software, 125, 207–219. https://doi.org/10.1016/j.jss.2016.11.027