Exploratory Data Analysis on E-Commerce Data

  • 时间: 2018-09-23 06:22:34

In general explanation, data science is nothing more than using advanced statistical and machine learning techniques to solve various problems using data. Yet, it’s easier to just dive into applying some fancy machine learning algorithms —and Voila! You got the prediction — without first understanding the data.

This is exactly where the importance of Exploratory Data Analysis (EDA) (as defined by Jaideep Khare) comes in which, unfortunately, is a commonly undervalued step as part of the data science process.

EDA is so important for 3 reasons (at least) as stated below:

  1. Make sure business stakeholders ask the right questions — often by exploring and visualizing data — and validate their business assumptions with thorough investigation
  2. Spot any potential anomalies in data to avoid feeding wrong data to a machine learning model
  3. Interpret the model output and test it’s assumptions

There you have it. Now that we have already understood the “WHAT and WHY”aspects of EDA, let’s examine a dataset together and go through the “HOW”that will eventually lead us to discover some interesting patterns, as we’ll see in the next section.

We’ll focus on the overall workflow of EDA, visualization and its results. For technical reference, please refer to my notebook on Kaggle anytime you want to have a more detailed understanding of the codes.

To give a brief overview, this post is dedicated to 5 sections as follow:

  1. Context of Data
  2. Data Cleaning (a.k.a data preprocessing)
  3. Exploratory Data Analysis
  4. Results
  5. Conclusion

Let’s get started and have fun!