Data Science Process

ยท

3 min read

The Data Science Process describes all phases for a project. Basically it contains 7 phases:

  1. Problem Definition
  2. Data Mining (Data Collection)
  3. Data Preparation (Data Wrangling)
  4. Exploratory Data Analysis (EDA)
  5. Feature Engineering
  6. Model Building
  7. Model Evaluation

Let's go through each phase briefly:

1. Problem Definition ๐Ÿ‘จโ€๐Ÿซ

In this phase we define the business objective or clarify the problem to be solved. The objective must be very clear.

2. Data Mining (Data Collection) ๐Ÿ•ต

We need data in order to solve the problem. Data can be acquired from a database, scraped from the Internet. It can be collected from an online source via API. There are also online repositories like Kaggle and UCI.

3. Data Preparation (Data Wrangling) ๐Ÿคน

Data needs to be prepared in a proper way before analyzing. The following steps are performed during this phase.

  • Data Discover: In this first step, you get familiar with your data and try to understand it.
  • Data Structuring: Raw data might be in any shape and size. So it should be structured in a proper way.
  • Data Cleaning: Raw data might have distortions like missing values, outliers, inconsistent formatting, etc. These must be cleaned or fixed.
  • Data Enriching: Similar 3rd party data set is integrated with the existing data. So this makes the existing data more useful. Demographic and Geographic Data Enrichment are the common examples.
  • Data Validating: It is needed to validate that you have the all necessary data in your data set and to make sure you don't have incomplete, duplicate or missing data, incorrect formats or null fields.
  • Data Publishing: Once you completed all the steps above, you data is ready for publishing to be analyzed.

4. Exploratory Data Analysis (EDA) ๐Ÿ‘จโ€๐Ÿ’ป

EDA is used to analyze the data sets and identify their characteristics. It identifies the errors and discover patterns and relations within the data, detect outliers or anomalous events. After this step is completed, its features can be used for more sophisticated data analysis or modeling.

5. Feature Engineering ๐Ÿ‘จโ€๐Ÿ”ง

It is the process of applying domain knowledge to extract analytical insights from data, making it ready for model building. It consists of creation, transformation, extraction, and selection of features.

  • Feature Creation: Existing features are mixed via addition, subtraction, multiplication, and ratio to create new features that have greater predictive power.
  • Feature Transformation: Manipulating a feature in some way to improve its performance in the predictive model.
  • Feature Extraction: Creating new features from existing features, typically with the goal of reducing the dimensions of the features.
  • Feature Selection: Filtering of irrelevant or redundant features from the data set. This is usually done by observing variance or correlation thresholds to determine which features to remove.

6. Model Building ๐Ÿก

This phase contains developing data sets for testing, training, and production purposes. Model is built and operated accordingly. Then, machine learning models are integrated into processes and applications.

7. Model Evaluation ๐Ÿ‘จโ€๐Ÿ’ป

This phase evaluates the quality of the model, but also provides the opportunity to determine if it meets the initial requirements of the problem.

ย