• HURU School

Understand Data Cleaning in Data Science?

Data cleaning is the process of detecting and removing errors so as to improve the quality of your data. Otherwise, you end up having misleading data analytics and data visualizations that might lead to unreliable decisions. Data cleaning can also be time consuming and expensive. These errors can be found files such as excel and or databases. Data cleaning is crucial in every organization. It is important to use the correct data for the best business decisions. However, data cleaning, also has its own shortcomings and one has to learn how to deal with these problems. Here are some of the data cleaning challenges and their solutions.

Typical Data analysis: Create a data cleansing pipeline in advance. You can use scripting languages such as R and or Python. If the data cleansing framework is not created in advance, it can be repetitive hence tiresome. Once data is cleaned, it should be stored in a secure location.

Big Data analysis: Big data leads to bigger problems. Big data requires complex computer data analysis and regular data cleaning in order to maintain the accuracy of the data.

Data cleaning Involves:

  • Removing Irrelevant Values and Outliers

  • Get Rid of Duplicate Values.

  • Avoid Typos (and similar errors)

  • Convert Data into correct data types

  • Take Care of Missing Values.

  • Convert Special characters (e.g. commas in numeric values)

  • Convert Numeric values stored as text/character data types

  • Deleting White space

  • Adding Missing data

  • Zeros instead of null values

Below are some handy data cleaning tools in Python:

  • Dora

  • datacleaner

  • Prettypandas

  • Tabulate

  • Scrubadub (for financial data)

  • Arrow

  • Beautifier

  • ftfy

Below are some handy data cleaning tools in R:

  • dplyr

  • data.table

  • ggplot2

  • reshape2

  • readr

  • tidyr

  • lubridate

18 views0 comments