Understand Data Cleaning in Data Science?
- HURU School
- Dec 14, 2020
- 1 min read
Data cleaning is the process of detecting and removing errors so as to improve the quality of your data. Otherwise, you end up having misleading data analytics and data visualizations that might lead to unreliable decisions. Data cleaning can also be time consuming and expensive. These errors can be found files such as excel and or databases. Data cleaning is crucial in every organization. It is important to use the correct data for the best business decisions. However, data cleaning, also has its own shortcomings and one has to learn how to deal with these problems. Here are some of the data cleaning challenges and their solutions.
Typical Data analysis: Create a data cleansing pipeline in advance. You can use scripting languages such as R and or Python. If the data cleansing framework is not created in advance, it can be repetitive hence tiresome. Once data is cleaned, it should be stored in a secure location.
Big Data analysis: Big data leads to bigger problems. Big data requires complex computer data analysis and regular data cleaning in order to maintain the accuracy of the data.
Data cleaning Involves:
Removing Irrelevant Values and Outliers
Get Rid of Duplicate Values.
Avoid Typos (and similar errors)
Convert Data into correct data types
Take Care of Missing Values.
Convert Special characters (e.g. commas in numeric values)
Convert Numeric values stored as text/character data types
Deleting White space
Adding Missing data
Zeros instead of null values
Below are some handy data cleaning tools in Python:
Dora
datacleaner
Prettypandas
Tabulate
Scrubadub (for financial data)
Arrow
Beautifier
ftfy
Below are some handy data cleaning tools in R:
dplyr
data.table
ggplot2
reshape2
readr
tidyr
lubridate
Comments