Dear Divine Mother

I wish I had a heart as big as yours To love all beings on Earth To clear all suffering and sorrow in this world To heal all the pain they carry To hug them all To whisper to them “you’re all so…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Data Cleaning Techniques for Common Data Types Using Excel

Data analysts spend most of their time cleaning data. In real life, data looks like this:

Bad data includes inconsistent text and formatting, missing values, strange characters, etc. This article will discuss general Excel data cleaning functions and showcase examples for three common data types: Text, Number, Date.

Filter on the Age column

Excel How-To: Data → Filter

2. Use Pivot Tables to understand more about the data values.

Now we want to know how many values of “blank” and “NA” are there. This is important to determine our strategy in dealing with missing data.

Excel How-To: Insert → Pivot Table

3. Handle Missing Values

Missing values requires a lot of intuition. Here are three recommendations from the least recommended to the most recommended.

3.1 If there are relatively few rows/columns with missing values, delete them.

3.2 Replace missing values with the mean

3.3. Replace missing values with a constant like -999 to group them for further analysis

The key to data cleaning is consistency. Once the data is consistent across rows, it is a pretty good indication it is ready for analysis.

The location column is inconsistent. We are interested in the first word only. In order to extract that, we will use the Text to Columns functionality.

Inconsistent data in the location column

Excel How-To: Data → Text to Columns.

1.2 The country column has inconsistent capitalization. We will need to standardize it by making it all capital case, lower case, or proper case using the =upper(), =lower(), =proper() functions, respectively.

Inconsistent casing in the country column

Excel How-To: =upper(), =lower(), =proper()

2. Cleaning Number Data

Numbers are easy to deal with. First, you want to get an idea of the distribution to identify any outliers. Typically, outliers are three standard deviations away from the mean. In this case, find any number greater than 49.46+(3*18.27) or less than 49.46-(3*18.27) and deal with them accordingly using the same techniques you used to deal with missing values.

Descriptive Stats for the Age Column

Excel How-To: Install the Analysis Toolpak plugin. Then, Data → Data Analysis → Descriptive Statistics

3. Cleaning Date Data

Dates can be the trickiest to deal with. Let’s look at this example:

Use a pivot table to understand the different values available and their distribution. It seems that the correct data follows this format: Y-M-D

Filtering the good data on the date column

The remaining bad data looks like this: M/D/Y

Filtering the bad data on the date column

We will need to use Text to Columns to split the data and then Concat in the order of the correct data Y-M-D.

Text to Columns then Concatenate

Keep on examining the data using the filter function until it looks good.

Ensure data looks good using the filter

Finally, and make sure you change the format of the column to Date.

Add a comment

Related posts:

Have You Ever Dared to be Stupid?

I was listening to my baby brother play Weird Al’s song, Dare To Be Stupid. At first, I thought of the 1986 Transformers movie. Then, I thought, have I ever dared to be stupid? I have my blonde…

The Dancing Ghost

Once the stage lights pop and fizzle to darkness all that amazing chemistry disassembles and rides out its days at the bottom of a tiny gravity well in a minor galaxy, in a universe that may or may…

Testing myself with pomodoro technique to improve productivity.

I got concept of new technique to manage my days to acheive my goals. I found thid technique very impressive. Most of guys can not manage time to perform their duties. This is because of diverted…