top of page

Data Cleaning Project with MS SQL SERVER

  • Writer: vardan
    vardan
  • Sep 9, 2024
  • 2 min read

In this project we walk through the process used for cleaning the raw real estate sales data.


BACKGROUND: Received raw real estate sales data from a client and needed to transform and clean the data to be used in a Web Application.


PROCESS: Use MS SQL SERVER to ingest the data, identified data inconsistencies, and normalized the data using processes shown below.


First let's take a look at the data:

First of all I noticed SaleDate is not consistency it has datetime format and need to change applying converting


Alter Table dbo.NashvilleHousing
   ADD SaleDateConverted Date;
update dbo.NashvilleHousing
   set SaleDateConverted=CONVERT(Date,SaleDate);

Doing this code we get following output .As you can see saledate converted to date format

While investigating our data I noticed the PropertyAddress columns have NULL values. I have decided to fill null values with compatible addresses using self join by ParcelID and UniqueID columns as the same ParcelID has the same PropretyAddress:Also I used ISNULL function which fill null values with given values:

update a
  Set PropertyAddress=ISNULL(a.PropertyAddress,b.PropertyAddress)
   from dbo.NashvilleHousing a
    join dbo.NashvilleHousing b
   on a.ParcelID=b.ParcelID
   and a.[UniqueID ]<>b.[UniqueID ]
   where a.PropertyAddress is null;

Using this code we can see our propertyaddress don't have any null values:


Breaking out Address into individual Columns (Address,City,State)

Alter Table dbo.NashvilleHousing
 ADD PropertySplitAddress Nvarchar(255);

   update dbo.NashvilleHousing
   set PropertySplitAddress= SUBSTRING(PropertyAddress,1,CHARINDEX(',',PropertyAddress)-1);

   Alter Table dbo.NashvilleHousing
   ADD PropertySplitCity Nvarchar(255);

   update dbo.NashvilleHousing
     set PropertySplitCity= SUBSTRING(PropertyAddress,CHARINDEX(',',PropertyAddress)+1,LEN(PropertyAddress));
 

Using above code i add columns and write my modified strings in the new columns respectively.

After running code look at output:

I have splited OwnerAddress columns for 3 columns. You can see it before and i show it after running code:

After running this code we get following output

Alter Table dbo.NashvilleHousing
ADD OwnerSplitAddress Nvarchar(255);

   update dbo.NashvilleHousing
   set OwnerSplitAddress= PARSENAME(REPLACE(OwnerAddress,',','.'),3);

    Alter Table dbo.NashvilleHousing
   ADD OwnerSplitCity Nvarchar(255);

   update dbo.NashvilleHousing
   set OwnerSplitCity= PARSENAME(REPLACE(OwnerAddress,',','.'),2);

    Alter Table dbo.NashvilleHousing
   ADD OwnerSplitState Nvarchar(255);

   update dbo.NashvilleHousing
   set OwnerSplitState= PARSENAME(REPLACE(OwnerAddress,',','.'),1)

Change Y and N to Yes and No in "SolidasVacant" field: This field looks in following format

We need to change Y and N to Yes and No using following code:

update dbo.NashvilleHousing
 set SoldAsVacant=case when SoldAsVacant='N' then 'No'
		              when SoldAsVacant='Y' then 'Yes'
		              else SoldAsVacant
		              END

We got perfect result)


Investigating our data I found out duplicate rows. We can see such rows in following output:

I have used window function as well as CTE:

with RowNumCTE as (Select *,
	  ROW_NUMBER() Over(
	  Partition by ParcelID,
					PropertyAddress,
					SalePrice,
					SaleDate,
					LegalReference
					Order by UniqueID
	  ) RowNumber
	   from dbo.NashvilleHousing
	  -- order by ParcelID
	)
	Delete
	--  Select *
	   from  RowNumCTE
	   where RowNumber>1
	--   order by PropertyAddress

After running above code using delete clause we removed our duplicate rows, which was validated looking this output:

At the end of project we deleted unused columns and got following output:

  Alter TAble dbo.NashvilleHousing
   drop column OwnerAddress,PropertyAddress,TaxDistrict

    Alter TAble dbo.NashvilleHousing
   drop column SaleDate

So as summarize, I want to mention I have used CTE, Window Function, Self join technic,

usefull text functions, Case When statement, as well as Alter and Update clauses for saving our modifications on the data.

 
 
 

Comments


bottom of page