Data Cleaning Project with MS SQL SERVER
- vardan
- Sep 9, 2024
- 2 min read
In this project we walk through the process used for cleaning the raw real estate sales data.
Link to Code: https://github.com/vardan74/vardan_portfolio/blob/main/Data%20Cleaning%20portfolio%20project.sql
BACKGROUND: Received raw real estate sales data from a client and needed to transform and clean the data to be used in a Web Application.
PROCESS: Use MS SQL SERVER to ingest the data, identified data inconsistencies, and normalized the data using processes shown below.
First let's take a look at the data:
First of all I noticed SaleDate is not consistency it has datetime format and need to change applying converting
Alter Table dbo.NashvilleHousing
ADD SaleDateConverted Date;
update dbo.NashvilleHousing
set SaleDateConverted=CONVERT(Date,SaleDate);
Doing this code we get following output .As you can see saledate converted to date format
While investigating our data I noticed the PropertyAddress columns have NULL values. I have decided to fill null values with compatible addresses using self join by ParcelID and UniqueID columns as the same ParcelID has the same PropretyAddress:Also I used ISNULL function which fill null values with given values:
update a
Set PropertyAddress=ISNULL(a.PropertyAddress,b.PropertyAddress)
from dbo.NashvilleHousing a
join dbo.NashvilleHousing b
on a.ParcelID=b.ParcelID
and a.[UniqueID ]<>b.[UniqueID ]
where a.PropertyAddress is null;
Using this code we can see our propertyaddress don't have any null values:
Breaking out Address into individual Columns (Address,City,State)
Alter Table dbo.NashvilleHousing
ADD PropertySplitAddress Nvarchar(255);
update dbo.NashvilleHousing
set PropertySplitAddress= SUBSTRING(PropertyAddress,1,CHARINDEX(',',PropertyAddress)-1);
Alter Table dbo.NashvilleHousing
ADD PropertySplitCity Nvarchar(255);
update dbo.NashvilleHousing
set PropertySplitCity= SUBSTRING(PropertyAddress,CHARINDEX(',',PropertyAddress)+1,LEN(PropertyAddress));
Using above code i add columns and write my modified strings in the new columns respectively.
After running code look at output:
I have splited OwnerAddress columns for 3 columns. You can see it before and i show it after running code:
After running this code we get following output
Alter Table dbo.NashvilleHousing
ADD OwnerSplitAddress Nvarchar(255);
update dbo.NashvilleHousing
set OwnerSplitAddress= PARSENAME(REPLACE(OwnerAddress,',','.'),3);
Alter Table dbo.NashvilleHousing
ADD OwnerSplitCity Nvarchar(255);
update dbo.NashvilleHousing
set OwnerSplitCity= PARSENAME(REPLACE(OwnerAddress,',','.'),2);
Alter Table dbo.NashvilleHousing
ADD OwnerSplitState Nvarchar(255);
update dbo.NashvilleHousing
set OwnerSplitState= PARSENAME(REPLACE(OwnerAddress,',','.'),1)
Change Y and N to Yes and No in "SolidasVacant" field: This field looks in following format
We need to change Y and N to Yes and No using following code:
update dbo.NashvilleHousing
set SoldAsVacant=case when SoldAsVacant='N' then 'No'
when SoldAsVacant='Y' then 'Yes'
else SoldAsVacant
END
We got perfect result)
Investigating our data I found out duplicate rows. We can see such rows in following output:
I have used window function as well as CTE:
with RowNumCTE as (Select *,
ROW_NUMBER() Over(
Partition by ParcelID,
PropertyAddress,
SalePrice,
SaleDate,
LegalReference
Order by UniqueID
) RowNumber
from dbo.NashvilleHousing
-- order by ParcelID
)
Delete
-- Select *
from RowNumCTE
where RowNumber>1
-- order by PropertyAddress
After running above code using delete clause we removed our duplicate rows, which was validated looking this output:
At the end of project we deleted unused columns and got following output:
Alter TAble dbo.NashvilleHousing
drop column OwnerAddress,PropertyAddress,TaxDistrict
Alter TAble dbo.NashvilleHousing
drop column SaleDate
So as summarize, I want to mention I have used CTE, Window Function, Self join technic,
usefull text functions, Case When statement, as well as Alter and Update clauses for saving our modifications on the data.
Comments