Warning: This blog entry was written two or more years ago. Therefore, it may contain broken links, out-dated or misleading content, or information that is just plain wrong. Please read on with caution.
Last week I posted about getting the most recent entry in each group of records. This week I want to expand on that by showing how you can use the same technique to remove duplicate entries but saving the first entry.
Again our sample data:
This time instead of returning the most recent entry for each group as a resultset we want to actually delete from the table any records that are duplicated (in this case have columns the same name and color entries) but keeping the oldest entry so that our database table becomes this:
To get the desired result we need to identify all records that are duplicated which in this case is all duplicate combinations of the name and favorite_color columns. We partition and number the rows sorting by date_entered to place the oldest one first and then use the CTE to delete those rows numbered higher than 1.
We could also use the Id column the same way since in this case it is an identity column but the Id could just as easily be a UUID so I wanted to show using an actual date column.
WITH CTE AS (
ROW_NUMBER() OVER(PARTITION BY name, favorite_color ORDER BY Date_Entered ASC) AS DuplicateCount
DELETE FROM CTE WHERE DuplicateCount > 1