Pythian Blog: Technical Track

Adding a Unique Constraint on a Table with Duplicate Data

After I moved back to Europe and Malta in order to set up our operations here, I was approached by a old friend of mine who wanted to know how to add a UNIQUE constraint and remove duplicates on a table, while keeping the newest records. He had been trying with ALTER TABLE but ran into problems as the older values were taken.

So, to help him out, I first solved it based on his original idea, and then figured I would post a small note about the solution here.

So, let’s say we have the following structure . . .

sql01 blogs> SHOW CREATE TABLE post1164\G
*************************** 1. row ***************************
       Table: post1164
Create Table: CREATE TABLE `post1164` (
  `a` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `b` varchar(5) DEFAULT NULL,
  `c` varchar(5) DEFAULT NULL,
  PRIMARY KEY (`a`)
) ENGINE=MyISAM AUTO_INCREMENT=7 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

. . . with a small amount of data in it:

sql01 blogs> SELECT * FROM post1164;
+---+------+------+
| a | b    | c    |
+---+------+------+
| 1 | a    | a1   | 
| 2 | a    | a2   | 
| 3 | b    | b1   | 
| 4 | c    | c2   | 
| 5 | b    | b2   | 
| 6 | c    | c1   | 
+---+------+------+
6 rows in set (0.00 sec)

Now, if I were to use his original SQL, I would get a result similar to this:

sql01 blogs> ALTER IGNORE TABLE post1164 ADD UNIQUE KEY idx1(b);
Query OK, 6 rows affected (0.01 sec)
Records: 6  Duplicates: 3  Warnings: 0

sql01 blogs> SELECT * FROM post1164;
+---+------+------+
| a | b    | c    |
+---+------+------+
| 1 | a    | a1   | 
| 3 | b    | b1   | 
| 4 | c    | c2   | 
+---+------+------+
3 rows in set (0.00 sec)

This is obviously not what I want, as I didn’t get the newest entries. So, let’s revert and solve it the correct way when using ALTER.

sql01 blogs> TRUNCATE TABLE post1164; ALTER TABLE post1164 DROP KEY idx1; INSERT INTO post1164 SELECT * FROM data1164; SELECT * FROM post1164; ALTER IGNORE TABLE post1164 ADD UNIQUE KEY idx1(b), ORDER BY a DESC; SELECT * FROM post1164;
Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 6 rows affected (0.00 sec)
Records: 6  Duplicates: 0  Warnings: 0

+---+------+------+
| a | b    | c    |
+---+------+------+
| 1 | a    | a1   | 
| 2 | a    | a2   | 
| 3 | b    | b1   | 
| 4 | c    | c2   | 
| 5 | b    | b2   | 
| 6 | c    | c1   | 
+---+------+------+
6 rows in set (0.00 sec)

Query OK, 6 rows affected (0.01 sec)
Records: 6  Duplicates: 3  Warnings: 0

+---+------+------+
| a | b    | c    |
+---+------+------+
| 6 | c    | c1   | 
| 5 | b    | b2   | 
| 2 | a    | a2   | 
+---+------+------+
3 rows in set (0.00 sec)

Ok, so I cleared out the table, dropped the constraint and filled it up again. Then, I used the same ALTER statement, but added a ORDER BY clause, which solves the issue here, as I will only insert the first match, which will be the newest one in this specific case.

However, since I don’t like the fact that the data is now displayed in reverse, I’ll fix that too.

sql01 blogs> ALTER TABLE post1164 ENGINE=MyISAM, ORDER BY a ASC; SELECT * FROM post1164;
Query OK, 3 rows affected (0.01 sec)
Records: 3  Duplicates: 0  Warnings: 0

+---+------+------+
| a | b    | c    |
+---+------+------+
| 2 | a    | a2   | 
| 5 | b    | b2   | 
| 6 | c    | c1   | 
+---+------+------+
3 rows in set (0.00 sec)

Okay, so I basically did the same, but set the order back to how it should be.
Now, let’s look at another approach to solving his original problem, with another solution not based on ALTER.

sql01 blogs> CREATE TABLE tmp1164 LIKE post1164; 
Query OK, 0 rows affected (0.01 sec)

sql01 blogs> ALTER TABLE tmp1164 ADD UNIQUE KEY idx1(a);
Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

sql01 blogs> INSERT INTO tmp1164 SELECT t2.a, t2.b, t2.c  FROM (SELECT MAX(a) AS a FROM post1164 GROUP BY b) t1 INNER JOIN post1164 t2 ON t1.a = t2.a ORDER BY t2.a; RENAME TABLE post1164 TO old1164, tmp1164 TO post1164; SELECT * FROM post1164;
Query OK, 3 rows affected (0.00 sec)
Records: 3  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.00 sec)

+---+------+------+
| a | b    | c    |
+---+------+------+
| 2 | a    | a2   | 
| 5 | b    | b2   | 
| 6 | c    | c1   | 
+---+------+------+
3 rows in set (0.01 sec)

So, here I did the same thing, but in a different way. Instead of using ALTER IGNORE, I chose to create another table, adding the unique constraint to it, and then inserting the cleaned-up data to it and renaming the table to put it into production.

Which one you go for will depend upon preference and the size of the table, and if, for example, you need all data right away, or if you can insert the last two hours data, then switch the tables around and start backfilling it in the background. Both ways work, but which way is your favorite? Or do you prefer solving this problem in yet another way?

No Comments Yet

Let us know what you think

Subscribe by email