In 2009, RockYou, a social game developer, experienced a data breach that resulted in 14 million plaintext passwords being leaked. This constitutes one of the biggest leaks of plaintext passwords ever, and is therefore a valuable resource for password researchers.
In an effort to analyze different dictionaries I wanted to import these millions of passwords (rockyou.txt) into MySQL. That turned out to be a bit more difficult that I first imagined. First I tried:
SET NAMES utf8; CREATE TABLE `rockyou` ( `password` varchar(999) CHARACTER SET utf8 DEFAULT NULL, KEY `password` (`password`(128)) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; LOAD DATA INFILE 'C:/rockyou.txt' INTO TABLE `rockyou` CHARACTER SET utf8;
Error Code: 1366. Incorrect string value: '\xF1a' for column 'password' at row 602043
Ouch. I removed the offending line from rockyou.txt and tried again.
Error Code: 1366. Incorrect string value: '\xF1a' for column 'password' at row 675998
I removed the offending line from rockyou.txt and tried again.
Error Code: 1366. Incorrect string value: '\xC5repod' for column 'password' at row 746300
Ok, so this could take a while. It became obvious my rockyou.txt had some invalid UTF-8, and that it would be tedious to clean it up manually. I wrote a PHP script, called dictclean, that can verify that a text file has a certain encoding. Using the –cleanfile and –dirtyfile options I split the original rockyou.txt into cleanrockyou.txt (lines that are 100% valid UTF-8) and dirtyrockyou.txt (lines that are not 100% valid UTF-8).
php -f dictclean.php -- --dictfile rockyou.txt --cleanfile cleanrockyou.txt --dirtyfile dirtyrockyou.txt
All the lines that contain invalid UTF-8 show up in the terminal are written to dirtyrockyou.txt. It was 218/14344392 (0.0015 %) lines that had invalid UTF-8. Not exactly a big deal that will dramatically skew your password research - but still annoying. All the lines that were valid UTF-8 was written to cleanrockyou.txt, so let's try to import that to MySQL. Now it'll certainly work, right?
LOAD DATA INFILE 'C:/cleanrockyou.txt' INTO TABLE `rockyou` CHARACTER SET utf8;
Error Code: 1366. Incorrect string value: '\xF1\xA5\xAF\xAEci...' for column 'password' at row 2459744
It turns out the utf8 charset in MySQL only supports BMP (Basic Multilingual Plane) characters (0x0000-0xFFFF). We have to use utf8mb4, which has been supported since MySQL 5.5.3.
SET NAMES utf8mb4; CREATE TABLE `rockyou2` ( `password` varchar(999) CHARACTER SET utf8mb4 DEFAULT NULL, KEY `password` (`password`(128)) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4; LOAD DATA INFILE 'C:/cleanrockyou.txt' INTO TABLE `rockyou2` CHARACTER SET utf8mb4;
14344173 row(s) affected Records: 14344173 Deleted: 0 Skipped: 0 Warnings: 0
Sweet, sweet success. Now, I got my copy of rockyou.txt from the SkullSecurity site. There may be a better version/source out there that does not have encoding problems. If that is the case, please let me know.