Legal disclaimer >>> The information on this site is intended to be used for legal and ethical purposes like research, education, journalism and educating the public. Our intention is to comply with any and all applicable laws. If you can provide legal advice, please let us know.

Contribute >>> Have new or missing information? See something wrong? Use the comment section at the bottom of all pages, email or Twitter.

Stay up to date >>> Follow us on Twitter.

How I found encoding errors in rockyou.txt trying to import dictionaries into MySQL

In 2009, RockYou, a social game developer, experienced a data breach that resulted in 14 million plaintext passwords being leaked. This constitutes one of the biggest leaks of plaintext passwords ever, and is therefore a valuable resource for password researchers.

In an effort to analyze different dictionaries I wanted to import these millions of passwords (rockyou.txt) into MySQL. That turned out to be a bit more difficult that I first imagined. First I tried:

SET NAMES utf8;
 
CREATE TABLE `rockyou` (
  `password` varchar(999) CHARACTER SET utf8 DEFAULT NULL,
  KEY `password` (`password`(128))
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
 
LOAD DATA INFILE 'C:/rockyou.txt' INTO TABLE `rockyou` CHARACTER SET utf8;
Error Code: 1366. Incorrect string value: '\xF1a' for column 'password' at row 602043

Ouch. I removed the offending line from rockyou.txt and tried again.

Error Code: 1366. Incorrect string value: '\xF1a' for column 'password' at row 675998

I removed the offending line from rockyou.txt and tried again.

Error Code: 1366. Incorrect string value: '\xC5repod' for column 'password' at row 746300

Ok, so this could take a while. It became obvious my rockyou.txt had some invalid UTF-8, and that it would be tedious to clean it up manually. I wrote a PHP script, called dictclean, that can verify that a text file has a certain encoding. Using the –cleanfile and –dirtyfile options I split the original rockyou.txt into cleanrockyou.txt (lines that are 100% valid UTF-8) and dirtyrockyou.txt (lines that are not 100% valid UTF-8).

php -f dictclean.php -- --dictfile rockyou.txt --cleanfile cleanrockyou.txt --dirtyfile dirtyrockyou.txt

All the lines that contain invalid UTF-8 show up in the terminal are written to dirtyrockyou.txt. It was 218/14344392 (0.0015 %) lines that had invalid UTF-8. Not exactly a big deal that will dramatically skew your password research - but still annoying. All the lines that were valid UTF-8 was written to cleanrockyou.txt, so let's try to import that to MySQL. Now it'll certainly work, right?

LOAD DATA INFILE 'C:/cleanrockyou.txt' INTO TABLE `rockyou` CHARACTER SET utf8;

Wrong!

Error Code: 1366. Incorrect string value: '\xF1\xA5\xAF\xAEci...' for column 'password' at row 2459744

It turns out the utf8 charset in MySQL only supports BMP (Basic Multilingual Plane) characters (0x0000-0xFFFF). We have to use utf8mb4, which has been supported since MySQL 5.5.3.

SET NAMES utf8mb4;
 
CREATE TABLE `rockyou2` (
  `password` varchar(999) CHARACTER SET utf8mb4 DEFAULT NULL,
  KEY `password` (`password`(128))
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
 
LOAD DATA INFILE 'C:/cleanrockyou.txt' INTO TABLE `rockyou2` CHARACTER SET utf8mb4;
14344173 row(s) affected
Records: 14344173  Deleted: 0  Skipped: 0  Warnings: 0

Sweet, sweet success. Now, I got my copy of rockyou.txt from the SkullSecurity site. There may be a better version/source out there that does not have encoding problems. If that is the case, please let me know.

Print/export