Legal disclaimer >>> The information on this site is intended to be used for legal and ethical purposes like research, education, journalism and educating the public. Our intention is to comply with any and all applicable laws. If you can provide legal advice, please let us know.

Contribute >>> Have new or missing information? See something wrong? Use the comment section at the bottom of all pages, email or Twitter.

Stay up to date >>> Follow us on Twitter.

I made dictclean after having encoding problems handling different dictionaries and password leaks. If you want to read more about that, check out this article. dictclean 0.1 is an extremely early version. Feedback welcome nonetheless. Use the comment section below, email or Twitter.

Download

Features

  • Verifies that your text file has the encoding you want it to have
  • Reports which lines have incorrect encoding, and tries to detect what the actual encoding is
  • Can generate a clean version of your text file (Only lines with correct encoding.)
  • Can generate a dirty version of your text file (Only lines with incorrect encoding.)

Help output

C:\dictclean>php -f dictclean.php -- --help
dictclean 0.1, T. Alexander Lystad <tal@lystadonline.no> (www.thepasswordproject.com)

Usage on Windows: php -f dictclean.php -- [switches]
Usage on Linux: ./dictclean.php -- [switches]

Example use on Windows: php -f dictclean.php -- --dictfile rockyou.txt --cleanfile rockyou.clean.txt
Example use on Linux: ./dictclean.php -- --dictfile rockyou.txt --cleanfile rockyou.clean.txt

Switches:
--help                   Show help
--list-encodings         List available encodings
--encoding               The encoding you want to check for. Must be listed in --list-encodings. Defaults to UTF-8. Example: --encoding ISO-8859-1
--dictfile               The file to analyze. Example: --dictfile dictfile.txt
--cleanfile              Generate cleaned up dictfile. All lines from dictfile with valid encoding will be written to this file. Example: --cleanfile cleandict.txt
--dirtyfile              Generate dirty dictfile. All lines from dictfile with invalid encoding will be written to this file. Example: --dirtyfile dirtydict.txt

Example output

Output shown in ANSI.

C:\dictclean>php -f dictclean.php -- --dictfile rockyou.txt
dictclean 0.1 report (www.thepasswordproject.com)

Invalid UTF-8 at line 602043: 'pequeña' (Encoding could not be detected)
Invalid UTF-8 at line 675999: 'contraseña' (Encoding could not be detected)
Invalid UTF-8 at line 746302: 'Årepod' (Encoding could not be detected)
Invalid UTF-8 at line 774276: 'teextraño' (Encoding could not be detected)
Invalid UTF-8 at line 847642: 'muñeca' (Encoding could not be detected)
Invalid UTF-8 at line 861790: 'mañana' (Encoding could not be detected)
Invalid UTF-8 at line 995207: 'cariño' (Encoding could not be detected)
Invalid UTF-8 at line 1136146: 'åæø23' (Encoding could not be detected)
Invalid UTF-8 at line 1136243: '¢¸¢·¢µ¢¶¢¶¢µ¢·' (Encoding could not be detected)
Invalid UTF-8 at line 1205981: 'toñito' (Encoding could not be detected)
Invalid UTF-8 at line 1438213: 'niño26' (Encoding could not be detected)
Invalid UTF-8 at line 1497153: 'midulceniña' (Encoding could not be detected)
Invalid UTF-8 at line 1573395: 'limeño' (Encoding could not be detected)
Invalid UTF-8 at line 1601611: 'kærlighed' (Encoding could not be detected)
Invalid UTF-8 at line 1761224: 'grévistes' (Encoding could not be detected)
Invalid UTF-8 at line 2025830: 'asdfghjklñ' (Encoding could not be detected)
Invalid UTF-8 at line 2459760: 'útvarp' (Encoding could not be detected)
Invalid UTF-8 at line 2459761: 'ó96691' (Encoding could not be detected)
Invalid UTF-8 at line 2459762: 'ñññ111' (Encoding could not be detected)
Invalid UTF-8 at line 2459764: 'ñeña010307' (Encoding could not be detected)
Invalid UTF-8 at line 2459765: 'ñep123' (Encoding could not be detected)
Invalid UTF-8 at line 2459766: 'ñañoñaña' (Encoding could not be detected)
Invalid UTF-8 at line 2459767: 'ñañelito' (Encoding could not be detected)
Invalid UTF-8 at line 2459768: 'ñañel' (Encoding could not be detected)
Invalid UTF-8 at line 2459769: 'ñañassiempre' (Encoding could not be detected)
Invalid UTF-8 at line 2459779: 'í210131í' (Encoding could not be detected)
Invalid UTF-8 at line 2459780: 'ì789op' (Encoding could not be detected)
Invalid UTF-8 at line 2459793: 'ãlexandra' (Encoding could not be detected)
Invalid UTF-8 at line 2459794: 'ãdrtrato' (Encoding could not be detected)
Invalid UTF-8 at line 2459882: 'áúàìåññ' (Encoding could not be detected)
Invalid UTF-8 at line 2459883: 'á÷áå÷ñâåø' (Encoding could not be detected)
Invalid UTF-8 at line 2459884: 'áñå÷é' (Encoding could not be detected)
Invalid UTF-8 at line 2459906: 'átoso' (Encoding could not be detected)
Invalid UTF-8 at line 2459907: 'àãíðùîúé' (Encoding could not be detected)
Invalid UTF-8 at line 2459908: 'àÃÕ¹·Õèä˹ÍèÐ' (Encoding could not be detected)
Invalid UTF-8 at line 2464160: 'à¨É®Ò¾Ã' (Encoding could not be detected)
Invalid UTF-8 at line 2464457: 'ÒѹҹËй¾Ñ¼' (Encoding could not be detected)
Invalid UTF-8 at line 2464581: 'ËÃÐ2518' (Encoding could not be detected)
Invalid UTF-8 at line 2464582: 'Ë¿×ÐÃÂ' (Encoding could not be detected)
Invalid UTF-8 at line 2464584: '˹×à·Â' (Encoding could not be detected)
Invalid UTF-8 at line 2465714: 'Àµ¨Øå¨---¨' (Encoding could not be detected)
Invalid UTF-8 at line 2465715: '·¿Ã2206' (Encoding could not be detected)
Invalid UTF-8 at line 2465716: '´millencolin' (Encoding could not be detected)
Invalid UTF-8 at line 2465717: '´laurinha' (Encoding could not be detected)
Invalid UTF-8 at line 2465718: '´gerardo' (Encoding could not be detected)
Invalid UTF-8 at line 2465719: '´capitulo' (Encoding could not be detected)
Invalid UTF-8 at line 2465720: '´PIERINA' (Encoding could not be detected)
Invalid UTF-8 at line 2465721: '°pjakkur9' (Encoding could not be detected)
Invalid UTF-8 at line 2465722: '°hugo°°' (Encoding could not be detected)
Invalid UTF-8 at line 2465723: '¨¤--¨/--µå' (Encoding could not be detected)
Invalid UTF-8 at line 2465724: '¨ske0109' (Encoding could not be detected)
Invalid UTF-8 at line 2465725: '§uper!' (Encoding could not be detected)
Invalid UTF-8 at line 2465726: '¤¨¨/ÖÀÀ' (Encoding could not be detected)
Invalid UTF-8 at line 2465727: '¢¾/cair' (Encoding could not be detected)
Invalid UTF-8 at line 2465728: '¢±¢·¢¸¢·¢²¢°¢°¢·' (Encoding could not be detected)
Invalid UTF-8 at line 2465729: '€aæm§y‡!' (Encoding could not be detected)
Invalid UTF-8 at line 2465730: '€07251981' (Encoding could not be detected)
Invalid UTF-8 at line 2468221: '|gurilça2' (Encoding could not be detected)
Invalid UTF-8 at line 2468733: 'z€12345' (Encoding could not be detected)
Invalid UTF-8 at line 2474468: 'zuñiga' (Encoding could not be detected)
Invalid UTF-8 at line 2693758: 'wænnah' (Encoding could not be detected)
Invalid UTF-8 at line 2866700: 'viñaviña12' (Encoding could not be detected)
Invalid UTF-8 at line 2876205: 'virgilio ' (Encoding could not be detected)
Invalid UTF-8 at line 2934880: 'vanessiña' (Encoding could not be detected)
Invalid UTF-8 at line 2972031: 'ureña12345' (Encoding could not be detected)
Invalid UTF-8 at line 3020228: 'txtraño' (Encoding could not be detected)
Invalid UTF-8 at line 3084340: 'treyüp611' (Encoding could not be detected)
Invalid UTF-8 at line 3123134: 'topherà_' (Encoding could not be detected)
Invalid UTF-8 at line 3211336: 'tigretoño' (Encoding could not be detected)
Invalid UTF-8 at line 3267028: 'tharisiña' (Encoding could not be detected)
Invalid UTF-8 at line 3267795: 'thankyouñine' (Encoding could not be detected)
Invalid UTF-8 at line 3308434: 'tefiña' (Encoding could not be detected)
Invalid UTF-8 at line 3320375: 'teamoperú' (Encoding could not be detected)
Invalid UTF-8 at line 3325906: 'teamobbitaç' (Encoding could not be detected)
Invalid UTF-8 at line 3467193: 'sureño.' (Encoding could not be detected)
Invalid UTF-8 at line 3497538: 'sueños' (Encoding could not be detected)
Invalid UTF-8 at line 3510634: 'strømmen' (Encoding could not be detected)
Invalid UTF-8 at line 3604538: 'soñador' (Encoding could not be detected)
Invalid UTF-8 at line 3657123: 'snælda' (Encoding could not be detected)
Invalid UTF-8 at line 3755322: 'silviña' (Encoding could not be detected)
Invalid UTF-8 at line 3953866: 'scheißer' (Encoding could not be detected)
Invalid UTF-8 at line 3960133: 'sc1234§' (Encoding could not be detected)
Invalid UTF-8 at line 3964193: 'sañaverry' (Encoding could not be detected)
Invalid UTF-8 at line 4012584: 'sandiolña' (Encoding could not be detected)
Invalid UTF-8 at line 4039682: 'salobreña' (Encoding could not be detected)
Invalid UTF-8 at line 4097463: 'rímekben' (Encoding could not be detected)
Invalid UTF-8 at line 4217899: 'robertiña' (Encoding could not be detected)
Invalid UTF-8 at line 4338160: 'rebelñdeau' (Encoding could not be detected)
Invalid UTF-8 at line 4354413: 'raööoe' (Encoding could not be detected)
Invalid UTF-8 at line 4357075: 'rayssiñau' (Encoding could not be detected)
Invalid UTF-8 at line 4441175: 'qvivalñajuerga' (Encoding could not be detected)
Invalid UTF-8 at line 4491810: 'punkbñast' (Encoding could not be detected)
Invalid UTF-8 at line 4502325: 'puds1983•' (Encoding could not be detected)
Invalid UTF-8 at line 4588221: 'polöä' (Encoding could not be detected)
Invalid UTF-8 at line 4630743: 'piñuelas' (Encoding could not be detected)
Invalid UTF-8 at line 4630744: 'piñijajaasisoy' (Encoding could not be detected)
Invalid UTF-8 at line 4630745: 'piñapotoroto' (Encoding could not be detected)
Invalid UTF-8 at line 4634041: 'pitupitumpà' (Encoding could not be detected)
Invalid UTF-8 at line 4640872: 'piraña' (Encoding could not be detected)
Invalid UTF-8 at line 4642921: 'pipoka´' (Encoding could not be detected)
Invalid UTF-8 at line 4700315: 'peña55' (Encoding could not be detected)
Invalid UTF-8 at line 4717289: 'perdiña' (Encoding could not be detected)
Invalid UTF-8 at line 4718585: 'pequeña12' (Encoding could not be detected)
Invalid UTF-8 at line 4734111: 'pekeña' (Encoding could not be detected)
Invalid UTF-8 at line 4812097: 'pangäa' (Encoding could not be detected)
Invalid UTF-8 at line 4819858: 'pamelña' (Encoding could not be detected)
Invalid UTF-8 at line 4874828: 'oskuridad1ç' (Encoding could not be detected)
Invalid UTF-8 at line 4883024: 'ormeño' (Encoding could not be detected)
Invalid UTF-8 at line 4928567: 'olaniña' (Encoding could not be detected)
Invalid UTF-8 at line 5031717: 'nojokl´p' (Encoding could not be detected)
Invalid UTF-8 at line 5052634: 'niñacute' (Encoding could not be detected)
Invalid UTF-8 at line 5052635: 'niñabella' (Encoding could not be detected)
Invalid UTF-8 at line 5116085: 'neña1' (Encoding could not be detected)
Invalid UTF-8 at line 5141667: 'nenalñindahottie' (Encoding could not be detected)
Invalid UTF-8 at line 5158987: 'neciosupmnuñez' (Encoding could not be detected)
Invalid UTF-8 at line 5171543: 'nayibaç' (Encoding could not be detected)
Invalid UTF-8 at line 5246549: 'mússa' (Encoding could not be detected)
Invalid UTF-8 at line 5296554: 'muñerita' (Encoding could not be detected)
Invalid UTF-8 at line 5337473: 'msbehnjbmgl,çwsafgthbnujmfe\\'w' (Encoding could not be detected)
Invalid UTF-8 at line 5349805: 'mrjmhbntgrfmlgi,tgmñhju,.yoiyt' (Encoding could not be detected)
Invalid UTF-8 at line 5363707: 'mourão' (Encoding could not be detected)
Invalid UTF-8 at line 5393582: 'montaña' (Encoding could not be detected)
Invalid UTF-8 at line 5430411: 'mohamed     î' (Encoding could not be detected)
Invalid UTF-8 at line 5599854: 'meu.espaço' (Encoding could not be detected)
Invalid UTF-8 at line 5646644: 'megzuow£4' (Encoding could not be detected)
Invalid UTF-8 at line 5658721: 'mediokilodecarneç' (Encoding could not be detected)
Invalid UTF-8 at line 5692728: 'mañita' (Encoding could not be detected)
Invalid UTF-8 at line 5808302: 'maraña' (Encoding could not be detected)
Invalid UTF-8 at line 5815430: 'manóka' (Encoding could not be detected)
Invalid UTF-8 at line 5849600: 'malöuco' (Encoding could not be detected)
Invalid UTF-8 at line 5856686: 'malenbjö' (Encoding could not be detected)
Invalid UTF-8 at line 5888796: 'mageña' (Encoding could not be detected)
Invalid UTF-8 at line 5949679: 'lüvevn183' (Encoding could not be detected)
Invalid UTF-8 at line 5949680: 'løkløkløk' (Encoding could not be detected)
Invalid UTF-8 at line 5949681: 'lösenord' (Encoding could not be detected)
Invalid UTF-8 at line 5949682: 'läser' (Encoding could not be detected)
Invalid UTF-8 at line 5958016: 'lykilorð' (Encoding could not be detected)
Invalid UTF-8 at line 5996758: 'luisiño' (Encoding could not be detected)
Invalid UTF-8 at line 6135870: 'llavemariños' (Encoding could not be detected)
Invalid UTF-8 at line 6177235: 'lindaotoyamuñoz' (Encoding could not be detected)
Invalid UTF-8 at line 6319075: 'lauriña' (Encoding could not be detected)
Invalid UTF-8 at line 6410867: 'l0$†w1†h4l0ñ³lýh³å®†' (Encoding could not be detected)
Invalid UTF-8 at line 6450922: 'kröte31' (Encoding could not be detected)
Invalid UTF-8 at line 6530951: 'kjhasbkscabjklfsakhlafskdhlñ' (Encoding could not be detected)
Invalid UTF-8 at line 6792104: 'jüliet1991' (Encoding could not be detected)
Invalid UTF-8 at line 6792105: 'jérémie' (Encoding could not be detected)
Invalid UTF-8 at line 6899295: 'jordbær' (Encoding could not be detected)
Invalid UTF-8 at line 6925629: 'johnny£' (Encoding could not be detected)
Invalid UTF-8 at line 7164317: 'jansen-preiß' (Encoding could not be detected)
Invalid UTF-8 at line 7329740: 'ingisæti' (Encoding could not be detected)
Invalid UTF-8 at line 7491827: 'hæhæhæ' (Encoding could not be detected)
Invalid UTF-8 at line 7620989: 'herpåberget' (Encoding could not be detected)
Invalid UTF-8 at line 7708336: 'hallöchen' (Encoding could not be detected)
Invalid UTF-8 at line 7742447: 'guðmundur' (Encoding could not be detected)
Invalid UTF-8 at line 7763156: 'guapoako§' (Encoding could not be detected)
Invalid UTF-8 at line 7779861: 'gretarsdottir°1' (Encoding could not be detected)
Invalid UTF-8 at line 7789388: 'graça' (Encoding could not be detected)
Invalid UTF-8 at line 7820597: 'gonçalo' (Encoding could not be detected)
Invalid UTF-8 at line 7903670: 'gestört' (Encoding could not be detected)
Invalid UTF-8 at line 7908812: 'gerard§101' (Encoding could not be detected)
Invalid UTF-8 at line 8028604: 'friðrik ingi' (Encoding could not be detected)
Invalid UTF-8 at line 8046918: 'frança' (Encoding could not be detected)
Invalid UTF-8 at line 8136286: 'feñita' (Encoding could not be detected)
Invalid UTF-8 at line 8219347: 'eyähm.' (Encoding could not be detected)
Invalid UTF-8 at line 8385640: 'el Señor es mi Salavador' (Encoding could not be detected)
Invalid UTF-8 at line 8600262: 'diseño' (Encoding could not be detected)
Invalid UTF-8 at line 8621896: 'dieärzte' (Encoding could not be detected)
Invalid UTF-8 at line 8860515: 'cê3digo' (Encoding could not be detected)
Invalid UTF-8 at line 8888535: 'cumpleaños' (Encoding could not be detected)
Invalid UTF-8 at line 8943078: 'coño140' (Encoding could not be detected)
Invalid UTF-8 at line 8964524: 'corazonç' (Encoding could not be detected)
Invalid UTF-8 at line 8983457: 'conmuxokriño' (Encoding could not be detected)
Invalid UTF-8 at line 9264639: 'castañuelas' (Encoding could not be detected)
Invalid UTF-8 at line 9347478: 'caetano£carol' (Encoding could not be detected)
Invalid UTF-8 at line 9375082: 'bärtram' (Encoding could not be detected)
Invalid UTF-8 at line 9421415: 'bubiña' (Encoding could not be detected)
Invalid UTF-8 at line 9456854: 'briseño' (Encoding could not be detected)
Invalid UTF-8 at line 9557180: 'bn&çsf' (Encoding could not be detected)
Invalid UTF-8 at line 9562414: 'blümchen' (Encoding could not be detected)
Invalid UTF-8 at line 9577835: 'blondie!¬' (Encoding could not be detected)
Invalid UTF-8 at line 9656822: 'bhd701£' (Encoding could not be detected)
Invalid UTF-8 at line 9876186: 'añigdhkwbnwñro' (Encoding could not be detected)
Invalid UTF-8 at line 9949359: 'aslýhan123' (Encoding could not be detected)
Invalid UTF-8 at line 9950059: 'aským' (Encoding could not be detected)
Invalid UTF-8 at line 9968618: 'asdfñlkj' (Encoding could not be detected)
Invalid UTF-8 at line 9968945: 'asdfghjklöä' (Encoding could not be detected)
Invalid UTF-8 at line 10109420: 'andrás' (Encoding could not be detected)
Invalid UTF-8 at line 10196641: 'altýntepe' (Encoding could not be detected)
Invalid UTF-8 at line 10203124: 'alonsoymelñisa' (Encoding could not be detected)
Invalid UTF-8 at line 10223021: 'alitahç' (Encoding could not be detected)
Invalid UTF-8 at line 10358764: 'adjokè' (Encoding could not be detected)
Invalid UTF-8 at line 10380588: 'acompañame' (Encoding could not be detected)
Invalid UTF-8 at line 10540516: 'Teñefono' (Encoding could not be detected)
Invalid UTF-8 at line 10553786: 'TRIGEÑA' (Encoding could not be detected)
Invalid UTF-8 at line 10617302: 'Sa190Fgö' (Encoding could not be detected)
Invalid UTF-8 at line 10715225: 'PÉNISCILINAREAL' (Encoding could not be detected)
Invalid UTF-8 at line 10717396: 'Prüfung1' (Encoding could not be detected)
Invalid UTF-8 at line 10787734: 'Nadamás' (Encoding could not be detected)
Invalid UTF-8 at line 10805952: 'NALIÑA' (Encoding could not be detected)
Invalid UTF-8 at line 10808651: 'Mücke' (Encoding could not be detected)
Invalid UTF-8 at line 10808652: 'Mädchen14' (Encoding could not be detected)
Invalid UTF-8 at line 10898336: 'Lächle...' (Encoding could not be detected)
Invalid UTF-8 at line 10963152: 'Krätschi' (Encoding could not be detected)
Invalid UTF-8 at line 11036903: 'JOSUÉ' (Encoding could not be detected)
Invalid UTF-8 at line 11069574: 'Isol2113ù' (Encoding could not be detected)
Invalid UTF-8 at line 11096463: 'Hülsta23' (Encoding could not be detected)
Invalid UTF-8 at line 11128417: 'Guðný041085' (Encoding could not be detected)
Invalid UTF-8 at line 11210730: 'ELÝF1234' (Encoding could not be detected)
Invalid UTF-8 at line 11280474: 'CoñoE-' (Encoding could not be detected)
Invalid UTF-8 at line 11396591: 'BENILD§E' (Encoding could not be detected)
Invalid UTF-8 at line 11956685: '6hundrað' (Encoding could not be detected)
Invalid UTF-8 at line 11992533: '676767º' (Encoding could not be detected)
Invalid UTF-8 at line 12088787: '5¢ripkilla' (Encoding could not be detected)
Invalid UTF-8 at line 13012467: '1ilovemeño' (Encoding could not be detected)
Invalid UTF-8 at line 14287423: '-ÀÀ¨À-¤' (Encoding could not be detected)
Invalid UTF-8 at line 14288497: '-kem-¡' (Encoding could not be detected)
Invalid UTF-8 at line 14322100: '&ç&à&ç_"' (Encoding could not be detected)
Invalid UTF-8 at line 14344109: '“R3CKL3$$”' (Encoding could not be detected)
Lines with invalid UTF-8: 218/14344392 (0.0015 %)
Print/export