I found a data-set of password(s) on DataScienceCentral: Password and hijacked email dataset for you to test your data science skills – And for fun, I played with the data-set for an hour or so:
1) Password Length vs Frequency
2) Percentage of passwords having at least one special character vs passwords having no special character:
3) Percentage of passwords that have: at-least one number, one alphabet & one special character AND length = 8 or more.
Let’s see a comparison of Passwords of length 8 or more (69.302%) vs Passwords of length 8 or more having combination of alphabets & numbers & special characters (1.485%)
That’s about it for now – it was fun!
And for those interested, here are the few behind the scene technical details:
Tools I used:
1. Excel & 2. SQL Server
Note: I first tried using Google refine to augment data – but it crashed on me. So thought of using SQL Server and TSQL. And if excel 2010 supported 2+ million then I would not have needed SQL server. Anyhow – the tool used is not important here.
2 million passwords in a .txt file.
Information I appended to the data-set using TSQL:
1. Length of password
2. Has Alphabets?
3. Has Numbers?
4. Has special Characters?
Plus few others derived from #2, #3 & #4 like ” has alphabets+ characters + special characters?”
That’s about it for the technical details. Ping me if interested!
- Where can we find datasets that we can play with for Business Intelligence, Data Mining, Data Analysis Projects? (parasdoshi.com)
- The top 10 passwords from the Yahoo hack: Is yours one of them? (zdnet.com)
5 thoughts on “Visualizing dataset of 2 million+ passwords:”
Nice findings Paras!!
Were these hacked passwords or just passwords in general? If hacked, then the idea that the common characteristics in these strings are to be avoided is a good idea. But if these are passwords that served well, then we should emulate them.
I believed it was “hacked passwords” data-set. I downloaded it from here: http://dazzlepod.com/site_media/txt/passwords.txt – and I just emailed them to verify if it’s a general password list or list of hacked password. I’ll update you if I get a response from them.
update: I have not received response from them. But I investigated on my own – seems it is not the dataset of “hacked” password(s). I updated the blog-post.
@All: sorry about confusion.