I found a data-set of password(s) on DataScienceCentral: Password and hijacked email dataset for you to test your data science skills – And for fun, I played with the data-set for an hour or so:
1) Password Length vs Frequency
2) Percentage of passwords having at least one special character vs passwords having no special character:
3) Percentage of passwords that have: at-least one number, one alphabet & one special character AND length = 8 or more.
Answer: 1.4856%
Let’s see a comparison of Passwords of length 8 or more (69.302%) vs Passwords of length 8 or more having combination of alphabets & numbers & special characters (1.485%)
That’s about it for now – it was fun!
And for those interested, here are the few behind the scene technical details:
Tools I used:
1. Excel & 2. SQL Server
Note: I first tried using Google refine to augment data – but it crashed on me. So thought of using SQL Server and TSQL. And if excel 2010 supported 2+ million then I would not have needed SQL server. Anyhow – the tool used is not important here.
Initial state:
2 million passwords in a .txt file.
Information I appended to the data-set using TSQL:
1. Length of password
2. Has Alphabets?
[a-zA-Z]
3. Has Numbers?
[0-9]
4. Has special Characters?
[^a-zA-Z0-9]
Plus few others derived from #2, #3 & #4 like ” has alphabets+ characters + special characters?”
That’s about it for the technical details. Ping me if interested!
Nice findings Paras!!
Thanks!
Were these hacked passwords or just passwords in general? If hacked, then the idea that the common characteristics in these strings are to be avoided is a good idea. But if these are passwords that served well, then we should emulate them.
I believed it was “hacked passwords” data-set. I downloaded it from here: http://dazzlepod.com/site_media/txt/passwords.txt – and I just emailed them to verify if it’s a general password list or list of hacked password. I’ll update you if I get a response from them.
update: I have not received response from them. But I investigated on my own – seems it is not the dataset of “hacked” password(s). I updated the blog-post.
@All: sorry about confusion.