Visualizing dataset of 2 million+ passwords:


I found a data-set of password(s) on DataScienceCentral: Password and hijacked email dataset for you to test your data science skills – And for fun, I played with the data-set for an hour or so:

1) Password Length vs Frequency

1 how to choose password password length

2) Percentage of passwords having at least one special character vs passwords having no special character:

2 passwords that have special character vs the one's that dont

3) Percentage of passwords that have: at-least one number, one alphabet & one special character AND length = 8 or more.

Answer: 1.4856%

Let’s see a comparison of Passwords of length 8 or more (69.302%) vs Passwords of length 8 or more having combination of alphabets & numbers & special characters (1.485%)

4 passwords having combination of alphabets plus numbers and special characters

That’s about it for now – it was fun!


And for those interested, here are the few behind the scene technical details:

Tools I used:

1. Excel & 2. SQL Server

Note: I first tried using Google refine to augment data – but it crashed on me. So thought of using SQL Server and TSQL. And if excel 2010 supported 2+ million then I would not have needed SQL server. Anyhow – the tool used is not important here.

Initial state:

2 million passwords in a .txt file.

Information I appended to the data-set using TSQL:

1. Length of password

2. Has Alphabets?


3. Has Numbers?


4. Has special Characters?


Plus few others derived from #2, #3 & #4 like ” has alphabets+ characters + special characters?”

That’s about it for the technical details. Ping me if interested!


5 thoughts on “Visualizing dataset of 2 million+ passwords:

  1. Ben Littenberg

    Were these hacked passwords or just passwords in general? If hacked, then the idea that the common characteristics in these strings are to be avoided is a good idea. But if these are passwords that served well, then we should emulate them.

What do you think? Leave a comment below.