The Have I Been Pwned Email Address Extractor

A project to rapidly extract all email addresses from any files in a given path.

Background

This project is intended to be a brand new open source version of a basic codebase I've used for the better part of a decade to extract email addresses from data breaches before loading them into HIBP. Most breaches are in a .sql or .csv format either in a single file or multiple files within a folder and extraction follows a simple process:

Extract all addresses via regex
Convert them to lowercase
Order them alphabetically
Save them to an output file
Create a report of how many unique addresses were in each file

Practical Considerations

Email address validation via regex is hard, but it also doesn't need to be perfect for this use case. Inevitably, discussion has led to compliance with the RFC versus the practical uses of certain characters when considering parsing rules. Let me explain:

There is an argument that where the RFC allows a character such as a double quote, it should be considered a valid character and permitted in addresses
However, a character such as a double quote is more likely to be present as the result of a parsing error rather than legitimate use

Anecdotally, point 1 is rarely ever true in comparison to point 2. The impact of falsely rejecting a legitimate spec-compliant address is that it doesn't end up in HIBP (i.e. low impact). The impact of allowing addresses that don't actually exist is that junk records are introduced into HIBP (also low impact). Especially when considering the likelihood of an address with obscure characters being practically used (for example, accepted into a registration form and not rejected), on balance it is preferable to reject characters that are likely the result of parsing errors. Let me rephrase and and reiterate for impact: the likelihood of an email address containing obscure character patterns and being valid and actively used is so rare that it should be discarded.

Email Address Rules

After almost two years of iterations on this project, the following rules have been defined. I'll break them into alias and domain as the latter's logic also applies to various locations in the project that only do domain validation (i.e. the domain search feature):

Alias Rules

1. Must be between 1 and 64 characters in length
2. The only allowable characters are:
   - Numbers
   - Letters (case insensitive)
   - 4 special characters: _-+.
3. Cannot start or end with a period
4. Cannot have consecutive periods

## Domain Rules
1. Must be between 4 and 255 characters in length (the shortest possible TLD is 2 characters)
2. The only allowable characters are:
   - Numbers
   - Letters (case insensitive)
   - 2 special characters: -.
3. Cannot start or end with a period
4. Cannot start or end with a hyphen
5. Must have at least 1 period
6. Cannot have consecutive periods
7. Cannot consecutively have a hyphen then a period
8, Must have a TLD from the following IANA list: https://data.iana.org/TLD/tlds-alpha-by-domain.txt

This is an easily defineable set of rules that are implemented across the codebase. They are not perfect - they can never be perfect - but they strike the best balance between spec-compliant utopia and weeding out the junk.

Contributions

I've reached out and asked for support and will get things kicked off via one or two key people then seek broader input. I'm particularly interested in optimising the service across larger data sets and non text-based files, especially with the uptick of documents being dumped by ransomware crews. I'll start creating issues for the bits that need building.

Test Data

Using Red Gate's SQL Data Generator, a sample (archived) file containing 10M records of typical breach data is available. This file results in exactly 10M email addresses being extracted with the current version of this app.

Running the Address Extractor

Syntax: AddressExtractor.exe -? Syntax: AddressExtractor.exe -v Syntax: AddressExtractor.exe <input [[... input]]> [-o output] [-r report]

Main Options

Option	Description
`-?`, `-h`, `--help`	Prints the command line syntax and options
`-v`, `--version`	Prints the application version number
input	One or more input filenames or directories
`-o`, `--output` output	Path and filename of the output file. Defaults to 'addresses_output.txt'
`-r`, `--report` report	Path and filename of the report file. Defaults to 'report.txt'
`--recursive`	Enable recursive mode for directories, which will search child directories
`-y`, `--yes`	Automatically confirm prompts to CONTINUE without asking
`-q`, `--quiet`	Run with less verbosity, progress messages aren't shown

Performance / Debugging

Option	Description
`--debug`	Enable debug mode for fine-tuned performance checking
`--threads` num	Uses multiple threads with channels for reading from files. Defaults to `4`
`--skip-exceptions`	Automatically prompts on CONTINUE when an exception occurs

Name		Name	Last commit message	Last commit date
Latest commit History 243 Commits
TestData		TestData
src		src
test		test
.editorconfig		.editorconfig
.gitignore		.gitignore
AddressExtractor.sln		AddressExtractor.sln
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
invalid.txt		invalid.txt
valid.txt		valid.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

The Have I Been Pwned Email Address Extractor

Background

Practical Considerations

Email Address Rules

Alias Rules

Contributions

Test Data

Running the Address Extractor

Main Options

Performance / Debugging

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 9

Uh oh!

Languages

License

HaveIBeenPwned/EmailAddressExtractor

Folders and files

Latest commit

History

Repository files navigation

The Have I Been Pwned Email Address Extractor

Background

Practical Considerations

Email Address Rules

Alias Rules

Contributions

Test Data

Running the Address Extractor

Main Options

Performance / Debugging

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 9

Uh oh!

Languages

Packages