We estimate of the extent of phishing activity on the Internet via capture-recapture analysis of two major phishing site reports. Capture-recapture analysis is a population estimation technique originally developed for wildlife conservation, but is applicable in any environment wherein multiple independent parties collect reports of an activity.
Generating a meaningful population estimate for phishing activity requires addressing complex relationships between phishers and phishing reports. Phishers clandestinely occupy machines and adding evasive measures into phishing URLs to evade firewalls and other fraud-detection measures. Phishing reports, in the meantime, may be demonstrate a preference towards certain classes of phish.
We address these problems by estimating population in terms of netblocks and by clustering phishing attempts together into scams, which are phishes that demonstrate similar behavior on multiple axes. We generate population estimates using data from two different phishing reports over an 80-day period, and show that these reports capture approximately 40% of scams and 80% of CIDR /24 (256 contiguous address) netblocks involved in phishing.