search menu icon-carat-right cmu-wordmark

The Long & Winding Road to Production-Worthy

August 2020 Presentation
Emily Heath (Mitre)

In this presentation, attendees learned valuable skills for how to test their analytics from different perspectives. From an operational perspective, the presenter discussed how to evaluate analytics for coverage of the problem and false positives.


The MITRE Corporation



Fraudulent domains are malicious domains posing as well-known services or websites. They are used by criminal and APT groups to target victims. As a result, identifying them is of particular interest to government agencies seeking to defend their networks against such attacks. This talk will detail several lessons learned from building and iterating on a production-deployed network defense analytic designed to identify these domains. Our initial analytic was a heuristic-based approach that focused on a relatively simple hypothesis. it performed well in operational testing with respect to false positives. However, this initial version had a substantial false negative problem that subsequently drove our development efforts for the next iteration. To develop our next version, we extended our heuristic approach and incorporated a machine learning model. Experimental testing led us to incorporate the machine learning model in a different way than initially planned, highlighting the classic balance between false negative coverage and false positives. The use of a machine learning model proved to be very valuable to strengthen the analytic and validate the hypothesis used for our heuristic approach. Although we were happy with the experimental results of the second version, we now had a false positive problem. Further complicating the matter, our analytic also had relatively serious computational shortcomings that did not allow it to keep up with the throughput of data. While we were able to develop a strategy for false positives, extensive profiling of our analytic code pointed to computational problems in our machine learning model that would be non-trivial to solve. We attempted several changes with our model but were ultimately forced to return to the drawing board and implement an entirely new model.