A group of researchers of Arizona State University think it could work!


In the paper titled “Darknet and Deepnet Mining for Proactive Cybersecurity Threat Intelligence”, a group of 10 researchers ( Eric Nunes, Ahmad Diab, Andrew Gunn, Ericsson Marin , Vineet Mishra, Vivin Paliath, John Robertson, Jana Shakarian, Amanda Thart, Paulo Shakarian) outlines the possibilities of programmatically identifying zero-days vulnerabilities before they’re used in an attack by scraping and parsing darkweb sites and forums using various data mining and machine learning techniques to analyze discussions where malicious code is sold in exchange for bitcoins.

In this paper, we present an operational system for cyber threat intelligence gathering from various social platforms on the Internet particularly sites on the darknet and deepnet. We focus our attention to collecting information from hacker forum discussions and marketplaces offering products and services focusing on malicious hacking. We have developed an operational system for obtaining information from these sites for the purposes of identifying emerging cyber threats. Currently, this system collects on average 305 high-quality cyber threat warnings each week. These threat warnings include information on newly developed malware and exploits that have not yet been deployed in a cyber-attack. This provides a significant service to cyber-defenders. The system is significantly augmented through the use of various data mining and machine learning techniques. With the use of machine learning models, we are able to recall 92% of products in marketplaces and 80% of discussions on forums relating to malicious hacking with high precision. We perform preliminary analysis on the data collected, demonstrating its application to aid a security expert for better threat analysis.


The system


The system consists of three main modules built independently before integration.

  • Crawler: The crawler is a program designed to traverse the website and retrieve HTML documents.
  • Parser: A module designed to extract specific information from marketplaces and hacker forums.
    This well-structured information is stored in a relational database.
  • Classifier: A program that implements machine learning technique using an expert-labeled dataset to detect relevant products and topics
    from marketplaces and forums: these classifiers are integrated into the parser to filter out products and topics relating to drugs, weapons, etc. not relevant to malicious hacking.

The paper

https://arxiv.org/pdf/1607.08583v1.pdf