SOC News: Article

One Article Review

Accueil - L'article:

Source	AlienVault Blog
Identifiant	1262325
Date de publication	2019-08-14 13:00:00 (vue: 2019-08-14 16:00:42)
Titre	Entity extraction for threat intelligence collection
Texte	Introduction This research project is part of my Master’s program at the University of San Francisco, where I collaborated with the AT&T Alien Labs team. I would like to share a new approach to automate the extraction of key details from cybersecurity documents. The goal is to extract entities such as country of origin, industry targeted, and malware name. The data is obtained from the AlienVault Open Threat Exchange (OTX) platform: Figure 1: The website otx.alienvault.com The Open Threat Exchange is a crowd-sourced platform where, where users upload “pulses” which contain information about a recent cybersecurity threat. A pulse consists of indicators of compromise and links to blog posts, whitepapers, reports, etc. with details of the attack. The pulse normally contains a link to the full content (a blog post), together with key meta-data manually extracted from the full content (the malware family, target of the attack etc.). Figure 2 is a screenshot of an example of a blog post that could be contained in a pulse: Figure 2: Snippet of a blog post from “Internet of Termites” by AT&T Alien Labs Figure 3 is a theoretical visualization of our end-goal - the automated extraction of meta-data from the blog post which can be added to a pulse: Figure 3: The same paragraph with entities extracted This kind of threat intelligence collection is still manual with a human having to read and tag the text. However, unsupervised machine learning techniques can be used to extract the information of interest. We created custom named entities trained on domain-specific data to tag pulses. This helps speed up the overall process of threat intelligence collection. Approach and Modeling We collected the data by scraping text from all the pulse reference links on the OTX platform. We focused on HTML and PDF sources and used appropriate document parsers. But, since the sources are not consistent, we put in place many rule-based checks to clean the text. For example, tags like ‘IP_ADDRESS’ and ‘SHA_256’ replace IP addresses and hashes. We did not omit them to preserve the word sequence and any dependencies. Next, we had the large task of annotating the documents. But SpaCy’s annotation tool, Prodigy, makes the process much less painful than it has been before. Figure 4 below is an example annotation where “Windows” is labeled as a country rather than “China” in the sentence. The confidence score is very low for this annotation, and we can reject this annotation. Figure 4: Example annotation from Prodigy SpaCy's built-in Named Entity Recognition (NER) model was our first approach. The current model architecture is not published, but this video explains it in more detail. We have also built a custom bidirectional LSTM which has gained popularity in recen
Envoyé	Oui
Condensat	“internet ‘ip ‘sha 256’ able about account added adding address’ addresses alien alienvault all also annotating annotation any approach appropriate architecture are at&t attack automate automated based batch been before believe below better between bidirectional blog blogs both built built: but can checks clean collaborated collected collection com compromise conclusion confidence consistent consists contain contained contains content context could countries country created crowd current custom cybersecurity data dependencies detail details diagram did directions document documents domain don’t end entities entity etc example exchange expand explains extract extracted extraction family figure first focused francisco from full future gained generalize goal good had has hashes have having helps however html human indicators industries industry information intelligence intentionally interest introduction key kind labeled labeling labs large larger learn learning less like link links long low lstm machine makes malware manual manually many master’s memory meta model modeling models more much name named names ner new next normally not obtained omit open origin otx outperforms overall overfit overview painful paragraph parsers part pdf place platform platform: popularity post posts preserve process prodigy program project published pulse pulse: pulses put rather read recent recognition recognizing reference reject replace reports research results robust rule same san score scraping screenshot sentence sequence session set share short significantly since snippet sourced sources spacy spacy's spacy’s specific speed such tag tags takes target targeted task tasks team techniques term termites” text than them theoretical threat thus together tool trained training university unsupervised upload used users using very video visualization want website well where which whitepapers within word words would years yet
Tags	Malware Threat
Stories
Notes
Move

L'article ne semble pas avoir été repris aprés sa publication.

L'article ne semble pas avoir été repris sur un précédent.