Resources for ICWE2019

Introduction

The following links are provided for anyone who wants to follow our ideas in paper "Domain Classifier: Compromised Machines versus Malicious Registrations".

Train and Test Datasets

The following are the train and test data sets we used in our experiments. The files below are python dictionaries format and can be read using python's ``eval'' method. The data format is described below.

{
	'info': domain name and domain id (MD5 hash, 
	'feat_labels': list of feature names, 
	'feat': feature value corresponding to feature label, 
	'target': whether the domain is compromised (1) or malicious (0)
}
Train
Test

Features

The following are the resources we used for processing the domains and building some of the features. Specifically we used the list of ``Free web-hosting services'' below to detect domains hosted on the respective services as the first layer in our testing phase. We used the other lists below to define our ``Freenom TLD'' features, ``Partial match of brand name'' feature and ``Alexa rank'' feature respectively. All the files are in json format, with the exception of ``Alexa rank'' which is plain text, one entry per line with `,' as deliminator.
Free web-hosting services
Freenom TLDs
Brand names
Alexa top 1 million

Phishing Database

The following are the phishing URLs we used in our analysis, which we collected from January 1st, 2016 to Januaray 20th, 2019. The file below is a python list and can be read using python's ``eval'' method. The data format is described below.

[
	first URL ID (MD5 hash), 
	first URL, 
	final URL (due to redirection), 
	report date
]
Phishing URLs