Resources for ESORICS2018

Introduction

The following links are provided for anyone who wants to follow our ideas in paper "Phishing Attacks Modifications and Evolutions", all the files are in JSON format, except the SCL graph files and the DOM folder, which is organized by three types of files: 1) screentshot file(*.png), 2)URL file(*.url), 3)DOM file(*.html). URL file(*.url) includes two lines. The content in the first line is the first URL of page, and the second line is the final URL of page. The prefix of file name in DOM folder is MD5 generated by the first URL of page. To facilitate the work of dealing with data, except the links in "URLs" section, all the other links provided in this page are the list of files in corresponding DOM folder.

For the "Phishing URLs" link in "URL" section, it's organized by a nested array, as the following format.
array(
    array(
        "file name",
        "URL",
        "submission date",
    ),
)
                                
For the "Legitimate URLs" link in "URL" section, it's organized as the following format.
array(
    array(
        "file name",
        "URL",
    ),
    ...
)
                                
For the links in "Clustering Result" section, it's organized as the following format.
dictionary(
    "flagged":
        array(
            array(
                "tag vector"
            ),
            ...
        ),
    "unknown":
        array(
            array(
                "tag vector"
            ),
            ...
        )
)
                                

DOMs

The DOM Folder is too large, more than 5GB, here we divide them into a few small files. The maximum size of each file is 1GB.
Phishing DOM From January 1st, 2016 to October 31th, 2017 Link1 Link2 Link3 Link4 Link5 Link6 Link7 Link8 Link9
Legitimate DOM of Sites in Alexa (original source) Link1 Link2 Link3 Link4 Link5 Link6 Link7 Link8 Link9 Link10 Link11 Link12 Link13 Link14 Link15 Link16 Link17 Link18 Link19 Link20 Link21 Link22 Link23 Link24 Link25 Link26 Link27 Link28 Link29 Link30 Link31 Link32 Link33

URLs

Phishing URLs
Legitimate URLs

Tag Corpus

Original tag corpus
Tag corpus after removing tag meta

Clustering Results

Clustering results based on our prevous research (WWW17 "Tracking Phishing Attacks Over Time")
Clustering results in this paper (after removing tag meta)

Samples of the SCL Graph (Gephi Format)

Gephi software offcial page
The SCL graphs of cluster 0 and cluster 1