Resources for WWW2017

Introduction

The following links are provided for anyone who wants to follow our ideas in paper "Tracking Phishing Attacks Over Time", all the files are in JSON format, except the DOM folder, which is organized by three types of files: 1) screentshot file(*.png), 2)URL file(*.url), 3)DOM file(*.html). URL file(*.url) includes two lines. The content in the first line is the first URL of page, and the second line is the final URL of page. In the paper, all the URLs mentioned in "Experiment" section is the final URL. The prefix of file name in DOM folder is MD5 generated by the first URL of page. To facilitate the work of dealing with data, except the links in "URLs" section, all the other links provided in this page are the list of files in corresponding DOM folder.

For the "Phishing URLs" link in "URL" section, it's organized by a nested array, as the following format.
array(
    array(
        "submission date",
        array(
            array(
            "file name",
            "URL",
            ),
            ...
        )
        
    ),
    ...
)
                                
For the "Legitimate URLs" link in "URL" section, it's organized as the following format.
array(
    array(
        "file name",
        "URL",
    ),
    ...
)
                                

DOMs

The DOM Folder is too large, more than 5GB, here we divide them into a few small files. The maximum size of each file is 1GB.
Phishing DOM From January 1st to October 5th Link1 Link2 Link3 Link4
Legitimate DOM of Sites in Alexa Link1 Link2 Link3 Link4 Link5 Link6 Link7 Link8 Link9 Link10 Link11 Link12 Link13 Link14 Link15 Link16 Link17 Link18 Link19 Link20 Link21 Link22 Link23 Link24 Link25 Link26 Link27 Link28 Link29 Link30 Link31 Link32 Link33

URLs

Phishing URLs
Legitimate URLs

File List of Noise Pages

The File List of Pages with 400 Level Error
The File List of Pages Taken Down
The File List of "Empty" Pages

File List of Duplicate Pages

The File List of Hash Duplicates

File List After Preprocessed Pages

The File List of the Pages After Hash Duplicates and Noise Removing

Code

Python Version: 3.4
Required Library: BeautifulSoup, lxml
Noise Filter (Empty, 400 Level Error, Taken Down)
Common Module for Filter

Misc

HTML Tag Corpus
IP Record for Each Host Name
Proof of Proportional Distance