Resources for WWW2017
Introduction
The following links are provided for anyone who wants to follow our ideas in paper "Tracking Phishing Attacks Over Time", all the files are in JSON format, except the DOM folder, which is organized by three types of files: 1) screentshot file(*.png), 2)URL file(*.url), 3)DOM file(*.html). URL file(*.url) includes two lines. The content in the first line is the first URL of page, and the second line is the final URL of page. In the paper, all the URLs mentioned in "Experiment" section is the final URL. The prefix of file name in DOM folder is MD5 generated by the first URL of page. To facilitate the work of dealing with data, except the links in "URLs" section, all the other links provided in this page are the list of files in corresponding DOM folder.
For the "Phishing URLs" link in "URL" section, it's organized by a nested array, as the following format.array( array( "submission date", array( array( "file name", "URL", ), ... ) ), ... )For the "Legitimate URLs" link in "URL" section, it's organized as the following format.
array( array( "file name", "URL", ), ... )
DOMs
The DOM Folder is too large, more than 5GB, here we divide them into a few small files. The maximum size of each file is 1GB.
Phishing DOM From January 1st to October 5th
Link1
Link2
Link3
Link4
Legitimate DOM of Sites in Alexa
Link1
Link2
Link3
Link4
Link5
Link6
Link7
Link8
Link9
Link10
Link11
Link12
Link13
Link14
Link15
Link16
Link17
Link18
Link19
Link20
Link21
Link22
Link23
Link24
Link25
Link26
Link27
Link28
Link29
Link30
Link31
Link32
Link33
URLs
File List of Noise Pages
The File List of Pages with 400 Level Error
The File List of Pages Taken Down
The File List of "Empty" Pages
File List of Duplicate Pages
The File List of Hash Duplicates
File List After Preprocessed Pages
The File List of the Pages After Hash Duplicates and Noise Removing
Code
Python Version: 3.4
Required Library: BeautifulSoup, lxml
Noise Filter (Empty, 400 Level Error, Taken Down)
Common Module for Filter
Misc
HTML Tag Corpus
IP Record for Each Host Name
Proof of Proportional Distance