S&P 2020 Resources

Resources for S&P

Introduction

The following links are provided for anyone who wants to follow our ideas in paper "Proactive Detection of Phishing Kit Traffic", all dataset files are in JSON format, and files of source codes are written in Python.

For the "Legitimate Email Dataset" link in "Dataset" section , it's organized as the following format.

dictionary{
    'training': array("training_email_content1", ...),
    'validation': array("validation_email_content1", ...),
    'test': array("test_email_content1", ...)
}

For the "Phishing Exfiltrating Email Dataset" link in "Dataset" section, it's organized as the following format. Specifically, the data with the index "original" refers to the original test set without any injection or replacement. The data with the index "xxx_inject" refers to the dataset used in the injection attack experiment, and the data with the index "xxx_replace" refers to the dataset used in the replacement attack experiment.

dictionary{
    'training': array(array("training_structure_token1", ... ), 
                array("training_semantic_feature1", ... )),
    'validation': array(array("validation_structure_token1", ... ), 
                  array("validation_semantic_feature1", ... )),
    'test': dictionary{
    'original': array(array("test_structure_token1", ... ), 
                array("test_semantic_feature1", ... )),
    'line_inject': array(array("injection_ratio", 
                   array("line_inject_structure_token1", ... ), 
                   array("line_inject_semantic_feature1", ... )), ...),
    'head_inject': array(array("injection_ratio", 
                   array("head_inject_structure_token1", ... ), 
                   array("head_inject_semantic_feature1", ... )), ...),
    'middle_inject': array(array("injection_ratio", 
                     array("middle_inject_structure_token1", ... ), 
                     array("middle_inject_semantic_feature1", ... )), ...),
    'end_inject': array(array("injection_ratio", 
                  array("end_inject_structure_token1", ... ), 
                  array("end_inject_semantic_feature1", ... )), ...),
    'word_replace': array(array("injection_ratio", 
                    array("word_replace_structure_token1", ... ), 
                    array("word_replace_semantic_feature1", ... )), ...),
    'non_word_replace': array(array("injection_ratio", 
                        array("non_word_replace_structure_token1", ... ), 
                        array("non_word_replace_semantic_feature1", ... )), ...),
    'all_replace': array(array("injection_ratio", 
                   array("all_replace_structure_token1", ... ), 
                   array("all_replace_semantic_feature1", ... )), ...)
    }
}

For the links in "Source Code" section, it's orgainized as either Python file or Jupyter notebook file.

Dataset

Due to the privacy policy, we only provide encoded data of phishing exfiltrating email content.
Phishing Exfiltrating Email Dataset
Legitimate Email Dataset

Source Code

Source Code of DeepPK
Source Code of Other Models