Skip to content

Perform semantically-equivalent transformations on Java code

License

Notifications You must be signed in to change notification settings

SecurityLab-UCD/SPAT

 
 

Repository files navigation

Modifications from original SPAT

  • Now uses java18
  • Add run.sh for running SPAT. Usage is in the script itself.
  • Add postprocessing.py for compiling transformation result

java18 and lib_path

Steps to run the jar file are the same as below except for the argument PathofJre. It is replaced by the path of lib. An example is "/usr/lib/jvm/java-18-openjdk-amd64/lib". This path can be found with whereis java and tracing to the directory of the original binary (instead of a symlink). The library folder is usually a sibling directory of the directory that contains the binary.

Note that it is recommended to use run.sh instead of running the jar file directly since the former integrates into postprocessing.py.

postprocessing.py

python3 postprocessing.py -h contains brief usage information.

This script organizes a benchmark's result into a jsonl file. It takes in three arguments.

postprocessing.py assumes default directory structure for SPAT (ran with ./run.sh:

Benchmark
└── <benchmark_name>
    ├── Original
    └── transformed
        ├── _<aug_id>
        │   ├── n<original_entry_id>.java
        │   ├── n<original_entry_id>.java
        │   ...
        ├── _<aug_id>
        ...

In this case, benchmark_path would be Benchmark/<benchmark_name>. The script will iterate through transformed subdirectory of the benchmark path. For each .java file, it will record an augmented entry, noting its augmentation type via the provided <aug_id>. The Supported Transformations section specifies an aug_id for each augmentation type.

Additionally, the script will append extra data from the metadata_jsonl argument. This file will be queried by <test_id>, and the resulting data will be added to the augmented entry. In the case of CodeSearchNet, metadata_jsonl is provided by preprocess.py.

Building jar

Eclipse is used to develop and build the project. Click "File > Export" and select the option "Runnable JAR file". Use the "Noargs - RuleWriter" launch configuration and keep everything else as default. Click finish. The resulting .jar file should be saved in the "artifacts" folder

SPAT Original README

Semantic-and-Naturalness Preserving Auto Transformation. This tool is a source-to-source transformation tool that can deal with partial code snippets (programs without dependency information). The transformed code will be semantic-equivalent to the original ones, as well as syntax-naturalness-preserving.

We have currently verified it on Windows10.

This project is developed in "Eclipse IDE for RCP and RAP Developers". If you want to play with the code, please use the same IDE. Starting with the "src/spat/RuleSelector.java" will bring you a nice view of the whole project.

We have produced a runnable jar file already in "artifacts".

To use this tool, simply type the followed command:

java -jar SPAT.jar [RuleId] [RootDir] [OutputDir] [PathofJre] \& [PathofotherDependentJar]

[RuleId] is the transformation rule you want to adopt.

[RootDir] is the root directory path in which you put all your code snippets to be transformed. each ".java'' file is regarded as a code snippet. Each file should contain one Java class. For method-level code snippets, users need to warp each method with a "foo'' class.

[OutputDir] is the directory path where you want to store the transformed code snippets.

[PathofJre] is the path of rt.jar (usually placed in ".../jre1.x.x_xxx/lib/''})

[PathofotherDependentJar] is optional, one can use it to specify additional dependent libraries.

For example,

java -jar .\artifacts\SPAT.jar 5 .\Benchmarks\9133\Original .\Benchmarks\9133\transformed\_5 C:\Program Files\Java\jre1.8.0_221\lib\rt.jar

This command will transform all java files under the ".\Benchmarks\9133\Original" path by the transformation rule 5 "ConditionalExp2SingleIF" to the path ".\Benchmarks\9133\_5". The only dependency is the rt.jar (java runtime).

Supported Transformations

0. LocalVarRenaming:

Replace the local variables' identifiers with new non-repeated identifiers.

1. For2While

Replace the for statement with an semantic-equivalent while statement.

2. While2For

Replace the while statement with an semantic-equivalent for statement.

3. ReverseIfElse

Switch the two code blocks in the if statement and the corresponding else statement.

4. SingleIF2ConditionalExp

Change a single if statement into a conditional expression statement.

5. ConditionalExp2SingleIF

Change a conditional expression statement into a single if statement.

6. PP2AddAssignment

Change the assignment $x++$ into $x\text{+=}1$.

7. AddAssignemnt2EqualAssignment

Change the assignment $x\text{+=}1$ into $x:=x+1$.

8. InfixExpressionDividing

Divide a infix expression into two expressions whose values are stored in temporary variables.

9. IfDividing

Divide a if statement with a compound condition ($\land$ , $\lor$, or $\lnot$) into two nested if statements.

10. StatementsOrderRearrangement

Switch the places of two adjacent statements in a code block, where the former statement has no shared variable with the latter statement.

11. LoopIfContinue2Else

Replace the if-continue statement in a loop block with if-else statement.

12. VarDeclarationMerging

Merge the declaration statements into a single composite declaration statement.

13. VarDeclarationDividing

Divide the composite declaration statement into separated declaration statements.

14. SwitchEqualSides

Switch the two expressions on both sides of the infix expression whose operator is $=$.

15. SwitchStringEqual

Switch the two expressions of the String.equal function, such as '123'.equals(x) -> x.equals('123').

16. PrePostFixExpressionDividing

Divide the pre-or-post expression into two seperated expressions.

17. Case2IfElse

Change the Switch-Case statements into If-Else statements.

Datasets

Educoder

The Educoder code clone dataset. In the "records.txt" file, each record is a triple (file1,file2,label). For example, (file1,file2,-1) means that it is not a clone, otherwise it is a clone.

9133 benchmark

The 9133 benchmark is selected from BCB benchmark, we use the 9133 instances to evaluate the syntax naturalness, applicability, and speed of each transformation rule.

Java Corpus

This dataset is used to train the Neural Probabilistic Language Model (see below).

Links to relevant repositories

  1. The Neural Probabilistic Language Model https://github.com/chiaminchuang/A-Neural-Probabilistic-Language-Model
  2. Code2vec https://github.com/tech-srl/code2vec
  3. DeepCom and Hybrid-DeepCom https://github.com/xing-hu/EMSE-DeepCom
  4. The dataset of DeepCom https://github.com/xing-hu/DeepCom
  5. ASTNN https://github.com/zhangj111/astnn
  6. TBCCD https://github.com/yh1105/datasetforTBCCD
  7. Jobfuscate https://www.duckware.com/jobfuscate/index.html

Papers

Shiwen Yu, Ting Wang, Ji Wang, "Data Augmentation by Program Transformation." Journal of Systems and Software (JSS 2022). (under JSS open science, the preprint pdf can be checked in ".\paper")

Deze Wang, Zhouyang Jia, Shanshan Li, Yue Yu, Yun Xiong, Wei Dong, Xiangke Liao, “Bridging Pre-trained Models and Downstream Tasks for Source Code Understanding.” 44th International Conference on Software Engineering (ICSE 2022)

About

Perform semantically-equivalent transformations on Java code

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 100.0%