Loading Now

Summary of On Training a Neural Network to Explain Binaries, by Alexander Interrante-grant et al.


On Training a Neural Network to Explain Binaries

by Alexander Interrante-Grant, Andy Davis, Heather Preslier, Tim Leek

First submitted to arxiv on: 30 Apr 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Cryptography and Security (cs.CR); Software Engineering (cs.SE)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper explores the idea of training a deep neural network to understand binary code by taking features derived from binaries as input and outputting English descriptions of functionality. The goal is to aid reverse engineers in investigating closed-source software, potentially malicious or benign. Building on recent successes with large language models for source code summarization, the authors create their own dataset from Stack Overflow containing 1.1M entries. They introduce a novel dataset evaluation method, Embedding Distance Correlation (EDC), which measures the correlation between distances in input and output embedding spaces. The EDC test proves to be diagnostic, indicating that the collected dataset and several existing datasets are of low quality. The authors apply EDC to known good and bad datasets, finding it to be a reliable indicator of dataset value.
Low GrooveSquid.com (original content) Low Difficulty Summary
The paper tries to teach computers to understand binary code by looking at features from the code and writing English descriptions about what it does. They’re trying to help people who reverse-engineer software (good or bad) figure out what’s going on inside. Building on some recent successes with big language models, they made their own dataset from Stack Overflow with over a million entries. The authors also came up with a new way to test the quality of datasets called Embedding Distance Correlation (EDC). It checks if the distances between similar inputs and outputs match. They found that EDC works well in identifying good or bad datasets.

Keywords

» Artificial intelligence  » Embedding  » Neural network  » Summarization