Skip to content
This repository has been archived by the owner on Jun 16, 2019. It is now read-only.

Interactivity or scripting #20

Open
pgoodman opened this issue Mar 30, 2018 · 9 comments
Open

Interactivity or scripting #20

pgoodman opened this issue Mar 30, 2018 · 9 comments
Labels
question Further information is requested

Comments

@pgoodman
Copy link
Collaborator

This is more of a braindump for a longer-term way of using fcd. When I look at decompiled code, one thing I notice is that there are lots of libc bringup routines (e.g. __libc_start_main) that could probably be elided, but coming up with a good policy for what to elide and when is not straightforward.

Another thing that comes up is how to specify things like headers to fcd in order for it to do a better job with decompilation. Interactivity has come up as an option, and I think that might be how fcd originally did things.

I think that a nice alternative might be a more scripting-oriented approach. It would be similar-ish to interactive, but permit more re-use down the line. For things like libc stuff, you can have a file like linux.py or ELF.py that just "does the right thing" for eliding stuff. Scripting may also enable things like specifying headers.

I'm not sure if scripting should be done via embedding a Python interpreter (this is what PointsTo did, and it worked reasonably well.. it would mean a new command-line argument would be something like --script or something), or making fcd into a Python module (a bit harder, might make it easier to integrate with other stuff).

Any thoughts?

@pgoodman pgoodman added the question Further information is requested label Mar 30, 2018
@surovic
Copy link

surovic commented Mar 31, 2018

I don't have much experience with integrating python interpreters into C++ projects or the other way around, but the linux.py and ELF.py approach sounds good. It's actually how Macho and I think PE binary formats are supported. ELF is hardcoded in fcd.

I think one could actually get something like "function filtering" to work by delegating some of the entry point discovery from fcd to the scripts. Discover some entry points in the .py scripts, omit suff like __libc_start_main and pass the entry point addresses to fcd for lifting and decompilation.

The issue with this approach I see is that some entry points are discovered using Remill via the recursive descent disassembly and I'm not sure if that would not reintroduce some entry points filtered by the scripts. Then again one could also pass a list of filtered entry points from the script to fcd and have fcd omit them as well.

@surovic
Copy link

surovic commented Mar 31, 2018

To add, I think the scripting approach is strictly better than the interactive one. But that's just my opinion.

@pgoodman
Copy link
Collaborator Author

pgoodman commented Mar 31, 2018

This raises one question for me, which is: should the main binary loading / parsing be done by C++ code? If we made fcd's C++ side cooperate with a Python side, then we could bring in third-party packages like Angr's cle to load in binary images, and have the C++ side actually invoke CLE to do the reading. I envision something like microx, where a class is provided that can be extended, and the extension implements methods for reading virtual memory, etc. This would then generalize to handling actual process memory dumps.

@surovic
Copy link

surovic commented Mar 31, 2018

Yeah, this sounds pretty good. And I think fcd actually has some support for this already, from glacing over fcd/scripts and fcd/fcd/executables. I think the idea there is that a Python script needs to provide certain functions, like a function to translate virtual addresses and maybe others, to a C++ class. But I bet this could be modified to better suit things like Angr's cle.

@surovic
Copy link

surovic commented Mar 31, 2018

I can also imagine *.py scripts being very useful in scenarios with packed and / or encrypted executables.

@pgoodman
Copy link
Collaborator Author

So maybe something like...

import cle
import fcd

# Memory abstraction that will let the decompiler read memory. You could
# implement Memory here by invoking APIs from cle, Binary Ninja, IDA Pro, etc.
# You could also provide info to fcd from a McSema-lifted CFG file, which contains
# rich info.
class ExecutableMemory(fcd.ExecutableMemory):
  def __init__(self, ld):
    self.ld = ld

  def read(self, addr, num_bytes):
    # do something with self.ld, returning a list or tuple bytearray

ld = cle.Loader(sys.argv[0])
memory = ExecutableMemory(ld)
decomp = fcd.Decompiler(memory)

decomp.add_entrypoint(0xf00, name="main")
# Fill in other named entrypoints from ld

# Maybe bring in Angr's CFGFast to invoke other APIs,
# e.g. decomp.mark_as_function() or something. Down
# the line, having the ability to mark indirect xrefs would
# be nifty.

# Now lift to bitcode
bc = decomp.lift()

# Show me the bitcode!
bc.dump(address=0xf00)
bc.dump(name="main")

# Eventually we could implement the emulator test suite
# via whatever bc is, e.g. bc.execute(cpu), where cpu is
# an object of a class implementing methods like
# read_register and read_memory.

bc.set_calling_convention(...)

bc.decompile(address=0xf00)
bc.decompile(name="main")

@surovic
Copy link

surovic commented Apr 1, 2018

I think your example looks good, but it's also the reverse of what fcd currently does. Currently fcd uses Python to parse executables. Like for example...

import pefile
import bisect

# helper globals
stubs = {}
sectionStart = []
sectionInfo = {}

# fcd interface below (I assume this is what fcd's C++ Executable class requires)

executableType = "Portable Executable"
targetTriple = "unknown-unknown-win32"
entryPoints = []

def init(data):
  # fill stubs, sectionStart, sectionInfo, ...

def getStubTarget(target):
  # returns the target of a stub function (library functions, etc)

def mapAddress(address):
  # maps virtual addresses to actual addresses in the binary

The above script is then passed to fcd via a command-line flag, for example $ fcd -f scripts/pe.py pefile.exe, and during lifting, fcd then calls the functions from the above script to resolve stub targets, virtual addresses and what have you. Fcd then does the actual reading of binary data on it's own.

In your example it seems to me that fcd, would be more of a library with Python bindings, rather than a standalone executable, which I'm not opposed to, but I assume it would be a bit more work. That being said, it seems that C++ library with Python bindings is the way a lot of projects nowadays go, so why do something different.

@pgoodman
Copy link
Collaborator Author

pgoodman commented Apr 1, 2018

I think library-ifying it is something I could pull together in a reasonable amount of time. It'd be pretty cool to expose fcd to Binary Ninja, for example.

@surovic
Copy link

surovic commented Apr 1, 2018

It'd be pretty cool to expose fcd to Binary Ninja, for example.

That I completely agree with.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants