[Request] lazy backend memory reader #345

paulkermann · 2022-09-21T12:55:45Z

Hey,
I want a backend class which will allow me to let the cle engine (and angr) get the needed bytes lazily.
This way I will be able to have sparse address space (similar to the minidump) without loading all of it beforehand.
I want this backend to have one parameter which will be a read function with address and size parameters.

When angr or cle wants to read memory it will read from the cached copy if it exists, otherwise the read function will be called for that address and a cached copy will be created.
When angr (cle) wants to write somewhere the original bytes will be read into a cached copy and then the write operation will happen. If the cache already exists then angr will just write there.

I have a huge minidump file and loading all the segments at initialization causes an out of memory error for my python.
In the end I want to use the angr engine with this backend and use the common normal functionality (like explore and such).

Thanks in advance and looking forward to hearing your opinion on this :)

The text was updated successfully, but these errors were encountered:

ltfish · 2022-09-21T14:32:49Z

It sounds very useful and easy to implement, but we do not have the manpower right now to implement it as a priority feature. Do you know enough about cle to implement it by yourself? We can offer help as needed!

paulkermann · 2022-09-21T15:01:18Z

Sadly I don't know much about cle. I am willing to try though.
I tried using add_backer in a new Backend class(which I also registered with register_backend) to add a class that inherits from Clemory but that did not really work for me.
Do you perhaps have any hints or tips as to what I need to implement to make this work?

ltfish · 2022-09-21T15:18:06Z

Maybe you can already do what you want!

It seems to me that Clemory.add_backer() supports mmap-type data parameter. Can you mmap the large file to memory and then add it as a backer to the existing Clemory in loader?

paulkermann · 2022-09-21T15:20:44Z

My process address space is not large enough. The dump is really big. I need to be able to "hook" the read functionality so only needed memory is read. mmaping sadly does not solve my problem.

rhelmot · 2022-09-21T15:22:09Z

What file format are you doing this for?

paulkermann · 2022-09-21T15:23:11Z

Currently this is a minidump. However I want that to be abstracted away so I could use it from live process memory.
The perfect abstraction for me is a read function that I can implement however I want

rhelmot · 2022-09-21T15:35:55Z

So it looks from my incredibly brief interrogation that the library we use to parse minidump files does in fact support the interface that you're looking for. What you will need to do is to create a LazyMinidumpSegment class which implements the Clemory interface, and then in the loop in cle/backends/minidump/__Init__.py:61 instead of saying data = segment.read(...) and self.memory.add_backer(..., data), you should say lazy_segment = LazyMinidumpSegment(segment, self._mdf.file_handle) and self.memory.add_backer(..., lazy_segment). You may also want to hook a LRU cache of some sort into the loop so that you can re-lazify these segments as you experience memory pressure.

Be warned that because you are storing a file descriptor, this will leak file descriptor references (cle is entirely designed to have zero file descriptors open after loading is done, there used to be a close method but we made it obsolete) and will not be thread safe.

rhelmot · 2022-09-21T15:38:53Z

In terms of live process memory, you probably want to look into symbion. I'm not a huge fan of its design, but there are a lot of people using it.

paulkermann · 2022-09-21T15:39:59Z

I don't want to specify segments beforehand. I want one large segment that start at address 0 with size of 0xffffffffffffffff that every read from will let me run custom code. It needs to be a new Backend.

rhelmot · 2022-09-21T15:41:21Z

oh. uh, I guess that's technically something we can do, though it will mess a huge amount of the static analysis up which assumes that it can enumerate a list of mapped addresses. Let me put something together for you.

paulkermann · 2022-09-21T15:42:58Z

@rhelmot that will be incredible, thanks you so much!
I don't need static analysis, I just need the angr state to be able to use call_state and run code from specified input from me
Thanks in advance :)

rhelmot · 2022-09-21T17:18:38Z

Take a look at this! https://github.com/angr/cle/compare/feat/lazy

I have not tested it even in making sure it imports, but it should be the right framework for what you want to do.

paulkermann · 2022-09-22T08:08:31Z

I'm not really sure on how to use this.
I understand I need to create a class that the implements LazyBackend thingy.
I have tried doing the thing below:

class my_lazy(cle.backends.lazy.LazyBackend):
    def __init__(self, *args, **kwargs):
        super().__init__("", archinfo.arch_amd64.ArchAMD64(), 0, **kwargs)

    def _load_data(self, addr, size):
        return b"\x01" * size

register_backend("lazy", my_lazy)

Also, it looks like on

cle/cle/memory.py

Line 354 in a77bcdc

    
           end = start + backer.max_addr if type(backer) is Clemory else start + len(backer)

It should be isinstance(backer, Clemory) and not type(backer) is Clemory.

When I use this code

stream = BytesIO(b"\x55" * 0x2000)
    proj = angr.Project(stream, main_opts={"backend": "lazy"})
    state = proj.factory.call_state(0x1337)
    print(state.memory.load(0x1337, 10))

It does not work (I would expect that the load would return a concrete value). However it looks like even the \x55 are not returned
Logs:

INFO    | 2022-09-22 11:05:13,863 | angr.project | Loading binary from stream
DEBUG   | 2022-09-22 11:05:13,863 | cle.loader | ... loading with <class '__main__.my_lazy'>
INFO    | 2022-09-22 11:05:13,864 | cle.loader | Linking 
INFO    | 2022-09-22 11:05:13,864 | cle.loader | Mapping  at 0x0
INFO    | 2022-09-22 11:05:13,864 | cle.loader | Linking cle##externs
INFO    | 2022-09-22 11:05:13,865 | cle.loader | Mapping cle##externs at 0x8000000000000000
DEBUG   | 2022-09-22 11:05:13,865 | angr.project | hooking 0x8000000000000000 with <SimProcedure CallReturn>
DEBUG   | 2022-09-22 11:05:13,865 | angr.project | hooking 0x8000000000000008 with <SimProcedure UnresolvableJumpTarget>
DEBUG   | 2022-09-22 11:05:13,865 | angr.project | hooking 0x8000000000000010 with <SimProcedure UnresolvableCallTarget>
WARNING | 2022-09-22 11:05:13,865 | angr.calling_conventions | Guessing call prototype. Please specify prototype.
DEBUG   | 2022-09-22 11:05:13,866 | angr.storage.memory_mixins.paged_memory.paged_memory_mixin | reg.load(0xb8, 8, Iend_LE) = <BV64 0x1337>
DEBUG   | 2022-09-22 11:05:13,867 | angr.storage.memory_mixins.paged_memory.paged_memory_mixin | reg.load(0xb8, 8, Iend_LE) = <BV64 0x1337>
DEBUG   | 2022-09-22 11:05:13,867 | angr.storage.memory_mixins.paged_memory.paged_memory_mixin | reg.load(0x30, 8, Iend_LE) = <BV64 0x7ffffffffff0000>
DEBUG   | 2022-09-22 11:05:13,867 | angr.storage.memory_mixins.paged_memory.paged_memory_mixin | reg.load(0x30, 8, Iend_LE) = <BV64 0x7ffffffffff0000>
DEBUG   | 2022-09-22 11:05:13,867 | angr.storage.memory_mixins.paged_memory.paged_memory_mixin | reg.load(0x30, 8, Iend_LE) = <BV64 0x7ffffffffff0000>
DEBUG   | 2022-09-22 11:05:13,867 | angr.storage.memory_mixins.paged_memory.paged_memory_mixin | reg.load(0x30, 8, Iend_LE) = <BV64 0x7fffffffffefff8>
DEBUG   | 2022-09-22 11:05:13,868 | angr.storage.memory_mixins.paged_memory.paged_memory_mixin | reg.load(0x30, 8, Iend_LE) = <BV64 0x7fffffffffefff8>
DEBUG   | 2022-09-22 11:05:13,868 | angr.storage.memory_mixins.paged_memory.paged_memory_mixin | reg.load(0x30, 8, Iend_LE) = <BV64 0x7fffffffffefff8>
WARNING | 2022-09-22 11:05:13,868 | angr.storage.memory_mixins.default_filler_mixin | The program is accessing memory with an unspecified value. This could indicate unwanted behavior.
WARNING | 2022-09-22 11:05:13,868 | angr.storage.memory_mixins.default_filler_mixin | angr will cope with this by generating an unconstrained symbolic variable and continuing. You can resolve this :
WARNING | 2022-09-22 11:05:13,868 | angr.storage.memory_mixins.default_filler_mixin | 1) setting a value to the initial state
WARNING | 2022-09-22 11:05:13,868 | angr.storage.memory_mixins.default_filler_mixin | 2) adding the state option ZERO_FILL_UNCONSTRAINED_{MEMORY,REGISTERS}, to make unknown regions hold null
WARNING | 2022-09-22 11:05:13,868 | angr.storage.memory_mixins.default_filler_mixin | 3) adding the state option SYMBOL_FILL_UNCONSTRAINED_{MEMORY,REGISTERS}, to suppress these messages.
DEBUG   | 2022-09-22 11:05:13,868 | angr.storage.memory_mixins.paged_memory.paged_memory_mixin | reg.load(0xb8, 8, Iend_LE) = <BV64 0x1337>
WARNING | 2022-09-22 11:05:13,868 | angr.storage.memory_mixins.default_filler_mixin | Filling memory at 0x1337 with 10 unconstrained bytes referenced from 0x1337 (offset 0x1337 in main binary (0x13)
DEBUG   | 2022-09-22 11:05:13,868 | angr.state_plugins.solver | Creating new unconstrained BV named mem_1337
DEBUG   | 2022-09-22 11:05:13,869 | angr.storage.memory_mixins.paged_memory.paged_memory_mixin | mem.load(0x1337, 10, Iend_BE) = <BV80 mem_1337_1_80{UNINITIALIZED}>
<BV80 mem_1337_1_80{UNINITIALIZED}>

I would be glad to have more assistance :)

rhelmot · 2022-09-22T17:58:03Z

I was really hoping you would be able to take it from here... nonetheless, I have pushed more changes to the branch such that your example now works.

paulkermann · 2022-09-28T07:27:43Z

Seems like the _load_data function gets called, but with an unexpected address (0x7fffffffffef000).
I have added

bla = state.memory.load(0x5000, 10)
print(bla)

It appears to enter an infinite loop in line 139 with real_start being 0x500 and real_end being 0x7fffffffffef000 (from before).
And real_end - real_start + self.resident becomes huge and more then _real_max_resident.

Also, is there a way to make the whole 64 bit address space available ?(and not just the lower half). I have tried replacing the
memory_map = [(0, 2**(self.arch.bits - 1))] with memory_map = [(0, 2**(self.arch.bits))] but that did not really work.
Thanks for the big help so far !!

rhelmot · 2022-09-28T15:40:54Z

There is probably a bug somewhere in the stuff I wrote. Feel free to make whatever changes you need to make it work; I am out of cycles to help with this.

You will have a very bad time getting angr to accept you mapping the entire 64 bit address space. angr needs to map some additional object files into free slots of the memory map in order to support things like call_state and simprocedures, and will complain very loudly if it can't do that.

paulkermann · 2022-09-29T08:24:09Z

Hey, I worked on it a bit in my branch here.
Could you check it out and comment your opinion. I also provided test_lazy.py that tests an example backend.

rhelmot · 2022-09-29T15:40:20Z

Some thoughts:

I see you got rid of the cache eviction mechanism (max_resident) - I thought the point was that your program was too large to fit into analysis host's memory? Are you guaranteed to only access guest memory that will fit into host memory? (I see that I never actually made evict_one work so it never worked in the first place, but that's neither here nor there)
self.consecutive needs to remain set to false if you want to keep support for other kinds of memory mappings.
the current implementation of self.backers will not work - it needs to continuously make chunks resident as it enumerates, otherwise it will report to the consumer that there is no more memory to read when there in fact is, assuming the consumer is allowed to read chunks larger than chunk_size at a time (it should be).

Are you looking to contribute this back upstream at some point?

paulkermann · 2022-09-29T15:56:49Z

I do want to contribute this upstream, yes.

Yea I have the problem of fitting all the addresses into the host memory. But the current implementation lets angr "lazily" fetch whatever memory it needs at that point and this way not all of the memory is loaded at once so it works so evicting does not really benefit me
Sure I have added it back.
how do I need to change the backers to make it work properly again?

ltfish added the help wanted label Sep 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Request] lazy backend memory reader #345

[Request] lazy backend memory reader #345

paulkermann commented Sep 21, 2022

ltfish commented Sep 21, 2022

paulkermann commented Sep 21, 2022 •

edited

Loading

ltfish commented Sep 21, 2022

paulkermann commented Sep 21, 2022

rhelmot commented Sep 21, 2022

paulkermann commented Sep 21, 2022 •

edited

Loading

rhelmot commented Sep 21, 2022

rhelmot commented Sep 21, 2022

paulkermann commented Sep 21, 2022

rhelmot commented Sep 21, 2022

paulkermann commented Sep 21, 2022

rhelmot commented Sep 21, 2022

paulkermann commented Sep 22, 2022

rhelmot commented Sep 22, 2022

paulkermann commented Sep 28, 2022 •

edited

Loading

rhelmot commented Sep 28, 2022

paulkermann commented Sep 29, 2022

rhelmot commented Sep 29, 2022

paulkermann commented Sep 29, 2022 •

edited

Loading

[Request] lazy backend memory reader #345

[Request] lazy backend memory reader #345

Comments

paulkermann commented Sep 21, 2022

ltfish commented Sep 21, 2022

paulkermann commented Sep 21, 2022 • edited Loading

ltfish commented Sep 21, 2022

paulkermann commented Sep 21, 2022

rhelmot commented Sep 21, 2022

paulkermann commented Sep 21, 2022 • edited Loading

rhelmot commented Sep 21, 2022

rhelmot commented Sep 21, 2022

paulkermann commented Sep 21, 2022

rhelmot commented Sep 21, 2022

paulkermann commented Sep 21, 2022

rhelmot commented Sep 21, 2022

paulkermann commented Sep 22, 2022

rhelmot commented Sep 22, 2022

paulkermann commented Sep 28, 2022 • edited Loading

rhelmot commented Sep 28, 2022

paulkermann commented Sep 29, 2022

rhelmot commented Sep 29, 2022

paulkermann commented Sep 29, 2022 • edited Loading

paulkermann commented Sep 21, 2022 •

edited

Loading

paulkermann commented Sep 21, 2022 •

edited

Loading

paulkermann commented Sep 28, 2022 •

edited

Loading

paulkermann commented Sep 29, 2022 •

edited

Loading