Skip to content
This repository has been archived by the owner on Sep 1, 2021. It is now read-only.

JPEG parser #5

Open
akvadrako opened this issue May 22, 2011 · 11 comments
Open

JPEG parser #5

akvadrako opened this issue May 22, 2011 · 11 comments

Comments

@akvadrako
Copy link

This parses EXIF and JFIF files. this is my first construct, and while making it I noticed I was missing a couple things:

  1. a PascalString that includes the size bytes in the length - this is fairly common in protocols. That is what the "- 2" below is for.
  2. Embed() on a Switch() doesn't work very well - because it discards the non-struct substructs.
  3. FastReader() below improves speed by about 10X.

class FastReader(Construct):
def _parse(self, stream, context):
return stream.read()

def _build(self, obj, stream, context):
    stream.write(obj)

SegBody = Struct(None,
UBInt16('size'),
Field('data', lambda ctx: ctx['size'] - 2),
)

Seg = Struct('seg',
Literal('\xff'),
Byte('kind'),
Switch('body', lambda c: c['kind'],
{
SOS: FastReader('data'),
},
default = Embed(SegBody),
)
)

JPEG = Struct('jpeg',
Literal('\xff\xd8'),
GreedyRange(Seg),
)

@akvadrako
Copy link
Author

and formatted more nicely:

class FastReader(Construct):
    def _parse(self, stream, context):
        return stream.read()

    def _build(self, obj, stream, context):
        stream.write(obj)

SegBody = Struct(None,
        UBInt16('size'),
        Field('data', lambda ctx: ctx['size'] - 2),
    )   

Seg = Struct('seg',
        Literal('\xff'),
        Byte('kind'),
        Switch('body', lambda c: c['kind'],
            {
                SOS: FastReader('data'),
            },  
            default = Embed(SegBody),
            )
        )   

JPEG = Struct('jpeg',
        Literal('\xff\xd8'),
        GreedyRange(Seg),
        )

@MostAwesomeDude
Copy link
Owner

Hi,

I'm not sure about the FastReader, as I still don't grok that section of Construct yet.

There is a PascalString, in construct.macros, which takes a length_field as a kwarg. An example usage:

>>> from construct import PascalString, UBInt16
>>> s = PascalString("hurp", length_field=UBInt16("length"))
>>> s.parse("\x00\x05Hello")
'Hello'

Thanks for your comments. Let me know if you have any patches you wish to contribute.

@akvadrako
Copy link
Author

Hi - the issue with the PascalString is that the length field doesn't include the bytes that make up the length field. In several protocols, we get fields like this, 0x0004babe, so the length (4) include the first 2 bytes.

@tomerfiliba
Copy link

@akvadrako: this could be done like so

>>> s=PascalString("data", ExprAdapter(ULInt16("length"), 
...    lambda val, ctx: val + 2, lambda val, ctx: val - 2))
>>> s.parse("\x05\x00helloxxxx")
'hel'
>>> s.build("foo")
'\x05\x00foo'

on the other hand, your straight forward solution is better.

as per your FastReader class -- i would consider it bad design. i understand you simply wanted to read everything in, but it's not predictable (can't tell how much it will read or write) and thus not symmetric. for instance, the following construct would work only in one direction:

Struct("a", 
    FastReader("blob"),
    UBInt32("x"),
)

you would be able to build anything you want, but you'll never be able to parse it back.

@akvadrako
Copy link
Author

I suggested a variant to PascalString because length+data is common in network protocols and apparently JPEG too.

FastReader is the best we can do with construct's internals. Your example wouldn't work with RepeatUntil and Range either. I'm not sure it should - since constructs need to know about future constructs and you'll get ambiguity:

Struct("a", 
    GreedyRange("b"),
    GreedyRange("c"),
)

Probably better to make a FastReadUntil('BOUNDARY').

@MostAwesomeDude
Copy link
Owner

Length + data is perfectly serviced by PascalString; the case where the length of the length is included in the length is actually rather uncommon though. Maybe a new String subclass is needed for it.

As far as "fast" reading, why not examine other optimizations first? There are optimization opportunities in Construct core, I think.

@tomerfiliba
Copy link

@MostAwesomeDude: no need to subclass, it would be much simpler to just define a InclusivePascalString "macro" that takes care of subtracting/adding the size of the length field from the length.

@akvadrako: your "fast" reader isn't any faster than the plain old Field except that it doesn't check the length. since this greedy construct can only appear once at the end of a data structure, it don't suppose it would make much difference in terms of speed. also, my tests back in the day showed that psycho can speed up parsing by a tenfold.

on the other hand, as you said, it poses a problem of breaking the symmetry between parsing and building... but i think it's inherent to the pattern and there isn't any real solution.

@akvadrako
Copy link
Author

it's much faster - construct is unusable for parsing JPEG images without it - where 99% of the data is an unbounded blob at the end of the file.

@tomerfiliba
Copy link

if you're using GreedyRange, then yes, it would be much faster. i was talking about Field. on the other hand, Field must have a predetermined length, so it's not suitable for your purpose.

what do you mean, though, that 99% of the file is a blob? doesn't it have an internal structure? if so, i assume you have no real interest in it, so you may want to use OnDemand, so it will actually be read only when asked for.

@akvadrako
Copy link
Author

Yes, you are correct. OnDemand doesn't help though, because it requires a known length.

@tomerfiliba
Copy link

well, i just had an idea: assuming you're working on a file/stringIO, you can write a construct that simply returns the remaining length till EOF. e.g.

p=stream.tell()
stream.seek(0, 2)
p2=stream.tell()
stream.seek(p)
return p2-p

and then you could combine it with Field and OnDemand.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants