tap-marketo seems to suffer from several performance issues #87

haleemur · 2023-11-08T02:33:43Z

We use tap-marketo, and have noticed several performance issues with it.

The ETL Sub Process Timeline

To visualize the issue, let me first break down the various stages of the etl workflow with tap-marketo's leads stream. The extract--load process is essentially the following:

do while export_end < job_start
   request marketo to create the export file and wait for it to be ready (1: dark blue)
   download the generated csv file (2: orange)
   parse the csv file row by row and call singer.write_record (3: red)

finish the tap process, generate output file (5: light blue)
copy file to s3 (6: green)

Description of Visualization

time required for step 1 is dictated by marketo and we don't have any options to optimize it from the perspective of the tap.
step 2 is governed by the network bandwidth
notice how step 3 (parsing the csv) takes an inordinate amount of time, and that is something we should be able to optimize
notice how the steps 1, 2 & 3 are repeated for a second time. it seems to me that the while loop is not terminating when it should

Optimization Opportunity

investigating the codebase, i found that parsing the csv has the following optimization opportunities

reading the csv involves an extra generator expression with a call to str.replace. this cost is increases with downloaded rows.
for reach row, call format_values, which in turn builds available_fields and calls format_value on each field. It stands to reason that we should optimize format_value as that is executed in the inner-most loop. this cost increases as the product of downloaded rows & fields in each row. format_value uses several inefficient constructs, described below.
determining available_fields in format_values seems inefficient as well, as the fields are not going to change for a given stream. this cost increases with downloaded rows
the extra loop seems to be due to export_end getting millisecond-truncated, export_start is set to export_end at the end of the while loop, and the comparison export_end < job_start returns true for 1 extra time because job_start is not millisecond-truncated, and thus is almost always a few milliseconds greater than export_end

Opportunity to add more user-configuration

ATTRIBUTION_WINDOW_DAYS = 1 is used to set the start of the extract back by 1 day. While this is a good default option for people ingesting data on a daily basis, it is not suitable for extractions running multiple times a day, as we are re-extracting the same record multiple time leading to additional data processing within the database.

It would be useful to be able to over-ride this hard-coded value through a parameter.

Internally at my workplace, we have improved the above mentioned issues, and have noticed our run time reduce by over 50%. Should the maintainers of tap-marketo agree, I'm happy to port the improvements over.

Opportunity to optimize `format_value`

isinstance is not required in this method.

csv.reader (or DictReader) will only return strings in our usage, and its possible that it might return floats if QUOTE_NONNUMERIC is specified. Quoting the csv module's documentation

Each row read from the csv file is returned as a list of strings. No automatic data type conversion is performed unless the QUOTE_NONNUMERIC format option is specified (in which case unquoted fields are transformed into floats).

the warning message feel unnecessary.

evidently the conversion to integer is correct, and it is marketo's standard practice to format percent-fields as a decimal. the warning message emitted for each percent field for each row imposes an undue linearly increasing cost

conversion to int is twice as fast with `int(float(value))`

this sequence of operation would still work if value was of non-string type (recall that csv.reader returns strings by default but could also return floats if QUOTE_NONNUMERIC is set), which the current method is unable to handle becasue find method only exists for strings.

str.find is slow (or at least slow for our critical loop), and the added if-else adds cost that we should try to void.

the null-value early return could leverage a set for better lookup performance.

we should just define a global constant (like below) and use that for lookup.

Even compared with a list of size 3, a lookup on a set will be more efficient. Having it be global means that the reference data is allocated only once per process.

NULL_VALUES = {None, '', 'null}

The text was updated successfully, but these errors were encountered:

haleemur linked a pull request Nov 8, 2023 that will close this issue

Optimize extract from marketo #88

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tap-marketo seems to suffer from several performance issues #87

tap-marketo seems to suffer from several performance issues #87

haleemur commented Nov 8, 2023 •

edited

Loading

tap-marketo seems to suffer from several performance issues #87

tap-marketo seems to suffer from several performance issues #87

Comments

haleemur commented Nov 8, 2023 • edited Loading

The ETL Sub Process Timeline

Description of Visualization

Optimization Opportunity

Opportunity to add more user-configuration

Opportunity to optimize format_value

isinstance is not required in this method.

the warning message feel unnecessary.

conversion to int is twice as fast with int(float(value))

the null-value early return could leverage a set for better lookup performance.

haleemur commented Nov 8, 2023 •

edited

Loading

Opportunity to optimize `format_value`

conversion to int is twice as fast with `int(float(value))`