You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We use tap-marketo, and have noticed several performance issues with it.
The ETL Sub Process Timeline
To visualize the issue, let me first break down the various stages of the etl workflow with tap-marketo's leads stream. The extract--load process is essentially the following:
do while export_end < job_start
request marketo to create the export file and wait for it to be ready (1: dark blue)
download the generated csv file (2: orange)
parse the csv file row by row and call singer.write_record (3: red)
finish the tap process, generate output file (5: light blue)
copy file to s3 (6: green)
Description of Visualization
time required for step 1 is dictated by marketo and we don't have any options to optimize it from the perspective of the tap.
step 2 is governed by the network bandwidth
notice how step 3 (parsing the csv) takes an inordinate amount of time, and that is something we should be able to optimize
notice how the steps 1, 2 & 3 are repeated for a second time. it seems to me that the while loop is not terminating when it should
Optimization Opportunity
investigating the codebase, i found that parsing the csv has the following optimization opportunities
reading the csv involves an extra generator expression with a call to str.replace. this cost is increases with downloaded rows.
for reach row, call format_values, which in turn builds available_fields and calls format_value on each field. It stands to reason that we should optimize format_value as that is executed in the inner-most loop. this cost increases as the product of downloaded rows & fields in each row. format_value uses several inefficient constructs, described below.
determining available_fields in format_values seems inefficient as well, as the fields are not going to change for a given stream. this cost increases with downloaded rows
the extra loop seems to be due to export_end getting millisecond-truncated, export_start is set to export_end at the end of the while loop, and the comparison export_end < job_start returns true for 1 extra time because job_start is not millisecond-truncated, and thus is almost always a few milliseconds greater than export_end
Opportunity to add more user-configuration
ATTRIBUTION_WINDOW_DAYS = 1 is used to set the start of the extract back by 1 day. While this is a good default option for people ingesting data on a daily basis, it is not suitable for extractions running multiple times a day, as we are re-extracting the same record multiple time leading to additional data processing within the database.
It would be useful to be able to over-ride this hard-coded value through a parameter.
Internally at my workplace, we have improved the above mentioned issues, and have noticed our run time reduce by over 50%. Should the maintainers of tap-marketo agree, I'm happy to port the improvements over.
Opportunity to optimize format_value
isinstance is not required in this method.
csv.reader (or DictReader) will only return strings in our usage, and its possible that it might return floats if QUOTE_NONNUMERIC is specified. Quoting the csv module's documentation
Each row read from the csv file is returned as a list of strings. No automatic data type conversion is performed unless the QUOTE_NONNUMERIC format option is specified (in which case unquoted fields are transformed into floats).
the warning message feel unnecessary.
evidently the conversion to integer is correct, and it is marketo's standard practice to format percent-fields as a decimal. the warning message emitted for each percent field for each row imposes an undue linearly increasing cost
conversion to int is twice as fast with int(float(value))
this sequence of operation would still work if value was of non-string type (recall that csv.reader returns strings by default but could also return floats if QUOTE_NONNUMERIC is set), which the current method is unable to handle becasue find method only exists for strings.
str.find is slow (or at least slow for our critical loop), and the added if-else adds cost that we should try to void.
the null-value early return could leverage a set for better lookup performance.
we should just define a global constant (like below) and use that for lookup.
Even compared with a list of size 3, a lookup on a set will be more efficient. Having it be global means that the reference data is allocated only once per process.
NULL_VALUES = {None, '', 'null}
The text was updated successfully, but these errors were encountered:
We use tap-marketo, and have noticed several performance issues with it.
The ETL Sub Process Timeline
To visualize the issue, let me first break down the various stages of the etl workflow with tap-marketo's leads stream. The extract--load process is essentially the following:
Description of Visualization
Optimization Opportunity
investigating the codebase, i found that parsing the csv has the following optimization opportunities
str.replace
. this cost is increases with downloaded rows.format_values
, which in turn buildsavailable_fields
and callsformat_value
on each field. It stands to reason that we should optimizeformat_value
as that is executed in the inner-most loop. this cost increases as the product of downloaded rows & fields in each row.format_value
uses several inefficient constructs, described below.available_fields
informat_values
seems inefficient as well, as the fields are not going to change for a given stream. this cost increases with downloaded rowsexport_end
getting millisecond-truncated,export_start
is set toexport_end
at the end of the while loop, and the comparisonexport_end < job_start
returnstrue
for 1 extra time becausejob_start
is not millisecond-truncated, and thus is almost always a few milliseconds greater thanexport_end
Opportunity to add more user-configuration
ATTRIBUTION_WINDOW_DAYS = 1
is used to set the start of the extract back by 1 day. While this is a good default option for people ingesting data on a daily basis, it is not suitable for extractions running multiple times a day, as we are re-extracting the same record multiple time leading to additional data processing within the database.It would be useful to be able to over-ride this hard-coded value through a parameter.
Internally at my workplace, we have improved the above mentioned issues, and have noticed our run time reduce by over 50%. Should the maintainers of tap-marketo agree, I'm happy to port the improvements over.
Opportunity to optimize
format_value
isinstance is not required in this method.
csv.reader (or DictReader) will only return strings in our usage, and its possible that it might return floats if
QUOTE_NONNUMERIC
is specified. Quoting the csv module's documentationthe warning message feel unnecessary.
evidently the conversion to integer is correct, and it is marketo's standard practice to format percent-fields as a decimal. the warning message emitted for each percent field for each row imposes an undue linearly increasing cost
conversion to int is twice as fast with
int(float(value))
this sequence of operation would still work if
value
was of non-string type (recall that csv.reader returns strings by default but could also return floats if QUOTE_NONNUMERIC is set), which the current method is unable to handle becasuefind
method only exists for strings.str.find
is slow (or at least slow for our critical loop), and the added if-else adds cost that we should try to void.the null-value early return could leverage a set for better lookup performance.
we should just define a global constant (like below) and use that for lookup.
Even compared with a list of size 3, a lookup on a set will be more efficient. Having it be global means that the reference data is allocated only once per process.
The text was updated successfully, but these errors were encountered: