Skip to content

Commit

Permalink
Update documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
jamescooke committed Feb 18, 2024
1 parent f4b657e commit a7d298e
Show file tree
Hide file tree
Showing 9 changed files with 294 additions and 4 deletions.
2 changes: 1 addition & 1 deletion CNAME
Original file line number Diff line number Diff line change
@@ -1 +1 @@
jamescooke.info
jamescooke.info
2 changes: 2 additions & 0 deletions archives.html
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@ <h1 class="blogtitle">
<h1>Archives for James Cooke</h1>

<dl>
<dt>Feb 18, 2024</dt>
<dd><a href="https://jamescooke.info/missing-tiny-data-breaks-pipeline.html">Missing tiny data breaks&nbsp;pipeline</a></dd>
<dt>Aug 29, 2023</dt>
<dd><a href="https://jamescooke.info/hledger-failure-messages-are-better-than-ledgers.html">hledger failure messages are better than&nbsp;Ledger&#8217;s</a></dd>
<dt>Jul 26, 2023</dt>
Expand Down
12 changes: 12 additions & 0 deletions author/james.html
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,18 @@ <h1>
Latest posts
</h1>

<div class="row">
<div class="span1">
<p class="postdate">Feb 18, 2024</p>
</div>
<div class="span4">
<h2 class="small">
<a href="https://jamescooke.info/missing-tiny-data-breaks-pipeline.html" rel='bookmark'>Missing tiny data breaks&nbsp;pipeline</a>
</h2>
<div class="article-excerpt"> <p>At work, when our usage and revenue reporting pipelines fail, they
usually fail because of <em>tiny</em>&nbsp;data.</p> </div>
</div>
</div>
<div class="row">
<div class="span1">
<p class="postdate">Aug 29, 2023</p>
Expand Down
2 changes: 1 addition & 1 deletion authors.html
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ <h1 class="blogtitle">

<h1>Authors on James Cooke</h1>
<ul>
<li><a href="https://jamescooke.info/author/james.html">James</a> (40)</li>
<li><a href="https://jamescooke.info/author/james.html">James</a> (41)</li>
</ul>

</div>
Expand Down
2 changes: 1 addition & 1 deletion categories.html
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ <h1 class="blogtitle">
<h1>Categories on James Cooke</h1>
<ul>
<li><a href="https://jamescooke.info/category/accounting.html">Accounting</a> (1)</li>
<li><a href="https://jamescooke.info/category/code.html">Code</a> (26)</li>
<li><a href="https://jamescooke.info/category/code.html">Code</a> (27)</li>
<li><a href="https://jamescooke.info/category/github-contributions.html">GitHub Contributions</a> (5)</li>
<li><a href="https://jamescooke.info/category/python.html">Python</a> (1)</li>
<li><a href="https://jamescooke.info/category/talk.html">Talk</a> (5)</li>
Expand Down
12 changes: 12 additions & 0 deletions category/code.html
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,18 @@ <h1 class="blogtitle">
<h1>
Posts in 'Code' </h1>

<div class="row">
<div class="span1">
<p class="postdate">Feb 18, 2024</p>
</div>
<div class="span4">
<h2 class="small">
<a href="https://jamescooke.info/missing-tiny-data-breaks-pipeline.html" rel='bookmark'>Missing tiny data breaks&nbsp;pipeline</a>
</h2>
<div class="article-excerpt"> <p>At work, when our usage and revenue reporting pipelines fail, they
usually fail because of <em>tiny</em>&nbsp;data.</p> </div>
</div>
</div>
<div class="row">
<div class="span1">
<p class="postdate">Dec 19, 2022</p>
Expand Down
88 changes: 87 additions & 1 deletion feeds/all.atom.xml
Original file line number Diff line number Diff line change
@@ -1,5 +1,91 @@
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>James Cooke</title><link href="https://jamescooke.info/" rel="alternate"></link><link href="https://jamescooke.info/feeds/all.atom.xml" rel="self"></link><id>https://jamescooke.info/</id><updated>2023-08-29T00:00:00+01:00</updated><entry><title>hledger failure messages are better than Ledger’s</title><link href="https://jamescooke.info/hledger-failure-messages-are-better-than-ledgers.html" rel="alternate"></link><published>2023-08-29T00:00:00+01:00</published><updated>2023-08-29T00:00:00+01:00</updated><author><name>James</name></author><id>tag:jamescooke.info,2023-08-29:/hledger-failure-messages-are-better-than-ledgers.html</id><summary type="html">&lt;p&gt;About six months ago, I upgraded our family accounts from Ledger to
<feed xmlns="http://www.w3.org/2005/Atom"><title>James Cooke</title><link href="https://jamescooke.info/" rel="alternate"></link><link href="https://jamescooke.info/feeds/all.atom.xml" rel="self"></link><id>https://jamescooke.info/</id><updated>2024-02-18T00:00:00+00:00</updated><entry><title>Missing tiny data breaks pipeline</title><link href="https://jamescooke.info/missing-tiny-data-breaks-pipeline.html" rel="alternate"></link><published>2024-02-18T00:00:00+00:00</published><updated>2024-02-18T00:00:00+00:00</updated><author><name>James</name></author><id>tag:jamescooke.info,2024-02-18:/missing-tiny-data-breaks-pipeline.html</id><summary type="html">&lt;p&gt;At work, when our usage and revenue reporting pipelines fail, they
usually fail because of &lt;em&gt;tiny&lt;/em&gt;&amp;nbsp;data.&lt;/p&gt;</summary><content type="html">&lt;p&gt;This week, during our monthly reporting run, two major label licensing reports
failed validation. This is unexpected because usually all reports are generated
and validate just&amp;nbsp;fine.&lt;/p&gt;
&lt;p&gt;It turned out a row of advertising revenue was missed for the United States
Minor Outlying Islands (&lt;span class="caps"&gt;UMI&lt;/span&gt;).&lt;/p&gt;
&lt;p&gt;That missed row was worth just £ 0.0003.&amp;nbsp;🙀&lt;/p&gt;
&lt;h2&gt;👌 This is tiny tiny&amp;nbsp;data&lt;/h2&gt;
&lt;p&gt;At work (&lt;a href="https://www.mixcloud.com/"&gt;Mixcloud&lt;/a&gt;) we generate usage reports for
major labels on a monthly basis. The&amp;nbsp;pipeline:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;identifies, reports and pays royalties out on tens of millions of tracks,
played by millions of Mixcloud creators, and owned by hundreds of thousands
of different artists and songwriters.
&lt;a href="https://blog.mixcloud.com/2021/06/30/why-mixcloud-doesnt-offer-on-demand-video-vods/"&gt;Via Mixcloud&amp;nbsp;blog&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This missing row was &amp;#8220;tiny&amp;#8221; by many&amp;nbsp;definitions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;It was a tiny territory that I have to &lt;a href="https://en.wikipedia.org/wiki/United_States_Minor_Outlying_Islands"&gt;look up on
Wikipedia&lt;/a&gt;.
Turns out the population is about 300&amp;nbsp;people.&lt;/li&gt;
&lt;li&gt;It was a tiny amount of revenue that would get rounded out of existence at
payout time. It would literally make zero change to the total payout for the
month to any&amp;nbsp;label.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We often use a 0.1 % sense check definition of edge cases when working out what
bugs and issues to put effort against, and by every definition, this missing
row was less than 0.1 % of all sorts of monthly&amp;nbsp;factors.&lt;/p&gt;
&lt;h2&gt;🔥 But the pipeline&amp;nbsp;failed&lt;/h2&gt;
&lt;p&gt;A long time ago, I realised that we needed to validate the reports generated
&lt;em&gt;before&lt;/em&gt; they were sent to partners. So we built a post-process validation
system. This checks the generated reports from the client perspective,
providing row-wise, file-wise and batch-wise&amp;nbsp;validation.&lt;/p&gt;
&lt;p&gt;One of these checks ensures that advertising revenue is reported in &lt;span class="caps"&gt;GBP&lt;/span&gt; £.
However, because we had a missing row for the United States Minor Outlying
Islands (&lt;span class="caps"&gt;UMI&lt;/span&gt;), the reported advertising-based usage row became &lt;span class="caps"&gt;USD&lt;/span&gt; $ and failed&amp;nbsp;validation.&lt;/p&gt;
&lt;p&gt;Under the hood, this happened because we have a &lt;code&gt;LEFT JOIN&lt;/code&gt; between revenue and
usage which wasn&amp;#8217;t populated on the revenue side because the &lt;span class="caps"&gt;UMI&lt;/span&gt; row was&amp;nbsp;missing.&lt;/p&gt;
&lt;h2&gt;🛑 When there&amp;#8217;s a validation failure, everything&amp;nbsp;stops&lt;/h2&gt;
&lt;p&gt;When the generated reports with $ 0 amounts of advertising revenue hit our
validators they fail for the partners whose reports contain enough detail to
see that revenue and currency information. Even though this was just two
partners, when we receive those validation errors in the pipeline, the monthly
production&amp;nbsp;stops.&lt;/p&gt;
&lt;p&gt;We keep the generated reports, but work to find out the cause of the error and
assess how many generated reports are&amp;nbsp;tainted.&lt;/p&gt;
&lt;h2&gt;🔧 Fix and&amp;nbsp;regenerate&lt;/h2&gt;
&lt;p&gt;This time the error was, as discussed, tiny. And the fix was pretty tiny too.
We generated an extra row of revenue for &lt;span class="caps"&gt;UMI&lt;/span&gt; worth £ 0.0001 and spliced it back
into our monthly source data&amp;nbsp;snapshots.&lt;/p&gt;
&lt;p&gt;Then we reran all partners that receive reports on Mixcloud&amp;#8217;s ad-funded usage
and our ops colleagues got our monthly production process back up to&amp;nbsp;speed.&lt;/p&gt;
&lt;h2&gt;🤔 Is this kind of behaviour a &amp;#8220;good&amp;#8221;&amp;nbsp;thing?&lt;/h2&gt;
&lt;p&gt;After this incident, I&amp;#8217;m left wondering if it&amp;#8217;s &lt;span class="caps"&gt;OK&lt;/span&gt; that our pipeline is halted
by a missing row worth less than a penny that wouldn&amp;#8217;t affect monthly&amp;nbsp;payouts.&lt;/p&gt;
&lt;h3&gt;This is&amp;nbsp;good&lt;/h3&gt;
&lt;p&gt;On the &amp;#8220;good&amp;#8221; side, we could&amp;nbsp;say:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;All the main sources of error are stable, it&amp;#8217;s just the tiny edge cases that
are&amp;nbsp;failing.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In addition, these failures are so rare that we often are surprised when things
fail. Plus, it&amp;#8217;s good that we have the validation in place that finds these
kind of errors and reports&amp;nbsp;them.&lt;/p&gt;
&lt;h3&gt;This is&amp;nbsp;bad&lt;/h3&gt;
&lt;p&gt;On the other hand, we could&amp;nbsp;say:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The pipelines are so fragile that a tiny missing piece of revenue allocated
to a user in a territory can bring down a monthly reporting&amp;nbsp;run.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;There also seems some truth in&amp;nbsp;this.&lt;/p&gt;
&lt;p&gt;Probably the &lt;code&gt;LEFT JOIN&lt;/code&gt; in our revenue pipeline that caused the &lt;span class="caps"&gt;USD&lt;/span&gt; row to
appear is not robust enough. And as we&amp;#8217;ve dug more into the error later in the
week, my colleague Tim might have found a scenario that we would never be able
to prevent without strengthening this revenue query&amp;#8217;s &lt;span class="caps"&gt;SQL&lt;/span&gt;.&lt;/p&gt;
&lt;h2&gt;⭐ Turn the bad into&amp;nbsp;good&lt;/h2&gt;
&lt;p&gt;What I realised is that the failure is a gift in disguise - it&amp;#8217;s helped us to
see a flaw in the pipeline that&amp;#8217;s so often hidden by aggregation. Instead of
resting on our laurels, we have an opportunity to improve the robustness and
accuracy of our revenue pipeline, plus a new test case to add to our test&amp;nbsp;suite.&lt;/p&gt;
&lt;p&gt;As a result of this error, we&amp;#8217;re also planning to adjust the source of the
missing row. This is currently a manual monthly process, but we&amp;#8217;ve seen that it
might be better incorporated into our pipeline directly, which we think will
give more&amp;nbsp;stability.&lt;/p&gt;
&lt;p&gt;So, if you happen to be that Mixcloud user in the United States Minor Outlying
Islands who listened in January - thanks so much. Your unusual pattern of
listening really helped us out.&amp;nbsp;😊&lt;/p&gt;</content><category term="Code"></category></entry><entry><title>hledger failure messages are better than Ledger’s</title><link href="https://jamescooke.info/hledger-failure-messages-are-better-than-ledgers.html" rel="alternate"></link><published>2023-08-29T00:00:00+01:00</published><updated>2023-08-29T00:00:00+01:00</updated><author><name>James</name></author><id>tag:jamescooke.info,2023-08-29:/hledger-failure-messages-are-better-than-ledgers.html</id><summary type="html">&lt;p&gt;About six months ago, I upgraded our family accounts from Ledger to
hledger. The &lt;span class="caps"&gt;CLI&lt;/span&gt; &lt;span class="caps"&gt;API&lt;/span&gt; of hledger is better than that of Ledger and the
feedback received when a balance assertion fails is just one&amp;nbsp;example.&lt;/p&gt;</summary><content type="html">&lt;p&gt;For any new plain text accounting project I always recommend using
&lt;a href="https://hledger.org/"&gt;hledger&lt;/a&gt; over
Expand Down
12 changes: 12 additions & 0 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,18 @@ <h1>
Latest posts
</h1>

<div class="row">
<div class="span1">
<p class="postdate">Feb 18, 2024</p>
</div>
<div class="span4">
<h2 class="small">
<a href="https://jamescooke.info/missing-tiny-data-breaks-pipeline.html" rel='bookmark'>Missing tiny data breaks&nbsp;pipeline</a>
</h2>
<div class="article-excerpt"> <p>At work, when our usage and revenue reporting pipelines fail, they
usually fail because of <em>tiny</em>&nbsp;data.</p> </div>
</div>
</div>
<div class="row">
<div class="span1">
<p class="postdate">Aug 29, 2023</p>
Expand Down
Loading

0 comments on commit a7d298e

Please sign in to comment.