Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add github:artifact resource that unzips the doubly zipped files #1799

Draft
wants to merge 1 commit into
base: trunk
Choose a base branch
from

Conversation

adamziel
Copy link
Collaborator

@adamziel adamziel commented Sep 24, 2024

Motivation for the change, related issues

GitHub artifacts are difficult to use as Blueprint resources. As doubly zipped archives, they need convoluted peeling. This PR introduces a new resource type called github:artifact that handles the peeling for the developer.

With this PR, a Blueprint may look like this:

{
    "steps": [
        {
            "step": "installPlugin",
            "pluginZipFile": {
                "resource": "github:artifact",
                "owner": "WordPress",
                "repo": "gutenberg",
                "workflow": "Build Gutenberg Plugin Zip",
                "artifact": "gutenberg-plugin",
                "pr": 65590
            }
        }
    ]
}

Instead of this:

{
	"steps": [
		{
			step: 'mkdir',
			path: '/wordpress/pr',
		},
		/*
		 * This is the most important step.
		 * It download the built plugin zip file exposed by GitHub CI.
		 *
		 * Because the zip file is not publicly accessible, we use the
		 * plugin-proxy API endpoint to download it. The source code of
		 * that endpoint is available at:
		 * https://github.com/WordPress/wordpress-playground/blob/trunk/packages/playground/website/public/plugin-proxy.php
		 */
		{
			step: 'writeFile',
			path: '/wordpress/pr/pr.zip',
			data: {
				resource: 'url',
				url: zipArtifactUrl,
				caption: `Downloading Gutenberg PR ${prNumber}`,
			},
			progress: {
				weight: 2,
				caption: `Applying Gutenberg PR ${prNumber}`,
			},
		},
		/**
		 * GitHub CI artifacts are doubly zipped:
		 *
		 * pr.zip
		 *    gutenberg.zip
		 *       gutenberg.php
		 *       ... other files ...
		 *
		 * This step extracts the inner zip file so that we get
		 * access directly to gutenberg.zip and can use it to
		 * install the plugin.
		 */
		{
			step: 'unzip',
			zipPath: '/wordpress/pr/pr.zip',
			extractToPath: '/wordpress/pr',
		},
		{
			step: 'installPlugin',
			pluginData: {
				resource: 'vfs',
				path: '/wordpress/pr/gutenberg.zip',
			},
		}
	]
}

Closes #1796

Remaining work

  • Fix this issue in Safari: TypeError: ReadableStreamBYOBReader needs a ReadableByteStreamController
  • Add unit tests and run them in Node.js v18, v20, and v22
  • Test in Firefox

Follow-up work

Add first-class Zip64 support to the stream compression package. Right now we're wiring it together manually in the Resource class.

GitHub artifacts are compressed as Zip64 and we cannot simply iterate through the files. Instead, we must first read the central directory index end, then the central directory index, and then use that information to find and unzip the right file entry. Unfortunately, the file headers list 0 as their compressed size.

Technically, this requires buffering the entire response stream, and repeatedly teeing it to seek to the central directory index end, then central directory index, and then to the right file.

Testing Instructions (or ideally a Blueprint)

  1. Go to http://localhost:5400/website-server/#{%20%22steps%22:%20[%20{%20%22step%22:%20%22installPlugin%22,%20%22pluginZipFile%22:%20{%20%22resource%22:%20%22github:artifact%22,%20%22owner%22:%20%22WordPress%22,%20%22repo%22:%20%22gutenberg%22,%20%22workflow%22:%20%22Build%20Gutenberg%20Plugin%20Zip%22,%20%22artifact%22:%20%22gutenberg-plugin%22,%20%22pr%22:%2065590%20}%20}%20]%20}
  2. Confirm the Gutenberg plugin was installed from a GitHub artifact

@adamziel
Copy link
Collaborator Author

Safari doesn't support BYOB streams to the extent needed to make this work. We'll need to either avoid using BYOB streams or use the web-streams-polyfill (but only in Safari). Here's an example of using the polyfill in another PR:

5973780/packages/php-wasm/stream-compression/src/polyfills.ts

@adamziel
Copy link
Collaborator Author

An easy solution: Ditch BYOB streams. We'd have to rearchitecting the stream-compression package to store the state (bytes read so far, buffer etc.) outside of the stream. That's what the PHP version of ZipStreamReader does. That will also make it easy to seek forward (via teeing the stream).

@adamziel
Copy link
Collaborator Author

adamziel commented Sep 25, 2024

A naive approach worked in this commit. I'm leaving this exploration open for now, I need an uninterrupted day or two to get to the bottom of this. Alternatively, if we can find a ~10KB JavaScript streaming library that can skip bytes, slice streams, fork them etc. without relying on BYOB streams while offering interop with DecompressionStream that would solve it, too.

@adamziel adamziel force-pushed the trunk branch 2 times, most recently from 680cd19 to 2e376d2 Compare October 4, 2024 09:24
@adamziel
Copy link
Collaborator Author

adamziel commented Oct 8, 2024

I suppose in the first version of the github:artifact resource we could do the entire buffering, unzipping, and moving files dance under the hood, and only migrate to streaming once we have Safari support.

adamziel added a commit that referenced this pull request Oct 14, 2024
…ools (#1888)

Let's officially kickoff [the Data
Liberation](https://wordpress.org/data-liberation/) efforts under the
Playground umbrella and unlock powerful new use cases for WordPress.

## Rationale

### Why work on Data Liberation?

WordPress core _really_ needs reliable data migration tools. There's
just no reliable, free, open source solution for:

-   Content import and export
-   Site import and export
- Site transfer and bulk transfers, e.g. mass WordPress -> WordPress, or
Tumblr -> WordPress
-   Site-to-site synchronization

Yes, there's the WXR content export. However, it won't help you backup a
photography blog full of media files, plugins, API integrations, and
custom tables. There are paid products out there, but nothing in core.

At the same time, so many Playground use-cases are **all about moving
your data**. Exporting your site as a zip archive, migrating between
hosts with the [Data Liberation browser
extension](https://github.com/WordPress/try-wordpress/), creating
interactive tutorials and showcasing beautiful sites using [the
Playground
block](https://wordpress.org/plugins/interactive-code-block/),
previewing Pull Requests, building new themes, and [editing
documentation](#1524)
are just the tip of the iceberg.

### Why the existing data migration tools fall short?

Moving data around seems easy, but it's a complex problem – consider
migrating links.

Imagine you're moving a site from
[https://my-old-site.com](https://playground-site-1.com) to
[https://my-new-site.com/blog/](https://my-site-2.com). If you just
moved the posts, all the links would still point to the old domain so
you'll need an importer that can adjust all the URLs in your entire
database. However, the typical tools like `preg_replace` or `wp
search_replace` can only replace some URLs correctly. They won't
reliably adjust deeply encoded data, such as this URL inside JSON inside
an HTML comment inside a WXR export:

The only way to perform a reliable replacement here is to carefully
parse each and every data format and replace the relevant parts of the
URL at the bottom of it. That requires four parsers: an XML parser, an
HTML parser, a JSON parser, a WHATWG URL parser. Most of those tools
don't exist in PHP. PHP provides `json_encode()`, which isn't free of
issues, and that's it. You can't even rely on DOMDocument to parse XML
because of its limited availability and non-streaming nature.

### Why build this in Playground?

Playground gives us a lot for free:

- **Customer-centric environment.** The need to move data around is so
natural in Playground. So many people asked for reliable WXR imports,
site exports, synchronization with git, and the ability to share their
Playground. Playground allows us to get active users and customer
feedback every step of the way.
- **Free QA**. Anyone can share a testing link and easily report any
problems they found. Playground is the perfect environment to get ample,
fast moving feedback.
- **Space to mature the API**. Playground doesn’t provide the same
backward compatibility guarantees as WordPress core. It's easy to
prototype a parser, find a use case where the design breaks down, and
start over.
- **Control over the runtime.** Playground can lean on PHP extensions to
validate our ideas, test them on a simulated slow hardware, and ship
them to a tablet to see how they do when the app goes into background
and the internet is flaky.

Playground enables methodically building spec-compliant software to
create the solid foundation WordPress needs.

## The way there

### What needs to be built?

There's been a lot of [gathering information, ideas, and
tools](https://core.trac.wordpress.org/ticket/60375). This writeup is
based on 10 years worth of site transfer problems, WordPress
synchronization plugins, chats with developers, analyzing existing
codebases, past attempts at data importing, non-WordPress tools,
discussions, and more.

WordPress needs parsers. Not just any parsers, they must be streaming,
re-entrant, fast, standard compliant, and tested using a large body of
possible inputs. The data synchronization tools must account for data
conflicts, WordPress plugins, invalid inputs, and unexpected power
outages. The errors must be non-fatal, retryable, and allow manual
resolution by the user. No data loss, ever. The transfer target site
should be usable as early as possible and show no broken links or images
during the transfer. That's the gist of it.

A number of parsers have already been prototyped. There's even [a draft
of reliable URL rewriting
library](https://github.com/adamziel/site-transfer-protocol). Here's a
bunch of early drafts of specific streaming use-cases:

- [A URL
parser](https://github.com/adamziel/site-transfer-protocol/blob/trunk/src/WP_URL.php)
- [A block markup
parser](https://github.com/adamziel/site-transfer-protocol/blob/trunk/src/WP_Block_Markup_Processor.php)
- [An XML
parser](WordPress/wordpress-develop#6713), also
explored by @dmsnell and @jonsurrell
- [A Zip archive
parser](https://github.com/WordPress/blueprints-library/blob/87afea1f9a244062a14aeff3949aae054bf74b70/src/WordPress/Zip/ZipStreamReader.php)
- [A multihandle HTTP
client](https://github.com/WordPress/blueprints-library/blob/trunk/src/WordPress/AsyncHttp/Client.php)
without curl dependency
- [A MySQL query
parser](WordPress/sqlite-database-integration#157)
started by @zieladam and now explored by @JanJakes
- [A stream chaining
API](adamziel/wxr-normalize#1) to connect all
these pieces

On top of that, WordPress core now has an HTML parser, and @dmsnell have
been exploring a
[UTF-8](WordPress/wordpress-develop#6883)
decoder that would to enable fast and regex-less URL detection in long
data streams.

There are still technical challenges to figure out, such as how to pause
and resume the data streaming. As this work progresses, you'll start
seeing incremental improvements in Playground. One possible roadmap is
shipping a reliable content importer, then reliable site zip importer
and exporter, then cloning a site, and then extends towards
full-featured site transfers and synchronization.

### How soon can it be shipped?

Three points:

* No dates.
* Let's keep building on top of prior work and ship meaningful user
flows often.
* Let's not ship any stable public APIs until the design is mature.

For example, the [Try WordPress
extension](https://github.com/WordPress/try-wordpress/) can already give
you a Playground site, even if you cannot migrate it to another
WordPress site just yet.

**Shipping matters. At the same time, taking the time required to build
rigorous, reliable software is also important**. An occasional early
version of this or that parser may be shipped once its architecture
seems alright, but the architecture and the stable API won't be rushed.
That would jeopardize the entire project. This project aims for a solid
design that will serve WordPress for years.

The progress will be communicated in the open, while maintaining
feedback loops and using the work to ship new Playground features.

## Plans, goals, details

### Next steps

Let's start with building a tool to export and import _a single
WordPress post_. Yes! Just one post. The tricky part is that all the
URLs will have to be preserved.

From there, let's explore the breadth and depth of the problem, e.g.:

* Rewriting links
* Frontloading media files
* Preserving dependent data (post meta, custom tables, etc.)
* Exporting/importing a WXR file using the above
* Pausing and resuming a WXR export/import
* Exporting/importing a full WordPress site as a zip file

Ideally, each milestone will result in a small, readily reusable tool.
For example "paste WordPress post, paste a new site URL, get your post
migrated".

There's an ample body of existing work. Let's keep the existing
codebases (e.g. WXR, site migration plugins) and discussions open in a
browser window during this work. Let's involve the authors of these
tools, ask them questions, ask them for reviews. Let's publish the
progress and the challenges encountered on the way.

### Design goals

- **Fault tolerance** – all the data tools should be able to start,
stop, resume, tolerate errors, accept alternative data from the user,
e.g. media files, posts etc.
- **WordPress-first** – let's build everything in PHP using WordPress
naming conventions.
- **Compatibility** – Every WordPress version, PHP version (7.2+, CLI),
and Playground runtime (web, CLI, browser extension, desktop app, CI
etc.) should be supported.
- **Dependency-free** – No PHP extensions required. If this means we
can't rely on cUrl, then let's build an HTTP client from scratch. Only
minimal Composer dependencies allowed, and only when absolutely
necessary.
- **Simplicity** – no advanced OOP patterns. Our role model is
[WP_HTML_Processor](https://developer.wordpress.org/reference/classes/wp_html_processor/)
– a **single class** that can parse nearly all HTML. There's no "Node",
"Element", "Attribute" classes etc. Let's aim for the same here.
- **Extensibility** – Playground should be able to benefit from, say,
WASM markdown parser even if core WordPress cannot.
- **Reusability** – Each library should be framework-agnostic and usable
outside of WordPress. We should be able to use them in WordPress core,
WP-CLI, Blueprint steps, Drupal, Symfony bundles, non-WordPress tools
like https://github.com/adamziel/playground-content-converters, and even
in Next.js via PHP.wasm.


### Prior art

Here's a few codebases that needs to be reviewed at minimum, and brought
into this project at maximum:

- URL rewriter: https://github.com/adamziel/site-transfer-protocol
- URL detector :
WordPress/wordpress-develop#7450
- WXR rewriter: https://github.com/adamziel/wxr-normalize/
- Stream Chain: adamziel/wxr-normalize#1
- WordPress/wordpress-develop#5466
- WordPress/wordpress-develop#6666
- XML parser: WordPress/wordpress-develop#6713
- Streaming PHP parsers:
https://github.com/WordPress/blueprints-library/tree/trunk/src/WordPress
- Zip64 support (in JS ZIP parser):
#1799
- Local Zip file reader in PHP (seeks to central directory, seeks back
as needed):
https://github.com/adamziel/wxr-normalize/blob/rewrite-remote-xml/zip-stream-reader-local.php
- WordPress/wordpress-develop#6883
- Blocky formats – Markdown <-> Block markup WordPress plugin:
https://github.com/dmsnell/blocky-formats
- Sandbox Site plugin that exports and imports WordPress to/from a zip
file:
https://github.com/WordPress/playground-tools/tree/trunk/packages/playground
- WordPress + Playground CLI setup to import, convert, and exporting
data: https://github.com/adamziel/playground-content-converters
- Markdown -> Playground workflow _and WordPress plugins_:
https://github.com/adamziel/playground-docs-workflow
- _Edit Visually_ browser extension for bringing data in and out of
Playground: WordPress/playground-tools#298
- _Try WordPress_ browser extension that imports existing WordPress and
non-WordPress sites to Playground:
https://github.com/WordPress/try-wordpress/
- Humanmade WXR importer designed by @rmccue:
https://github.com/humanmade/WordPress-Importer

### Related resources

- [Site transfer protocol](https://core.trac.wordpress.org/ticket/60375)
- [Existing data migration
plugins](https://core.trac.wordpress.org/ticket/60375#comment:32)
- WordPress/data-liberation#74
- #1524
- WordPress/gutenberg#65012

### The project structure

The structure of the `data-liberation` package is an open exploration
and will change multiple times. Here's what it aims to achieve.

**Structural goals:**

- Publish each library as a separate Composer package
- Publish each WordPress plugin separately (perhaps a single plugin
would be the most useful?)
- No duplication of libraries between WordPress plugins
- Easy installation in Playground via Blueprints, e.g. no `composer
install` required
- Compatibility with different Playground runtimes (web, CLI) and
versions of WordPress and PHP

**Logical parts**

- First-party libraries, e.g. streaming parsers
- WordPress plugins where those libraries are used, e.g. content
importers
- Third party libraries installed via Composer, e.g. a URL parser

**Ideas:**

- Use Composer dependency graph to automatically resolve dependencies
between libraries and WordPress plugins
- or use WordPress "required plugins" feature to manage dependencies
- or use Blueprints to manage dependencies


cc @brandonpayton @bgrgicak @mho22 @griffbrad @akirk @psrpinto @ashfame
@ryanwelcher @justintadlock @azaozz @annezazu @mtias @schlessera
@swissspidy @eliot-akira @sirreal @obenland @rralian @ockham
@youknowriad @ellatrix @mcsf @hellofromtonya @jsnajdr @dawidurbanski
@palmiak @JanJakes @luisherranz @naruniec @peterwilsoncc @priethor @zzap
@michalczaplinski @danluu
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Blocked
Development

Successfully merging this pull request may close these issues.

[Blueprints] Add a github:artifact resource type
1 participant