Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore a WP entity export iterator #2107

Closed
wants to merge 11 commits into from

Conversation

brandonpayton
Copy link
Member

Motivation for the change, related issues

For data liberation, we want an API for streaming WP entities from a site. Then we can export WP entities to multiple targets without having to solve the same entity traversal problems in each exporter.

Related to #2106

Implementation details

TBD

Testing Instructions (or ideally a Blueprint)

TBD

@brandonpayton brandonpayton added [Type] Exploration An exploration that may or may not result in mergable code [Aspect] Data Liberation labels Dec 20, 2024
@brandonpayton brandonpayton self-assigned this Dec 20, 2024
@brandonpayton
Copy link
Member Author

I've been reviewing WP tables and data structures to see what seems to make sense. This is a very rough outline that I plan to fill in and explore more tomorrow.

So far, the rough approach is to iterate over various entity iterators, most of which will be iterating over table rows with a condition like WHERE ID > previous_entity_ID ORDER BY ID. For the sake of performance we may select in chunks but still relay one entity at a time through the interface.

One possibly controversial direction so far is that I am thinking it may make sense to directly convey terms, taxonomy, and term relationships separately rather than modeling categories and tags as first class entities. Intuitively, it less fraught to just convey what is there and leave meaning-making to API consumers. Do you have any thoughts on this, @adamziel and @zaerl?

@brandonpayton
Copy link
Member Author

Note: Some of the inline TODOs share some of my thinking.

@zaerl
Copy link
Collaborator

zaerl commented Dec 20, 2024

One possibly controversial direction so far is that I am thinking it may make sense to directly convey terms, taxonomy, and term relationships separately rather than modeling categories and tags as first-class entities

This is okay, and it's not controversial at all. But remember that the tables may have some artifacts and some special rules (example: a post is_sticky if is_sticky( $post->ID ) is true). The standard WXR exporter export_wp uses get_{categories|etc} functions that clean away data using the *_exists family of functions. The wp_term_relationships table exists only because we have many-to-many relationships with posts, and it can have "relationships" that do not exist anymore or refer to entities that do not exist anymore.

We should have two base XML exporters here and start from the simpler ones.

  1. Having an exporter that copies the database by looping the rows, creating the XML one after another. Fast and not memory-hungry for obvious reasons.
  2. Another should export with the 1:1 results of what core does. A WXR created by the core exporter guarantees you to have a specific structure: site options, terms ordered by hierarchies, and all the items after.

For the sake of performance we may select in chunks but still relay one entity at a time through the interface.

Reading the steps in twenty posts is okay. For every post, you should read the following:

  1. The post meta(s)
  2. The comments
    • The comment meta(s)

@brandonpayton
Copy link
Member Author

Thank you for your feedback on this, @zaerl! It's helpful.

Having an exporter that copies the database by looping the rows, creating the XML one after another. Fast and not memory-hungry for obvious reasons.

I am working on this first. Currently, there is just a dumb iteration over database rows starting with terms tables, but it's going to have to be a bit smarter than that (I think). And the above provides a good context / test case to see whether this is headed in a reasonable direction.

@zaerl
Copy link
Collaborator

zaerl commented Dec 23, 2024

And the above provides a good context / test case to see whether this is headed in a reasonable direction.

Having both cases can be a good thing for us. The core export_wp function does, for example, use the can_export arg to check post types. The function core code is good! It does a lot of low-level queries, and the terms are the only ones that use the get_* style. To put categories in order with no child going before its parent. But this is no longer a problem with the importer.

Starting with the DB SQL queries is perfectly fine, and you made the right choice. 👍

Comment on lines +26 to +27
// Intuition: Directly representing terms, taxonomies,
// and relationships will be more flexible at this level.
Copy link
Collaborator

@adamziel adamziel Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any way we slice it, let's make sure there's a 100% overlap between the entities we know how to export and the entities we know how to import.

});

// @TODO: Move to dedicated file
class WP_Export_Entity {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about using a single WP_Entity class for both export and import purposes?

Copy link
Member Author

@brandonpayton brandonpayton Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adamziel I want to acknowledge that we talked about this before, and I think that being able to use the same entity types for import and export is a good clue that we've arrived at a useful and/or appropriate level of abstraction.

In this case, I was exploring intuition about having a raw, almost unmodeled export level. But after more time has passed, I'm not sure how valuable such a raw export is... maybe it would be useful for a fallback for export in case we have tables without better, custom export logic. At this raw DB level, we could have a basic record type if we wished, but this level is not the same as WP Entities.

I think we probably should target the same entity types as listed here:

const TYPE_POST = 'post';
const TYPE_POST_META = 'post_meta';
const TYPE_COMMENT = 'comment';
const TYPE_COMMENT_META = 'comment_meta';
const TYPE_TERM = 'term';
const TYPE_TAG = 'tag';
const TYPE_CATEGORY = 'category';
const TYPE_USER = 'user';
const TYPE_SITE_OPTION = 'site_option';

And I think we should probably have a "custom" entity type containing a subtype to explicitly denote entities with custom import/export logic. There could even be enforced subtype naming conventions based on the WordPress plugin/theme slug namespace to encourage unique subtype names.

I'm planning on working on a better export iterator that speaks the same entity types. That will come under a separate PR.

}

// TODO: Maybe this can work for all primary key types
class WP_Entity_Iterator_For_Table_With_Incrementing_IDs implements Iterator {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat idea!

@brandonpayton
Copy link
Member Author

Based on my comment above, I'm going to close this PR in favor of a better modeled WP entity iterator. Planning to carry some of the DB row iteration forward in that PR if it makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Aspect] Data Liberation [Type] Exploration An exploration that may or may not result in mergable code
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants