Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add created_at to person table ORDER BY #29326

Open
timgl opened this issue Feb 27, 2025 · 1 comment
Open

Add created_at to person table ORDER BY #29326

timgl opened this issue Feb 27, 2025 · 1 comment
Assignees
Labels
performance Has to do with performance. For PRs, runs the clickhouse query performance suite

Comments

@timgl
Copy link
Collaborator

timgl commented Feb 27, 2025

We should add created_at to the person table ORDER BY, so it becomes

CREATE TABLE posthog.tim_test_created_at_person
(
    `id` UUID,
    `created_at` DateTime64(3),
    `team_id` Int64,
    `properties` String,
    `is_identified` Int8,
    `is_deleted` Int8 DEFAULT 0,
    `version` UInt64,
    `_timestamp` DateTime,
    `_offset` UInt64
)
ENGINE = ReplicatedReplacingMergeTree(...)
ORDER BY (team_id, id, created_at) 
SETTINGS index_granularity = 8192

Below shows the reason why we should do this. The main person page load would go from ~15 seconds to < 1 second. Because we're adding it to the end, it doesn't impact joins (see last query). We still do full table scans, but because we don't need to reorder in memory it's much faster.

There seems to be no difference in table size between (team_id, id, toDate(created_at)) and (team_id, id, created_at) and for more speed improvements I'd suggest the latter, but I don't understand clickhouse well enough to know.

-- original
/* user_id:1 celery:posthog.tasks.tasks.process_query_task */ SELECT
    persons.id,
    persons.id AS id,
    persons.created_at AS created_at,
    1
FROM
    (SELECT
        argMax(toTimeZone(person.created_at, 'US/Pacific'), person.version) AS created_at,
        person.id AS id
    FROM
        person
    WHERE
        equals(person.team_id, 1589)
    GROUP BY
        person.id
    HAVING
        and(ifNull(equals(argMax(person.is_deleted, person.version), 0), 0), ifNull(less(argMax(toTimeZone(person.created_at, 'US/Pacific'), person.version), plus(now64(6, 'US/Pacific'), toIntervalDay(1))), 0))
    SETTINGS optimize_aggregation_in_order=1) AS persons
ORDER BY
    persons.created_at DESC
LIMIT 101
OFFSET 0 SETTINGS optimize_aggregation_in_order=1, join_algorithm='auto', readonly=2, max_execution_time=600, allow_experimental_object_type=1, format_csv_allow_double_quotes=0, max_ast_elements=4000000, max_expanded_ast_elements=4000000, max_bytes_before_external_group_by=0

--101 rows in set. Elapsed: 15.563 sec. Processed 1.18 billion rows, 48.45 GB (75.86 million rows/s., 3.11 GB/s.)
-- Peak memory usage: 2.06 GiB.

-- Optimized query
/* user_id:1 celery:posthog.tasks.tasks.process_query_task */ SELECT
    persons.id,
    persons.id AS id,
    persons.created_at AS created_at,
    properties
FROM
    (SELECT
        argMax(toTimeZone(person.created_at, 'UTC'), person.version) AS created_at,
        person.id AS id,
        argMax(person.properties, person.version) AS properties
    FROM
        person
    WHERE
        and(equals(person.team_id, 1589), in(id, (SELECT
                    where_optimization.id AS id
                FROM
                    person AS where_optimization
                WHERE
                    and(equals(where_optimization.team_id, 1589), notIn(where_optimization.id, (SELECT
                                limit_delete_optimization.id AS id
                            FROM
                                person AS limit_delete_optimization
                            WHERE
                                and(equals(limit_delete_optimization.team_id, 1589), equals(limit_delete_optimization.is_deleted, 1)))))
                ORDER BY
                    toTimeZone(where_optimization.created_at, 'UTC') DESC
                LIMIT 1 BY where_optimization.id
                LIMIT 101
                OFFSET 0)))
    GROUP BY
        person.id
    HAVING
        and(ifNull(equals(argMax(person.is_deleted, person.version), 0), 0), ifNull(less(argMax(toTimeZone(person.created_at, 'UTC'), person.version), plus(now64(6, 'UTC'), toIntervalDay(1))), 0))) AS persons
ORDER BY
    persons.created_at DESC
LIMIT 101 SETTINGS optimize_aggregation_in_order=1, join_algorithm='auto', readonly=2, max_execution_time=600, allow_experimental_object_type=1, format_csv_allow_double_quotes=0, max_ast_elements=4000000, max_expanded_ast_elements=4000000, max_bytes_before_external_group_by=0
-- 101 rows in set. Elapsed: 7.587 sec. Processed 2.37 billion rows, 50.30 GB (312.29 million rows/s., 6.63 GB/s.)
-- Peak memory usage: 45.15 GiB.

-- create new table with new order by key
CREATE TABLE posthog.tim_test_created_at_person
(
    `id` UUID,
    `created_at` DateTime64(3),
    `team_id` Int64,
    `properties` String,
    `is_identified` Int8,
    `is_deleted` Int8 DEFAULT 0,
    `version` UInt64,
    `_timestamp` DateTime,
    `_offset` UInt64,
)
ENGINE = ReplacingMergeTree(version)
-- ORDER BY (team_id, id, toDate(created_at)) -- 29.66GB
ORDER BY (team_id, id, created_at) -- 29.94GB
SETTINGS index_granularity = 8192
insert into posthog.tim_test_created_at_person select id, created_at, team_id, '', is_identified, is_deleted, version, _timestamp, _offset from person where team_id = 1589 SETTINGS max_execution_time = 600

-- Optimized query with new order by key (note much less mmeory)
SELECT
    where_optimization.id AS id
FROM
    tim_test_created_at_person AS where_optimization
    --person AS where_optimization
WHERE
    and(
        equals(where_optimization.team_id, 1589),
        notIn(
            where_optimization.id,
            (
                SELECT
                    limit_delete_optimization.id AS id
                FROM
                    person AS limit_delete_optimization
                WHERE
                    and(
                        equals(limit_delete_optimization.team_id, 1589),
                        equals(limit_delete_optimization.is_deleted, 1)
                    )
            )
        ),
        equals(toStartOfMonth(created_at), toStartOfMonth(now64(6, 'UTC')))
    )

ORDER BY
    toDate(where_optimization.created_at) DESC, where_optimization.created_at DESC
LIMIT
    1 BY where_optimization.id
LIMIT
    101 OFFSET 0
-- with toDate(created_at)
-- 101 rows in set. Elapsed: 0.901 sec. Processed 2.17 billion rows, 42.63 GB (2.40 billion rows/s., 47.33 GB/s.)
-- Peak memory usage: 3.29 GiB.

-- with created_at
-- 101 rows in set. Elapsed: 0.852 sec. Processed 2.16 billion rows, 42.59 GB (2.54 billion rows/s., 50.00 GB/s.)
-- Peak memory usage: 3.32 GiB.

-- join comparison
select
    count(distinct e__person.id)
from
    (select person_id from events where event = '$pageview' and timestamp > '2025-02-27 00:00:00' and timestamp < '2025-02-27 00:00:10' and team_id = 1589) as e
LEFT JOIN (
    SELECT
        person.id AS id
    FROM
        -- person
        tim_test_created_at_person as person
    WHERE
        and(
            equals(person.team_id, 21661),
            ifNull(
                in(
                    tuple(person.id, person.version),
                    (
                        SELECT
                            person.id AS id,
                            max(person.version) AS version
                        FROM
                            -- person
                            tim_test_created_at_person as person
                        WHERE
                            equals(person.team_id, 1589)
                        GROUP BY
                            person.id
                        HAVING
                            and(
                                ifNull(
                                    equals(argMax(person.is_deleted, person.version), 0),
                                    0
                                ),
                                ifNull(
                                    less(
                                        argMax(
                                            toTimeZone(person.created_at, 'Asia/Dubai'),
                                            person.version
                                        ),
                                        plus(now64(6, 'Asia/Dubai'), toIntervalDay(1))
                                    ),
                                    0
                                )
                            )
                        limit 10000000
                    )
                ),
                0
            )
        ) SETTINGS optimize_aggregation_in_order = 1
) AS e__person ON equals(
    e.person_id,
    e__person.id
)
-- 1 row in set. Elapsed: 1.662 sec. Processed 341.93 million rows, 13.65 GB (205.79 million rows/s., 8.21 GB/s.)
-- Peak memory usage: 2.56 GiB. 
@timgl timgl added the performance Has to do with performance. For PRs, runs the clickhouse query performance suite label Feb 27, 2025
@timgl
Copy link
Collaborator Author

timgl commented Feb 27, 2025

Operations to get this done
Based on https://kb.altinity.com/altinity-kb-schema-design/change-order-by/
Pre start checklist

  • overprovisioned space

Step 1
We'll need to do this separately in EU/US as they have different mat views

SHOW CREATE TABLE person format Vertical

Use this to generate the new person_rekey table with the order_by key updated from (team_id, id) to (team_id, id, created_at)

Step 2

CREATE TABLE posthog.kafka_person_rekey
(
    `id` UUID,
    `created_at` DateTime64(3),
    `team_id` Int64,
    `properties` String,
    `is_identified` Int8,
    `is_deleted` Int8,
    `version` UInt64
)
ENGINE = Kafka('b-4.posthogprodeuevents.dkk5lp.c3.kafka.eu-central-1.amazonaws.com:9094,b-2.posthogprodeuevents.dkk5lp.c3.kafka.eu-central-1.amazonaws.com:9094,b-1.posthogprodeuevents.dkk5lp.c3.kafka.eu-central-1.amazonaws.com:9094', 'clickhouse_person_rekey', 'group1', 'JSONEachRow')
CREATE MATERIALIZED VIEW posthog.person_rekey_mv TO posthog.person_rekey
(
    `id` UUID,
    `created_at` DateTime64(3),
    `team_id` Int64,
    `properties` String,
    `is_identified` Int8,
    `is_deleted` Int8,
    `version` UInt64,
    `_timestamp` Nullable(DateTime),
    `_offset` UInt64
)
AS SELECT
    id,
    created_at,
    team_id,
    properties,
    is_identified,
    is_deleted,
    version,
    _timestamp,
    _offset
FROM posthog.kafka_person_rekey

Step 3
In a tmux window execute the following from local clickhouse client

INSERT INTO person_rekey SELECT * FROM person SETTINGS max_execution_time = 60000000

If this fails we will need to break this insert/select into smaller chunks, possibly by team or team mod

Step 4

DROP TABLE posthog.person_rekey_mv ON CLUSTER posthog_migrations;
DROP TABLE posthog.person_mv ON CLUSTER posthog_migrations;
DROP TABLE posthog.kafka_person_rekey ON CLUSTER posthog_migrations;
DROP TABLE posthog.kafka_person ON CLUSTER posthog_migrations;

EXCHANGE TABLES person AND person_rekey ON CLUSTER posthog_migrations;

CREATE TABLE posthog.kafka_person
(
    `id` UUID,
    `created_at` DateTime64(3),
    `team_id` Int64,
    `properties` String,
    `is_identified` Int8,
    `is_deleted` Int8,
    `version` UInt64
)
ENGINE = Kafka('b-4.posthogprodeuevents.dkk5lp.c3.kafka.eu-central-1.amazonaws.com:9094,b-2.posthogprodeuevents.dkk5lp.c3.kafka.eu-central-1.amazonaws.com:9094,b-1.posthogprodeuevents.dkk5lp.c3.kafka.eu-central-1.amazonaws.com:9094', 'clickhouse_person_rekey', 'group1', 'JSONEachRow');

CREATE MATERIALIZED VIEW posthog.person_mv TO posthog.person
(
    `id` UUID,
    `created_at` DateTime64(3),
    `team_id` Int64,
    `properties` String,
    `is_identified` Int8,
    `is_deleted` Int8,
    `version` UInt64,
    `_timestamp` Nullable(DateTime),
    `_offset` UInt64
)
AS SELECT
    id,
    created_at,
    team_id,
    properties,
    is_identified,
    is_deleted,
    version,
    _timestamp,
    _offset
FROM posthog.kafka_person;

Step 5
Verify everything looks good ✅

DROP TABLE person_old ON CLUSTER posthog_migrations;

@fuziontech fuziontech self-assigned this Feb 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Has to do with performance. For PRs, runs the clickhouse query performance suite
Projects
None yet
Development

No branches or pull requests

2 participants