[Bug]: TimescaleDB corrupt catalog tables #3874

markrui3 · 2021-12-01T03:04:02Z

What type of bug is this?

Data corruption

What subsystems and features are affected?

Data node

What happened?

#2680 report the bug of catalog table corruption and 3a8f396bc7d014a2722b8113236b34b0eda73eec solved it. But it only discuss the situation with preload timescaledb.
When I create extension timescaledb and running the following script, the catalog table would be corrupted again.
We believe the corruption happens in the ParallelWorker situation.
The function call is like this:

ParallelWorker

-> InvalidateSystemCaches
    -> RelationCacheInvalidate
        -> cache_invalidate_callback
            -> ts_catalog_get
                -> get_namespace_oid
                    -> GetSysCacheOid1 (this step reproduce system cache)
                    -> other functions

If we call VACUUM FULL, parallel workers will actually do the things. If the above code reproduce cache, parallel workers will delete catalog because of the incorrect cache.

TimescaleDB version affected

2.5.0

PostgreSQL version used

13.5

What operating system did you use?

RedHat 7u2 3.10.0-327

What installation method did you use?

Source

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

No response

How can we reproduce the bug?

#!/bin/sh
set +e
exitcode=0
datapath=invalidcache_data
logpath=invalidcache_log

export PGDATABASE=bug2680
export PGPORT=12345
rm -rf $datapath $logpath
initdb -D $datapath
cp test/postgresql.conf $datapath/
pg_ctl -D $datapath -l $logpath start

columnnum=1000
partnum=100
tablenum=10

for j in $(seq 0 1)
do
  psql -c "drop database if exists $PGDATABASE" postgres
  psql -c "create database $PGDATABASE" postgres
  if [[ "$j" = "1" ]]; then
    psql -c "create extension if not exists timescaledb"
  fi
  columndef=""
  for i in $(seq 0 $[$columnnum-1])
  do
    columndef="$columndef,c$i int"
  done

  echo "create tables"
  for i in $(seq 0 $[$tablenum-1])
  do
    psql -c "drop table if exists userinfo$i" > /dev/null 2>&1
    psql -c "CREATE TABLE userinfo$i(id int$columndef) PARTITION BY HASH(id);" > /dev/null 2>&1
    for j in $(seq 0 $[$partnum-1])
    do
      psql -c "CREATE TABLE userinfo${i}_$j PARTITION of userinfo$i FOR VALUES WITH (MODULUS $partnum, REMAINDER $j);" > /dev/null 2>&1
    done
    wait `(jobs -p)`
  done

  for i in $(seq 0 1)
  do
    psql -c "vacuum full analyze pg_attribute;" && psql -c "vacuum full;"
    if [[ "$?" -ne "0" ]];then
      exitcode=1
      break
    fi
  done

  if [[ "$exitcode" = "1" ]]; then
    break
  fi
done

if [[ "$1" = "check" ]];then
  pg_ctl -D $datapath stop
fi

exit $exitcode

The text was updated successfully, but these errors were encountered:

markrui3 · 2021-12-01T03:18:15Z

@erimatnor

…corruption with VACUUM FULL

…corruption with VACUUM FULL.

…corruption with VACUUM FULL

afiskon · 2021-12-02T13:21:12Z

@markrui3 Many thanks for the bug report.

I executed the script against PG 13.5 and master branch of TS on MacOS and it terminated with 0 exit code indicating that the bug was not reproduced. Then I noticed that something is wrong with the script since it doesn't add shared_preload_libraries = 'timescaledb' to postgresql.conf. I fixed this and repeated the test.

This time I got:

$ sh bug2680.sh
create tables
Timing is on.
Time: 8.573 ms
VACUUM
Time: 4181.688 ms (00:04.182)
Timing is on.
Time: 7.444 ms
ERROR:  invalid attribute number -1 for userinfo0_25
Time: 79.550 ms
$ echo $?
1

$ psql
51355 (master) =# select * from userinfo0_25;
ERROR:  invalid attribute number -1 for userinfo0_25
LINE 1: select * from userinfo0_25;
                      ^
Time: 0.818 ms

It can fail differently, e.g.:

$ sh bug2680.sh
create tables
Timing is on.
Time: 7.413 ms
VACUUM
Time: 5292.855 ms (00:05.293)
Timing is on.
Time: 7.367 ms
ERROR:  catalog is missing 6 attribute(s) for relid 36818
Time: 0.939 ms

$ ./restart.sh
$ psql
86489 (master) =# select * from pg_attribute;
ERROR:  catalog is missing 6 attribute(s) for relid 36818
Time: 0.389 ms

# ^ restart doesn't help

I repeated the test, this time without TS installed. This time I couldn't reproduce the bug. I repeated the entire experiment several times just to exclude the possibility that the problem doesn't always reproduce.

I can confirm there is a bug in TS and it reproduces well. I also discovered that it reproduces against PG 14.1. Now I'm going to check the proposed fix #3875

nikkhils · 2021-12-16T08:14:28Z

Doing catalog queries inside a cache invalidation callback is a very problematic approach. We will need to rethink this entire logic in the cache_invalidate_callback function.

nikkhils · 2021-12-20T14:36:58Z

Closing this as a duplicate of #3924

markrui3 added bug triage labels Dec 1, 2021

markrui3 added a commit to markrui3/timescaledb that referenced this issue Dec 1, 2021

Fix timescale#3874 invalidate cache in parallel worker to avoid data …

00d593e

…corruption with VACUUM FULL

markrui3 added a commit to markrui3/timescaledb that referenced this issue Dec 1, 2021

Fix timescale#3874 invalidate cache in parallel worker to avoid data …

f8488de

…corruption with VACUUM FULL

markrui3 mentioned this issue Dec 1, 2021

Fix #3874 invalidate cache in parallel worker to avoid data corruption with VACUUM FULL #3875

Closed

markrui3 added a commit to markrui3/timescaledb that referenced this issue Dec 1, 2021

Fix timescale#3874 invalidate cache in parallel worker to avoid data …

0ade836

…corruption with VACUUM FULL.

markrui3 added a commit to markrui3/timescaledb that referenced this issue Dec 1, 2021

Fix timescale#3874 invalidate cache in parallel worker to avoid data …

ccd78b7

…corruption with VACUUM FULL

NunoFilipeSantos added 2.5.0 pg13 and removed triage labels Dec 1, 2021

markrui3 added a commit to markrui3/timescaledb that referenced this issue Dec 2, 2021

Fix timescale#3874 invalidate cache in parallel worker to avoid data …

735d430

…corruption with VACUUM FULL

afiskon self-assigned this Dec 2, 2021

afiskon removed their assignment Dec 3, 2021

nikkhils assigned nikkhils and unassigned nikkhils Dec 16, 2021

nikkhils closed this as completed Dec 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: TimescaleDB corrupt catalog tables #3874

[Bug]: TimescaleDB corrupt catalog tables #3874

markrui3 commented Dec 1, 2021 •

edited

Loading

markrui3 commented Dec 1, 2021 •

edited

Loading

afiskon commented Dec 2, 2021

nikkhils commented Dec 16, 2021

nikkhils commented Dec 20, 2021 •

edited

Loading

[Bug]: TimescaleDB corrupt catalog tables #3874

[Bug]: TimescaleDB corrupt catalog tables #3874

Comments

markrui3 commented Dec 1, 2021 • edited Loading

What type of bug is this?

What subsystems and features are affected?

What happened?

TimescaleDB version affected

PostgreSQL version used

What operating system did you use?

What installation method did you use?

What platform did you run on?

Relevant log output and stack trace

How can we reproduce the bug?

markrui3 commented Dec 1, 2021 • edited Loading

afiskon commented Dec 2, 2021

nikkhils commented Dec 16, 2021

nikkhils commented Dec 20, 2021 • edited Loading

markrui3 commented Dec 1, 2021 •

edited

Loading

markrui3 commented Dec 1, 2021 •

edited

Loading

nikkhils commented Dec 20, 2021 •

edited

Loading