Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: TimescaleDB corrupt catalog tables #3874

Closed
markrui3 opened this issue Dec 1, 2021 · 4 comments
Closed

[Bug]: TimescaleDB corrupt catalog tables #3874

markrui3 opened this issue Dec 1, 2021 · 4 comments

Comments

@markrui3
Copy link

markrui3 commented Dec 1, 2021

What type of bug is this?

Data corruption

What subsystems and features are affected?

Data node

What happened?

#2680 report the bug of catalog table corruption and 3a8f396bc7d014a2722b8113236b34b0eda73eec solved it. But it only discuss the situation with preload timescaledb.
When I create extension timescaledb and running the following script, the catalog table would be corrupted again.
We believe the corruption happens in the ParallelWorker situation.
The function call is like this:

ParallelWorker

-> InvalidateSystemCaches
    -> RelationCacheInvalidate
        -> cache_invalidate_callback
            -> ts_catalog_get
                -> get_namespace_oid
                    -> GetSysCacheOid1 (this step reproduce system cache)
                    -> other functions

If we call VACUUM FULL, parallel workers will actually do the things. If the above code reproduce cache, parallel workers will delete catalog because of the incorrect cache.

TimescaleDB version affected

2.5.0

PostgreSQL version used

13.5

What operating system did you use?

RedHat 7u2 3.10.0-327

What installation method did you use?

Source

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

No response

How can we reproduce the bug?

#!/bin/sh
set +e
exitcode=0
datapath=invalidcache_data
logpath=invalidcache_log

export PGDATABASE=bug2680
export PGPORT=12345
rm -rf $datapath $logpath
initdb -D $datapath
cp test/postgresql.conf $datapath/
pg_ctl -D $datapath -l $logpath start

columnnum=1000
partnum=100
tablenum=10

for j in $(seq 0 1)
do
  psql -c "drop database if exists $PGDATABASE" postgres
  psql -c "create database $PGDATABASE" postgres
  if [[ "$j" = "1" ]]; then
    psql -c "create extension if not exists timescaledb"
  fi
  columndef=""
  for i in $(seq 0 $[$columnnum-1])
  do
    columndef="$columndef,c$i int"
  done

  echo "create tables"
  for i in $(seq 0 $[$tablenum-1])
  do
    psql -c "drop table if exists userinfo$i" > /dev/null 2>&1
    psql -c "CREATE TABLE userinfo$i(id int$columndef) PARTITION BY HASH(id);" > /dev/null 2>&1
    for j in $(seq 0 $[$partnum-1])
    do
      psql -c "CREATE TABLE userinfo${i}_$j PARTITION of userinfo$i FOR VALUES WITH (MODULUS $partnum, REMAINDER $j);" > /dev/null 2>&1
    done
    wait `(jobs -p)`
  done

  for i in $(seq 0 1)
  do
    psql -c "vacuum full analyze pg_attribute;" && psql -c "vacuum full;"
    if [[ "$?" -ne "0" ]];then
      exitcode=1
      break
    fi
  done

  if [[ "$exitcode" = "1" ]]; then
    break
  fi
done

if [[ "$1" = "check" ]];then
  pg_ctl -D $datapath stop
fi

exit $exitcode
@markrui3
Copy link
Author

markrui3 commented Dec 1, 2021

@erimatnor

markrui3 added a commit to markrui3/timescaledb that referenced this issue Dec 1, 2021
markrui3 added a commit to markrui3/timescaledb that referenced this issue Dec 1, 2021
markrui3 added a commit to markrui3/timescaledb that referenced this issue Dec 1, 2021
markrui3 added a commit to markrui3/timescaledb that referenced this issue Dec 1, 2021
markrui3 added a commit to markrui3/timescaledb that referenced this issue Dec 2, 2021
@afiskon afiskon self-assigned this Dec 2, 2021
@afiskon
Copy link
Contributor

afiskon commented Dec 2, 2021

@markrui3 Many thanks for the bug report.

I executed the script against PG 13.5 and master branch of TS on MacOS and it terminated with 0 exit code indicating that the bug was not reproduced. Then I noticed that something is wrong with the script since it doesn't add shared_preload_libraries = 'timescaledb' to postgresql.conf. I fixed this and repeated the test.

This time I got:

$ sh bug2680.sh
create tables
Timing is on.
Time: 8.573 ms
VACUUM
Time: 4181.688 ms (00:04.182)
Timing is on.
Time: 7.444 ms
ERROR:  invalid attribute number -1 for userinfo0_25
Time: 79.550 ms
$ echo $?
1

$ psql
51355 (master) =# select * from userinfo0_25;
ERROR:  invalid attribute number -1 for userinfo0_25
LINE 1: select * from userinfo0_25;
                      ^
Time: 0.818 ms

It can fail differently, e.g.:

$ sh bug2680.sh
create tables
Timing is on.
Time: 7.413 ms
VACUUM
Time: 5292.855 ms (00:05.293)
Timing is on.
Time: 7.367 ms
ERROR:  catalog is missing 6 attribute(s) for relid 36818
Time: 0.939 ms

$ ./restart.sh
$ psql
86489 (master) =# select * from pg_attribute;
ERROR:  catalog is missing 6 attribute(s) for relid 36818
Time: 0.389 ms

# ^ restart doesn't help

I repeated the test, this time without TS installed. This time I couldn't reproduce the bug. I repeated the entire experiment several times just to exclude the possibility that the problem doesn't always reproduce.

I can confirm there is a bug in TS and it reproduces well. I also discovered that it reproduces against PG 14.1. Now I'm going to check the proposed fix #3875

@afiskon afiskon removed their assignment Dec 3, 2021
@nikkhils
Copy link
Contributor

Doing catalog queries inside a cache invalidation callback is a very problematic approach. We will need to rethink this entire logic in the cache_invalidate_callback function.

@nikkhils nikkhils assigned nikkhils and unassigned nikkhils Dec 16, 2021
@nikkhils
Copy link
Contributor

nikkhils commented Dec 20, 2021

Closing this as a duplicate of #3924

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants