Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-34467] bump flink version to 1.20.0 #111

Merged
merged 2 commits into from
Oct 14, 2024

Conversation

HuangZhenQiu
Copy link
Contributor

@HuangZhenQiu HuangZhenQiu commented Aug 11, 2024

Update Flink version 1.20.0 to prepare support Lineage integration

Copy link

boring-cyborg bot commented Aug 11, 2024

Thanks for opening this pull request! Please check out our contributing guidelines. (https://flink.apache.org/contributing/how-to-contribute.html)

@HuangZhenQiu HuangZhenQiu force-pushed the FLINK-34466-bump-flink-version branch from b9d644b to d07c47a Compare August 11, 2024 02:06
@HuangZhenQiu HuangZhenQiu changed the title [FLINK-34467] bump flink version to 1.19.0 [FLINK-34467] bump flink version to 1.20.0 Aug 11, 2024
@uce uce requested review from AHeise and fapaul August 12, 2024 05:50
@AHeise
Copy link
Contributor

AHeise commented Aug 22, 2024

What's the purpose of bumping the version?

If we don't bump, the connector can be easily used in Flink 1.17 .. 1.20 (Maven will choose the highest version automatically depending on the other flink dependencies).

If we bump, the connector cannot be easily used with 1.17 .. 1.19 (you'd need to explicitly exclude the dependencies or use dependency management).

Usually, we bump if the connector depends on specific API of the respective Flink version. You mentioned Lineage but I'm missing context.

@HuangZhenQiu
Copy link
Contributor Author

@AHeise
Thanks for the context. We need to use the LineageVertexProvide interface in connectors to support the native lineage integration within Flink. https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/lineage/LineageVertexProvider.java

The interface is released in Flink 1.20. I saw other connectors, for example, Cassandra and JDBC (I also need to bump the flink version for them) have been using Flink 1.19. Could we use the similar way to do the dependency management?

@AHeise
Copy link
Contributor

AHeise commented Aug 23, 2024

The interface is released in Flink 1.20. I saw other connectors, for example, Cassandra and JDBC (I also need to bump the flink version for them) have been using Flink 1.19. Could we use the similar way to do the dependency management?

The way this interface is designed will always result in a breaking change on connector side. Any connector that uses the interface will need to release a specific version just for Flink 1.20.

From the interface description and usage, I'm inferring that we need to do:

class KafkaSource ... implements LineageVertexProvider

Because it's designed to be a base interface, trying to load this jar in 1.19 will cause ClassLoader exceptions. It's similarly to adding new abstract methods to the Source class directly and implementing them in KafkaSource.

We have two options:

  • Bump to 1.20 as suggested. That means that new features and bugfixes in Kafka connector would not be directly available for users of 1.19. We could add a feature branch for maintaining older releases. However, that would cause at least double the number of releases of connectors until we phase out 1.19. @dannycranmer made the experience that just releasing a single release regularly is quite an effort because we have too few PMCs involved in the connector ecosystem. So very likely we effectively stop developing the connector for 1.19 unless something critical pops up.
  • Extend the interfaces to avoid locking on 1.20. We could achieve full backward compatibility by using something like
class KafkaSourceLineageVertexProvider implements ExternalLineageVertexProvider<KafkaSource> {
  LineageVertex getLineageVertex(KafkaSource source); 
}

and find the respective implementations through class loader scans or using SPI. Loading this class in 1.19 would fail but there is also no need to load it. We would also need to adjust TableLineageUtils to look for these external providers.

In general, since this is a Table API feature, it also feels more natural to extend SourceProvider instead of Source similarly to ParallelismProvider. This would solve the issue for DataStream users as well but it's too complicated to explain that Table API users wouldn't be able to use the new jar with 1.19 🙃 .

@taiziwang
Copy link

hope the community can support Flink version 1.16 due to the issue FLINK-28303.

@AHeise
Copy link
Contributor

AHeise commented Aug 23, 2024

hope the community can support Flink version 1.16 due to the issue FLINK-28303.

I don't think this is related to the discussion. According to the ticket, the bug has been solved in kafka-3.0.1. I'm guessing you running into the issue that I sketched above that we only released https://mvnrepository.com/artifact/org.apache.flink/flink-connector-kafka/3.0.1-1.17 and you'd like to bundle that with flink 1.16. As mentioned above, you can try to override the flink version to 1.16 (forced downgrade) and check it out. If there are no Flink 1.17 specific APIs being used, it should just work.

Note that I'd like to move away from releases targeting specific Flink versions. It's bugged anyways currently as https://mvnrepository.com/artifact/org.apache.flink/flink-connector-kafka/3.0.1-1.18 is also linking to 1.17 by default. We just need a matrix saying

  Flink 1.17 Flink 1.18 Flink 1.19 Flink 1.20
Kafka 3.0.1 x x x (untested) x (untested)
Kafka 3.1.1 x x x x (untested)
Kafka 3.2.0   x x x

But I'd like to defer the discussion somewhere else and keep on focusing on making compilation possible on Flink 1.20 with 1.20 API but still also be able to run it on Flink 1.18.

@HuangZhenQiu
Copy link
Contributor Author

@AHeise
Thanks for the reply. I totally understand the pain points of maintain multiple flink version compatibility for a connector. In each Flink release, there are always some new experimental interfaces in api or runtime introduced. Shall we consider the solution from Apache Hudi or Apache Iceberg? Both of them use a separate module for different flink versions. Some classes are replicated into different modules as needed.
https://github.com/apache/hudi/tree/master/hudi-flink-datasource

@HuangZhenQiu
Copy link
Contributor Author

I think we can drop support for flink 1.17 and flink 1.18 first in this PR #102.

@AHeise
Copy link
Contributor

AHeise commented Aug 26, 2024

@AHeise Thanks for the reply. I totally understand the pain points of maintain multiple flink version compatibility for a connector. In each Flink release, there are always some new experimental interfaces in api or runtime introduced. Shall we consider the solution from Apache Hudi or Apache Iceberg? Both of them use a separate module for different flink versions. Some classes are replicated into different modules as needed. https://github.com/apache/hudi/tree/master/hudi-flink-datasource

This sounds like a maintenance nightmare. Having separate branches sounds much better to me. Hudi needs to do it because they have their own release cycle independent of Flink. But the connector can correlate the release cycle to the Flink version, so we don't need to resort to such hackery.

@AHeise
Copy link
Contributor

AHeise commented Aug 26, 2024

Thanks for the reply. I totally understand the pain points of maintain multiple flink version compatibility for a connector. In each Flink release, there are always some new experimental interfaces in api or runtime introduced.

We broke connectors quite a bit in the past with these new APIs or changes to existing APIs (that's also why Hudi has different modules). I think we just need to craft our API extensions more carefully as outlined above. API that can be used easily internally may not be suited for external repos.

In theory, we could post-process compiled classes to remove usages of 1.20 classes for 1.19 release but that's quite involved and probably hard to understand (too much magic).

For now, I propose the following. We try to get as many other bugfixes in as possible. Then we do a release for 1.19 (I'm going to pick up the PR that you linked to get rid of 1.17 and 1.18). Then we merge this PR and add lineages and create a new minor (or even major). If we need to patch 1.19, we fork from 1.19 minor and do a patch release.

@HuangZhenQiu
Copy link
Contributor Author

HuangZhenQiu commented Aug 26, 2024

@AHeise
Thanks. It is a very reasonable plan! Let me know anything I can help before merging the PR also.

@pawel-big-lebowski
Copy link
Contributor

pawel-big-lebowski commented Sep 4, 2024

@AHeise What amount of time do you think is needed for this to accomplish? Day, weeks or perhaps months? In case of longer period, would it make sense to think of some temporar workarounds?

A hacky way would be to copy the interfaces from (https://github.com/apache/flink/pull/23626/files) into the kafka connector codebase and exclude from a built jar. This would work as interface definition in Java is needed only when interface methods are called and this would happen only for Flink 1.20 or later, which contain those definitions. Copied classes could be abandoned once package gets updated to 1.20.

Just thinking loudly as I see the interest and huge potential in finishing the lineage mechanism and would love to avoid blocking the progress on that.

@AHeise
Copy link
Contributor

AHeise commented Sep 4, 2024

@AHeise What amount of time do you think is needed for this to accomplish? Day, weeks or perhaps months? In case of longer period, would it make sense to think of some temporar workarounds?

It's going to be a few weeks. We don't need to wait for the release to go through though; we just need to get all fixes in.

A hacky way would be to copy the interfaces from (https://github.com/apache/flink/pull/23626/files) into the kafka connector codebase and exclude from a built jar. This would work as interface definition in Java is needed only when interface methods are called and this would happen only for Flink 1.20 or later, which contain those definitions. Copied classes could be abandoned once package gets updated to 1.20.

This would be news to be me but an awesome solution (I don't think we need to copy the interfaces though). Can you provide a POC that shows that this works? Afaik the classloader needs to load all base classes/interfaces on class resolution (for things like instanceof checks).

@pawel-big-lebowski
Copy link
Contributor

@AHeise I am trying to work on that within this PR -> https://github.com/pawel-big-lebowski/flink-connector-kafka/pull/1/files. What would be the best way to show whether it works or not? I believe it does bcz of Java lazy class loading but not sure how to prove this via flink Kafka connector end-to-end tests.

@AHeise
Copy link
Contributor

AHeise commented Sep 11, 2024

@AHeise I am trying to work on that within this PR -> https://github.com/pawel-big-lebowski/flink-connector-kafka/pull/1/files. What would be the best way to show whether it works or not? I believe it does bcz of Java lazy class loading but not sure how to prove this via flink Kafka connector end-to-end tests.

I'm guessing the easiest way to check it is to simply include some dependency as provided, use an interface of that dependency in some class and then execute some test that uses the class. Your assumption is that any test that refers to methods of the interface fails while all tests that don't succeed.

I don't think you need to use this repo to make your point - you can also just setup some small POC repo. Or simplify a fork as much as needed.

@AHeise
Copy link
Contributor

AHeise commented Sep 11, 2024

@AHeise I am trying to work on that within this PR -> https://github.com/pawel-big-lebowski/flink-connector-kafka/pull/1/files. What would be the best way to show whether it works or not? I believe it does bcz of Java lazy class loading but not sure how to prove this via flink Kafka connector end-to-end tests.

Another way to do it could be to use flink 1.20 for compiling (e.g. provided) ad flink 1.19 for test dependency. You should be able to use FlinkVersion#current() to assert that the setup actually works as expected.

@pawel-big-lebowski
Copy link
Contributor

Another way to do it could be to use flink 1.20 for compiling (e.g. provided) ad flink 1.19 for test dependency.

@AHeise That could work as well. But the bigger problem is that flink-core 1.20 does not contain interfaces like LineageVertexProvider. The interface was merged to main, then removed from the 1.20 release and merged again.

@@ -1,89 +1,5 @@
org.apache.flink.connector.kafka.sink.FlinkKafkaInternalProducerITCase does not satisfy: only one of the following predicates match:\
* reside in a package 'org.apache.flink.runtime.*' and contain any fields that are static, final, and of type InternalMiniClusterExtension and annotated with @RegisterExtension\
* reside outside of package 'org.apache.flink.runtime.*' and contain any fields that are static, final, and of type MiniClusterExtension and annotated with @RegisterExtension\
Copy link

@davidradl davidradl Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious what the thinking is to remove all of these checks? Is this something that is required to move to 1.20?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is auto removed.

Copy link
Contributor Author

@HuangZhenQiu HuangZhenQiu Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right. Given we need to support 1.18 and 1.19, I think these need to be kept.

@HuangZhenQiu HuangZhenQiu force-pushed the FLINK-34466-bump-flink-version branch 5 times, most recently from 581c5b1 to 1534c2d Compare October 11, 2024 22:54
@HuangZhenQiu HuangZhenQiu force-pushed the FLINK-34466-bump-flink-version branch from 1534c2d to 2b4ec36 Compare October 12, 2024 12:09
@HuangZhenQiu
Copy link
Contributor Author

@AHeise
No more violation errors. But there is a CI timeout.

HuangZhenQiu#1

@AHeise AHeise merged commit 2dfdae6 into apache:main Oct 14, 2024
11 checks passed
Copy link

boring-cyborg bot commented Oct 14, 2024

Awesome work, congrats on your first merged pull request!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants