Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging implementations of Strict and Logical DNS (draft for discussion) #38483

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

grnmeira
Copy link
Member

@grnmeira grnmeira commented Feb 18, 2025

Commit Message:
Additional Description:

This is a draft for discussing the approach taken in this PR. Note that there's a few things that still need to be worked on before this becomes an actual PR.

The main idea here is tackling #37145 (which is a follow up from #36353).

The approach simplifies the code merging the two current implementation of DNS cluster (logical and strict) by removing unnecessary logical DNS, as it can be seen as a special case of strict DNS.

According to the documentation:

When using strict DNS service discovery, Envoy will continuously and asynchronously resolve the specified DNS targets. Each returned IP address in the DNS result will be considered an explicit host in the upstream cluster. This means that if the query returns three IP addresses, Envoy will assume the cluster has three hosts, and all three should be load balanced to. If a host is removed from the result Envoy assumes it no longer exists and will drain traffic from any existing connection pools

Logical DNS uses a similar asynchronous resolution mechanism to strict DNS. However, instead of strictly taking the results of the DNS query and assuming that they comprise the entire upstream cluster, a logical DNS cluster only uses the first IP address returned when a new connection needs to be initiated. Thus, a single logical connection pool may contain physical connections to a variety of different upstream hosts. Connections are never drained, including on a successful DNS resolution that returns 0 hosts.

There are a few caveats with this approach, as in error conditions and membership updates, and the merge should address those.

There's a few things still missing in this draft:

  • Check more unit tests.
  • Add feature flags/runtime flags.
  • Some error conditions from logical DNS.
  • The "legacy" implementations were removed, but we still need the them for a transitional period.
  • Removal of exception throws (required by code checking)
  • Etc.

Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

Copy link

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #38483 was opened by grnmeira.

see: more, trace.

const auto& locality_lb_endpoints = load_assignment_.endpoints();
for (const auto& locality_lb_endpoint : locality_lb_endpoints) {
if (!all_addresses_in_single_endpoint_) {
THROW_IF_NOT_OK(validateEndpointsForZoneAwareRouting(locality_lb_endpoint));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is an opportunity to not throw an exception here and below and if we don't I think we should avoid throwing exceptions.

Basically the exceptions are thrown when input is somehow incorrect, so we can verify the input before calling constructor. Additionally, constructor only seem to be called from a static create method above, which already returns StatusOr - that is a good place to put input validation in. Though strictly speaking even that should not be necessary, given that constructor already gets creation_status parameter in by reference.

Maybe we can try to return errors through the creation_status parameter instead of throwing an exception, unless it breaks something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agreed. We actually have a code check that wouldn't allow this specific file to throw exceptions (I did a push --no-verify for now). So it'd be great to keep it exception-free.

case envoy::config::cluster::v3::Cluster::LOGICAL_DNS:
cluster_name = "envoy.cluster.logical_dns";
case envoy::config::cluster::v3::Cluster::STRICT_DNS:
cluster_name = "envoy.cluster.dns";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe extract the whole logic of finding the right factory into its own function. It would also help to put a comment there providing some context as to why all those manipulations with config type name and cluster names are needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also you should set all_addresses_in_single_endpoint to true/false depending on the the cluster name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would ideally like to see this structured by having all cluster types implemented as extensions, and then supporting the legacy types by basically converting the legacy configs to the extension configs. In other words, I think we ideally want something like the following logic (pseudo-code):

Cluster::CustomClusterType cluster_type =
    cluster.has_cluster_type()
        ? cluster.cluster_type()
        : ConvertLegacyConfigToExtensionConfig(cluster);
std::string cluster_config_type_name =
    TypeUtil::typeUrlToDescriptorFullName(cluster_type.typed_config().type_url());
ClusterFactory* factory =
    Registry::FactoryRegistry<ClusterFactory>::getFactoryByType(cluster_config_type_name);

The idea is that ConvertLegacyConfigToExtensionConfig() would contain the code to convert the legacy enum-based config to the corresponding extension config.

I believe that @wbpcode previously did something like this for LB policies, so we probably want to model this work on that same pattern.

});
}

REGISTER_FACTORY(DnsClusterFactory, ClusterFactory);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to move it closer to the factory definition above and before the DnsClusterImpl

const envoy::extensions::clusters::dns::v3::DnsCluster& dns_cluster,
ClusterFactoryContext& context, Network::DnsResolverSharedPtr dns_resolver);

protected:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why constructor has to be protected instead of just private?

I figure that protected only matters when something derives from this class, which should not be the case given that it's new and nothing in this PR inherits it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a copy from the original strict DNS declarations, not sure why protected is used here. Only reason I can think of is derived classes being extended for tests and then friend classes needing that... I'll keep that in mind and see if we can make it private for a full PR.

@grnmeira
Copy link
Member Author

@markdroth if you're interested, as this is a follow up from #36353.

@@ -31,6 +31,11 @@ ClusterFactoryImplBase::create(const envoy::config::cluster::v3::Cluster& cluste
ProtobufWkt::Struct::GetDescriptor()->full_name())) {
cluster_config_type_name =
TypeUtil::typeUrlToDescriptorFullName(cluster.cluster_type().typed_config().type_url());
// This should be behind a runtime flag.
if (cluster_config_type_name == "envoy.cluster.strict_dns" ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may not be needed, because there should be no configs out there using these legacy names in a custom cluster type.

case envoy::config::cluster::v3::Cluster::LOGICAL_DNS:
cluster_name = "envoy.cluster.logical_dns";
case envoy::config::cluster::v3::Cluster::STRICT_DNS:
cluster_name = "envoy.cluster.dns";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would ideally like to see this structured by having all cluster types implemented as extensions, and then supporting the legacy types by basically converting the legacy configs to the extension configs. In other words, I think we ideally want something like the following logic (pseudo-code):

Cluster::CustomClusterType cluster_type =
    cluster.has_cluster_type()
        ? cluster.cluster_type()
        : ConvertLegacyConfigToExtensionConfig(cluster);
std::string cluster_config_type_name =
    TypeUtil::typeUrlToDescriptorFullName(cluster_type.typed_config().type_url());
ClusterFactory* factory =
    Registry::FactoryRegistry<ClusterFactory>::getFactoryByType(cluster_config_type_name);

The idea is that ConvertLegacyConfigToExtensionConfig() would contain the code to convert the legacy enum-based config to the corresponding extension config.

I believe that @wbpcode previously did something like this for LB policies, so we probably want to model this work on that same pattern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants