Skip to content

Conversation

@mbissa
Copy link
Contributor

@mbissa mbissa commented Dec 2, 2025

Addresses : https://github.com/grpc/proposal/blob/master/A94-subchannel-otel-metrics.md

This PR adds subchannel metrics with applicable labels as per the RFC proposal. disconnection_reason will be added as a follow up PR.

RELEASE NOTES:

  • stats/otel: add subchannel metrics (without the disconnection reason) to eventually replace the pickfirst metrics.

@mbissa mbissa added this to the 1.78 Release milestone Dec 2, 2025
@mbissa mbissa requested a review from easwars December 2, 2025 19:58
@mbissa mbissa added Type: Feature New features or improvements in behavior Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability labels Dec 2, 2025
@codecov
Copy link

codecov bot commented Dec 2, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.37%. Comparing base (432bda3) to head (cc6c6e3).
⚠️ Report is 8 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8738      +/-   ##
==========================================
+ Coverage   83.22%   83.37%   +0.15%     
==========================================
  Files         419      418       -1     
  Lines       32454    32425      -29     
==========================================
+ Hits        27009    27034      +25     
+ Misses       4057     4010      -47     
+ Partials     1388     1381       -7     
Files with missing lines Coverage Δ
clientconn.go 90.44% <100.00%> (-0.06%) ⬇️
internal/internal.go 100.00% <ø> (ø)
internal/transport/http2_client.go 92.80% <100.00%> (+0.18%) ⬆️
internal/transport/transport.go 90.86% <ø> (+2.15%) ⬆️
internal/xds/xds.go 80.55% <100.00%> (+5.55%) ⬆️
stream.go 81.88% <100.00%> (+0.04%) ⬆️

... and 36 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@easwars
Copy link
Contributor

easwars commented Dec 2, 2025

@arjan-bal

A94 states the following:

If we end up discarding connection attempts as we do with the “happy eyeballs” algorithm
(as per A61), we should not record the connection attempt or the disconnection.

How are we currently handling this in the pickfirst metrics?

@easwars
Copy link
Contributor

easwars commented Dec 2, 2025

@mbissa

A94 states the following:

Implementations that have already implemented the pick-first metrics should give 
enough time for users to transition to the new metrics. For example, implementations 
should report both the old pick-first metrics and the new subchannel metrics for
2 releases, and then remove the old pick-first metrics.

Can you please ensure that we have an issue filed to track the removal of the old metrics and that it captures the correct release where it needs to be removed.

@arjan-bal
Copy link
Contributor

@arjan-bal

A94 states the following:

If we end up discarding connection attempts as we do with the “happy eyeballs” algorithm
(as per A61), we should not record the connection attempt or the disconnection.

How are we currently handling this in the pickfirst metrics?

Here is how pickfirst handles this:

  1. When a subchannel connects, all remaining subchannels are removed from pickfirst's subchannel map.
    if newState.ConnectivityState == connectivity.Ready {
    connectionAttemptsSucceededMetric.Record(b.metricsRecorder, 1, b.target)
    b.shutdownRemainingLocked(sd)
  2. The updateSubConnState method ignores any updates from subchannels that are not in the subchannel map.
    // Previously relevant SubConns can still callback with state updates.
    // To prevent pickers from returning these obsolete SubConns, this logic
    // is included to check if the current list of active SubConns includes this
    // SubConn.
    if !b.isActiveSCData(sd) {
    return
    }

@easwars easwars removed their assignment Dec 5, 2025
@mbissa
Copy link
Contributor Author

mbissa commented Dec 8, 2025

@mbissa

A94 states the following:

Implementations that have already implemented the pick-first metrics should give 
enough time for users to transition to the new metrics. For example, implementations 
should report both the old pick-first metrics and the new subchannel metrics for
2 releases, and then remove the old pick-first metrics.

Can you please ensure that we have an issue filed to track the removal of the old metrics and that it captures the correct release where it needs to be removed.

Issue filed - #8752

@mbissa
Copy link
Contributor Author

mbissa commented Dec 8, 2025

@arjan-bal

A94 states the following:

If we end up discarding connection attempts as we do with the “happy eyeballs” algorithm
(as per A61), we should not record the connection attempt or the disconnection.

How are we currently handling this in the pickfirst metrics?

Handled as per PR

t.Errorf("Unexpected data for metric %v, got: %v, want: %v", "grpc.lb.pick_first.disconnections", got, 0)
}

//Checking for subchannel metrics as well
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Space between the // and the start of the comment, and please terminate comment sentences with a period.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and elsewhere. There are still a bunch of places with comments that are not terminated with periods. See: go/go-style/decisions#comment-sentences

Comment on lines +2172 to +2178
// Wait for the SUCCESS metric to ensure recording logic has processed.
waitForMetric(ctx, t, tmr, "grpc.subchannel.connection_attempts_succeeded")

// Verify Success: Exactly 1 (The Winner).
if got, _ := tmr.Metric("grpc.subchannel.connection_attempts_succeeded"); got != 1 {
t.Errorf("Unexpected data for metric %v, got: %v, want: 1", "grpc.subchannel.connection_attempts_succeeded", got)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this actually ensure that we check the value of the metric after the first connection attempt is completely processed? We do call holds[0].Resume(), but does that guarantee that the subchannel code sees the connection being successful, but drops it since the subchannel has been deleted by the LB policy.

Copy link
Contributor Author

@mbissa mbissa Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are waiting for the metric to be emitted. Connection attempt success will only be emitted if there is a successful connection. In case of cancellation of attempt - it will not be successful and in case of disconnection after establishing connection, it will still be recorded as a disconnection. In both scenarios, the attempts succeeded will always be 1.

@easwars easwars removed their assignment Dec 9, 2025
Copy link
Contributor

@easwars easwars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good mostly. Just some minor nits.

Can you also please update the PR description to note that the disconnection_reason will be plumbed in a follow-up PR. Thanks.

Comment on lines 2027 to 2028
_, ok := tmr.Metric(metricName)
if ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The assignment and the conditional can be moved into a single line:

	if _, ok := tmr.Metric(metricName); ok {
	   ...
	}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@easwars easwars removed their assignment Dec 10, 2025
@mbissa
Copy link
Contributor Author

mbissa commented Dec 10, 2025

Looks good mostly. Just some minor nits.

Can you also please update the PR description to note that the disconnection_reason will be plumbed in a follow-up PR. Thanks.

done.

@easwars easwars removed their assignment Dec 10, 2025
@mbissa mbissa merged commit 6553ea1 into grpc:master Dec 10, 2025
15 checks passed
@mbissa mbissa deleted the a94-subchannel-metrics-final branch December 10, 2025 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability Type: Feature New features or improvements in behavior

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants