Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU usage/efficiency #7894

Open
bolshoytoster opened this issue Feb 27, 2025 · 5 comments
Open

CPU usage/efficiency #7894

bolshoytoster opened this issue Feb 27, 2025 · 5 comments

Comments

@bolshoytoster
Copy link

bolshoytoster commented Feb 27, 2025

I'm a user of qBittorrent, and I've been experiencing high CPU usage (30-50%) recently, mostly from libtorrent's network thread.

I decided to investigate this with callgrind, and I found that ~17% of the total CPU usage was in parse_url_components, specifically in the calls to std::find, which, it seems aren't being properly inlined/optimized (yes, I'm using -O3).

I modified this to use raw pointers and memchr instead of iterators and std::find, and I've been able to get a 1.6x speed improvement (according to my crude benchmarking):

std::tuple<std::string, std::string, std::string, int, std::string>
	parse_url_components(std::string url, error_code& ec)
{
	std::string hostname; // hostname only
	std::string auth; // user:pass
	std::string protocol; // http or https for instance
	int port = -1;

	const char *at;
	const char *colon;
	const char *slash;
	const char *question_mark;
	const char *hash;
	const char *port_pos;

	// PARSE URL
	const char *start = url.data();
	const char *str_end = start + url.length();
	// remove white spaces in front of the url
	while (start != str_end && is_space(*start))
		++start;
	const char *end = (const char *) memchr(start, ':', str_end - start);
	protocol.assign(start, end);

	if (!end)
	{
		ec = errors::unsupported_url_protocol;
		goto exit;
	}
	++end;
	if (end == str_end || *end != '/')
	{
		ec = errors::unsupported_url_protocol;
		goto exit;
	}
	++end;
	if (end == str_end || *end != '/')
	{
		ec = errors::unsupported_url_protocol;
		goto exit;
	}
	++end;
	start = end;

	colon = (const char *) memchr(start, ':', str_end - start);
	at = colon ? (const char *) memchr(colon, '@', str_end - colon) : 0;
	slash = (const char *) memchr(start, '/', str_end - start);
	question_mark = (const char *) memchr(start, '?', str_end - start);
	hash = (const char *) memchr(start, '#', str_end - start);
	end = slash
		? question_mark
			? hash
				? std::min({slash, question_mark, hash})
				: std::min(slash, question_mark)
			: hash
				? std::min(slash, hash)
				: slash
		: question_mark
			? hash
				? std::min(question_mark, hash)
				: question_mark
			: hash
				? hash
				: str_end;

	if (at && at < end)
	{
		auth.assign(start, at);
		start = at;
		++start;
	}

	// this is for IPv6 addresses
	if (start != str_end && *start == '[')
	{
		port_pos = (const char *) memchr(start, ']', str_end - start);
		if (!port_pos)
		{
			ec = errors::expected_close_bracket_in_address;
			goto exit;
		}
		// strip the brackets
		hostname.assign(start + 1, port_pos);
		port_pos = (const char *) memchr(port_pos, ':', str_end - port_pos);
	}
	else
	{
		port_pos = (const char *) memchr(start, ':', str_end - start);
		if (port_pos && port_pos < end) hostname.assign(start, port_pos);
		else hostname.assign(start, end);
	}

	if (port_pos && port_pos < end)
	{
		++port_pos;
		for (auto i = port_pos; i < end; ++i)
		{
			if (is_digit(*i)) continue;
			ec = errors::invalid_port;
			goto exit;
		}
		port = std::atoi(std::string(port_pos, end).c_str());
	}

	start = end;
exit:
	std::string path_component(start, str_end);
	if (path_component.empty()
		|| path_component.front() == '?'
		|| path_component.front() == '#')
	{
		path_component.insert(path_component.begin(), '/');
	}

	return std::make_tuple(std::move(protocol)
		, std::move(auth)
		, std::move(hostname)
		, port
		, path_component);
}

There are probably faster implementations than this, but it brings down parse_url_components to ~6% of the total (still room for improvement).

Another thing I noticed is that parse_url_components is called hundreds of times per second (mostly by is_i2p_url), with the same urls, so would it be worth caching the result/optimizing is_i2p_url, since it doesn't actually require parsing the full url.

I'd like to know if this would be welcome as a PR, perhaps along with an optimization to is_i2p_url.

Another performance issue I'm looking into is comparing digest32s in the std::find_if in dht::routing_table::node_failed, which looks like it could be fixed if routing_table_node::live_nodes were a hashmap instead of a std::vector.

@Chocobo1
Copy link
Contributor

Chocobo1 commented Feb 28, 2025

I decided to investigate this with callgrind, and I found that ~17% of the total CPU usage was in parse_url_components, specifically in the calls to std::find, which, it seems aren't being properly inlined/optimized (yes, I'm using -O3).

Interesting! If I were to improve it, I would try these:

  1. Try std::string::find() rather than std::find()
    IMO raw pointer is a bit too low-level. It should be the last option after trying all other alternatives.
  2. If C++17 is available, use std::string_view
  3. Use regex. But I heard std::regex is horrible and I have no experience with boost::regex.

@bolshoytoster
Copy link
Author

bolshoytoster commented Feb 28, 2025

@Chocobo1

Try std::string::find() rather than std::find()

I just tried that, and I must've done something wrong, because somehow, my std::string::find() implementation is actually ~40% slower than the original version. Maybe because it uses indices instead of iterators.

If C++17 is available, use std::string_view

I also tried this, but the performance gain is negligible.

Use regex. But I heard std::regex is horrible and I have no experience with boost::regex.

Regex seems a bit overkill for this.

I have, however, been able to speed up is_i2p_url by ~25% by manually inlining parse_url_components and removing unneccessary bits. But the real performance improvement is from caching the results of is_i2p_url, giving a ~70% improvement to calls with cached urls (most of them).

I would look into why is_i2p_url is called so often, but 70% is good enough for me.

I could open a PR with this caching, and it would remove the need to speed up parse_url_components.

Also, is there a reason the nodes in routing_table_node can't/shouldn't be hashmaps? I suppose it would make iteration slightly slower, and it would no longer be ordered. But I've just changed bucket_t to a std::unordered_map<node_id, node_entry>, and updated everything relevant and it seems fine (although I think my implementation is a bit buggy).

@bolshoytoster
Copy link
Author

bolshoytoster commented Feb 28, 2025

After optimizing those, another issue has appeared: ~20% of the remaining total CPU usage is in session_settings::get_bool, specifically with the locking that it does. The locking in itself should be fine, but session_settings::get_bool is called several times within three loops in update_tracker_timer. There's some low hanging fruit to just move them to before the loops.

@Chocobo1
Copy link
Contributor

Chocobo1 commented Mar 1, 2025

I don't have much to add and I just found this interesting post: https://gms.tf/stdfind-and-memchr-optimizations.html#is-stdfind-faster-as-memchr

@bolshoytoster
Could you share your compiler version?

I'd like to know if this would be welcome as a PR

IMO feel free to submit it. The important bits can be discussed in the PR.

@bolshoytoster
Copy link
Author

bolshoytoster commented Mar 1, 2025

Could you share your compiler version?

$ g++ --version
g++ (GCC) 14.2.1 20250207

And I build libtorrent with b2 optimization=speed inlining=on

I've opened a PR with a few more optimizations: #7898

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants