Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Refactor][Gitlab] Single account collection for on-prem instance #8283

Open
hippojay opened this issue Feb 4, 2025 · 5 comments
Open

[Refactor][Gitlab] Single account collection for on-prem instance #8283

hippojay opened this issue Feb 4, 2025 · 5 comments
Labels
component/plugins This issue or PR relates to plugins improvement type/refactor This issue is to refactor existing code

Comments

@hippojay
Copy link

hippojay commented Feb 4, 2025

What and why to refactor

The gitlab plugin will collect account information relating to users (account_collector.go). For Gitlab.com this is done on a per project basis, and run for each gitlab repository collection.

For on-premise instance, there is a test which allows the plugin to use the global /users API endpoint. However this results in duplicated operations for each repository (data scope) in the project. I.e. for a DevLake project with 20 data scopes, this will result in the account information being gathered, extracted and converted 20 times. 19 of them will be repeats of the same data

User collection on a large user base (7000 users) takes 3min 30 seconds for collection, extraction and conversation per stage.

Describe the solution you'd like

Ideally account Collection for on-premise needs to be a single Gitlab stage that is added to the pipeline. However it could also be a added to the first collection stage as a subtask, and then not added as a subtask to other stages - however this makes that collection less visible.

@hippojay hippojay added the type/refactor This issue is to refactor existing code label Feb 4, 2025
@hippojay
Copy link
Author

hippojay commented Feb 4, 2025

I've already looked into this, but my Go skills are super low which makes understanding the code base flow difficult, but if anyone has pointers I can look to implement.

@dosubot dosubot bot added component/plugins This issue or PR relates to plugins improvement labels Feb 4, 2025
@hippojay
Copy link
Author

hippojay commented Feb 4, 2025

I've implemented this by altering the pipeline plan:

  • replicate the test in account_collector.go to check if teh endpoint sting starts with https://gitlab.com
  • Insert a first stage using Collect Users | Extract Users | Convert Users subtask directly
  • Remove any subsequent tasks that matches Collect Users | Extract Users | Convert Users

If should allow normal functionality if using gitlab.com. I've not submitted this as a pull request yet, because it's the first time I've written more than a few lines of go:

hippojay@66286d1

I've done some simple testing and have seen an improvement.

Image

However the UI elements are not nice and clean, due to the requirements of the plugins

Image

@hippojay
Copy link
Author

hippojay commented Feb 4, 2025

This also stops the _raw_gitlab_api_uses table from growing to an alarming number (in my case ~11M rows for 7000 users)

@klesh
Copy link
Contributor

klesh commented Feb 6, 2025

Nice work, the code looks solid to me, looking forward to your PR.

@hippojay
Copy link
Author

hippojay commented Feb 6, 2025

Thanks. However I've been digging a little more deeply, as whilst my code works, there is a link with getapiprojects that may cause it to fail for others. I'm having a look at other ways of implementing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/plugins This issue or PR relates to plugins improvement type/refactor This issue is to refactor existing code
Projects
None yet
Development

No branches or pull requests

2 participants