Skip to content

Conversation

@FlomoN
Copy link
Contributor

@FlomoN FlomoN commented Oct 17, 2025

⚠️ Pre Checklist

Please complete ALL items in this checklist, and remove before submitting

  • I have read through the Contributing Documentation.
  • I have added relevant tests.
  • I have added relevant documentation.
  • I will add labels to the PR, such as pr-type/bug-fix, pr-type/feature-development, etc.

Summary

This PR fixes the Pagination bug from #8615 and adds the ability to adjust the collection paramaters of the Github GraphQL Job collector through environment variables.

The bug was caused by a pagination implementation for workflow jobs in job_collector.go that didn't play well together with the simultaneous batching of multiple workflow runs. In getPageInfo the first workflow run that had HasNextPage true returned it's EndCursor which was then used for pagination for all Workflow runs in the batch, but only worked for the one it came from, therefore missing all the pages of other workflow runs in the same batch.

func getPageInfo(query interface{}, args *helper.GraphqlCollectorArgs) (*helper.GraphqlQueryPageInfo, error) {
	queryWrapper := query.(*GraphqlQueryCheckRunWrapper)
	hasNextPage := false
	endCursor := ""
	for _, node := range queryWrapper.Node {
		if node.CheckSuite.CheckRuns.PageInfo.HasNextPage {
			hasNextPage = true
			endCursor = node.CheckSuite.CheckRuns.PageInfo.EndCursor // <- This cursor will be used for all workflow runs, since only one skipCursor variable exists.
			break // <- Then stops after the first was found
		}
	}
	return &helper.GraphqlQueryPageInfo{
		EndCursor:   endCursor,
		HasNextPage: hasNextPage,
	}, nil
}

Since some people reported having timeout issues with this plugin (@ClaudioMascaro and @robaca) for large repositories with large workflows I didn't want to take away either option. But combining pagination and batching at the same time didn't seem feasible without adding a lot of hard to maintain complexity to it (like keeping track of all produced EndCursors and submitting them to following collection, but only for those workflow runs that had more pages...).

I ended up implementing a configurable mode switch:

  • Either you choose BATCHING mode, if you want a little bit better performance while collecting to reduce graphQL API calls, when your Workflow Runs dont have more than 20 Jobs (this number is also adjustable however).
  • OR you choose PAGINATING mode, if your workflow runs typically have more than 20 Jobs and would potentially cause timeouts of the graphql API when trying to fetch all jobs at once.

I think Rate Limit wise it doesnt make a difference (If I understood that point system correctly), since both ways would in total consume about the same amount of points.

Since it was suggested in #8469 to have the batch size and page size configurable via environment variable, I also added this to give more flexibility to the users in finding the sweet spot for their setup.

Does this close any open issues?

Closes #8615

Screenshots

Include any relevant screenshots here.

Other Information

Maybe it would also make sense to add these configuration options to Scope Config and thus make them configurable on a per scope config basis, but this would also need migration scripts and so on. So maybe an improvement for the future?

I will add tests aswell in the following days, if this approach seems appropriate to you?

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Oct 17, 2025
@FlomoN
Copy link
Contributor Author

FlomoN commented Oct 20, 2025

@klesh here it is :)

Copy link
Contributor

@klesh klesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat.
Could you kindly update the docs as well?
It is hosted at https://devlake.apache.org, the repository is https://github.com/apache/incubator-devlake-website

@klesh klesh merged commit 4b418b9 into apache:main Oct 20, 2025
11 checks passed
@klesh
Copy link
Contributor

klesh commented Oct 20, 2025

The configurable page_size should help #8614 as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug][Github_graphql] Job_collector not paginating correctly

2 participants