fix: allow Actions runner to recover tasks lost during fetching from intermittent errors #11401

Merged
mfenniak merged 1 commit from mfenniak/forgejo:idempotent-FetchTask into forgejo 2026-02-22 23:24:40 +01:00
Member

Probably fixes (or improves, at least) https://code.forgejo.org/forgejo/runner/issues/1391, paired with the runner implementation https://code.forgejo.org/forgejo/runner/pulls/1393.

When the FetchTask() API is invoked to create a task, unpreventable environmental errors may occur; for example, network disconnects and timeouts. It's possible that these errors occur after the server-side has assigned a task to the runner during the API call, in which case the error would cause that task to be lost between the two systems -- the server will think it's assigned to the runner, and the runner never received it. This can cause jobs to appear stuck at "Set up job".

The solution implemented here is idempotency in the FetchTask() API call, which means that the "same" FetchTask() API call is expected to return the same values. Specifically, the runner creates a unique identifier which is transmitted to the server as a header x-runner-request-key with each FetchTask() invocation which defines the sameness of the call, and the runner retains the value until the API call receives a successful response. The server implementation returns the same tasks back if a second (or Nth) call is received with the same x-runner-request-key header. In order to accomplish this is records the x-runner-request-key value that is used with each request that assigns tasks.

As a complication, the Forgejo server is unable to return the same ${{ secrets.forgejo_token }} for the task because the server stores that value in a one-way hash in the database. To resolve this, the server regenerates the token when retrieving tasks for a second time.

Checklist

The contributor guide contains information that will be helpful to first time contributors. There also are a few conditions for merging Pull Requests in Forgejo repositories. You are also welcome to join the Forgejo development chatroom.

Tests for Go changes

(can be removed for JavaScript changes)

  • I added test coverage for Go changes...
    • in their respective *_test.go for unit tests.
    • in the tests/integration directory if it involves interactions with a live Forgejo server.
  • I ran...
    • make pr-go before pushing

Documentation

  • I created a pull request to the documentation to explain to Forgejo users how to use this change.
  • I did not document these changes and I do not expect someone else to do it.

Release notes

  • This change will be noticed by a Forgejo user or admin (feature, bug fix, performance, etc.). I suggest to include a release note for this change.
  • This change is not visible to a Forgejo user or admin (refactor, dependency upgrade, etc.). I think there is no need to add a release note for this change.

The decision if the pull request will be shown in the release notes is up to the mergers / release team.

The content of the release-notes/<pull request number>.md file will serve as the basis for the release notes. If the file does not exist, the title of the pull request will be used instead.

Probably fixes (or improves, at least) https://code.forgejo.org/forgejo/runner/issues/1391, paired with the runner implementation https://code.forgejo.org/forgejo/runner/pulls/1393. When the FetchTask() API is invoked to create a task, unpreventable environmental errors may occur; for example, network disconnects and timeouts. It's possible that these errors occur after the server-side has assigned a task to the runner during the API call, in which case the error would cause that task to be lost between the two systems -- the server will think it's assigned to the runner, and the runner never received it. This can cause jobs to appear stuck at "Set up job". The solution implemented here is idempotency in the FetchTask() API call, which means that the "same" FetchTask() API call is expected to return the same values. Specifically, the runner creates a unique identifier which is transmitted to the server as a header `x-runner-request-key` with each FetchTask() invocation which defines the sameness of the call, and the runner retains the value until the API call receives a successful response. The server implementation returns the same tasks back if a second (or Nth) call is received with the same `x-runner-request-key` header. In order to accomplish this is records the `x-runner-request-key` value that is used with each request that assigns tasks. As a complication, the Forgejo server is unable to return the same `${{ secrets.forgejo_token }}` for the task because the server stores that value in a one-way hash in the database. To resolve this, the server regenerates the token when retrieving tasks for a second time. ## Checklist The [contributor guide](https://forgejo.org/docs/next/contributor/) contains information that will be helpful to first time contributors. There also are a few [conditions for merging Pull Requests in Forgejo repositories](https://codeberg.org/forgejo/governance/src/branch/main/PullRequestsAgreement.md). You are also welcome to join the [Forgejo development chatroom](https://matrix.to/#/#forgejo-development:matrix.org). ### Tests for Go changes (can be removed for JavaScript changes) - I added test coverage for Go changes... - [x] in their respective `*_test.go` for unit tests. - [x] in the `tests/integration` directory if it involves interactions with a live Forgejo server. - I ran... - [x] `make pr-go` before pushing ### Documentation - [ ] I created a pull request [to the documentation](https://codeberg.org/forgejo/docs) to explain to Forgejo users how to use this change. - [x] I did not document these changes and I do not expect someone else to do it. ### Release notes - [x] This change will be noticed by a Forgejo user or admin (feature, bug fix, performance, etc.). I suggest to include a release note for this change. - [ ] This change is not visible to a Forgejo user or admin (refactor, dependency upgrade, etc.). I think there is no need to add a release note for this change. *The decision if the pull request will be shown in the release notes is up to the mergers / release team.* The content of the `release-notes/<pull request number>.md` file will serve as the basis for the release notes. If the file does not exist, the title of the pull request will be used instead.
fix: make FetchTask() idempotent (requires opt-in from runner)
Some checks failed
testing / semgrep/ci (pull_request) Successful in 24s
testing / frontend-checks (pull_request) Successful in 2m4s
testing / backend-checks (pull_request) Successful in 6m3s
Integration tests for the release process / release-simulation (pull_request) Successful in 7m6s
testing / test-unit (pull_request) Successful in 8m34s
testing / test-remote-cacher (redict) (pull_request) Successful in 2m33s
testing / test-remote-cacher (redis) (pull_request) Successful in 2m39s
testing / test-remote-cacher (garnet) (pull_request) Successful in 2m37s
testing / test-remote-cacher (valkey) (pull_request) Successful in 2m38s
testing / test-mysql (pull_request) Successful in 27m45s
testing / test-sqlite (pull_request) Successful in 32m42s
testing / test-e2e (pull_request) Successful in 17m55s
issue-labels / release-notes (pull_request_target) Has been skipped
issue-labels / backporting (pull_request_target) Has been skipped
requirements / merge-conditions (pull_request) Successful in 3s
testing / test-pgsql (pull_request) Successful in 30m21s
issue-labels / cascade (pull_request_target) Failing after 24s
testing / security-check (pull_request) Failing after 1m14s
2de328f06a
mfenniak changed title from fix: make FetchTask() idempotent (requires opt-in from runner) to fix: allow Actions runner to recover tasks lost from intermittent network errors 2026-02-22 01:04:22 +01:00
mfenniak changed title from fix: allow Actions runner to recover tasks lost from intermittent network errors to fix: allow Actions runner to recover tasks lost during fetching from intermittent errors 2026-02-22 01:04:37 +01:00
aahlenst approved these changes 2026-02-22 13:24:23 +01:00
aahlenst left a comment
Member

Thanks a lot. Seems to work well. I simulated a network interruption at the end of FetchTask(). Forgejo Runner and Forgejo managed to recover the situation. 🎉

Thanks a lot. Seems to work well. I simulated a network interruption at the end of `FetchTask()`. Forgejo Runner and Forgejo managed to recover the situation. 🎉
@ -143,0 +147,4 @@
// could be used indefinitely to retrieve the associated task(s), so requiring the correctly authenticated
// runner reduces that risk.
//
// Note: ephemeral runners cannot use this recovery model because they'll be deleted.
Member

This comment confuses me. "cannot use" implies that there's some technical means to prevent it, like queries with ephemeral = false. But I don't see that. If I'm not mistaken, ephemeral runners get deleted after the job completed. It's not obvious to me how that influences retrying FetchTask().

This comment confuses me. "cannot use" implies that there's some technical means to prevent it, like queries with `ephemeral = false`. But I don't see that. If I'm not mistaken, ephemeral runners get deleted after the job completed. It's not obvious to me how that influences retrying `FetchTask()`.
Author
Member

Ah, this is an analysis mistake on my part; I'll remove this comment.

Ah, this is an analysis mistake on my part; I'll remove this comment.
mfenniak marked this conversation as resolved
@ -90,0 +112,4 @@
jobs:
job1:
strategy:
# matrix creates 125 different jobs from one push...
Member

Why does this test need 125 jobs? I only see a single job being tested.

Why does this test need 125 jobs? I only see a single job being tested.
Author
Member

Two jobs are tested, as the last thing the test does is reset to a new request key and ensure it doesn't get the same results. But I've trimmed the matrix down -- it was just copied from the concurrency test above originally.

Two jobs are tested, as the last thing the test does is reset to a new request key and ensure it doesn't get the same results. But I've trimmed the matrix down -- it was just copied from the concurrency test above originally.
mfenniak marked this conversation as resolved
mfenniak force-pushed idempotent-FetchTask from 2de328f06a
Some checks failed
testing / semgrep/ci (pull_request) Successful in 24s
testing / frontend-checks (pull_request) Successful in 2m4s
testing / backend-checks (pull_request) Successful in 6m3s
Integration tests for the release process / release-simulation (pull_request) Successful in 7m6s
testing / test-unit (pull_request) Successful in 8m34s
testing / test-remote-cacher (redict) (pull_request) Successful in 2m33s
testing / test-remote-cacher (redis) (pull_request) Successful in 2m39s
testing / test-remote-cacher (garnet) (pull_request) Successful in 2m37s
testing / test-remote-cacher (valkey) (pull_request) Successful in 2m38s
testing / test-mysql (pull_request) Successful in 27m45s
testing / test-sqlite (pull_request) Successful in 32m42s
testing / test-e2e (pull_request) Successful in 17m55s
issue-labels / release-notes (pull_request_target) Has been skipped
issue-labels / backporting (pull_request_target) Has been skipped
requirements / merge-conditions (pull_request) Successful in 3s
testing / test-pgsql (pull_request) Successful in 30m21s
issue-labels / cascade (pull_request_target) Failing after 24s
testing / security-check (pull_request) Failing after 1m14s
to 485c47ce9b
Some checks failed
issue-labels / release-notes (pull_request_target) Has been skipped
requirements / merge-conditions (pull_request) Successful in 2s
testing / semgrep/ci (pull_request) Successful in 13s
testing / frontend-checks (pull_request) Successful in 1m5s
testing / backend-checks (pull_request) Successful in 3m13s
Integration tests for the release process / release-simulation (pull_request) Successful in 5m27s
testing / test-unit (pull_request) Successful in 6m3s
testing / test-remote-cacher (garnet) (pull_request) Successful in 2m0s
testing / test-remote-cacher (redict) (pull_request) Successful in 2m0s
testing / test-remote-cacher (valkey) (pull_request) Successful in 2m1s
testing / test-remote-cacher (redis) (pull_request) Successful in 2m2s
testing / test-e2e (pull_request) Successful in 21m55s
testing / test-mysql (pull_request) Successful in 23m53s
testing / test-sqlite (pull_request) Successful in 29m51s
testing / test-pgsql (pull_request) Successful in 32m56s
testing / security-check (pull_request) Failing after 1m12s
2026-02-22 17:48:12 +01:00
Compare
mfenniak force-pushed idempotent-FetchTask from 485c47ce9b
Some checks failed
issue-labels / release-notes (pull_request_target) Has been skipped
requirements / merge-conditions (pull_request) Successful in 2s
testing / semgrep/ci (pull_request) Successful in 13s
testing / frontend-checks (pull_request) Successful in 1m5s
testing / backend-checks (pull_request) Successful in 3m13s
Integration tests for the release process / release-simulation (pull_request) Successful in 5m27s
testing / test-unit (pull_request) Successful in 6m3s
testing / test-remote-cacher (garnet) (pull_request) Successful in 2m0s
testing / test-remote-cacher (redict) (pull_request) Successful in 2m0s
testing / test-remote-cacher (valkey) (pull_request) Successful in 2m1s
testing / test-remote-cacher (redis) (pull_request) Successful in 2m2s
testing / test-e2e (pull_request) Successful in 21m55s
testing / test-mysql (pull_request) Successful in 23m53s
testing / test-sqlite (pull_request) Successful in 29m51s
testing / test-pgsql (pull_request) Successful in 32m56s
testing / security-check (pull_request) Failing after 1m12s
to c874cce6ea
All checks were successful
issue-labels / release-notes (pull_request_target) Has been skipped
requirements / merge-conditions (pull_request) Successful in 2s
testing / semgrep/ci (pull_request) Successful in 13s
testing / frontend-checks (pull_request) Successful in 1m15s
testing / backend-checks (pull_request) Successful in 3m5s
testing / test-unit (pull_request) Successful in 6m13s
testing / test-remote-cacher (valkey) (pull_request) Successful in 1m49s
testing / test-remote-cacher (redict) (pull_request) Successful in 1m50s
testing / test-remote-cacher (garnet) (pull_request) Successful in 1m50s
testing / test-remote-cacher (redis) (pull_request) Successful in 1m51s
testing / test-e2e (pull_request) Successful in 18m38s
testing / test-mysql (pull_request) Successful in 21m13s
testing / test-sqlite (pull_request) Successful in 26m19s
testing / test-pgsql (pull_request) Successful in 29m31s
testing / security-check (pull_request) Successful in 51s
2026-02-22 18:30:20 +01:00
Compare
mfenniak force-pushed idempotent-FetchTask from c874cce6ea
All checks were successful
issue-labels / release-notes (pull_request_target) Has been skipped
requirements / merge-conditions (pull_request) Successful in 2s
testing / semgrep/ci (pull_request) Successful in 13s
testing / frontend-checks (pull_request) Successful in 1m15s
testing / backend-checks (pull_request) Successful in 3m5s
testing / test-unit (pull_request) Successful in 6m13s
testing / test-remote-cacher (valkey) (pull_request) Successful in 1m49s
testing / test-remote-cacher (redict) (pull_request) Successful in 1m50s
testing / test-remote-cacher (garnet) (pull_request) Successful in 1m50s
testing / test-remote-cacher (redis) (pull_request) Successful in 1m51s
testing / test-e2e (pull_request) Successful in 18m38s
testing / test-mysql (pull_request) Successful in 21m13s
testing / test-sqlite (pull_request) Successful in 26m19s
testing / test-pgsql (pull_request) Successful in 29m31s
testing / security-check (pull_request) Successful in 51s
to c52eaa64fb
All checks were successful
testing / semgrep/ci (pull_request) Successful in 11s
testing / frontend-checks (pull_request) Successful in 1m8s
issue-labels / release-notes (pull_request_target) Has been skipped
requirements / merge-conditions (pull_request) Successful in 3s
testing / backend-checks (pull_request) Successful in 3m35s
testing / test-unit (pull_request) Successful in 7m55s
testing / test-remote-cacher (redis) (pull_request) Successful in 1m41s
testing / test-remote-cacher (valkey) (pull_request) Successful in 1m42s
testing / test-remote-cacher (garnet) (pull_request) Successful in 1m44s
testing / test-remote-cacher (redict) (pull_request) Successful in 1m38s
testing / test-e2e (pull_request) Successful in 23m54s
issue-labels / cascade (pull_request_target) Successful in 27m37s
testing / test-mysql (pull_request) Successful in 25m46s
testing / test-sqlite (pull_request) Successful in 28m17s
testing / test-pgsql (pull_request) Successful in 33m54s
testing / security-check (pull_request) Successful in 49s
issue-labels / backporting (pull_request_target) Has been skipped
milestone / set (pull_request_target) Successful in 13s
2026-02-22 22:32:30 +01:00
Compare
Collaborator
cascading-pr updated at https://code.forgejo.org/forgejo/end-to-end/pulls/1610
mfenniak merged commit 0ae6235386 into forgejo 2026-02-22 23:24:40 +01:00
mfenniak deleted branch idempotent-FetchTask 2026-02-22 23:24:44 +01:00
Owner

should we backport to v14 if it proves to be reliable?

should we backport to v14 if it proves to be reliable?
Author
Member

@viceice wrote in #11401 (comment):

should we backport to v14 if it proves to be reliable?

I could be convinced either way, but I'd lean towards no. The impact of the issue is low -- rare occurrences of "Set up job" stuck actions in the face of network unreliability, no data loss, no blocker to functional usage. We'll end up having an argument over developer confusion in backporting a v15 schema migration into v14. 🤣

On the other hand, in support of a backport, I'm always a fan of getting bugfixes in the hands of users.

I think the only reason to apply this early to Codeberg is that it's where we've observed the problem, the environment has known network outages, and those factors make it a good experimental environment to see if it solves the problem.

@viceice wrote in https://codeberg.org/forgejo/forgejo/pulls/11401#issuecomment-11042576: > should we backport to v14 if it proves to be reliable? I could be convinced either way, but I'd lean towards no. The impact of the issue is low -- rare occurrences of "Set up job" stuck actions in the face of network unreliability, no data loss, no blocker to functional usage. We'll end up having an argument over developer confusion in backporting a v15 schema migration into v14. 🤣 On the other hand, in support of a backport, I'm always a fan of getting bugfixes in the hands of users. I think the only reason to apply this early to Codeberg is that it's where we've observed the problem, the environment has known network outages, and those factors make it a good experimental environment to see if it solves the problem.
Sign in to join this conversation.
No reviewers
No labels
arch
riscv64
backport/v1.19
backport/v1.20
backport/v1.21/forgejo
backport/v10.0/forgejo
backport/v11.0/forgejo
backport/v12.0/forgejo
backport/v13.0/forgejo
backport/v14.0/forgejo
backport/v7.0/forgejo
backport/v8.0/forgejo
backport/v9.0/forgejo
breaking
bug
bug
confirmed
bug
duplicate
bug
needs-more-info
bug
new-report
bug
reported-upstream
code/actions
code/api
code/auth
code/auth/faidp
code/auth/farp
code/email
code/federation
code/git
code/migrations
code/packages
code/wiki
database
MySQL
database
PostgreSQL
database
SQLite
dependency-upgrade
dependency
certmagic
dependency
chart.js
dependency
Chi
dependency
Chroma
dependency
citation.js
dependency
codespell
dependency
css-loader
dependency
devcontainers
dependency
dropzone
dependency
editorconfig-checker
dependency
elasticsearch
dependency
enmime
dependency
F3
dependency
ForgeFed
dependency
garage
dependency
Git
dependency
git-backporting
dependency
Gitea
dependency
gitignore
dependency
go-ap
dependency
go-enry
dependency
go-gitlab
dependency
Go-org
dependency
go-rpmutils
dependency
go-sql-driver mysql
dependency
go-swagger
dependency
go-version
dependency
go-webauthn
dependency
gocron
dependency
Golang
dependency
goldmark
dependency
goquery
dependency
Goth
dependency
grpc-go
dependency
happy-dom
dependency
Helm
dependency
image-spec
dependency
jsonschema
dependency
KaTeX
dependency
lint
dependency
MariaDB
dependency
Mermaid
dependency
minio-go
dependency
misspell
dependency
Monaco
dependency
PDFobject
dependency
playwright
dependency
postcss
dependency
postcss-plugins
dependency
pprof
dependency
prometheus client_golang
dependency
protobuf
dependency
relative-time-element
dependency
renovate
dependency
reply
dependency
ssh
dependency
swagger-ui
dependency
tailwind
dependency
temporal-polyfill
dependency
terminal-to-html
dependency
tests-only
dependency
text-expander-element
dependency
urfave
dependency
vfsgen
dependency
vite
dependency
Woodpecker CI
dependency
x tools
dependency
XORM
Discussion
duplicate
enhancement/feature
forgejo/accessibility
forgejo/branding
forgejo/ci
forgejo/commit-graph
forgejo/documentation
forgejo/furnace cleanup
forgejo/i18n
forgejo/interop
forgejo/moderation
forgejo/privacy
forgejo/release
forgejo/scaling
forgejo/security
forgejo/ui
Gain
High
Gain
Nice to have
Gain
Undefined
Gain
Very High
good first issue
i18n/backport-stable
impact
large
impact
medium
impact
small
impact
unknown
Incompatible license
issue
closed
issue
do-not-exist-yet
issue
open
manual test
Manually tested during feature freeze
OS
FreeBSD
OS
Linux
OS
macOS
OS
Windows
problem
QA
regression
release blocker
Release Cycle
Feature Freeze
release-blocker
v7.0
release-blocker
v7.0.1
release-blocker
v7.0.2
release-blocker
v7.0.3
release-blocker
v7.0.4
release-blocker
v8.0.0
release-blocker/v9.0.0
run-all-playwright-tests
run-end-to-end-tests
test
manual
test
needed
test
needs-help
test
not-needed
test
present
untested
User research - time-tracker
valuable code
worth a release-note
User research - Accessibility
User research - Blocked
User research - Community
User research - Config (instance)
User research - Errors
User research - Filters
User research - Future backlog
User research - Git workflow
User research - Labels
User research - Moderation
User research - Needs input
User research - Notifications/Dashboard
User research - Rendering
User research - Repo creation
User research - Repo units
User research - Security
User research - Settings (in-app)
No milestone
No project
No assignees
4 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo/forgejo!11401
No description provided.