fix: allow Actions runner to recover tasks lost during fetching from intermittent errors #11401
Labels
No labels
arch
riscv64
backport/v1.19
backport/v1.20
backport/v1.21/forgejo
backport/v10.0/forgejo
backport/v11.0/forgejo
backport/v12.0/forgejo
backport/v13.0/forgejo
backport/v14.0/forgejo
backport/v7.0/forgejo
backport/v8.0/forgejo
backport/v9.0/forgejo
breaking
bug
bug
confirmed
bug
duplicate
bug
needs-more-info
bug
new-report
bug
reported-upstream
code/actions
code/api
code/auth
code/auth/faidp
code/auth/farp
code/email
code/federation
code/git
code/migrations
code/packages
code/wiki
database
MySQL
database
PostgreSQL
database
SQLite
dependency-upgrade
dependency
certmagic
dependency
chart.js
dependency
Chi
dependency
Chroma
dependency
citation.js
dependency
codespell
dependency
css-loader
dependency
devcontainers
dependency
dropzone
dependency
editorconfig-checker
dependency
elasticsearch
dependency
enmime
dependency
F3
dependency
ForgeFed
dependency
garage
dependency
Git
dependency
git-backporting
dependency
Gitea
dependency
gitignore
dependency
go-ap
dependency
go-enry
dependency
go-gitlab
dependency
Go-org
dependency
go-rpmutils
dependency
go-sql-driver mysql
dependency
go-swagger
dependency
go-version
dependency
go-webauthn
dependency
gocron
dependency
Golang
dependency
goldmark
dependency
goquery
dependency
Goth
dependency
grpc-go
dependency
happy-dom
dependency
Helm
dependency
image-spec
dependency
jsonschema
dependency
KaTeX
dependency
lint
dependency
MariaDB
dependency
Mermaid
dependency
minio-go
dependency
misspell
dependency
Monaco
dependency
PDFobject
dependency
playwright
dependency
postcss
dependency
postcss-plugins
dependency
pprof
dependency
prometheus client_golang
dependency
protobuf
dependency
relative-time-element
dependency
renovate
dependency
reply
dependency
ssh
dependency
swagger-ui
dependency
tailwind
dependency
temporal-polyfill
dependency
terminal-to-html
dependency
tests-only
dependency
text-expander-element
dependency
urfave
dependency
vfsgen
dependency
vite
dependency
Woodpecker CI
dependency
x tools
dependency
XORM
Discussion
duplicate
enhancement/feature
forgejo/accessibility
forgejo/branding
forgejo/ci
forgejo/commit-graph
forgejo/documentation
forgejo/furnace cleanup
forgejo/i18n
forgejo/interop
forgejo/moderation
forgejo/privacy
forgejo/release
forgejo/scaling
forgejo/security
forgejo/ui
Gain
High
Gain
Nice to have
Gain
Undefined
Gain
Very High
good first issue
i18n/backport-stable
impact
large
impact
medium
impact
small
impact
unknown
Incompatible license
issue
closed
issue
do-not-exist-yet
issue
open
manual test
Manually tested during feature freeze
OS
FreeBSD
OS
Linux
OS
macOS
OS
Windows
problem
QA
regression
release blocker
Release Cycle
Feature Freeze
release-blocker
v7.0
release-blocker
v7.0.1
release-blocker
v7.0.2
release-blocker
v7.0.3
release-blocker
v7.0.4
release-blocker
v8.0.0
release-blocker/v9.0.0
run-all-playwright-tests
run-end-to-end-tests
test
manual
test
needed
test
needs-help
test
not-needed
test
present
untested
User research - time-tracker
valuable code
worth a release-note
User research - Accessibility
User research - Blocked
User research - Community
User research - Config (instance)
User research - Errors
User research - Filters
User research - Future backlog
User research - Git workflow
User research - Labels
User research - Moderation
User research - Needs input
User research - Notifications/Dashboard
User research - Rendering
User research - Repo creation
User research - Repo units
User research - Security
User research - Settings (in-app)
No milestone
No project
No assignees
4 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo/forgejo!11401
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "mfenniak/forgejo:idempotent-FetchTask"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Probably fixes (or improves, at least) https://code.forgejo.org/forgejo/runner/issues/1391, paired with the runner implementation https://code.forgejo.org/forgejo/runner/pulls/1393.
When the FetchTask() API is invoked to create a task, unpreventable environmental errors may occur; for example, network disconnects and timeouts. It's possible that these errors occur after the server-side has assigned a task to the runner during the API call, in which case the error would cause that task to be lost between the two systems -- the server will think it's assigned to the runner, and the runner never received it. This can cause jobs to appear stuck at "Set up job".
The solution implemented here is idempotency in the FetchTask() API call, which means that the "same" FetchTask() API call is expected to return the same values. Specifically, the runner creates a unique identifier which is transmitted to the server as a header
x-runner-request-keywith each FetchTask() invocation which defines the sameness of the call, and the runner retains the value until the API call receives a successful response. The server implementation returns the same tasks back if a second (or Nth) call is received with the samex-runner-request-keyheader. In order to accomplish this is records thex-runner-request-keyvalue that is used with each request that assigns tasks.As a complication, the Forgejo server is unable to return the same
${{ secrets.forgejo_token }}for the task because the server stores that value in a one-way hash in the database. To resolve this, the server regenerates the token when retrieving tasks for a second time.Checklist
The contributor guide contains information that will be helpful to first time contributors. There also are a few conditions for merging Pull Requests in Forgejo repositories. You are also welcome to join the Forgejo development chatroom.
Tests for Go changes
(can be removed for JavaScript changes)
*_test.gofor unit tests.tests/integrationdirectory if it involves interactions with a live Forgejo server.make pr-gobefore pushingDocumentation
Release notes
The decision if the pull request will be shown in the release notes is up to the mergers / release team.
The content of the
release-notes/<pull request number>.mdfile will serve as the basis for the release notes. If the file does not exist, the title of the pull request will be used instead.fix: make FetchTask() idempotent (requires opt-in from runner)to fix: allow Actions runner to recover tasks lost from intermittent network errorsfix: allow Actions runner to recover tasks lost from intermittent network errorsto fix: allow Actions runner to recover tasks lost during fetching from intermittent errorsThanks a lot. Seems to work well. I simulated a network interruption at the end of
FetchTask(). Forgejo Runner and Forgejo managed to recover the situation. 🎉@ -143,0 +147,4 @@// could be used indefinitely to retrieve the associated task(s), so requiring the correctly authenticated// runner reduces that risk.//// Note: ephemeral runners cannot use this recovery model because they'll be deleted.This comment confuses me. "cannot use" implies that there's some technical means to prevent it, like queries with
ephemeral = false. But I don't see that. If I'm not mistaken, ephemeral runners get deleted after the job completed. It's not obvious to me how that influences retryingFetchTask().Ah, this is an analysis mistake on my part; I'll remove this comment.
@ -90,0 +112,4 @@jobs:job1:strategy:# matrix creates 125 different jobs from one push...Why does this test need 125 jobs? I only see a single job being tested.
Two jobs are tested, as the last thing the test does is reset to a new request key and ensure it doesn't get the same results. But I've trimmed the matrix down -- it was just copied from the concurrency test above originally.
2de328f06a485c47ce9b485c47ce9bc874cce6eac874cce6eac52eaa64fbcascading-pr updated at https://code.forgejo.org/forgejo/end-to-end/pulls/1610
should we backport to v14 if it proves to be reliable?
@viceice wrote in #11401 (comment):
I could be convinced either way, but I'd lean towards no. The impact of the issue is low -- rare occurrences of "Set up job" stuck actions in the face of network unreliability, no data loss, no blocker to functional usage. We'll end up having an argument over developer confusion in backporting a v15 schema migration into v14. 🤣
On the other hand, in support of a backport, I'm always a fan of getting bugfixes in the hands of users.
I think the only reason to apply this early to Codeberg is that it's where we've observed the problem, the environment has known network outages, and those factors make it a good experimental environment to see if it solves the problem.