[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["很难理解","hardToUnderstand","thumb-down"],["信息或示例代码不正确","incorrectInformationOrSampleCode","thumb-down"],["没有我需要的信息/示例","missingTheInformationSamplesINeed","thumb-down"],["翻译问题","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2025-08-21。"],[],[],null,["# Jobs retries and checkpoints best practices\n\nIndividual job tasks or even job executions can fail for a variety of reasons.\nThis page contains best practices to handle these failures, centered around\ntask restarts and job checkpointing.\n\nUse task retries\n----------------\n\nIndividual job tasks can fail for a variety of reasons, including issues with\napplication dependencies, quotas, or even internal system events. Often such\nissues are transient and the task will succeed after a retry.\n\nBy default, each task will automatically retry up to 3 times. This helps ensure\na job will run to completion even if it encounters transient task failures. You\ncan also\n[customize the maximum number of retries](/run/docs/configuring/max-retries).\nHowever, if you do change the default, you should specify at least one retry.\n\n### Plan for job task restarts\n\nMake your jobs [idempotent](https://en.wikipedia.org/wiki/Idempotence),\nso that a task restart does not result in corrupt or duplicate output.\nThat is, write repeatable logic that has the same behavior for a given set of\ninputs no matter how many times it is repeated or when it is executed.\n\nWrite your output to a different location than the input data, leaving input\ndata intact. This way, if the job runs again, the job can repeat the process\nfrom the beginning and get the same result.\n\nAvoid duplicating output data by reusing the same unique identifier or\nchecking if the output already exists. Duplicate data represents\n*collection-level* data corruption.\n\nUse checkpointing\n-----------------\n\nWhere possible, checkpoint your jobs so that if a task restarts after a\nfailure, it can pick up where it left off, instead of restarting work at the\nbeginning. Doing this will speed up your jobs as well as minimize unnecessary\ncosts.\n\nPeriodically write partial results and an indication of progress made to a\npersistent storage location such as Cloud Storage or a database. When\nyour task starts, look for partial results upon startup. If partial results\nare found, begin processing where they left off.\n\nIf your job does not lend itself to checkpointing, consider breaking it up\ninto smaller chunks and run a larger number of tasks.\n\n### Checkpointing example 1: calculating Pi\n\nIf you have a job that executes a recursive algorithm, such as calculating Pi to\nmany decimal places, and uses parallelism set to a value of 1:\n\n- Write your progress every 10 minutes or whatever your lost work tolerance allows, to a `pi-progress.txt` Cloud Storage object.\n- When a task starts, query the `pi-progress.txt` object and load the value as a starting place. Use that value as the initial input to your function.\n- Write your final result to Cloud Storage as an object named `pi-complete.txt` to avoid duplication via parallel or repeated execution or `pi-complete-DATE.txt` to differentiate by completion date.\n\n### Checkpointing example 2: processing 10,000 records from Cloud SQL\n\nIf you have a job processing 10,000 records in a relational database such as\nCloud SQL:\n\n- Retrieve records to be processed with a SQL query such as `SELECT * FROM example_table LIMIT 10000`\n- Write out updated records in batches of 100 so significant processing work is not lost on interruption.\n- When records are written, note which ones have been processed. You might add a boolean column processed to the table which is set to 1 only if processing is confirmed.\n- When a task starts, the query used to retrieve items for processing should add the condition processed = 0.\n- In addition to clean retries, this technique also supports breaking up work into smaller tasks, such as by modifying your query to select 100 records at a time: `LIMIT 100 OFFSET $CLOUD_RUN_TASK_INDEX*100`, and running 100 tasks to process all 10,000 records. `CLOUD_RUN_TASK_INDEX` is a built-in environment variable present inside the container running Cloud Run jobs.\n\nUsing all these pieces together, the final query might look like this:\n`SELECT * FROM example_table WHERE processed = 0 LIMIT 100 OFFSET $CLOUD_RUN_TASK_INDEX*100`\n\nWhat's next\n-----------\n\n- To create a Cloud Run job, see [Create jobs](/run/docs/create-jobs).\n- To execute a job, see [Execute jobs](/run/docs/execute/jobs).\n- To execute a job on a schedule, see [Execute jobs on a schedule](/run/docs/execute/jobs-on-schedule)."]]