Skip to content

Latest commit

 

History

History
81 lines (66 loc) · 10.7 KB

operation_en.md

File metadata and controls

81 lines (66 loc) · 10.7 KB

BlueKing Job Platform Deployment and Maintenance Documentation

English | 简体中文

I. Deployment

The BlueKing Job Platform (Job) is one of the BlueKing atomic platforms. Its underlying dependencies include:

1. Strong Dependencies

The following systems are strong dependencies of Job. If they are missing or malfunctioning, the core functionalities of Job will be unavailable:
PaaS: Job calls its interfaces to integrate with the unified login service;
User Management: Job calls its interfaces to obtain user information;
Permission Center: Job calls its interfaces to get permission policies for user operation authentication;
APIGateway: Job calls underlying platform interfaces through APIGateway and provides interfaces for other upper-level platforms;
CMDB: Job calls its interfaces to obtain business, host, and other resource data;
GSE: Job calls its interfaces to distribute script execution, file distribution tasks, and obtain task execution results and logs.

2. Weak Dependencies

The following systems are weak dependencies of Job. If they are missing or malfunctioning, some functionalities of Job will be unavailable:
BlueKing Artifact Repository: Job calls its interfaces to store local files, import/export, and temporary file data generated by execution log export;
Audit Center: Job outputs audit data log files for collection, used for auditing user operations;
Log Platform: Job's backend log files will be collected by its collector, facilitating unified search on the log platform;
Monitoring Platform: Job exposes metric data through the Prometheus protocol, which will be collected by its collector, facilitating the configuration of metric panels and alerts on the monitoring platform. Additionally, Job calls its interfaces through the OTEL protocol to report call chain APM data, facilitating call chain observation and analysis on the monitoring platform;
Message Notification Center: Job calls its interfaces to obtain notification and announcement information and displays them on the page.

Due to the numerous dependencies, deploying Job alone will not work properly. It is necessary to first deploy its underlying platforms. Please refer to the basic package content in the deployment and maintenance section of the BlueKing Documentation Center for deployment.

II. Maintenance

1. Resource Usage Management

After successfully deploying Job, as the system runs longer, the usage data increases, leading to increased resource usage by Job. For cost and performance considerations, resource usage needs to be managed, and some expired data should be cleaned up as needed.

(1) MySQL Data

Different functional modules of Job use different databases to store data. The job_execute database used by the task execution engine needs special attention. This database stores the flow data generated during task execution, and the data volume increases with the number of tasks. Usually, after the platform runs for a period, the job_execute database will become the largest storage space consumer, requiring data archiving and cleaning.
Job supports an automatic archiving and cleaning mechanism for task flow data, which can be configured through the backupConfig.archive properties in the helm chart values. The default archiving policy is: do not enable archiving, permanently retain all data. If there is a need to clean up expired data, the policy can be configured to delete data directly after the expiration time (default is 30 days) without backup. If there is a need to back up expired data, the policy can be configured to back up data before deletion.

(2) MongoDB Data

Job uses MongoDB to store business script execution logs and file distribution task execution logs generated during task execution, which are stored in the joblog database by default (configurable through related properties in the helm chart values). The log data volume increases with the number of tasks. Usually, after the platform runs for a period, the joblog database will become the largest storage space consumer, requiring data archiving and cleaning.
Job organizes and stores log data by task step type (script/file) and date. Script task logs are stored in the job_log_script_{date} collection, and file task logs are stored in the job_log_file_{date} collection, where {date} is the task creation date, such as job_log_script_2024_07_19, job_log_file_2024_07_20.
Currently, Job does not provide an automatic archiving and cleaning mechanism. Users need to manually execute commands or write scripts to export/delete the collections that need to be processed.

(3) BlueKing Artifact Repository Data

Job uses the BlueKing Artifact Repository to store local files uploaded by users, import/export, and temporary files generated during execution log export. The default BlueKing Artifact Repository project used is bkjob (configurable through related properties in the helm chart values), and the repositories used are localupload (local files), backup (import/export temporary files), and filedata (execution log export temporary files). Files in the localupload repository will be permanently retained if referenced by the job/execution plan's local file distribution step. Unreferenced files will be automatically cleaned up after the expiration time, which is 7 days by default (configurable through related properties in the helm chart values). Files in the backup and filedata repositories are temporary files and can be cleaned up as needed if not automatically cleaned by the system.

(4) Other Data

Backend Logs: Backend logs generated during system operation are printed in files within the container by default and will be automatically cleaned up after exceeding the limited capacity or expiration time. Usually, no manual handling is required. The capacity and expiration time can be configured through the log properties in the helm chart values.
Log Platform Data: If log collection is configured (properties under bkLogConfig in the helm chart values), the log platform will collect backend logs through the collector and store them in its backend storage components for search. This data needs to be configured for cleanup on the log platform.
Monitoring Metric Data: If monitoring data collection is configured (properties under serviceMonitor in the helm chart values), the monitoring platform will collect backend service metric data through the collector for viewing on the monitoring platform via dashboards or configuring alerts. This data needs to be configured for cleanup on the monitoring platform.
APM Data: If APM data collection is configured (properties under job.trace in the helm chart values), Job will use the OTEL protocol to actively report backend service call chain data to the monitoring platform for visual analysis of the call chain. This data needs to be configured for cleanup on the monitoring platform.

2. System Exception Troubleshooting and Recovery

(1) Service Status Check
After successfully deploying Job, system administrators can check the status of each service module and instance on the "Platform Management - Service Status" subpage of the web page. Generally, the status of all instances should be "Normal". If not, further troubleshooting of the corresponding instance is required.

(2) Pod Status Check
In the namespace where Job is deployed, you can use the kubectl get pod command combined with the "bk-job" keyword to grep out all Pods of Job. Normally, the status of all Pods of Job should be Running/Completed. If not, further troubleshooting of the service logs within the corresponding Pod or attempting to restart the Pod is required.

(3) Service Log Analysis
In a containerized deployment, you can enter the containers within each service Pod to view service logs. The log storage location for each service of Job is: /data/logs/{service_name}/. For example, the log location for the execution engine is /data/logs/job-execute/. The logs in the directory are divided into multiple files by category and will roll over by time. The purposes of various log files are introduced below:
error.log: Summary of error logs, which will also be present in other regular log files and need special attention;
{service_name}.log: Main service log, such as execute.log, which is the core business logic log of the service and needs special attention;
openapi.log: API (ESB, APIGW, etc.) call logs provided by Job;
schedule.log: Task scheduling logs of the execution engine;
monitor_task.log: Monitoring logs for large tasks (involving a large number of hosts);
sync_app_host.log: Monitoring logs for synchronization tasks and events of business (sets), hosts, and other CMDB resources;
access.log: Gateway interface access logs;
esb_access.log: Gateway ESB interface access logs;
cmdb.log: Logs for calls to CMDB related interfaces;
gse.log: Logs for calls to GSE related interfaces;
iam.log: Logs for calls to Permission Center (IAM) related interfaces;
paas.log: Logs for calls to PaaS platform related interfaces (login, user, etc.);
notice.log: Logs for calls to Message Notification Center related interfaces (notifications, announcements, etc.);
audit_event.log: Audit logs, which will be collected by the Audit Center through its collector if configured;
gc.log*: JVM garbage collection logs, which need attention only when JVM-level performance issues occur.

Log Analysis Method:
Most service logs of Job contain traceId and spanId, located after the log level in a log line. Here is an example for illustration:

[2024-07-19 19:00:00.001]  INFO [job-execute,84a2e275a062728b9b54aab16dea4289,4f786ec86329458a] 6 --- [execute-scheduler-3] e.e.r.h.ResultHandleTaskKeepaliveManager : Refresh task keepalive info start...

In the above log, traceId is 84a2e275a062728b9b54aab16dea4289, and spanId is 4f786ec86329458a. Using traceId, you can filter out all logs generated by a single request (supporting cross-service, but excluding gateway logs). Using spanId, you can filter out logs for a single processing stage. To quickly troubleshoot issues, you can first find the traceId of the corresponding core process through business characteristic information (such as taskInstanceId, stepInstanceId, etc.), then use traceId to filter out all logs of the core process for analysis. It is recommended to configure log collection, cleaning, and use the log platform for efficient log retrieval.

In Job's interface responses (Web, Service, ESB interfaces, etc.), traceId will be returned as a separate field (requestId for Web and Service interfaces, job_request_id for ESB interfaces). Therefore, you can obtain traceId from the interface response for log retrieval.