В качестве теста я запускаю train_manifest и validation_manifest, которые идентичны и содержат только один файл...
{"source-ref": "s3:///bad_ofs/Images_final/Crushing/iO/A_2208040CA2_1430_220804-205516.jpg", "bounding-box": {"annotations": [{"class_id": 0, "top": 750, "left": 7000, "height": 450, "width": 5500}, {"class_id": 0, "top": 3000, "left": 7000, "height": 500, "width": 5500}]}, "bounding-box-metadata": {"objects": [{"confidence": 1.0}, {"confidence": 1.0}], "class-map": {"0": "good"}, "type": "groundtruth/object-detection", "human-annotated": "yes", "creation-date": "2024-11-05T00:00:00", "job-name": "labeling-job/bounding-box"}}
При попытке обучения модели я получаю следующую ошибку...
INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: 1.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating training-job with name: object-detection-2024-11-06-11-11-07-815
----------------------------------------------
Train Input Config: {'DataSource': {'S3DataSource': {'S3DataType': 'AugmentedManifestFile', 'S3Uri': 's3://agilent-aws-tmp-12-data/bad_ofs/Images_final/Crushing/train_manifest.json', 'S3DataDistributionType': 'FullyReplicated', 'AttributeNames': ['source-ref', 'bounding-box']}}, 'ContentType': 'application/x-image', 'InputMode': 'Pipe'}
----------------------------------------------
Validation Input Config: {'DataSource': {'S3DataSource': {'S3DataType': 'AugmentedManifestFile', 'S3Uri': 's3://agilent-aws-tmp-12-data/bad_ofs/Images_final/Crushing/validation_manifest.json', 'S3DataDistributionType': 'FullyReplicated', 'AttributeNames': ['source-ref', 'bounding-box']}}, 'ContentType': 'application/x-image', 'InputMode': 'Pipe'}
----------------------------------------------
2024-11-06 11:11:10 Starting - Starting the training job...
2024-11-06 11:11:23 Starting - Preparing the instances for training...
2024-11-06 11:12:08 Downloading - Downloading the training image...............
2024-11-06 11:14:40 Training - Training image download completed. Training in progress...Docker entrypoint called with argument(s): train
Running default environment configuration script
Nvidia gpu devices, drivers and cuda toolkit versions (only available on hosts with GPU):
Wed Nov 6 11:14:49 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 23C P8 9W / 70W | 1MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Checking for nvidia driver and cuda compatibility.
CUDA Compatibility driver provided.
Proceeding with compatibility check between driver, cuda-toolkit and cuda-compat.
Detected cuda-toolkit version: 11.1.
Detected cuda-compat version: 455.32.00.
Detected Nvidia driver version: 550.127.05.
Nvidia driver compatible with cuda-toolkit. Disabling cuda-compat.
Running custom environment configuration script
/opt/amazon/lib/python3.8/site-packages/mxnet/model.py:97: SyntaxWarning: "is" with a literal. Did you mean "=="?
if num_device is 1 and 'dist' not in kvstore:
[11/06/2024 11:14:52 INFO 140179896461120] Reading default configuration from /opt/amazon/lib/python3.8/site-packages/algorithm/default-input.json: {'base_network': 'vgg-16', 'use_pretrained_model': '0', 'num_classes': '', 'mini_batch_size': '32', 'epochs': '30', 'learning_rate': '0.001', 'lr_scheduler_step': '', 'lr_scheduler_factor': '0.1', 'optimizer': 'sgd', 'momentum': '0.9', 'weight_decay': '0.0005', 'overlap_threshold': '0.5', 'nms_threshold': '0.45', 'num_training_samples': '', 'image_shape': '300', '_tuning_objective_metric': '', '_kvstore': 'device', 'kv_store': 'device', '_num_kv_servers': 'auto', 'label_width': '350', 'freeze_layer_pattern': '', 'nms_topk': '400', 'early_stopping': 'False', 'early_stopping_min_epochs': '10', 'early_stopping_patience': '5', 'early_stopping_tolerance': '0.0', '_begin_epoch': '0'}
[11/06/2024 11:14:52 INFO 140179896461120] Merging with provided configuration from /opt/ml/input/config/hyperparameters.json: {'base_network': 'resnet-50', 'epochs': '30', 'image_shape': '300', 'learning_rate': '0.001', 'mini_batch_size': '16', 'nms_threshold': '0.45', 'num_classes': '2', 'num_training_samples': '1000', 'optimizer': 'adam', 'overlap_threshold': '0.5', 'use_pretrained_model': '1'}
[11/06/2024 11:14:52 INFO 140179896461120] Final configuration: {'base_network': 'resnet-50', 'use_pretrained_model': '1', 'num_classes': '2', 'mini_batch_size': '16', 'epochs': '30', 'learning_rate': '0.001', 'lr_scheduler_step': '', 'lr_scheduler_factor': '0.1', 'optimizer': 'adam', 'momentum': '0.9', 'weight_decay': '0.0005', 'overlap_threshold': '0.5', 'nms_threshold': '0.45', 'num_training_samples': '1000', 'image_shape': '300', '_tuning_objective_metric': '', '_kvstore': 'device', 'kv_store': 'device', '_num_kv_servers': 'auto', 'label_width': '350', 'freeze_layer_pattern': '', 'nms_topk': '400', 'early_stopping': 'False', 'early_stopping_min_epochs': '10', 'early_stopping_patience': '5', 'early_stopping_tolerance': '0.0', '_begin_epoch': '0'}
Process 13 is a worker.
[11/06/2024 11:14:52 INFO 140179896461120] Using default worker.
[11/06/2024 11:14:52 INFO 140179896461120] Loaded iterator creator application/x-image for content type ('application/x-image', '1.0')
[11/06/2024 11:14:52 INFO 140179896461120] Loaded iterator creator application/x-recordio for content type ('application/x-recordio', '1.0')
[11/06/2024 11:14:52 INFO 140179896461120] Loaded iterator creator image/jpeg for content type ('image/jpeg', '1.0')
[11/06/2024 11:14:52 INFO 140179896461120] Loaded iterator creator image/png for content type ('image/png', '1.0')
[11/06/2024 11:14:52 INFO 140179896461120] Checkpoint loading and saving are disabled.
[11/06/2024 11:14:52 INFO 140179896461120] The channel 'train' is in pipe input mode under /opt/ml/input/data/train.
[11/06/2024 11:14:52 INFO 140179896461120] The channel 'train' is in pipe input mode under /opt/ml/input/data/train.
[11/06/2024 11:14:52 ERROR 140179896461120] Customer Error: train channel is not specified.
2024-11-06 11:15:04 Uploading - Uploading generated training model
2024-11-06 11:15:04 Failed - Training job failed
---------------------------------------------------------------------------
UnexpectedStatusException Traceback (most recent call last)
Cell In[118], line 70
54 od_model.set_hyperparameters(
55 base_network="resnet-50",
56 use_pretrained_model=1,
(...)
65 num_training_samples=1000
66 )
68 # Start the training job with both train and validation channels
69 # od_model.fit({"train": train_input, "validation": validation_input})
---> 70 od_model.fit({"train": train_input, "validation": validation_input})
File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/sagemaker/workflow/pipeline_context.py:346, in runnable_by_pipeline..wrapper(*args, **kwargs)
342 return context
344 return _StepArguments(retrieve_caller_name(self_instance), run_func, *args, **kwargs)
--> 346 return run_func(*args, **kwargs)
File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/sagemaker/estimator.py:1376, in EstimatorBase.fit(self, inputs, wait, logs, job_name, experiment_config)
1374 forward_to_mlflow_tracking_server = True
1375 if wait:
-> 1376 self.latest_training_job.wait(logs=logs)
1377 if forward_to_mlflow_tracking_server:
1378 log_sagemaker_job_to_mlflow(self.latest_training_job.name)
File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/sagemaker/estimator.py:2750, in _TrainingJob.wait(self, logs)
2748 # If logs are requested, call logs_for_jobs.
2749 if logs != "None":
-> 2750 self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
2751 else:
2752 self.sagemaker_session.wait_for_job(self.job_name)
File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/sagemaker/session.py:5945, in Session.logs_for_job(self, job_name, wait, poll, log_type, timeout)
5924 def logs_for_job(self, job_name, wait=False, poll=10, log_type="All", timeout=None):
5925 """Display logs for a given training job, optionally tailing them until job is complete.
5926
5927 If the output is a tty or a Jupyter cell, it will be color-coded
(...)
5943 exceptions.UnexpectedStatusException: If waiting and the training job fails.
5944 """
-> 5945 _logs_for_job(self, job_name, wait, poll, log_type, timeout)
File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/sagemaker/session.py:8547, in _logs_for_job(sagemaker_session, job_name, wait, poll, log_type, timeout)
8544 last_profiler_rule_statuses = profiler_rule_statuses
8546 if wait:
-> 8547 _check_job_status(job_name, description, "TrainingJobStatus")
8548 if dot:
8549 print()
File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/sagemaker/session.py:8611, in _check_job_status(job, desc, status_key_name)
8605 if "CapacityError" in str(reason):
8606 raise exceptions.CapacityError(
8607 message=message,
8608 allowed_statuses=["Completed", "Stopped"],
8609 actual_status=status,
8610 )
-> 8611 raise exceptions.UnexpectedStatusException(
8612 message=message,
8613 allowed_statuses=["Completed", "Stopped"],
8614 actual_status=status,
8615 )
UnexpectedStatusException: Error for Training job object-detection-2024-11-06-11-11-07-815: Failed. Reason: ClientError: train channel is not specified., exit code: 2. Check troubleshooting guide for common errors: https://docs.aws.amazon.com/sagemaker/l ... oting.html
Код выглядит так:
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.inputs import TrainingInput
from sagemaker.image_uris import retrieve
# Initialize the session and role
sagemaker_session = sagemaker.Session()
role = get_execution_role()
# Specify the S3 bucket and manifest file path
bucket_name = ""
train_manifest_s3_key = "bad_ofs/Images_final/Crushing/train_manifest.json"
validation_manifest_s3_key = "bad_ofs/Images_final/Crushing/validation_manifest.json"
# Define TrainingInput for training and validation data
train_input = TrainingInput(
s3_data=f"s3://{bucket_name}/{train_manifest_s3_key}",
content_type="application/x-image",
s3_data_type="AugmentedManifestFile",
attribute_names=["source-ref", "bounding-box"],
input_mode="Pipe"
)
validation_input = TrainingInput(
s3_data=f"s3://{bucket_name}/{validation_manifest_s3_key}",
content_type="application/x-image",
s3_data_type="AugmentedManifestFile",
attribute_names=["source-ref", "bounding-box"],
input_mode="Pipe"
)
print('----------------------------------------------')
print("Train Input Config:", train_input.config)
print('----------------------------------------------')
print("Validation Input Config:", validation_input.config)
print('----------------------------------------------')
# Retrieve the Docker container for object detection
container = retrieve("object-detection", sagemaker_session.boto_region_name)
# Define the estimator for SageMaker
od_model = sagemaker.estimator.Estimator(
container,
role,
instance_count=1,
instance_type="ml.g4dn.xlarge",
volume_size=50,
max_run=3600,
output_path=f"s3://{bucket_name}/output",
sagemaker_session=sagemaker_session
)
# Set hyperparameters for object detection
od_model.set_hyperparameters(
base_network="resnet-50",
use_pretrained_model=1,
num_classes=2,
mini_batch_size=16,
epochs=30,
learning_rate=0.001,
optimizer="adam",
overlap_threshold=0.5,
nms_threshold=0.45,
image_shape=300,
num_training_samples=1000
)
# Start the training job with both train and validation channels
# od_model.fit({"train": train_input, "validation": validation_input})
od_model.fit({"train": train_input, "validation": validation_input})
Подробнее здесь: https://stackoverflow.com/questions/791 ... file-error
AWS Sagemaker ClientError: не указан канал поезда (ошибка файла манифеста) ⇐ Python
-
- Похожие темы
- Ответы
- Просмотры
- Последнее сообщение
-
-
Ошибка манифеста Flutter Android: слияние манифеста не удалось с несколькими ошибками
Anonymous » » в форуме Android - 0 Ответы
- 28 Просмотры
-
Последнее сообщение Anonymous
-
-
-
Проблема с компиляцией Tesseract OCR в AWS SageMaker: ошибка версии GCC и файловой системы
Anonymous » » в форуме Python - 0 Ответы
- 23 Просмотры
-
Последнее сообщение Anonymous
-
-
-
Проблема с компиляцией Tesseract OCR в AWS SageMaker: ошибка версии GCC и файловой системы
Anonymous » » в форуме Python - 0 Ответы
- 19 Просмотры
-
Последнее сообщение Anonymous
-