AWS Sagemaker ClientError: не указан канал поезда (ошибка файла манифеста)

AWS Sagemaker ClientError: не указан канал поезда (ошибка файла манифеста) ⇐ Python

1 сообщение • Страница 1 из 1

Anonymous

AWS Sagemaker ClientError: не указан канал поезда (ошибка файла манифеста)

Цитата

Сообщение Anonymous » 06 ноя 2024, 14:43

В качестве теста я запускаю train_manifest и validation_manifest, которые идентичны и содержат только один файл...
{"source-ref": "s3:///bad_ofs/Images_final/Crushing/iO/A_2208040CA2_1430_220804-205516.jpg", "bounding-box": {"annotations": [{"class_id": 0, "top": 750, "left": 7000, "height": 450, "width": 5500}, {"class_id": 0, "top": 3000, "left": 7000, "height": 500, "width": 5500}]}, "bounding-box-metadata": {"objects": [{"confidence": 1.0}, {"confidence": 1.0}], "class-map": {"0": "good"}, "type": "groundtruth/object-detection", "human-annotated": "yes", "creation-date": "2024-11-05T00:00:00", "job-name": "labeling-job/bounding-box"}}

При попытке обучения модели я получаю следующую ошибку...
INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: 1.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating training-job with name: object-detection-2024-11-06-11-11-07-815
----------------------------------------------
Train Input Config: {'DataSource': {'S3DataSource': {'S3DataType': 'AugmentedManifestFile', 'S3Uri': 's3://agilent-aws-tmp-12-data/bad_ofs/Images_final/Crushing/train_manifest.json', 'S3DataDistributionType': 'FullyReplicated', 'AttributeNames': ['source-ref', 'bounding-box']}}, 'ContentType': 'application/x-image', 'InputMode': 'Pipe'}
----------------------------------------------
Validation Input Config: {'DataSource': {'S3DataSource': {'S3DataType': 'AugmentedManifestFile', 'S3Uri': 's3://agilent-aws-tmp-12-data/bad_ofs/Images_final/Crushing/validation_manifest.json', 'S3DataDistributionType': 'FullyReplicated', 'AttributeNames': ['source-ref', 'bounding-box']}}, 'ContentType': 'application/x-image', 'InputMode': 'Pipe'}
----------------------------------------------
2024-11-06 11:11:10 Starting - Starting the training job...
2024-11-06 11:11:23 Starting - Preparing the instances for training...
2024-11-06 11:12:08 Downloading - Downloading the training image...............
2024-11-06 11:14:40 Training - Training image download completed. Training in progress...Docker entrypoint called with argument(s): train
Running default environment configuration script
Nvidia gpu devices, drivers and cuda toolkit versions (only available on hosts with GPU):
Wed Nov 6 11:14:49 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 23C P8 9W / 70W | 1MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Checking for nvidia driver and cuda compatibility.
CUDA Compatibility driver provided.
Proceeding with compatibility check between driver, cuda-toolkit and cuda-compat.
Detected cuda-toolkit version: 11.1.
Detected cuda-compat version: 455.32.00.
Detected Nvidia driver version: 550.127.05.
Nvidia driver compatible with cuda-toolkit. Disabling cuda-compat.
Running custom environment configuration script
/opt/amazon/lib/python3.8/site-packages/mxnet/model.py:97: SyntaxWarning: "is" with a literal. Did you mean "=="?
if num_device is 1 and 'dist' not in kvstore:
[11/06/2024 11:14:52 INFO 140179896461120] Reading default configuration from /opt/amazon/lib/python3.8/site-packages/algorithm/default-input.json: {'base_network': 'vgg-16', 'use_pretrained_model': '0', 'num_classes': '', 'mini_batch_size': '32', 'epochs': '30', 'learning_rate': '0.001', 'lr_scheduler_step': '', 'lr_scheduler_factor': '0.1', 'optimizer': 'sgd', 'momentum': '0.9', 'weight_decay': '0.0005', 'overlap_threshold': '0.5', 'nms_threshold': '0.45', 'num_training_samples': '', 'image_shape': '300', '_tuning_objective_metric': '', '_kvstore': 'device', 'kv_store': 'device', '_num_kv_servers': 'auto', 'label_width': '350', 'freeze_layer_pattern': '', 'nms_topk': '400', 'early_stopping': 'False', 'early_stopping_min_epochs': '10', 'early_stopping_patience': '5', 'early_stopping_tolerance': '0.0', '_begin_epoch': '0'}
[11/06/2024 11:14:52 INFO 140179896461120] Merging with provided configuration from /opt/ml/input/config/hyperparameters.json: {'base_network': 'resnet-50', 'epochs': '30', 'image_shape': '300', 'learning_rate': '0.001', 'mini_batch_size': '16', 'nms_threshold': '0.45', 'num_classes': '2', 'num_training_samples': '1000', 'optimizer': 'adam', 'overlap_threshold': '0.5', 'use_pretrained_model': '1'}
[11/06/2024 11:14:52 INFO 140179896461120] Final configuration: {'base_network': 'resnet-50', 'use_pretrained_model': '1', 'num_classes': '2', 'mini_batch_size': '16', 'epochs': '30', 'learning_rate': '0.001', 'lr_scheduler_step': '', 'lr_scheduler_factor': '0.1', 'optimizer': 'adam', 'momentum': '0.9', 'weight_decay': '0.0005', 'overlap_threshold': '0.5', 'nms_threshold': '0.45', 'num_training_samples': '1000', 'image_shape': '300', '_tuning_objective_metric': '', '_kvstore': 'device', 'kv_store': 'device', '_num_kv_servers': 'auto', 'label_width': '350', 'freeze_layer_pattern': '', 'nms_topk': '400', 'early_stopping': 'False', 'early_stopping_min_epochs': '10', 'early_stopping_patience': '5', 'early_stopping_tolerance': '0.0', '_begin_epoch': '0'}
Process 13 is a worker.
[11/06/2024 11:14:52 INFO 140179896461120] Using default worker.
[11/06/2024 11:14:52 INFO 140179896461120] Loaded iterator creator application/x-image for content type ('application/x-image', '1.0')
[11/06/2024 11:14:52 INFO 140179896461120] Loaded iterator creator application/x-recordio for content type ('application/x-recordio', '1.0')
[11/06/2024 11:14:52 INFO 140179896461120] Loaded iterator creator image/jpeg for content type ('image/jpeg', '1.0')
[11/06/2024 11:14:52 INFO 140179896461120] Loaded iterator creator image/png for content type ('image/png', '1.0')
[11/06/2024 11:14:52 INFO 140179896461120] Checkpoint loading and saving are disabled.
[11/06/2024 11:14:52 INFO 140179896461120] The channel 'train' is in pipe input mode under /opt/ml/input/data/train.
[11/06/2024 11:14:52 INFO 140179896461120] The channel 'train' is in pipe input mode under /opt/ml/input/data/train.
[11/06/2024 11:14:52 ERROR 140179896461120] Customer Error: train channel is not specified.

2024-11-06 11:15:04 Uploading - Uploading generated training model
2024-11-06 11:15:04 Failed - Training job failed
---------------------------------------------------------------------------
UnexpectedStatusException Traceback (most recent call last)
Cell In[118], line 70
54 od_model.set_hyperparameters(
55 base_network="resnet-50",
56 use_pretrained_model=1,
(...)
65 num_training_samples=1000
66 )
68 # Start the training job with both train and validation channels
69 # od_model.fit({"train": train_input, "validation": validation_input})
---> 70 od_model.fit({"train": train_input, "validation": validation_input})

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/sagemaker/workflow/pipeline_context.py:346, in runnable_by_pipeline..wrapper(*args, **kwargs)
342 return context
344 return _StepArguments(retrieve_caller_name(self_instance), run_func, *args, **kwargs)
--> 346 return run_func(*args, **kwargs)

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/sagemaker/estimator.py:1376, in EstimatorBase.fit(self, inputs, wait, logs, job_name, experiment_config)
1374 forward_to_mlflow_tracking_server = True
1375 if wait:
-> 1376 self.latest_training_job.wait(logs=logs)
1377 if forward_to_mlflow_tracking_server:
1378 log_sagemaker_job_to_mlflow(self.latest_training_job.name)

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/sagemaker/estimator.py:2750, in _TrainingJob.wait(self, logs)
2748 # If logs are requested, call logs_for_jobs.
2749 if logs != "None":
-> 2750 self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
2751 else:
2752 self.sagemaker_session.wait_for_job(self.job_name)

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/sagemaker/session.py:5945, in Session.logs_for_job(self, job_name, wait, poll, log_type, timeout)
5924 def logs_for_job(self, job_name, wait=False, poll=10, log_type="All", timeout=None):
5925 """Display logs for a given training job, optionally tailing them until job is complete.
5926
5927 If the output is a tty or a Jupyter cell, it will be color-coded
(...)
5943 exceptions.UnexpectedStatusException: If waiting and the training job fails.
5944 """
-> 5945 _logs_for_job(self, job_name, wait, poll, log_type, timeout)

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/sagemaker/session.py:8547, in _logs_for_job(sagemaker_session, job_name, wait, poll, log_type, timeout)
8544 last_profiler_rule_statuses = profiler_rule_statuses
8546 if wait:
-> 8547 _check_job_status(job_name, description, "TrainingJobStatus")
8548 if dot:
8549 print()

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/sagemaker/session.py:8611, in _check_job_status(job, desc, status_key_name)
8605 if "CapacityError" in str(reason):
8606 raise exceptions.CapacityError(
8607 message=message,
8608 allowed_statuses=["Completed", "Stopped"],
8609 actual_status=status,
8610 )
-> 8611 raise exceptions.UnexpectedStatusException(
8612 message=message,
8613 allowed_statuses=["Completed", "Stopped"],
8614 actual_status=status,
8615 )

UnexpectedStatusException: Error for Training job object-detection-2024-11-06-11-11-07-815: Failed. Reason: ClientError: train channel is not specified., exit code: 2. Check troubleshooting guide for common errors: https://docs.aws.amazon.com/sagemaker/l ... oting.html

Код выглядит так:
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.inputs import TrainingInput
from sagemaker.image_uris import retrieve

# Initialize the session and role
sagemaker_session = sagemaker.Session()
role = get_execution_role()

# Specify the S3 bucket and manifest file path
bucket_name = ""
train_manifest_s3_key = "bad_ofs/Images_final/Crushing/train_manifest.json"
validation_manifest_s3_key = "bad_ofs/Images_final/Crushing/validation_manifest.json"

# Define TrainingInput for training and validation data
train_input = TrainingInput(
s3_data=f"s3://{bucket_name}/{train_manifest_s3_key}",
content_type="application/x-image",
s3_data_type="AugmentedManifestFile",
attribute_names=["source-ref", "bounding-box"],
input_mode="Pipe"
)
validation_input = TrainingInput(
s3_data=f"s3://{bucket_name}/{validation_manifest_s3_key}",
content_type="application/x-image",
s3_data_type="AugmentedManifestFile",
attribute_names=["source-ref", "bounding-box"],
input_mode="Pipe"
)

print('----------------------------------------------')
print("Train Input Config:", train_input.config)
print('----------------------------------------------')
print("Validation Input Config:", validation_input.config)
print('----------------------------------------------')

# Retrieve the Docker container for object detection
container = retrieve("object-detection", sagemaker_session.boto_region_name)

# Define the estimator for SageMaker
od_model = sagemaker.estimator.Estimator(
container,
role,
instance_count=1,
instance_type="ml.g4dn.xlarge",
volume_size=50,
max_run=3600,
output_path=f"s3://{bucket_name}/output",
sagemaker_session=sagemaker_session
)

# Set hyperparameters for object detection
od_model.set_hyperparameters(
base_network="resnet-50",
use_pretrained_model=1,
num_classes=2,
mini_batch_size=16,
epochs=30,
learning_rate=0.001,
optimizer="adam",
overlap_threshold=0.5,
nms_threshold=0.45,
image_shape=300,
num_training_samples=1000
)

# Start the training job with both train and validation channels
# od_model.fit({"train": train_input, "validation": validation_input})
od_model.fit({"train": train_input, "validation": validation_input})

Подробнее здесь: https://stackoverflow.com/questions/791 ... file-error

1730893424

Anonymous

В качестве теста я запускаю train_manifest и validation_manifest, которые идентичны и содержат только один файл...
{"source-ref": "s3:///bad_ofs/Images_final/Crushing/iO/A_2208040CA2_1430_220804-205516.jpg", "bounding-box": {"annotations": [{"class_id": 0, "top": 750, "left": 7000, "height": 450, "width": 5500}, {"class_id": 0, "top": 3000, "left": 7000, "height": 500, "width": 5500}]}, "bounding-box-metadata": {"objects": [{"confidence": 1.0}, {"confidence": 1.0}], "class-map": {"0": "good"}, "type": "groundtruth/object-detection", "human-annotated": "yes", "creation-date": "2024-11-05T00:00:00", "job-name": "labeling-job/bounding-box"}}

При попытке обучения модели я получаю следующую ошибку...
INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: 1.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating training-job with name: object-detection-2024-11-06-11-11-07-815
----------------------------------------------
Train Input Config: {'DataSource': {'S3DataSource': {'S3DataType': 'AugmentedManifestFile', 'S3Uri': 's3://agilent-aws-tmp-12-data/bad_ofs/Images_final/Crushing/train_manifest.json', 'S3DataDistributionType': 'FullyReplicated', 'AttributeNames': ['source-ref', 'bounding-box']}}, 'ContentType': 'application/x-image', 'InputMode': 'Pipe'}
----------------------------------------------
Validation Input Config: {'DataSource': {'S3DataSource': {'S3DataType': 'AugmentedManifestFile', 'S3Uri': 's3://agilent-aws-tmp-12-data/bad_ofs/Images_final/Crushing/validation_manifest.json', 'S3DataDistributionType': 'FullyReplicated', 'AttributeNames': ['source-ref', 'bounding-box']}}, 'ContentType': 'application/x-image', 'InputMode': 'Pipe'}
----------------------------------------------
2024-11-06 11:11:10 Starting - Starting the training job...
2024-11-06 11:11:23 Starting - Preparing the instances for training...
2024-11-06 11:12:08 Downloading - Downloading the training image...............
2024-11-06 11:14:40 Training - Training image download completed. Training in progress...Docker entrypoint called with argument(s): train
Running default environment configuration script
Nvidia gpu devices, drivers and cuda toolkit versions (only available on hosts with GPU):
Wed Nov  6 11:14:49 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M.  |
|=========================================+========================+======================|
|   0  Tesla T4                       On  |   00000000:00:1E.0 Off |                    0 |
| N/A   23C    P8              9W /   70W |       1MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
Checking for nvidia driver and cuda compatibility.
CUDA Compatibility driver provided.
Proceeding with compatibility check between driver, cuda-toolkit and cuda-compat.
Detected cuda-toolkit version: 11.1.
Detected cuda-compat version: 455.32.00.
Detected Nvidia driver version: 550.127.05.
Nvidia driver compatible with cuda-toolkit. Disabling cuda-compat.
Running custom environment configuration script
/opt/amazon/lib/python3.8/site-packages/mxnet/model.py:97: SyntaxWarning: "is" with a literal.  Did you mean "=="?
if num_device is 1 and 'dist' not in kvstore:
[11/06/2024 11:14:52 INFO 140179896461120] Reading default configuration from /opt/amazon/lib/python3.8/site-packages/algorithm/default-input.json: {'base_network': 'vgg-16', 'use_pretrained_model': '0', 'num_classes': '', 'mini_batch_size': '32', 'epochs': '30', 'learning_rate': '0.001', 'lr_scheduler_step': '', 'lr_scheduler_factor': '0.1', 'optimizer': 'sgd', 'momentum': '0.9', 'weight_decay': '0.0005', 'overlap_threshold': '0.5', 'nms_threshold': '0.45', 'num_training_samples': '', 'image_shape': '300', '_tuning_objective_metric': '', '_kvstore': 'device', 'kv_store': 'device', '_num_kv_servers': 'auto', 'label_width': '350', 'freeze_layer_pattern': '', 'nms_topk': '400', 'early_stopping': 'False', 'early_stopping_min_epochs': '10', 'early_stopping_patience': '5', 'early_stopping_tolerance': '0.0', '_begin_epoch': '0'}
[11/06/2024 11:14:52 INFO 140179896461120] Merging with provided configuration from /opt/ml/input/config/hyperparameters.json: {'base_network': 'resnet-50', 'epochs': '30', 'image_shape': '300', 'learning_rate': '0.001', 'mini_batch_size': '16', 'nms_threshold': '0.45', 'num_classes': '2', 'num_training_samples': '1000', 'optimizer': 'adam', 'overlap_threshold': '0.5', 'use_pretrained_model': '1'}
[11/06/2024 11:14:52 INFO 140179896461120] Final configuration: {'base_network': 'resnet-50', 'use_pretrained_model': '1', 'num_classes': '2', 'mini_batch_size': '16', 'epochs': '30', 'learning_rate': '0.001', 'lr_scheduler_step': '', 'lr_scheduler_factor': '0.1', 'optimizer': 'adam', 'momentum': '0.9', 'weight_decay': '0.0005', 'overlap_threshold': '0.5', 'nms_threshold': '0.45', 'num_training_samples': '1000', 'image_shape': '300', '_tuning_objective_metric': '', '_kvstore': 'device', 'kv_store': 'device', '_num_kv_servers': 'auto', 'label_width': '350', 'freeze_layer_pattern': '', 'nms_topk': '400', 'early_stopping': 'False', 'early_stopping_min_epochs': '10', 'early_stopping_patience': '5', 'early_stopping_tolerance': '0.0', '_begin_epoch': '0'}
Process 13 is a worker.
[11/06/2024 11:14:52 INFO 140179896461120] Using default worker.
[11/06/2024 11:14:52 INFO 140179896461120] Loaded iterator creator application/x-image for content type ('application/x-image', '1.0')
[11/06/2024 11:14:52 INFO 140179896461120] Loaded iterator creator application/x-recordio for content type ('application/x-recordio', '1.0')
[11/06/2024 11:14:52 INFO 140179896461120] Loaded iterator creator image/jpeg for content type ('image/jpeg', '1.0')
[11/06/2024 11:14:52 INFO 140179896461120] Loaded iterator creator image/png for content type ('image/png', '1.0')
[11/06/2024 11:14:52 INFO 140179896461120] Checkpoint loading and saving are disabled.
[11/06/2024 11:14:52 INFO 140179896461120] The channel 'train' is in pipe input mode under /opt/ml/input/data/train.
[11/06/2024 11:14:52 INFO 140179896461120] The channel 'train' is in pipe input mode under /opt/ml/input/data/train.
[11/06/2024 11:14:52 ERROR 140179896461120] Customer Error: train channel is not specified.

2024-11-06 11:15:04 Uploading - Uploading generated training model
2024-11-06 11:15:04 Failed - Training job failed
---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
Cell In[118], line 70
54 od_model.set_hyperparameters(
55     base_network="resnet-50",
56     use_pretrained_model=1,
(...)
65     num_training_samples=1000
66 )
68 # Start the training job with both train and validation channels
69 # od_model.fit({"train": train_input, "validation": validation_input})
---> 70 od_model.fit({"train": train_input, "validation": validation_input})

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/sagemaker/workflow/pipeline_context.py:346, in runnable_by_pipeline..wrapper(*args, **kwargs)
342         return context
344     return _StepArguments(retrieve_caller_name(self_instance), run_func, *args, **kwargs)
--> 346 return run_func(*args, **kwargs)

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/sagemaker/estimator.py:1376, in EstimatorBase.fit(self, inputs, wait, logs, job_name, experiment_config)
1374     forward_to_mlflow_tracking_server = True
1375 if wait:
->  1376     self.latest_training_job.wait(logs=logs)
1377 if forward_to_mlflow_tracking_server:
1378     log_sagemaker_job_to_mlflow(self.latest_training_job.name)

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/sagemaker/estimator.py:2750, in _TrainingJob.wait(self, logs)
2748 # If logs are requested, call logs_for_jobs.
2749 if logs != "None":
-> 2750     self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
2751 else:
2752     self.sagemaker_session.wait_for_job(self.job_name)

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/sagemaker/session.py:5945, in Session.logs_for_job(self, job_name, wait, poll, log_type, timeout)
5924 def logs_for_job(self, job_name, wait=False, poll=10, log_type="All", timeout=None):
5925     """Display logs for a given training job, optionally tailing them until job is complete.
5926
5927     If the output is a tty or a Jupyter cell, it will be color-coded
(...)
5943         exceptions.UnexpectedStatusException: If waiting and the training job fails.
5944     """
-> 5945     _logs_for_job(self, job_name, wait, poll, log_type, timeout)

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/sagemaker/session.py:8547, in _logs_for_job(sagemaker_session, job_name, wait, poll, log_type, timeout)
8544             last_profiler_rule_statuses = profiler_rule_statuses
8546 if wait:
-> 8547     _check_job_status(job_name, description, "TrainingJobStatus")
8548     if dot:
8549         print()

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/sagemaker/session.py:8611, in _check_job_status(job, desc, status_key_name)
8605 if "CapacityError" in str(reason):
8606     raise exceptions.CapacityError(
8607         message=message,
8608         allowed_statuses=["Completed", "Stopped"],
8609         actual_status=status,
8610     )
-> 8611 raise exceptions.UnexpectedStatusException(
8612     message=message,
8613     allowed_statuses=["Completed", "Stopped"],
8614     actual_status=status,
8615 )

UnexpectedStatusException: Error for Training job object-detection-2024-11-06-11-11-07-815: Failed. Reason: ClientError: train channel is not specified., exit code: 2.  Check troubleshooting guide for common errors: https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-python-sdk-troubleshooting.html

Код выглядит так:
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.inputs import TrainingInput
from sagemaker.image_uris import retrieve

# Initialize the session and role
sagemaker_session = sagemaker.Session()
role = get_execution_role()

# Specify the S3 bucket and manifest file path
bucket_name = ""
train_manifest_s3_key = "bad_ofs/Images_final/Crushing/train_manifest.json"
validation_manifest_s3_key = "bad_ofs/Images_final/Crushing/validation_manifest.json"

# Define TrainingInput for training and validation data
train_input = TrainingInput(
s3_data=f"s3://{bucket_name}/{train_manifest_s3_key}",
content_type="application/x-image",
s3_data_type="AugmentedManifestFile",
attribute_names=["source-ref", "bounding-box"],
input_mode="Pipe"
)
validation_input = TrainingInput(
s3_data=f"s3://{bucket_name}/{validation_manifest_s3_key}",
content_type="application/x-image",
s3_data_type="AugmentedManifestFile",
attribute_names=["source-ref", "bounding-box"],
input_mode="Pipe"
)

print('----------------------------------------------')
print("Train Input Config:", train_input.config)
print('----------------------------------------------')
print("Validation Input Config:", validation_input.config)
print('----------------------------------------------')

# Retrieve the Docker container for object detection
container = retrieve("object-detection", sagemaker_session.boto_region_name)

# Define the estimator for SageMaker
od_model = sagemaker.estimator.Estimator(
container,
role,
instance_count=1,
instance_type="ml.g4dn.xlarge",
volume_size=50,
max_run=3600,
output_path=f"s3://{bucket_name}/output",
sagemaker_session=sagemaker_session
)

# Set hyperparameters for object detection
od_model.set_hyperparameters(
base_network="resnet-50",
use_pretrained_model=1,
num_classes=2,
mini_batch_size=16,
epochs=30,
learning_rate=0.001,
optimizer="adam",
overlap_threshold=0.5,
nms_threshold=0.45,
image_shape=300,
num_training_samples=1000
)

# Start the training job with both train and validation channels
# od_model.fit({"train": train_input, "validation": validation_input})
od_model.fit({"train": train_input, "validation": validation_input})
 

Подробнее здесь: [url]https://stackoverflow.com/questions/79162377/aws-sagemaker-clienterror-train-channel-is-not-specified-manifest-file-error[/url]

Ответить Пред. тема След. тема

1 сообщение • Страница 1 из 1

Быстрый ответ

Заголовок:

Имя пользователя:

Изменение регистра текста:

Смайлики

Ещё смайлики…

К этому ответу прикреплено по крайней мере одно вложение.

Если вы не хотите добавлять вложения, оставьте поля пустыми. Можно прикреплять файлы, перетаскивая их в окно сообщения.

Максимально разрешённый размер вложения: 15 МБ.

Имя файла:

Комментарий к файлу:

Имя файла	Комментарий к файлу	Размер	Статус

Похожие темы

Ответы

Просмотры

Последнее сообщение

Ошибка манифеста Flutter Android: слияние манифеста не удалось с несколькими ошибками

Последнее сообщение Anonymous « 13 окт 2024, 09:08
Добавлено в форуме Android

Anonymous » 13 окт 2024, 09:08 » в форуме Android

Я пытаюсь добавить уведомления в свое приложение Flutter и внес соответствующие изменения в Androidmanifest.xml для разрешения. Но я получил эту ошибку. Я использую пакет flutter_local_notifications. Я добавил инструменты:replace= android:label, как...

0 Ответы

28 Просмотры

Последнее сообщение Anonymous
13 окт 2024, 09:08
Как исправить com.android.volley.ClientError в студии Android

Последнее сообщение Anonymous « 30 сен 2024, 21:01
Добавлено в форуме Android

Anonymous » 30 сен 2024, 21:01 » в форуме Android

Когда я нажимаю кнопку входа в систему, появляется это предупреждение, я использую com.android.volley:volley:1.1.1
а для API я использую restframework из Python

это мой Код mainactivity.java
@Override
protected void onCreate(Bundle...

0 Ответы

24 Просмотры

Последнее сообщение Anonymous
30 сен 2024, 21:01
Как исправить com.android.volley.clienterror в Android Studio

Последнее сообщение Anonymous « 28 янв 2025, 23:00
Добавлено в форуме Android

Anonymous » 28 янв 2025, 23:00 » в форуме Android

Когда я нажимаю кнопку входа в систему, это будет отображаться, я использую com.android.volley: Volley: 1.1.1
и для API, я использую RestFramework от python

Это мой mainActivity.java code
@Override
protected void onCreate(Bundle...

0 Ответы

12 Просмотры

Последнее сообщение Anonymous
28 янв 2025, 23:00
Проблема с компиляцией Tesseract OCR в AWS SageMaker: ошибка версии GCC и файловой системы

Последнее сообщение Anonymous « 25 окт 2024, 03:48
Добавлено в форуме Python

Anonymous » 25 окт 2024, 03:48 » в форуме Python

Я пытаюсь скомпилировать последнюю версию Tesseract OCR на AWS SageMaker (Amazon Linux 2), чтобы иметь возможность использовать функции OCR PyMupdf. После успешной установки Leptonica 1.85.0 из исходного кода я попытался скомпилировать Tesseract. В...

0 Ответы

23 Просмотры

Последнее сообщение Anonymous
25 окт 2024, 03:48
Проблема с компиляцией Tesseract OCR в AWS SageMaker: ошибка версии GCC и файловой системы

Последнее сообщение Anonymous « 25 окт 2024, 18:14
Добавлено в форуме Python

Anonymous » 25 окт 2024, 18:14 » в форуме Python

Я пытаюсь скомпилировать последнюю версию Tesseract OCR на AWS SageMaker (Amazon Linux 2), чтобы иметь возможность использовать функции OCR PyMupdf. После успешной установки Leptonica 1.85.0 из исходного кода я попытался скомпилировать Tesseract. В...

0 Ответы

19 Просмотры

Последнее сообщение Anonymous
25 окт 2024, 18:14

Вернуться в «Python»