【BUG】can't find 'main' module in '/root/lyd/unimol/Prediction/scripts/./train_dataset' #311

Carpdong · 2025-01-07T07:01:28Z

Describe the bug

data_path="./train_dataset" # replace to your data path
save_dir="./save_finetune_mp" # replace to your save path
n_gpu=1
MASTER_PORT=10086
dict_name="dict.txt"
weight_path="./mol_pre_no_h_220816.pt" # replace to your ckpt path
task_name="mp" # molecular property prediction task name
task_num=1
loss_func="finetune_smooth_mae"
lr=1e-4
batch_size=32
epoch=500
dropout=0
warmup=0.06
local_batch_size=32
only_polar=0
conf_size=11
seed=0

if [ "$task_name" == "qm7dft" ] || [ "$task_name" == "qm8dft" ] || [ "$task_name" == "qm9dft" ] || [ "$task_name" == "ep" ] || [ "$task_name" == "lipo" ] || [ "$task_name" == "mp" ] || [ "$task_name" == "bp" ] || [ "$task_name" == "fp" ]; then
metric="valid_agg_mae"
elif [ "$task_name" == "esol" ] || [ "$task_name" == "freesolv" ]; then
metric="valid_agg_rmse"
else
metric="valid_agg_auc"
fi

export NCCL_ASYNC_ERROR_HANDLING=1
export OMP_NUM_THREADS=1
update_freq=expr $batch_size / $local_batch_size
python -m torch.distributed.launch --nproc_per_node=$n_gpu --master_port=$MASTER_PORT $(which unicore-train) $data_path --task-name $task_name --user-dir ./unimol --train-subset train --valid-subset valid
--conf-size $conf_size
--num-workers 8 --ddp-backend=c10d
--dict-name $dict_name
--task mol_finetune --loss $loss_func --arch unimol_base
--classification-head-name $task_name --num-classes $task_num
--optimizer adam --adam-betas "(0.9, 0.99)" --adam-eps 1e-6 --clip-norm 1.0
--lr-scheduler polynomial_decay --lr $lr --max-epoch $epoch --batch-size $local_batch_size --pooler-dropout $dropout
--update-freq $update_freq --seed $seed
--fp16 --fp16-init-scale 4 --fp16-scale-window 256
--log-interval 100 --log-format simple
--validate-interval 1
--finetune-from-model $weight_path
--best-checkpoint-metric $metric --patience 20
--save-dir $save_dir --only-polar $only_polar
--reg

这是我的finetune.sh文件，运行时报错：
can't find 'main' module in '/root/lyd/unimol/Prediction/scripts/./train_dataset'

Uni-Mol Version

Uni-Mol

Expected behavior

跑通finetune

To Reproduce

No response

Environment

No response

Additional Context

No response

The text was updated successfully, but these errors were encountered:

ZhouGengmo · 2025-01-07T07:59:26Z

脚本看起来好像没什么问题
这个data_path前面的点是多打了吗，还是脚本里就有

`data_path="./train_dataset" # replace to your data path

以及每一行可以用 \分隔，比如
--dict-name $dict_name
--task mol_finetune --loss $loss_func --arch unimol_base

Carpdong · 2025-01-07T08:02:59Z

感谢回复！
目前报错：can't find 'main' module in '/root/lyd/unimol/Prediction/scripts/./train_dataset'
这个data_path前面的点是多复制了的。脚本里没有。
train_dataset下存放有train.lmdb

ZhouGengmo · 2025-01-07T08:56:22Z

这是我的finetune.sh文件，运行时报错：
can't find 'main' module in '/root/lyd/unimol/Prediction/scripts/./train_dataset'

这里/scripts/./train_dataset看上去有问题，可以把data_path换成绝对路径

另外如果想微调推荐使用unimol_tools，用起来比较友好

Carpdong · 2025-01-07T09:21:01Z

感谢回复
可能是环境问题造成的，改为绝对路径后，我使用自己安装的环境运行，依然报错：
can't find 'main' module in '/root/lyd/unimol/Prediction/scripts/./train_dataset'
起了个您这边提供docker容器，报错：
FileNotFoundError: /UniMol/Prediction/scripts/unimol

ZhouGengmo · 2025-01-08T09:22:38Z

起了个您这边提供docker容器，报错：
FileNotFoundError: /UniMol/Prediction/scripts/unimol

关于这个错误，可以把脚本里--user-dir ./unimol，这个unimol目录换成绝对路径
对应仓库中的目录地址是：
https://github.com/deepmodeling/Uni-Mol/tree/main/unimol/unimol

Carpdong · 2025-01-09T06:41:59Z

感谢回复，
起了个您这边提供docker容器，报错：ModuleNotFoundError: No module named 'numpy._core'
升降numpy版本都无效

ZhouGengmo · 2025-01-09T07:27:07Z

可以试试unicore的镜像，基本是通用的，需要装一下rdkit
docker pull dptechnology/unicore:latest-pytorch1.11.0-cuda11.3
docker pull dptechnology/unicore:latest-pytorch1.12.1-cuda11.6-rdma

Carpdong · 2025-01-14T08:00:16Z

ModuleNotFoundError: No module named 'numpy._core'的错误原因是：
由数据处理得到的train.lmdb test,lmdb valid.lmdb引起的。包括其余有关pickle的错误也是这个原因：
生成lmdb的环境与finetune的环境必须为相同环境。

Carpdong · 2025-01-14T08:05:04Z

finetune.sh中的python -m torch.distributed.launch需要改为torchrun，可以避免1个错误：
raise ChildFailedError(torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

inference.sh中python改为torchrun，可以避免1个错误：
KeyError: 'mol_finetune'

Carpdong added the bug Something isn't working label Jan 7, 2025

Carpdong closed this as completed Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【BUG】can't find 'main' module in '/root/lyd/unimol/Prediction/scripts/./train_dataset' #311

【BUG】can't find 'main' module in '/root/lyd/unimol/Prediction/scripts/./train_dataset' #311

Carpdong commented Jan 7, 2025 •

edited

Loading

ZhouGengmo commented Jan 7, 2025

Carpdong commented Jan 7, 2025

ZhouGengmo commented Jan 7, 2025

Carpdong commented Jan 7, 2025

ZhouGengmo commented Jan 8, 2025

Carpdong commented Jan 9, 2025

ZhouGengmo commented Jan 9, 2025

Carpdong commented Jan 14, 2025

Carpdong commented Jan 14, 2025 •

edited

Loading

【BUG】can't find '__main__' module in '/root/lyd/unimol/Prediction/scripts/./train_dataset' #311

【BUG】can't find '__main__' module in '/root/lyd/unimol/Prediction/scripts/./train_dataset' #311

Comments

Carpdong commented Jan 7, 2025 • edited Loading

Describe the bug

Uni-Mol Version

Expected behavior

To Reproduce

Environment

Additional Context

ZhouGengmo commented Jan 7, 2025

Carpdong commented Jan 7, 2025

ZhouGengmo commented Jan 7, 2025

Carpdong commented Jan 7, 2025

ZhouGengmo commented Jan 8, 2025

Carpdong commented Jan 9, 2025

ZhouGengmo commented Jan 9, 2025

Carpdong commented Jan 14, 2025

Carpdong commented Jan 14, 2025 • edited Loading

【BUG】can't find 'main' module in '/root/lyd/unimol/Prediction/scripts/./train_dataset' #311

【BUG】can't find 'main' module in '/root/lyd/unimol/Prediction/scripts/./train_dataset' #311

Carpdong commented Jan 7, 2025 •

edited

Loading

Carpdong commented Jan 14, 2025 •

edited

Loading