Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【BUG】can't find '__main__' module in '/root/lyd/unimol/Prediction/scripts/./train_dataset' #311

Closed
Carpdong opened this issue Jan 7, 2025 · 9 comments
Labels
bug Something isn't working

Comments

@Carpdong
Copy link

Carpdong commented Jan 7, 2025

Describe the bug

data_path="./train_dataset" # replace to your data path
save_dir="./save_finetune_mp" # replace to your save path
n_gpu=1
MASTER_PORT=10086
dict_name="dict.txt"
weight_path="./mol_pre_no_h_220816.pt" # replace to your ckpt path
task_name="mp" # molecular property prediction task name
task_num=1
loss_func="finetune_smooth_mae"
lr=1e-4
batch_size=32
epoch=500
dropout=0
warmup=0.06
local_batch_size=32
only_polar=0
conf_size=11
seed=0

if [ "$task_name" == "qm7dft" ] || [ "$task_name" == "qm8dft" ] || [ "$task_name" == "qm9dft" ] || [ "$task_name" == "ep" ] || [ "$task_name" == "lipo" ] || [ "$task_name" == "mp" ] || [ "$task_name" == "bp" ] || [ "$task_name" == "fp" ]; then
metric="valid_agg_mae"
elif [ "$task_name" == "esol" ] || [ "$task_name" == "freesolv" ]; then
metric="valid_agg_rmse"
else
metric="valid_agg_auc"
fi

export NCCL_ASYNC_ERROR_HANDLING=1
export OMP_NUM_THREADS=1
update_freq=expr $batch_size / $local_batch_size
python -m torch.distributed.launch --nproc_per_node=$n_gpu --master_port=$MASTER_PORT $(which unicore-train) $data_path --task-name $task_name --user-dir ./unimol --train-subset train --valid-subset valid
--conf-size $conf_size
--num-workers 8 --ddp-backend=c10d
--dict-name $dict_name
--task mol_finetune --loss $loss_func --arch unimol_base
--classification-head-name $task_name --num-classes $task_num
--optimizer adam --adam-betas "(0.9, 0.99)" --adam-eps 1e-6 --clip-norm 1.0
--lr-scheduler polynomial_decay --lr $lr --max-epoch $epoch --batch-size $local_batch_size --pooler-dropout $dropout
--update-freq $update_freq --seed $seed
--fp16 --fp16-init-scale 4 --fp16-scale-window 256
--log-interval 100 --log-format simple
--validate-interval 1
--finetune-from-model $weight_path
--best-checkpoint-metric $metric --patience 20
--save-dir $save_dir --only-polar $only_polar
--reg

这是我的finetune.sh文件,运行时报错:
can't find 'main' module in '/root/lyd/unimol/Prediction/scripts/./train_dataset'

Uni-Mol Version

Uni-Mol

Expected behavior

跑通finetune

To Reproduce

No response

Environment

No response

Additional Context

No response

@Carpdong Carpdong added the bug Something isn't working label Jan 7, 2025
@ZhouGengmo
Copy link
Collaborator

脚本看起来好像没什么问题
这个data_path前面的点是多打了吗,还是脚本里就有

`data_path="./train_dataset" # replace to your data path

以及每一行可以用 \分隔,比如
--dict-name $dict_name
--task mol_finetune --loss $loss_func --arch unimol_base

@Carpdong
Copy link
Author

Carpdong commented Jan 7, 2025

感谢回复!
目前报错:can't find 'main' module in '/root/lyd/unimol/Prediction/scripts/./train_dataset'
这个data_path前面的点是多复制了的。脚本里没有。
train_dataset下存放有train.lmdb

@ZhouGengmo
Copy link
Collaborator

这是我的finetune.sh文件,运行时报错:
can't find 'main' module in '/root/lyd/unimol/Prediction/scripts/./train_dataset'

这里/scripts/./train_dataset看上去有问题,可以把data_path换成绝对路径

另外如果想微调推荐使用unimol_tools,用起来比较友好

@Carpdong
Copy link
Author

Carpdong commented Jan 7, 2025

感谢回复
可能是环境问题造成的,改为绝对路径后,我使用自己安装的环境运行,依然报错:
can't find 'main' module in '/root/lyd/unimol/Prediction/scripts/./train_dataset'
起了个您这边提供docker容器,报错:
FileNotFoundError: /UniMol/Prediction/scripts/unimol

@ZhouGengmo
Copy link
Collaborator

起了个您这边提供docker容器,报错:
FileNotFoundError: /UniMol/Prediction/scripts/unimol

关于这个错误,可以把脚本里--user-dir ./unimol,这个unimol目录换成绝对路径
对应仓库中的目录地址是:
https://github.com/deepmodeling/Uni-Mol/tree/main/unimol/unimol

@Carpdong
Copy link
Author

Carpdong commented Jan 9, 2025

感谢回复,
起了个您这边提供docker容器,报错:ModuleNotFoundError: No module named 'numpy._core'
升降numpy版本都无效

@ZhouGengmo
Copy link
Collaborator

可以试试unicore的镜像,基本是通用的,需要装一下rdkit
docker pull dptechnology/unicore:latest-pytorch1.11.0-cuda11.3
docker pull dptechnology/unicore:latest-pytorch1.12.1-cuda11.6-rdma

@Carpdong
Copy link
Author

ModuleNotFoundError: No module named 'numpy._core'的错误原因是:
由数据处理得到的train.lmdb test,lmdb valid.lmdb引起的。包括其余有关pickle的错误也是这个原因:
生成lmdb的环境与finetune的环境必须为相同环境。

@Carpdong
Copy link
Author

Carpdong commented Jan 14, 2025

finetune.sh中的python -m torch.distributed.launch需要改为torchrun,可以避免1个错误:
raise ChildFailedError(torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

inference.sh中python改为torchrun,可以避免1个错误:
KeyError: 'mol_finetune'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants