Use `langcodes` to match `to_lang` to `chat_sample` name. #872

SamuelWN · 2025-03-07T16:36:30Z

Change chat_sample handling:

Add function get_chat_sample(to_lang) to return chat_sample for requested language
Use langcodes to match to_lang to the closest available sample languages
- Currently implemented with a narrow margin (max_distance=5)
  - e.g. both en-US vs en-GB and pt-BR vs pt-PT have a score langcodes distance of 5
Cache chat_sample[to_lang] match as variable (only need to do langcodes matching once)

Provides additional flexibility & resilience for handling chat_sample language IDs (a user-provided value) .

Additional changes:

Consolidate _LANGUAGE_CODE_MAP into GPTConfig
Cleaned up unused imports
chatgpt.py - Added English comments below Chinese comments

Change `chat_sample` handling: - Add function `get_chat_sample(to_lang)` to return chat_sample for requested language - Use `langcodes` to match `to_lang` to the closest available sample languages (within narrow margin) * `max_difference = 5` = `en-US` vs `en-GB` or `pt-BR` vs `pt-PT` - Cache `chat_sample[to_lang]` match as variable (only need to do `langcodes` matching once)

For zyddnys@b964df3

Merge `ollama` --> `custom_openai` migration

- Default: `deepseek-chat` - Option: `deepseek-reasoner` If `reasoning_content` provided: Print to `debug` logger. Add `ConfigGPT` setup.

- https://api-docs.deepseek.com/quick_start/token_usage#calculate-token-usage-offline Use `_assemble_prompts` function from current ChatGPT script ( zyddnys@c3bd2e9 ) - Modified to use true token count

SamuelWN and others added 7 commits March 3, 2025 14:11

Append English translation for comments

1887579

Rename ollama.py to custom_openai.py

bdef6f9

For zyddnys@b964df3

Merge branch 'main' into langcode_temp

189ff1c

Merge langcode_temp

2cc1103

Merge `ollama` --> `custom_openai` migration

Add DEEPSEEK_MODEL key.

9d2894e

- Default: `deepseek-chat` - Option: `deepseek-reasoner` If `reasoning_content` provided: Print to `debug` logger. Add `ConfigGPT` setup.

Use tokenizer to count tokens accurately.

ebb4d4c

- https://api-docs.deepseek.com/quick_start/token_usage#calculate-token-usage-offline Use `_assemble_prompts` function from current ChatGPT script ( zyddnys@c3bd2e9 ) - Modified to use true token count

SamuelWN changed the title ~~Change chat_sample handling~~ Use langcodes to match to_lang to chat_sample name. Mar 7, 2025

SamuelWN closed this Mar 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `langcodes` to match `to_lang` to `chat_sample` name. #872

Use `langcodes` to match `to_lang` to `chat_sample` name. #872

SamuelWN commented Mar 7, 2025

Use langcodes to match to_lang to chat_sample name. #872

Use langcodes to match to_lang to chat_sample name. #872

Conversation

SamuelWN commented Mar 7, 2025

Use `langcodes` to match `to_lang` to `chat_sample` name. #872

Use `langcodes` to match `to_lang` to `chat_sample` name. #872