Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Struct-based API / Hide raw types of decoding tables & decoding hash maps #14

Open
tats-u opened this issue Jun 6, 2024 · 6 comments
Open
Labels
breaking change good first issue Good for newcomers help wanted Extra attention is needed
Milestone

Comments

@tats-u
Copy link
Owner

tats-u commented Jun 6, 2024

I want to switch this crate to struct-based API:

let encoder = get_cp_encoder(OEMCodePage::Cp437);
// let encoder = try_get_cp_encoder_from_number(437).unwrap(); // For those who want to pass a number directly
let cp437_bytes = encoder.encode_string_lossy("½ ± 1");

The current API looks ugly and exposes too much.

Especially [char; 128] and [Option<char>; 128] should not be exposed.

@tats-u tats-u added help wanted Extra attention is needed good first issue Good for newcomers breaking change labels Jun 6, 2024
@tats-u tats-u added this to the v4+ milestone Jun 6, 2024
@mkroening
Copy link
Contributor

I think we can base this on the char-based API and provide both in a nice API.

One question for clarification: Is it common to dynamically change the code page at runtime? Or would something like get_cp_encoder::<437>() be okay too?

@tats-u tats-u changed the title Struct-based API / Hide raw decoding tables & decoding hash maps Struct-based API / Hide raw types of decoding tables & decoding hash maps Jun 9, 2024
@tats-u
Copy link
Owner Author

tats-u commented Jun 9, 2024

Is it common to dynamically change the code page at runtime?

Not just common but mandatory if you're going to develop apps for worldwide use.
Only for an app for a single country (e.g. one for internal use in your company), something like CP437_ENCODER is useful.

OEM code pages (= encoding for ZIP files and the terminal) depends on the language and region in Windows; CP437 in US English and Shift_JIS (CP932; not covered by this crate) in Japanese.

@tats-u
Copy link
Owner Author

tats-u commented Jun 11, 2024

Ah I got it. the enum-based is more preferred. There's no reason to prefer to use a raw code page number (like 437 instead of OEMCodePage::Cp437).
Enum can be easily completed and combined with match.

@mkroening
Copy link
Contributor

Ah, makes sense. This is a draft of the static API I am imagining:

#[derive(Clone, Copy)]
pub struct Cp<const N: u16>(u8);

impl Cp<437> {
    pub const fn new(cp: u8) -> Option<Self>;
    pub const fn get(self) -> u8;
    pub const fn from_char(c: char) -> Option<Self>;
    pub const fn to_char(self) -> char;
}

pub type CpStr<N> = [Cp<N>];

impl CpStr<437> {
    pub const fn from_cp(bytes: &[u8]) -> Result<&Self, CpError>;
    pub const fn to_bytes(&self) -> &[u8];
}

pub struct CpString<const N: u16>(Box<CpStr<N>>);

impl CpString<437> {
    pub fn from_cp_lossy(bytes: &[u8]) -> Cow<'_, CpStr<N>>;
}

This API does not have structural differences between complete and incomplete tables: For complete tables, some operations just never fail.

If I understood you correctly, we also need to find a replacement for accessing DECODING_TABLE_CP_MAP with dynamic numbers, right? The dynamic API could look like this:

pub struct CpNumber(u16);

impl CpNumber {
    pub const fn new(n: u16) -> Option<Self>;
}

impl str {
    pub fn to_cp(&self, n: CpNumber) -> Result<&[u8], CpError>;
    pub fn to_cp_lossy(&self, n: CpNumber) -> Cow<'_, [u8]>;
}

impl [u8] {
    pub fn cp_to_str(&self, n: CpNumber) -> Result<&str, CpError>;
    pub fn cp_to_str_lossy(&self, n: CpNumber) -> Cow<'_, str>;
}

What do you think of this? Would this work for all use-cases? I can work on this and open a PR (or change the existing one) if you want.

@tats-u
Copy link
Owner Author

tats-u commented Jun 13, 2024

https://crates.io/crates/enum-map

This crate looks better than CpNumber(u16), which can't prevent the typo of the code page number by users who want to use a single code page.
Also I think it allows invalid objects like let foo = CpNumber(1252); (not CpNumber::new(1252)) by a user.

Your suggested API looks suitable for one-shot or casual/informal/simple usage because the library can't easily cache the reference of a encoding/decoding table. An additional static hash map (recent code page numbers → tables) object will be required.

@mkroening
Copy link
Contributor

This crate looks better than CpNumber(u16)

I thought we need a way to go from u16 to code page at runtime, no?

If not, we can just use enums and don't need enum-map or what exactly were you thinking of?

Also I think it allows invalid objects like let foo = CpNumber(1252); (not CpNumber::new(1252)) by a user.

This is not possible, as the inner field is not public. Creation can only happen through new.

library can't easily cache the reference of a encoding/decoding table

How about putting the reference into the number to allow reuse? This makes the struct bigger, though. I am not sure if a phf lookup in exchange for smaller structs might be beneficial here, but we can do it if you want:

pub struct CpNumber {
    encode_table: &'static Map,
    decode_table: &'static Map,
}

impl CpNumber {
    pub fn new(n: u16) -> Self {
        /// look up the table
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants