How was the `tokenizer.json` created?

#3
by VaishalBusiness - opened

Hi Xenova team,

I'm trying to understand how you generated the tokenizer.json file used in your models. Was it directly exported from a SentencePiece model, converted via Hugging Face’s transformers tools, or created through a custom process?

Specifically, I'm interested in reproducing the same structure for a custom SentencePiece model so it works with your ONNX/transformers.js pipelines.

Could you please share how you built or converted it — and which tools or scripts were used?

Thanks in advance!

VaishalBusiness changed discussion status to closed
VaishalBusiness changed discussion status to open

Hi Xenova team,

I hope you’re doing well. I sent the message above on October 22 regarding how the tokenizer.json file was generated for your models, but I haven’t heard back yet.

I’m still very interested in understanding whether it was exported directly from a SentencePiece model, converted via Hugging Face tools, or created through a custom process — and any guidance for reproducing the same structure for a custom SentencePiece model to work with your ONNX/transformers.js pipelines.

I’d greatly appreciate any insight or pointers whenever you have a chance.

Thank you very much!

Hi Xenova,
I have been working on this for a quite long time it would be great if you answered my question. Thank you.

Hi there. Sorry for the delayed response. I wrote a very simple conversion script a couple of years ago for it: https://github.com/huggingface/transformers.js/blob/b125e82b86f62cf9f0f77217601e04a5ff7c6e7f/scripts/extra/marian.py

I haven't tested or used it in a long time, so hopefully it still works 😅

Hi Xenova,

Thank you so much for sharing the script — I tried it out and it worked perfectly for generating the tokenizer.json.

Best regards,
Vaishal

VaishalBusiness changed discussion status to closed

Sign up or log in to comment