How was the `tokenizer.json` created?
Hi Xenova team,
I'm trying to understand how you generated the tokenizer.json file used in your models. Was it directly exported from a SentencePiece model, converted via Hugging Face’s transformers tools, or created through a custom process?
Specifically, I'm interested in reproducing the same structure for a custom SentencePiece model so it works with your ONNX/transformers.js pipelines.
Could you please share how you built or converted it — and which tools or scripts were used?
Thanks in advance!
Hi Xenova team,
I hope you’re doing well. I sent the message above on October 22 regarding how the tokenizer.json file was generated for your models, but I haven’t heard back yet.
I’m still very interested in understanding whether it was exported directly from a SentencePiece model, converted via Hugging Face tools, or created through a custom process — and any guidance for reproducing the same structure for a custom SentencePiece model to work with your ONNX/transformers.js pipelines.
I’d greatly appreciate any insight or pointers whenever you have a chance.
Thank you very much!
Hi Xenova,
I have been working on this for a quite long time it would be great if you answered my question. Thank you.
Hi there. Sorry for the delayed response. I wrote a very simple conversion script a couple of years ago for it: https://github.com/huggingface/transformers.js/blob/b125e82b86f62cf9f0f77217601e04a5ff7c6e7f/scripts/extra/marian.py
I haven't tested or used it in a long time, so hopefully it still works 😅
Hi Xenova,
Thank you so much for sharing the script — I tried it out and it worked perfectly for generating the tokenizer.json.
Best regards,
Vaishal