French ressources (datasets & models) I developped to empower use cases in French
Loïck BOURDOIS
lbourdois
AI & ML interests
👀
Recent Activity
updated
a model
13 days ago
Bretagne/whisper-large-v3-turbo-audio_breton-transcription_breton
updated
a model
13 days ago
Bretagne/whisper-large-v3-turbo-audio_breton-transcription_francais
updated
a dataset
13 days ago
lbourdois/VQA-neulab-CulturalGround-clean
Organizations
FAT5
Flash Attention T5 (FAT5) models developped when I worked at CATIE (https://hf.co/CATIE-AQ).
French NER
NER models & datasets developped when I worked at CATIE (https://hf.co/CATIE-AQ). Over 170,000 downloads.
-
CATIE-AQ/Moderncamembert_3entities
Token Classification • 0.1B • Updated • 43 -
CATIE-AQ/NERmemberta-3entities
Token Classification • 0.1B • Updated • 189 • 1 -
CATIE-AQ/NERmembert-base-3entities
Token Classification • 0.1B • Updated • 13 • 2 -
CATIE-AQ/NERmembert-large-3entities
Token Classification • 0.3B • Updated • 473 • 2
French prompts datasets
French prompts dataset developped when I worked at CATIE (https://hf.co/CATIE-AQ). Over 30,000 downloads.
French VQA datasets
VQA datasets I cleaned with an image, a question and an answer.
Can be used to train VLMs.
French OCR datasets
Datasets I cleaned with an image, a prompt question (like "transcribe the text in this image") and an answer.
Can be used to train VLMs.
French table-to-text datasets
In 2021 before the release of LoRA, I was interested in Prefix-tuning, which I wanted to apply to French. So I had to translate table-to-text data
French Translations
Things I've translated: courses, blog posts, guides. More on my personal blog (https://lbourdois.github.io/blog/).
-
Sleeping33
Free online AI courses in French
📚French translations of four AI courses
-
lbourdois/en-fr-nyu-dl-course-corpus
Viewer • Updated • 3.13k • 90 -
Running44
SSM Blog Posts
📝Blog posts about State Space Models (SSM)
-
Running
Guide sur l'évaluation des LLM
⚖Traduction du guide de Clémentine Fourrier
Breton packs
Breton ressources (datasets & models) I developped to empower use cases in Breton
French QA
QA models & datasets developped when I worked at CATIE (https://hf.co/CATIE-AQ). Over 150,000 downloads.
French embedding datasets
French datasets to train embeddings models or evaluate them.
French caption datasets
Datasets I cleaned with an image, a prompt question (like "describe this image") and an answer.
Can be used to train VLMs.
-
lbourdois/caption-maya-multimodal-pretrain-clean
Viewer • Updated • 551k • 187 -
CATIE-AQ/caption-vidore-vdsid_french-clean
Viewer • Updated • 5k • 17 -
CATIE-AQ/caption-vidore-tabfquad_test_subsampled-clean
Viewer • Updated • 280 • 11 -
CATIE-AQ/caption-floschne-xm3600-clean
Viewer • Updated • 8.56k • 21
French retriever datasets
Datasets I cleaned with an image and a question.
Can be used to train visual retrievers (ColPali and co.).
-
CATIE-AQ/retriever-vidore-vdsid_french-clean
Viewer • Updated • 5k • 17 -
CATIE-AQ/retriever-vidore-tabfquad_test_subsampled-clean
Viewer • Updated • 280 • 15 -
CATIE-AQ/retriever-manu-tabfquad_retrieving-clean
Viewer • Updated • 1.83k • 11 -
CATIE-AQ/retriever-princeton-nlp-CharXiv-clean
Viewer • Updated • 1.32k • 15
French audio datasets (pretraining)
Around 117K hours of audio in French for research purpose
French packs
French ressources (datasets & models) I developped to empower use cases in French
French Translations
Things I've translated: courses, blog posts, guides. More on my personal blog (https://lbourdois.github.io/blog/).
-
Sleeping33
Free online AI courses in French
📚French translations of four AI courses
-
lbourdois/en-fr-nyu-dl-course-corpus
Viewer • Updated • 3.13k • 90 -
Running44
SSM Blog Posts
📝Blog posts about State Space Models (SSM)
-
Running
Guide sur l'évaluation des LLM
⚖Traduction du guide de Clémentine Fourrier
FAT5
Flash Attention T5 (FAT5) models developped when I worked at CATIE (https://hf.co/CATIE-AQ).
Breton packs
Breton ressources (datasets & models) I developped to empower use cases in Breton
French NER
NER models & datasets developped when I worked at CATIE (https://hf.co/CATIE-AQ). Over 170,000 downloads.
-
CATIE-AQ/Moderncamembert_3entities
Token Classification • 0.1B • Updated • 43 -
CATIE-AQ/NERmemberta-3entities
Token Classification • 0.1B • Updated • 189 • 1 -
CATIE-AQ/NERmembert-base-3entities
Token Classification • 0.1B • Updated • 13 • 2 -
CATIE-AQ/NERmembert-large-3entities
Token Classification • 0.3B • Updated • 473 • 2
French QA
QA models & datasets developped when I worked at CATIE (https://hf.co/CATIE-AQ). Over 150,000 downloads.
French prompts datasets
French prompts dataset developped when I worked at CATIE (https://hf.co/CATIE-AQ). Over 30,000 downloads.
French embedding datasets
French datasets to train embeddings models or evaluate them.
French VQA datasets
VQA datasets I cleaned with an image, a question and an answer.
Can be used to train VLMs.
French caption datasets
Datasets I cleaned with an image, a prompt question (like "describe this image") and an answer.
Can be used to train VLMs.
-
lbourdois/caption-maya-multimodal-pretrain-clean
Viewer • Updated • 551k • 187 -
CATIE-AQ/caption-vidore-vdsid_french-clean
Viewer • Updated • 5k • 17 -
CATIE-AQ/caption-vidore-tabfquad_test_subsampled-clean
Viewer • Updated • 280 • 11 -
CATIE-AQ/caption-floschne-xm3600-clean
Viewer • Updated • 8.56k • 21
French OCR datasets
Datasets I cleaned with an image, a prompt question (like "transcribe the text in this image") and an answer.
Can be used to train VLMs.
French retriever datasets
Datasets I cleaned with an image and a question.
Can be used to train visual retrievers (ColPali and co.).
-
CATIE-AQ/retriever-vidore-vdsid_french-clean
Viewer • Updated • 5k • 17 -
CATIE-AQ/retriever-vidore-tabfquad_test_subsampled-clean
Viewer • Updated • 280 • 15 -
CATIE-AQ/retriever-manu-tabfquad_retrieving-clean
Viewer • Updated • 1.83k • 11 -
CATIE-AQ/retriever-princeton-nlp-CharXiv-clean
Viewer • Updated • 1.32k • 15
French table-to-text datasets
In 2021 before the release of LoRA, I was interested in Prefix-tuning, which I wanted to apply to French. So I had to translate table-to-text data
French audio datasets (pretraining)
Around 117K hours of audio in French for research purpose