You have 0 Pre-Labeled Datasets Added to Quote Request Quote
Dataset Video Action videos Common Use Cases: Movement detection, Human Body Movement, Action Classification Recording Device: Camera Unit: 300 videos Add Dataset to Quote HUMAN_BODY_VID003 Appen Global Human Body Movement N/A United States Mixed lighting conditions Available upon request N/A N/A N/A N/A mp4 Participants videoed themselves completing an action from a given prompt, e.g. “zip up a jacket”, “drink a beverage” Action videos
Dataset Text Albanian (Albania) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 12,000 words Add Dataset to Quote sqi_ALB_PHON Appen Global Pronunciation Dictionary Albanian Albania N/A N/A N/A N/A 12,000 N/A text Albanian (Albania) Pronunciation Dictionary
Dataset Text Amharic (Ethiopia) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 49,000 words Add Dataset to Quote amh_ETH_PHON Appen Global Pronunciation Dictionary Amharic Ethiopia N/A N/A N/A N/A 49,000 N/A text Amharic (Ethiopia) Pronunciation Dictionary
Dataset Text Arabic (Algeria) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 11,000 words Add Dataset to Quote ara_DZA_PHON Appen Global Pronunciation Dictionary Arabic Algeria N/A N/A N/A N/A 11,000 N/A text Arabic (Algeria) Pronunciation Dictionary
Dataset Audio Arabic (Eastern Algeria) conversational telephony Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone and landline Unit: 29 hours Add Dataset to Quote EAR_ASR001 Appen Global Conversational Speech Arabic Algeria Low background noise (home/office) 496 2 32,899 15,314 8 alaw Dataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
For the majority of calls, both speakers (in-line/out-line) were collected and transcribed however, for a smaller number of calls, only one half of the conversation was collected and transcribed
8% landline, 92% mobile
Arabic (Eastern Algeria) conversational telephony
Dataset Text Arabic (Egypt) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 40,000 words Add Dataset to Quote ara_EGY_PHON Appen Global Pronunciation Dictionary Arabic Egypt N/A N/A N/A N/A 40,000 N/A text Arabic (Egypt) Pronunciation Dictionary
Dataset Audio Arabic (Egypt) scripted smartphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Mobile phone Unit: 352 hours Add Dataset to Quote ARE_ASR001_CN Appen China Scripted Speech Arabic Egypt Low background noise (home/office) 627 1 128,908 207,576 16 wav Dataset contains audio with corresponding text prompts
Text prompts are not vowelised
Arabic (Egypt) scripted smartphone
Dataset Text Arabic (Iraq) Part of Speech Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 13,000 words Add Dataset to Quote ara_IRQ_POS Appen Global Part of Speech Dictionary Arabic Iraq N/A N/A N/A N/A 13,000 N/A text Arabic (Iraq) Part of Speech Dictionary
Dataset Text Arabic (Iraq) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 19,000 words Add Dataset to Quote ara_IRQ_PHON Appen Global Pronunciation Dictionary Arabic Iraq N/A N/A N/A N/A 19,000 N/A text Person names Arabic (Iraq) Pronunciation Dictionary
Dataset Audio Arabic (Levantine) scripted microphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Microphone Unit: 32 hours Add Dataset to Quote ARU_ASR002 Appen Global Scripted Speech Arabic United Arab Emirates Low background noise (studio) 100 1 Available upon request Available upon request 48 wav Audio with corresponding text prompts. Transcription can be developed upon request. Arabic (Levantine) scripted microphone
Dataset Text Arabic (Libya) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 48,000 words Add Dataset to Quote ara_LBY_PHON Appen Global Pronunciation Dictionary Arabic Libya N/A N/A N/A N/A 48,000 N/A text Arabic (Libya) Pronunciation Dictionary
Dataset Audio Arabic (Modern Standard Arabic) scripted microphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Microphone Unit: 12 hours Add Dataset to Quote MSA_ASR001 GlobalPhone Scripted Speech Arabic Tunisia Mixed (quiet home/office, public, outdoor) 78 1 4,908 40,000 16 wav Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus
Dataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
Arabic (Modern Standard Arabic) scripted microphone
Dataset Audio Arabic (Morocco) conversational telephony Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone and landline Unit: 33 hours Add Dataset to Quote ARY_ASR001 Appen Global Conversational Speech Arabic Morocco Low background noise 180 2 80,430 23,836 8 alaw Each speaker participated in 1 to 4 conversations. Speakers are identified by a unique 4-digit speaker ID which is recorded in the demographic file
Transcription is available in original script and fully reversible Romanised version with accompanying pronunciation lexicon
English translation of product transcription is available (ARY_MT001, ARY_ASRMT001)
Arabic (Morocco) conversational telephony
Dataset Text Arabic (Morocco) conversational telephony translation Common Use Cases: MT, Chatbot , Conversational AI Recording Device: N/A Unit: 80,430 utterances Add Dataset to Quote ARY_MT001 Appen Global Conversational Translation Arabic Morocco N/A 180 N/A 80,430 23,836 N/A text Corresponding audio, transcription, fully reversible romanised transcription and pronunciation lexicon data are available (ARY_ASR001, ARY_ASRMT001) Arabic (Morocco) conversational telephony translation
Dataset Text Arabic (Morocco) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 60,000 words Add Dataset to Quote ara_MAR_PHON Appen Global Pronunciation Dictionary Arabic Morocco N/A N/A N/A N/A 60,000 N/A text Arabic (Morocco) Pronunciation Dictionary
Dataset Text Arabic (MSA) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 40,000 words Add Dataset to Quote arb_MSA_PHON Appen Global Pronunciation Dictionary Arabic (Standard) N/A N/A N/A N/A N/A 40,000 N/A text Arabic (MSA) Pronunciation Dictionary
Dataset Audio Arabic (Saudi Arabia) scripted smartphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Mobile phone Unit: 322 hours Add Dataset to Quote ARS_ASR001_CN Appen China Scripted Speech Arabic Saudi Arabia Low background noise (home/office) 227 1 104,574 156,282 16 wav Dataset contains audio with corresponding text prompts
Text prompts are not vowelised
300-1000 prompts per speaker covering general content including education, sports, entertainment, travel, culture and technology
Arabic (Saudi Arabia) scripted smartphone
Dataset Text Arabic (Sudan) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 17,000 words Add Dataset to Quote ara_SDN_PHON Appen Global Pronunciation Dictionary Arabic Sudan N/A N/A N/A N/A 17,000 N/A text Arabic (Sudan) Pronunciation Dictionary
Dataset Image Arabic (UAE) printed text annotated OCR Common Use Cases: Document Processing, Document Search, Text detection Recording Device: Mobile phone Unit: 20000 images Add Dataset to Quote IMG_OCR_ARU002_CN Appen China Document OCR Arabic United Arab Emirates Mixed lighting conditions N/A N/A N/A N/A N/A jpg + json Images containing text, such as slogans, advertisements, maps, store names, menus, product outer packaging, indication board. Includes bounding box annotations, 50 boxes per image, with all text annotated (Arabic, non-Arabic characters, special characters, numbers) Arabic (UAE) printed text annotated OCR
Dataset Text Arabic (United Arab Emirates (UAE)) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 75,000 words Add Dataset to Quote ara_ARE_PHON Appen Global Pronunciation Dictionary Arabic United Arab Emirates (UAE) N/A N/A N/A N/A 75,000 N/A text Arabic (United Arab Emirates (UAE)) Pronunciation Dictionary
Dataset Audio Arabic (United Arab Emirates (UAE)) scripted smartphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Mobile phone Unit: 170 hours Add Dataset to Quote ARU_ASR001_CN Appen China Scripted Speech Arabic United Arab Emirates (UAE) Low background noise (home/office) 133 1 42,352 85,775 16 wav Dataset contains audio with corresponding text prompts
Text prompts are not vowelised
Arabic (United Arab Emirates (UAE)) scripted smartphone
Dataset Audio Arabic (United Arab Emirates (UAE)) scripted telephony Common Use Cases: ASR, Virtual Assistant Recording Device: Mobile phone and landline Unit: 48 hours Add Dataset to Quote OrienTel United Arab Emirates MCA (Modern Colloquial Arabic) Nuance Scripted Speech Arabic United Arab Emirates (UAE) Low background noise 880 1 43,000 22197 8 alaw Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
49 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words and spontaneous items for control
Arabic (United Arab Emirates (UAE)) scripted telephony
Dataset Audio Arabic (United Arab Emirates (UAE)) scripted telephony Common Use Cases: ASR, Virtual Assistant Recording Device: Mobile phone and landline Unit: 31 hours Add Dataset to Quote OrienTel United Arab Emirates MSA (Modern Standard Arabic) Nuance Scripted Speech Arabic United Arab Emirates (UAE) Low background noise 500 1 24,500 13348 8 alaw Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
49 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words and spontaneous items for control
Arabic (United Arab Emirates (UAE)) scripted telephony
Dataset Audio Arabic (United Arab Emirates (UAE)/ Saudi Arabia) scripted microphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Microphone Unit: 86 hours Add Dataset to Quote CGA_ASR001 Appen Global Scripted Speech Arabic United Arab Emirates (UAE) – Saudi Arabia Low background noise (home/office) 150 4 42,000 19,245 16 raw PCM Fully transcribed with acoustic event tagging derived from the SpeechDAT conventions
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
All transcriptions fully vowelized
280 prompts per speaker including 30 Person names (first name and family name) from a set of 15, 10 single isolated digits 0-10, 8-digit sequences (randomly generated), 200 phonetically balanced sentences, 30 x 10-word phonetically balanced word strings
Arabic (United Arab Emirates (UAE)/ Saudi Arabia) scripted microphone
Dataset Text Arabic NER news text Common Use Cases: NER, Content Classification, Search Engines Recording Device: N/A Unit: 20,774 sentences Add Dataset to Quote ARB_NER001 Appen Global News NER Arabic (Standard) N/A N/A N/A N/A 20,774 Available on request N/A text News text corpora with entities tagged in XML format: Person, Title, Organization, Location, Geo-political entity, Facility, Religion, Nationality, Quantity Arabic NER news text
Dataset Text Assamese (India) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 40,000 words Add Dataset to Quote asm_IND_PHON Appen Global Pronunciation Dictionary Assamese India N/A N/A N/A N/A 40,000 N/A text Assamese (India) Pronunciation Dictionary
Dataset Audio Baby crying audio Common Use Cases: Baby Monitor, Security & Other Consumer Applications Recording Device: Mobile phone Unit: 70 hours Add Dataset to Quote CRY_ASR001_CN Appen China Human Sound N/A China Low background noise (home/office) 566 1 N/A N/A 16 wav Crying sound of babies 0-3 years old, each lasting around 2 minutes. Audio only. Baby crying audio
Dataset Audio Bahasa Indonesia conversational telephony Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone and landline Unit: 31 hours Add Dataset to Quote BAH_ASR001 Appen Global Conversational Speech Indonesian Indonesia Low background noise 1,002 2 30,695 11,480 8 wav Dataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
For a large proportion of calls, only one half of the conversation was collected and transcribed
28% landline, 72% mobile
Bahasa Indonesia conversational telephony
Dataset Image Baking Pictures Common Use Cases: Image recognition Recording Device: N/A Unit: 6000 images Add Dataset to Quote IMG_BAKE_CN Appen China Image recognition N/A China N/A N/A N/A N/A N/A N/A jpg (Data source: website) This dataset includes pictures of baked goods: 2000 images of bread, 2000 images of cakes, and 2000 images of cookies. Image resolution: 640px * 640px. Shooting angle: either vertically downward or slightly offset. Baking Pictures
Dataset Text Basque (Spain) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 10,000 words Add Dataset to Quote eus_ESP_PHON Appen Global Pronunciation Dictionary Basque Spain N/A N/A N/A N/A 10,000 N/A text Basque (Spain) Pronunciation Dictionary
Dataset Audio Bengali (Bangladesh) conversational telephony Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone and landline Unit: 47 hours Add Dataset to Quote BEN_ASR001 Appen Global Conversational Speech Bengali Bangladesh Mixed (in-car, roadside, home/office) 1,000 2 108,923 17,922 8 wav Dataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
Bengali (Bangladesh) conversational telephony
Dataset Text Bengali (India) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 29,000 words Add Dataset to Quote ben_IND_PHON Appen Global Pronunciation Dictionary Bengali India N/A N/A N/A N/A 29,000 N/A text Bengali (India) Pronunciation Dictionary
Dataset Audio Bulgarian (Bulgaria) conversational telephony Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone and landline Unit: 38 hours Add Dataset to Quote BUL_ASR001 Appen Global Conversational Speech Bulgarian Bulgaria Low background noise (home/office) 217 2 86,453 22,342 8 alaw or wav Dataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 telephony conversations are recorded for this project – 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers
49% landline, 51% mobile
Conversations cover a range of topics including: Holiday/Leisure, Movies/TV Shows and Work.
Bulgarian (Bulgaria) conversational telephony
Dataset Text Bulgarian (Bulgaria) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 55,000 words Add Dataset to Quote bul_BGR_PHON Appen Global Pronunciation Dictionary Bulgarian Bulgaria N/A N/A N/A N/A 55,000 N/A text Bulgarian (Bulgaria) Pronunciation Dictionary
Dataset Audio Bulgarian (Bulgaria) scripted microphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Microphone Unit: 22 hours Add Dataset to Quote BUL_ASR002 GlobalPhone Scripted Speech Bulgarian Bulgaria Mixed (quiet home/office, public, outdoor) 77 1 8,674 Available on request 16 wav Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus
Dataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
Bulgarian (Bulgaria) scripted microphone
Dataset Image Business-to-business printed text document OCR Common Use Cases: Document Processing, Document Search, Text detection Recording Device: Camera, scan Unit: 5,838 documents Add Dataset to Quote IMG_OCR_B2B Appen Global Document OCR N/A N/A Mixed lighting conditions N/A N/A N/A N/A N/A png Scans and photographs of business-to-business documents containing printed text. 38% Premium Quality images in 10 languages, 25 countries, including Purchase Order, Payment Advice or Remittance Advice, Order Confirmation and Delivery note. 64% Standard Quality images in various challenging conditions in 11 languages, 34 countries, in a wider range of categories including Complaints or Return, Delivery advice, Delivery note, Dunning, Goods receipt, Invoice, Offer, Order confirmation, Pay slip, Payment Advice or Remittance Advice, Purchase Order, Receipt, and Supplier load Business-to-business printed text document OCR
Dataset Audio Cantonese (China) business dialogues Common Use Cases: ASR, Conversational AI, Speech Analytics, Business Recording Device: Mobile phone Unit: 98.35 hours Add Dataset to Quote YYDH_ASR001_CN Appen China Conversational Speech Cantonese China Low background noise (home/office) 241 2 Available upon request Available upon request 16 wav Business meetings and conversations audio with transcription and timestamping, from a variety of industries.
30% male participants, 70% female
Cantonese (China) business dialogues
Dataset Text Cantonese (China) Simplified Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 37,000 words Add Dataset to Quote yue_CHN_PHON Appen Global Pronunciation Dictionary Cantonese China N/A N/A N/A N/A 37,000 N/A text Simplified Cantonese (China) Simplified Pronunciation Dictionary
Dataset Text Cantonese (China) Traditional Part of Speech Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 10,000 words Add Dataset to Quote yue_HKG_POS Appen Global Part of Speech Dictionary Cantonese China N/A N/A N/A N/A 10,000 N/A text Traditional Cantonese (China) Traditional Part of Speech Dictionary
Dataset Text Cantonese (China) Traditional Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 40,000 words Add Dataset to Quote yue_HKG_PHON Appen Global Pronunciation Dictionary Cantonese China N/A N/A N/A N/A 40,000 N/A text Traditional Cantonese (China) Traditional Pronunciation Dictionary
Dataset Text Catalan (Spain) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 10,000 words Add Dataset to Quote cat_ESP_PHON Appen Global Pronunciation Dictionary Catalan Spain N/A N/A N/A N/A 10,000 N/A text Catalan (Spain) Pronunciation Dictionary
Dataset Text Cebuano (Philippines) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 21,000 words Add Dataset to Quote ceb_PHL_PHON Appen Global Pronunciation Dictionary Cebuano Philippines N/A N/A N/A N/A 21,000 N/A text Cebuano (Philippines) Pronunciation Dictionary
Dataset Audio Chinese (multinational foreigner) scripted smartphone Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone Unit: 200 hours Add Dataset to Quote FOREIGNER_ASR001_CN Appen China Scripted Speech Mandarin Chinese China Low background noise 309 1 16 wav Dataset contains audio with corresponding text prompts.
This database contains 200 hours of foreigners speaking Chinese from the following countries: Argentina, Egypt, Australia, Russia, the Philippines, Kazakhstan, Korea, Kyrgyzstan, Canada, Kuala Lumpur, Kenya, Laos, Malaysia, Mauritius, the United States, Mongolia, South Africa, Japan, Tajikistan, Thailand, Turkey, Hong Kong, Singapore, India, Indonesia, Vietnam
There is no data from South Korea, Brazil, or data recorded by minors.
Each session lasts about an hour; sentence duration ranges between 3-10 seconds
The content is in the form of an individual reading while being recorded on a mobile phone in a home/office environment.
Sensitive data and personal information has been scrubbed.
Chinese (multinational foreigner) scripted smartphone
Dataset Text Chinese and English related texts Common Use Cases: LLM training Recording Device: N/A Unit: 400000 Add Dataset to Quote GLWB_CN Appen China LLM training English/Chinese N/A N/A N/A N/A Available upon request Available upon request N/A json This data set contains long article content in English and Chinese, sourced from publicly available books including title, author and language metadata. Chinese and English related texts
Dataset Text Chinese command and control prompt response corpus Common Use Cases: LLM training, Command and Control, TV Player, Device Control Recording Device: N/A Unit: 20000 sentences Add Dataset to Quote DSDH_corpus_CN Appen China LLM training Chinese China N/A N/A N/A N/A N/A N/A txt App Commands, Question & response pairs, tagged with categories and intents, for use with TV player controls, lifestyle services, and device control. Chinese command and control prompt response corpus
Dataset Text Chinese instruction set sentence corpus Common Use Cases: LLM training Recording Device: N/A Unit: 200000 sentences Add Dataset to Quote ZLJ_corpus_CN Appen China LLM training Chinese China N/A N/A N/A N/A N/A N/A txt Sentence corpus containing 10 sections:
Question and answer class instruction set ( ZLCWD_corpus_CN);
Multi-turn dialogue instruction set prompt-response pairs (ZLCDH_corpus_CN);
Logical reasoning instruction set prompt (Topic) – response (Reasoning) pairs (ZLCLJ_corpus_CN);
Programming code language instruction set prompt-response pairs, e.g. python (ZLCDM_corpus_CN);
Brainstorming instruction set question-answer pairs (ZLCTN_corpus_CN);
Text rewriting-instruction set original-rewritten pairs (ZLCGX_corpus_CN);
Text to reply to security – command set (ZLCAQ_corpus_CN);
Roleplay instruction set prompt-response pairs (ZLCJS_corpus_CN);
Long text-instruction set prompt-response pairs (ZLCCWB_corpus_CN);
Text generation instruction set prompt-response pairs (ZLCWB_corpus_CN)
Chinese instruction set sentence corpus
Dataset Text Chinese multidisciplinary test questions corpus Common Use Cases: LLM training Recording Device: N/A Unit: 319970 sentences Add Dataset to Quote MTQ_CN Appen China LLM training Chinese China N/A N/A 1 N/A N/A N/A json Corpus containing 8 sections of middle-high school prompt response pairs with metadata Subject, Grade, Knowledge Area, Question Type, Question, Answer, Difficulty. Question categories included are:
Geography – 30k sentences (DLT001_CN);
Chemistry – 40k sentences (HXT001_CN);
History – 40k sentences (LST001_CN:);
Biology – 40k sentences (SWT001_CN);
Math – 30k sentences (SXT001_CN);
Physics – 40k sentences (WLT001_CN);
Chinese language – 10k sentences (YWT001_CN);
Political – 40k sentences (ZZT001_CN)
Chinese multidisciplinary test questions corpus
Dataset Text Chinese news text summaries corpus Common Use Cases: LLM training Recording Device: N/A Unit: 20000 summaries Add Dataset to Quote DMXWB_corpus_CN Appen China LLM training Chinese China N/A N/A N/A N/A N/A N/A xls Summaries of main events and themes from news data in 15 domains (Finance and economics, Lottery ticket, House property, Share certificate, Home furnishings, Education, Science & Technology, Society & people’s livelihood, Fashion, Politics, Sports activities, Constellation, Game, Entertainment) Chinese news text summaries corpus
Dataset Text Code Q&A Dataset Common Use Cases: LLM training Recording Device: N/A Unit: 12 million pairs Add Dataset to Quote DM_CNRD Appen China LLM training English N/A N/A N/A N/A Available upon request Available upon request N/A json This is a text dataset of coding questions and answers in English, sourced through web-spidering with subsequent clean up and filtering. Programming languages include: JavaScript, Python, Java, C#, PHP, C++, SQL, R, C, Swift. Topics include: computer, scientific research technology, wholesale and retail, finance, entertainment and other industries Code Q&A Dataset
Dataset Audio Croatian (Croatia) conversational telephony Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone and landline Unit: 39 hours Add Dataset to Quote CRO_ASR001 Appen Global Conversational Speech Croatian Croatia Low background noise (home/office) 200 2 Available on request 23,919 8 alaw Dataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 telephony conversations are recorded for this project – 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers
53% landline, 47% mobile
Conversations cover a range of topics including: News & Current Affairs, Health and Sport.
Croatian (Croatia) conversational telephony
Dataset Text Croatian (Croatia) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 19,000 words Add Dataset to Quote hrv_HRV_PHON Appen Global Pronunciation Dictionary Croatian Croatia N/A N/A N/A N/A 19,000 N/A text Croatian (Croatia) Pronunciation Dictionary
Dataset Audio Croatian (Croatia) scripted microphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Microphone Unit: 11 hours Add Dataset to Quote CRO_ASR002 GlobalPhone Scripted Speech Croatian Croatia Mixed (quiet home/office, public, outdoor) 94 1 4,499 23,929 16 wav Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus
Dataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
Croatian (Croatia) scripted microphone
Dataset Audio Croatian (Croatia) scripted smartphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Mobile phone Unit: 263 hours Add Dataset to Quote CRO_ASR003_CN Appen China Scripted Speech Croatian Croatia Low background noise (home/office) 243 1 73,467 136,140 16 wav Dataset contains audio with corresponding text prompts Croatian (Croatia) scripted smartphone
Dataset Text Czech (Czech Republic) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 50,000 words Add Dataset to Quote ces_CZE_PHON Appen Global Pronunciation Dictionary Czech Czech Republic N/A N/A N/A N/A 50,000 N/A text Czech (Czech Republic) Pronunciation Dictionary
Dataset Audio Czech (Czech Republic) scripted microphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Microphone Unit: 31 hours Add Dataset to Quote CZE_ASR001 GlobalPhone Scripted Speech Czech Czech Republic Mixed (quiet home/office, public, outdoor) 102 1 12,425 Available on request 16 wav Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus
Dataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
Czech (Czech Republic) scripted microphone
Dataset Audio Czech (Czech Republic) scripted telephony Common Use Cases: ASR, Virtual Assistant Recording Device: Landline only Unit: 93 hours Add Dataset to Quote Czech SpeechDat(E) Dataset Nuance Scripted Speech Czech Czech Republic Low background noise 1,000 1 52,000 Available on request 8 alaw Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
52 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items, and phonetically rich words and sentences
Czech (Czech Republic) scripted telephony
Dataset Text Danish (Denmark) Part of Speech Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 100,000 words Add Dataset to Quote dan_DNK_POS Appen Global Part of Speech Dictionary Danish Denmark N/A N/A N/A N/A 100,000 N/A text Danish (Denmark) Part of Speech Dictionary
Dataset Text Danish (Denmark) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 107,000 words Add Dataset to Quote dan_DNK_PHON Appen Global Pronunciation Dictionary Danish Denmark N/A N/A N/A N/A 107,000 N/A text Danish (Denmark) Pronunciation Dictionary
Dataset Audio Danish (Denmark) scripted microphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Microphone Unit: 53 hours Add Dataset to Quote Speecon Danish Nuance Scripted Speech Danish Denmark Mixed (office, entertainment, car, public place) 600 (550 adult speakers and 50 child speakers) 4 170,000 Available on request 16 Available on request Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
290 prompts per adult speaker and 210 prompts per child speaker including digits, natural numbers, letter strings, personal, place and business names, application words for adult speakers, command (toy, phone and general) for child speakers, phonetically rich words and sentences and free and elicited spontaneous responses for adult speakers
Danish (Denmark) scripted microphone
Dataset Audio Dari (Afghanistan) broadcast Common Use Cases: ASR, Automatic Captioning, Keyword Spotting Recording Device: N/A Unit: 49 hours Add Dataset to Quote DAR_BRC001 Appen Global Broadcast Speech Dari Afghanistan Low background noise (studio) N/A 1 Available on request Available on request 16 – 48 wav Dataset is fully transcribed and timestamped
Pronunciation lexicon not currently available but can be developed upon request
Dataset is largely speech only and does not include music or advertisements
Data types include: talk shows, interviews, news broadcasts (excluding news reading by anchors)
13% landline, 87% mobile
Dari (Afghanistan) broadcast
Dataset Audio Dari (Afghanistan) conversational telephony Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone and landline Unit: 40 hours Add Dataset to Quote DAR_ASR001 Appen Global Conversational Speech Dari Afghanistan Low background noise 500 2 Available on request 11,168 8 alaw Dataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
Dataset is largely speech only and does not include music or advertisements
13% landline, 87% mobile
Dari (Afghanistan) conversational telephony
Dataset Text Dari (Afghanistan) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 31,000 words Add Dataset to Quote prs_AFG_PHON Appen Global Pronunciation Dictionary Dari Afghanistan N/A N/A N/A N/A 31,000 N/A text Dari (Afghanistan) Pronunciation Dictionary
Dataset Text Dholuo (Kenya) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 23,000 words Add Dataset to Quote luo_KEN_PHON Appen Global Pronunciation Dictionary Dholuo Kenya N/A N/A N/A N/A 23,000 N/A text Dholuo (Kenya) Pronunciation Dictionary
Dataset Audio Dongbei dialect (China) Conversational Speech Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Recording pen/microphone Unit: 84.6 hours Add Dataset to Quote DONGBEI_ASR001_CN Appen China Conversational Speech Dongbei dialect China Low background noise 268 1 16 wav Audio only; transcription in development for Q1 2025
Audio recordings cover 19 districts: Shenyang Heping District, Shenhe District, Huanggu District, Dadong District, Tiexi District, Lvyuan District, Chaoyang District, Kuancheng District, Erdao District, Nanguan District, Daoli District, Nangang District, Daowai District, Pingfang District, Songbei District, Xiangfang District, Hulan District, Acheng District and Shuangcheng District
Northeast suburb accents not included, and no minors were recorded.
Each recording session contains 20-30 minutes of free dialogue between 2-5 people.
Sensitive data and personal information has been scrubbed.
Dongbei dialect (China) Conversational Speech
Dataset Audio Dongbei dialect (China) Conversational Speech Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone Unit: 75.2 hours Add Dataset to Quote DONGBEI_ASR002_CN Appen China Conversational Speech Dongbei dialect China Low background noise 185 1 8 wav Audio only; transcription in development for Q1 2025
Audio recordings cover 19 districts: Shenyang Heping District, Shenhe District, Huanggu District, Dadong District, Tiexi District, Lvyuan District, Chaoyang District, Kuancheng District, Erdao District, Nanguan District, Daoli District, Nangang District, Daowai District, Pingfang District, Songbei District, Xiangfang District, Hulan District, Acheng District and Shuangcheng District
Northeast suburb accents not included, and no minors were recorded.
Each recording session contains 20-30 minutes of free dialogue between 2-5 people.
Sensitive data and personal information has been scrubbed.
Dongbei dialect (China) Conversational Speech
Dataset Audio Dutch (Belgium) scripted microphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Microphone Unit: 47 hours Add Dataset to Quote Speecon Dutch from Belgium Nuance Scripted Speech Dutch Belgium Mixed (office, entertainment, car, public place) 600 (550 adult speakers and 50 child speakers) 4 170,000 Available on request 16 Available on request Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
290 prompts per adult speaker and 210 prompts per child speaker including digits, natural numbers, letter strings, personal, place and business names, application words for adult speakers, command (toy, phone and general) for child speakers, phonetically rich words and sentences and free and elicited spontaneous responses for adult speakers
Dutch (Belgium) scripted microphone
Dataset Audio Dutch (Belgium) scripted telephony Common Use Cases: ASR, Virtual Assistant Recording Device: Microphone Unit: 80 hours Add Dataset to Quote Flemish SpeechDat(II) FDB-1000 (FIXED1FL) Nuance Scripted Speech Dutch Belgium Low background noise 1,000 1 52,000 Available on request 8 Available on request Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
52 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words and spontaneous items for control
Dutch (Belgium) scripted telephony
Dataset Audio Dutch (Netherlands & Belgium) scripted in-car Common Use Cases: ASR, Virtual Assistant, In Car HMI & Entertainment Recording Device: Microphone and mobile phone Unit: 27 hours Add Dataset to Quote Dutch and Flemish SpeechDat-Car Nuance Scripted Speech Dutch Netherland – Belgium Mixed (in-car) 302 5 15,100 Available on request 16 and 8 Available on request Dataset is fully transcribed and is accompanied by a pronunciation lexicon and validation report
125 prompts per adult speaker including digits, natural numbers, letter strings, personal, place and business names (some spontaneous), generic command and control items, phonetically rich words and sentences and prompts for spontaneous speech
Dutch (Netherlands & Belgium) scripted in-car
Dataset Audio Dutch (Netherlands) conversational telephony Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone and landline Unit: 36 hours Add Dataset to Quote NLD_ASR001 Appen Global Conversational Speech Dutch Netherlands Low background noise 200 2 Available on request 14,964 8 alaw Dataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 telephony conversations are recorded for this project – 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers
51% landline, 49% mobile
Conversations cover a range of topics including: Holiday/Leisure, Work and Sport.
Dutch (Netherlands) conversational telephony
Dataset Text Dutch (Netherlands) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 45,000 words Add Dataset to Quote nld_NLD_PHON Appen Global Pronunciation Dictionary Dutch Netherlands N/A N/A N/A N/A 45,000 N/A text Dutch (Netherlands) Pronunciation Dictionary
Dataset Audio Dutch (Netherlands) scripted microphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Microphone Unit: 68 hours Add Dataset to Quote Speecon Dutch from the Netherlands Nuance Scripted Speech Dutch Netherlands Mixed (office, entertainment, car, public place) 600 (550 adult speakers and 50 child speakers) 4 170,000 Available on request 16 Available on request Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
290 prompts per adult speaker and 210 prompts per child speaker including digits, natural numbers, letter strings, personal, place and business names, application words for adult speakers, command (toy, phone and general) for child speakers, phonetically rich words and sentences and free and elicited spontaneous responses for adult speakers
Dutch (Netherlands) scripted microphone
Dataset Image East African facial images Common Use Cases: Facial Recognition Recording Device: Camera Unit: 13500 images Add Dataset to Quote IMG_FACE_KEN_CN Appen China Human Face N/A Kenya Mixed background and lighting conditions 99 N/A N/A N/A N/A jpg Images of 99 participants across a variety of conditions (lighting, distance, camera angles, facial expressions, and accessories).
9 different lighting conditions, 2 different distances between participants face and smartphone, 7 different camera angles. All combinations of these 3 requirements were completed per participant.
A random 32 images per person include occlusions such as sunglasses, masks, wigs or hats
A random 36 shots include different facial expressions including stare, open mouth, pout mouth smile and frown
Lighting conditions: indoor normal light, outdoor normal light, indoor backlight, outdoor backlight, indoor ordinary dark light, full black screen fill light, point light source (white light, street light), neon light (monochromatic red, green and blue, multi-color mixed light), side glare
Distances: 30cm and 50cm
Camera angles: front, left 45°, right 45°, left 15°, right 15°, top 30°, bottom 30°
East African facial images
Dataset Image Electric vehicles in elevators Common Use Cases: Image recognition Recording Device: N/A Unit: 17132 images Add Dataset to Quote IMG_DDC_CN Appen China Image recognition N/A China N/A N/A N/A N/A N/A N/A jpg The electric vehicle image in elevator scene, with no more than 5 images of the same electric vehicle appearing. All images have annotation (monitoring perspective) with bounding boxes and labels (person, vehicle) Electric vehicles in elevators
Dataset Audio English (Arabic – Levant/Egypt) conversational telephony Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone and landline Unit: 28 hours Add Dataset to Quote ENA_ASR001 Appen Global Conversational Speech English Egypt Low background noise 250 2 33,057 5,619 8 alaw or wav Dataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
Average length of calls: 10-15 mins
English (Arabic – Levant/Egypt) conversational telephony
Dataset Text English (Australia) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 157,000 words Add Dataset to Quote eng_AUS_PHON Appen Global Pronunciation Dictionary English Australia N/A N/A N/A N/A 157,000 N/A text English (Australia) Pronunciation Dictionary
Dataset Audio English (Australia) scripted telephony Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Mobile phone and landline Unit: 92 hours Add Dataset to Quote AUS_ASR001 Appen Global Scripted Speech English Australia Low background noise (home/office) 500 1 82,500 35,137 8 alaw or wav Fully transcribed to SpeechDAT type conventions
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
162 prompts (read speech) per speaker including digits, natural numbers, letter strings, personal, place, and business names, confirmation items (yes, no + fuzzy), generic command and control items (from a set of 215), phonetically rich sentences and words
English (Australia) scripted telephony
Dataset Audio English (Australia) scripted telephony Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Mobile phone and landline Unit: 118 hours Add Dataset to Quote AUS_ASR002 Appen Global Scripted Speech English Australia Mixed 1,000 1 75,000 18,952 8 alaw or wav Fully transcribed to SpeechDAT type conventions
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
75 prompts per speaker including digits, natural numbers, letter strings, personal, place, and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words
The prompts are a mixture of ‘read’ and ‘elicited’ items where 5 prompts per script are ‘spontaneous free speech’
English (Australia) scripted telephony
Dataset Text English (Canada) Part of Speech Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 3,000 words Add Dataset to Quote eng_CAN_POS Appen Global Part of Speech Dictionary English Canada N/A N/A N/A N/A 3,000 N/A text English (Canada) Part of Speech Dictionary
Dataset Text English (Canada) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 50,000 words Add Dataset to Quote eng_CAN_PHON Appen Global Pronunciation Dictionary English Canada N/A N/A N/A N/A 50,000 N/A text English (Canada) Pronunciation Dictionary
Dataset Audio English (Canada) scripted telephony Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Mobile phone and landline Unit: 144 hours Add Dataset to Quote ENC_ASR001 Appen Global Scripted Speech English Canada Mixed 1,000 1 99,000 12,483 8 alaw or wav Fully transcribed to SALA II/SpeechDAT type conventions
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
99 prompts per speaker including digits, natural numbers, letter strings, personal, place, and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words
English (Canada) scripted telephony
Dataset Text English (Hong Kong) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 18,000 words Add Dataset to Quote eng_HKG_PHON Appen Global Pronunciation Dictionary English Hong Kong N/A N/A N/A N/A 18,000 N/A text English (Hong Kong) Pronunciation Dictionary
Dataset Audio English (India) conversational smartphone Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone Unit: 143 hours Add Dataset to Quote ENI_ASR003 Appen Global Conversational Speech English India Mixed (home, car, public place, outdoor) 272 1 145559 20746 48 wav Dataset is fully transcribed and time stamped
Two person conversations covering a broad range of generic topics including clothing, culture, education, finance, food, health, history, hospitality, insurance, media/entertainment, sports, travel/holiday, weather and work.
Each speaker participates in up to 12 conversations that are 5-15 minutes long.
Pronunciation lexicon not currently available but can be developed upon request
English (India) conversational smartphone
Dataset Audio English (India) conversational telephony Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone and landline Unit: 67 hours Add Dataset to Quote ENI_ASR002 Appen Global Conversational Speech English India Low background noise 540 2 77,565 11,646 8 alaw or wav Dataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
271 telephony conversations are recorded for this project
English (India) conversational telephony
Dataset Text English (India) Part of Speech Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 13,000 words Add Dataset to Quote eng_IND_POS Appen Global Part of Speech Dictionary English India N/A N/A N/A N/A 13,000 N/A text English (India) Part of Speech Dictionary
Dataset Text English (India) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 60,000 words Add Dataset to Quote eng_IND_PHON Appen Global Pronunciation Dictionary English India N/A N/A N/A N/A 60,000 N/A text English (India) Pronunciation Dictionary
Dataset Audio English (India) scripted telephony Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Mobile phone and landline Unit: 217 hours Add Dataset to Quote ENI_ASR001 Appen Global Scripted Speech English India Mixed 2,358 1 115,541 9,190 8 alaw or wav Fully transcribed to SpeechDAT type conventions.
Dataset is accompanied by a pronunciation lexicon [SAMPA] containing all transcribed words
49 prompts per speaker including digits, natural numbers, letter strings, personal, place, and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words
English (India) scripted telephony
Dataset Text English (Ireland) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 12,000 words Add Dataset to Quote eng_IRL_PHON Appen Global Pronunciation Dictionary English Ireland N/A N/A N/A N/A 12,000 N/A text English (Ireland) Pronunciation Dictionary
Dataset Text English (NZ) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 28,000 words Add Dataset to Quote eng_NZL_PHON Appen Global Pronunciation Dictionary English NZ N/A N/A N/A N/A 28,000 N/A text English (NZ) Pronunciation Dictionary
Dataset Audio English (Philippines) conversational telephony Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone and landline Unit: 53 hours Add Dataset to Quote ENF_ASR001 Appen Global Conversational Speech English Philippines Low background noise 450 2 41,602 7,272 8 alaw or wav Dataset is fully transcribed and time stamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
Average length of calls: 10-15 mins
English (Philippines) conversational telephony
Dataset Text English (Philippines) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 7,000 words Add Dataset to Quote eng_PHL_PHON Appen Global Pronunciation Dictionary English Philippines N/A N/A N/A N/A 7,000 N/A text English (Philippines) Pronunciation Dictionary
Dataset Text English (United Arab Emirates (UAE)) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 5,000 words Add Dataset to Quote eng_ARE_PHON Appen Global Pronunciation Dictionary English United Arab Emirates (UAE) N/A N/A N/A N/A 5,000 N/A text English (United Arab Emirates (UAE)) Pronunciation Dictionary
Dataset Audio English (United Arab Emirates (UAE)) scripted telephony Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Mobile phone and landline Unit: 33 hours Add Dataset to Quote OrienTel English as spoken in the United Arab Emirates Nuance Scripted Speech English United Arab Emirates (UAE) Low background noise 500 1 25,500 3990 8 alaw Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
51 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words and spontaneous items for control
English (United Arab Emirates (UAE)) scripted telephony
Dataset Audio English (United Kingdom) conversational telephony Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone and landline Unit: 150 hours Add Dataset to Quote UKE_ASR001 Appen Global Conversational Speech English United Kingdom Low background noise 1,175 2 298,562 24,193 8 wav Dataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
This version contains full 15-minute calls – there is a reduced version with 5 min calls named UKE_ASR001B.
English (United Kingdom) conversational telephony
Dataset Audio English (United Kingdom) conversational telephony Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone and landline Unit: 50 hours Add Dataset to Quote UKE_ASR001B Appen Global Conversational Speech English United Kingdom Low background noise 1,150 2 Available on request 13,192 8 wav Dataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
This version contains full 5-minute calls – there is an expanded version with 15 min calls named UKE_ASR001.
English (United Kingdom) conversational telephony
Dataset Text English (United Kingdom) Part of Speech Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 155,000 words Add Dataset to Quote eng_GBR_POS Appen Global Part of Speech Dictionary English United Kingdom N/A N/A N/A N/A 155,000 N/A text English (United Kingdom) Part of Speech Dictionary
Dataset Text English (United Kingdom) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 195,000 words Add Dataset to Quote eng_GBR_PHON Appen Global Pronunciation Dictionary English United Kingdom N/A N/A N/A N/A 195,000 N/A text English (United Kingdom) Pronunciation Dictionary
Dataset Audio English (United Kingdom) TTS female scripted microphone Common Use Cases: TTS Recording Device: Headset microphone Unit: 11 hours Add Dataset to Quote TC-STAR female baseline voice Laura Nuance TTS Scripted Speech English United Kingdom Low background noise (studio) 1 1 Available on request Available on request 96 Available on request Dataset includes manual orthographic transcription, automatic segmentation into phonemes, automatic generation of pitch marks (where a certain percentage of phonetic segments and pitch marks has been manually checked)
Dataset is accompanied by a pronunciation lexicon with POS, lemma and phonetic transcription
English (United Kingdom) TTS female scripted microphone
Dataset Audio English (United Kingdom) TTS male scripted microphone Common Use Cases: TTS Recording Device: Headset microphone Unit: 7 hours Add Dataset to Quote TC-STAR male baseline voice Ian Nuance TTS Scripted Speech English United Kingdom Low background noise (studio) 1 1 Available on request Available on request 96 Available on request Dataset includes manual orthographic transcription, automatic segmentation into phonemes, automatic generation of pitch marks (where a certain percentage of phonetic segments and pitch marks has been manually checked)
Dataset is accompanied by a pronunciation lexicon with POS, lemma and phonetic transcription
English (United Kingdom) TTS male scripted microphone
Dataset Audio English (United States – African American) conversational smartphone Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone Unit: 50 hours Add Dataset to Quote USE_ASR004 Appen Global Conversational Speech English United States Mixed (home, car, public place, outdoor) 94 1 58316 13468 48 wav Dataset is fully transcribed and time stamped
Two person conversations recorded on a smartphone covering a broad range of generic topics including clothing, culture, education, finance, food, health, history, hospitality, insurance, media/entertainment, sports, travel/holiday, weather and work.
Each speaker participates in up to 12 conversations that are 5-15 minutes long.
Pronunciation lexicon not currently available but can be developed upon request
English (United States – African American) conversational smartphone
Dataset Text English (United States) Adversarial prompts for LLM red teaming **in development** Common Use Cases: LLM training , LLM Red teaming Recording Device: N/A Unit: 500 prompts Add Dataset to Quote eng_USA_LLM002 Appen Global LLM training English United States N/A Available upon request N/A 500 Available upon request N/A csv Adversarial prompts in English for LLM red teaming
500 already collected with QA underway; total of 1000 prompts planned for development. Can be prioritized upon request.
Please enquire about our optional benchmarking service to rate harm levels in model responses.
English (United States) Adversarial prompts for LLM red teaming **in development**
Dataset Audio English (United States) answers to questions **in development** Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Mobile phone Unit: 65 hours Add Dataset to Quote USE_ASR007 Appen Global Scripted Speech English United States Low background noise 100 1 40000 Available on request 16 wav Participants recorded themselves on a smartphone answering prompted questions, e.g. “What’s your favourite food?”. There were 100 prompts per session, and 1000 unique prompts across the whole collection.
Audio is collected, QA and transcription is underway, expected to be ready Q1 2025. Can be prioritized upon request.
English (United States) answers to questions **in development**
Dataset Text English (United States) Chatbot conversations **in development** Common Use Cases: LLM training , Chatbot , Virtual Assistant Recording Device: N/A Unit: 1800 prompts Add Dataset to Quote eng_USA_LLM003 Appen Global LLM training English United States N/A Available upon request N/A Available upon request Available upon request N/A csv Real-world conversations between a user and a chatbot. Conversations were collected by asking questions of customer service chatbots on websites from a variety of industries to trigger 3 different conversation formats: chatbot input, chatbot solutions and follow up, solution instructions. Domains include: financial, retail, entertainment, IT
Data collected, QA is underway, expected to be ready by EOY 2024. Can be prioritized upon request.
English (United States) Chatbot conversations **in development**
Dataset Audio English (United States) conversational smartphone Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone Unit: 1000 hours Add Dataset to Quote USE_ASR003 Appen Global Conversational Speech English United States Low background noise 1,856 1 500,000 52,586 16 wav Dataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
Conversations cover a wide variety of topics including: study/major/work, hometown, living arrangements, weather and seasons, punctuality, TV programs/film)
English (United States) conversational smartphone
Dataset Audio English (United States) conversational smartphone **in development** Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone Unit: 2 hours Add Dataset to Quote USE_ASR008 Appen Global Conversational Speech English United States Low background noise 6 1 Available on request Available on request 16 wav Two participants recorded themselves on a smartphone having a 10-15 minute natural conversation on a selected topic, e.g. Hobbies, History. Includes some AAVE participants and some toxic speech
Audio is collected; QA, transcription and labelling is underway, expected to be ready Q1 2025. Can be prioritized upon request.
English (United States) conversational smartphone **in development**
Dataset Audio English (United States) device commands **in development** Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Mobile phone Unit: 40 hours Add Dataset to Quote USE_ASR006 Appen Global Scripted Speech English United States Low background noise 100 1 23000 Available on request 16 wav Participants recorded themselves on a smartphone saying device commands in response to a prompt, e.g. “Tell the device to disable shuffle mode”. There were 94 prompts per session, and 280 unique prompts across the whole collection.
Audio is collected, QA and transcription is underway, expected to be ready Q1 2025. Can be prioritized upon request.
English (United States) device commands **in development**
Dataset Text English (United States) Harmful and harmless prompts and responses **in development** Common Use Cases: LLM training , LLM Red teaming , Chatbot Recording Device: N/A Unit: 300 prompts Add Dataset to Quote eng_USA_LLM001 Appen Global LLM training English United States N/A Available upon request N/A 300 Available upon request N/A csv Prompts and responses annotated for Harm category, Intensity, Voice, and Phrasing.
Data collected, QA is underway, expected to be ready Q1 2025. Can be prioritized upon request.
English (United States) Harmful and harmless prompts and responses **in development**
Dataset Text English (United States) Medical Terms Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 8,000 words Add Dataset to Quote eng_USA_Med_PHON Appen Global Pronunciation Dictionary English United States N/A N/A N/A N/A 8,000 N/A text Pronunciation dictionary of medical terms with their associated transcriptions and domain tagging.
Data is comprised of medical words extracted from PubMed abstracts, as well as pharmaceutical drug names collected by Appen through web-spidering. Pronunciations were processed by native speakers of US English and domain tagging done by a team of US English native speakers with medical transcription or other medical qualifications and experience.
Domains include: Anatomy, Biochem/biological, Condition, General, Organisation, Person, Pharmaceutical, Procedure.
English (United States) Medical Terms Pronunciation Dictionary
Dataset Text English (United States) Part of Speech Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 263,000 words Add Dataset to Quote eng_USA_POS Appen Global Part of Speech Dictionary English United States N/A N/A N/A N/A 263,000 N/A text English (United States) Part of Speech Dictionary
Dataset Image English (United States) product labels **in development** Common Use Cases: Image recognition, Object recognition, Retail Recording Device: Camera Unit: 60000 images Add Dataset to Quote IMG_OCR_USE_ProductLabels Appen Global Image recognition English United States Mixed lighting conditions Available upon request N/A N/A N/A N/A jpg Photos of various products including the label
Annotated for category e.g. Food, health & beauty, pet supplies.
Data collected, QA is underway, expected to be ready Q1 2025. Can be prioritized upon request.
No bounding box or text transcription annotation planned so far, but can be developed upon request.
English (United States) product labels **in development**
Dataset Text English (United States) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 358,000 words Add Dataset to Quote eng_USA_PHON Appen Global Pronunciation Dictionary English United States N/A N/A N/A N/A 358,000 N/A text English (United States) Pronunciation Dictionary
Dataset Image English (United States) receipts **in development** Common Use Cases: Image recognition, Object recognition, OCR, Text detection Recording Device: Camera Unit: 4500 images Add Dataset to Quote IMG_OCR_USE_RECEIPTS Appen Global OCR English United States Mixed lighting conditions Available upon request N/A N/A N/A N/A jpg Photos of receipts, bills or invoices, annotated with bounding boxes and transcribed text. PII redacted.
Data collected, QA is underway, expected to be ready end of Q1 2025. Can be prioritized upon request.
English (United States) receipts **in development**
Dataset Audio English (United States) scripted microphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Microphone Unit: 53 hours Add Dataset to Quote Speecon English (USA) database Nuance Scripted Speech English United States Mixed (office, entertainment, car, public place) 600 (550 adult speakers and 50 child speakers) 4 170,000 Available on request 16 Available on request Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
290 prompts per adult speaker and 210 prompts per child speaker including digits, natural numbers, letter strings, personal, place and business names, application words for adult speakers, command (toy, phone and general) for child speakers, phonetically rich words and sentences and free and elicited spontaneous responses for adult speakers
English (United States) scripted microphone
Dataset Audio English (United States) scripted microphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Microphone Unit: 62 hours Add Dataset to Quote USE_ASR001 Appen Global Scripted Speech English United States Low background noise (studio) 200 2 80,000 18,318 48 raw PCM or wav PCM Dataset is fully transcribed and timestamped
Dataset is formatted according to SALA II/SpeechDAT style conventions
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
Each speaker read 400 prompts including digits, natural numbers, personal and city names, telephone numbers, generic command and control items, phonetically rich sentences and words
English (United States) scripted microphone
Dataset Audio English (United States) scripted sentences **in development** Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Mobile phone Unit: 500 hours Add Dataset to Quote USE_ASR005 Appen Global Scripted Speech English United States Low background noise 250 1 300000 Available on request 16 wav Participants recorded themselves on a smartphone reading out prompted sentences. There were 96 prompts per session, and 9000 unique sentences across the whole collection.
Audio is already collected, QA and transcription is underway, expected to be ready Q1 2025. Can be prioritized upon request.
English (United States) scripted sentences **in development**
Dataset Image English (United States) street signs Common Use Cases: Image recognition, Object recognition, OCR, Text detection Recording Device: Camera Unit: 669 images Add Dataset to Quote IMG_OCR_USE_STREET001 Appen Global OCR English N/A Mixed lighting conditions N/A N/A N/A N/A N/A png Photographs of street signs, 51% traffic signs and 49% other. English from 18 locales. English (United States) street signs
Dataset Image English (United States) street signs **in development** Common Use Cases: Image recognition, Object recognition, OCR, Text detection Recording Device: Camera Unit: 3500 images Add Dataset to Quote IMG_OCR_USE_STREET002 Appen Global OCR English United States Mixed lighting conditions Available upon request N/A N/A Available upon request N/A jpg Photos of US street signs, annotated with bounding boxes, transcribed text, and text description of the sign
Data collected, QA is underway, expected to be ready Q1 2025. Can be prioritized upon request.
English (United States) street signs **in development**
Dataset Image English (United States) symbols **in development** Common Use Cases: Image recognition, Object recognition, OCR Recording Device: Camera Unit: 1500 images Add Dataset to Quote IMG_SYMBOLS_US Appen Global OCR English United States Mixed lighting conditions Available upon request N/A N/A N/A N/A jpg Photos of symbols – small pictures that communicate an action or warning (e.g. recycling or laundering instructions) encountered in everyday life, with text descriptions
Data collected, QA is underway, expected to be ready Q1 2025. Can be prioritized upon request.
English (United States) symbols **in development**
Dataset Text English (United States) Text message conversations Common Use Cases: Chatbot , Virtual Assistant , Conversational AI Recording Device: N/A Unit: 100 conversations Add Dataset to Quote eng_USA_SMS003 Appen Global Text messages English United States N/A Available upon request N/A Available upon request Available upon request N/A tsv Short WhatsApp and SMS text message conversations (20-400 words), labelled for topic English (United States) Text message conversations
Dataset Audio English (United States) Ultra High-Volume labeled speech Common Use Cases: ASR, Conversational AI, Speech Analytics, Automatic Captioning, In Car HMI & Entertainment, Virtual Assistant Recording Device: N/A Unit: 1196 hours Add Dataset to Quote USE_UHV001 Appen Global Broadcast Speech English United States Low background noise 20472 1 423371 110265 16 wav Customised packaging available
High quality labelled speech datasets of web-sourced licensable broadcast audio data, curated to ensure representative speaker demographic distributions, and filtered through human quality checks.
12.6M total words
Utterance-level labelling includes: speech transcription, accent identification, speaker identification, verification, gender and age-group detection, domain classification.
Domains include: Agriculture & plants, Animals & Pets, Art & Culture, Beauty & Fashion, Career, Clothing, Education, Entertainment, Family & Relationships, Finance & Insurance, Food, Health, History, Hospitality, Legal, Leisure, News & Politics, Religion & Spirituality, Retail, Science & Technology, Social Networks, Sports, Telecom, Travel, Weather, Others
English (United States) Ultra High-Volume labeled speech
Dataset Text English Inverse text normalisation Common Use Cases: ASR, Language Modelling, Closed Captioning Recording Device: N/A Unit: 4454 test cases Add Dataset to Quote ENG_ITN001 Appen Global Inverse text normalisation English N/A N/A N/A N/A N/A N/A N/A text Inverse text normalisation input-output pairs, in 14 semiotic classes: cardinal, ordinal, decimal, fraction, measure, time, date, currency, letter, digit, electronic, address, postal codes, identifiers English Inverse text normalisation
Dataset Text English NER news text Common Use Cases: NER, Content Classification, Search Engines Recording Device: N/A Unit: 22,768 sentences Add Dataset to Quote ENG_NER001 Appen Global News NER English N/A N/A N/A N/A 22,768 Available on request N/A text News text corpora with entities tagged in XML format: Person, Title, Organization, Location, Geo-political entity, Facility, Religion, Nationality, Quantity English NER news text
Dataset Image Annotation European License Plate Detection Annotations Common Use Cases: License plate detection for vehicles on the road Recording Device: N/A Unit: 100,000 bounding boxes Add Dataset to Quote LICENSE_ANNO Appen Global Image and Video Bounding Box Annotations N/A Germany, France, Switzerland N/A N/A N/A N/A N/A N/A json This dataset contains 100,000 license plate bounding box annotations of 38,000 images and video frames from the KITTI and Cityscapes datasets, from Germany, France and Switzerland. Metadata associated with the bounding boxes, of the box size and position distributions, is included.
The source images and video frames cover real world complex scenes in good/median weather conditions, captured over several months (spring, summer, fall), varying scene layouts, backgrounds, and occlusion.
The annotations were carried out by Appen’s in-house workforce.
European License Plate Detection Annotations
Dataset Text Farsi/Persian NER news text Common Use Cases: NER, Content Classification, Search Engines Recording Device: N/A Unit: 19,584 sentences Add Dataset to Quote FAR_NER001 Appen Global News NER Iranian Persian Iran N/A N/A N/A 19,584 Available on request N/A text News text corpora with entities tagged in XML format: Person, Title, Organization, Location, Geo-political entity, Facility, Religion, Nationality, Quantity Farsi/Persian NER news text
Dataset Text Finnish (Finland) Part of Speech Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 10,000 words Add Dataset to Quote fin_FIN_POS Appen Global Part of Speech Dictionary Finnish Finland N/A N/A N/A N/A 10,000 N/A text Finnish (Finland) Part of Speech Dictionary
Dataset Image Finnish (Finland) printed text OCR Common Use Cases: Document Processing, Document Search, Text detection Recording Device: Camera Unit: 7293 images Add Dataset to Quote IMG_OCR_FIN_CN Appen China Document OCR Finnish Finland Mixed lighting conditions 4 N/A N/A N/A N/A jpg Images containing text, such as billboards / outer packaging / signage / magazines / menus, etc. Finnish (Finland) printed text OCR
Dataset Text Finnish (Finland) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 86,000 words Add Dataset to Quote fin_FIN_PHON Appen Global Pronunciation Dictionary Finnish Finland N/A N/A N/A N/A 86,000 N/A text Finnish (Finland) Pronunciation Dictionary
Dataset Text French (Algeria) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 4,000 words Add Dataset to Quote fra_DZA_PHON Appen Global Pronunciation Dictionary French Algeria N/A N/A N/A N/A 4,000 N/A text Arabic script French (Algeria) Pronunciation Dictionary
Dataset Audio French (Belgium) scripted telephony Common Use Cases: ASR, Virtual Assistant Recording Device: Landline only Unit: 76 hours Add Dataset to Quote Belgian French SpeechDat(II) FDB-1000 (FIXED1BF) Nuance Scripted Speech French Belgium Low background noise 1,000 1 53,000 Available on request 8 alaw Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
53 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words and spontaneous items for control
French (Belgium) scripted telephony
Dataset Audio French (Canada) conversational telephony Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone and landline Unit: 9 hours Add Dataset to Quote FRC_ASR003 Appen Global Conversational Speech French Canada Mixed 68 2 Available on request 6,022 8 alaw Dataset is fully transcribed and time stamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
Average length of calls: 10-15 mins
For the majority of calls, only one half of the conversation was collected and transcribed, however, for a smaller number of calls, both speakers (in-line/out-line) were collected and transcribed
French (Canada) conversational telephony
Dataset Text French (Canada) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 67,000 words Add Dataset to Quote fra_CAN_PHON Appen Global Pronunciation Dictionary French Canada N/A N/A N/A N/A 67,000 N/A text French (Canada) Pronunciation Dictionary
Dataset Audio French (Canada) scripted microphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Microphone Unit: 46 hours Add Dataset to Quote FRC_ASR002 Appen Global Scripted Speech French Canada Low background noise (home/office) 150 1 22,500 10,755 16 wav Dataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
150 prompts per speaker including digits, digit strings (randomly generated), addresses and phonetically rich sentences and words
French (Canada) scripted microphone
Dataset Audio French (Canada) scripted telephony Common Use Cases: ASR, Virtual Assistant Recording Device: Mobile phone Unit: 131 hours Add Dataset to Quote FRC_ASR001 Appen Global Scripted Speech French Canada Mixed 1,000 1 100,000 11,697 8 mulaw Fully transcribed to SpeechDAT type conventions
Dataset is accompanied by a pronunciation lexicon [SAMPA] containing all transcribed words
100 prompts per speaker including digits, natural numbers, letter strings, personal, place, and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words
French (Canada) scripted telephony
Dataset Audio French (France) conversational smartphone Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone Unit: 159 hours Add Dataset to Quote FRF_ASR004 Appen Global Conversational Speech French France Mixed (home, car, public place, outdoor) 298 1 Available on request Available on request 48 wav Dataset is fully transcribed and time stamped
Two person conversations covering a broad range of generic topics including clothing, culture, education, finance, food, health, history, hospitality, insurance, media/entertainment, sports, travel/holiday, weather and work.
Each speaker participates in up to 12 conversations that are 5-15 minutes long.
Pronunciation lexicon not currently available but can be developed upon request
French (France) conversational smartphone
Dataset Audio French (France) conversational telephony Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone and landline Unit: 25 hours Add Dataset to Quote FRF_ASR001 Appen Global Conversational Speech French France Low background noise 563 2 Available on request 11,922 8 alaw or wav Dataset is fully transcribed and time stamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
For the majority of calls, both speakers (in-line/out-line) were collected and transcribed, however, for a smaller number of calls, only one half of the conversation was collected and transcribed
French (France) conversational telephony
Dataset Audio French (France) In-Car Common Use Cases: ASR, Virtual Assistant, In Car HMI & Entertainment Recording Device: Microphone and mobile phone Unit: 113 hours Add Dataset to Quote French SpeechDat-Car Nuance Scripted Speech French France Mixed (in-car) 300 5 37,500 Available on request 16 and 8 Available on request Dataset is fully transcribed and is accompanied by a pronunciation lexicon and validation report
Approximately 125 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names (some spontaneous), generic command and control items, phonetically rich words and sentences and prompts for spontaneous speech
113.7 hours
French (France) In-Car
Dataset Text French (France) Part of Speech Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 95,000 words Add Dataset to Quote fra_FRA_POS Appen Global Part of Speech Dictionary French France N/A N/A N/A N/A 95,000 N/A text French (France) Part of Speech Dictionary
Dataset Text French (France) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 112,000 words Add Dataset to Quote fra_FRA_PHON Appen Global Pronunciation Dictionary French France N/A N/A N/A N/A 112,000 N/A text French (France) Pronunciation Dictionary
Dataset Audio French (France) scripted microphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Microphone Unit: 26 hours Add Dataset to Quote FRF_ASR003 GlobalPhone Scripted Speech French France Mixed (quiet home/office, public, outdoor) 98 1 10,273 Available on request 16 wav Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus
Dataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
French (France) scripted microphone
Dataset Audio French (France) scripted telephony Common Use Cases: ASR, Virtual Assistant Recording Device: Landline only Unit: 41 hours Add Dataset to Quote French SpeechDat(II) FDB-1000 Nuance Scripted Speech French France Low background noise (home/office) 1,017 1 48,000 Available on request 8 Available on request Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
48 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
French (France) scripted telephony
Dataset Audio French (France) scripted telephony Common Use Cases: ASR, Virtual Assistant Recording Device: Landline only Unit: 305 hours Add Dataset to Quote French SpeechDat(II) FDB-5000 Nuance Scripted Speech French France Low background noise 5,040 1 237,000 Available on request 8 Available on request Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
47 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
French (France) scripted telephony
Dataset Audio French (Luxembourg) telephony Common Use Cases: ASR, Virtual Assistant Recording Device: Landline only Unit: 45 hours Add Dataset to Quote Luxembourgish French SpeechDat(II) FDB-500 (FIXED1LF) Nuance Scripted Speech French Luxembourg Low background noise 614 1 32,000 Available on request 8 Available on request Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
53 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
French (Luxembourg) telephony
Dataset Text French Inverse text normalisation Common Use Cases: ASR, Language Modelling, Closed Captioning Recording Device: N/A Unit: 3274 test cases Add Dataset to Quote FRA_ITN001 Appen Global Inverse text normalisation French N/A N/A N/A N/A N/A N/A N/A text Inverse text normalisation input-output pairs, in 14 semiotic classes: cardinal, ordinal, decimal, fraction, measure, time, date, currency, letter, digit, electronic, address, postal codes, identifiers French Inverse text normalisation
Dataset Video Garments image and video collection **in development** Common Use Cases: Image recognition, Object recognition, Retail , e-commerce Recording Device: Camera Unit: 300 sessions Add Dataset to Quote IMG_VID_GARMENTS_US Appen Global Image recognition N/A United States Mixed lighting conditions Available upon request 1 N/A N/A N/A jpg, mp4, mov Participants took 2 pictures (front and back) of an item of clothing, and a 60-second video of themselves wearing the garment and moving to various angles. Metadata includes demographics, body measurements, labels for garment category (e.g. t-shirt, trousers) and description.
Data collected, QA is underway, expected to be ready Q1 2025. Can be prioritized upon request.
Garments image and video collection **in development**
Dataset Text Georgian (Georgia) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 67,000 words Add Dataset to Quote kat_GEO_PHON Appen Global Pronunciation Dictionary Georgian Georgia N/A N/A N/A N/A 67,000 N/A text Georgian (Georgia) Pronunciation Dictionary
Dataset Audio German (Germany) conversational smartphone Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone Unit: 104 hours Add Dataset to Quote DEU_ASR004 Appen Global Conversational Speech German Germany Mixed (home, car, public place, outdoor) 198 1 Available on request Available on request 48 wav Dataset is fully transcribed and time stamped
Two person conversations covering a broad range of generic topics including clothing, culture, education, finance, food, health, history, hospitality, insurance, media/entertainment, sports, travel/holiday, weather and work.
Each speaker participates in up to 12 conversations that are 5-15 minutes long.
Pronunciation lexicon not currently available but can be developed upon request
German (Germany) conversational smartphone
Dataset Text German (Germany) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 146,000 words Add Dataset to Quote deu_DEU_PHON Appen Global Pronunciation Dictionary German Germany N/A N/A N/A N/A 146,000 N/A text German (Germany) Pronunciation Dictionary
Dataset Audio German (Germany) scripted microphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Microphone Unit: 16 hours Add Dataset to Quote DEU_ASR001 Appen Global Scripted Speech German Germany Low background noise (studio) 127 2 12,700 6,826 48 raw PCM Dataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
Each speaker read 100 prompts including digits, natural numbers, personal and city names, telephone numbers, generic command and control items, phonetically rich sentences and words
German (Germany) scripted microphone
Dataset Audio German (Germany) scripted microphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Microphone Unit: 25 hours Add Dataset to Quote DEU_ASR003 GlobalPhone Scripted Speech German Germany Mixed (quiet home/office, public, outdoor) 77 1 10,085 Available on request 16 wav Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus
Dataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
German (Germany) scripted microphone
Dataset Audio German (Germany) telephony Common Use Cases: ASR, Virtual Assistant Recording Device: Landline only Unit: 31 hours Add Dataset to Quote German SpeechDat (II) FDB-1000 Nuance Scripted Speech German Germany Low background noise (home/office) 988 1 43,000 Available on request 8 Available on request Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
44 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
German (Germany) telephony
Dataset Audio German (Germany) telephony Common Use Cases: ASR, Virtual Assistant Recording Device: Landline only Unit: 268 hours Add Dataset to Quote German SpeechDat(II) FDB-4000 Nuance Scripted Speech German Germany Low background noise (home/office) 4,000 1 160,000 Available on request 8 Available on request Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
40 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
German (Germany) telephony
Dataset Audio German (Luxembourg) telephony Common Use Cases: ASR, Virtual Assistant Recording Device: Landline only Unit: 33 hours Add Dataset to Quote Luxembourgish German SpeechDat(II) FDB-500 (FIXED1LG) Nuance Scripted Speech German Luxembourg Low background noise 500 1 26,500 Available on request 8 Available on request Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
53 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
German (Luxembourg) telephony
Dataset Text German (Switzerland) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 27,000 words Add Dataset to Quote deu_CHE_PHON Appen Global Pronunciation Dictionary German Switzerland N/A N/A N/A N/A 27,000 N/A text German (Switzerland) Pronunciation Dictionary
Dataset Audio German (Switzerland) scripted microphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Microphone Unit: 53 hours Add Dataset to Quote Speecon German (Switzerland) database Nuance Scripted Speech German Switzerland Mixed (office, entertainment, car, public place) 600 (550 adult speakers and 50 child speakers) 4 170,000 Available on request 16 Available on request Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
290 prompts per adult speaker and 210 prompts per child speaker including digits, natural numbers, letter strings, personal, place and business names, application words for adult speakers, command (toy, phone and general) for child speakers, phonetically rich words and sentences and free and elicited spontaneous responses for adult speakers
German (Switzerland) scripted microphone
Dataset Audio German (Turkey) telephony Common Use Cases: ASR, Virtual Assistant Recording Device: Mobile phone and landline Unit: 31 hours Add Dataset to Quote OrienTel German Spoken by Turkish Nuance Scripted Speech German Turkey Low background noise 300 1 15,600 Available on request 8 Available on request Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
52 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
German (Turkey) telephony
Dataset Text German Inverse text normalisation Common Use Cases: ASR, Language Modelling, Closed Captioning Recording Device: N/A Unit: 8001 test cases Add Dataset to Quote DEU_ITN001 Appen Global Inverse text normalisation German N/A N/A N/A N/A N/A N/A N/A text Inverse text normalisation input-output pairs, in 14 semiotic classes: cardinal, ordinal, decimal, fraction, measure, time, date, currency, letter, digit, electronic, address, postal codes, identifiers German Inverse text normalisation
Dataset Audio GlobalPhone Multilingual Text & Speech Database Common Use Cases: ASR, Language Identification, Multilingual Speech Synthesis, Virtual Assistant, Chatbot Recording Device: Microphone Unit: 450 hours Add Dataset to Quote GLOBALPHONE GlobalPhone Scripted Speech N/A Global coverage Mixed (quiet home/office, public, outdoor) 1942 1 169,755 Available on request 16 wav Global Phone multilingual corpus, languages can be sold separately or in multi-language packages. Tiered package pricing available.
GLOBALPHONE provides multilingual speech and text data in 20 Languages: Arabic, Bulgarian, Chinese-Mandarin, Chinese-Shanghai, Croatian, Czech, French, German, Hausa, Japanese, Korean, Polish, Portuguese, Russian, Spanish, Swedish, Tamil, Thai, Turkish, and Vietnamese.
Dataset is fully transcribed and the transcription is available both in original script and in Romanized form
In each language, news article sentences were read by about 100 native speakers. The articles cover national and international political news, as well as economic news from 1995-2011. The speech is available in 16bit, 16kHz mono quality recorded with a close-speaking microphone and the same recording equipment was used for all languages.
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
GlobalPhone Multilingual Text & Speech Database
Dataset Text Greek (Greece) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 5,000 words Add Dataset to Quote ell_GRC_PHON Appen Global Pronunciation Dictionary Greek Greece N/A N/A N/A N/A 5,000 N/A text Greek (Greece) Pronunciation Dictionary
Dataset Audio Greek (Greece) scripted smartphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Mobile phone Unit: 191 hours Add Dataset to Quote GRE_ASR001_CN Appen China Scripted Speech Greek Greece Low background noise (home/office) 287 1 54,113 68,271 16 wav Dataset contains audio with corresponding text prompts Greek (Greece) scripted smartphone
Dataset Text Guarani (Paraguay) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 36,000 words Add Dataset to Quote grn_PRY_PHON Appen Global Pronunciation Dictionary Guarani Paraguay N/A N/A N/A N/A 36,000 N/A text Guarani (Paraguay) Pronunciation Dictionary
Dataset Text Haitian Creole (Haiti) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 18,000 words Add Dataset to Quote hat_HTI_PHON Appen Global Pronunciation Dictionary Haitian Creole Haiti N/A N/A N/A N/A 18,000 N/A text Haitian Creole (Haiti) Pronunciation Dictionary
Dataset Video Hand gesture videos **in development** Common Use Cases: Movement detection, Human Body Movement, Action Classification Recording Device: Camera Unit: 5000 videos Add Dataset to Quote HUMAN_BODY_VID004 Appen Global Human Body Movement N/A United States Mixed lighting conditions Available upon request 1 N/A N/A N/A jpg, mp4, mov Approximately 11 hours of video of participants making hand gestures e.g. thumbs up, wave.  Videos may include the participants face or only their hand. Metadata included describing the type of hand gesture in the video.
Data collected, QA is underway, expected to be ready Q1 2025. Can be prioritized upon request.
Hand gesture videos **in development**
Dataset Image Handwritten text document OCR Common Use Cases: Document Processing, Document Search, Text detection Recording Device: Camera, scan Unit: 663 images Add Dataset to Quote IMG_OCR_Handwritten Appen Global Document OCR N/A N/A Mixed lighting conditions N/A N/A N/A N/A N/A png Scans and photographs of handwritten forms and handwritten documents. 3 Languages: 11% Arabic, 60% English, 29% Russian Handwritten text document OCR
Dataset Audio Hausa (Nigeria) conversational telephony Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone Unit: 33 hours Add Dataset to Quote HAU_ASR002 Appen Global Conversational Speech Hausa Nigeria Low background noise 200 2 Available on request 7,949 8 alaw Dataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 telephony conversations are recorded for this project – 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers
Hausa (Nigeria) conversational telephony
Dataset Text Hausa (Nigeria) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 11,000 words Add Dataset to Quote hau_NGA_PHON Appen Global Pronunciation Dictionary Hausa Nigeria N/A N/A N/A N/A 11,000 N/A text Hausa (Nigeria) Pronunciation Dictionary
Dataset Audio Hausa scripted microphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Microphone Unit: 20 hours Add Dataset to Quote HAU_ASR001 GlobalPhone Scripted Speech Hausa Cameroon Mixed (quiet home/office, public, outdoor) 103 1 7,895 Available on request 16 wav Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus
Dataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
Hausa scripted microphone
Dataset Audio Hebrew (Israel) conversational telephony Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone and landline Unit: 34 hours Add Dataset to Quote HEB_ASR001 Appen Global Conversational Speech Hebrew Israel Low background noise 200 2 Available on request 19,250 8 alaw or wav Dataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 telephony conversations are recorded for this project – 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers
50% landline, 50% mobile
Conversations cover a range of topics including: Friends, Family and Studies.
Hebrew (Israel) conversational telephony
Dataset Text Hebrew (Israel) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 31,000 words Add Dataset to Quote heb_ISR_PHON Appen Global Pronunciation Dictionary Hebrew Israel N/A N/A N/A N/A 31,000 N/A text Hebrew (Israel) Pronunciation Dictionary
Dataset Audio Hindi (India) conversational telephony Common Use Cases: ASR, Conversational AI, Speech Analytics, TTS Recording Device: Mobile phone and landline Unit: 32 hours Add Dataset to Quote HIN_ASR002 Appen Global Conversational Speech Hindi India Mixed 996 2 Available on request 12,266 8 wav Dataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
For the majority of calls, both speakers (in-line/out-line) were collected and transcribed, however, for a smaller number of calls, only one half of the conversation was collected and transcribed
29% landline, 71% mobile
Hindi (India) conversational telephony
Dataset Text Hindi (India) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: Unit: 35,000 words Add Dataset to Quote hin_IND_PHON Appen Global Pronunciation Dictionary Hindi India N/A N/A N/A N/A 35,000 N/A text Hindi (India) Pronunciation Dictionary
Dataset Audio Hindi (India) scripted telephony Common Use Cases: ASR, Virtual Assistant, TTS Recording Device: Mobile phone Unit: 224 hours Add Dataset to Quote HIN_ASR001 Appen Global Scripted Speech Hindi India Low background noise 1,920 1 96,000 9,853 8 alaw or wav Fully transcribed to SpeechDAT type conventions
Dataset is accompanied by a pronunciation lexicon [SAMPA] containing all transcribed words
50 prompts per speaker including digits, natural numbers, personal, business and place names, web addresses, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words
Hindi (India) scripted telephony
Dataset Text Hindi Inverse text normalisation Common Use Cases: ASR, Language Modelling, Closed Captioning Recording Device: N/A Unit: 6924 test cases Add Dataset to Quote HIN_ITN001 Appen Global Inverse text normalisation Hindi N/A N/A N/A N/A N/A N/A N/A text Inverse text normalisation input-output pairs, in 14 semiotic classes: cardinal, ordinal, decimal, fraction, measure, time, date, currency, letter, digit, electronic, address, postal codes, identifiers Hindi Inverse text normalisation
Dataset Image Home environment pictures Common Use Cases: Image recognition Recording Device: N/A Unit: 10000 images Add Dataset to Quote IMG_HOME_CN Appen China Image recognition N/A N/A N/A N/A N/A N/A N/A N/A jpg (Data source: website) 4000 images in the study room; 6000 images in the living room. No annotation. Home environment pictures
Dataset Text Hungarian (Hungary) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 500 words Add Dataset to Quote hun_HUN_PHON Appen Global Pronunciation Dictionary Hungarian Hungary N/A N/A N/A N/A 500 N/A text Hungarian (Hungary) Pronunciation Dictionary
Dataset Audio Hungarian (Hungary) scripted smartphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Mobile phone Unit: 286 hours Add Dataset to Quote HUN_ASR001_CN Appen China Scripted Speech Hungarian Hungary Low background noise (home/office) 254 1 94,031 201,921 16 wav Dataset contains audio with corresponding text prompts Hungarian (Hungary) scripted smartphone
Dataset Audio Hungarian (Hungary) scripted telephony Common Use Cases: ASR, Virtual Assistant Recording Device: Landline only Unit: 65 hours Add Dataset to Quote Hungarian SpeechDat(E) Nuance Scripted Speech Hungarian Hungary Low background noise 1,000 1 48,000 Available on request 8 Available on request Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
48 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
Hungarian (Hungary) scripted telephony
Dataset Text Icelandic (Iceland) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 3,000 words Add Dataset to Quote isl_ISL_PHON Appen Global Pronunciation Dictionary Icelandic Iceland N/A N/A N/A N/A 3000 N/A text Icelandic (Iceland) Pronunciation Dictionary
Dataset Text Igbo (Nigeria) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 32,000 words Add Dataset to Quote ibo_NGA_PHON Appen Global Pronunciation Dictionary Igbo Nigeria N/A N/A N/A N/A 32,000 N/A text Igbo (Nigeria) Pronunciation Dictionary
Dataset Audio Indonesian (Indonesia) conversational telephony Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone Unit: 150 hours Add Dataset to Quote IND_DH_ASR001_CN Appen China Conversational Speech Indonesian Indonesia Low background noise 1000 2 N/A N/A 16 wav Audio with transcription and timestamping. Indonesian (Indonesia) conversational telephony
Dataset Text Indonesian (Indonesia) Part of Speech Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 10,000 words Add Dataset to Quote ind_IDN_POS Appen Global Part of Speech Dictionary Indonesian Indonesia N/A N/A N/A N/A 10,000 N/A text Indonesian (Indonesia) Part of Speech Dictionary
Dataset Text Indonesian (Indonesia) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 95,000 words Add Dataset to Quote ind_IDN_PHON Appen Global Pronunciation Dictionary Indonesian Indonesia N/A N/A N/A N/A 95,000 N/A text Indonesian (Indonesia) Pronunciation Dictionary
Dataset Audio Iranian Persian (Farsi) (Iran) conversational telephony Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone and landline Unit: 30 hours Add Dataset to Quote FAR_ASR002 Appen Global Conversational Speech Iranian Persian (Farsi) Iran Mixed 1,000 2 Available on request 12,358 8 wav Dataset is fully transcribed and time stamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
Iranian Persian (Farsi) (Iran) conversational telephony
Dataset Audio Iranian Persian (Farsi) (Iran) scripted telephony Common Use Cases: ASR, Virtual Assistant Recording Device: Mobile phone and landline Unit: 85 hours Add Dataset to Quote FAR_ASR001 Appen Global Scripted Speech Iranian Persian (Farsi) Iran Mixed 789 1 38,400 8,716 8 alaw or wav Fully transcribed to OrienTel type conventions
Dataset is accompanied by a pronunciation lexicon [SAMPA] containing all transcribed words
48 prompts per speaker including digits, natural numbers, letter strings, personal, place, and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words
Iranian Persian (Farsi) (Iran) scripted telephony
Dataset Text Iranian Persian (Iran) Part of Speech Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 1,400,000 words Add Dataset to Quote pes_IRN_POS Appen Global Part of Speech Dictionary Iranian Persian Iran N/A N/A N/A N/A 1,400,000 N/A text Iranian Persian (Iran) Part of Speech Dictionary
Dataset Text Iranian Persian (Iran) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 85,000 words Add Dataset to Quote pes_IRN_PHON Appen Global Pronunciation Dictionary Iranian Persian Iran N/A N/A N/A N/A 85,000 N/A text Iranian Persian (Iran) Pronunciation Dictionary
Dataset Audio Italian (Italy) conversational smartphone Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone Unit: 256 hours Add Dataset to Quote ITA_ASR005 Appen Global Conversational Speech Italian Italy Mixed (home, car, public place, outdoor) 482 1 Available on request Available on request 48 wav Dataset is fully transcribed and time stamped
Two person conversations covering a broad range of generic topics including clothing, culture, education, finance, food, health, history, hospitality, insurance, media/entertainment, sports, travel/holiday, weather and work.
Each speaker participates in up to 12 conversations that are 5-15 minutes long.
Pronunciation lexicon not currently available but can be developed upon request
Italian (Italy) conversational smartphone
Dataset Audio Italian (Italy) conversational telephony Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone and landline Unit: 36 hours Add Dataset to Quote ITA_ASR003 Appen Global Conversational Speech Italian Italy Low background noise 200 2 Available on request 18,974 8 alaw Dataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 telephony conversations are recorded for this project – 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers
50% landline, 50% mobile
Conversations cover a range of topics including: Travel, Family and Holidays.
Italian (Italy) conversational telephony
Dataset Text Italian (Italy) Part of Speech Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 171,000 words Add Dataset to Quote ita_ITA_POS Appen Global Part of Speech Dictionary Italian Italy N/A N/A N/A N/A 171,000 N/A text Italian (Italy) Part of Speech Dictionary
Dataset Text Italian (Italy) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 197,000 words Add Dataset to Quote ita_ITA_PHON Appen Global Pronunciation Dictionary Italian Italy N/A N/A N/A N/A 197,000 N/A text Italian (Italy) Pronunciation Dictionary
Dataset Audio Italian (Italy) scripted microphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Microphone Unit: 44 hours Add Dataset to Quote ITA_ASR001 Appen Global Scripted Speech Italian Italy Mixed 200 4 40,000 7,316 22 raw PCM Fully transcribed to SpeechDAT type conventions
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 prompts per speaker including 100 command and control type items and 100 phonetically rich sentences
Italian (Italy) scripted microphone
Dataset Audio Italian (Italy) scripted microphone in-car Common Use Cases: ASR, Virtual Assistant, In Car HMI & Entertainment Recording Device: Microphone Unit: 47 hours Add Dataset to Quote ITA_ASR002 Appen Global Scripted Speech Italian Italy Mixed (in-car) 205 4 35,875 10,366 48 raw PCM Fully transcribed to SpeechDAT type conventions
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
350 prompts per speaker including digits, street names, generic command and control items, phonetically rich sentences and words
Each speaker recorded 1or 2 sessions including Session 1 in a parked vehicle with the engine running and Session 2 in a vehicle travelling at 60 mph (100 km/h)
Italian (Italy) scripted microphone in-car
Dataset Audio Italian (Italy) telephony Common Use Cases: ASR, Virtual Assistant Recording Device: Landline only Unit: 38 hours Add Dataset to Quote Italian Fixed Network Speech SpeechDat(M) Corpus Nuance Scripted Speech Italian Italy Low background noise (home/office) 1,000 1 39,000 Available on request 8 Available on request Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
39 prompts per speaker including isolated and connected digits, natural numbers, money amounts, spelled words, time and date phrases, yes/no questions, city names, common application words, application words in phrases and phonetically rich sentences
Italian (Italy) telephony
Dataset Audio Italian (Italy) telephony Common Use Cases: ASR, Virtual Assistant Recording Device: Landline only Unit: 228 hours Add Dataset to Quote Italian SpeechDat(II) FDB-3000 Nuance Scripted Speech Italian Italy Low background noise (home/office) 3,040 1 134,000 Available on request 8 Available on request Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
44 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
Italian (Italy) telephony
Dataset Audio Italian (Italy) telephony Common Use Cases: ASR, Virtual Assistant Recording Device: Mobile phone Unit: 103 hours Add Dataset to Quote Italian SpeechDat(II) MDB-250 Nuance Scripted Speech Italian Italy Low background noise (home/office) 375 1 19,000 Available on request 8 Available on request Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
51 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
Italian (Italy) telephony
Dataset Audio Italian (Italy) telephony Common Use Cases: ASR, Virtual Assistant Recording Device: Mobile phone Unit: 13 hours Add Dataset to Quote SpeechDat(M) Italian Mobile Network Speech Database Nuance Scripted Speech Italian Italy Low background noise (home/office) 342 1 13,500 Available on request 8 Available on request Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
40 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
Italian (Italy) telephony
Dataset Audio Italian (Italy) TTS male scripted microphone Common Use Cases: TTS Recording Device: Microphone Unit: 3 hours Add Dataset to Quote ITA_TTS001 Appen Global TTS Scripted Speech Italian Italy Low background noise (studio) 1 1 3,300 Available on request 22 raw PCM Dataset is accompanied by a pronunciation lexicon containing all words spoken in the Dataset
3,300 prompts per speaker including phonetically rich sentences
Italian (Italy) TTS male scripted microphone
Dataset Text Japanese (Japan) Part of Speech Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 269,000 words Add Dataset to Quote jpn_JPN_POS Appen Global Part of Speech Dictionary Japanese Japan N/A N/A N/A N/A 269,000 N/A text Japanese (Japan) Part of Speech Dictionary
Dataset Text Japanese (Japan) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 262,000 words Add Dataset to Quote jpn_JPN_PHON Appen Global Pronunciation Dictionary Japanese Japan N/A N/A N/A N/A 262,000 N/A text Japanese (Japan) Pronunciation Dictionary
Dataset Audio Japanese (Japan) scripted microphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Microphone Unit: 33 hours Add Dataset to Quote JPN_ASR001 GlobalPhone Scripted Speech Japanese Japan Mixed (quiet home/office, public, outdoor) 144 1 13,067 Available on request 16 wav Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus
Dataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
Japanese (Japan) scripted microphone
Dataset Audio Japanese (Japan) scripted microphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Microphone Unit: 57 hours Add Dataset to Quote Speecon Japanese Nuance Scripted Speech Japanese Japan Mixed (office, entertainment, car, public place) 600 (550 adult speakers and 50 child speakers) 4 170,000 Available on request 16 Available on request Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
290 prompts per adult speaker and 210 prompts per child speaker including digits, natural numbers, letter strings, personal, place and business names, application words for adult speakers, command (toy, phone and general) for child speakers, phonetically rich words and sentences and free and elicited spontaneous responses for adult speakers
Japanese (Japan) scripted microphone
Dataset Text Japanese Inverse text normalisation Common Use Cases: ASR, Language Modelling, Closed Captioning Recording Device: N/A Unit: 5363 test cases Add Dataset to Quote JPN_ITN001 Appen Global Inverse text normalisation Japanese N/A N/A N/A N/A N/A N/A N/A text Inverse text normalisation input-output pairs, in 14 semiotic classes: cardinal, ordinal, decimal, fraction, measure, time, date, currency, letter, digit, electronic, address, postal codes, identifiers Japanese Inverse text normalisation
Dataset Text Japanese NER news text Common Use Cases: NER, Content Classification, Search Engines Recording Device: N/A Unit: 20,629 sentences Add Dataset to Quote JPY_NER001 Appen Global News NER Japanese Japan N/A N/A N/A 20,629 Available on request N/A text News text corpora with entities tagged in XML format: Person, Title, Organization, Location, Geo-political entity, Facility, Religion, Nationality, Quantity Japanese NER news text
Dataset Text Javanese (Indonesia) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 22,000 words Add Dataset to Quote jav_IDN_PHON Appen Global Pronunciation Dictionary Javanese Indonesia N/A N/A N/A N/A 22,000 N/A text Javanese (Indonesia) Pronunciation Dictionary
Dataset Audio Kannada (India) conversational telephony Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone and landline Unit: 15 hours Add Dataset to Quote KAN_ASR001 Appen Global Conversational Speech Kannada India Mixed 178 2 Available on request 15,660 8 alaw Dataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
15% landline, 85% mobile
Kannada (India) conversational telephony
Dataset Audio Kannada (India) conversational telephony Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone and landline Unit: 57 hours Add Dataset to Quote KAN_ASR001A Appen Global Conversational Speech Kannada India Mixed 1,000 2 Available on request 15,660 8 alaw Approx. 25% of the dataset sessions are transcribed and time stamped – full transcripts can be made available
Database is accompanied by a pronunciation lexicon containing all transcribed words
16% Hands-Free car, 16% Landline quiet, 15% Mobile quiet, 17% Moving vehicle, 19% Public place, 17% Roadside
Kannada (India) conversational telephony
Dataset Text Kannada (India) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 49,000 words Add Dataset to Quote kan_IND_PHON Appen Global Pronunciation Dictionary Kannada India N/A N/A N/A N/A 49,000 N/A text Kannada (India) Pronunciation Dictionary
Dataset Text Kazakh (Kazakhstan) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 31,000 words Add Dataset to Quote kaz_KAZ_PHON Appen Global Pronunciation Dictionary Kazakh Kazakhstan N/A N/A N/A N/A 31,000 N/A text Kazakh (Kazakhstan) Pronunciation Dictionary
Dataset Text Korean (South Korea) Part of Speech Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 100,000 words Add Dataset to Quote kor_KOR_POS Appen Global Part of Speech Dictionary Korean South Korea N/A N/A N/A N/A 100,000 N/A text Korean (South Korea) Part of Speech Dictionary
Dataset Text Korean (South Korea) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 105,000 words Add Dataset to Quote kor_KOR_PHON Appen Global Pronunciation Dictionary Korean South Korea N/A N/A N/A N/A 105,000 N/A text Korean (South Korea) Pronunciation Dictionary
Dataset Audio Korean (South Korea) scripted microphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Microphone Unit: 20 hours Add Dataset to Quote KOR_ASR001 GlobalPhone Scripted Speech Korean South Korea Mixed (quiet home/office, public, outdoor) 100 1 8,107 Available on request 16 wav Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus
Dataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
Korean (South Korea) scripted microphone
Dataset Text Korean NER news text Common Use Cases: NER, Content Classification, Search Engines Recording Device: N/A Unit: 25,830 sentences Add Dataset to Quote KOR_NER001 Appen Global News NER Korean South Korea N/A N/A N/A 25,830 Available on request N/A text News text corpora with entities tagged in XML format: Person, Title, Organization, Location, Geo-political entity, Facility, Religion, Nationality, Quantity Korean NER news text
Dataset Text Kurmanji (Turkey) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 60,000 words Add Dataset to Quote kur_TUR_PHON Appen Global Pronunciation Dictionary Kurmanji Turkey N/A N/A N/A N/A 60,000 N/A text Kurmanji (Turkey) Pronunciation Dictionary
Dataset Text Lao (Laos) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 9,000 words Add Dataset to Quote lao_LAO_PHON Appen Global Pronunciation Dictionary Lao Laos N/A N/A N/A N/A 9,000 N/A text Lao (Laos) Pronunciation Dictionary
Dataset Text Latin American Spanish Inverse text normalisation Common Use Cases: ASR, Language Modelling, Closed Captioning Recording Device: N/A Unit: 3795 test cases Add Dataset to Quote SPA_ITN001 Appen Global Inverse text normalisation Spanish N/A N/A N/A N/A N/A N/A N/A text Inverse text normalisation input-output pairs, in 14 semiotic classes: cardinal, ordinal, decimal, fraction, measure, time, date, currency, letter, digit, electronic, address, postal codes, identifiers Latin American Spanish Inverse text normalisation
Dataset Text Lithuanian (Lithuania) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 71,000 words Add Dataset to Quote lit_LTU_PHON Appen Global Pronunciation Dictionary Lithuanian Lithuania N/A N/A N/A N/A 71,000 N/A text Lithuanian (Lithuania) Pronunciation Dictionary
Dataset Video Location entrance human body movement videos Common Use Cases: Security, Movement detection, Human body movement recognition Recording Device: Camera Unit: 130 videos Add Dataset to Quote HUMAN_BODY_VID002 Appen Global Human Body Movement N/A United Kingdom, Philippines Mixed background and lighting conditions 100 3 N/A N/A N/A mp4 This dataset contains 130 sessions of approximately 1-minute videos of groups of 3-10 people (52% 3-5 people, 41% 6-8 people, 37% 9-10 people) walking towards and through entrances in one location in Exeter, UK (2 camera views: front, top – 20 sessions) and 2 locations in Cavite, Philippines (3 camera views: front, top, side for location 1 – 85 sessions – and front, top, top2 for location 2 – 25 sessions).
2.85 hours of video footage.
Varying scenes (e.g. weather conditions, time of day) and participants’ appearance (e.g. wearing masks, hat, glasses, clothes) and actions (e.g. looking at phone, talking, bowing head).
No annotation.
2048p resolution, 30 fps, synchronized camera streams.
Location entrance human body movement videos
Dataset Text Malayalam (India) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 19,000 words Add Dataset to Quote mal_IND_PHON Appen Global Pronunciation Dictionary Malayalam India N/A N/A N/A N/A 19,000 N/A text Malayalam (India) Pronunciation Dictionary
Dataset Text Malaysian (Malaysia) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 26,000 words Add Dataset to Quote msa_MYS_PHON Appen Global Pronunciation Dictionary Malaysian Malaysia N/A N/A N/A N/A 26,000 N/A text Malaysian (Malaysia) Pronunciation Dictionary
Dataset Text Mandarin (Simplified) (China) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 35,000 words Add Dataset to Quote zho_CHN_PHON Appen Global Pronunciation Dictionary Mandarin (Simplified) China N/A N/A N/A N/A 35,000 N/A text Mandarin (Simplified) (China) Pronunciation Dictionary
Dataset Text Mandarin (Traditional) (Taiwan) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 50,000 words Add Dataset to Quote zho_TWN_PHON Appen Global Pronunciation Dictionary Mandarin (Traditional) Taiwan N/A N/A N/A N/A 50,000 N/A text Mandarin (Traditional) (Taiwan) Pronunciation Dictionary
Dataset Audio Mandarin Chinese (China) scripted microphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Microphone Unit: 26 hours Add Dataset to Quote MAC_ASR002 GlobalPhone Scripted Speech Mandarin Chinese China Mixed (quiet home/office, public, outdoor) 132 1 10,225 Available on request 16 wav Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus
Dataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
Mandarin Chinese (China) scripted microphone
Dataset Audio Mandarin Chinese (China) scripted telephony Common Use Cases: ASR, Virtual Assistant Recording Device: Mobile phone and landline Unit: 323 hours Add Dataset to Quote MAC_ASR001 Appen Global Scripted Speech Mandarin Chinese China Mixed 2,000 1 200,000 7,145 8 alaw Fully transcribed to SpeechDAT type conventions
Dataset is accompanied by a pronunciation lexicon [SAMPA] containing all transcribed words
98 prompts per speaker including digits, natural numbers, letter strings, personal, place, and business names, confirmation items (yes, no + fuzzy), generic command and control items (from a set of 215), phonetically rich sentences and words
100% mobile
Mandarin Chinese (China) scripted telephony
Dataset Text Mandarin Chinese Inverse text normalisation Common Use Cases: ASR, Language Modelling, Closed Captioning Recording Device: N/A Unit: 4230 test cases Add Dataset to Quote CMN_ITN001 Appen Global Inverse text normalisation Mandarin Chinese N/A N/A N/A N/A N/A N/A N/A text Inverse text normalisation input-output pairs, in 14 semiotic classes: cardinal, ordinal, decimal, fraction, measure, time, date, currency, letter, digit, electronic, address, postal codes, identifiers Mandarin Chinese Inverse text normalisation
Dataset Text Mandarin NER news text Common Use Cases: NER, Content Classification, Search Engines Recording Device: N/A Unit: 17,313 sentences Add Dataset to Quote MAC_NER001 Appen Global News NER Mandarin Chinese China N/A N/A N/A 17,313 Available on request N/A text News text corpora with entities tagged in XML format: Person, Title, Organization, Location, Geo-political entity, Facility, Religion, Nationality, Quantity Mandarin NER news text
Dataset Audio Marathi (India) conversational telephony Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone and landline Unit: 15 hours Add Dataset to Quote MAR_ASR001 Appen Global Conversational Speech Marathi India Mixed 180 2 Available on request 11,908 8 alaw Approx. 29% of the dataset sessions are transcribed and time stamped – full transcripts can be made available
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
17% Hands-Free car, 16% Landline quiet, 19% Mobile quiet, 16% Moving vehicle, 16% Public place, 17% Roadside
Marathi (India) conversational telephony
Dataset Audio Marathi (India) conversational telephony Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone and landline Unit: 52 hours Add Dataset to Quote MAR_ASR001A Appen Global Conversational Speech Marathi India Mixed 1,000 2 Available on request 11,908 8 alaw Portion of the dataset sessions are transcribed and time stamped – full transcripts can be made available
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
16% landline, 84% mobile
Marathi (India) conversational telephony
Dataset Text Marathi (India) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 30,000 words Add Dataset to Quote mar_IND_PHON Appen Global Pronunciation Dictionary Marathi India N/A N/A N/A N/A 30,000 N/A text Marathi (India) Pronunciation Dictionary
Dataset Location Data Mobile Location Data Common Use Cases: AI Platforms, Advertising and Marketing, Business Intelligence, Financial Modeling, FMCG, Footfall and Attribution, Healthcare, Human Mobility Insights, Location Analytics, OOH and DOOH, Retail Planning and Site Selection, Retail, Research and Academia, Smart Cities and Urban Planning, Supply Chain, Travel and Tourism, Transportation Planning and Logistics Recording Device: Mobile device Unit: 20 billion+ location events per day Add Dataset to Quote LOCATION_MOBILE_GLOBAL Quadrant Mobile GPS Location Data N/A Global coverage N/A N/A N/A N/A N/A N/A CSV, XLS, JSON, Parquet
For enquiries relating to this data, please contact:
Tim Solt, Quadrant VP Sales
tim@quadrant.io
Book a meeting: https://meetings.hubspot.com/tim321

Quadrant (an Appen Company) is a global leader in the compilation and delivery of compliant, mobile (GPS) location data. Our global location data panel is of the highest authenticity and quality, allowing you to easily integrate and perform location-based activities to support your business initiatives and solve a myriad of real-world problems.

The Quadrant location data panel contains 16 core metadata attributes, including all standard attributes such as Device ID, Latitude, Longitude, Timestamp, Horizontal Accuracy, and non-standard attributes such as Geohash and H3. Our historical data spans as far back as 2021, and data can be selected specific to your requirements (e.g., geography, timeframe, delivery cadence).

Country or Region specific requests can be accommodated.

Please book a meeting to discuss your requirements and obtain a sample dataset to evaluate for your unique use case.
Mobile Location Data
Dataset Text Mongolian (Mongolia) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 32,000 words Add Dataset to Quote mon_MNG_PHON Appen Global Pronunciation Dictionary Mongolian Mongolia N/A N/A N/A N/A 32,000 N/A text Mongolian (Mongolia) Pronunciation Dictionary
Dataset Text Norwegian (Norway) Part of Speech Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 3,000 words Add Dataset to Quote nor_NOR_POS Appen Global Part of Speech Dictionary Norwegian Norway N/A N/A N/A N/A 3,000 N/A text Norwegian (Norway) Part of Speech Dictionary
Dataset Text Norwegian (Norway) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 117,000 words Add Dataset to Quote nor_NOR_PHON Appen Global Pronunciation Dictionary Norwegian Norway N/A N/A N/A N/A 117,000 N/A text Norwegian (Norway) Pronunciation Dictionary
Dataset Image Object Image Collection **text descriptions in development** Common Use Cases: Image label recognition training, Accessibility, LLM image generation Recording Device: Mobile phone and camera Unit: 2000 images Add Dataset to Quote IMG_TAG_CN Appen China Image recognition N/A N/A Mixed lighting conditions N/A N/A N/A N/A N/A jpg Multi-scene picture sample library of approximately 2000 images. English text descriptions in development. Categories: Airport: 65; Beach: 95; Car: 50; Clothing store: 53; Crowd: 67; Department store: 56; Desert: 73; Electrical equipment: 55; Gym: 47; Handbag: 35; KTV: 50; Market: 55; Mountain area: 54; Museum: 63; Night view: 132; Office: 100; Pet: 82; Playground: 94; Restaurant: 54; Sandbeach: 68; Scenic spot: 77; Sea: 191; Ship: 50; Sky: 102; Snow Mountain: 53; Snow scene: 71; Sports equipment: 54; Store: 34; Tree: 85; Window scenery: 62; Zoo: 70 Object Image Collection **text descriptions in development**
Dataset Video Object videos **in development** Common Use Cases: Movement detection, Action Classification Recording Device: Camera Unit: 5500 videos Add Dataset to Quote VID_OBJECT_US Appen Global Movement recognition N/A United States Mixed lighting conditions Available upon request 1 N/A N/A N/A mp4, mov Approximately 6 hours of videos of various objects under different angles, distances and lighting conditions. Contributors selected from a list of ~150 everyday objects (e.g. dog, kettle, desk).
Data collected, QA is underway, expected to be ready Q1 2025. Can be prioritized upon request.
Object videos **in development**
Dataset Text Oriya (India) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 19,000 words Add Dataset to Quote ori_IND_PHON Appen Global Pronunciation Dictionary Oriya India N/A N/A N/A N/A 19,000 N/A text Oriya (India) Pronunciation Dictionary
Dataset Audio Panjabi (Pakistan) conversational telephony Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone and landline Unit: 20 hours Add Dataset to Quote PAP_ASR001 Appen Global Conversational Speech Panjabi Pakistan Low background noise 205 2 Available on request 7,298 8 alaw Dataset is fully transcribed and time-stamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
71% of calls, both speakers (in-line/out-line) were collected and transcribed, however, for 29% calls, only one half of the conversation was collected and transcribed
20% landline, 80% mobile
Panjabi (Pakistan) conversational telephony
Dataset Audio Pashto (Afghanistan) broadcast Common Use Cases: ASR, Automatic Captioning, Keyword Spotting Recording Device: N/A Unit: 51 hours Add Dataset to Quote PAS_BRC001 Appen Global Broadcast Speech Northern Pashto – Southern Pashto Afghanistan Low background noise (studio) N/A 1 Available on request Available on request 32 – 44 wav Dataset is fully transcribed and timestamped
Pronunciation lexicon not currently available but can be developed upon request
Dataset is largely speech only and does not include music or advertisements
Data types include: talk shows, interviews, news broadcasts (excluding news reading by anchors)
Pashto (Afghanistan) broadcast
Dataset Audio Pashto (Afghanistan) conversational microphone Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Microphone Unit: 39 hours Add Dataset to Quote PAS_ASR002 Appen Global Conversational Speech Northern Pashto – Southern Pashto Afghanistan Low background noise 40 2 34860 9,480 16 wav Dataset is fully transcribed and time stamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
A full translation of the transcripts into French is also available as an optional additional purchase
Average length of calls: 120 mins where one speaker acts as an interviewer and the other as the interviewee for scenarios similar to TransTAC style (e.g. civil affairs, checkpoints etc.)
The interviewer appears in more than one set of dialogues but the interviewee is unique for each set
Pashto (Afghanistan) conversational microphone
Dataset Audio Pashto (Afghanistan) conversational telephony Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone and landline Unit: 55 hours Add Dataset to Quote PAS_ASR001 Appen Global Conversational Speech Northern Pashto – Southern Pashto Afghanistan Low background noise 967 2 Available on request 13,633 8 wav Dataset is fully transcribed and time stamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
For the majority of calls, both speakers (in-line/out-line) were collected and transcribed, however, for a smaller number of calls, only one half of the conversation was collected and transcribed
25% landline, 75% mobile
Pashto (Afghanistan) conversational telephony
Dataset Text Pashto (Afghanistan) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 64,000 words Add Dataset to Quote pus_AFG_PHON Appen Global Pronunciation Dictionary Pashto Afghanistan N/A N/A N/A N/A 64,000 N/A text Pashto (Afghanistan) Pronunciation Dictionary
Dataset Text Polish (Poland) Part of Speech Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 4,000 words Add Dataset to Quote pol_POL_POS Appen Global Part of Speech Dictionary Polish Poland N/A N/A N/A N/A 4,000 N/A text Polish (Poland) Part of Speech Dictionary
Dataset Text Polish (Poland) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 42,000 words Add Dataset to Quote pol_POL_PHON Appen Global Pronunciation Dictionary Polish Poland N/A N/A N/A N/A 42,000 N/A text Polish (Poland) Pronunciation Dictionary
Dataset Audio Polish (Poland) scripted microphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Microphone Unit: 25 hours Add Dataset to Quote POL_ASR001 GlobalPhone Scripted Speech Polish Poland Mixed (quiet home/office, public, outdoor) 99 1 10,130 Available on request 16 wav Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus
Dataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
Polish (Poland) scripted microphone
Dataset Audio Polish (Poland) scripted smartphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Mobile phone Unit: 293 hours Add Dataset to Quote POL_ASR002_CN Appen China Scripted Speech Polish Poland Low background noise (home/office) 353 1 106,674 168,544 16 wav Dataset contains audio with corresponding text prompts Polish (Poland) scripted smartphone
Dataset Audio Polish (Poland) scripted telephony Common Use Cases: ASR, Virtual Assistant Recording Device: Landline only Unit: 78 hours Add Dataset to Quote Polish SpeechDat(E) Database Nuance Scripted Speech Polish Poland Low background noise 1,000 1 48,000 Available on request 8 Available on request Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
48 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
Polish (Poland) scripted telephony
Dataset Audio Portuguese (Brazil) conversational telephony Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone and landline Unit: 33 hours Add Dataset to Quote PTB_ASR002 Appen Global Conversational Speech Portuguese Brazil Low background noise 200 2 33,837 11,287 8 alaw or wav Dataset is fully transcribed and time stamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
63% landline, 38% mobile
Portuguese (Brazil) conversational telephony
Dataset Audio Portuguese (Brazil) microphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Microphone Unit: 26 hours Add Dataset to Quote PTB_ASR001 GlobalPhone Scripted Speech Portuguese Brazil Mixed (quiet home/office, public, outdoor) 102 1 10,417 Available on request 16 wav Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus
Dataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
Portuguese (Brazil) microphone
Dataset Text Portuguese (Brazil) Part of Speech Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 98,000 words Add Dataset to Quote por_BRA_POS Appen Global Part of Speech Dictionary Portuguese Brazil N/A N/A N/A N/A 98000 N/A text Portuguese (Brazil) Part of Speech Dictionary
Dataset Text Portuguese (Brazil) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 102,000 words Add Dataset to Quote por_BRA_PHON Appen Global Pronunciation Dictionary Portuguese Brazil N/A N/A N/A N/A 102,000 N/A text Portuguese (Brazil) Pronunciation Dictionary
Dataset Audio Portuguese (Portugal) conversational telephony Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone and landline Unit: 36 hours Add Dataset to Quote PTP_ASR001 Appen Global Conversational Speech Portuguese Portugal Low background noise 200 2 36,586 16,339 8 alaw or wav Dataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 telephony conversations are recorded for this project – 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers
Portuguese (Portugal) conversational telephony
Dataset Text Portuguese (Portugal) Part of Speech Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 60,000 words Add Dataset to Quote por_PRT_POS Appen Global Part of Speech Dictionary Portuguese Portugal N/A N/A N/A N/A 60,000 N/A text Portuguese (Portugal) Part of Speech Dictionary
Dataset Text Portuguese (Portugal) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 112,000 words Add Dataset to Quote por_PRT_PHON Appen Global Pronunciation Dictionary Portuguese Portugal N/A N/A N/A N/A 112,000 N/A text Portuguese (Portugal) Pronunciation Dictionary
Dataset Audio Romanian (Romania) conversational telephony Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone and landline Unit: 37 hours Add Dataset to Quote ROM_ASR001 Appen Global Conversational Speech Romanian Romania Low background noise 200 2 Available on request 16,658 8 alaw Dataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 telephony conversations are recorded for this project – 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers
50% landline, 50% mobile
Conversations cover a range of topics including: Leisure, Work and Sport.
Romanian (Romania) conversational telephony
Dataset Text Romanian (Romania) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 16,000 words Add Dataset to Quote ron_ROU_PHON Appen Global Pronunciation Dictionary Romanian Romania N/A N/A N/A N/A 16,000 N/A text Romanian (Romania) Pronunciation Dictionary
Dataset Audio Russian (Russia) conversational telephony Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone and landline Unit: 37 hours Add Dataset to Quote RUS_ASR001 Appen Global Conversational Speech Russian Russia Low background noise 200 2 Available on request 28,284 8 alaw or wav Dataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 telephony conversations are recorded for this project – 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers
50% landline, 50% mobile
Russian (Russia) conversational telephony
Dataset Text Russian (Russia) Part of Speech Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 100,000 words Add Dataset to Quote rus_RUS_POS Appen Global Part of Speech Dictionary Russian Russia N/A N/A N/A N/A 100,000 N/A text Russian (Russia) Part of Speech Dictionary
Dataset Text Russian (Russia) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 120,000 words Add Dataset to Quote rus_RUS_PHON Appen Global Pronunciation Dictionary Russian Russia N/A N/A N/A N/A 120,000 N/A text Russian (Russia) Pronunciation Dictionary
Dataset Audio Russian (Russia) scripted microphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Microphone Unit: 31 hours Add Dataset to Quote RUS_ASR002 GlobalPhone Scripted Speech Russian Russia Mixed (quiet home/office, public, outdoor) 115 1 12,205 Available on request 16 wav Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus
Dataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
Russian (Russia) scripted microphone
Dataset Audio Russian (Russia) scripted microphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Microphone Unit: 46 hours Add Dataset to Quote Speecon Russian Database Nuance Scripted Speech Russian Russia Mixed (office, entertainment, car, public place) 600 (550 adult speakers and 50 child speakers) 4 170,000 Available on request 16 Available on request Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
290 prompts per adult speaker and 210 prompts per child speaker including digits, natural numbers, letter strings, personal, place and business names, application words for adult speakers, command (toy, phone and general) for child speakers, phonetically rich words and sentences and free and elicited spontaneous responses for adult speakers
Russian (Russia) scripted microphone
Dataset Audio Russian (Russia) scripted telephony Common Use Cases: ASR, Virtual Assistant Recording Device: Landline only Unit: 180 hours Add Dataset to Quote Russian SpeechDat(E) Database Nuance Scripted Speech Russian Russia Low background noise 2,500 1 112,000 Available on request 8 alaw Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
45 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
Russian (Russia) scripted telephony
Dataset Audio Russian + German Female TTS Common Use Cases: TTS Recording Device: microphone Unit: 2.32 hours Add Dataset to Quote ED_TTS001_CN Appen China TTS Scripted Speech Russian/German Russia/Germany Low background noise (studio) 1 1 Available upon request Available upon request 48 wav Audio with transcription. Female voice talent recorded in a professional studio on a Neumann U87 microphone; SNR of 40-50dB Russian + German Female TTS
Dataset Text Russian NER news text Common Use Cases: NER, Content Classification, Search Engines Recording Device: N/A Unit: 29,888 sentences Add Dataset to Quote RUS_NER001 Appen Global News NER Russian Russia N/A N/A N/A 29,888 Available on request N/A text News text corpora with entities tagged in XML format: Person, Title, Organization, Location, Geo-political entity, Facility, Religion, Nationality, Quantity Russian NER news text
Dataset Image, Video Selfie image and video collection Common Use Cases: Facial recognition, Human Body Movement recognition Recording Device: Camera Unit: 1400 sessions Add Dataset to Quote IMG_VID_SELFIE_US Appen Global Human Face N/A United States Mixed lighting conditions Available upon request 1 N/A N/A N/A jpg, mp4, mov Participants took a short video and picture of themselves following a prompt making various facial expressions under different conditions, e.g. “while blinking”, “while wearing a scarf” Selfie image and video collection
Dataset Text Serbian (Serbia) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 29,000 words Add Dataset to Quote srp_SRB_PHON Appen Global Pronunciation Dictionary Serbian Serbia N/A N/A N/A N/A 29,000 N/A text Serbian (Serbia) Pronunciation Dictionary
Dataset Audio Shanghai dialect (China) Conversational Speech Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Recording pen/microphone Unit: 21 hours Add Dataset to Quote SHANGHAI_ASR001_CN Appen China Conversational Speech Shanghai dialect China Low background noise 51 1 16 wav Audio only, transcription in development for Q1 2025
Audio recordings cover the following districts: Shanghai Huangpu District, Xuhui District, Changning District, Jing ‘an District, Putuo District, Hongkou District, Yangpu District, Pudong New Area
Shanghai suburb accents not included, and no minors were recorded.
Each recording session contains 20-30 minutes of free dialogue between 2-5 people.
Sensitive data and personal information has been scrubbed.
Shanghai dialect (China) Conversational Speech
Dataset Audio Shanghai dialect (China) Conversational Speech Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone Unit: 4.5 hours Add Dataset to Quote SHANGHAI_ASR002_CN Appen China Conversational Speech Shanghai dialect China Low background noise 14 1 8 wav Audio only, transcription in development for Q1 2025
Audio recordings cover the following districts: Shanghai Huangpu District, Xuhui District, Changning District, Jing ‘an District, Putuo District, Hongkou District, Yangpu District, Pudong New Area
Shanghai suburb accents not included, and no minors were recorded.
Each recording session contains 20-30 minutes of free dialogue between 2-5 people.
Sensitive data and personal information has been scrubbed.
Shanghai dialect (China) Conversational Speech
Dataset Audio Slovak (Slovakia) scripted telephony Common Use Cases: ASR, Virtual Assistant Recording Device: Landline only Unit: 65 hours Add Dataset to Quote Slovak SpeechDat(E) Database Nuance Scripted Speech Slovak Slovakia Low background noise 1,000 1 48,000 Available on request 8 Available on request Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
48 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
Slovak (Slovakia) scripted telephony
Dataset Text Slovenian (Slovenian) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 28,000 words Add Dataset to Quote slv_SVN_PHON Appen Global Pronunciation Dictionary Slovenian Slovenia N/A N/A N/A N/A 28000 N/A text Slovenian (Slovenian) Pronunciation Dictionary
Dataset Audio Slovenian (Slovenian) telephony Common Use Cases: ASR, Virtual Assistant Recording Device: Landline only Unit: 76 hours Add Dataset to Quote Slovenian SpeechDat(II) FDB-1000 Nuance Scripted Speech Slovenian Slovenia Low background noise (home/office) 1,000 1 40,000 Available on request 8 Available on request Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
Approximately 40 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
Slovenian (Slovenian) telephony
Dataset Audio Somali (Somalia) conversational telephony Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone and landline Unit: 50 hours Add Dataset to Quote SOM_ASR001 Appen Global Conversational Speech Somali Somalia Low background noise 1,000 2 Available on request 23,217 8 alaw Dataset is fully transcribed and time stamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
1% landline, 99% mobile
Somali (Somalia) conversational telephony
Dataset Text Somali (Somalia) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 76,000 words Add Dataset to Quote som_SOM_PHON Appen Global Pronunciation Dictionary Somali Somalia N/A N/A N/A N/A 76,000 N/A text Somali (Somalia) Pronunciation Dictionary
Dataset Text Sorani (Iraq) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 26,000 words Add Dataset to Quote kur_IRQ_PHON Appen Global Pronunciation Dictionary Sorani Iraq N/A N/A N/A N/A 26,000 N/A text Sorani (Iraq) Pronunciation Dictionary
Dataset Audio Sorani (Kurdish) conversational telephony Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone and landline Unit: 5 hours Add Dataset to Quote SOR_ASR001 Appen Global Conversational Speech Central Kurdish (Iran) Iran Low background noise 170 2 Available on request 7,924 8 alaw or wav Dataset is fully transcribed and time stamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
For a large proportion of calls, only one half of the conversation was collected and transcribed
Sorani (Kurdish) conversational telephony
Dataset Text Spanish (Argentina) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 15,000 words Add Dataset to Quote spa_ARG_PHON Appen Global Pronunciation Dictionary Spanish Argentina N/A N/A N/A N/A 15,000 N/A text Spanish (Argentina) Pronunciation Dictionary
Dataset Text Spanish (Chile) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 15,000 words Add Dataset to Quote spa_CHL_PHON Appen Global Pronunciation Dictionary Spanish Chile N/A N/A N/A N/A 15,000 N/A text Spanish (Chile) Pronunciation Dictionary
Dataset Text Spanish (Colombia) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 15,000 words Add Dataset to Quote spa_COL_PHON Appen Global Pronunciation Dictionary Spanish Colombia N/A N/A N/A N/A 15,000 N/A text Spanish (Colombia) Pronunciation Dictionary
Dataset Audio Spanish (Latin America – Chile and Colombia) conversational telephony Common Use Cases: ASR, Call Centre, Conversational AI, Speech Analytics Recording Device: Mobile phone and landline Unit: 22 hours Add Dataset to Quote ESL_ASR002 Appen Global Conversational Speech Spanish Chile-Columbia Mixed 84 2 22,098 Available on request 8 wav Dataset is fully transcribed and time-stamped
Pronunciation lexicon not currently available but can be developed upon request
Call Center
Call Centre style conversations (by 64 customers, 14 agents) in banking and telco domains, primarily using mobile phone
Spanish (Latin America – Chile and Colombia) conversational telephony
Dataset Audio Spanish (Latin America) scripted microphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Microphone Unit: 17 hours Add Dataset to Quote ESL_ASR001 GlobalPhone Scripted Speech Spanish Costa Rica Mixed (quiet home/office, public, outdoor) 100 1 6,898 Available on request 16 wav Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus
Dataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
Spanish (Latin America) scripted microphone
Dataset Text Spanish (Peru) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 15,000 words Add Dataset to Quote spa_PER_PHON Appen Global Pronunciation Dictionary Spanish Peru N/A N/A N/A N/A 15,000 N/A text Spanish (Peru) Pronunciation Dictionary
Dataset Audio Spanish (Spain) conversational smartphone Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone Unit: 223 hours Add Dataset to Quote ESP_ASR003 Appen Global Conversational Speech Spanish Spain Mixed (home, car, public place, outdoor) 414 1 Available on request Available on request 48 wav Dataset is fully transcribed and time stamped
Two person conversations covering a broad range of generic topics including clothing, culture, education, finance, food, health, history, hospitality, insurance, media/entertainment, sports, travel/holiday, weather and work.
Each speaker participates in up to 12 conversations that are 5-15 minutes long.
Pronunciation lexicon not currently available but can be developed upon request
Spanish (Spain) conversational smartphone
Dataset Text Spanish (Spain) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 100,000 words Add Dataset to Quote spa_ESP_PHON Appen Global Pronunciation Dictionary Spanish Spain N/A N/A N/A N/A 100,000 N/A text Spanish (Spain) Pronunciation Dictionary
Dataset Audio Spanish (Spain) scripted microphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Microphone Unit: 39 hours Add Dataset to Quote ESP_ASR001 Appen Global Scripted Speech Spanish Spain Mixed 200 4 40,000 6,367 22 raw PCM Fully transcribed to SpeechDAT type conventions
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 prompts per speaker including 100 command and control type items and 100 phonetically rich sentences
Spanish (Spain) scripted microphone
Dataset Audio Spanish (Spain) scripted microphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Microphone Unit: 46 hours Add Dataset to Quote Speecon Spanish Database Nuance Scripted Speech Spanish Spain Mixed (office, entertainment, car, public place) 600 (550 adult speakers and 50 child speakers) 4 170,000 Available on request 16 Available on request Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
290 prompts per adult speaker and 210 prompts per child speaker including digits, natural numbers, letter strings, personal, place and business names, application words for adult speakers, command (toy, phone and general) for child speakers, phonetically rich words and sentences and free and elicited spontaneous responses for adult speakers
Spanish (Spain) scripted microphone
Dataset Audio Spanish (Spain) TTS male scripted microphone Common Use Cases: TTS Recording Device: Microphone Unit: 1 hour Add Dataset to Quote ESP_TTS001 Appen Global TTS Scripted Speech Spanish Spain Low background noise (studio) 1 1 1,787 3,614 22 wav Dataset is accompanied by a pronunciation lexicon containing all words spoken in the Dataset
1,787 prompts per speaker including phonetically rich sentences
Spanish (Spain) TTS male scripted microphone
Dataset Text Spanish (United States) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 90,000 words Add Dataset to Quote spa_USA_PHON Appen Global Pronunciation Dictionary Spanish United States N/A N/A N/A N/A 90,000 N/A text Spanish (United States) Pronunciation Dictionary
Dataset Text Spanish (Venezuela) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 15,000 words Add Dataset to Quote spa_VEN_PHON Appen Global Pronunciation Dictionary Spanish Venezuela N/A N/A N/A N/A 15,000 N/A text Spanish (Venezuela) Pronunciation Dictionary
Dataset Text Swahili (Kenya) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 66,000 words Add Dataset to Quote swa_KEN_PHON Appen Global Pronunciation Dictionary Swahili Kenya N/A N/A N/A N/A 66,000 N/A text Swahili (Kenya) Pronunciation Dictionary
Dataset Text Swedish (Sweden) Part of Speech Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 105,000 words Add Dataset to Quote swe_SWE_POS Appen Global Part of Speech Dictionary Swedish Sweden N/A N/A N/A N/A 105,000 N/A text Swedish (Sweden) Part of Speech Dictionary
Dataset Text Swedish (Sweden) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 105,000 words Add Dataset to Quote swe_SWE_PHON Appen Global Pronunciation Dictionary Swedish Sweden N/A N/A N/A N/A 105,000 N/A text Swedish (Sweden) Pronunciation Dictionary
Dataset Audio Swedish (Sweden/ Finland) microphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Microphone Unit: 30 hours Add Dataset to Quote SWE_ASR001 GlobalPhone Scripted Speech Swedish Sweden – Finland Mixed (quiet home/office, public, outdoor) 98 1 11,816 Available on request 16 wav Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus
Dataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
Swedish (Sweden/ Finland) microphone
Dataset Text Sylheti (Bangladesh – India) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 22,000 words Add Dataset to Quote syl_BGD_PHON Appen Global Pronunciation Dictionary Sylheti Bangladesh – India N/A N/A N/A N/A 22,000 N/A text Sylheti (Bangladesh – India) Pronunciation Dictionary
Dataset Text Tagalog (Philippines) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 34,000 words Add Dataset to Quote tgl_PHL_PHON Appen Global Pronunciation Dictionary Tagalog Philippines N/A N/A N/A N/A 34,000 N/A text Tagalog (Philippines) Pronunciation Dictionary
Dataset Text Tamil (India) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 106,000 words Add Dataset to Quote tam_IND_PHON Appen Global Pronunciation Dictionary Tamil India N/A N/A N/A N/A 106,000 N/A text Tamil (India) Pronunciation Dictionary
Dataset Text Telugu (India) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 51,000 words Add Dataset to Quote tel_IND_PHON Appen Global Pronunciation Dictionary Telugu India N/A N/A N/A N/A 51,000 N/A text Telugu (India) Pronunciation Dictionary
Dataset Audio Thai (Thailand) microphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Microphone Unit: 28 hours Add Dataset to Quote THA_ASR001 GlobalPhone Scripted Speech Thai Thailand Mixed (quiet home/office, public, outdoor) 98 1 14,039 Available on request 16 wav Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus
Dataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
Thai (Thailand) microphone
Dataset Image Thai (Thailand) printed text OCR Common Use Cases: Document Processing, Document Search, Text detection Recording Device: Camera Unit: 1219 images Add Dataset to Quote IMG_OCR_THA_CN Appen China Document OCR Thai Thailand Mixed lighting conditions 10 N/A N/A N/A N/A jpg Images containing text, Shopping receipts / tickets / invoices / taxi slips, etc. Thai (Thailand) printed text OCR
Dataset Text Thai (Thailand) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 30,000 words Add Dataset to Quote tha_THA_PHON Appen Global Pronunciation Dictionary Thai Thailand N/A N/A N/A N/A 30,000 N/A text Thai (Thailand) Pronunciation Dictionary
Dataset Text Tok Pisin (Papua New Guinea) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 8,000 words Add Dataset to Quote tpi_PNG_PHON Appen Global Pronunciation Dictionary Tok Pisin Papua New Guinea N/A N/A N/A N/A 8,000 N/A text Tok Pisin (Papua New Guinea) Pronunciation Dictionary
Dataset Audio Turkish (Turkey) conversational telephony Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone and landline Unit: 41 hours Add Dataset to Quote TUR_ASR001 Appen Global Conversational Speech Turkish Turkey Low background noise 200 2 Available on request 32,386 8 alaw or wav Dataset is fully transcribed and timestamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
200 telephony conversations are recorded for this project – 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers
48% landline, 52% mobile
Turkish (Turkey) conversational telephony
Dataset Audio Turkish (Turkey) microphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Microphone Unit: 17 hours Add Dataset to Quote TUR_ASR002 GlobalPhone Scripted Speech Turkish Turkey Mixed (quiet home/office, public, outdoor) 100 1 6,950 Available on request 16 wav Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus
Dataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
Turkish (Turkey) microphone
Dataset Text Turkish (Turkey) Part of Speech Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 257,000 words Add Dataset to Quote tur_TUR_POS Appen Global Part of Speech Dictionary Turkish Turkey N/A N/A N/A N/A 257,000 N/A text Turkish (Turkey) Part of Speech Dictionary
Dataset Text Turkish (Turkey) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 255,000 words Add Dataset to Quote tur_TUR_PHON Appen Global Pronunciation Dictionary Turkish Turkey N/A N/A N/A N/A 255,000 N/A text Turkish (Turkey) Pronunciation Dictionary
Dataset Audio Turkish (Turkey) scripted smartphone Common Use Cases: ASR, Virtual Assistant, Speech Analytics Recording Device: Mobile phone Unit: 738 hours Add Dataset to Quote TUR_ASR003_CN Appen China Scripted Speech Turkish Turkey Low background noise (home/office) 664 1 N/A N/A 16 wav Audio with corresponding text prompts. Participants recorded on mobile phone reading aloud about 40 sentence prompts each. Turkish (Turkey) scripted smartphone
Dataset Audio Turkish (Turkey) telephony Common Use Cases: ASR, Virtual Assistant Recording Device: Mobile phone and landline Unit: 118 hours Add Dataset to Quote OrienTel Turkish Database Nuance Scripted Speech Turkish Turkey Low background noise 1,700 1 76,500 Available on request 8 Available on request Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report
45 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words
Turkish (Turkey) telephony
Dataset Text Ukrainian (Ukraine) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 6,000 words Add Dataset to Quote ukr_UKR_PHON Appen Global Pronunciation Dictionary Ukrainian Ukraine N/A N/A N/A N/A 6,000 N/A text Ukrainian (Ukraine) Pronunciation Dictionary
Dataset Location Data United States Mobile Location Data Common Use Cases: AI Platforms, Advertising and Marketing, Business Intelligence, Financial Modeling, FMCG, Footfall and Attribution, Healthcare, Human Mobility Insights, Location Analytics, OOH and DOOH, Retail Planning and Site Selection, Retail, Research and Academia, Smart Cities and Urban Planning, Supply Chain, Travel and Tourism, Transportation Planning and Logistics Recording Device: Mobile device Unit: 5 billion+ location events per day Add Dataset to Quote LOCATION_MOBILE_US Quadrant Mobile GPS Location Data N/A United States N/A N/A N/A N/A N/A N/A CSV, XLS, JSON, Parquet
For enquiries relating to this data, please contact:
Tim Solt, Quadrant VP Sales
tim@quadrant.io
Book a meeting: https://meetings.hubspot.com/tim321

Quadrant (an Appen Company) is a global leader in the compilation and delivery of compliant, mobile (GPS) location data. Our location data panel is of the highest authenticity and quality, allowing you to easily integrate and perform location-based activities to support your business initiatives and solve a myriad of real-world problems.

The Quadrant location data panel contains 16 core metadata attributes, including all standard attributes such as Device ID, Latitude, Longitude, Timestamp, Horizontal Accuracy, and non-standard attributes such as Geohash and H3. Our historical data spans as far back as 2021, and data can be selected specific to your requirements (e.g., geography, timeframe, delivery cadence).

State or Market specific requests can be accommodated.

Please book a meeting to discuss your requirements and obtain a sample dataset to evaluate for your unique use case.
United States Mobile Location Data
Dataset Audio Urdu (India/ Pakistan) conversational telephony Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone and landline Unit: 47 hours Add Dataset to Quote URD_ASR001 Appen Global Conversational Speech Urdu India – Pakistan Mixed 1,000 2 174,666 10,871 8 wav Dataset is fully transcribed and time stamped
Dataset is accompanied by a pronunciation lexicon containing all transcribed words
Environments: 9% Hands-free car, 7% Landline quiet, 34% mobile quiet, 29% public place, 16% roadside
Urdu (India/ Pakistan) conversational telephony
Dataset Text Urdu (Pakistan) Part of Speech Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 12,000 words Add Dataset to Quote urd_PAK_POS Appen Global Part of Speech Dictionary Urdu Pakistan N/A N/A N/A N/A 12,000 N/A text Urdu (Pakistan) Part of Speech Dictionary
Dataset Text Urdu (Pakistan) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 21,000 words Add Dataset to Quote urd_PAK_PHON Appen Global Pronunciation Dictionary Urdu Pakistan N/A N/A N/A N/A 21,000 N/A text Urdu (Pakistan) Pronunciation Dictionary
Dataset Text Urdu NER news text Common Use Cases: NER, Content Classification, Search Engines Recording Device: N/A Unit: 20,634 sentences Add Dataset to Quote URD_NER001 Appen Global News NER Urdu Pakistan N/A N/A N/A 20,634 Available on request N/A text News text corpora with entities tagged in XML format: Person, Title, Organization, Location, Geo-political entity, Facility, Religion, Nationality, Quantity Urdu NER news text
Dataset Image Vehicle tail light images Common Use Cases: Image label recognition training Recording Device: Mobile phone Unit: 30793 images Add Dataset to Quote IMG_WD_CN Appen China Image recognition N/A N/A Mixed lighting conditions N/A N/A N/A N/A N/A jpg Images of vehicle tail lights, with right turn signal on (55%), left turn signal on (22%), both lights on (23%).
License plates have been redacted.
Vehicle tail light images
Dataset Audio Vietnamese (Vietnam) microphone Common Use Cases: ASR, Virtual Assistant, Chatbot Recording Device: Microphone Unit: 19 hours Add Dataset to Quote VIE_ASR001 GlobalPhone Scripted Speech Vietnamese Vietnam Mixed (quiet home/office, public, outdoor) 129 1 18,842 Available on request 16 wav Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus
Dataset is fully transcribed and the transcription is available both in original script and in Romanized form
Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary
Developed in collaboration with the Karlsruhe Institute of Technology (KIT)
Vietnamese (Vietnam) microphone
Dataset Text Vietnamese (Vietnam) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 8,000 words Add Dataset to Quote vie_VNM_PHON Appen Global Pronunciation Dictionary Vietnamese Vietnam N/A N/A N/A N/A 8,000 N/A text Vietnamese (Vietnam) Pronunciation Dictionary
Dataset Text Wu (China) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 11,000 words Add Dataset to Quote wuu_CHN_PHON Appen Global Pronunciation Dictionary Wu China N/A N/A N/A N/A 11,000 N/A text Wu (China) Pronunciation Dictionary
Dataset Audio Wuhan dialect (China) Conversational Speech Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Recording pen/microphone Unit: 44.71 hours Add Dataset to Quote WUHAN_ASR001_CN Appen China Conversational Speech Wuhan dialect China Low background noise 135 1 16 wav Audio only; transcription in development for Q1 2025
Audio recordings cover 5 districts of Wuhan: Jiang ‘an, Jianghan, Qiao Kou, Hanyang and Wuchang
Northeast suburb accents not included, and no minors were recorded.
Each recording session contains 20-30 minutes of free dialogue between 2-5 people.
Sensitive data and personal information has been scrubbed.
Wuhan dialect (China) Conversational Speech
Dataset Audio Wuhan dialect (China) Conversational Speech Common Use Cases: ASR, Conversational AI, Speech Analytics Recording Device: Mobile phone Unit: 58.6 hours Add Dataset to Quote WUHAN_ASR002_CN Appen China Conversational Speech Wuhan dialect China Low background noise 180 1 8 wav Audio only; transcription in development for Q1 2025
Audio recordings cover 5 districts of Wuhan: Jiang ‘an, Jianghan, Qiao Kou, Hanyang and Wuchang
Northeast suburb accents not included, and no minors were recorded.
Each recording session contains 20-30 minutes of free dialogue between 2-5 people.
Sensitive data and personal information has been scrubbed.
Wuhan dialect (China) Conversational Speech
Dataset Text Xiang (China) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 12,000 words Add Dataset to Quote hsn_CHN_PHON Appen Global Pronunciation Dictionary Xiang China N/A N/A N/A N/A 12,000 N/A text Xiang (China) Pronunciation Dictionary
Dataset Text Zulu (South Africa) Pronunciation Dictionary Common Use Cases: ASR, TTS, Language Modelling Recording Device: N/A Unit: 77,000 words Add Dataset to Quote zul_ZAF_PHON Appen Global Pronunciation Dictionary Zulu South Africa N/A N/A N/A N/A 77,000 N/A text Zulu (South Africa) Pronunciation Dictionary