Dataset Text | Albanian (Albania) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 12,000 words | Add Dataset to Quote | sqi_ALB_PHON | Appen Global | Pronunciation Dictionary | Albanian | Albania | N/A | N/A | N/A | N/A | 12,000 | N/A | text | Albanian (Albania) Pronunciation Dictionary | ||
Dataset Text | Amharic (Ethiopia) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 49,000 words | Add Dataset to Quote | amh_ETH_PHON | Appen Global | Pronunciation Dictionary | Amharic | Ethiopia | N/A | N/A | N/A | N/A | 49,000 | N/A | text | Amharic (Ethiopia) Pronunciation Dictionary | ||
Dataset Text | Arabic (Algeria) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 11,000 words | Add Dataset to Quote | ara_DZA_PHON | Appen Global | Pronunciation Dictionary | Arabic | Algeria | N/A | N/A | N/A | N/A | 11,000 | N/A | text | Arabic (Algeria) Pronunciation Dictionary | ||
Dataset Audio | Arabic (Eastern Algeria) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 29 hours | Add Dataset to Quote | EAR_ASR001 | Appen Global | Conversational Speech | Arabic | Algeria | Low background noise (home/office) | 496 | 2 | 32,899 | 15,314 | 8 | alaw | Dataset is fully transcribed and timestamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words For the majority of calls, both speakers (in-line/out-line) were collected and transcribed however, for a smaller number of calls, only one half of the conversation was collected and transcribed 8% landline, 92% mobile |
Arabic (Eastern Algeria) conversational telephony | |
Dataset Text | Arabic (Egypt) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 40,000 words | Add Dataset to Quote | ara_EGY_PHON | Appen Global | Pronunciation Dictionary | Arabic | Egypt | N/A | N/A | N/A | N/A | 40,000 | N/A | text | Arabic (Egypt) Pronunciation Dictionary | ||
Dataset Audio | Arabic (Egypt) scripted smartphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Mobile phone | Unit: 352 hours | Add Dataset to Quote | ARE_ASR001_CN | Appen China | Scripted Speech | Arabic | Egypt | Low background noise (home/office) | 627 | 1 | 128,908 | 207,576 | 16 | wav | Dataset contains audio with corresponding text prompts Text prompts are not vowelised |
Arabic (Egypt) scripted smartphone | |
Dataset Text | Arabic (Iraq) Part of Speech Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 13,000 words | Add Dataset to Quote | ara_IRQ_POS | Appen Global | Part of Speech Dictionary | Arabic | Iraq | N/A | N/A | N/A | N/A | 13,000 | N/A | text | Arabic (Iraq) Part of Speech Dictionary | ||
Dataset Text | Arabic (Iraq) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 19,000 words | Add Dataset to Quote | ara_IRQ_PHON | Appen Global | Pronunciation Dictionary | Arabic | Iraq | N/A | N/A | N/A | N/A | 19,000 | N/A | text | Person names | Arabic (Iraq) Pronunciation Dictionary | |
Dataset Text | Arabic (Libya) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 48,000 words | Add Dataset to Quote | ara_LBY_PHON | Appen Global | Pronunciation Dictionary | Arabic | Libya | N/A | N/A | N/A | N/A | 48,000 | N/A | text | Arabic (Libya) Pronunciation Dictionary | ||
Dataset Audio | Arabic (Modern Standard Arabic) scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 12 hours | Add Dataset to Quote | MSA_ASR001 | GlobalPhone | Scripted Speech | Arabic | Tunisia | Mixed (quiet home/office, public, outdoor) | 78 | 1 | 4,908 | 40,000 | 16 | wav | Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus Dataset is fully transcribed and the transcription is available both in original script and in Romanized form Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary Developed in collaboration with the Karlsruhe Institute of Technology (KIT) |
Arabic (Modern Standard Arabic) scripted microphone | |
Dataset Audio | Arabic (Morocco) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 33 hours | Add Dataset to Quote | ARY_ASR001 | Appen Global | Conversational Speech | Arabic | Morocco | Low background noise | 180 | 2 | 80,430 | 23,836 | 8 | alaw | Each speaker participated in 1 to 4 conversations. Speakers are identified by a unique 4-digit speaker ID which is recorded in the demographic file Transcription is available in original script and fully reversible Romanised version with accompanying pronunciation lexicon English translation of product transcription is available (ARY_MT001, ARY_ASRMT001) |
Arabic (Morocco) conversational telephony | |
Dataset Text | Arabic (Morocco) conversational telephony translation | Common Use Cases: MT, Chatbot , Conversational AI | Recording Device: N/A | Unit: 80,430 utterances | Add Dataset to Quote | ARY_MT001 | Appen Global | Conversational Translation | Arabic | Morocco | N/A | 180 | N/A | 80,430 | 23,836 | N/A | text | Corresponding audio, transcription, fully reversible romanised transcription and pronunciation lexicon data are available (ARY_ASR001, ARY_ASRMT001) | Arabic (Morocco) conversational telephony translation | |
Dataset Text | Arabic (Morocco) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 60,000 words | Add Dataset to Quote | ara_MAR_PHON | Appen Global | Pronunciation Dictionary | Arabic | Morocco | N/A | N/A | N/A | N/A | 60,000 | N/A | text | Arabic (Morocco) Pronunciation Dictionary | ||
Dataset Text | Arabic (MSA) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 40,000 words | Add Dataset to Quote | arb_MSA_PHON | Appen Global | Pronunciation Dictionary | Standard Arabic | N/A | N/A | N/A | N/A | N/A | 40,000 | N/A | text | Arabic (MSA) Pronunciation Dictionary | ||
Dataset Audio | Arabic (Saudi Arabia) scripted smartphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Mobile phone | Unit: 322 hours | Add Dataset to Quote | ARS_ASR001_CN | Appen China | Scripted Speech | Arabic | Saudi Arabia | Low background noise (home/office) | 227 | 1 | 104,574 | 156,282 | 16 | wav | Dataset contains audio with corresponding text prompts Text prompts are not vowelised 300-1000 prompts per speaker covering general content including education, sports, entertainment, travel, culture and technology |
Arabic (Saudi Arabia) scripted smartphone | |
Dataset Text | Arabic (Sudan) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 17,000 words | Add Dataset to Quote | ara_SDN_PHON | Appen Global | Pronunciation Dictionary | Arabic | Sudan | N/A | N/A | N/A | N/A | 17,000 | N/A | text | Arabic (Sudan) Pronunciation Dictionary | ||
Dataset Image | Arabic (UAE) printed text annotated OCR | Common Use Cases: Image label recognition training | Recording Device: Mobile phone | Unit: 20000 images | Add Dataset to Quote | IMG_OCR_ARU002_CN | Appen China | Document OCR | Arabic | United Arab Emirates | Mixed lighting conditions | N/A | N/A | N/A | N/A | N/A | jpg + json | Images containing text, such as slogans, advertisements, maps, store names, menus, product outer packaging, indication board. Includes bounding box annotations, 50 boxes per image, with all text annotated (Arabic, non-Arabic characters, special characters, numbers) | Arabic (UAE) printed text annotated OCR | |
Dataset Text | Arabic (United Arab Emirates (UAE)) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 75,000 words | Add Dataset to Quote | ara_ARE_PHON | Appen Global | Pronunciation Dictionary | Arabic | United Arab Emirates (UAE) | N/A | N/A | N/A | N/A | 75,000 | N/A | text | Arabic (United Arab Emirates (UAE)) Pronunciation Dictionary | ||
Dataset Audio | Arabic (United Arab Emirates (UAE)) scripted smartphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Mobile phone | Unit: 170 hours | Add Dataset to Quote | ARU_ASR001_CN | Appen China | Scripted Speech | Arabic | United Arab Emirates (UAE) | Low background noise (home/office) | 133 | 1 | 42,352 | 85,775 | 16 | wav | Dataset contains audio with corresponding text prompts Text prompts are not vowelised |
Arabic (United Arab Emirates (UAE)) scripted smartphone | |
Dataset Audio | Arabic (United Arab Emirates (UAE)) scripted telephony | Common Use Cases: ASR, Virtual Assistant | Recording Device: Mobile phone and landline | Unit: 48 hours | Add Dataset to Quote | OrienTel United Arab Emirates MCA (Modern Colloquial Arabic) | Nuance | Scripted Speech | Arabic | United Arab Emirates (UAE) | Low background noise | 880 | 1 | 43,000 | 22197 | 8 | alaw | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 49 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words and spontaneous items for control |
Arabic (United Arab Emirates (UAE)) scripted telephony | |
Dataset Audio | Arabic (United Arab Emirates (UAE)) scripted telephony | Common Use Cases: ASR, Virtual Assistant | Recording Device: Mobile phone and landline | Unit: 31 hours | Add Dataset to Quote | OrienTel United Arab Emirates MSA (Modern Standard Arabic) | Nuance | Scripted Speech | Arabic | United Arab Emirates (UAE) | Low background noise | 500 | 1 | 24,500 | 13348 | 8 | alaw | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 49 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words and spontaneous items for control |
Arabic (United Arab Emirates (UAE)) scripted telephony | |
Dataset Audio | Arabic (United Arab Emirates (UAE)/ Saudi Arabia) scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 86 hours | Add Dataset to Quote | CGA_ASR001 | Appen Global | Scripted Speech | Arabic | United Arab Emirates (UAE) – Saudi Arabia | Low background noise (home/office) | 150 | 4 | 42,000 | 19,245 | 16 | raw PCM | Fully transcribed with acoustic event tagging derived from the SpeechDAT conventions Dataset is accompanied by a pronunciation lexicon containing all transcribed words All transcriptions fully vowelized 280 prompts per speaker including 30 Person names (first name and family name) from a set of 15, 10 single isolated digits 0-10, 8-digit sequences (randomly generated), 200 phonetically balanced sentences, 30 x 10-word phonetically balanced word strings |
Arabic (United Arab Emirates (UAE)/ Saudi Arabia) scripted microphone | |
Dataset Text | Arabic NER news text | Common Use Cases: NER, Content Classification, Search Engines | Recording Device: N/A | Unit: 20,774 sentences | Add Dataset to Quote | ARB_NER001 | Appen Global | News NER | Standard Arabic | N/A | N/A | N/A | N/A | 20,774 | Available on request | N/A | text | News text corpora with entities tagged in XML format: Person, Title, Organization, Location, Geo-political entity, Facility, Religion, Nationality, Quantity | Arabic NER news text | |
Dataset Text | Assamese (India) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 40,000 words | Add Dataset to Quote | asm_IND_PHON | Appen Global | Pronunciation Dictionary | Assamese | India | N/A | N/A | N/A | N/A | 40,000 | N/A | text | Assamese (India) Pronunciation Dictionary | ||
Dataset Audio | Baby crying audio | Common Use Cases: Baby Monitor, Security & Other Consumer Applications | Recording Device: Mobile phone | Unit: 70 hours | Add Dataset to Quote | CRY_ASR001_CN | Appen China | Human Sound | N/A | China | Low background noise (home/office) | 566 | 1 | N/A | N/A | 16 | wav | Crying sound of babies 0-3 years old, each lasting around 2 minutes. Audio only. | Baby crying audio | |
Dataset Audio | Bahasa Indonesia conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 31 hours | Add Dataset to Quote | BAH_ASR001 | Appen Global | Conversational Speech | Indonesian | Indonesia | Low background noise | 1,002 | 2 | 30,695 | 11,480 | 8 | wav | Dataset is fully transcribed and timestamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words For a large proportion of calls, only one half of the conversation was collected and transcribed 28% landline, 72% mobile |
Bahasa Indonesia conversational telephony | |
Dataset Image | Baking Pictures | Common Use Cases: Image recognition | Recording Device: N/A | Unit: 6000 images | Add Dataset to Quote | IMG_Bake_CN | Appen China | Image recognition | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | jpg | This dataset includes pictures of baked goods: 2000 images of bread, 2000 images of cakes, and 2000 images of cookies. Image resolution: 640px * 640px. Shooting angle: either vertically downward or slightly offset. | Baking Pictures | |
Dataset Text | Basque (Spain) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 10,000 words | Add Dataset to Quote | eus_ESP_PHON | Appen Global | Pronunciation Dictionary | Basque | Spain | N/A | N/A | N/A | N/A | 10,000 | N/A | text | Basque (Spain) Pronunciation Dictionary | ||
Dataset Audio | Bengali (Bangladesh) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 47 hours | Add Dataset to Quote | BEN_ASR001 | Appen Global | Conversational Speech | Bengali | Bangladesh | Mixed (in-car, roadside, home/office) | 1,000 | 2 | 108,923 | 17,922 | 8 | wav | Dataset is fully transcribed and timestamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words |
Bengali (Bangladesh) conversational telephony | |
Dataset Text | Bengali (India) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 29,000 words | Add Dataset to Quote | ben_IND_PHON | Appen Global | Pronunciation Dictionary | Bengali | India | N/A | N/A | N/A | N/A | 29,000 | N/A | text | Bengali (India) Pronunciation Dictionary | ||
Dataset Audio | Bulgarian (Bulgaria) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 38 hours | Add Dataset to Quote | BUL_ASR001 | Appen Global | Conversational Speech | Bulgarian | Bulgaria | Low background noise (home/office) | 217 | 2 | 86,453 | 22,342 | 8 | alaw or wav | Dataset is fully transcribed and timestamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words 200 telephony conversations are recorded for this project – 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers 49% landline, 51% mobile Conversations cover a range of topics including: Holiday/Leisure, Movies/TV Shows and Work. |
Bulgarian (Bulgaria) conversational telephony | |
Dataset Text | Bulgarian (Bulgaria) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 55,000 words | Add Dataset to Quote | bul_BGR_PHON | Appen Global | Pronunciation Dictionary | Bulgarian | Bulgaria | N/A | N/A | N/A | N/A | 55,000 | N/A | text | Bulgarian (Bulgaria) Pronunciation Dictionary | ||
Dataset Audio | Bulgarian (Bulgaria) scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 22 hours | Add Dataset to Quote | BUL_ASR002 | GlobalPhone | Scripted Speech | Bulgarian | Bulgaria | Mixed (quiet home/office, public, outdoor) | 77 | 1 | 8,674 | Available on request | 16 | wav | Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus Dataset is fully transcribed and the transcription is available both in original script and in Romanized form Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary Developed in collaboration with the Karlsruhe Institute of Technology (KIT) |
Bulgarian (Bulgaria) scripted microphone | |
Dataset Image | Business-to-business printed text document OCR | Common Use Cases: Document Processing, Document Search | Recording Device: Camera, scan | Unit: 5,838 documents | Add Dataset to Quote | IMG_OCR_B2B | Appen Global | Document OCR | N/A | N/A | Mixed lighting conditions | N/A | N/A | N/A | N/A | N/A | png | Scans and photographs of business-to-business documents containing printed text. 38% Premium Quality images in 10 languages, 25 countries, including Purchase Order, Payment Advice or Remittance Advice, Order Confirmation and Delivery note. 64% Standard Quality images in various challenging conditions in 11 languages, 34 countries, in a wider range of categories including Complaints or Return, Delivery advice, Delivery note, Dunning, Goods receipt, Invoice, Offer, Order confirmation, Pay slip, Payment Advice or Remittance Advice, Purchase Order, Receipt, and Supplier load | Business-to-business printed text document OCR | |
Dataset Text | Cantonese (China) Simplified Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 37,000 words | Add Dataset to Quote | yue_CHN_PHON | Appen Global | Pronunciation Dictionary | Cantonese | China | N/A | N/A | N/A | N/A | 37,000 | N/A | text | Simplified | Cantonese (China) Simplified Pronunciation Dictionary | |
Dataset Text | Cantonese (China) Traditional Part of Speech Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 10,000 words | Add Dataset to Quote | yue_HKG_POS | Appen Global | Part of Speech Dictionary | Cantonese | China | N/A | N/A | N/A | N/A | 10,000 | N/A | text | Traditional | Cantonese (China) Traditional Part of Speech Dictionary | |
Dataset Text | Cantonese (China) Traditional Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 40,000 words | Add Dataset to Quote | yue_HKG_PHON | Appen Global | Pronunciation Dictionary | Cantonese | China | N/A | N/A | N/A | N/A | 40,000 | N/A | text | Traditional | Cantonese (China) Traditional Pronunciation Dictionary | |
Dataset Text | Catalan (Spain) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 10,000 words | Add Dataset to Quote | cat_ESP_PHON | Appen Global | Pronunciation Dictionary | Catalan | Spain | N/A | N/A | N/A | N/A | 10,000 | N/A | text | Catalan (Spain) Pronunciation Dictionary | ||
Dataset Text | Cebuano (Philippines) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 21,000 words | Add Dataset to Quote | ceb_PHL_PHON | Appen Global | Pronunciation Dictionary | Cebuano | Philippines | N/A | N/A | N/A | N/A | 21,000 | N/A | text | Cebuano (Philippines) Pronunciation Dictionary | ||
Dataset Audio | Chinese (multinational foreigner) scripted smartphone | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone | Unit: 200 hours | Add Dataset to Quote | FOREIGNER_ASR001_CN | Appen China | Scripted Speech | Mandarin Chinese | China | Low background noise | 309 | 1 | 16 | wav | Dataset contains audio with corresponding text prompts. This database contains 200 hours of foreigners speaking Chinese from the following countries: Argentina, Egypt, Australia, Russia, the Philippines, Kazakhstan, Korea, Kyrgyzstan, Canada, Kuala Lumpur, Kenya, Laos, Malaysia, Mauritius, the United States, Mongolia, South Africa, Japan, Tajikistan, Thailand, Turkey, Hong Kong, Singapore, India, Indonesia, Vietnam There is no data from South Korea, Brazil, or data recorded by minors. Each session lasts about an hour; sentence duration ranges between 3-10 seconds The content is in the form of an individual reading while being recorded on a mobile phone in a home/office environment. Sensitive data and personal information has been scrubbed. |
Chinese (multinational foreigner) scripted smartphone | |||
Dataset Text | Chinese command and control prompt response corpus | Common Use Cases: LLM training, Command and Control, TV Player, Device Control | Recording Device: N/A | Unit: 20000 sentences | Add Dataset to Quote | DSDH_corpus_CN | Appen China | LLM training | Chinese | China | N/A | N/A | N/A | N/A | N/A | N/A | txt | App Commands, Question & response pairs, tagged with categories and intents, for use with TV player controls, lifestyle services, and device control. | Chinese command and control prompt response corpus | |
Dataset Text | Chinese instruction set sentence corpus | Common Use Cases: LLM training | Recording Device: N/A | Unit: 200000 sentences | Add Dataset to Quote | ZLJ_corpus_CN | Appen China | LLM training | Chinese | China | N/A | N/A | N/A | N/A | N/A | N/A | txt | Sentence corpus containing 10 sections: Question and answer class instruction set ( ZLCWD_corpus_CN); Multi-turn dialogue instruction set prompt-response pairs (ZLCDH_corpus_CN); Logical reasoning instruction set prompt (Topic) – response (Reasoning) pairs (ZLCLJ_corpus_CN); Programming code language instruction set prompt-response pairs, e.g. python (ZLCDM_corpus_CN); Brainstorming instruction set question-answer pairs (ZLCTN_corpus_CN); Text rewriting-instruction set original-rewritten pairs (ZLCGX_corpus_CN); Text to reply to security – command set (ZLCAQ_corpus_CN); Roleplay instruction set prompt-response pairs (ZLCJS_corpus_CN); Long text-instruction set prompt-response pairs (ZLCCWB_corpus_CN); Text generation instruction set prompt-response pairs (ZLCWB_corpus_CN) |
Chinese instruction set sentence corpus | |
Dataset Text | Chinese multidisciplinary test questions corpus | Common Use Cases: LLM training | Recording Device: N/A | Unit: 319970 sentences | Add Dataset to Quote | MTQ_CN | Appen China | LLM training | Chinese | China | N/A | N/A | 1 | N/A | N/A | N/A | json | Corpus containing 8 sections of middle-high school prompt response pairs with metadata Subject, Grade, Knowledge Area, Question Type, Question, Answer, Difficulty. Question categories included are: Geography – 30k sentences (DLT001_CN); Chemistry – 40k sentences (HXT001_CN); History – 40k sentences (LST001_CN:); Biology – 40k sentences (SWT001_CN); Math – 30k sentences (SXT001_CN); Physics – 40k sentences (WLT001_CN); Chinese language – 10k sentences (YWT001_CN); Political – 40k sentences (ZZT001_CN) |
Chinese multidisciplinary test questions corpus | |
Dataset Text | Chinese news text summaries corpus | Common Use Cases: LLM training | Recording Device: N/A | Unit: 20000 summaries | Add Dataset to Quote | DMXWB_corpus_CN | Appen China | LLM training | Chinese | China | N/A | N/A | N/A | N/A | N/A | N/A | xls | Summaries of main events and themes from news data in 15 domains (Finance and economics, Lottery ticket, House property, Share certificate, Home furnishings, Education, Science & Technology, Society & people\’s livelihood, Fashion, Politics, Sports activities, Constellation, Game, Entertainment) | Chinese news text summaries corpus | |
Dataset Audio | Croatian (Croatia) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 39 hours | Add Dataset to Quote | CRO_ASR001 | Appen Global | Conversational Speech | Croatian | Croatia | Low background noise (home/office) | 200 | 2 | Available on request | 23,919 | 8 | alaw | Dataset is fully transcribed and timestamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words 200 telephony conversations are recorded for this project – 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers 53% landline, 47% mobile Conversations cover a range of topics including: News & Current Affairs, Health and Sport. |
Croatian (Croatia) conversational telephony | |
Dataset Text | Croatian (Croatia) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 19,000 words | Add Dataset to Quote | hrv_HRV_PHON | Appen Global | Pronunciation Dictionary | Croatian | Croatia | N/A | N/A | N/A | N/A | 19,000 | N/A | text | Croatian (Croatia) Pronunciation Dictionary | ||
Dataset Audio | Croatian (Croatia) scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 11 hours | Add Dataset to Quote | CRO_ASR002 | GlobalPhone | Scripted Speech | Croatian | Croatia | Mixed (quiet home/office, public, outdoor) | 94 | 1 | 4,499 | 23,929 | 16 | wav | Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus Dataset is fully transcribed and the transcription is available both in original script and in Romanized form Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary Developed in collaboration with the Karlsruhe Institute of Technology (KIT) |
Croatian (Croatia) scripted microphone | |
Dataset Audio | Croatian (Croatia) scripted smartphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Mobile phone | Unit: 263 hours | Add Dataset to Quote | CRO_ASR003_CN | Appen China | Scripted Speech | Croatian | Croatia | Low background noise (home/office) | 243 | 1 | 73,467 | 136,140 | 16 | wav | Dataset contains audio with corresponding text prompts | Croatian (Croatia) scripted smartphone | |
Dataset Text | Czech (Czech Republic) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 50,000 words | Add Dataset to Quote | ces_CZE_PHON | Appen Global | Pronunciation Dictionary | Czech | Czech Republic | N/A | N/A | N/A | N/A | 50,000 | N/A | text | Czech (Czech Republic) Pronunciation Dictionary | ||
Dataset Audio | Czech (Czech Republic) scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 31 hours | Add Dataset to Quote | CZE_ASR001 | GlobalPhone | Scripted Speech | Czech | Czech Republic | Mixed (quiet home/office, public, outdoor) | 102 | 1 | 12,425 | Available on request | 16 | wav | Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus Dataset is fully transcribed and the transcription is available both in original script and in Romanized form Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary Developed in collaboration with the Karlsruhe Institute of Technology (KIT) |
Czech (Czech Republic) scripted microphone | |
Dataset Audio | Czech (Czech Republic) scripted telephony | Common Use Cases: ASR, Virtual Assistant | Recording Device: Landline only | Unit: 93 hours | Add Dataset to Quote | Czech SpeechDat(E) Dataset | Nuance | Scripted Speech | Czech | Czech Republic | Low background noise | 1,000 | 1 | 52,000 | Available on request | 8 | alaw | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 52 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items, and phonetically rich words and sentences |
Czech (Czech Republic) scripted telephony | |
Dataset Text | Danish (Denmark) Part of Speech Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 100,000 words | Add Dataset to Quote | dan_DNK_POS | Appen Global | Part of Speech Dictionary | Danish | Denmark | N/A | N/A | N/A | N/A | 100,000 | N/A | text | Danish (Denmark) Part of Speech Dictionary | ||
Dataset Text | Danish (Denmark) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 107,000 words | Add Dataset to Quote | dan_DNK_PHON | Appen Global | Pronunciation Dictionary | Danish | Denmark | N/A | N/A | N/A | N/A | 107,000 | N/A | text | Danish (Denmark) Pronunciation Dictionary | ||
Dataset Audio | Danish (Denmark) scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 53 hours | Add Dataset to Quote | Speecon Danish | Nuance | Scripted Speech | Danish | Denmark | Mixed (office, entertainment, car, public place) | 600 (550 adult speakers and 50 child speakers) | 4 | 170,000 | Available on request | 16 | Available on request | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 290 prompts per adult speaker and 210 prompts per child speaker including digits, natural numbers, letter strings, personal, place and business names, application words for adult speakers, command (toy, phone and general) for child speakers, phonetically rich words and sentences and free and elicited spontaneous responses for adult speakers |
Danish (Denmark) scripted microphone | |
Dataset Audio | Dari (Afghanistan) broadcast | Common Use Cases: ASR, Automatic Captioning, Keyword Spotting | Recording Device: N/A | Unit: 49 hours | Add Dataset to Quote | DAR_BRC001 | Appen Global | Broadcast Speech | Dari | Afghanistan | Low background noise (studio) | N/A | 1 | Available on request | Available on request | 16 – 48 | wav | Dataset is fully transcribed and timestamped Pronunciation lexicon not currently available but can be developed upon request Dataset is largely speech only and does not include music or advertisements Data types include: talk shows, interviews, news broadcasts (excluding news reading by anchors) 13% landline, 87% mobile |
Dari (Afghanistan) broadcast | |
Dataset Audio | Dari (Afghanistan) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 40 hours | Add Dataset to Quote | DAR_ASR001 | Appen Global | Conversational Speech | Dari | Afghanistan | Low background noise | 500 | 2 | Available on request | 11,168 | 8 | alaw | Dataset is fully transcribed and timestamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words Dataset is largely speech only and does not include music or advertisements 13% landline, 87% mobile |
Dari (Afghanistan) conversational telephony | |
Dataset Text | Dari (Afghanistan) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 31,000 words | Add Dataset to Quote | prs_AFG_PHON | Appen Global | Pronunciation Dictionary | Dari | Afghanistan | N/A | N/A | N/A | N/A | 31,000 | N/A | text | Dari (Afghanistan) Pronunciation Dictionary | ||
Dataset Text | Dholuo (Kenya) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 23,000 words | Add Dataset to Quote | luo_KEN_PHON | Appen Global | Pronunciation Dictionary | Dholuo | Kenya | N/A | N/A | N/A | N/A | 23,000 | N/A | text | Dholuo (Kenya) Pronunciation Dictionary | ||
Dataset Audio | Dongbei dialect (China) Conversational Speech | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Recording pen/microphone | Unit: 84.6 hours | Add Dataset to Quote | DONGBEI_ASR001_CN | Appen China | Conversational Speech | Dongbei dialect | China | Low background noise | 268 | 1 | 16 | wav | Audio only; transcription not included Audio recordings cover 19 districts: Shenyang Heping District, Shenhe District, Huanggu District, Dadong District, Tiexi District, Lvyuan District, Chaoyang District, Kuancheng District, Erdao District, Nanguan District, Daoli District, Nangang District, Daowai District, Pingfang District, Songbei District, Xiangfang District, Hulan District, Acheng District and Shuangcheng District Northeast suburb accents not included, and no minors were recorded. Each recording session contains 20-30 minutes of free dialogue between 2-5 people. Sensitive data and personal information has been scrubbed. |
Dongbei dialect (China) Conversational Speech | |||
Dataset Audio | Dongbei dialect (China) Conversational Speech | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone | Unit: 75.2 hours | Add Dataset to Quote | DONGBEI_ASR002_CN | Appen China | Conversational Speech | Dongbei dialect | China | Low background noise | 185 | 1 | 8 | wav | Audio only; transcription not included Audio recordings cover 19 districts: Shenyang Heping District, Shenhe District, Huanggu District, Dadong District, Tiexi District, Lvyuan District, Chaoyang District, Kuancheng District, Erdao District, Nanguan District, Daoli District, Nangang District, Daowai District, Pingfang District, Songbei District, Xiangfang District, Hulan District, Acheng District and Shuangcheng District Northeast suburb accents not included, and no minors were recorded. Each recording session contains 20-30 minutes of free dialogue between 2-5 people. Sensitive data and personal information has been scrubbed. |
Dongbei dialect (China) Conversational Speech | |||
Dataset Audio | Dutch (Belgium) scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 47 hours | Add Dataset to Quote | Speecon Dutch from Belgium | Nuance | Scripted Speech | Dutch | Belgium | Mixed (office, entertainment, car, public place) | 600 (550 adult speakers and 50 child speakers) | 4 | 170,000 | Available on request | 16 | Available on request | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 290 prompts per adult speaker and 210 prompts per child speaker including digits, natural numbers, letter strings, personal, place and business names, application words for adult speakers, command (toy, phone and general) for child speakers, phonetically rich words and sentences and free and elicited spontaneous responses for adult speakers |
Dutch (Belgium) scripted microphone | |
Dataset Audio | Dutch (Belgium) scripted telephony | Common Use Cases: ASR, Virtual Assistant | Recording Device: Microphone | Unit: 80 hours | Add Dataset to Quote | Flemish SpeechDat(II) FDB-1000 (FIXED1FL) | Nuance | Scripted Speech | Dutch | Belgium | Low background noise | 1,000 | 1 | 52,000 | Available on request | 8 | Available on request | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 52 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words and spontaneous items for control |
Dutch (Belgium) scripted telephony | |
Dataset Audio | Dutch (Netherlands & Belgium) scripted in-car | Common Use Cases: ASR, Virtual Assistant, In Car HMI & Entertainment | Recording Device: Microphone and mobile phone | Unit: 27 hours | Add Dataset to Quote | Dutch and Flemish SpeechDat-Car | Nuance | Scripted Speech | Dutch | Netherland – Belgium | Mixed (in-car) | 302 | 5 | 15,100 | Available on request | 16 and 8 | Available on request | Dataset is fully transcribed and is accompanied by a pronunciation lexicon and validation report 125 prompts per adult speaker including digits, natural numbers, letter strings, personal, place and business names (some spontaneous), generic command and control items, phonetically rich words and sentences and prompts for spontaneous speech |
Dutch (Netherlands & Belgium) scripted in-car | |
Dataset Audio | Dutch (Netherlands) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 36 hours | Add Dataset to Quote | NLD_ASR001 | Appen Global | Conversational Speech | Dutch | Netherlands | Low background noise | 200 | 2 | Available on request | 14,964 | 8 | alaw | Dataset is fully transcribed and timestamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words 200 telephony conversations are recorded for this project – 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers 51% landline, 49% mobile Conversations cover a range of topics including: Holiday/Leisure, Work and Sport. |
Dutch (Netherlands) conversational telephony | |
Dataset Text | Dutch (Netherlands) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 45,000 words | Add Dataset to Quote | nld_NLD_PHON | Appen Global | Pronunciation Dictionary | Dutch | Netherlands | N/A | N/A | N/A | N/A | 45,000 | N/A | text | Dutch (Netherlands) Pronunciation Dictionary | ||
Dataset Audio | Dutch (Netherlands) scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 68 hours | Add Dataset to Quote | Speecon Dutch from the Netherlands | Nuance | Scripted Speech | Dutch | Netherlands | Mixed (office, entertainment, car, public place) | 600 (550 adult speakers and 50 child speakers) | 4 | 170,000 | Available on request | 16 | Available on request | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 290 prompts per adult speaker and 210 prompts per child speaker including digits, natural numbers, letter strings, personal, place and business names, application words for adult speakers, command (toy, phone and general) for child speakers, phonetically rich words and sentences and free and elicited spontaneous responses for adult speakers |
Dutch (Netherlands) scripted microphone | |
Dataset Image | East African facial images | Common Use Cases: Facial Recognition | Recording Device: Camera | Unit: 13500 images | Add Dataset to Quote | IMG_FACE_KEN_CN | Appen China | Human Face | N/A | Kenya | Mixed background and lighting conditions | 99 | N/A | N/A | N/A | N/A | jpg | Images of 99 participants across a variety of conditions (lighting, distance, camera angles, facial expressions, and accessories). 9 different lighting conditions, 2 different distances between participants face and smartphone, 7 different camera angles. All combinations of these 3 requirements were completed per participant. A random 32 images per person include occlusions such as sunglasses, masks, wigs or hats A random 36 shots include different facial expressions including stare, open mouth, pout mouth smile and frown Lighting conditions: indoor normal light, outdoor normal light, indoor backlight, outdoor backlight, indoor ordinary dark light, full black screen fill light, point light source (white light, street light), neon light (monochromatic red, green and blue, multi-color mixed light), side glare Distances: 30cm and 50cm Camera angles: front, left 45°, right 45°, left 15°, right 15°, top 30°, bottom 30° |
East African facial images | |
Dataset Image | Electric vehicles in the elevator room | Common Use Cases: Image recognition | Recording Device: N/A | Unit: 17132 images | Add Dataset to Quote | IMG_DTDDC_CN | Appen China | Image recognition | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | jpg | The electric vehicle image in elevator scene, with no more than 5 images of the same electric vehicle appearing. All images have annotation (monitoring perspective) with bounding boxes and labels (person, vehicle) | Electric vehicles in the elevator room | |
Dataset Audio | English (Arabic – Levant/Egypt) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 28 hours | Add Dataset to Quote | ENA_ASR001 | Appen Global | Conversational Speech | English | Egypt | Low background noise | 250 | 2 | 33,057 | 5,619 | 8 | alaw or wav | Dataset is fully transcribed and timestamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words Average length of calls: 10-15 mins |
English (Arabic – Levant/Egypt) conversational telephony | |
Dataset Text | English (Australia) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 157,000 words | Add Dataset to Quote | eng_AUS_PHON | Appen Global | Pronunciation Dictionary | English | Australia | N/A | N/A | N/A | N/A | 157,000 | N/A | text | English (Australia) Pronunciation Dictionary | ||
Dataset Audio | English (Australia) scripted telephony | Common Use Cases: ASR, Virtual Assistant | Recording Device: Mobile phone and landline | Unit: 92 hours | Add Dataset to Quote | AUS_ASR001 | Appen Global | Scripted Speech | English | Australia | Low background noise (home/office) | 500 | 1 | 82,500 | 35,137 | 8 | alaw or wav | Fully transcribed to SpeechDAT type conventions Dataset is accompanied by a pronunciation lexicon containing all transcribed words 162 prompts (read speech) per speaker including digits, natural numbers, letter strings, personal, place, and business names, confirmation items (yes, no + fuzzy), generic command and control items (from a set of 215), phonetically rich sentences and words |
English (Australia) scripted telephony | |
Dataset Audio | English (Australia) scripted telephony | Common Use Cases: ASR, Virtual Assistant | Recording Device: Mobile phone and landline | Unit: 118 hours | Add Dataset to Quote | AUS_ASR002 | Appen Global | Scripted Speech | English | Australia | Mixed | 1,000 | 1 | 75,000 | 18,952 | 8 | alaw or wav | Fully transcribed to SpeechDAT type conventions Dataset is accompanied by a pronunciation lexicon containing all transcribed words 75 prompts per speaker including digits, natural numbers, letter strings, personal, place, and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words The prompts are a mixture of ‘read’ and ‘elicited’ items where 5 prompts per script are ‘spontaneous free speech’ |
English (Australia) scripted telephony | |
Dataset Text | English (Canada) Part of Speech Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 3,000 words | Add Dataset to Quote | eng_CAN_POS | Appen Global | Part of Speech Dictionary | English | Canada | N/A | N/A | N/A | N/A | 3,000 | N/A | text | English (Canada) Part of Speech Dictionary | ||
Dataset Text | English (Canada) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 50,000 words | Add Dataset to Quote | eng_CAN_PHON | Appen Global | Pronunciation Dictionary | English | Canada | N/A | N/A | N/A | N/A | 50,000 | N/A | text | English (Canada) Pronunciation Dictionary | ||
Dataset Audio | English (Canada) scripted telephony | Common Use Cases: ASR, Virtual Assistant | Recording Device: Mobile phone and landline | Unit: 144 hours | Add Dataset to Quote | ENC_ASR001 | Appen Global | Scripted Speech | English | Canada | Mixed | 1,000 | 1 | 99,000 | 12,483 | 8 | alaw or wav | Fully transcribed to SALA II/SpeechDAT type conventions Dataset is accompanied by a pronunciation lexicon containing all transcribed words 99 prompts per speaker including digits, natural numbers, letter strings, personal, place, and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words |
English (Canada) scripted telephony | |
Dataset Text | English (Hong Kong) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 18,000 words | Add Dataset to Quote | eng_HKG_PHON | Appen Global | Pronunciation Dictionary | English | Hong Kong | N/A | N/A | N/A | N/A | 18,000 | N/A | text | English (Hong Kong) Pronunciation Dictionary | ||
Dataset Audio | English (India) conversational smartphone | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone | Unit: 143 hours | Add Dataset to Quote | ENI_ASR003 | Appen Global | Conversational Speech | English | India | Mixed (home, car, public place, outdoor) | 272 | 1 | 145559 | 20746 | 48 | wav | Dataset is fully transcribed and time stamped Two person conversations covering a broad range of generic topics including clothing, culture, education, finance, food, health, history, hospitality, insurance, media/entertainment, sports, travel/holiday, weather and work. Each speaker participates in up to 12 conversations that are 5-15 minutes long. Pronunciation lexicon not currently available but can be developed upon request |
English (India) conversational smartphone | |
Dataset Audio | English (India) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 67 hours | Add Dataset to Quote | ENI_ASR002 | Appen Global | Conversational Speech | English | India | Low background noise | 540 | 2 | 77,565 | 11,646 | 8 | alaw or wav | Dataset is fully transcribed and timestamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words 271 telephony conversations are recorded for this project |
English (India) conversational telephony | |
Dataset Text | English (India) Part of Speech Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 13,000 words | Add Dataset to Quote | eng_IND_POS | Appen Global | Part of Speech Dictionary | English | India | N/A | N/A | N/A | N/A | 13,000 | N/A | text | English (India) Part of Speech Dictionary | ||
Dataset Text | English (India) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 60,000 words | Add Dataset to Quote | eng_IND_PHON | Appen Global | Pronunciation Dictionary | English | India | N/A | N/A | N/A | N/A | 60,000 | N/A | text | English (India) Pronunciation Dictionary | ||
Dataset Audio | English (India) scripted telephony | Common Use Cases: ASR, Virtual Assistant | Recording Device: Mobile phone and landline | Unit: 217 hours | Add Dataset to Quote | ENI_ASR001 | Appen Global | Scripted Speech | English | India | Mixed | 2,358 | 1 | 115,541 | 9,190 | 8 | alaw or wav | Fully transcribed to SpeechDAT type conventions. Dataset is accompanied by a pronunciation lexicon [SAMPA] containing all transcribed words 49 prompts per speaker including digits, natural numbers, letter strings, personal, place, and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words |
English (India) scripted telephony | |
Dataset Text | English (Ireland) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 12,000 words | Add Dataset to Quote | eng_IRL_PHON | Appen Global | Pronunciation Dictionary | English | Ireland | N/A | N/A | N/A | N/A | 12,000 | N/A | text | English (Ireland) Pronunciation Dictionary | ||
Dataset Text | English (NZ) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 28,000 words | Add Dataset to Quote | eng_NZL_PHON | Appen Global | Pronunciation Dictionary | English | NZ | N/A | N/A | N/A | N/A | 28,000 | N/A | text | English (NZ) Pronunciation Dictionary | ||
Dataset Audio | English (Philippines) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 53 hours | Add Dataset to Quote | ENF_ASR001 | Appen Global | Conversational Speech | English | Philippines | Low background noise | 450 | 2 | 41,602 | 7,272 | 8 | alaw or wav | Dataset is fully transcribed and time stamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words Average length of calls: 10-15 mins |
English (Philippines) conversational telephony | |
Dataset Text | English (Philippines) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 7,000 words | Add Dataset to Quote | eng_PHL_PHON | Appen Global | Pronunciation Dictionary | English | Philippines | N/A | N/A | N/A | N/A | 7,000 | N/A | text | English (Philippines) Pronunciation Dictionary | ||
Dataset Text | English (United Arab Emirates (UAE)) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 5,000 words | Add Dataset to Quote | eng_ARE_PHON | Appen Global | Pronunciation Dictionary | English | United Arab Emirates (UAE) | N/A | N/A | N/A | N/A | 5,000 | N/A | text | English (United Arab Emirates (UAE)) Pronunciation Dictionary | ||
Dataset Audio | English (United Arab Emirates (UAE)) scripted telephony | Common Use Cases: ASR, Virtual Assistant | Recording Device: Mobile phone and landline | Unit: 33 hours | Add Dataset to Quote | OrienTel English as spoken in the United Arab Emirates | Nuance | Scripted Speech | English | United Arab Emirates (UAE) | Low background noise | 500 | 1 | 25,500 | 3990 | 8 | alaw | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 51 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words and spontaneous items for control |
English (United Arab Emirates (UAE)) scripted telephony | |
Dataset Audio | English (United Kingdom) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 150 hours | Add Dataset to Quote | UKE_ASR001 | Appen Global | Conversational Speech | English | United Kingdom | Low background noise | 1,175 | 2 | 298,562 | 24,193 | 8 | wav | Dataset is fully transcribed and timestamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words This version contains full 15-minute calls – there is a reduced version with 5 min calls named UKE_ASR001B. |
English (United Kingdom) conversational telephony | |
Dataset Audio | English (United Kingdom) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 50 hours | Add Dataset to Quote | UKE_ASR001B | Appen Global | Conversational Speech | English | United Kingdom | Low background noise | 1,150 | 2 | Available on request | 13,192 | 8 | wav | Dataset is fully transcribed and timestamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words This version contains full 5-minute calls – there is an expanded version with 15 min calls named UKE_ASR001. |
English (United Kingdom) conversational telephony | |
Dataset Text | English (United Kingdom) Part of Speech Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 155,000 words | Add Dataset to Quote | eng_GBR_POS | Appen Global | Part of Speech Dictionary | English | United Kingdom | N/A | N/A | N/A | N/A | 155,000 | N/A | text | English (United Kingdom) Part of Speech Dictionary | ||
Dataset Text | English (United Kingdom) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 195,000 words | Add Dataset to Quote | eng_GBR_PHON | Appen Global | Pronunciation Dictionary | English | United Kingdom | N/A | N/A | N/A | N/A | 195,000 | N/A | text | English (United Kingdom) Pronunciation Dictionary | ||
Dataset Audio | English (United Kingdom) TTS female scripted microphone | Common Use Cases: TTS | Recording Device: Headset microphone | Unit: 11 hours | Add Dataset to Quote | TC-STAR female baseline voice Laura | Nuance | TTS Scripted Speech | English | United Kingdom | Low background noise (studio) | 1 | 1 | Available on request | Available on request | 96 | Available on request | Dataset includes manual orthographic transcription, automatic segmentation into phonemes, automatic generation of pitch marks (where a certain percentage of phonetic segments and pitch marks has been manually checked) Dataset is accompanied by a pronunciation lexicon with POS, lemma and phonetic transcription |
English (United Kingdom) TTS female scripted microphone | |
Dataset Audio | English (United Kingdom) TTS male scripted microphone | Common Use Cases: TTS | Recording Device: Headset microphone | Unit: 7 hours | Add Dataset to Quote | TC-STAR male baseline voice Ian | Nuance | TTS Scripted Speech | English | United Kingdom | Low background noise (studio) | 1 | 1 | Available on request | Available on request | 96 | Available on request | Dataset includes manual orthographic transcription, automatic segmentation into phonemes, automatic generation of pitch marks (where a certain percentage of phonetic segments and pitch marks has been manually checked) Dataset is accompanied by a pronunciation lexicon with POS, lemma and phonetic transcription |
English (United Kingdom) TTS male scripted microphone | |
Dataset Audio | English (United States – African American) conversational smartphone | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone | Unit: 50 hours | Add Dataset to Quote | USE_ASR004 | Appen Global | Conversational Speech | English | United States | Mixed (home, car, public place, outdoor) | 94 | 1 | 58316 | 13468 | 48 | wav | Dataset is fully transcribed and time stamped Two person conversations recorded on a smartphone covering a broad range of generic topics including clothing, culture, education, finance, food, health, history, hospitality, insurance, media/entertainment, sports, travel/holiday, weather and work. Each speaker participates in up to 12 conversations that are 5-15 minutes long. Pronunciation lexicon not currently available but can be developed upon request |
English (United States – African American) conversational smartphone | |
Dataset Audio | English (United States) conversational smartphone | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: |