Dataset Video | Action videos | Common Use Cases: Movement detection, Human Body Movement, Action Classification | Recording Device: Camera | Unit: 300 videos | Add Dataset to Quote | HUMAN_BODY_VID003 | Appen Global | Human Body Movement | N/A | United States | Mixed lighting conditions | Available upon request | N/A | N/A | N/A | N/A | mp4 | Participants videoed themselves completing an action from a given prompt, e.g. “zip up a jacket”, “drink a beverage” | Action videos | |
Dataset Text | Albanian (Albania) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 12,000 words | Add Dataset to Quote | sqi_ALB_PHON | Appen Global | Pronunciation Dictionary | Albanian | Albania | N/A | N/A | N/A | N/A | 12,000 | N/A | text | Albanian (Albania) Pronunciation Dictionary | ||
Dataset Text | Amharic (Ethiopia) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 49,000 words | Add Dataset to Quote | amh_ETH_PHON | Appen Global | Pronunciation Dictionary | Amharic | Ethiopia | N/A | N/A | N/A | N/A | 49,000 | N/A | text | Amharic (Ethiopia) Pronunciation Dictionary | ||
Dataset Text | Arabic (Algeria) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 11,000 words | Add Dataset to Quote | ara_DZA_PHON | Appen Global | Pronunciation Dictionary | Arabic | Algeria | N/A | N/A | N/A | N/A | 11,000 | N/A | text | Arabic (Algeria) Pronunciation Dictionary | ||
Dataset Audio | Arabic (Eastern Algeria) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 29 hours | Add Dataset to Quote | EAR_ASR001 | Appen Global | Conversational Speech | Arabic | Algeria | Low background noise (home/office) | 496 | 2 | 32,899 | 15,314 | 8 | alaw | Dataset is fully transcribed and timestamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words For the majority of calls, both speakers (in-line/out-line) were collected and transcribed however, for a smaller number of calls, only one half of the conversation was collected and transcribed 8% landline, 92% mobile |
Arabic (Eastern Algeria) conversational telephony | |
Dataset Text | Arabic (Egypt) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 40,000 words | Add Dataset to Quote | ara_EGY_PHON | Appen Global | Pronunciation Dictionary | Arabic | Egypt | N/A | N/A | N/A | N/A | 40,000 | N/A | text | Arabic (Egypt) Pronunciation Dictionary | ||
Dataset Audio | Arabic (Egypt) scripted smartphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Mobile phone | Unit: 352 hours | Add Dataset to Quote | ARE_ASR001_CN | Appen China | Scripted Speech | Arabic | Egypt | Low background noise (home/office) | 627 | 1 | 128,908 | 207,576 | 16 | wav | Dataset contains audio with corresponding text prompts Text prompts are not vowelised |
Arabic (Egypt) scripted smartphone | |
Dataset Text | Arabic (Iraq) Part of Speech Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 13,000 words | Add Dataset to Quote | ara_IRQ_POS | Appen Global | Part of Speech Dictionary | Arabic | Iraq | N/A | N/A | N/A | N/A | 13,000 | N/A | text | Arabic (Iraq) Part of Speech Dictionary | ||
Dataset Text | Arabic (Iraq) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 19,000 words | Add Dataset to Quote | ara_IRQ_PHON | Appen Global | Pronunciation Dictionary | Arabic | Iraq | N/A | N/A | N/A | N/A | 19,000 | N/A | text | Person names | Arabic (Iraq) Pronunciation Dictionary | |
Dataset Audio | Arabic (Levantine) scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 32 hours | Add Dataset to Quote | ARU_ASR002 | Appen Global | Scripted Speech | Arabic | United Arab Emirates | Low background noise (studio) | 100 | 1 | Available upon request | Available upon request | 48 | wav | Audio with corresponding text prompts. Transcription can be developed upon request. | Arabic (Levantine) scripted microphone | |
Dataset Text | Arabic (Libya) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 48,000 words | Add Dataset to Quote | ara_LBY_PHON | Appen Global | Pronunciation Dictionary | Arabic | Libya | N/A | N/A | N/A | N/A | 48,000 | N/A | text | Arabic (Libya) Pronunciation Dictionary | ||
Dataset Audio | Arabic (Modern Standard Arabic) scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 12 hours | Add Dataset to Quote | MSA_ASR001 | GlobalPhone | Scripted Speech | Arabic | Tunisia | Mixed (quiet home/office, public, outdoor) | 78 | 1 | 4,908 | 40,000 | 16 | wav | Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus Dataset is fully transcribed and the transcription is available both in original script and in Romanized form Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary Developed in collaboration with the Karlsruhe Institute of Technology (KIT) |
Arabic (Modern Standard Arabic) scripted microphone | |
Dataset Audio | Arabic (Morocco) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 33 hours | Add Dataset to Quote | ARY_ASR001 | Appen Global | Conversational Speech | Arabic | Morocco | Low background noise | 180 | 2 | 80,430 | 23,836 | 8 | alaw | Each speaker participated in 1 to 4 conversations. Speakers are identified by a unique 4-digit speaker ID which is recorded in the demographic file Transcription is available in original script and fully reversible Romanised version with accompanying pronunciation lexicon English translation of product transcription is available (ARY_MT001, ARY_ASRMT001) |
Arabic (Morocco) conversational telephony | |
Dataset Text | Arabic (Morocco) conversational telephony translation | Common Use Cases: MT, Chatbot , Conversational AI | Recording Device: N/A | Unit: 80,430 utterances | Add Dataset to Quote | ARY_MT001 | Appen Global | Conversational Translation | Arabic | Morocco | N/A | 180 | N/A | 80,430 | 23,836 | N/A | text | Corresponding audio, transcription, fully reversible romanised transcription and pronunciation lexicon data are available (ARY_ASR001, ARY_ASRMT001) | Arabic (Morocco) conversational telephony translation | |
Dataset Text | Arabic (Morocco) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 60,000 words | Add Dataset to Quote | ara_MAR_PHON | Appen Global | Pronunciation Dictionary | Arabic | Morocco | N/A | N/A | N/A | N/A | 60,000 | N/A | text | Arabic (Morocco) Pronunciation Dictionary | ||
Dataset Text | Arabic (MSA) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 40,000 words | Add Dataset to Quote | arb_MSA_PHON | Appen Global | Pronunciation Dictionary | Arabic (Standard) | N/A | N/A | N/A | N/A | N/A | 40,000 | N/A | text | Arabic (MSA) Pronunciation Dictionary | ||
Dataset Audio | Arabic (Saudi Arabia) scripted smartphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Mobile phone | Unit: 322 hours | Add Dataset to Quote | ARS_ASR001_CN | Appen China | Scripted Speech | Arabic | Saudi Arabia | Low background noise (home/office) | 227 | 1 | 104,574 | 156,282 | 16 | wav | Dataset contains audio with corresponding text prompts Text prompts are not vowelised 300-1000 prompts per speaker covering general content including education, sports, entertainment, travel, culture and technology |
Arabic (Saudi Arabia) scripted smartphone | |
Dataset Text | Arabic (Sudan) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 17,000 words | Add Dataset to Quote | ara_SDN_PHON | Appen Global | Pronunciation Dictionary | Arabic | Sudan | N/A | N/A | N/A | N/A | 17,000 | N/A | text | Arabic (Sudan) Pronunciation Dictionary | ||
Dataset Image | Arabic (UAE) printed text annotated OCR | Common Use Cases: Document Processing, Document Search, Text detection | Recording Device: Mobile phone | Unit: 20000 images | Add Dataset to Quote | IMG_OCR_ARU002_CN | Appen China | Document OCR | Arabic | United Arab Emirates | Mixed lighting conditions | N/A | N/A | N/A | N/A | N/A | jpg + json | Images containing text, such as slogans, advertisements, maps, store names, menus, product outer packaging, indication board. Includes bounding box annotations, 50 boxes per image, with all text annotated (Arabic, non-Arabic characters, special characters, numbers) | Arabic (UAE) printed text annotated OCR | |
Dataset Text | Arabic (United Arab Emirates (UAE)) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 75,000 words | Add Dataset to Quote | ara_ARE_PHON | Appen Global | Pronunciation Dictionary | Arabic | United Arab Emirates (UAE) | N/A | N/A | N/A | N/A | 75,000 | N/A | text | Arabic (United Arab Emirates (UAE)) Pronunciation Dictionary | ||
Dataset Audio | Arabic (United Arab Emirates (UAE)) scripted smartphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Mobile phone | Unit: 170 hours | Add Dataset to Quote | ARU_ASR001_CN | Appen China | Scripted Speech | Arabic | United Arab Emirates (UAE) | Low background noise (home/office) | 133 | 1 | 42,352 | 85,775 | 16 | wav | Dataset contains audio with corresponding text prompts Text prompts are not vowelised |
Arabic (United Arab Emirates (UAE)) scripted smartphone | |
Dataset Audio | Arabic (United Arab Emirates (UAE)) scripted telephony | Common Use Cases: ASR, Virtual Assistant | Recording Device: Mobile phone and landline | Unit: 48 hours | Add Dataset to Quote | OrienTel United Arab Emirates MCA (Modern Colloquial Arabic) | Nuance | Scripted Speech | Arabic | United Arab Emirates (UAE) | Low background noise | 880 | 1 | 43,000 | 22197 | 8 | alaw | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 49 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words and spontaneous items for control |
Arabic (United Arab Emirates (UAE)) scripted telephony | |
Dataset Audio | Arabic (United Arab Emirates (UAE)) scripted telephony | Common Use Cases: ASR, Virtual Assistant | Recording Device: Mobile phone and landline | Unit: 31 hours | Add Dataset to Quote | OrienTel United Arab Emirates MSA (Modern Standard Arabic) | Nuance | Scripted Speech | Arabic | United Arab Emirates (UAE) | Low background noise | 500 | 1 | 24,500 | 13348 | 8 | alaw | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 49 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words and spontaneous items for control |
Arabic (United Arab Emirates (UAE)) scripted telephony | |
Dataset Audio | Arabic (United Arab Emirates (UAE)/ Saudi Arabia) scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 86 hours | Add Dataset to Quote | CGA_ASR001 | Appen Global | Scripted Speech | Arabic | United Arab Emirates (UAE) – Saudi Arabia | Low background noise (home/office) | 150 | 4 | 42,000 | 19,245 | 16 | raw PCM | Fully transcribed with acoustic event tagging derived from the SpeechDAT conventions Dataset is accompanied by a pronunciation lexicon containing all transcribed words All transcriptions fully vowelized 280 prompts per speaker including 30 Person names (first name and family name) from a set of 15, 10 single isolated digits 0-10, 8-digit sequences (randomly generated), 200 phonetically balanced sentences, 30 x 10-word phonetically balanced word strings |
Arabic (United Arab Emirates (UAE)/ Saudi Arabia) scripted microphone | |
Dataset Text | Arabic NER news text | Common Use Cases: NER, Content Classification, Search Engines | Recording Device: N/A | Unit: 20,774 sentences | Add Dataset to Quote | ARB_NER001 | Appen Global | News NER | Arabic (Standard) | N/A | N/A | N/A | N/A | 20,774 | Available on request | N/A | text | News text corpora with entities tagged in XML format: Person, Title, Organization, Location, Geo-political entity, Facility, Religion, Nationality, Quantity | Arabic NER news text | |
Dataset Text | Assamese (India) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 40,000 words | Add Dataset to Quote | asm_IND_PHON | Appen Global | Pronunciation Dictionary | Assamese | India | N/A | N/A | N/A | N/A | 40,000 | N/A | text | Assamese (India) Pronunciation Dictionary | ||
Dataset Audio | Baby crying audio | Common Use Cases: Baby Monitor, Security & Other Consumer Applications | Recording Device: Mobile phone | Unit: 70 hours | Add Dataset to Quote | CRY_ASR001_CN | Appen China | Human Sound | N/A | China | Low background noise (home/office) | 566 | 1 | N/A | N/A | 16 | wav | Crying sound of babies 0-3 years old, each lasting around 2 minutes. Audio only. | Baby crying audio | |
Dataset Audio | Bahasa Indonesia conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 31 hours | Add Dataset to Quote | BAH_ASR001 | Appen Global | Conversational Speech | Indonesian | Indonesia | Low background noise | 1,002 | 2 | 30,695 | 11,480 | 8 | wav | Dataset is fully transcribed and timestamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words For a large proportion of calls, only one half of the conversation was collected and transcribed 28% landline, 72% mobile |
Bahasa Indonesia conversational telephony | |
Dataset Image | Baking Pictures | Common Use Cases: Image recognition | Recording Device: N/A | Unit: 6000 images | Add Dataset to Quote | IMG_BAKE_CN | Appen China | Image recognition | N/A | China | N/A | N/A | N/A | N/A | N/A | N/A | jpg | (Data source: website) This dataset includes pictures of baked goods: 2000 images of bread, 2000 images of cakes, and 2000 images of cookies. Image resolution: 640px * 640px. Shooting angle: either vertically downward or slightly offset. | Baking Pictures | |
Dataset Text | Basque (Spain) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 10,000 words | Add Dataset to Quote | eus_ESP_PHON | Appen Global | Pronunciation Dictionary | Basque | Spain | N/A | N/A | N/A | N/A | 10,000 | N/A | text | Basque (Spain) Pronunciation Dictionary | ||
Dataset Audio | Bengali (Bangladesh) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 47 hours | Add Dataset to Quote | BEN_ASR001 | Appen Global | Conversational Speech | Bengali | Bangladesh | Mixed (in-car, roadside, home/office) | 1,000 | 2 | 108,923 | 17,922 | 8 | wav | Dataset is fully transcribed and timestamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words |
Bengali (Bangladesh) conversational telephony | |
Dataset Text | Bengali (India) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 29,000 words | Add Dataset to Quote | ben_IND_PHON | Appen Global | Pronunciation Dictionary | Bengali | India | N/A | N/A | N/A | N/A | 29,000 | N/A | text | Bengali (India) Pronunciation Dictionary | ||
Dataset Audio | Bulgarian (Bulgaria) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 38 hours | Add Dataset to Quote | BUL_ASR001 | Appen Global | Conversational Speech | Bulgarian | Bulgaria | Low background noise (home/office) | 217 | 2 | 86,453 | 22,342 | 8 | alaw or wav | Dataset is fully transcribed and timestamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words 200 telephony conversations are recorded for this project – 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers 49% landline, 51% mobile Conversations cover a range of topics including: Holiday/Leisure, Movies/TV Shows and Work. |
Bulgarian (Bulgaria) conversational telephony | |
Dataset Text | Bulgarian (Bulgaria) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 55,000 words | Add Dataset to Quote | bul_BGR_PHON | Appen Global | Pronunciation Dictionary | Bulgarian | Bulgaria | N/A | N/A | N/A | N/A | 55,000 | N/A | text | Bulgarian (Bulgaria) Pronunciation Dictionary | ||
Dataset Audio | Bulgarian (Bulgaria) scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 22 hours | Add Dataset to Quote | BUL_ASR002 | GlobalPhone | Scripted Speech | Bulgarian | Bulgaria | Mixed (quiet home/office, public, outdoor) | 77 | 1 | 8,674 | Available on request | 16 | wav | Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus Dataset is fully transcribed and the transcription is available both in original script and in Romanized form Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary Developed in collaboration with the Karlsruhe Institute of Technology (KIT) |
Bulgarian (Bulgaria) scripted microphone | |
Dataset Image | Business-to-business printed text document OCR | Common Use Cases: Document Processing, Document Search, Text detection | Recording Device: Camera, scan | Unit: 5,838 documents | Add Dataset to Quote | IMG_OCR_B2B | Appen Global | Document OCR | N/A | N/A | Mixed lighting conditions | N/A | N/A | N/A | N/A | N/A | png | Scans and photographs of business-to-business documents containing printed text. 38% Premium Quality images in 10 languages, 25 countries, including Purchase Order, Payment Advice or Remittance Advice, Order Confirmation and Delivery note. 64% Standard Quality images in various challenging conditions in 11 languages, 34 countries, in a wider range of categories including Complaints or Return, Delivery advice, Delivery note, Dunning, Goods receipt, Invoice, Offer, Order confirmation, Pay slip, Payment Advice or Remittance Advice, Purchase Order, Receipt, and Supplier load | Business-to-business printed text document OCR | |
Dataset Audio | Cantonese (China) business dialogues | Common Use Cases: ASR, Conversational AI, Speech Analytics, Business | Recording Device: Mobile phone | Unit: 98.35 hours | Add Dataset to Quote | YYDH_ASR001_CN | Appen China | Conversational Speech | Cantonese | China | Low background noise (home/office) | 241 | 2 | Available upon request | Available upon request | 16 | wav | Business meetings and conversations audio with transcription and timestamping, from a variety of industries. 30% male participants, 70% female |
Cantonese (China) business dialogues | |
Dataset Text | Cantonese (China) Simplified Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 37,000 words | Add Dataset to Quote | yue_CHN_PHON | Appen Global | Pronunciation Dictionary | Cantonese | China | N/A | N/A | N/A | N/A | 37,000 | N/A | text | Simplified | Cantonese (China) Simplified Pronunciation Dictionary | |
Dataset Text | Cantonese (China) Traditional Part of Speech Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 10,000 words | Add Dataset to Quote | yue_HKG_POS | Appen Global | Part of Speech Dictionary | Cantonese | China | N/A | N/A | N/A | N/A | 10,000 | N/A | text | Traditional | Cantonese (China) Traditional Part of Speech Dictionary | |
Dataset Text | Cantonese (China) Traditional Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 40,000 words | Add Dataset to Quote | yue_HKG_PHON | Appen Global | Pronunciation Dictionary | Cantonese | China | N/A | N/A | N/A | N/A | 40,000 | N/A | text | Traditional | Cantonese (China) Traditional Pronunciation Dictionary | |
Dataset Text | Catalan (Spain) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 10,000 words | Add Dataset to Quote | cat_ESP_PHON | Appen Global | Pronunciation Dictionary | Catalan | Spain | N/A | N/A | N/A | N/A | 10,000 | N/A | text | Catalan (Spain) Pronunciation Dictionary | ||
Dataset Text | Cebuano (Philippines) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 21,000 words | Add Dataset to Quote | ceb_PHL_PHON | Appen Global | Pronunciation Dictionary | Cebuano | Philippines | N/A | N/A | N/A | N/A | 21,000 | N/A | text | Cebuano (Philippines) Pronunciation Dictionary | ||
Dataset Audio | Chinese (multinational foreigner) scripted smartphone | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone | Unit: 200 hours | Add Dataset to Quote | FOREIGNER_ASR001_CN | Appen China | Scripted Speech | Mandarin Chinese | China | Low background noise | 309 | 1 | 16 | wav | Dataset contains audio with corresponding text prompts. This database contains 200 hours of foreigners speaking Chinese from the following countries: Argentina, Egypt, Australia, Russia, the Philippines, Kazakhstan, Korea, Kyrgyzstan, Canada, Kuala Lumpur, Kenya, Laos, Malaysia, Mauritius, the United States, Mongolia, South Africa, Japan, Tajikistan, Thailand, Turkey, Hong Kong, Singapore, India, Indonesia, Vietnam There is no data from South Korea, Brazil, or data recorded by minors. Each session lasts about an hour; sentence duration ranges between 3-10 seconds The content is in the form of an individual reading while being recorded on a mobile phone in a home/office environment. Sensitive data and personal information has been scrubbed. |
Chinese (multinational foreigner) scripted smartphone | |||
Dataset Text | Chinese and English related texts | Common Use Cases: LLM training | Recording Device: N/A | Unit: 400000 | Add Dataset to Quote | GLWB_CN | Appen China | LLM training | English/Chinese | N/A | N/A | N/A | N/A | Available upon request | Available upon request | N/A | json | This data set contains long article content in English and Chinese, sourced from publicly available books including title, author and language metadata. | Chinese and English related texts | |
Dataset Text | Chinese command and control prompt response corpus | Common Use Cases: LLM training, Command and Control, TV Player, Device Control | Recording Device: N/A | Unit: 20000 sentences | Add Dataset to Quote | DSDH_corpus_CN | Appen China | LLM training | Chinese | China | N/A | N/A | N/A | N/A | N/A | N/A | txt | App Commands, Question & response pairs, tagged with categories and intents, for use with TV player controls, lifestyle services, and device control. | Chinese command and control prompt response corpus | |
Dataset Text | Chinese instruction set sentence corpus | Common Use Cases: LLM training | Recording Device: N/A | Unit: 200000 sentences | Add Dataset to Quote | ZLJ_corpus_CN | Appen China | LLM training | Chinese | China | N/A | N/A | N/A | N/A | N/A | N/A | txt | Sentence corpus containing 10 sections: Question and answer class instruction set ( ZLCWD_corpus_CN); Multi-turn dialogue instruction set prompt-response pairs (ZLCDH_corpus_CN); Logical reasoning instruction set prompt (Topic) – response (Reasoning) pairs (ZLCLJ_corpus_CN); Programming code language instruction set prompt-response pairs, e.g. python (ZLCDM_corpus_CN); Brainstorming instruction set question-answer pairs (ZLCTN_corpus_CN); Text rewriting-instruction set original-rewritten pairs (ZLCGX_corpus_CN); Text to reply to security – command set (ZLCAQ_corpus_CN); Roleplay instruction set prompt-response pairs (ZLCJS_corpus_CN); Long text-instruction set prompt-response pairs (ZLCCWB_corpus_CN); Text generation instruction set prompt-response pairs (ZLCWB_corpus_CN) |
Chinese instruction set sentence corpus | |
Dataset Text | Chinese multidisciplinary test questions corpus | Common Use Cases: LLM training | Recording Device: N/A | Unit: 319970 sentences | Add Dataset to Quote | MTQ_CN | Appen China | LLM training | Chinese | China | N/A | N/A | 1 | N/A | N/A | N/A | json | Corpus containing 8 sections of middle-high school prompt response pairs with metadata Subject, Grade, Knowledge Area, Question Type, Question, Answer, Difficulty. Question categories included are: Geography – 30k sentences (DLT001_CN); Chemistry – 40k sentences (HXT001_CN); History – 40k sentences (LST001_CN:); Biology – 40k sentences (SWT001_CN); Math – 30k sentences (SXT001_CN); Physics – 40k sentences (WLT001_CN); Chinese language – 10k sentences (YWT001_CN); Political – 40k sentences (ZZT001_CN) |
Chinese multidisciplinary test questions corpus | |
Dataset Text | Chinese news text summaries corpus | Common Use Cases: LLM training | Recording Device: N/A | Unit: 20000 summaries | Add Dataset to Quote | DMXWB_corpus_CN | Appen China | LLM training | Chinese | China | N/A | N/A | N/A | N/A | N/A | N/A | xls | Summaries of main events and themes from news data in 15 domains (Finance and economics, Lottery ticket, House property, Share certificate, Home furnishings, Education, Science & Technology, Society & people’s livelihood, Fashion, Politics, Sports activities, Constellation, Game, Entertainment) | Chinese news text summaries corpus | |
Dataset Text | Code Q&A Dataset | Common Use Cases: LLM training | Recording Device: N/A | Unit: 12 million pairs | Add Dataset to Quote | DM_CNRD | Appen China | LLM training | English | N/A | N/A | N/A | N/A | Available upon request | Available upon request | N/A | json | This is a text dataset of coding questions and answers in English, sourced through web-spidering with subsequent clean up and filtering. Programming languages include: JavaScript, Python, Java, C#, PHP, C++, SQL, R, C, Swift. Topics include: computer, scientific research technology, wholesale and retail, finance, entertainment and other industries | Code Q&A Dataset | |
Dataset Audio | Croatian (Croatia) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 39 hours | Add Dataset to Quote | CRO_ASR001 | Appen Global | Conversational Speech | Croatian | Croatia | Low background noise (home/office) | 200 | 2 | Available on request | 23,919 | 8 | alaw | Dataset is fully transcribed and timestamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words 200 telephony conversations are recorded for this project – 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers 53% landline, 47% mobile Conversations cover a range of topics including: News & Current Affairs, Health and Sport. |
Croatian (Croatia) conversational telephony | |
Dataset Text | Croatian (Croatia) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 19,000 words | Add Dataset to Quote | hrv_HRV_PHON | Appen Global | Pronunciation Dictionary | Croatian | Croatia | N/A | N/A | N/A | N/A | 19,000 | N/A | text | Croatian (Croatia) Pronunciation Dictionary | ||
Dataset Audio | Croatian (Croatia) scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 11 hours | Add Dataset to Quote | CRO_ASR002 | GlobalPhone | Scripted Speech | Croatian | Croatia | Mixed (quiet home/office, public, outdoor) | 94 | 1 | 4,499 | 23,929 | 16 | wav | Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus Dataset is fully transcribed and the transcription is available both in original script and in Romanized form Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary Developed in collaboration with the Karlsruhe Institute of Technology (KIT) |
Croatian (Croatia) scripted microphone | |
Dataset Audio | Croatian (Croatia) scripted smartphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Mobile phone | Unit: 263 hours | Add Dataset to Quote | CRO_ASR003_CN | Appen China | Scripted Speech | Croatian | Croatia | Low background noise (home/office) | 243 | 1 | 73,467 | 136,140 | 16 | wav | Dataset contains audio with corresponding text prompts | Croatian (Croatia) scripted smartphone | |
Dataset Text | Czech (Czech Republic) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 50,000 words | Add Dataset to Quote | ces_CZE_PHON | Appen Global | Pronunciation Dictionary | Czech | Czech Republic | N/A | N/A | N/A | N/A | 50,000 | N/A | text | Czech (Czech Republic) Pronunciation Dictionary | ||
Dataset Audio | Czech (Czech Republic) scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 31 hours | Add Dataset to Quote | CZE_ASR001 | GlobalPhone | Scripted Speech | Czech | Czech Republic | Mixed (quiet home/office, public, outdoor) | 102 | 1 | 12,425 | Available on request | 16 | wav | Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus Dataset is fully transcribed and the transcription is available both in original script and in Romanized form Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary Developed in collaboration with the Karlsruhe Institute of Technology (KIT) |
Czech (Czech Republic) scripted microphone | |
Dataset Audio | Czech (Czech Republic) scripted telephony | Common Use Cases: ASR, Virtual Assistant | Recording Device: Landline only | Unit: 93 hours | Add Dataset to Quote | Czech SpeechDat(E) Dataset | Nuance | Scripted Speech | Czech | Czech Republic | Low background noise | 1,000 | 1 | 52,000 | Available on request | 8 | alaw | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 52 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items, and phonetically rich words and sentences |
Czech (Czech Republic) scripted telephony | |
Dataset Text | Danish (Denmark) Part of Speech Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 100,000 words | Add Dataset to Quote | dan_DNK_POS | Appen Global | Part of Speech Dictionary | Danish | Denmark | N/A | N/A | N/A | N/A | 100,000 | N/A | text | Danish (Denmark) Part of Speech Dictionary | ||
Dataset Text | Danish (Denmark) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 107,000 words | Add Dataset to Quote | dan_DNK_PHON | Appen Global | Pronunciation Dictionary | Danish | Denmark | N/A | N/A | N/A | N/A | 107,000 | N/A | text | Danish (Denmark) Pronunciation Dictionary | ||
Dataset Audio | Danish (Denmark) scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 53 hours | Add Dataset to Quote | Speecon Danish | Nuance | Scripted Speech | Danish | Denmark | Mixed (office, entertainment, car, public place) | 600 (550 adult speakers and 50 child speakers) | 4 | 170,000 | Available on request | 16 | Available on request | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 290 prompts per adult speaker and 210 prompts per child speaker including digits, natural numbers, letter strings, personal, place and business names, application words for adult speakers, command (toy, phone and general) for child speakers, phonetically rich words and sentences and free and elicited spontaneous responses for adult speakers |
Danish (Denmark) scripted microphone | |
Dataset Audio | Dari (Afghanistan) broadcast | Common Use Cases: ASR, Automatic Captioning, Keyword Spotting | Recording Device: N/A | Unit: 49 hours | Add Dataset to Quote | DAR_BRC001 | Appen Global | Broadcast Speech | Dari | Afghanistan | Low background noise (studio) | N/A | 1 | Available on request | Available on request | 16 – 48 | wav | Dataset is fully transcribed and timestamped Pronunciation lexicon not currently available but can be developed upon request Dataset is largely speech only and does not include music or advertisements Data types include: talk shows, interviews, news broadcasts (excluding news reading by anchors) 13% landline, 87% mobile |
Dari (Afghanistan) broadcast | |
Dataset Audio | Dari (Afghanistan) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 40 hours | Add Dataset to Quote | DAR_ASR001 | Appen Global | Conversational Speech | Dari | Afghanistan | Low background noise | 500 | 2 | Available on request | 11,168 | 8 | alaw | Dataset is fully transcribed and timestamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words Dataset is largely speech only and does not include music or advertisements 13% landline, 87% mobile |
Dari (Afghanistan) conversational telephony | |
Dataset Text | Dari (Afghanistan) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 31,000 words | Add Dataset to Quote | prs_AFG_PHON | Appen Global | Pronunciation Dictionary | Dari | Afghanistan | N/A | N/A | N/A | N/A | 31,000 | N/A | text | Dari (Afghanistan) Pronunciation Dictionary | ||
Dataset Text | Dholuo (Kenya) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 23,000 words | Add Dataset to Quote | luo_KEN_PHON | Appen Global | Pronunciation Dictionary | Dholuo | Kenya | N/A | N/A | N/A | N/A | 23,000 | N/A | text | Dholuo (Kenya) Pronunciation Dictionary | ||
Dataset Audio | Dongbei dialect (China) Conversational Speech | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Recording pen/microphone | Unit: 84.6 hours | Add Dataset to Quote | DONGBEI_ASR001_CN | Appen China | Conversational Speech | Dongbei dialect | China | Low background noise | 268 | 1 | 16 | wav | Audio only; transcription in development for Q1 2025 Audio recordings cover 19 districts: Shenyang Heping District, Shenhe District, Huanggu District, Dadong District, Tiexi District, Lvyuan District, Chaoyang District, Kuancheng District, Erdao District, Nanguan District, Daoli District, Nangang District, Daowai District, Pingfang District, Songbei District, Xiangfang District, Hulan District, Acheng District and Shuangcheng District Northeast suburb accents not included, and no minors were recorded. Each recording session contains 20-30 minutes of free dialogue between 2-5 people. Sensitive data and personal information has been scrubbed. |
Dongbei dialect (China) Conversational Speech | |||
Dataset Audio | Dongbei dialect (China) Conversational Speech | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone | Unit: 75.2 hours | Add Dataset to Quote | DONGBEI_ASR002_CN | Appen China | Conversational Speech | Dongbei dialect | China | Low background noise | 185 | 1 | 8 | wav | Audio only; transcription in development for Q1 2025 Audio recordings cover 19 districts: Shenyang Heping District, Shenhe District, Huanggu District, Dadong District, Tiexi District, Lvyuan District, Chaoyang District, Kuancheng District, Erdao District, Nanguan District, Daoli District, Nangang District, Daowai District, Pingfang District, Songbei District, Xiangfang District, Hulan District, Acheng District and Shuangcheng District Northeast suburb accents not included, and no minors were recorded. Each recording session contains 20-30 minutes of free dialogue between 2-5 people. Sensitive data and personal information has been scrubbed. |
Dongbei dialect (China) Conversational Speech | |||
Dataset Audio | Dutch (Belgium) scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 47 hours | Add Dataset to Quote | Speecon Dutch from Belgium | Nuance | Scripted Speech | Dutch | Belgium | Mixed (office, entertainment, car, public place) | 600 (550 adult speakers and 50 child speakers) | 4 | 170,000 | Available on request | 16 | Available on request | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 290 prompts per adult speaker and 210 prompts per child speaker including digits, natural numbers, letter strings, personal, place and business names, application words for adult speakers, command (toy, phone and general) for child speakers, phonetically rich words and sentences and free and elicited spontaneous responses for adult speakers |
Dutch (Belgium) scripted microphone | |
Dataset Audio | Dutch (Belgium) scripted telephony | Common Use Cases: ASR, Virtual Assistant | Recording Device: Microphone | Unit: 80 hours | Add Dataset to Quote | Flemish SpeechDat(II) FDB-1000 (FIXED1FL) | Nuance | Scripted Speech | Dutch | Belgium | Low background noise | 1,000 | 1 | 52,000 | Available on request | 8 | Available on request | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 52 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words and spontaneous items for control |
Dutch (Belgium) scripted telephony | |
Dataset Audio | Dutch (Netherlands & Belgium) scripted in-car | Common Use Cases: ASR, Virtual Assistant, In Car HMI & Entertainment | Recording Device: Microphone and mobile phone | Unit: 27 hours | Add Dataset to Quote | Dutch and Flemish SpeechDat-Car | Nuance | Scripted Speech | Dutch | Netherland – Belgium | Mixed (in-car) | 302 | 5 | 15,100 | Available on request | 16 and 8 | Available on request | Dataset is fully transcribed and is accompanied by a pronunciation lexicon and validation report 125 prompts per adult speaker including digits, natural numbers, letter strings, personal, place and business names (some spontaneous), generic command and control items, phonetically rich words and sentences and prompts for spontaneous speech |
Dutch (Netherlands & Belgium) scripted in-car | |
Dataset Audio | Dutch (Netherlands) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 36 hours | Add Dataset to Quote | NLD_ASR001 | Appen Global | Conversational Speech | Dutch | Netherlands | Low background noise | 200 | 2 | Available on request | 14,964 | 8 | alaw | Dataset is fully transcribed and timestamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words 200 telephony conversations are recorded for this project – 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers 51% landline, 49% mobile Conversations cover a range of topics including: Holiday/Leisure, Work and Sport. |
Dutch (Netherlands) conversational telephony | |
Dataset Text | Dutch (Netherlands) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 45,000 words | Add Dataset to Quote | nld_NLD_PHON | Appen Global | Pronunciation Dictionary | Dutch | Netherlands | N/A | N/A | N/A | N/A | 45,000 | N/A | text | Dutch (Netherlands) Pronunciation Dictionary | ||
Dataset Audio | Dutch (Netherlands) scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 68 hours | Add Dataset to Quote | Speecon Dutch from the Netherlands | Nuance | Scripted Speech | Dutch | Netherlands | Mixed (office, entertainment, car, public place) | 600 (550 adult speakers and 50 child speakers) | 4 | 170,000 | Available on request | 16 | Available on request | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 290 prompts per adult speaker and 210 prompts per child speaker including digits, natural numbers, letter strings, personal, place and business names, application words for adult speakers, command (toy, phone and general) for child speakers, phonetically rich words and sentences and free and elicited spontaneous responses for adult speakers |
Dutch (Netherlands) scripted microphone | |
Dataset Image | East African facial images | Common Use Cases: Facial Recognition | Recording Device: Camera | Unit: 13500 images | Add Dataset to Quote | IMG_FACE_KEN_CN | Appen China | Human Face | N/A | Kenya | Mixed background and lighting conditions | 99 | N/A | N/A | N/A | N/A | jpg | Images of 99 participants across a variety of conditions (lighting, distance, camera angles, facial expressions, and accessories). 9 different lighting conditions, 2 different distances between participants face and smartphone, 7 different camera angles. All combinations of these 3 requirements were completed per participant. A random 32 images per person include occlusions such as sunglasses, masks, wigs or hats A random 36 shots include different facial expressions including stare, open mouth, pout mouth smile and frown Lighting conditions: indoor normal light, outdoor normal light, indoor backlight, outdoor backlight, indoor ordinary dark light, full black screen fill light, point light source (white light, street light), neon light (monochromatic red, green and blue, multi-color mixed light), side glare Distances: 30cm and 50cm Camera angles: front, left 45°, right 45°, left 15°, right 15°, top 30°, bottom 30° |
East African facial images | |
Dataset Image | Electric vehicles in elevators | Common Use Cases: Image recognition | Recording Device: N/A | Unit: 17132 images | Add Dataset to Quote | IMG_DDC_CN | Appen China | Image recognition | N/A | China | N/A | N/A | N/A | N/A | N/A | N/A | jpg | The electric vehicle image in elevator scene, with no more than 5 images of the same electric vehicle appearing. All images have annotation (monitoring perspective) with bounding boxes and labels (person, vehicle) | Electric vehicles in elevators | |
Dataset Audio | English (Arabic – Levant/Egypt) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 28 hours | Add Dataset to Quote | ENA_ASR001 | Appen Global | Conversational Speech | English | Egypt | Low background noise | 250 | 2 | 33,057 | 5,619 | 8 | alaw or wav | Dataset is fully transcribed and timestamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words Average length of calls: 10-15 mins |
English (Arabic – Levant/Egypt) conversational telephony | |
Dataset Text | English (Australia) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 157,000 words | Add Dataset to Quote | eng_AUS_PHON | Appen Global | Pronunciation Dictionary | English | Australia | N/A | N/A | N/A | N/A | 157,000 | N/A | text | English (Australia) Pronunciation Dictionary | ||
Dataset Audio | English (Australia) scripted telephony | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Mobile phone and landline | Unit: 92 hours | Add Dataset to Quote | AUS_ASR001 | Appen Global | Scripted Speech | English | Australia | Low background noise (home/office) | 500 | 1 | 82,500 | 35,137 | 8 | alaw or wav | Fully transcribed to SpeechDAT type conventions Dataset is accompanied by a pronunciation lexicon containing all transcribed words 162 prompts (read speech) per speaker including digits, natural numbers, letter strings, personal, place, and business names, confirmation items (yes, no + fuzzy), generic command and control items (from a set of 215), phonetically rich sentences and words |
English (Australia) scripted telephony | |
Dataset Audio | English (Australia) scripted telephony | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Mobile phone and landline | Unit: 118 hours | Add Dataset to Quote | AUS_ASR002 | Appen Global | Scripted Speech | English | Australia | Mixed | 1,000 | 1 | 75,000 | 18,952 | 8 | alaw or wav | Fully transcribed to SpeechDAT type conventions Dataset is accompanied by a pronunciation lexicon containing all transcribed words 75 prompts per speaker including digits, natural numbers, letter strings, personal, place, and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words The prompts are a mixture of ‘read’ and ‘elicited’ items where 5 prompts per script are ‘spontaneous free speech’ |
English (Australia) scripted telephony | |
Dataset Text | English (Canada) Part of Speech Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 3,000 words | Add Dataset to Quote | eng_CAN_POS | Appen Global | Part of Speech Dictionary | English | Canada | N/A | N/A | N/A | N/A | 3,000 | N/A | text | English (Canada) Part of Speech Dictionary | ||
Dataset Text | English (Canada) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 50,000 words | Add Dataset to Quote | eng_CAN_PHON | Appen Global | Pronunciation Dictionary | English | Canada | N/A | N/A | N/A | N/A | 50,000 | N/A | text | English (Canada) Pronunciation Dictionary | ||
Dataset Audio | English (Canada) scripted telephony | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Mobile phone and landline | Unit: 144 hours | Add Dataset to Quote | ENC_ASR001 | Appen Global | Scripted Speech | English | Canada | Mixed | 1,000 | 1 | 99,000 | 12,483 | 8 | alaw or wav | Fully transcribed to SALA II/SpeechDAT type conventions Dataset is accompanied by a pronunciation lexicon containing all transcribed words 99 prompts per speaker including digits, natural numbers, letter strings, personal, place, and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words |
English (Canada) scripted telephony | |
Dataset Text | English (Hong Kong) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 18,000 words | Add Dataset to Quote | eng_HKG_PHON | Appen Global | Pronunciation Dictionary | English | Hong Kong | N/A | N/A | N/A | N/A | 18,000 | N/A | text | English (Hong Kong) Pronunciation Dictionary | ||
Dataset Audio | English (India) conversational smartphone | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone | Unit: 143 hours | Add Dataset to Quote | ENI_ASR003 | Appen Global | Conversational Speech | English | India | Mixed (home, car, public place, outdoor) | 272 | 1 | 145559 | 20746 | 48 | wav | Dataset is fully transcribed and time stamped Two person conversations covering a broad range of generic topics including clothing, culture, education, finance, food, health, history, hospitality, insurance, media/entertainment, sports, travel/holiday, weather and work. Each speaker participates in up to 12 conversations that are 5-15 minutes long. Pronunciation lexicon not currently available but can be developed upon request |
English (India) conversational smartphone | |
Dataset Audio | English (India) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 67 hours | Add Dataset to Quote | ENI_ASR002 | Appen Global | Conversational Speech | English | India | Low background noise | 540 | 2 | 77,565 | 11,646 | 8 | alaw or wav | Dataset is fully transcribed and timestamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words 271 telephony conversations are recorded for this project |
English (India) conversational telephony | |
Dataset Text | English (India) Part of Speech Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 13,000 words | Add Dataset to Quote | eng_IND_POS | Appen Global | Part of Speech Dictionary | English | India | N/A | N/A | N/A | N/A | 13,000 | N/A | text | English (India) Part of Speech Dictionary | ||
Dataset Text | English (India) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 60,000 words | Add Dataset to Quote | eng_IND_PHON | Appen Global | Pronunciation Dictionary | English | India | N/A | N/A | N/A | N/A | 60,000 | N/A | text | English (India) Pronunciation Dictionary | ||
Dataset Audio | English (India) scripted telephony | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Mobile phone and landline | Unit: 217 hours | Add Dataset to Quote | ENI_ASR001 | Appen Global | Scripted Speech | English | India | Mixed | 2,358 | 1 | 115,541 | 9,190 | 8 | alaw or wav | Fully transcribed to SpeechDAT type conventions. Dataset is accompanied by a pronunciation lexicon [SAMPA] containing all transcribed words 49 prompts per speaker including digits, natural numbers, letter strings, personal, place, and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words |
English (India) scripted telephony | |
Dataset Text | English (Ireland) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 12,000 words | Add Dataset to Quote | eng_IRL_PHON | Appen Global | Pronunciation Dictionary | English | Ireland | N/A | N/A | N/A | N/A | 12,000 | N/A | text | English (Ireland) Pronunciation Dictionary | ||
Dataset Text | English (NZ) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 28,000 words | Add Dataset to Quote | eng_NZL_PHON | Appen Global | Pronunciation Dictionary | English | NZ | N/A | N/A | N/A | N/A | 28,000 | N/A | text | English (NZ) Pronunciation Dictionary | ||
Dataset Audio | English (Philippines) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 53 hours | Add Dataset to Quote | ENF_ASR001 | Appen Global | Conversational Speech | English | Philippines | Low background noise | 450 | 2 | 41,602 | 7,272 | 8 | alaw or wav | Dataset is fully transcribed and time stamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words Average length of calls: 10-15 mins |
English (Philippines) conversational telephony | |
Dataset Text | English (Philippines) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 7,000 words | Add Dataset to Quote | eng_PHL_PHON | Appen Global | Pronunciation Dictionary | English | Philippines | N/A | N/A | N/A | N/A | 7,000 | N/A | text | English (Philippines) Pronunciation Dictionary | ||
Dataset Text | English (United Arab Emirates (UAE)) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 5,000 words | Add Dataset to Quote | eng_ARE_PHON | Appen Global | Pronunciation Dictionary | English | United Arab Emirates (UAE) | N/A | N/A | N/A | N/A | 5,000 | N/A | text | English (United Arab Emirates (UAE)) Pronunciation Dictionary | ||
Dataset Audio | English (United Arab Emirates (UAE)) scripted telephony | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Mobile phone and landline | Unit: 33 hours | Add Dataset to Quote | OrienTel English as spoken in the United Arab Emirates | Nuance | Scripted Speech | English | United Arab Emirates (UAE) | Low background noise | 500 | 1 | 25,500 | 3990 | 8 | alaw | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 51 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words and spontaneous items for control |
English (United Arab Emirates (UAE)) scripted telephony | |
Dataset Audio | English (United Kingdom) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 150 hours | Add Dataset to Quote | UKE_ASR001 | Appen Global | Conversational Speech | English | United Kingdom | Low background noise | 1,175 | 2 | 298,562 | 24,193 | 8 | wav | Dataset is fully transcribed and timestamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words This version contains full 15-minute calls – there is a reduced version with 5 min calls named UKE_ASR001B. |
English (United Kingdom) conversational telephony | |
Dataset Audio | English (United Kingdom) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 50 hours | Add Dataset to Quote | UKE_ASR001B | Appen Global | Conversational Speech | English | United Kingdom | Low background noise | 1,150 | 2 | Available on request | 13,192 | 8 | wav | Dataset is fully transcribed and timestamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words This version contains full 5-minute calls – there is an expanded version with 15 min calls named UKE_ASR001. |
English (United Kingdom) conversational telephony | |
Dataset Text | English (United Kingdom) Part of Speech Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 155,000 words | Add Dataset to Quote | eng_GBR_POS | Appen Global | Part of Speech Dictionary | English | United Kingdom | N/A | N/A | N/A | N/A | 155,000 | N/A | text | English (United Kingdom) Part of Speech Dictionary | ||
Dataset Text | English (United Kingdom) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 195,000 words | Add Dataset to Quote | eng_GBR_PHON | Appen Global | Pronunciation Dictionary | English | United Kingdom | N/A | N/A | N/A | N/A | 195,000 | N/A | text | English (United Kingdom) Pronunciation Dictionary | ||
Dataset Audio | English (United Kingdom) TTS female scripted microphone | Common Use Cases: TTS | Recording Device: Headset microphone | Unit: 11 hours | Add Dataset to Quote | TC-STAR female baseline voice Laura | Nuance | TTS Scripted Speech | English | United Kingdom | Low background noise (studio) | 1 | 1 | Available on request | Available on request | 96 | Available on request | Dataset includes manual orthographic transcription, automatic segmentation into phonemes, automatic generation of pitch marks (where a certain percentage of phonetic segments and pitch marks has been manually checked) Dataset is accompanied by a pronunciation lexicon with POS, lemma and phonetic transcription |
English (United Kingdom) TTS female scripted microphone | |
Dataset Audio | English (United Kingdom) TTS male scripted microphone | Common Use Cases: TTS | Recording Device: Headset microphone | Unit: 7 hours | Add Dataset to Quote | TC-STAR male baseline voice Ian | Nuance | TTS Scripted Speech | English | United Kingdom | Low background noise (studio) | 1 | 1 | Available on request | Available on request | 96 | Available on request | Dataset includes manual orthographic transcription, automatic segmentation into phonemes, automatic generation of pitch marks (where a certain percentage of phonetic segments and pitch marks has been manually checked) Dataset is accompanied by a pronunciation lexicon with POS, lemma and phonetic transcription |
English (United Kingdom) TTS male scripted microphone | |
Dataset Audio | English (United States – African American) conversational smartphone | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone | Unit: 50 hours | Add Dataset to Quote | USE_ASR004 | Appen Global | Conversational Speech | English | United States | Mixed (home, car, public place, outdoor) | 94 | 1 | 58316 | 13468 | 48 | wav | Dataset is fully transcribed and time stamped Two person conversations recorded on a smartphone covering a broad range of generic topics including clothing, culture, education, finance, food, health, history, hospitality, insurance, media/entertainment, sports, travel/holiday, weather and work. Each speaker participates in up to 12 conversations that are 5-15 minutes long. Pronunciation lexicon not currently available but can be developed upon request |
English (United States – African American) conversational smartphone | |
Dataset Text | English (United States) Adversarial prompts for LLM red teaming **in development** | Common Use Cases: LLM training , LLM Red teaming | Recording Device: N/A | Unit: 500 prompts | Add Dataset to Quote | eng_USA_LLM002 | Appen Global | LLM training | English | United States | N/A | Available upon request | N/A | 500 | Available upon request | N/A | csv | Adversarial prompts in English for LLM red teaming 500 already collected with QA underway; total of 1000 prompts planned for development. Can be prioritized upon request. Please enquire about our optional benchmarking service to rate harm levels in model responses. |
English (United States) Adversarial prompts for LLM red teaming **in development** | |
Dataset Audio | English (United States) answers to questions **in development** | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Mobile phone | Unit: 65 hours | Add Dataset to Quote | USE_ASR007 | Appen Global | Scripted Speech | English | United States | Low background noise | 100 | 1 | 40000 | Available on request | 16 | wav | Participants recorded themselves on a smartphone answering prompted questions, e.g. “What’s your favourite food?”. There were 100 prompts per session, and 1000 unique prompts across the whole collection. Audio is collected, QA and transcription is underway, expected to be ready Q1 2025. Can be prioritized upon request. |
English (United States) answers to questions **in development** | |
Dataset Text | English (United States) Chatbot conversations **in development** | Common Use Cases: LLM training , Chatbot , Virtual Assistant | Recording Device: N/A | Unit: 1800 prompts | Add Dataset to Quote | eng_USA_LLM003 | Appen Global | LLM training | English | United States | N/A | Available upon request | N/A | Available upon request | Available upon request | N/A | csv | Real-world conversations between a user and a chatbot. Conversations were collected by asking questions of customer service chatbots on websites from a variety of industries to trigger 3 different conversation formats: chatbot input, chatbot solutions and follow up, solution instructions. Domains include: financial, retail, entertainment, IT Data collected, QA is underway, expected to be ready by EOY 2024. Can be prioritized upon request. |
English (United States) Chatbot conversations **in development** | |
Dataset Audio | English (United States) conversational smartphone | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone | Unit: 1000 hours | Add Dataset to Quote | USE_ASR003 | Appen Global | Conversational Speech | English | United States | Low background noise | 1,856 | 1 | 500,000 | 52,586 | 16 | wav | Dataset is fully transcribed and timestamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words Conversations cover a wide variety of topics including: study/major/work, hometown, living arrangements, weather and seasons, punctuality, TV programs/film) |
English (United States) conversational smartphone | |
Dataset Audio | English (United States) conversational smartphone **in development** | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone | Unit: 2 hours | Add Dataset to Quote | USE_ASR008 | Appen Global | Conversational Speech | English | United States | Low background noise | 6 | 1 | Available on request | Available on request | 16 | wav | Two participants recorded themselves on a smartphone having a 10-15 minute natural conversation on a selected topic, e.g. Hobbies, History. Includes some AAVE participants and some toxic speech Audio is collected; QA, transcription and labelling is underway, expected to be ready Q1 2025. Can be prioritized upon request. |
English (United States) conversational smartphone **in development** | |
Dataset Audio | English (United States) device commands **in development** | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Mobile phone | Unit: 40 hours | Add Dataset to Quote | USE_ASR006 | Appen Global | Scripted Speech | English | United States | Low background noise | 100 | 1 | 23000 | Available on request | 16 | wav | Participants recorded themselves on a smartphone saying device commands in response to a prompt, e.g. “Tell the device to disable shuffle mode”. There were 94 prompts per session, and 280 unique prompts across the whole collection. Audio is collected, QA and transcription is underway, expected to be ready Q1 2025. Can be prioritized upon request. |
English (United States) device commands **in development** | |
Dataset Text | English (United States) Harmful and harmless prompts and responses **in development** | Common Use Cases: LLM training , LLM Red teaming , Chatbot | Recording Device: N/A | Unit: 300 prompts | Add Dataset to Quote | eng_USA_LLM001 | Appen Global | LLM training | English | United States | N/A | Available upon request | N/A | 300 | Available upon request | N/A | csv | Prompts and responses annotated for Harm category, Intensity, Voice, and Phrasing. Data collected, QA is underway, expected to be ready Q1 2025. Can be prioritized upon request. |
English (United States) Harmful and harmless prompts and responses **in development** | |
Dataset Text | English (United States) Medical Terms Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 8,000 words | Add Dataset to Quote | eng_USA_Med_PHON | Appen Global | Pronunciation Dictionary | English | United States | N/A | N/A | N/A | N/A | 8,000 | N/A | text | Pronunciation dictionary of medical terms with their associated transcriptions and domain tagging. Data is comprised of medical words extracted from PubMed abstracts, as well as pharmaceutical drug names collected by Appen through web-spidering. Pronunciations were processed by native speakers of US English and domain tagging done by a team of US English native speakers with medical transcription or other medical qualifications and experience. Domains include: Anatomy, Biochem/biological, Condition, General, Organisation, Person, Pharmaceutical, Procedure. |
English (United States) Medical Terms Pronunciation Dictionary | |
Dataset Text | English (United States) Part of Speech Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 263,000 words | Add Dataset to Quote | eng_USA_POS | Appen Global | Part of Speech Dictionary | English | United States | N/A | N/A | N/A | N/A | 263,000 | N/A | text | English (United States) Part of Speech Dictionary | ||
Dataset Image | English (United States) product labels **in development** | Common Use Cases: Image recognition, Object recognition, Retail | Recording Device: Camera | Unit: 60000 images | Add Dataset to Quote | IMG_OCR_USE_ProductLabels | Appen Global | Image recognition | English | United States | Mixed lighting conditions | Available upon request | N/A | N/A | N/A | N/A | jpg | Photos of various products including the label Annotated for category e.g. Food, health & beauty, pet supplies. Data collected, QA is underway, expected to be ready Q1 2025. Can be prioritized upon request. No bounding box or text transcription annotation planned so far, but can be developed upon request. |
English (United States) product labels **in development** | |
Dataset Text | English (United States) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 358,000 words | Add Dataset to Quote | eng_USA_PHON | Appen Global | Pronunciation Dictionary | English | United States | N/A | N/A | N/A | N/A | 358,000 | N/A | text | English (United States) Pronunciation Dictionary | ||
Dataset Image | English (United States) receipts **in development** | Common Use Cases: Image recognition, Object recognition, OCR, Text detection | Recording Device: Camera | Unit: 4500 images | Add Dataset to Quote | IMG_OCR_USE_RECEIPTS | Appen Global | OCR | English | United States | Mixed lighting conditions | Available upon request | N/A | N/A | N/A | N/A | jpg | Photos of receipts, bills or invoices, annotated with bounding boxes and transcribed text. PII redacted. Data collected, QA is underway, expected to be ready end of Q1 2025. Can be prioritized upon request. |
English (United States) receipts **in development** | |
Dataset Audio | English (United States) scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 53 hours | Add Dataset to Quote | Speecon English (USA) database | Nuance | Scripted Speech | English | United States | Mixed (office, entertainment, car, public place) | 600 (550 adult speakers and 50 child speakers) | 4 | 170,000 | Available on request | 16 | Available on request | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 290 prompts per adult speaker and 210 prompts per child speaker including digits, natural numbers, letter strings, personal, place and business names, application words for adult speakers, command (toy, phone and general) for child speakers, phonetically rich words and sentences and free and elicited spontaneous responses for adult speakers |
English (United States) scripted microphone | |
Dataset Audio | English (United States) scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 62 hours | Add Dataset to Quote | USE_ASR001 | Appen Global | Scripted Speech | English | United States | Low background noise (studio) | 200 | 2 | 80,000 | 18,318 | 48 | raw PCM or wav PCM | Dataset is fully transcribed and timestamped Dataset is formatted according to SALA II/SpeechDAT style conventions Dataset is accompanied by a pronunciation lexicon containing all transcribed words Each speaker read 400 prompts including digits, natural numbers, personal and city names, telephone numbers, generic command and control items, phonetically rich sentences and words |
English (United States) scripted microphone | |
Dataset Audio | English (United States) scripted sentences **in development** | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Mobile phone | Unit: 500 hours | Add Dataset to Quote | USE_ASR005 | Appen Global | Scripted Speech | English | United States | Low background noise | 250 | 1 | 300000 | Available on request | 16 | wav | Participants recorded themselves on a smartphone reading out prompted sentences. There were 96 prompts per session, and 9000 unique sentences across the whole collection. Audio is already collected, QA and transcription is underway, expected to be ready Q1 2025. Can be prioritized upon request. |
English (United States) scripted sentences **in development** | |
Dataset Image | English (United States) street signs | Common Use Cases: Image recognition, Object recognition, OCR, Text detection | Recording Device: Camera | Unit: 669 images | Add Dataset to Quote | IMG_OCR_USE_STREET001 | Appen Global | OCR | English | N/A | Mixed lighting conditions | N/A | N/A | N/A | N/A | N/A | png | Photographs of street signs, 51% traffic signs and 49% other. English from 18 locales. | English (United States) street signs | |
Dataset Image | English (United States) street signs **in development** | Common Use Cases: Image recognition, Object recognition, OCR, Text detection | Recording Device: Camera | Unit: 3500 images | Add Dataset to Quote | IMG_OCR_USE_STREET002 | Appen Global | OCR | English | United States | Mixed lighting conditions | Available upon request | N/A | N/A | Available upon request | N/A | jpg | Photos of US street signs, annotated with bounding boxes, transcribed text, and text description of the sign Data collected, QA is underway, expected to be ready Q1 2025. Can be prioritized upon request. |
English (United States) street signs **in development** | |
Dataset Image | English (United States) symbols **in development** | Common Use Cases: Image recognition, Object recognition, OCR | Recording Device: Camera | Unit: 1500 images | Add Dataset to Quote | IMG_SYMBOLS_US | Appen Global | OCR | English | United States | Mixed lighting conditions | Available upon request | N/A | N/A | N/A | N/A | jpg | Photos of symbols – small pictures that communicate an action or warning (e.g. recycling or laundering instructions) encountered in everyday life, with text descriptions Data collected, QA is underway, expected to be ready Q1 2025. Can be prioritized upon request. |
English (United States) symbols **in development** | |
Dataset Text | English (United States) Text message conversations | Common Use Cases: Chatbot , Virtual Assistant , Conversational AI | Recording Device: N/A | Unit: 100 conversations | Add Dataset to Quote | eng_USA_SMS003 | Appen Global | Text messages | English | United States | N/A | Available upon request | N/A | Available upon request | Available upon request | N/A | tsv | Short WhatsApp and SMS text message conversations (20-400 words), labelled for topic | English (United States) Text message conversations | |
Dataset Audio | English (United States) Ultra High-Volume labeled speech | Common Use Cases: ASR, Conversational AI, Speech Analytics, Automatic Captioning, In Car HMI & Entertainment, Virtual Assistant | Recording Device: N/A | Unit: 1196 hours | Add Dataset to Quote | USE_UHV001 | Appen Global | Broadcast Speech | English | United States | Low background noise | 20472 | 1 | 423371 | 110265 | 16 | wav | Customised packaging available High quality labelled speech datasets of web-sourced licensable broadcast audio data, curated to ensure representative speaker demographic distributions, and filtered through human quality checks. 12.6M total words Utterance-level labelling includes: speech transcription, accent identification, speaker identification, verification, gender and age-group detection, domain classification. Domains include: Agriculture & plants, Animals & Pets, Art & Culture, Beauty & Fashion, Career, Clothing, Education, Entertainment, Family & Relationships, Finance & Insurance, Food, Health, History, Hospitality, Legal, Leisure, News & Politics, Religion & Spirituality, Retail, Science & Technology, Social Networks, Sports, Telecom, Travel, Weather, Others |
English (United States) Ultra High-Volume labeled speech | |
Dataset Text | English Inverse text normalisation | Common Use Cases: ASR, Language Modelling, Closed Captioning | Recording Device: N/A | Unit: 4454 test cases | Add Dataset to Quote | ENG_ITN001 | Appen Global | Inverse text normalisation | English | N/A | N/A | N/A | N/A | N/A | N/A | N/A | text | Inverse text normalisation input-output pairs, in 14 semiotic classes: cardinal, ordinal, decimal, fraction, measure, time, date, currency, letter, digit, electronic, address, postal codes, identifiers | English Inverse text normalisation | |
Dataset Text | English NER news text | Common Use Cases: NER, Content Classification, Search Engines | Recording Device: N/A | Unit: 22,768 sentences | Add Dataset to Quote | ENG_NER001 | Appen Global | News NER | English | N/A | N/A | N/A | N/A | 22,768 | Available on request | N/A | text | News text corpora with entities tagged in XML format: Person, Title, Organization, Location, Geo-political entity, Facility, Religion, Nationality, Quantity | English NER news text | |
Dataset Image Annotation | European License Plate Detection Annotations | Common Use Cases: License plate detection for vehicles on the road | Recording Device: N/A | Unit: 100,000 bounding boxes | Add Dataset to Quote | LICENSE_ANNO | Appen Global | Image and Video Bounding Box Annotations | N/A | Germany, France, Switzerland | N/A | N/A | N/A | N/A | N/A | N/A | json | This dataset contains 100,000 license plate bounding box annotations of 38,000 images and video frames from the KITTI and Cityscapes datasets, from Germany, France and Switzerland. Metadata associated with the bounding boxes, of the box size and position distributions, is included. The source images and video frames cover real world complex scenes in good/median weather conditions, captured over several months (spring, summer, fall), varying scene layouts, backgrounds, and occlusion. The annotations were carried out by Appen’s in-house workforce. |
European License Plate Detection Annotations | |
Dataset Text | Farsi/Persian NER news text | Common Use Cases: NER, Content Classification, Search Engines | Recording Device: N/A | Unit: 19,584 sentences | Add Dataset to Quote | FAR_NER001 | Appen Global | News NER | Iranian Persian | Iran | N/A | N/A | N/A | 19,584 | Available on request | N/A | text | News text corpora with entities tagged in XML format: Person, Title, Organization, Location, Geo-political entity, Facility, Religion, Nationality, Quantity | Farsi/Persian NER news text | |
Dataset Text | Finnish (Finland) Part of Speech Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 10,000 words | Add Dataset to Quote | fin_FIN_POS | Appen Global | Part of Speech Dictionary | Finnish | Finland | N/A | N/A | N/A | N/A | 10,000 | N/A | text | Finnish (Finland) Part of Speech Dictionary | ||
Dataset Image | Finnish (Finland) printed text OCR | Common Use Cases: Document Processing, Document Search, Text detection | Recording Device: Camera | Unit: 7293 images | Add Dataset to Quote | IMG_OCR_FIN_CN | Appen China | Document OCR | Finnish | Finland | Mixed lighting conditions | 4 | N/A | N/A | N/A | N/A | jpg | Images containing text, such as billboards / outer packaging / signage / magazines / menus, etc. | Finnish (Finland) printed text OCR | |
Dataset Text | Finnish (Finland) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 86,000 words | Add Dataset to Quote | fin_FIN_PHON | Appen Global | Pronunciation Dictionary | Finnish | Finland | N/A | N/A | N/A | N/A | 86,000 | N/A | text | Finnish (Finland) Pronunciation Dictionary | ||
Dataset Text | French (Algeria) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 4,000 words | Add Dataset to Quote | fra_DZA_PHON | Appen Global | Pronunciation Dictionary | French | Algeria | N/A | N/A | N/A | N/A | 4,000 | N/A | text | Arabic script | French (Algeria) Pronunciation Dictionary | |
Dataset Audio | French (Belgium) scripted telephony | Common Use Cases: ASR, Virtual Assistant | Recording Device: Landline only | Unit: 76 hours | Add Dataset to Quote | Belgian French SpeechDat(II) FDB-1000 (FIXED1BF) | Nuance | Scripted Speech | French | Belgium | Low background noise | 1,000 | 1 | 53,000 | Available on request | 8 | alaw | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 53 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words and spontaneous items for control |
French (Belgium) scripted telephony | |
Dataset Audio | French (Canada) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 9 hours | Add Dataset to Quote | FRC_ASR003 | Appen Global | Conversational Speech | French | Canada | Mixed | 68 | 2 | Available on request | 6,022 | 8 | alaw | Dataset is fully transcribed and time stamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words Average length of calls: 10-15 mins For the majority of calls, only one half of the conversation was collected and transcribed, however, for a smaller number of calls, both speakers (in-line/out-line) were collected and transcribed |
French (Canada) conversational telephony | |
Dataset Text | French (Canada) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 67,000 words | Add Dataset to Quote | fra_CAN_PHON | Appen Global | Pronunciation Dictionary | French | Canada | N/A | N/A | N/A | N/A | 67,000 | N/A | text | French (Canada) Pronunciation Dictionary | ||
Dataset Audio | French (Canada) scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 46 hours | Add Dataset to Quote | FRC_ASR002 | Appen Global | Scripted Speech | French | Canada | Low background noise (home/office) | 150 | 1 | 22,500 | 10,755 | 16 | wav | Dataset is fully transcribed and timestamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words 150 prompts per speaker including digits, digit strings (randomly generated), addresses and phonetically rich sentences and words |
French (Canada) scripted microphone | |
Dataset Audio | French (Canada) scripted telephony | Common Use Cases: ASR, Virtual Assistant | Recording Device: Mobile phone | Unit: 131 hours | Add Dataset to Quote | FRC_ASR001 | Appen Global | Scripted Speech | French | Canada | Mixed | 1,000 | 1 | 100,000 | 11,697 | 8 | mulaw | Fully transcribed to SpeechDAT type conventions Dataset is accompanied by a pronunciation lexicon [SAMPA] containing all transcribed words 100 prompts per speaker including digits, natural numbers, letter strings, personal, place, and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words |
French (Canada) scripted telephony | |
Dataset Audio | French (France) conversational smartphone | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone | Unit: 159 hours | Add Dataset to Quote | FRF_ASR004 | Appen Global | Conversational Speech | French | France | Mixed (home, car, public place, outdoor) | 298 | 1 | Available on request | Available on request | 48 | wav | Dataset is fully transcribed and time stamped Two person conversations covering a broad range of generic topics including clothing, culture, education, finance, food, health, history, hospitality, insurance, media/entertainment, sports, travel/holiday, weather and work. Each speaker participates in up to 12 conversations that are 5-15 minutes long. Pronunciation lexicon not currently available but can be developed upon request |
French (France) conversational smartphone | |
Dataset Audio | French (France) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 25 hours | Add Dataset to Quote | FRF_ASR001 | Appen Global | Conversational Speech | French | France | Low background noise | 563 | 2 | Available on request | 11,922 | 8 | alaw or wav | Dataset is fully transcribed and time stamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words For the majority of calls, both speakers (in-line/out-line) were collected and transcribed, however, for a smaller number of calls, only one half of the conversation was collected and transcribed |
French (France) conversational telephony | |
Dataset Audio | French (France) In-Car | Common Use Cases: ASR, Virtual Assistant, In Car HMI & Entertainment | Recording Device: Microphone and mobile phone | Unit: 113 hours | Add Dataset to Quote | French SpeechDat-Car | Nuance | Scripted Speech | French | France | Mixed (in-car) | 300 | 5 | 37,500 | Available on request | 16 and 8 | Available on request | Dataset is fully transcribed and is accompanied by a pronunciation lexicon and validation report Approximately 125 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names (some spontaneous), generic command and control items, phonetically rich words and sentences and prompts for spontaneous speech 113.7 hours |
French (France) In-Car | |
Dataset Text | French (France) Part of Speech Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 95,000 words | Add Dataset to Quote | fra_FRA_POS | Appen Global | Part of Speech Dictionary | French | France | N/A | N/A | N/A | N/A | 95,000 | N/A | text | French (France) Part of Speech Dictionary | ||
Dataset Text | French (France) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 112,000 words | Add Dataset to Quote | fra_FRA_PHON | Appen Global | Pronunciation Dictionary | French | France | N/A | N/A | N/A | N/A | 112,000 | N/A | text | French (France) Pronunciation Dictionary | ||
Dataset Audio | French (France) scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 26 hours | Add Dataset to Quote | FRF_ASR003 | GlobalPhone | Scripted Speech | French | France | Mixed (quiet home/office, public, outdoor) | 98 | 1 | 10,273 | Available on request | 16 | wav | Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus Dataset is fully transcribed and the transcription is available both in original script and in Romanized form Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary Developed in collaboration with the Karlsruhe Institute of Technology (KIT) |
French (France) scripted microphone | |
Dataset Audio | French (France) scripted telephony | Common Use Cases: ASR, Virtual Assistant | Recording Device: Landline only | Unit: 41 hours | Add Dataset to Quote | French SpeechDat(II) FDB-1000 | Nuance | Scripted Speech | French | France | Low background noise (home/office) | 1,017 | 1 | 48,000 | Available on request | 8 | Available on request | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 48 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words |
French (France) scripted telephony | |
Dataset Audio | French (France) scripted telephony | Common Use Cases: ASR, Virtual Assistant | Recording Device: Landline only | Unit: 305 hours | Add Dataset to Quote | French SpeechDat(II) FDB-5000 | Nuance | Scripted Speech | French | France | Low background noise | 5,040 | 1 | 237,000 | Available on request | 8 | Available on request | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 47 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words |
French (France) scripted telephony | |
Dataset Audio | French (Luxembourg) telephony | Common Use Cases: ASR, Virtual Assistant | Recording Device: Landline only | Unit: 45 hours | Add Dataset to Quote | Luxembourgish French SpeechDat(II) FDB-500 (FIXED1LF) | Nuance | Scripted Speech | French | Luxembourg | Low background noise | 614 | 1 | 32,000 | Available on request | 8 | Available on request | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 53 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words |
French (Luxembourg) telephony | |
Dataset Text | French Inverse text normalisation | Common Use Cases: ASR, Language Modelling, Closed Captioning | Recording Device: N/A | Unit: 3274 test cases | Add Dataset to Quote | FRA_ITN001 | Appen Global | Inverse text normalisation | French | N/A | N/A | N/A | N/A | N/A | N/A | N/A | text | Inverse text normalisation input-output pairs, in 14 semiotic classes: cardinal, ordinal, decimal, fraction, measure, time, date, currency, letter, digit, electronic, address, postal codes, identifiers | French Inverse text normalisation | |
Dataset Video | Garments image and video collection **in development** | Common Use Cases: Image recognition, Object recognition, Retail , e-commerce | Recording Device: Camera | Unit: 300 sessions | Add Dataset to Quote | IMG_VID_GARMENTS_US | Appen Global | Image recognition | N/A | United States | Mixed lighting conditions | Available upon request | 1 | N/A | N/A | N/A | jpg, mp4, mov | Participants took 2 pictures (front and back) of an item of clothing, and a 60-second video of themselves wearing the garment and moving to various angles. Metadata includes demographics, body measurements, labels for garment category (e.g. t-shirt, trousers) and description. Data collected, QA is underway, expected to be ready Q1 2025. Can be prioritized upon request. |
Garments image and video collection **in development** | |
Dataset Text | Georgian (Georgia) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 67,000 words | Add Dataset to Quote | kat_GEO_PHON | Appen Global | Pronunciation Dictionary | Georgian | Georgia | N/A | N/A | N/A | N/A | 67,000 | N/A | text | Georgian (Georgia) Pronunciation Dictionary | ||
Dataset Audio | German (Germany) conversational smartphone | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone | Unit: 104 hours | Add Dataset to Quote | DEU_ASR004 | Appen Global | Conversational Speech | German | Germany | Mixed (home, car, public place, outdoor) | 198 | 1 | Available on request | Available on request | 48 | wav | Dataset is fully transcribed and time stamped Two person conversations covering a broad range of generic topics including clothing, culture, education, finance, food, health, history, hospitality, insurance, media/entertainment, sports, travel/holiday, weather and work. Each speaker participates in up to 12 conversations that are 5-15 minutes long. Pronunciation lexicon not currently available but can be developed upon request |
German (Germany) conversational smartphone | |
Dataset Text | German (Germany) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 146,000 words | Add Dataset to Quote | deu_DEU_PHON | Appen Global | Pronunciation Dictionary | German | Germany | N/A | N/A | N/A | N/A | 146,000 | N/A | text | German (Germany) Pronunciation Dictionary | ||
Dataset Audio | German (Germany) scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 16 hours | Add Dataset to Quote | DEU_ASR001 | Appen Global | Scripted Speech | German | Germany | Low background noise (studio) | 127 | 2 | 12,700 | 6,826 | 48 | raw PCM | Dataset is fully transcribed and timestamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words Each speaker read 100 prompts including digits, natural numbers, personal and city names, telephone numbers, generic command and control items, phonetically rich sentences and words |
German (Germany) scripted microphone | |
Dataset Audio | German (Germany) scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 25 hours | Add Dataset to Quote | DEU_ASR003 | GlobalPhone | Scripted Speech | German | Germany | Mixed (quiet home/office, public, outdoor) | 77 | 1 | 10,085 | Available on request | 16 | wav | Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus Dataset is fully transcribed and the transcription is available both in original script and in Romanized form Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary Developed in collaboration with the Karlsruhe Institute of Technology (KIT) |
German (Germany) scripted microphone | |
Dataset Audio | German (Germany) telephony | Common Use Cases: ASR, Virtual Assistant | Recording Device: Landline only | Unit: 31 hours | Add Dataset to Quote | German SpeechDat (II) FDB-1000 | Nuance | Scripted Speech | German | Germany | Low background noise (home/office) | 988 | 1 | 43,000 | Available on request | 8 | Available on request | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 44 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words |
German (Germany) telephony | |
Dataset Audio | German (Germany) telephony | Common Use Cases: ASR, Virtual Assistant | Recording Device: Landline only | Unit: 268 hours | Add Dataset to Quote | German SpeechDat(II) FDB-4000 | Nuance | Scripted Speech | German | Germany | Low background noise (home/office) | 4,000 | 1 | 160,000 | Available on request | 8 | Available on request | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 40 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words |
German (Germany) telephony | |
Dataset Audio | German (Luxembourg) telephony | Common Use Cases: ASR, Virtual Assistant | Recording Device: Landline only | Unit: 33 hours | Add Dataset to Quote | Luxembourgish German SpeechDat(II) FDB-500 (FIXED1LG) | Nuance | Scripted Speech | German | Luxembourg | Low background noise | 500 | 1 | 26,500 | Available on request | 8 | Available on request | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 53 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words |
German (Luxembourg) telephony | |
Dataset Text | German (Switzerland) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 27,000 words | Add Dataset to Quote | deu_CHE_PHON | Appen Global | Pronunciation Dictionary | German | Switzerland | N/A | N/A | N/A | N/A | 27,000 | N/A | text | German (Switzerland) Pronunciation Dictionary | ||
Dataset Audio | German (Switzerland) scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 53 hours | Add Dataset to Quote | Speecon German (Switzerland) database | Nuance | Scripted Speech | German | Switzerland | Mixed (office, entertainment, car, public place) | 600 (550 adult speakers and 50 child speakers) | 4 | 170,000 | Available on request | 16 | Available on request | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 290 prompts per adult speaker and 210 prompts per child speaker including digits, natural numbers, letter strings, personal, place and business names, application words for adult speakers, command (toy, phone and general) for child speakers, phonetically rich words and sentences and free and elicited spontaneous responses for adult speakers |
German (Switzerland) scripted microphone | |
Dataset Audio | German (Turkey) telephony | Common Use Cases: ASR, Virtual Assistant | Recording Device: Mobile phone and landline | Unit: 31 hours | Add Dataset to Quote | OrienTel German Spoken by Turkish | Nuance | Scripted Speech | German | Turkey | Low background noise | 300 | 1 | 15,600 | Available on request | 8 | Available on request | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 52 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words |
German (Turkey) telephony | |
Dataset Text | German Inverse text normalisation | Common Use Cases: ASR, Language Modelling, Closed Captioning | Recording Device: N/A | Unit: 8001 test cases | Add Dataset to Quote | DEU_ITN001 | Appen Global | Inverse text normalisation | German | N/A | N/A | N/A | N/A | N/A | N/A | N/A | text | Inverse text normalisation input-output pairs, in 14 semiotic classes: cardinal, ordinal, decimal, fraction, measure, time, date, currency, letter, digit, electronic, address, postal codes, identifiers | German Inverse text normalisation | |
Dataset Audio | GlobalPhone Multilingual Text & Speech Database | Common Use Cases: ASR, Language Identification, Multilingual Speech Synthesis, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 450 hours | Add Dataset to Quote | GLOBALPHONE | GlobalPhone | Scripted Speech | N/A | Global coverage | Mixed (quiet home/office, public, outdoor) | 1942 | 1 | 169,755 | Available on request | 16 | wav | Global Phone multilingual corpus, languages can be sold separately or in multi-language packages. Tiered package pricing available. GLOBALPHONE provides multilingual speech and text data in 20 Languages: Arabic, Bulgarian, Chinese-Mandarin, Chinese-Shanghai, Croatian, Czech, French, German, Hausa, Japanese, Korean, Polish, Portuguese, Russian, Spanish, Swedish, Tamil, Thai, Turkish, and Vietnamese. Dataset is fully transcribed and the transcription is available both in original script and in Romanized form In each language, news article sentences were read by about 100 native speakers. The articles cover national and international political news, as well as economic news from 1995-2011. The speech is available in 16bit, 16kHz mono quality recorded with a close-speaking microphone and the same recording equipment was used for all languages. Developed in collaboration with the Karlsruhe Institute of Technology (KIT) |
GlobalPhone Multilingual Text & Speech Database | |
Dataset Text | Greek (Greece) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 5,000 words | Add Dataset to Quote | ell_GRC_PHON | Appen Global | Pronunciation Dictionary | Greek | Greece | N/A | N/A | N/A | N/A | 5,000 | N/A | text | Greek (Greece) Pronunciation Dictionary | ||
Dataset Audio | Greek (Greece) scripted smartphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Mobile phone | Unit: 191 hours | Add Dataset to Quote | GRE_ASR001_CN | Appen China | Scripted Speech | Greek | Greece | Low background noise (home/office) | 287 | 1 | 54,113 | 68,271 | 16 | wav | Dataset contains audio with corresponding text prompts | Greek (Greece) scripted smartphone | |
Dataset Text | Guarani (Paraguay) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 36,000 words | Add Dataset to Quote | grn_PRY_PHON | Appen Global | Pronunciation Dictionary | Guarani | Paraguay | N/A | N/A | N/A | N/A | 36,000 | N/A | text | Guarani (Paraguay) Pronunciation Dictionary | ||
Dataset Text | Haitian Creole (Haiti) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 18,000 words | Add Dataset to Quote | hat_HTI_PHON | Appen Global | Pronunciation Dictionary | Haitian Creole | Haiti | N/A | N/A | N/A | N/A | 18,000 | N/A | text | Haitian Creole (Haiti) Pronunciation Dictionary | ||
Dataset Video | Hand gesture videos **in development** | Common Use Cases: Movement detection, Human Body Movement, Action Classification | Recording Device: Camera | Unit: 5000 videos | Add Dataset to Quote | HUMAN_BODY_VID004 | Appen Global | Human Body Movement | N/A | United States | Mixed lighting conditions | Available upon request | 1 | N/A | N/A | N/A | jpg, mp4, mov | Approximately 11 hours of video of participants making hand gestures e.g. thumbs up, wave. Videos may include the participants face or only their hand. Metadata included describing the type of hand gesture in the video. Data collected, QA is underway, expected to be ready Q1 2025. Can be prioritized upon request. |
Hand gesture videos **in development** | |
Dataset Image | Handwritten text document OCR | Common Use Cases: Document Processing, Document Search, Text detection | Recording Device: Camera, scan | Unit: 663 images | Add Dataset to Quote | IMG_OCR_Handwritten | Appen Global | Document OCR | N/A | N/A | Mixed lighting conditions | N/A | N/A | N/A | N/A | N/A | png | Scans and photographs of handwritten forms and handwritten documents. 3 Languages: 11% Arabic, 60% English, 29% Russian | Handwritten text document OCR | |
Dataset Audio | Hausa (Nigeria) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone | Unit: 33 hours | Add Dataset to Quote | HAU_ASR002 | Appen Global | Conversational Speech | Hausa | Nigeria | Low background noise | 200 | 2 | Available on request | 7,949 | 8 | alaw | Dataset is fully transcribed and timestamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words 200 telephony conversations are recorded for this project – 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers |
Hausa (Nigeria) conversational telephony | |
Dataset Text | Hausa (Nigeria) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 11,000 words | Add Dataset to Quote | hau_NGA_PHON | Appen Global | Pronunciation Dictionary | Hausa | Nigeria | N/A | N/A | N/A | N/A | 11,000 | N/A | text | Hausa (Nigeria) Pronunciation Dictionary | ||
Dataset Audio | Hausa scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 20 hours | Add Dataset to Quote | HAU_ASR001 | GlobalPhone | Scripted Speech | Hausa | Cameroon | Mixed (quiet home/office, public, outdoor) | 103 | 1 | 7,895 | Available on request | 16 | wav | Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus Dataset is fully transcribed and the transcription is available both in original script and in Romanized form Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary Developed in collaboration with the Karlsruhe Institute of Technology (KIT) |
Hausa scripted microphone | |
Dataset Audio | Hebrew (Israel) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 34 hours | Add Dataset to Quote | HEB_ASR001 | Appen Global | Conversational Speech | Hebrew | Israel | Low background noise | 200 | 2 | Available on request | 19,250 | 8 | alaw or wav | Dataset is fully transcribed and timestamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words 200 telephony conversations are recorded for this project – 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers 50% landline, 50% mobile Conversations cover a range of topics including: Friends, Family and Studies. |
Hebrew (Israel) conversational telephony | |
Dataset Text | Hebrew (Israel) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 31,000 words | Add Dataset to Quote | heb_ISR_PHON | Appen Global | Pronunciation Dictionary | Hebrew | Israel | N/A | N/A | N/A | N/A | 31,000 | N/A | text | Hebrew (Israel) Pronunciation Dictionary | ||
Dataset Audio | Hindi (India) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics, TTS | Recording Device: Mobile phone and landline | Unit: 32 hours | Add Dataset to Quote | HIN_ASR002 | Appen Global | Conversational Speech | Hindi | India | Mixed | 996 | 2 | Available on request | 12,266 | 8 | wav | Dataset is fully transcribed and timestamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words For the majority of calls, both speakers (in-line/out-line) were collected and transcribed, however, for a smaller number of calls, only one half of the conversation was collected and transcribed 29% landline, 71% mobile |
Hindi (India) conversational telephony | |
Dataset Text | Hindi (India) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: | Unit: 35,000 words | Add Dataset to Quote | hin_IND_PHON | Appen Global | Pronunciation Dictionary | Hindi | India | N/A | N/A | N/A | N/A | 35,000 | N/A | text | Hindi (India) Pronunciation Dictionary | ||
Dataset Audio | Hindi (India) scripted telephony | Common Use Cases: ASR, Virtual Assistant, TTS | Recording Device: Mobile phone | Unit: 224 hours | Add Dataset to Quote | HIN_ASR001 | Appen Global | Scripted Speech | Hindi | India | Low background noise | 1,920 | 1 | 96,000 | 9,853 | 8 | alaw or wav | Fully transcribed to SpeechDAT type conventions Dataset is accompanied by a pronunciation lexicon [SAMPA] containing all transcribed words 50 prompts per speaker including digits, natural numbers, personal, business and place names, web addresses, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words |
Hindi (India) scripted telephony | |
Dataset Text | Hindi Inverse text normalisation | Common Use Cases: ASR, Language Modelling, Closed Captioning | Recording Device: N/A | Unit: 6924 test cases | Add Dataset to Quote | HIN_ITN001 | Appen Global | Inverse text normalisation | Hindi | N/A | N/A | N/A | N/A | N/A | N/A | N/A | text | Inverse text normalisation input-output pairs, in 14 semiotic classes: cardinal, ordinal, decimal, fraction, measure, time, date, currency, letter, digit, electronic, address, postal codes, identifiers | Hindi Inverse text normalisation | |
Dataset Image | Home environment pictures | Common Use Cases: Image recognition | Recording Device: N/A | Unit: 10000 images | Add Dataset to Quote | IMG_HOME_CN | Appen China | Image recognition | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | jpg | (Data source: website) 4000 images in the study room; 6000 images in the living room. No annotation. | Home environment pictures | |
Dataset Text | Hungarian (Hungary) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 500 words | Add Dataset to Quote | hun_HUN_PHON | Appen Global | Pronunciation Dictionary | Hungarian | Hungary | N/A | N/A | N/A | N/A | 500 | N/A | text | Hungarian (Hungary) Pronunciation Dictionary | ||
Dataset Audio | Hungarian (Hungary) scripted smartphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Mobile phone | Unit: 286 hours | Add Dataset to Quote | HUN_ASR001_CN | Appen China | Scripted Speech | Hungarian | Hungary | Low background noise (home/office) | 254 | 1 | 94,031 | 201,921 | 16 | wav | Dataset contains audio with corresponding text prompts | Hungarian (Hungary) scripted smartphone | |
Dataset Audio | Hungarian (Hungary) scripted telephony | Common Use Cases: ASR, Virtual Assistant | Recording Device: Landline only | Unit: 65 hours | Add Dataset to Quote | Hungarian SpeechDat(E) | Nuance | Scripted Speech | Hungarian | Hungary | Low background noise | 1,000 | 1 | 48,000 | Available on request | 8 | Available on request | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 48 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words |
Hungarian (Hungary) scripted telephony | |
Dataset Text | Icelandic (Iceland) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 3,000 words | Add Dataset to Quote | isl_ISL_PHON | Appen Global | Pronunciation Dictionary | Icelandic | Iceland | N/A | N/A | N/A | N/A | 3000 | N/A | text | Icelandic (Iceland) Pronunciation Dictionary | ||
Dataset Text | Igbo (Nigeria) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 32,000 words | Add Dataset to Quote | ibo_NGA_PHON | Appen Global | Pronunciation Dictionary | Igbo | Nigeria | N/A | N/A | N/A | N/A | 32,000 | N/A | text | Igbo (Nigeria) Pronunciation Dictionary | ||
Dataset Audio | Indonesian (Indonesia) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone | Unit: 150 hours | Add Dataset to Quote | IND_DH_ASR001_CN | Appen China | Conversational Speech | Indonesian | Indonesia | Low background noise | 1000 | 2 | N/A | N/A | 16 | wav | Audio with transcription and timestamping. | Indonesian (Indonesia) conversational telephony | |
Dataset Text | Indonesian (Indonesia) Part of Speech Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 10,000 words | Add Dataset to Quote | ind_IDN_POS | Appen Global | Part of Speech Dictionary | Indonesian | Indonesia | N/A | N/A | N/A | N/A | 10,000 | N/A | text | Indonesian (Indonesia) Part of Speech Dictionary | ||
Dataset Text | Indonesian (Indonesia) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 95,000 words | Add Dataset to Quote | ind_IDN_PHON | Appen Global | Pronunciation Dictionary | Indonesian | Indonesia | N/A | N/A | N/A | N/A | 95,000 | N/A | text | Indonesian (Indonesia) Pronunciation Dictionary | ||
Dataset Audio | Iranian Persian (Farsi) (Iran) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 30 hours | Add Dataset to Quote | FAR_ASR002 | Appen Global | Conversational Speech | Iranian Persian (Farsi) | Iran | Mixed | 1,000 | 2 | Available on request | 12,358 | 8 | wav | Dataset is fully transcribed and time stamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words |
Iranian Persian (Farsi) (Iran) conversational telephony | |
Dataset Audio | Iranian Persian (Farsi) (Iran) scripted telephony | Common Use Cases: ASR, Virtual Assistant | Recording Device: Mobile phone and landline | Unit: 85 hours | Add Dataset to Quote | FAR_ASR001 | Appen Global | Scripted Speech | Iranian Persian (Farsi) | Iran | Mixed | 789 | 1 | 38,400 | 8,716 | 8 | alaw or wav | Fully transcribed to OrienTel type conventions Dataset is accompanied by a pronunciation lexicon [SAMPA] containing all transcribed words 48 prompts per speaker including digits, natural numbers, letter strings, personal, place, and business names, confirmation items (yes, no + fuzzy), generic command and control items, phonetically rich sentences and words |
Iranian Persian (Farsi) (Iran) scripted telephony | |
Dataset Text | Iranian Persian (Iran) Part of Speech Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 1,400,000 words | Add Dataset to Quote | pes_IRN_POS | Appen Global | Part of Speech Dictionary | Iranian Persian | Iran | N/A | N/A | N/A | N/A | 1,400,000 | N/A | text | Iranian Persian (Iran) Part of Speech Dictionary | ||
Dataset Text | Iranian Persian (Iran) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 85,000 words | Add Dataset to Quote | pes_IRN_PHON | Appen Global | Pronunciation Dictionary | Iranian Persian | Iran | N/A | N/A | N/A | N/A | 85,000 | N/A | text | Iranian Persian (Iran) Pronunciation Dictionary | ||
Dataset Audio | Italian (Italy) conversational smartphone | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone | Unit: 256 hours | Add Dataset to Quote | ITA_ASR005 | Appen Global | Conversational Speech | Italian | Italy | Mixed (home, car, public place, outdoor) | 482 | 1 | Available on request | Available on request | 48 | wav | Dataset is fully transcribed and time stamped Two person conversations covering a broad range of generic topics including clothing, culture, education, finance, food, health, history, hospitality, insurance, media/entertainment, sports, travel/holiday, weather and work. Each speaker participates in up to 12 conversations that are 5-15 minutes long. Pronunciation lexicon not currently available but can be developed upon request |
Italian (Italy) conversational smartphone | |
Dataset Audio | Italian (Italy) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 36 hours | Add Dataset to Quote | ITA_ASR003 | Appen Global | Conversational Speech | Italian | Italy | Low background noise | 200 | 2 | Available on request | 18,974 | 8 | alaw | Dataset is fully transcribed and timestamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words 200 telephony conversations are recorded for this project – 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers 50% landline, 50% mobile Conversations cover a range of topics including: Travel, Family and Holidays. |
Italian (Italy) conversational telephony | |
Dataset Text | Italian (Italy) Part of Speech Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 171,000 words | Add Dataset to Quote | ita_ITA_POS | Appen Global | Part of Speech Dictionary | Italian | Italy | N/A | N/A | N/A | N/A | 171,000 | N/A | text | Italian (Italy) Part of Speech Dictionary | ||
Dataset Text | Italian (Italy) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 197,000 words | Add Dataset to Quote | ita_ITA_PHON | Appen Global | Pronunciation Dictionary | Italian | Italy | N/A | N/A | N/A | N/A | 197,000 | N/A | text | Italian (Italy) Pronunciation Dictionary | ||
Dataset Audio | Italian (Italy) scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 44 hours | Add Dataset to Quote | ITA_ASR001 | Appen Global | Scripted Speech | Italian | Italy | Mixed | 200 | 4 | 40,000 | 7,316 | 22 | raw PCM | Fully transcribed to SpeechDAT type conventions Dataset is accompanied by a pronunciation lexicon containing all transcribed words 200 prompts per speaker including 100 command and control type items and 100 phonetically rich sentences |
Italian (Italy) scripted microphone | |
Dataset Audio | Italian (Italy) scripted microphone in-car | Common Use Cases: ASR, Virtual Assistant, In Car HMI & Entertainment | Recording Device: Microphone | Unit: 47 hours | Add Dataset to Quote | ITA_ASR002 | Appen Global | Scripted Speech | Italian | Italy | Mixed (in-car) | 205 | 4 | 35,875 | 10,366 | 48 | raw PCM | Fully transcribed to SpeechDAT type conventions Dataset is accompanied by a pronunciation lexicon containing all transcribed words 350 prompts per speaker including digits, street names, generic command and control items, phonetically rich sentences and words Each speaker recorded 1or 2 sessions including Session 1 in a parked vehicle with the engine running and Session 2 in a vehicle travelling at 60 mph (100 km/h) |
Italian (Italy) scripted microphone in-car | |
Dataset Audio | Italian (Italy) telephony | Common Use Cases: ASR, Virtual Assistant | Recording Device: Landline only | Unit: 38 hours | Add Dataset to Quote | Italian Fixed Network Speech SpeechDat(M) Corpus | Nuance | Scripted Speech | Italian | Italy | Low background noise (home/office) | 1,000 | 1 | 39,000 | Available on request | 8 | Available on request | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 39 prompts per speaker including isolated and connected digits, natural numbers, money amounts, spelled words, time and date phrases, yes/no questions, city names, common application words, application words in phrases and phonetically rich sentences |
Italian (Italy) telephony | |
Dataset Audio | Italian (Italy) telephony | Common Use Cases: ASR, Virtual Assistant | Recording Device: Landline only | Unit: 228 hours | Add Dataset to Quote | Italian SpeechDat(II) FDB-3000 | Nuance | Scripted Speech | Italian | Italy | Low background noise (home/office) | 3,040 | 1 | 134,000 | Available on request | 8 | Available on request | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 44 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words |
Italian (Italy) telephony | |
Dataset Audio | Italian (Italy) telephony | Common Use Cases: ASR, Virtual Assistant | Recording Device: Mobile phone | Unit: 103 hours | Add Dataset to Quote | Italian SpeechDat(II) MDB-250 | Nuance | Scripted Speech | Italian | Italy | Low background noise (home/office) | 375 | 1 | 19,000 | Available on request | 8 | Available on request | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 51 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words |
Italian (Italy) telephony | |
Dataset Audio | Italian (Italy) telephony | Common Use Cases: ASR, Virtual Assistant | Recording Device: Mobile phone | Unit: 13 hours | Add Dataset to Quote | SpeechDat(M) Italian Mobile Network Speech Database | Nuance | Scripted Speech | Italian | Italy | Low background noise (home/office) | 342 | 1 | 13,500 | Available on request | 8 | Available on request | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 40 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words |
Italian (Italy) telephony | |
Dataset Audio | Italian (Italy) TTS male scripted microphone | Common Use Cases: TTS | Recording Device: Microphone | Unit: 3 hours | Add Dataset to Quote | ITA_TTS001 | Appen Global | TTS Scripted Speech | Italian | Italy | Low background noise (studio) | 1 | 1 | 3,300 | Available on request | 22 | raw PCM | Dataset is accompanied by a pronunciation lexicon containing all words spoken in the Dataset 3,300 prompts per speaker including phonetically rich sentences |
Italian (Italy) TTS male scripted microphone | |
Dataset Text | Japanese (Japan) Part of Speech Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 269,000 words | Add Dataset to Quote | jpn_JPN_POS | Appen Global | Part of Speech Dictionary | Japanese | Japan | N/A | N/A | N/A | N/A | 269,000 | N/A | text | Japanese (Japan) Part of Speech Dictionary | ||
Dataset Text | Japanese (Japan) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 262,000 words | Add Dataset to Quote | jpn_JPN_PHON | Appen Global | Pronunciation Dictionary | Japanese | Japan | N/A | N/A | N/A | N/A | 262,000 | N/A | text | Japanese (Japan) Pronunciation Dictionary | ||
Dataset Audio | Japanese (Japan) scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 33 hours | Add Dataset to Quote | JPN_ASR001 | GlobalPhone | Scripted Speech | Japanese | Japan | Mixed (quiet home/office, public, outdoor) | 144 | 1 | 13,067 | Available on request | 16 | wav | Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus Dataset is fully transcribed and the transcription is available both in original script and in Romanized form Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary Developed in collaboration with the Karlsruhe Institute of Technology (KIT) |
Japanese (Japan) scripted microphone | |
Dataset Audio | Japanese (Japan) scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 57 hours | Add Dataset to Quote | Speecon Japanese | Nuance | Scripted Speech | Japanese | Japan | Mixed (office, entertainment, car, public place) | 600 (550 adult speakers and 50 child speakers) | 4 | 170,000 | Available on request | 16 | Available on request | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 290 prompts per adult speaker and 210 prompts per child speaker including digits, natural numbers, letter strings, personal, place and business names, application words for adult speakers, command (toy, phone and general) for child speakers, phonetically rich words and sentences and free and elicited spontaneous responses for adult speakers |
Japanese (Japan) scripted microphone | |
Dataset Text | Japanese Inverse text normalisation | Common Use Cases: ASR, Language Modelling, Closed Captioning | Recording Device: N/A | Unit: 5363 test cases | Add Dataset to Quote | JPN_ITN001 | Appen Global | Inverse text normalisation | Japanese | N/A | N/A | N/A | N/A | N/A | N/A | N/A | text | Inverse text normalisation input-output pairs, in 14 semiotic classes: cardinal, ordinal, decimal, fraction, measure, time, date, currency, letter, digit, electronic, address, postal codes, identifiers | Japanese Inverse text normalisation | |
Dataset Text | Japanese NER news text | Common Use Cases: NER, Content Classification, Search Engines | Recording Device: N/A | Unit: 20,629 sentences | Add Dataset to Quote | JPY_NER001 | Appen Global | News NER | Japanese | Japan | N/A | N/A | N/A | 20,629 | Available on request | N/A | text | News text corpora with entities tagged in XML format: Person, Title, Organization, Location, Geo-political entity, Facility, Religion, Nationality, Quantity | Japanese NER news text | |
Dataset Text | Javanese (Indonesia) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 22,000 words | Add Dataset to Quote | jav_IDN_PHON | Appen Global | Pronunciation Dictionary | Javanese | Indonesia | N/A | N/A | N/A | N/A | 22,000 | N/A | text | Javanese (Indonesia) Pronunciation Dictionary | ||
Dataset Audio | Kannada (India) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 15 hours | Add Dataset to Quote | KAN_ASR001 | Appen Global | Conversational Speech | Kannada | India | Mixed | 178 | 2 | Available on request | 15,660 | 8 | alaw | Dataset is fully transcribed and timestamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words 15% landline, 85% mobile |
Kannada (India) conversational telephony | |
Dataset Audio | Kannada (India) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 57 hours | Add Dataset to Quote | KAN_ASR001A | Appen Global | Conversational Speech | Kannada | India | Mixed | 1,000 | 2 | Available on request | 15,660 | 8 | alaw | Approx. 25% of the dataset sessions are transcribed and time stamped – full transcripts can be made available Database is accompanied by a pronunciation lexicon containing all transcribed words 16% Hands-Free car, 16% Landline quiet, 15% Mobile quiet, 17% Moving vehicle, 19% Public place, 17% Roadside |
Kannada (India) conversational telephony | |
Dataset Text | Kannada (India) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 49,000 words | Add Dataset to Quote | kan_IND_PHON | Appen Global | Pronunciation Dictionary | Kannada | India | N/A | N/A | N/A | N/A | 49,000 | N/A | text | Kannada (India) Pronunciation Dictionary | ||
Dataset Text | Kazakh (Kazakhstan) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 31,000 words | Add Dataset to Quote | kaz_KAZ_PHON | Appen Global | Pronunciation Dictionary | Kazakh | Kazakhstan | N/A | N/A | N/A | N/A | 31,000 | N/A | text | Kazakh (Kazakhstan) Pronunciation Dictionary | ||
Dataset Text | Korean (South Korea) Part of Speech Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 100,000 words | Add Dataset to Quote | kor_KOR_POS | Appen Global | Part of Speech Dictionary | Korean | South Korea | N/A | N/A | N/A | N/A | 100,000 | N/A | text | Korean (South Korea) Part of Speech Dictionary | ||
Dataset Text | Korean (South Korea) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 105,000 words | Add Dataset to Quote | kor_KOR_PHON | Appen Global | Pronunciation Dictionary | Korean | South Korea | N/A | N/A | N/A | N/A | 105,000 | N/A | text | Korean (South Korea) Pronunciation Dictionary | ||
Dataset Audio | Korean (South Korea) scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 20 hours | Add Dataset to Quote | KOR_ASR001 | GlobalPhone | Scripted Speech | Korean | South Korea | Mixed (quiet home/office, public, outdoor) | 100 | 1 | 8,107 | Available on request | 16 | wav | Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus Dataset is fully transcribed and the transcription is available both in original script and in Romanized form Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary Developed in collaboration with the Karlsruhe Institute of Technology (KIT) |
Korean (South Korea) scripted microphone | |
Dataset Text | Korean NER news text | Common Use Cases: NER, Content Classification, Search Engines | Recording Device: N/A | Unit: 25,830 sentences | Add Dataset to Quote | KOR_NER001 | Appen Global | News NER | Korean | South Korea | N/A | N/A | N/A | 25,830 | Available on request | N/A | text | News text corpora with entities tagged in XML format: Person, Title, Organization, Location, Geo-political entity, Facility, Religion, Nationality, Quantity | Korean NER news text | |
Dataset Text | Kurmanji (Turkey) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 60,000 words | Add Dataset to Quote | kur_TUR_PHON | Appen Global | Pronunciation Dictionary | Kurmanji | Turkey | N/A | N/A | N/A | N/A | 60,000 | N/A | text | Kurmanji (Turkey) Pronunciation Dictionary | ||
Dataset Text | Lao (Laos) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 9,000 words | Add Dataset to Quote | lao_LAO_PHON | Appen Global | Pronunciation Dictionary | Lao | Laos | N/A | N/A | N/A | N/A | 9,000 | N/A | text | Lao (Laos) Pronunciation Dictionary | ||
Dataset Text | Latin American Spanish Inverse text normalisation | Common Use Cases: ASR, Language Modelling, Closed Captioning | Recording Device: N/A | Unit: 3795 test cases | Add Dataset to Quote | SPA_ITN001 | Appen Global | Inverse text normalisation | Spanish | N/A | N/A | N/A | N/A | N/A | N/A | N/A | text | Inverse text normalisation input-output pairs, in 14 semiotic classes: cardinal, ordinal, decimal, fraction, measure, time, date, currency, letter, digit, electronic, address, postal codes, identifiers | Latin American Spanish Inverse text normalisation | |
Dataset Text | Lithuanian (Lithuania) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 71,000 words | Add Dataset to Quote | lit_LTU_PHON | Appen Global | Pronunciation Dictionary | Lithuanian | Lithuania | N/A | N/A | N/A | N/A | 71,000 | N/A | text | Lithuanian (Lithuania) Pronunciation Dictionary | ||
Dataset Video | Location entrance human body movement videos | Common Use Cases: Security, Movement detection, Human body movement recognition | Recording Device: Camera | Unit: 130 videos | Add Dataset to Quote | HUMAN_BODY_VID002 | Appen Global | Human Body Movement | N/A | United Kingdom, Philippines | Mixed background and lighting conditions | 100 | 3 | N/A | N/A | N/A | mp4 | This dataset contains 130 sessions of approximately 1-minute videos of groups of 3-10 people (52% 3-5 people, 41% 6-8 people, 37% 9-10 people) walking towards and through entrances in one location in Exeter, UK (2 camera views: front, top – 20 sessions) and 2 locations in Cavite, Philippines (3 camera views: front, top, side for location 1 – 85 sessions – and front, top, top2 for location 2 – 25 sessions). 2.85 hours of video footage. Varying scenes (e.g. weather conditions, time of day) and participants’ appearance (e.g. wearing masks, hat, glasses, clothes) and actions (e.g. looking at phone, talking, bowing head). No annotation. 2048p resolution, 30 fps, synchronized camera streams. |
Location entrance human body movement videos | |
Dataset Text | Malayalam (India) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 19,000 words | Add Dataset to Quote | mal_IND_PHON | Appen Global | Pronunciation Dictionary | Malayalam | India | N/A | N/A | N/A | N/A | 19,000 | N/A | text | Malayalam (India) Pronunciation Dictionary | ||
Dataset Text | Malaysian (Malaysia) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 26,000 words | Add Dataset to Quote | msa_MYS_PHON | Appen Global | Pronunciation Dictionary | Malaysian | Malaysia | N/A | N/A | N/A | N/A | 26,000 | N/A | text | Malaysian (Malaysia) Pronunciation Dictionary | ||
Dataset Text | Mandarin (Simplified) (China) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 35,000 words | Add Dataset to Quote | zho_CHN_PHON | Appen Global | Pronunciation Dictionary | Mandarin (Simplified) | China | N/A | N/A | N/A | N/A | 35,000 | N/A | text | Mandarin (Simplified) (China) Pronunciation Dictionary | ||
Dataset Text | Mandarin (Traditional) (Taiwan) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 50,000 words | Add Dataset to Quote | zho_TWN_PHON | Appen Global | Pronunciation Dictionary | Mandarin (Traditional) | Taiwan | N/A | N/A | N/A | N/A | 50,000 | N/A | text | Mandarin (Traditional) (Taiwan) Pronunciation Dictionary | ||
Dataset Audio | Mandarin Chinese (China) scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 26 hours | Add Dataset to Quote | MAC_ASR002 | GlobalPhone | Scripted Speech | Mandarin Chinese | China | Mixed (quiet home/office, public, outdoor) | 132 | 1 | 10,225 | Available on request | 16 | wav | Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus Dataset is fully transcribed and the transcription is available both in original script and in Romanized form Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary Developed in collaboration with the Karlsruhe Institute of Technology (KIT) |
Mandarin Chinese (China) scripted microphone | |
Dataset Audio | Mandarin Chinese (China) scripted telephony | Common Use Cases: ASR, Virtual Assistant | Recording Device: Mobile phone and landline | Unit: 323 hours | Add Dataset to Quote | MAC_ASR001 | Appen Global | Scripted Speech | Mandarin Chinese | China | Mixed | 2,000 | 1 | 200,000 | 7,145 | 8 | alaw | Fully transcribed to SpeechDAT type conventions Dataset is accompanied by a pronunciation lexicon [SAMPA] containing all transcribed words 98 prompts per speaker including digits, natural numbers, letter strings, personal, place, and business names, confirmation items (yes, no + fuzzy), generic command and control items (from a set of 215), phonetically rich sentences and words 100% mobile |
Mandarin Chinese (China) scripted telephony | |
Dataset Text | Mandarin Chinese Inverse text normalisation | Common Use Cases: ASR, Language Modelling, Closed Captioning | Recording Device: N/A | Unit: 4230 test cases | Add Dataset to Quote | CMN_ITN001 | Appen Global | Inverse text normalisation | Mandarin Chinese | N/A | N/A | N/A | N/A | N/A | N/A | N/A | text | Inverse text normalisation input-output pairs, in 14 semiotic classes: cardinal, ordinal, decimal, fraction, measure, time, date, currency, letter, digit, electronic, address, postal codes, identifiers | Mandarin Chinese Inverse text normalisation | |
Dataset Text | Mandarin NER news text | Common Use Cases: NER, Content Classification, Search Engines | Recording Device: N/A | Unit: 17,313 sentences | Add Dataset to Quote | MAC_NER001 | Appen Global | News NER | Mandarin Chinese | China | N/A | N/A | N/A | 17,313 | Available on request | N/A | text | News text corpora with entities tagged in XML format: Person, Title, Organization, Location, Geo-political entity, Facility, Religion, Nationality, Quantity | Mandarin NER news text | |
Dataset Audio | Marathi (India) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 15 hours | Add Dataset to Quote | MAR_ASR001 | Appen Global | Conversational Speech | Marathi | India | Mixed | 180 | 2 | Available on request | 11,908 | 8 | alaw | Approx. 29% of the dataset sessions are transcribed and time stamped – full transcripts can be made available Dataset is accompanied by a pronunciation lexicon containing all transcribed words 17% Hands-Free car, 16% Landline quiet, 19% Mobile quiet, 16% Moving vehicle, 16% Public place, 17% Roadside |
Marathi (India) conversational telephony | |
Dataset Audio | Marathi (India) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 52 hours | Add Dataset to Quote | MAR_ASR001A | Appen Global | Conversational Speech | Marathi | India | Mixed | 1,000 | 2 | Available on request | 11,908 | 8 | alaw | Portion of the dataset sessions are transcribed and time stamped – full transcripts can be made available Dataset is accompanied by a pronunciation lexicon containing all transcribed words 16% landline, 84% mobile |
Marathi (India) conversational telephony | |
Dataset Text | Marathi (India) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 30,000 words | Add Dataset to Quote | mar_IND_PHON | Appen Global | Pronunciation Dictionary | Marathi | India | N/A | N/A | N/A | N/A | 30,000 | N/A | text | Marathi (India) Pronunciation Dictionary | ||
Dataset Location Data | Mobile Location Data | Common Use Cases: AI Platforms, Advertising and Marketing, Business Intelligence, Financial Modeling, FMCG, Footfall and Attribution, Healthcare, Human Mobility Insights, Location Analytics, OOH and DOOH, Retail Planning and Site Selection, Retail, Research and Academia, Smart Cities and Urban Planning, Supply Chain, Travel and Tourism, Transportation Planning and Logistics | Recording Device: Mobile device | Unit: 20 billion+ location events per day | Add Dataset to Quote | LOCATION_MOBILE_GLOBAL | Quadrant | Mobile GPS Location Data | N/A | Global coverage | N/A | N/A | N/A | N/A | N/A | N/A | CSV, XLS, JSON, Parquet | For enquiries relating to this data, please contact: Tim Solt, Quadrant VP Sales tim@quadrant.io Book a meeting: https://meetings.hubspot.com/tim321 Quadrant (an Appen Company) is a global leader in the compilation and delivery of compliant, mobile (GPS) location data. Our global location data panel is of the highest authenticity and quality, allowing you to easily integrate and perform location-based activities to support your business initiatives and solve a myriad of real-world problems. The Quadrant location data panel contains 16 core metadata attributes, including all standard attributes such as Device ID, Latitude, Longitude, Timestamp, Horizontal Accuracy, and non-standard attributes such as Geohash and H3. Our historical data spans as far back as 2021, and data can be selected specific to your requirements (e.g., geography, timeframe, delivery cadence). Country or Region specific requests can be accommodated. Please book a meeting to discuss your requirements and obtain a sample dataset to evaluate for your unique use case. |
Mobile Location Data | |
Dataset Text | Mongolian (Mongolia) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 32,000 words | Add Dataset to Quote | mon_MNG_PHON | Appen Global | Pronunciation Dictionary | Mongolian | Mongolia | N/A | N/A | N/A | N/A | 32,000 | N/A | text | Mongolian (Mongolia) Pronunciation Dictionary | ||
Dataset Text | Norwegian (Norway) Part of Speech Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 3,000 words | Add Dataset to Quote | nor_NOR_POS | Appen Global | Part of Speech Dictionary | Norwegian | Norway | N/A | N/A | N/A | N/A | 3,000 | N/A | text | Norwegian (Norway) Part of Speech Dictionary | ||
Dataset Text | Norwegian (Norway) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 117,000 words | Add Dataset to Quote | nor_NOR_PHON | Appen Global | Pronunciation Dictionary | Norwegian | Norway | N/A | N/A | N/A | N/A | 117,000 | N/A | text | Norwegian (Norway) Pronunciation Dictionary | ||
Dataset Image | Object Image Collection **text descriptions in development** | Common Use Cases: Image label recognition training, Accessibility, LLM image generation | Recording Device: Mobile phone and camera | Unit: 2000 images | Add Dataset to Quote | IMG_TAG_CN | Appen China | Image recognition | N/A | N/A | Mixed lighting conditions | N/A | N/A | N/A | N/A | N/A | jpg | Multi-scene picture sample library of approximately 2000 images. English text descriptions in development. Categories: Airport: 65; Beach: 95; Car: 50; Clothing store: 53; Crowd: 67; Department store: 56; Desert: 73; Electrical equipment: 55; Gym: 47; Handbag: 35; KTV: 50; Market: 55; Mountain area: 54; Museum: 63; Night view: 132; Office: 100; Pet: 82; Playground: 94; Restaurant: 54; Sandbeach: 68; Scenic spot: 77; Sea: 191; Ship: 50; Sky: 102; Snow Mountain: 53; Snow scene: 71; Sports equipment: 54; Store: 34; Tree: 85; Window scenery: 62; Zoo: 70 | Object Image Collection **text descriptions in development** | |
Dataset Video | Object videos **in development** | Common Use Cases: Movement detection, Action Classification | Recording Device: Camera | Unit: 5500 videos | Add Dataset to Quote | VID_OBJECT_US | Appen Global | Movement recognition | N/A | United States | Mixed lighting conditions | Available upon request | 1 | N/A | N/A | N/A | mp4, mov | Approximately 6 hours of videos of various objects under different angles, distances and lighting conditions. Contributors selected from a list of ~150 everyday objects (e.g. dog, kettle, desk). Data collected, QA is underway, expected to be ready Q1 2025. Can be prioritized upon request. |
Object videos **in development** | |
Dataset Text | Oriya (India) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 19,000 words | Add Dataset to Quote | ori_IND_PHON | Appen Global | Pronunciation Dictionary | Oriya | India | N/A | N/A | N/A | N/A | 19,000 | N/A | text | Oriya (India) Pronunciation Dictionary | ||
Dataset Audio | Panjabi (Pakistan) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 20 hours | Add Dataset to Quote | PAP_ASR001 | Appen Global | Conversational Speech | Panjabi | Pakistan | Low background noise | 205 | 2 | Available on request | 7,298 | 8 | alaw | Dataset is fully transcribed and time-stamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words 71% of calls, both speakers (in-line/out-line) were collected and transcribed, however, for 29% calls, only one half of the conversation was collected and transcribed 20% landline, 80% mobile |
Panjabi (Pakistan) conversational telephony | |
Dataset Audio | Pashto (Afghanistan) broadcast | Common Use Cases: ASR, Automatic Captioning, Keyword Spotting | Recording Device: N/A | Unit: 51 hours | Add Dataset to Quote | PAS_BRC001 | Appen Global | Broadcast Speech | Northern Pashto – Southern Pashto | Afghanistan | Low background noise (studio) | N/A | 1 | Available on request | Available on request | 32 – 44 | wav | Dataset is fully transcribed and timestamped Pronunciation lexicon not currently available but can be developed upon request Dataset is largely speech only and does not include music or advertisements Data types include: talk shows, interviews, news broadcasts (excluding news reading by anchors) |
Pashto (Afghanistan) broadcast | |
Dataset Audio | Pashto (Afghanistan) conversational microphone | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Microphone | Unit: 39 hours | Add Dataset to Quote | PAS_ASR002 | Appen Global | Conversational Speech | Northern Pashto – Southern Pashto | Afghanistan | Low background noise | 40 | 2 | 34860 | 9,480 | 16 | wav | Dataset is fully transcribed and time stamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words A full translation of the transcripts into French is also available as an optional additional purchase Average length of calls: 120 mins where one speaker acts as an interviewer and the other as the interviewee for scenarios similar to TransTAC style (e.g. civil affairs, checkpoints etc.) The interviewer appears in more than one set of dialogues but the interviewee is unique for each set |
Pashto (Afghanistan) conversational microphone | |
Dataset Audio | Pashto (Afghanistan) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 55 hours | Add Dataset to Quote | PAS_ASR001 | Appen Global | Conversational Speech | Northern Pashto – Southern Pashto | Afghanistan | Low background noise | 967 | 2 | Available on request | 13,633 | 8 | wav | Dataset is fully transcribed and time stamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words For the majority of calls, both speakers (in-line/out-line) were collected and transcribed, however, for a smaller number of calls, only one half of the conversation was collected and transcribed 25% landline, 75% mobile |
Pashto (Afghanistan) conversational telephony | |
Dataset Text | Pashto (Afghanistan) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 64,000 words | Add Dataset to Quote | pus_AFG_PHON | Appen Global | Pronunciation Dictionary | Pashto | Afghanistan | N/A | N/A | N/A | N/A | 64,000 | N/A | text | Pashto (Afghanistan) Pronunciation Dictionary | ||
Dataset Text | Polish (Poland) Part of Speech Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 4,000 words | Add Dataset to Quote | pol_POL_POS | Appen Global | Part of Speech Dictionary | Polish | Poland | N/A | N/A | N/A | N/A | 4,000 | N/A | text | Polish (Poland) Part of Speech Dictionary | ||
Dataset Text | Polish (Poland) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 42,000 words | Add Dataset to Quote | pol_POL_PHON | Appen Global | Pronunciation Dictionary | Polish | Poland | N/A | N/A | N/A | N/A | 42,000 | N/A | text | Polish (Poland) Pronunciation Dictionary | ||
Dataset Audio | Polish (Poland) scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 25 hours | Add Dataset to Quote | POL_ASR001 | GlobalPhone | Scripted Speech | Polish | Poland | Mixed (quiet home/office, public, outdoor) | 99 | 1 | 10,130 | Available on request | 16 | wav | Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus Dataset is fully transcribed and the transcription is available both in original script and in Romanized form Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary Developed in collaboration with the Karlsruhe Institute of Technology (KIT) |
Polish (Poland) scripted microphone | |
Dataset Audio | Polish (Poland) scripted smartphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Mobile phone | Unit: 293 hours | Add Dataset to Quote | POL_ASR002_CN | Appen China | Scripted Speech | Polish | Poland | Low background noise (home/office) | 353 | 1 | 106,674 | 168,544 | 16 | wav | Dataset contains audio with corresponding text prompts | Polish (Poland) scripted smartphone | |
Dataset Audio | Polish (Poland) scripted telephony | Common Use Cases: ASR, Virtual Assistant | Recording Device: Landline only | Unit: 78 hours | Add Dataset to Quote | Polish SpeechDat(E) Database | Nuance | Scripted Speech | Polish | Poland | Low background noise | 1,000 | 1 | 48,000 | Available on request | 8 | Available on request | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 48 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words |
Polish (Poland) scripted telephony | |
Dataset Audio | Portuguese (Brazil) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 33 hours | Add Dataset to Quote | PTB_ASR002 | Appen Global | Conversational Speech | Portuguese | Brazil | Low background noise | 200 | 2 | 33,837 | 11,287 | 8 | alaw or wav | Dataset is fully transcribed and time stamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words 63% landline, 38% mobile |
Portuguese (Brazil) conversational telephony | |
Dataset Audio | Portuguese (Brazil) microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 26 hours | Add Dataset to Quote | PTB_ASR001 | GlobalPhone | Scripted Speech | Portuguese | Brazil | Mixed (quiet home/office, public, outdoor) | 102 | 1 | 10,417 | Available on request | 16 | wav | Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus Dataset is fully transcribed and the transcription is available both in original script and in Romanized form Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary Developed in collaboration with the Karlsruhe Institute of Technology (KIT) |
Portuguese (Brazil) microphone | |
Dataset Text | Portuguese (Brazil) Part of Speech Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 98,000 words | Add Dataset to Quote | por_BRA_POS | Appen Global | Part of Speech Dictionary | Portuguese | Brazil | N/A | N/A | N/A | N/A | 98000 | N/A | text | Portuguese (Brazil) Part of Speech Dictionary | ||
Dataset Text | Portuguese (Brazil) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 102,000 words | Add Dataset to Quote | por_BRA_PHON | Appen Global | Pronunciation Dictionary | Portuguese | Brazil | N/A | N/A | N/A | N/A | 102,000 | N/A | text | Portuguese (Brazil) Pronunciation Dictionary | ||
Dataset Audio | Portuguese (Portugal) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 36 hours | Add Dataset to Quote | PTP_ASR001 | Appen Global | Conversational Speech | Portuguese | Portugal | Low background noise | 200 | 2 | 36,586 | 16,339 | 8 | alaw or wav | Dataset is fully transcribed and timestamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words 200 telephony conversations are recorded for this project – 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers |
Portuguese (Portugal) conversational telephony | |
Dataset Text | Portuguese (Portugal) Part of Speech Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 60,000 words | Add Dataset to Quote | por_PRT_POS | Appen Global | Part of Speech Dictionary | Portuguese | Portugal | N/A | N/A | N/A | N/A | 60,000 | N/A | text | Portuguese (Portugal) Part of Speech Dictionary | ||
Dataset Text | Portuguese (Portugal) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 112,000 words | Add Dataset to Quote | por_PRT_PHON | Appen Global | Pronunciation Dictionary | Portuguese | Portugal | N/A | N/A | N/A | N/A | 112,000 | N/A | text | Portuguese (Portugal) Pronunciation Dictionary | ||
Dataset Audio | Romanian (Romania) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 37 hours | Add Dataset to Quote | ROM_ASR001 | Appen Global | Conversational Speech | Romanian | Romania | Low background noise | 200 | 2 | Available on request | 16,658 | 8 | alaw | Dataset is fully transcribed and timestamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words 200 telephony conversations are recorded for this project – 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers 50% landline, 50% mobile Conversations cover a range of topics including: Leisure, Work and Sport. |
Romanian (Romania) conversational telephony | |
Dataset Text | Romanian (Romania) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 16,000 words | Add Dataset to Quote | ron_ROU_PHON | Appen Global | Pronunciation Dictionary | Romanian | Romania | N/A | N/A | N/A | N/A | 16,000 | N/A | text | Romanian (Romania) Pronunciation Dictionary | ||
Dataset Audio | Russian (Russia) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 37 hours | Add Dataset to Quote | RUS_ASR001 | Appen Global | Conversational Speech | Russian | Russia | Low background noise | 200 | 2 | Available on request | 28,284 | 8 | alaw or wav | Dataset is fully transcribed and timestamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words 200 telephony conversations are recorded for this project – 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers 50% landline, 50% mobile |
Russian (Russia) conversational telephony | |
Dataset Text | Russian (Russia) Part of Speech Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 100,000 words | Add Dataset to Quote | rus_RUS_POS | Appen Global | Part of Speech Dictionary | Russian | Russia | N/A | N/A | N/A | N/A | 100,000 | N/A | text | Russian (Russia) Part of Speech Dictionary | ||
Dataset Text | Russian (Russia) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 120,000 words | Add Dataset to Quote | rus_RUS_PHON | Appen Global | Pronunciation Dictionary | Russian | Russia | N/A | N/A | N/A | N/A | 120,000 | N/A | text | Russian (Russia) Pronunciation Dictionary | ||
Dataset Audio | Russian (Russia) scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 31 hours | Add Dataset to Quote | RUS_ASR002 | GlobalPhone | Scripted Speech | Russian | Russia | Mixed (quiet home/office, public, outdoor) | 115 | 1 | 12,205 | Available on request | 16 | wav | Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus Dataset is fully transcribed and the transcription is available both in original script and in Romanized form Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary Developed in collaboration with the Karlsruhe Institute of Technology (KIT) |
Russian (Russia) scripted microphone | |
Dataset Audio | Russian (Russia) scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 46 hours | Add Dataset to Quote | Speecon Russian Database | Nuance | Scripted Speech | Russian | Russia | Mixed (office, entertainment, car, public place) | 600 (550 adult speakers and 50 child speakers) | 4 | 170,000 | Available on request | 16 | Available on request | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 290 prompts per adult speaker and 210 prompts per child speaker including digits, natural numbers, letter strings, personal, place and business names, application words for adult speakers, command (toy, phone and general) for child speakers, phonetically rich words and sentences and free and elicited spontaneous responses for adult speakers |
Russian (Russia) scripted microphone | |
Dataset Audio | Russian (Russia) scripted telephony | Common Use Cases: ASR, Virtual Assistant | Recording Device: Landline only | Unit: 180 hours | Add Dataset to Quote | Russian SpeechDat(E) Database | Nuance | Scripted Speech | Russian | Russia | Low background noise | 2,500 | 1 | 112,000 | Available on request | 8 | alaw | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 45 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words |
Russian (Russia) scripted telephony | |
Dataset Audio | Russian + German Female TTS | Common Use Cases: TTS | Recording Device: microphone | Unit: 2.32 hours | Add Dataset to Quote | ED_TTS001_CN | Appen China | TTS Scripted Speech | Russian/German | Russia/Germany | Low background noise (studio) | 1 | 1 | Available upon request | Available upon request | 48 | wav | Audio with transcription. Female voice talent recorded in a professional studio on a Neumann U87 microphone; SNR of 40-50dB | Russian + German Female TTS | |
Dataset Text | Russian NER news text | Common Use Cases: NER, Content Classification, Search Engines | Recording Device: N/A | Unit: 29,888 sentences | Add Dataset to Quote | RUS_NER001 | Appen Global | News NER | Russian | Russia | N/A | N/A | N/A | 29,888 | Available on request | N/A | text | News text corpora with entities tagged in XML format: Person, Title, Organization, Location, Geo-political entity, Facility, Religion, Nationality, Quantity | Russian NER news text | |
Dataset Image, Video | Selfie image and video collection | Common Use Cases: Facial recognition, Human Body Movement recognition | Recording Device: Camera | Unit: 1400 sessions | Add Dataset to Quote | IMG_VID_SELFIE_US | Appen Global | Human Face | N/A | United States | Mixed lighting conditions | Available upon request | 1 | N/A | N/A | N/A | jpg, mp4, mov | Participants took a short video and picture of themselves following a prompt making various facial expressions under different conditions, e.g. “while blinking”, “while wearing a scarf” | Selfie image and video collection | |
Dataset Text | Serbian (Serbia) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 29,000 words | Add Dataset to Quote | srp_SRB_PHON | Appen Global | Pronunciation Dictionary | Serbian | Serbia | N/A | N/A | N/A | N/A | 29,000 | N/A | text | Serbian (Serbia) Pronunciation Dictionary | ||
Dataset Audio | Shanghai dialect (China) Conversational Speech | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Recording pen/microphone | Unit: 21 hours | Add Dataset to Quote | SHANGHAI_ASR001_CN | Appen China | Conversational Speech | Shanghai dialect | China | Low background noise | 51 | 1 | 16 | wav | Audio only, transcription in development for Q1 2025 Audio recordings cover the following districts: Shanghai Huangpu District, Xuhui District, Changning District, Jing ‘an District, Putuo District, Hongkou District, Yangpu District, Pudong New Area Shanghai suburb accents not included, and no minors were recorded. Each recording session contains 20-30 minutes of free dialogue between 2-5 people. Sensitive data and personal information has been scrubbed. |
Shanghai dialect (China) Conversational Speech | |||
Dataset Audio | Shanghai dialect (China) Conversational Speech | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone | Unit: 4.5 hours | Add Dataset to Quote | SHANGHAI_ASR002_CN | Appen China | Conversational Speech | Shanghai dialect | China | Low background noise | 14 | 1 | 8 | wav | Audio only, transcription in development for Q1 2025 Audio recordings cover the following districts: Shanghai Huangpu District, Xuhui District, Changning District, Jing ‘an District, Putuo District, Hongkou District, Yangpu District, Pudong New Area Shanghai suburb accents not included, and no minors were recorded. Each recording session contains 20-30 minutes of free dialogue between 2-5 people. Sensitive data and personal information has been scrubbed. |
Shanghai dialect (China) Conversational Speech | |||
Dataset Audio | Slovak (Slovakia) scripted telephony | Common Use Cases: ASR, Virtual Assistant | Recording Device: Landline only | Unit: 65 hours | Add Dataset to Quote | Slovak SpeechDat(E) Database | Nuance | Scripted Speech | Slovak | Slovakia | Low background noise | 1,000 | 1 | 48,000 | Available on request | 8 | Available on request | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 48 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words |
Slovak (Slovakia) scripted telephony | |
Dataset Text | Slovenian (Slovenian) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 28,000 words | Add Dataset to Quote | slv_SVN_PHON | Appen Global | Pronunciation Dictionary | Slovenian | Slovenia | N/A | N/A | N/A | N/A | 28000 | N/A | text | Slovenian (Slovenian) Pronunciation Dictionary | ||
Dataset Audio | Slovenian (Slovenian) telephony | Common Use Cases: ASR, Virtual Assistant | Recording Device: Landline only | Unit: 76 hours | Add Dataset to Quote | Slovenian SpeechDat(II) FDB-1000 | Nuance | Scripted Speech | Slovenian | Slovenia | Low background noise (home/office) | 1,000 | 1 | 40,000 | Available on request | 8 | Available on request | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report Approximately 40 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words |
Slovenian (Slovenian) telephony | |
Dataset Audio | Somali (Somalia) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 50 hours | Add Dataset to Quote | SOM_ASR001 | Appen Global | Conversational Speech | Somali | Somalia | Low background noise | 1,000 | 2 | Available on request | 23,217 | 8 | alaw | Dataset is fully transcribed and time stamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words 1% landline, 99% mobile |
Somali (Somalia) conversational telephony | |
Dataset Text | Somali (Somalia) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 76,000 words | Add Dataset to Quote | som_SOM_PHON | Appen Global | Pronunciation Dictionary | Somali | Somalia | N/A | N/A | N/A | N/A | 76,000 | N/A | text | Somali (Somalia) Pronunciation Dictionary | ||
Dataset Text | Sorani (Iraq) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 26,000 words | Add Dataset to Quote | kur_IRQ_PHON | Appen Global | Pronunciation Dictionary | Sorani | Iraq | N/A | N/A | N/A | N/A | 26,000 | N/A | text | Sorani (Iraq) Pronunciation Dictionary | ||
Dataset Audio | Sorani (Kurdish) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 5 hours | Add Dataset to Quote | SOR_ASR001 | Appen Global | Conversational Speech | Central Kurdish (Iran) | Iran | Low background noise | 170 | 2 | Available on request | 7,924 | 8 | alaw or wav | Dataset is fully transcribed and time stamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words For a large proportion of calls, only one half of the conversation was collected and transcribed |
Sorani (Kurdish) conversational telephony | |
Dataset Text | Spanish (Argentina) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 15,000 words | Add Dataset to Quote | spa_ARG_PHON | Appen Global | Pronunciation Dictionary | Spanish | Argentina | N/A | N/A | N/A | N/A | 15,000 | N/A | text | Spanish (Argentina) Pronunciation Dictionary | ||
Dataset Text | Spanish (Chile) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 15,000 words | Add Dataset to Quote | spa_CHL_PHON | Appen Global | Pronunciation Dictionary | Spanish | Chile | N/A | N/A | N/A | N/A | 15,000 | N/A | text | Spanish (Chile) Pronunciation Dictionary | ||
Dataset Text | Spanish (Colombia) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 15,000 words | Add Dataset to Quote | spa_COL_PHON | Appen Global | Pronunciation Dictionary | Spanish | Colombia | N/A | N/A | N/A | N/A | 15,000 | N/A | text | Spanish (Colombia) Pronunciation Dictionary | ||
Dataset Audio | Spanish (Latin America – Chile and Colombia) conversational telephony | Common Use Cases: ASR, Call Centre, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 22 hours | Add Dataset to Quote | ESL_ASR002 | Appen Global | Conversational Speech | Spanish | Chile-Columbia | Mixed | 84 | 2 | 22,098 | Available on request | 8 | wav | Dataset is fully transcribed and time-stamped Pronunciation lexicon not currently available but can be developed upon request Call Center Call Centre style conversations (by 64 customers, 14 agents) in banking and telco domains, primarily using mobile phone |
Spanish (Latin America – Chile and Colombia) conversational telephony | |
Dataset Audio | Spanish (Latin America) scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 17 hours | Add Dataset to Quote | ESL_ASR001 | GlobalPhone | Scripted Speech | Spanish | Costa Rica | Mixed (quiet home/office, public, outdoor) | 100 | 1 | 6,898 | Available on request | 16 | wav | Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus Dataset is fully transcribed and the transcription is available both in original script and in Romanized form Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary Developed in collaboration with the Karlsruhe Institute of Technology (KIT) |
Spanish (Latin America) scripted microphone | |
Dataset Text | Spanish (Peru) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 15,000 words | Add Dataset to Quote | spa_PER_PHON | Appen Global | Pronunciation Dictionary | Spanish | Peru | N/A | N/A | N/A | N/A | 15,000 | N/A | text | Spanish (Peru) Pronunciation Dictionary | ||
Dataset Audio | Spanish (Spain) conversational smartphone | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone | Unit: 223 hours | Add Dataset to Quote | ESP_ASR003 | Appen Global | Conversational Speech | Spanish | Spain | Mixed (home, car, public place, outdoor) | 414 | 1 | Available on request | Available on request | 48 | wav | Dataset is fully transcribed and time stamped Two person conversations covering a broad range of generic topics including clothing, culture, education, finance, food, health, history, hospitality, insurance, media/entertainment, sports, travel/holiday, weather and work. Each speaker participates in up to 12 conversations that are 5-15 minutes long. Pronunciation lexicon not currently available but can be developed upon request |
Spanish (Spain) conversational smartphone | |
Dataset Text | Spanish (Spain) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 100,000 words | Add Dataset to Quote | spa_ESP_PHON | Appen Global | Pronunciation Dictionary | Spanish | Spain | N/A | N/A | N/A | N/A | 100,000 | N/A | text | Spanish (Spain) Pronunciation Dictionary | ||
Dataset Audio | Spanish (Spain) scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 39 hours | Add Dataset to Quote | ESP_ASR001 | Appen Global | Scripted Speech | Spanish | Spain | Mixed | 200 | 4 | 40,000 | 6,367 | 22 | raw PCM | Fully transcribed to SpeechDAT type conventions Dataset is accompanied by a pronunciation lexicon containing all transcribed words 200 prompts per speaker including 100 command and control type items and 100 phonetically rich sentences |
Spanish (Spain) scripted microphone | |
Dataset Audio | Spanish (Spain) scripted microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 46 hours | Add Dataset to Quote | Speecon Spanish Database | Nuance | Scripted Speech | Spanish | Spain | Mixed (office, entertainment, car, public place) | 600 (550 adult speakers and 50 child speakers) | 4 | 170,000 | Available on request | 16 | Available on request | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 290 prompts per adult speaker and 210 prompts per child speaker including digits, natural numbers, letter strings, personal, place and business names, application words for adult speakers, command (toy, phone and general) for child speakers, phonetically rich words and sentences and free and elicited spontaneous responses for adult speakers |
Spanish (Spain) scripted microphone | |
Dataset Audio | Spanish (Spain) TTS male scripted microphone | Common Use Cases: TTS | Recording Device: Microphone | Unit: 1 hour | Add Dataset to Quote | ESP_TTS001 | Appen Global | TTS Scripted Speech | Spanish | Spain | Low background noise (studio) | 1 | 1 | 1,787 | 3,614 | 22 | wav | Dataset is accompanied by a pronunciation lexicon containing all words spoken in the Dataset 1,787 prompts per speaker including phonetically rich sentences |
Spanish (Spain) TTS male scripted microphone | |
Dataset Text | Spanish (United States) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 90,000 words | Add Dataset to Quote | spa_USA_PHON | Appen Global | Pronunciation Dictionary | Spanish | United States | N/A | N/A | N/A | N/A | 90,000 | N/A | text | Spanish (United States) Pronunciation Dictionary | ||
Dataset Text | Spanish (Venezuela) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 15,000 words | Add Dataset to Quote | spa_VEN_PHON | Appen Global | Pronunciation Dictionary | Spanish | Venezuela | N/A | N/A | N/A | N/A | 15,000 | N/A | text | Spanish (Venezuela) Pronunciation Dictionary | ||
Dataset Text | Swahili (Kenya) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 66,000 words | Add Dataset to Quote | swa_KEN_PHON | Appen Global | Pronunciation Dictionary | Swahili | Kenya | N/A | N/A | N/A | N/A | 66,000 | N/A | text | Swahili (Kenya) Pronunciation Dictionary | ||
Dataset Text | Swedish (Sweden) Part of Speech Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 105,000 words | Add Dataset to Quote | swe_SWE_POS | Appen Global | Part of Speech Dictionary | Swedish | Sweden | N/A | N/A | N/A | N/A | 105,000 | N/A | text | Swedish (Sweden) Part of Speech Dictionary | ||
Dataset Text | Swedish (Sweden) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 105,000 words | Add Dataset to Quote | swe_SWE_PHON | Appen Global | Pronunciation Dictionary | Swedish | Sweden | N/A | N/A | N/A | N/A | 105,000 | N/A | text | Swedish (Sweden) Pronunciation Dictionary | ||
Dataset Audio | Swedish (Sweden/ Finland) microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 30 hours | Add Dataset to Quote | SWE_ASR001 | GlobalPhone | Scripted Speech | Swedish | Sweden – Finland | Mixed (quiet home/office, public, outdoor) | 98 | 1 | 11,816 | Available on request | 16 | wav | Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus Dataset is fully transcribed and the transcription is available both in original script and in Romanized form Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary Developed in collaboration with the Karlsruhe Institute of Technology (KIT) |
Swedish (Sweden/ Finland) microphone | |
Dataset Text | Sylheti (Bangladesh – India) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 22,000 words | Add Dataset to Quote | syl_BGD_PHON | Appen Global | Pronunciation Dictionary | Sylheti | Bangladesh – India | N/A | N/A | N/A | N/A | 22,000 | N/A | text | Sylheti (Bangladesh – India) Pronunciation Dictionary | ||
Dataset Text | Tagalog (Philippines) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 34,000 words | Add Dataset to Quote | tgl_PHL_PHON | Appen Global | Pronunciation Dictionary | Tagalog | Philippines | N/A | N/A | N/A | N/A | 34,000 | N/A | text | Tagalog (Philippines) Pronunciation Dictionary | ||
Dataset Text | Tamil (India) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 106,000 words | Add Dataset to Quote | tam_IND_PHON | Appen Global | Pronunciation Dictionary | Tamil | India | N/A | N/A | N/A | N/A | 106,000 | N/A | text | Tamil (India) Pronunciation Dictionary | ||
Dataset Text | Telugu (India) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 51,000 words | Add Dataset to Quote | tel_IND_PHON | Appen Global | Pronunciation Dictionary | Telugu | India | N/A | N/A | N/A | N/A | 51,000 | N/A | text | Telugu (India) Pronunciation Dictionary | ||
Dataset Audio | Thai (Thailand) microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 28 hours | Add Dataset to Quote | THA_ASR001 | GlobalPhone | Scripted Speech | Thai | Thailand | Mixed (quiet home/office, public, outdoor) | 98 | 1 | 14,039 | Available on request | 16 | wav | Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus Dataset is fully transcribed and the transcription is available both in original script and in Romanized form Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary Developed in collaboration with the Karlsruhe Institute of Technology (KIT) |
Thai (Thailand) microphone | |
Dataset Image | Thai (Thailand) printed text OCR | Common Use Cases: Document Processing, Document Search, Text detection | Recording Device: Camera | Unit: 1219 images | Add Dataset to Quote | IMG_OCR_THA_CN | Appen China | Document OCR | Thai | Thailand | Mixed lighting conditions | 10 | N/A | N/A | N/A | N/A | jpg | Images containing text, Shopping receipts / tickets / invoices / taxi slips, etc. | Thai (Thailand) printed text OCR | |
Dataset Text | Thai (Thailand) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 30,000 words | Add Dataset to Quote | tha_THA_PHON | Appen Global | Pronunciation Dictionary | Thai | Thailand | N/A | N/A | N/A | N/A | 30,000 | N/A | text | Thai (Thailand) Pronunciation Dictionary | ||
Dataset Text | Tok Pisin (Papua New Guinea) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 8,000 words | Add Dataset to Quote | tpi_PNG_PHON | Appen Global | Pronunciation Dictionary | Tok Pisin | Papua New Guinea | N/A | N/A | N/A | N/A | 8,000 | N/A | text | Tok Pisin (Papua New Guinea) Pronunciation Dictionary | ||
Dataset Audio | Turkish (Turkey) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 41 hours | Add Dataset to Quote | TUR_ASR001 | Appen Global | Conversational Speech | Turkish | Turkey | Low background noise | 200 | 2 | Available on request | 32,386 | 8 | alaw or wav | Dataset is fully transcribed and timestamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words 200 telephony conversations are recorded for this project – 100 speakers make 2 calls each (1 from landline, 1 from mobile) to a pool of 100 call receivers 48% landline, 52% mobile |
Turkish (Turkey) conversational telephony | |
Dataset Audio | Turkish (Turkey) microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 17 hours | Add Dataset to Quote | TUR_ASR002 | GlobalPhone | Scripted Speech | Turkish | Turkey | Mixed (quiet home/office, public, outdoor) | 100 | 1 | 6,950 | Available on request | 16 | wav | Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus Dataset is fully transcribed and the transcription is available both in original script and in Romanized form Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary Developed in collaboration with the Karlsruhe Institute of Technology (KIT) |
Turkish (Turkey) microphone | |
Dataset Text | Turkish (Turkey) Part of Speech Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 257,000 words | Add Dataset to Quote | tur_TUR_POS | Appen Global | Part of Speech Dictionary | Turkish | Turkey | N/A | N/A | N/A | N/A | 257,000 | N/A | text | Turkish (Turkey) Part of Speech Dictionary | ||
Dataset Text | Turkish (Turkey) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 255,000 words | Add Dataset to Quote | tur_TUR_PHON | Appen Global | Pronunciation Dictionary | Turkish | Turkey | N/A | N/A | N/A | N/A | 255,000 | N/A | text | Turkish (Turkey) Pronunciation Dictionary | ||
Dataset Audio | Turkish (Turkey) scripted smartphone | Common Use Cases: ASR, Virtual Assistant, Speech Analytics | Recording Device: Mobile phone | Unit: 738 hours | Add Dataset to Quote | TUR_ASR003_CN | Appen China | Scripted Speech | Turkish | Turkey | Low background noise (home/office) | 664 | 1 | N/A | N/A | 16 | wav | Audio with corresponding text prompts. Participants recorded on mobile phone reading aloud about 40 sentence prompts each. | Turkish (Turkey) scripted smartphone | |
Dataset Audio | Turkish (Turkey) telephony | Common Use Cases: ASR, Virtual Assistant | Recording Device: Mobile phone and landline | Unit: 118 hours | Add Dataset to Quote | OrienTel Turkish Database | Nuance | Scripted Speech | Turkish | Turkey | Low background noise | 1,700 | 1 | 76,500 | Available on request | 8 | Available on request | Dataset is fully transcribed to SpeechDAT type conventions and is accompanied by a pronunciation lexicon and validation report 45 prompts per speaker including digits, natural numbers, letter strings, personal, place and business names, confirmation items (yes, no + fuzzy), generic command and control items and phonetically rich sentences and words |
Turkish (Turkey) telephony | |
Dataset Text | Ukrainian (Ukraine) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 6,000 words | Add Dataset to Quote | ukr_UKR_PHON | Appen Global | Pronunciation Dictionary | Ukrainian | Ukraine | N/A | N/A | N/A | N/A | 6,000 | N/A | text | Ukrainian (Ukraine) Pronunciation Dictionary | ||
Dataset Location Data | United States Mobile Location Data | Common Use Cases: AI Platforms, Advertising and Marketing, Business Intelligence, Financial Modeling, FMCG, Footfall and Attribution, Healthcare, Human Mobility Insights, Location Analytics, OOH and DOOH, Retail Planning and Site Selection, Retail, Research and Academia, Smart Cities and Urban Planning, Supply Chain, Travel and Tourism, Transportation Planning and Logistics | Recording Device: Mobile device | Unit: 5 billion+ location events per day | Add Dataset to Quote | LOCATION_MOBILE_US | Quadrant | Mobile GPS Location Data | N/A | United States | N/A | N/A | N/A | N/A | N/A | N/A | CSV, XLS, JSON, Parquet | For enquiries relating to this data, please contact: Tim Solt, Quadrant VP Sales tim@quadrant.io Book a meeting: https://meetings.hubspot.com/tim321 Quadrant (an Appen Company) is a global leader in the compilation and delivery of compliant, mobile (GPS) location data. Our location data panel is of the highest authenticity and quality, allowing you to easily integrate and perform location-based activities to support your business initiatives and solve a myriad of real-world problems. The Quadrant location data panel contains 16 core metadata attributes, including all standard attributes such as Device ID, Latitude, Longitude, Timestamp, Horizontal Accuracy, and non-standard attributes such as Geohash and H3. Our historical data spans as far back as 2021, and data can be selected specific to your requirements (e.g., geography, timeframe, delivery cadence). State or Market specific requests can be accommodated. Please book a meeting to discuss your requirements and obtain a sample dataset to evaluate for your unique use case. |
United States Mobile Location Data | |
Dataset Audio | Urdu (India/ Pakistan) conversational telephony | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone and landline | Unit: 47 hours | Add Dataset to Quote | URD_ASR001 | Appen Global | Conversational Speech | Urdu | India – Pakistan | Mixed | 1,000 | 2 | 174,666 | 10,871 | 8 | wav | Dataset is fully transcribed and time stamped Dataset is accompanied by a pronunciation lexicon containing all transcribed words Environments: 9% Hands-free car, 7% Landline quiet, 34% mobile quiet, 29% public place, 16% roadside |
Urdu (India/ Pakistan) conversational telephony | |
Dataset Text | Urdu (Pakistan) Part of Speech Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 12,000 words | Add Dataset to Quote | urd_PAK_POS | Appen Global | Part of Speech Dictionary | Urdu | Pakistan | N/A | N/A | N/A | N/A | 12,000 | N/A | text | Urdu (Pakistan) Part of Speech Dictionary | ||
Dataset Text | Urdu (Pakistan) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 21,000 words | Add Dataset to Quote | urd_PAK_PHON | Appen Global | Pronunciation Dictionary | Urdu | Pakistan | N/A | N/A | N/A | N/A | 21,000 | N/A | text | Urdu (Pakistan) Pronunciation Dictionary | ||
Dataset Text | Urdu NER news text | Common Use Cases: NER, Content Classification, Search Engines | Recording Device: N/A | Unit: 20,634 sentences | Add Dataset to Quote | URD_NER001 | Appen Global | News NER | Urdu | Pakistan | N/A | N/A | N/A | 20,634 | Available on request | N/A | text | News text corpora with entities tagged in XML format: Person, Title, Organization, Location, Geo-political entity, Facility, Religion, Nationality, Quantity | Urdu NER news text | |
Dataset Image | Vehicle tail light images | Common Use Cases: Image label recognition training | Recording Device: Mobile phone | Unit: 30793 images | Add Dataset to Quote | IMG_WD_CN | Appen China | Image recognition | N/A | N/A | Mixed lighting conditions | N/A | N/A | N/A | N/A | N/A | jpg | Images of vehicle tail lights, with right turn signal on (55%), left turn signal on (22%), both lights on (23%). License plates have been redacted. |
Vehicle tail light images | |
Dataset Audio | Vietnamese (Vietnam) microphone | Common Use Cases: ASR, Virtual Assistant, Chatbot | Recording Device: Microphone | Unit: 19 hours | Add Dataset to Quote | VIE_ASR001 | GlobalPhone | Scripted Speech | Vietnamese | Vietnam | Mixed (quiet home/office, public, outdoor) | 129 | 1 | 18,842 | Available on request | 16 | wav | Part of a multilingual corpus; tiered package prices available with purchase of multiple Global Phone languages or the full corpus Dataset is fully transcribed and the transcription is available both in original script and in Romanized form Each speaker reads a number of phonetically rich sentences selected from national newspaper articles available from the web to cover a wide domain with large vocabulary Developed in collaboration with the Karlsruhe Institute of Technology (KIT) |
Vietnamese (Vietnam) microphone | |
Dataset Text | Vietnamese (Vietnam) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 8,000 words | Add Dataset to Quote | vie_VNM_PHON | Appen Global | Pronunciation Dictionary | Vietnamese | Vietnam | N/A | N/A | N/A | N/A | 8,000 | N/A | text | Vietnamese (Vietnam) Pronunciation Dictionary | ||
Dataset Text | Wu (China) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 11,000 words | Add Dataset to Quote | wuu_CHN_PHON | Appen Global | Pronunciation Dictionary | Wu | China | N/A | N/A | N/A | N/A | 11,000 | N/A | text | Wu (China) Pronunciation Dictionary | ||
Dataset Audio | Wuhan dialect (China) Conversational Speech | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Recording pen/microphone | Unit: 44.71 hours | Add Dataset to Quote | WUHAN_ASR001_CN | Appen China | Conversational Speech | Wuhan dialect | China | Low background noise | 135 | 1 | 16 | wav | Audio only; transcription in development for Q1 2025 Audio recordings cover 5 districts of Wuhan: Jiang ‘an, Jianghan, Qiao Kou, Hanyang and Wuchang Northeast suburb accents not included, and no minors were recorded. Each recording session contains 20-30 minutes of free dialogue between 2-5 people. Sensitive data and personal information has been scrubbed. |
Wuhan dialect (China) Conversational Speech | |||
Dataset Audio | Wuhan dialect (China) Conversational Speech | Common Use Cases: ASR, Conversational AI, Speech Analytics | Recording Device: Mobile phone | Unit: 58.6 hours | Add Dataset to Quote | WUHAN_ASR002_CN | Appen China | Conversational Speech | Wuhan dialect | China | Low background noise | 180 | 1 | 8 | wav | Audio only; transcription in development for Q1 2025 Audio recordings cover 5 districts of Wuhan: Jiang ‘an, Jianghan, Qiao Kou, Hanyang and Wuchang Northeast suburb accents not included, and no minors were recorded. Each recording session contains 20-30 minutes of free dialogue between 2-5 people. Sensitive data and personal information has been scrubbed. |
Wuhan dialect (China) Conversational Speech | |||
Dataset Text | Xiang (China) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 12,000 words | Add Dataset to Quote | hsn_CHN_PHON | Appen Global | Pronunciation Dictionary | Xiang | China | N/A | N/A | N/A | N/A | 12,000 | N/A | text | Xiang (China) Pronunciation Dictionary | ||
Dataset Text | Zulu (South Africa) Pronunciation Dictionary | Common Use Cases: ASR, TTS, Language Modelling | Recording Device: N/A | Unit: 77,000 words | Add Dataset to Quote | zul_ZAF_PHON | Appen Global | Pronunciation Dictionary | Zulu | South Africa | N/A | N/A | N/A | N/A | 77,000 | N/A | text | Zulu (South Africa) Pronunciation Dictionary |