Public Datasets for Machine Learning

Table of Contents

Datasets are an integral part of developing, testing, and running machine learning models. It’s known, if domain-specific data is required, creating it will be a time-consuming process. In this matter, public databases can help improve productivity, reducing the need to create them from scratch. 

In recent years, numerous organizations have created thousands of public datasets to help the industry move forward. One of the most popular datasets is ImageNet and MNST. Currently, they are available for use in verticals like image classification, facial recognition, weather, object detection, and much more.

Certainly, those datasets may be helpful in developing machine learning models that address problems like heart disease, droughts, diabetes, and poverty. However, it’s necessary to understand their challenges, even ethics. Take for instance facial recognition, cataloging faces of individuals is an invasion of privacy in the public domain. 

In the section below, you can find a list of twenty-five public datasets.

NameCreatorDescription
AWSManyPublicly Hosted
GoogleManyPublicly Hosted
KaggleKagglePublicly Hosted
MicrosoftMany Publicly Hosted
Notre DameUniv. of Notre Dame3D Face
VisualData.ioVisiualData.ioComputer vision
ACSUS CensusDetailed US demographics data
ApolloScapeBaiduAutonomous driving
Berkeley DeepDriveUC BerkeleyVideo dataset
Data USADeloitte & othersVisualize US issues like jobs, skills..
DiabetesUCIDiabetes patient data
El Nino DatasetUCIOceanographic and meteorological readings
FeretDoD/NISTSecurity and law enforcement
HAR DatasetUCIHuman activity recognition – sitting, biking, standing…
Heart DiseaseUCIIndividual data – age, sex, …
ImageNetStanford UniversityImage database
MovieslensGroupLensMovie Ratings
Million SongKaggleMusic
Netflix PrizeNetflixMovie Ratings
Open ImagesGoogleImages
Overhead Imagery Research Data SetORIDOverhead Imagery
SAT-4 Airborne DatasetASULandscape pictures
Serre LabBrow UniversityHuman actions like a smile, laugh, talk, smoke…
SIFT10M Data SetUCIThe nearest neighbor search algorithm method
SpaceNetSpaceNetPrecision-labeled. High-resolution satellite images.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top