Public Datasets for Machine Learning

Datasets are an integral part of developing, testing, and running machine learning models. It’s known, if domain-specific data is required, creating it will be a time-consuming process. In this matter, public databases can help improve productivity, reducing the need to create them from scratch.

In recent years, numerous organizations have created thousands of public datasets to help the industry move forward. One of the most popular datasets is ImageNet and MNST. Currently, they are available for use in verticals like image classification, facial recognition, weather, object detection, and much more.

Certainly, those datasets may be helpful in developing machine learning models that address problems like heart disease, droughts, diabetes, and poverty. However, it’s necessary to understand their challenges, even ethics. Take for instance facial recognition, cataloging faces of individuals is an invasion of privacy in the public domain.

In the section below, you can find a list of twenty-five public datasets.

Name	Creator	Description
AWS	Many	Publicly Hosted
Google	Many	Publicly Hosted
Kaggle	Kaggle	Publicly Hosted
Microsoft	Many	Publicly Hosted
Notre Dame	Univ. of Notre Dame	3D Face
VisualData.io	VisiualData.io	Computer vision

ACS	US Census	Detailed US demographics data
ApolloScape	Baidu	Autonomous driving
Berkeley DeepDrive	UC Berkeley	Video dataset
Data USA	Deloitte & others	Visualize US issues like jobs, skills..
Diabetes	UCI	Diabetes patient data
El Nino Dataset	UCI	Oceanographic and meteorological readings
Feret	DoD/NIST	Security and law enforcement
HAR Dataset	UCI	Human activity recognition – sitting, biking, standing…
Heart Disease	UCI	Individual data – age, sex, …
ImageNet	Stanford University	Image database
Movieslens	GroupLens	Movie Ratings
Million Song	Kaggle	Music
Netflix Prize	Netflix	Movie Ratings
Open Images	Google	Images
Overhead Imagery Research Data Set	ORID	Overhead Imagery
SAT-4 Airborne Dataset	ASU	Landscape pictures
Serre Lab	Brow University	Human actions like a smile, laugh, talk, smoke…
SIFT10M Data Set	UCI	The nearest neighbor search algorithm method
SpaceNet	SpaceNet	Precision-labeled. High-resolution satellite images.