Machine Learning Datasets to build your own projects
The best machine learning data sets and their corresponding repositories in one single page!
One of the hardest problems in Machine Learning is finding data that suits the project/application that we want to build. If this field has one weakness is that without data we can’t do anything.
This category is dedicated to providing various resources where data sets can be obtained, so that you can start building your own projects or using pre-made data sets to learn!
University of Irvine – California UCI Machine Learning Repository
The centre for Machine Learning and Intelligent systems from the University of Irvine, California, has an amazing repository of data sets divided in different categories. This repository, known as the UCI Machine Learning Repository, allows you to search for specific Machine Learning problems like classification, regression, clustering, or time series analysis. Also, each of the ML data sets is categorised according to the nature of their data: from financial to medical, business, or game data sets.
It is a great resource to find data sets to play around and build projects to improve your data science and Machine Learning knowledge, as each data set is described, and its procedente clarified. Enjoy the UCI Dataset Repo!
Google Dataset Search
Google recently put at our disposal a huge engine dedicated to looking Data Sets. This engine, similar to the normal google search, can be filtered to specify our needs: looking for open data-sets, filter the format you want, the topic, and much more.
Each result includes a description of the data set, along with references to previous works where it has been used, and various links to the possible download options. This search console is supposed to grant access to the widest group of data in the world: 25M data-sets. Google does this by using an indexing scheme similar to what their normal search console uses, so it does not curate the data-sets in any way. You can find more information on this tool on Google’s blog.
Video game, Anime, and Manga Data Sets
LionBridge has put together a very cool repository/post of manga, anime, and video game data sets for Machine Learning. There are 25 data sets in this repository that are fun, briefly described in the article, and mostly in English.
There are image data sets, review data sets, and data sets of game genre classification with descriptions about the games, tittles, and other cool information. As a lot of these data sets contain non-structured data, they might be harder to use than a normal table-like data set, however we deeply encourage you to take a look if you are a bit nerdy like us!
Kaggle Data Sets
Kaggle is one of the best known resources for fetching all kinds of data sets. The famous website, known for the organisation of Machine Learning challenges and competitions has an extensive catalogue of data sets, for all kinds of uses. Despite most of these data sets were initially offered as the data for some challenge, they progressed to be freely available and at the disposal of everybody.
This resource has more than 30K public data sets that are fully described, and that despite of sometimes overlapping with other repositories like the UCI repo described above, constitutes one of the best resources nowadays to look for fresh, useful data.
Visualdata Image Datasets
Visualdata.io is a website that has collected about 500 fantastic data sets for computer vision and image recognition. It is a very good place if you are looking for high-quaility classified images to play around with.
Again, it is not a resource for beginners as these computer vision tasks are probably more challenging that standard Machine Learning problems, but if you are already experienced on the field and are looking for data to build your own project using images, this is a very very good resource.
If you want to really exploit these Data Sets, go take a look at our Computer Vision Tutorials.
data.world is a platform that has the intention of building a collaborative, abundant, and wide platform for sharing, discovering, managing and understanding data sets. It is oriented a bit like a social network, and it is still growing, but the idea is promising and useful. It can be used by both, individuals and organisations, and it is oriented towards, research and business!
Anyone can join their forces to collaborate and solve problems together! You can find the documentation and getting started guide here.
Google Open Images Datasets
The Open Images Dataset is an image dataset repository by Google Open Source with images and labels for all kinds of problems: image classification, object detection (problems with bounding boxes), and object segmentation (problems with bounding boxes and masks).
These annotated images are perfect for trying state of the art computer vision algorithms like YOLO, implementing your own applications, and even using these images for data augmentation. Go take a look!
Awesome GitHub Dataset Repository
We’ve already spoken about Awesome in our Tutorials section. They are great.
This Github repository they’ve created contains datasets divided by category: From agriculture and biology to Natural Language Processing, Social Networks, or Computer Vision/Image Processing.
If you are looking to start building your own applications using public data, this is definitely the place to go!
Papers with Code Datasets
Our lovely Papers with code, which we have in our Other Resources section has created a whole new category with tons of datasets (3208 at the moment of updating of this page) that can be classified into many different categories.
They can be filtered by Modality (Images, Text, Video, audio and so on), task (Object Detection, Semantic Segmentation, Sentiment Analysis, etc..), and language of the Dataset. Check it out, there is a lot of material to get up and running cool Machine Learning models with this free public data.
That is it! We hope you enjoyed our category of Data sets for Machine Learning projects. This will allow you to play around with real data and start building cool projects and applications. The most well known out of all of these is the UC Irvine Machine Learning repository, but the great are also great resources, so check them out!
Thank you for reading How to Learn Machine Learning, we hope you enjoyed this repository of Machine learning datasets, and have a fantastic day!