Machine Learning Datasets to build your own projects.
The best machine learning datasets in one single page!
One of the hardest problems in Machine Learning is finding data that suits the project/application that we want to build. If this field has one weakness is that without data we can’t do anything.
This category is dedicated to providing various resources where data sets can be obtained, so that you can start building your own projects or using pre-made data sets to learn!
University of Irvine – California (UCI) Machine Learning Repository
The centre for Machine Learning and Intelligent systems from the University of Irvine, California, has an amazing repository of data sets divided in different categories. You can search for specific Machine Learning problems like classification, regression, clustering, or time series analysis. Also, each of the data sets is categorised according to the nature of their data: from financial to medical, business, or game data sets.
It is a great resource to find data sets to play around and build projects to improve your data science and Machine Learning knowledge, as each data set is described, and its procedente clarified. Enjoy it!
Google Dataset Search
Google recently put at our disposal a huge engine dedicated to looking Data Sets. This engine, similar to the normal google search, can be filtered to specify our needs: looking for open data-sets, filter the format you want, the topic, and much more.
Each result includes a description of the data set, along with references to previous works where it has been used, and various links to the possible download options. This search console is supposed to grant access to the widest group of data in the world: 25M data-sets. Google does this by using an indexing scheme similar to what their normal search console uses, so it does not curate the data-sets in any way. You can find more information on this tool on Google’s blog.
Video game, Anime, and Manga Data Sets
LionBridge has put together a very cool repository/post of manga, anime, and video game data sets for Machine Learning. There are 25 data sets in this repository that are fun, briefly described in the article, and mostly in English.
There are image data sets, review data sets, and data sets of game genre classification with descriptions about the games, tittles, and other cool information. As a lot of these data sets contain non-structured data, they might be harder to use than a normal table-like data set, however we deeply encourage you to take a look if you are a bit nerdy like us!
Kaggle Data Sets
Kaggle is one of the best known resources for fetching all kinds of data sets. The famous website, known for the organisation of Machine Learning challenges and competitions has an extensive catalogue of data sets, for all kinds of uses. Despite most of these data sets were initially offered as the data for some challenge, they progressed to be freely available and at the disposal of everybody.
This resource has more than 30K public data sets that are fully described, and that despite of sometimes overlapping with other repositories like the UCI repo described above, constitutes one of the best resources nowadays to look for fresh, useful data.
VisualData Image Datasets
Visualdata.io is a website that has collected about 500 fantastic data sets for computer vision and image recognition. It is a very good place if you are looking for high-quaility classified images to play around with.
Again, it is not a resource for beginners as these computer vision tasks are probably more challenging that standard Machine Learning problems, but if you are already experienced on the field and are looking for data to build your own project using images, this is a very very good resource.
If you want to really exploit these Data Sets, go take a look at our Computer Vision Tutorials.
data.world is a platform that has the intention of building a collaborative, abundant, and wide platform for sharing, discovering, managing and understanding data sets. It is oriented a bit like a social network, and it is still growing, but the idea is promising and useful. It can be used by both, individuals and organisations, and it is oriented towards, research and bussiness!
Anyone can join their forces to collaborate and solve problems together! You can find the documentation and getting started guide here.
Google Open Images Dataset
The Open Images Dataset is an image dataset repository by Google Open Source with images and labels for all kinds of problems: image classification, object detection (problems with bounding boxes), and object segmentation (problems with bounding boxes and masks).
These annotated images are perfect for trying state of the art computer vision algorithms like YOLO, implementing your own applications, and even using these images for data augmentation. Go take a look!