DataDreamer: Creating Custom Datasets Made Easy!

NikitaSokovnin · Jan 18, 2024

Data: The Fuel of Modern AI Development

In the realm of modern AI, data stands as a pivotal element. Typically, when training a model for a specific purpose, data collection emerges as the most time-consuming phase. Imagine, however, the possibility of bypassing this step entirely, enabling the creation of a Computer Vision model without the need for real-world data. The only requirement? Simply the names of the objects you wish to detect or classify.

Consider a scenario where you need an application that detects robots in videos and images. DataDreamer streamlines this process, allowing you to generate thousands of annotated images with just a single command. This innovative approach not only saves time but also opens up new avenues for AI development, unbound by the constraints of traditional data collection methods.

datadreamer –class_names robot

Prompt: A photo of robot interacting with nature in a serene field. The bot seems to meditate & appreciate the beauty of the environment as it soaks up the suns rays.

Prompt: A photo of robot assisting a human in the kitchen, as they cook a meal together, showing the collaboration between man and machine.

Utilizing this dataset, you can efficiently train a compact model designed for Luxonis OAK cameras or other devices. This model is capable of detecting real robots in various real-world scenarios. In the following video, we demonstrate the performance of the model trained on a dataset of 2,000 images generated by DataDreamer.

DataDreamer: Creating Custom Datasets from sratch

DataDreamer is a library that empowers you to create custom datasets with virtually any class you can imagine, right from scratch. This process is streamlined into three key steps:

Prompt Generation: At this stage, we utilize the powerful Mistral-7B-Instruct-v0.1 to generate semantically rich prompts, which are crucial in accurately depicting your target objects in a generated image. For a more straightforward approach, we also offer the option of simply concatenating target objects.
Image Generation: The user has the choice between two image generators. First is Stable Diffusion XL, known for its adherence to prompts and superior image quality, albeit with a slower generation speed. The second option is SDXL-Turbo, which offers a quicker generation time but with a slight compromise in image fidelity.
Image Annotation: In the final step, we employ Foundation models like OWLv2 to annotate the generated images. This process utilizes the class names provided at the outset, ensuring that each image is accurately labelled according to your specifications.

By integrating these advanced models, DataDreamer not only simplifies but also enhances the process of creating tailored datasets for diverse applications in the realm of computer vision.

Real vs. DataDreamer dataset - performance comparison

To compare the performance of a model trained on DataDreamer-generated data with one trained on real-world data, we conducted an interesting experiment. We used the PASCAL VOC dataset, a well-known benchmark in object detection, as our basis for real data. From this, we created a comparable dataset using DataDreamer, targeting the same 20 classes present in the PASCAL VOC dataset.

The command we used for DataDreamer was as follows:

datadreamer --save_dir generated_dataset_voc_2k --class_names aeroplane bicycle bird boat bottle bus car cat chair cow dining\ table dog horse motorbike person potted\ plant sheep sofa train tv --prompts_number 2000 --prompt_generator lm --num_objects_range 1 3 --image_generator sdxl

This command generated a dataset with 2000 prompts, focusing on a range of 1 to 3 objects per image and using the SDXL image generator for high-quality results. For the 17k dataset we change the prompt number to 17,000 (the same number of images as in the real dataset) and the image generator to sdxl-turbo.

Below, we present some examples from the resulting DataDreamer dataset:

For comparison, here are annotated images from the original VOC PASCAL dataset:

To assess the effectiveness of using synthetic data in training computer vision models, we embarked on an experiment with two distinct training scenarios using the YOLOv8n model:

Training on Synthetic Data, then Finetuning on Real Data: In this approach, we first train the YOLOv8n model entirely on the synthetic dataset generated by DataDreamer. Once the model has learned from these generated images, we proceed to finetune it on real-world data from the PASCAL VOC dataset. This two-step process aims to see how well a model trained on synthetic data adapts to real-world images through finetuning.
Training Exclusively on Real Data: Here, the YOLOv8n model is trained solely using real-world images from the PASCAL VOC dataset. This traditional approach serves as a benchmark to compare the effectiveness of incorporating synthetic data in the training process.

By comparing these two scenarios, we aim to understand the impact of synthetic data on the model's learning capabilities and its performance in real-world scenarios. This comparison will shed light on the potential advantages of using synthetic datasets for initial training, especially in cases where collecting extensive real-world data is challenging or impractical.

Results

Real dataset size: 17k images, 1% - 170 images, 10% - 1.7k images, 25% - 4.25k images. Performance is measured on real validation data.

Based on the results, it's evident that synthetic datasets generated by DataDreamer are particularly beneficial in scenarios where there's a scarcity of real annotated images, or none at all. This is a crucial insight for situations where gathering a large volume of real-world data is impractical or impossible.

The observations indicate that while synthetic data significantly boosts model performance in data-scarce scenarios, the advantage lessens as the quantity of real data increases. This diminishing difference suggests that while synthetic data is a powerful tool for initial training phases, especially when real data is scarce, its impact reduces as more real-world data becomes available for training.

Summary

By streamlining the dataset creation process, DataDreamer makes it not only accessible but also efficient for everyone – from seasoned data scientists to beginners in the field. It's a game-changer in data preparation, enabling users to quickly generate synthetic data, train initial models, and subsequently enhance these models with real-world data as it becomes available.

Future work

The team behind DataDreamer is committed to evolving and enhancing its capabilities to meet the growing demands and complexities of AI model training. The future roadmap for DataDreamer includes several exciting enhancements and additions:

Expanding Task Variety: We plan to integrate additional tasks like instance segmentation and keypoints detection. These advanced capabilities will allow for more nuanced and detailed data generation, catering to a wider range of AI applications.
Speeding Up Dataset Generation: A key focus will be on improving the efficiency of dataset generation. This enhancement will significantly reduce the time required to create large, diverse datasets, enabling faster model development cycles.
Model Updates and Additions: Continuous updates and additions to the models used in each step of the dataset generation process are planned. This will ensure that DataDreamer remains at the forefront of technology, utilizing the latest advancements in AI for superior dataset creation.
Feature Enhancements: We aim to add more sophisticated features to DataDreamer. These features will be designed to further reduce the reliance on real data, allowing for the training of robust models with minimal real-world datasets.

Through these improvements, DataDreamer will not only simplify the initial stages of model training but also push the boundaries of what can be achieved with synthetic data, making it an even more powerful tool in the field of AI development.

In conclusion, we invite the wider community to join us in this exciting journey. Your contributions, whether in the form of feedback, ideas, or direct involvement in development, are invaluable in shaping the future of DataDreamer. Together, we can redefine the landscape of AI model training. Let's collaborate to make DataDreamer not just a tool, but a community-driven catalyst for innovation in AI!

Github repository link: luxonis/datadreamer
Colab notebook link: https://colab.research.google.com/github/luxonis/datadreamer/blob/main/examples/generate_dataset_and_train_yolo.ipynb

NikitaSokovnin

Data: The Fuel of Modern AI Development

DataDreamer: Creating Custom Datasets from sratch

Real vs. DataDreamer dataset - performance comparison

Results

Summary

Future work

Comments (0)

NikitaSokovnin

Data: The Fuel of Modern AI Development

DataDreamer: Creating Custom Datasets from sratch

Real vs. DataDreamer dataset - performance comparison

Results

Summary

Future work

Categories

Forum Nav