NikitaSokovnin

  • 27 Mar
  • Joined Jan 11, 2024
  • 1 best answer
  • The new DataDreamer v0.2.0 update expands its capabilities for generating synthetic data and pre-annotating real datasets. This version introduces SlimSAM for instance segmentation. Additionally, the Qwen 2.5 language model is now integrated as a prompt generator, boosting text data generation. With these updates, DataDreamer becomes even better suited for data preparation and annotation.

    Data Prepping (Annotation)

    Step 1 of training and deploying (to eg. OAK cameras) a custom ML model is data preparation. This includes collecting, cleaning, and annotating data. DataDreamer simplifies this process by providing tools for auto-annotation of images and generating synthetic data.

    On the video above you can see how to use DataDreamer to auto-annotate a video (sequence of images). If you took all frames of the video
    and include the in the dataset you would have a lot of similar images, which isn't good for training a model, as it would be overfitting.

    That's why you'd usually take only eg. 1 frame per second, which we can accomplish easily with OpenCV (Gist here). After we have images in frames/ folder, we can use DataStreamer to auto-annotate them.

    I also cropped the video (yellow rectangle) so only lemons are in the FOV (field of view), which helps with detection accuracy, as from my tests it doesn't work well with smaller objects.

    Datadreamer

    datadreamer --save_dir frames --class_names "lemons" --task instance-segmentation \
    --annotator_size large --use_tta --image_annotator owlv2-slimsam \
    --conf_threshold 0.25  --annotate_only --annotation_iou_threshold 0.1 --device cuda

    Let's break down the command (all Params explain here):

    • --save_dir frames - folder with images
    • --class_names "lemons" - class name - I only want to detect lemons
    • --task instance-segmentation - we want to do instance segmentation, which includes bounding boxes (detection) and masks. You could change it to detection if you only want bounding boxes, or classification if you only want to classify images.
    • annotator_size large - size of the annotator. You can choose between base and large. large is more accurate, but slower.
    • use_tta - use Test Time Augmentation. This will do rotations/flips/crops/scaling to image and aggregate predictions to make outputs more robust. It provides better results, but is slower.
    • image_annotator owlv2-slimsam - Currently only annotator that supports segmentation. owlv2 can be used for detection, and clip for image classification.
    • conf_threshold 0.25 - confidence threshold. If the model is less than 25% sure that it's a lemon, it won't annotate it.
    • annotate_only - just annotate images, don't generate new/synthetic data.
    • annotation_iou_threshold 0.1 - IOU threshold.
    • device cuda - use NVIDIA GPU for inference. If you don't have a GPU, you can use cpu, but it will be much slower.

    After letting it sit for a couple of minutes (or hours, depending on the number of images), you'll have annotated images in frames/annotations.json file.

    erik@eriks-MacBook-Pro frames % ls -l
    total 140424
    -rw-r--r--  1 erik  staff  28571010 Nov 26 15:02 annotations.json
    -rw-r--r--  1 erik  staff    135045 Nov 26 14:56 frame_0000.jpg
    -rw-r--r--  1 erik  staff    132306 Nov 26 14:56 frame_0001.jpg
    -rw-r--r--  1 erik  staff    128611 Nov 26 14:56 frame_0002.jpg
    -rw-r--r--  1 erik  staff    130135 Nov 26 14:56 frame_0003.jpg
    -rw-r--r--  1 erik  staff    129604 Nov 26 14:56 frame_0004.jpg
    ...

    Annotations

    These are in raw format, with detections (bounding boxes), masks, and labels for each image. One could also include --dataset_format yolo or --dataset_format coco argument to get annotations in YOLO or COCO format.

    Visualizing Annotations

    To visualize annotations (draw mask/detection/label on the image), we'll be using SuperVision library, which is an opensource CV tool for easy visualizing and debugging computer vision models.

    import cv2
    import numpy as np
    import json
    import os
    import supervision as sv
    from supervision.detection.utils import polygon_to_mask
    
    # Path to your folder containing images and annotation JSON file
    image_folder = "frames"
    annotation_file = "frames/annotations.json"
    
    # Load annotations
    with open(annotation_file, 'r') as f:
        annotations = json.load(f)
    
    # Sort annotations by image_name
    annotations = {k: v for k, v in sorted(annotations.items(), key=lambda item: item[0])}
    
    yellow = sv.Color(255,240,0)
    label_anotator = sv.LabelAnnotator(color=yellow, text_color=sv.Color(0,0,0))
    polygon_annotator = sv.PolygonAnnotator(color=yellow, thickness=3)
    
    # Iterate through each image in the annotations
    for image_name, data in annotations.items():
        image_path = os.path.join(image_folder, image_name)
        image = cv2.imread(image_path)
    
        if image is None:
            print(f"Could not read image: {image_name}")
            continue
    
        # Extract bounding boxes and labels from image annotation
        boxes = np.array(data["boxes"], dtype=np.float32)  # Bounding boxes
        class_ids = np.array(data["labels"], dtype=np.int32)  # Class IDs
    
        if "masks" in data:
            masks = []
            for polygon in data["masks"]:
                new_mask = polygon_to_mask(
                    polygon=np.array(polygon, dtype=np.int32),
                    resolution_wh=(image.shape[1], image.shape[0])  # (width, height)
                )
                masks.append(new_mask.astype(bool))
            masks = np.array(masks, dtype=bool)
        else:
            masks = None
    
        # Create sv.Detections object
        if len(boxes) != 0:
            detections = sv.Detections(
                xyxy=boxes,
                class_id=class_ids,
                mask=masks
            )
    
            # Draw labels and polygons on the image
            labels = ["Lemon"] * len(data["labels"]) # Only 1 class anyways
            image = label_anotator.annotate(scene=image, detections=detections, labels=labels)
            image = polygon_annotator.annotate(scene=image, detections=detections)
    
        # Display the annotated image
        cv2.imshow("Annotated Image", image)
        cv2.waitKey(33) # Wait 33ms
    
    cv2.destroyAllWindows()

    OpenCV will open a window with the image and annotations, like on the image below. You can use q to close the window.


    On the next blog post, we'll cover how to train the model and then deploy it to OAK cameras. Stay tuned!

    Let us know if you have any questions/thoughts in the comments🙂
    - Erik

  • Evolution gave humans, and most all other mammals, two eyes. Human eyes are typically spaced 6 to 7 centimeters (2.4 to 2.8 inches) apart. Remarkably, a substantial portion of the human brain is dedicated to processing visual information, with some studies suggesting a significant investment of the brain's resources in this area (source: University of Rochester). Our brain is able to perceive both how close an object is from us and where that object is heading. When Luxonis set out to create an edge artificial intelligence (AI) depth camera, we used learnings from human eye sight to design our products. This is why the stereo baselines of our cameras are typically 7.5 centimeters (3 inches) apart and a sizable portion of our onboard processing is dedicated towards fusing the left and right cameras together to create a three dimensional map. 

    At Luxonis, we are on a long term mission of empowering companies to surpass human perception in every task. To achieve that mission, it requires the accuracy of the depth map to be superior to humans. Fortunately, we have already long since exceeded this because human beings are actually quite poor at estimating distances. If you look at your hands do you really know if they are currently 10, 12, or 14 inches away from your head? There is a reason we all have a tape measure (or laser or smartphone) at home to help us determine distances. 

    We have recently been working diligently to improve the accuracy of our depth maps. And we are now proud to announce that starting in March 2024, stereo cameras shipped from Luxonis will be implemented with a next generation calibration procedure that delivers much better accuracy than our old generation. Please note that fortunately the performance is not hardware limited so existing devices can be recalibrated. 

    Here are the real world (not theoretical) depth accuracy measurements that you’ll be experiencing with our products come March: 

    • 0.7m - 4m (2’4” to 13’1”): below 2% absolute depth error

    • 4m - 7m (13’1” to 23’): below 4% absolute depth error

    • 7m - 12m (23’ to 39’4”): below 6% absolute depth error

    You can learn more about our improved performance here:

    https://docs.luxonis.com/projects/hardware/en/latest/pages/guides/depth_accuracy/#p-75mm-baseline-distance-oaks

    If you’d like to get your hands on an OAK with improved depth sooner please reach out to us at [support@luxonis.com](mailto:support@luxonis.com).

    We'll be releasing competitive benchmark data on depth performance by the end of March.