Dataset Analysis

Tools for reviewing your dataset structure and composition

You can access the dataset analysis tools as follows.

from lexset.review import analysis

dir_path = "D:/<PATH TO DATASET>/"

# Create an instance of the 'analysis' class
sample_data = analysis(dir_path)

These functions are designed to analyze the synthetic datasets you generate and download through our Seahaven platform. Specifically, these functions will process the coco_annotations.json file in your dataset directory. These functions can also be used for analyzing any COCO JSON file.

Datasets derived outside Seahaven:

If you use this utility to analyze datasets generated outside the Seahaven platform. Please be sure to format your annotations in a single COCO JSON file. Similar to the ones provided in your datasets downloaded from the Seahaven platform. Additionally, All of your images must be in one directory and should only include images you wish to include in the analysis.

Sample dataset directory:

DATASET_NAME/ 

├── coco_annotations.json 

└── images001.png

└── images002.png

└── images003.png

└── etc...
// SAMPLE COCO FILE 

{
  "info": {
    "description": "Example Dataset",
    "version": "1.0",
    "year": 2023,
    "contributor": "Your Name",
    "date_created": "2023-08-31"
  },
  "licenses": [
    {
      "id": 1,
      "name": "License Type",
      "url": "http://www.___.com/"
    }
  ],
  "images": [
    {
      "id": 1,
      "width": 640,
      "height": 480,
      "file_name": "image1.jpg",
      "license": 1,
      "date_captured": "2023-08-31"
    }
    // Additional image entries...
  ],
  "annotations": [
    {
      "id": 1,
      "image_id": 1,
      "category_id": 1,
      "segmentation": [[ /* polygon */]],
      "area": /* area */,
      "bbox": [ /* bounding box */],
      "iscrowd": 0
    }
    // Additional annotation entries...
  ],
  "categories": [
    {
      "id": 1,
      "name": "person",
      "supercategory": "human"
    },
    // Additional category entries...
  ]
}

Spatial Analysis:

Performs spatial analysis on a dataset containing bounding box annotations of objects. The goal is to create heatmaps representing the distribution of object centers across different categories within the dataset.

sample_data.spatial_analysis()

Sample output: The plot contains heatmaps for each category, representing the density distribution of object centers within bounding boxes for that category. The heatmap is created by calculating a 2D histogram of the object centers. This histogram counts the number of object centers that fall into each bin on the heatmap.

Arguments:

Bins: int, optional (default=50): This determines the number of equally spaced intervals over the range of the data. The more bins you use, the finer the granularity of the distribution representation. However, too many bins may overfit the data and exaggerate minor fluctuations. Choose this parameter according to the granularity of analysis required.

Class Distribution:

Analyzes a dataset containing object annotations and generates a bar plot representing the distribution of different classes (categories) present in the dataset.

sample_data.class_distribution()

Relative Scale:

Performs relative size analysis on a dataset containing object annotations with bounding box information. The goal is to analyze the relative size of objects within each category and visualize the distribution of relative sizes using histograms.

Arguments:

Bins: int, optional (default=50): This determines the number of equally spaced intervals over the range of the data. The more bins you use, the finer the granularity of the distribution representation. However, too many bins may overfit the data and exaggerate minor fluctuations. Choose this parameter according to the granularity of analysis required.

sample_data.relative_scale()

Bounding Box Areas:

Analyzes a dataset containing object annotations with bounding box information. The goal is to compute the bounding box areas for each object category and visualize the distribution of these areas using histograms.

Arguments:

Bins: int, optional (default=50): This determines the number of equally spaced intervals over the range of the data. The more bins you use, the finer the granularity of the distribution representation. However, too many bins may overfit the data and exaggerate minor fluctuations. Choose this parameter according to the granularity of analysis required.

sample_data.bounding_box_areas()

Aspect Ratio Distribution:

Performs aspect ratio analysis on a dataset containing object annotations with bounding box information. The goal is to calculate the aspect ratio of bounding boxes for each object category and visualize the distribution of these aspect ratios using histograms.

Arguments:

Bins: int, optional (default=50): This determines the number of equally spaced intervals over the range of the data. The more bins you use, the finer the granularity of the distribution representation. However, too many bins may overfit the data and exaggerate minor fluctuations. Choose this parameter according to the granularity of analysis required.

sample_data.aspect_ratio_distribution()

Pixel Intensity Distribution:

Performs pixel intensity distribution analysis for each color channel and plots the distribution.

Arguments:

Type: String, optional (default=Lexset): If set to the default value of Lexset it will automatically filter out all RGB images with the standard Lexset naming convention. If "Type" is set to "other" it will automatically analyze every image in the directory.

sample_data.plot_pixel_intensity_distribution()

#or 

sample_data.plot_pixel_intensity_distribution("other")

Power Spectral Density:

Performs a comparative analysis between two directories containing image files by computing their average Power Spectral Density (PSD).

The function then generates three plots:

  • Average PSD Comparison: This plot contains two subplots, one showing the log-transformed average PSD of images in self.dir1 and the other showing the log-transformed average PSD of images in compare_dir.

  • Difference Map: This plot shows the absolute difference between the two average PSDs. The "hot" colormap is used to highlight the areas where the PSDs differ most.

  • Ratio Map: This plot shows the ratio of the average PSDs. The "coolwarm" colormap is used to highlight the areas of ratio discrepancies. Division by zero is avoided by adding a small constant (1e-8).

Arguments:

compare_dir: String - Path to the directory containing data you want to compare your real data against.

dir1 = "D:/5869/"
dir2 = "D:/real_img/"

# Create an instance of the 'analysis' class
sample_data = analysis(dir1)

sample_data.plot_comparative_psd(compare_dir=dir2)

Structural Similarity Index (SSIM):

The SSIM index ranges from -1 to 1, with a value of 1 indicating that the test image is identical to the reference image. Higher SSIM values generally indicate greater structural similarity and less distortion or difference between the images. This operation will write a JSON file with the results.

Arguments:

compare_dir (str): The directory containing real images. target_size (tuple, optional): The target size to resize images to, for SSIM comparison. Defaults to (256, 256).

dir1 = "D:/5869/"
dir2 = "D:/real_img/"

# Create an instance of the 'analysis' class
sample_data = analysis(dir1)
sample_data.compare_ssim_distributions(compare_dir=dir2,target_size=(256, 256))

Fréchet Inception Distance (FID):

Fréchet Inception Distance (FID) is a metric used to evaluate the quality of images generated by generative models, relative to real images. It was proposed as an improvement over older metrics like Inception Score. The FID measures the similarity between two sets of images by examining how a pre-trained model (InceptionV3 in this case) interprets them. The idea is to see if the generated or synthetic images are similar to real images in the eyes of the InceptionV3 model.

Arguments:

compare_dir (str): The directory containing real images.

dir1 = "D:/5869/"
dir2 = "D:/real_img/"

# Create an instance of the 'analysis' class
sample_data = analysis(dir1)
sample_data.calculate_FID(compare_dir=dir2)

Last updated