Data Greenhouse
logo
|
Blog
    Check Data Quality

    Manufacturing AI Use Cases: Building a Factory Competitors Can't Replicate

    Explore 5 manufacturing AI data quality challenges and how synthetic data solves them, backed by real cases saving $370,000 annually.
    Pebbly's avatar
    Pebbly
    Apr 27, 2026
    Manufacturing AI Use Cases: Building a Factory Competitors Can't Replicate
    Contents
    Why is manufacturing AI important?Strengths of manufacturing AIChanges in synthetic data generation strategies that support manufacturing AIDirections for future manufacturing AI to developCase Study 1: Three Data Quality Problems and Solutions from a Major Korean Company's Recycling AI DeploymentProblem 1: The Challenge of Classifying Varied Materials (Even within the Same Plastic Category)Problem 2: Transparent or Reflective Materials Defeat Standard AI RecognitionProblem 3: Even Minor Shape Deformation Causes AI to Lose RecognitionPebblous' other ideas for customersManufacturing AI cases with healthy data, what is the effect of introduction?Case Study 2: 1 million images of national industrial waste, what are the results of the DataClinic analysis?Problem 1: Class Imbalance Leads to Biased RecognitionProblem 2: Materials Difficult for Humans to Distinguish Challenge AI ClassificationWhy is manufacturing AI important?Strengths of manufacturing AIChanges in synthetic data generation strategies that support manufacturing AIDirections for future manufacturing AI to developCase Study 1: Three Data Quality Problems and Solutions from a Major Korean Company's Recycling AI DeploymentProblem 1: The Challenge of Classifying Varied Materials (Even within the Same Plastic Category)Problem 2: Transparent or Reflective Materials Defeat Standard AI RecognitionProblem 3: Even Minor Shape Deformation Causes AI to Lose RecognitionPebblous' other ideas for customersManufacturing AI cases with healthy data, what is the effect of introduction?Case Study 2: 1 million images of national industrial waste, what are the results of the DataClinic analysis?Problem 1: Class Imbalance Leads to Biased RecognitionProblem 2: Materials Difficult for Humans to Distinguish Challenge AI Classification

    As a manufacturer, what is your competitive edge?

    The answers vary: plant capacity, core technology, experienced talent. But if you assess honestly, most competitors can match you on at least one of those fronts. The differentiator that is harder to replicate is manufacturing AI.

    A truly intelligent factory captures your operational know-how as structured data. Given the right equipment and sensor infrastructure, AI can optimize both utilization rates and defect rates in real time.

    🤖

    Manufacturing AI makes your competitive advantages harder to replicate. It builds a system that amplifies your existing strengths: faster throughput, more consistent quality, and sustained performance over time.


    Why is manufacturing AI important?

    A measurable performance gap is already forming between companies that have embedded AI deeply into their processes and those that have not. It may appear marginal today, but the compounding effect over time is significant. Two manufacturers in the same industry, with comparable facilities, can end up with very different productivity and margin profiles depending on whether AI is in the loop.

    Tesla is the clearest example of this shift. With its Gigafactory, Tesla is simultaneously improving both the cost and margin in its electric vehicle mass production system.

    Tesla Gigafactory
    Tesla Gigafactory
    • Tesla's Gigafactory is not the same as what is commonly called an automated factory. It is not just a factory where robots replace people, but the entire factory is designed to learn and improve on its own based on data. It is also called the "Software-Defined Factory."

    • This goes well beyond integrating an LLM into a manufacturing workflow. The facility operates as a single system driven by manufacturing AI. As design, production, and operations become connected through shared data, and that data accumulates over time, the system compounds in intelligence. The factory gets measurably smarter the longer it runs.

    🏭

    This points to a critical insight: applying AI to individual processes in isolation does not produce structural gains. Real results accumulate only when the entire factory is redesigned around data, with an operating model that lets AI intervene in real time.

    📊

    Underlying all of this is one non-negotiable prerequisite: data quality. AI makes decisions based on data and learns from data. Errors in that data produce bad decisions; biased data produces biased models. No matter how sophisticated the model architecture, poor-quality data turns a promising system into one that compounds mistakes rather than improves outcomes.


    Strengths of manufacturing AI

    A well-implemented manufacturing AI delivers tangible operational improvements across three areas.

    • Preservation: AI monitors machine sensor data continuously and flags anomalies before failures occur. By learning from historical field data, it identifies the conditions that precede breakdowns, reducing unplanned downtime and maintenance costs. Operational logs provide a clear audit trail for root cause analysis after any incident.

    • Optimizing production: Manufacturing AI detects visual defects that human inspectors miss. It automatically schedules production to eliminate bottlenecks, balancing inventory levels, order timelines, and machine utilization simultaneously to minimize waste. Energy management systems monitor plant-wide power consumption in real time and shift intensive operations to off-peak hours when electricity rates are lower.

    • Safety Management: Manufacturing AI enables cleaner human-machine collaboration by dividing tasks according to each party's strengths. Even well-designed robotic systems can pose safety risks in shared workspaces. When integrated with smart manufacturing AI, these systems include real-time collision detection, motion recognition, and automatic shutoff, allowing robots and workers to operate in close proximity without incident.


    Changes in synthetic data generation strategies that support manufacturing AI

    "Is the data in our factory now of sufficient quality for AI to learn?"

    This is the right question to be asking. It is also the starting point for improving the data infrastructure that underlies manufacturing AI. At the center of that infrastructure is synthetic data. The following covers how synthetic data strategies need to evolve as manufacturing AI matures, and what that means for how AI systems perform in the field.

    Directions for future manufacturing AI to develop

    • Neuro-Symbolic AI enables models to evaluate both object attributes and scene context simultaneously. Earlier approaches assessed object properties in isolation. Modern systems understand the full scene and the relationships between objects as a unified whole.

    • When assessing properties, the output might be "Transparent and modified A part." However, by considering the full scene context, you can infer that it is a "Transparent and modified A part, overlapped with B part, where reflective light is blocking the boundaries.”

    • Manufacturing AI needs to move beyond surface-level visual recognition and assess internal attributes: material composition, optical reflectance, light absorption, contamination levels, and structural deformation.

    • It also requires a multimodal sensing approach, combining NIR, hyperspectral imaging, and X-ray rather than relying on a single RGB camera. Training datasets must reflect the full complexity of real production environments.


    Case Study 1: Three Data Quality Problems and Solutions from a Major Korean Company's Recycling AI Deployment

    The following is drawn from a proof-of-concept engagement with a parent company of Pebblous. The project focused on improving data quality for an AI-powered waste sorting system deployed at a public recycling facility, specifically targeting waste plastics classification.

    Pebblous has since continued this work across multiple government-sponsored manufacturing AI programs and its own R&D initiatives.

    • Before AI: Sorting operations were entirely manual. When COVID-19 quarantine mandates prevented workers from reporting to facilities, operations ground to a halt. The urgency for AI automation became undeniable.

    • After initial AI deployment: When AI was introduced on the floor, new limitations surfaced immediately. Existing models had been trained on outdoor or generic waste imagery and failed to handle the complexity of an actual sorting facility environment.

    That gap in performance is what led the company to engage Pebblous DataClinic.

    Synthetic waste plastic data generated by Pebblous
    Synthetic waste plastic data generated by Pebblous

    Problem 1: The Challenge of Classifying Varied Materials (Even within the Same Plastic Category)

    This is particularly relevant for manufacturers producing a broad range of product types within a single facility.

    Material

    Real-World Examples

    PET

    Water bottles, beverage bottles

    HDPE

    Shampoo bottles, detergent containers

    ECAL

    Milk cartons, juice cartons (composite: paper + aluminum + plastic)

    PET Oil

    Cooking oil bottles

    Mixed Soft Plastic

    Plastic bags, snack packaging

    Mixed Rigid Plastic

    Plastic caps, toys

    Cardboard

    Shipping boxes, paper cartons

    Metal

    Cans, aluminum

    In this case, materials were classified into eight types. Depending on product characteristics, variation extends beyond material composition to include color and surface condition. A given item may appear transparent, white, black, painted, or contaminated.

    Synthesis data of waste plastics generated by automation of metal application by Pebblous
    Synthesis data of waste plastics generated by automation of metal application by Pebblous

    ⬆️

    In nearly every manufacturing AI engagement Pebblous has worked on, this is where problems originate. The number of possible condition combinations grows exponentially, scaling to thousands of distinct cases depending on the product range.

    • However, not all of these combinations are evenly distributed in real-world data. Certain combinations of data are significantly lacking. You'll have to wait months to actually collect them.

    • The manufacturing AI is biased at this point. Combinations with a lot of data are quickly recognized, but they are not clearly distinguished in the face of rare combinations and cause errors. 

    • A further complication: some items, like juice cartons, are composite materials. The outer layer may be PE while the interior is an entirely different material. This cannot be determined by visual inspection alone.

    Solution: Intentional mass production of rare cases, built-in composite material information

    Data Clinic can intentionally mass produce desired combinations by adjusting materials, colors, and contamination conditions to set values. Thousands of combinations that are difficult to collect in reality can be secured. It solves the problem of performance imbalance that AI cannot recognize only certain combinations.

    In addition, for composite objects, the data clinic creates information about the composite material from scratch when it generates composite data for that object. The finished image then automatically attaches the information "PE on the outside, composite on the inside."

    Problem 2: Transparent or Reflective Materials Defeat Standard AI Recognition

    A significant portion of the materials at sorting facilities are transparent or reflective, and standard AI models struggle to accurately identify them. 

    • Both are among the most common materials at sorting facilities. When transparent vinyl is layered over a transparent PET bottle, AI models frequently fail to register either as a distinct object, treating the combined item as background rather than a sorting target.

    • Metal Cans: The reflective surface of metal cans mirrors the surrounding environment. AI models have difficulty defining accurate object boundaries when specular reflections distort the visual signal.

    • Black Plastic (Carbon Black): Carbon-black plastics absorb nearly all incident light. Standard camera sensors rely on reflected light for object detection, so when reflectance approaches zero, the sensor effectively cannot detect the object.

    Synthetic waste plastic data generated by Pebblous by applying morphological deformation simulation
    Synthetic waste plastic data generated by Pebblous by applying morphological deformation simulation

    Solution: Precision Implementation of Materials and Glosses (PBR)

    "Do I have to sort them manually in this case?" 

    Not necessarily. Augmenting training data with synthetic examples that accurately simulate these optical properties eliminates the need for manual sorting in these cases.

    • Physically Based Rendering (PBR) enables accurate simulation of surface properties, from the transparency of a water bottle to the opacity of a detergent container. PBR models how light physically reflects, refracts, and absorbs across different materials according to actual physical laws. It is the same rendering technology used in high-end film visual effects.

    • A complementary approach combines NIR (near-infrared), hyperspectral imaging, and X-ray sensing to characterize materials by chemical composition rather than appearance. This enables differentiation between visually identical materials like PET and PS that are chemically distinct.

    • Carbon black presents a separate challenge: because it absorbs light completely, standard optical sensors cannot detect it at all. The only viable solution is upgrading to a more sensitive modality, such as MIR or Raman spectroscopy, and generating synthetic training data calibrated to that sensor's characteristics.

    Problem 3: Even Minor Shape Deformation Causes AI to Lose Recognition

    "What's this?"

    Waste items are routinely crushed, torn, and tangled during collection and transport. When a model encounters a severely flattened PET bottle, it often fails to recognize it. A human immediately understands it as a damaged version of a familiar object. To the AI, it is an entirely unfamiliar shape.

    Synthetic waste plastic data generated by reflecting various shapes of containers
    Synthetic waste plastic data generated by reflecting various shapes of containers
    Pebblous' synthetic data for resource recycling Infinite Shape Deformation Simulation of Plastic Bottles.
    Pebblous' synthetic data for resource recycling Infinite Shape Deformation Simulation of Plastic Bottles.

    💥

    When a crumpled or contaminated item triggers a model failure, the output may appear to be a recognition error. However, the deformed item is still a valid classification target. The AI must be trained to treat deformation and contamination as expected characteristics of the material class, not anomalies to be rejected.

    ‼️

    This becomes even more critical in the context of physical AI, where robotic systems must do more than classify objects. A standard plastic bottle and a crushed one require entirely different gripping strategies. If the robot can identify the item but cannot determine how to grip it in its deformed state, the classification step is effectively useless. Object recognition and grasp planning must be developed together.

    Solution: infinite shape deformation simulation

    Physically collecting, photographing, and labeling crushed plastics at scale is impractical. Using DataClinic's data quality management tools, Pebblous generated a large, diverse set of deformation states, including crumpled, curved, and torn variations, in a virtual environment. This enables AI models to recognize objects across the full deformation range encountered in real-world operations.

    Several techniques from active research areas were applied to address the shape deformation problem.

    • Gaussian Splatting/Diffusion Model: Restores the original shape of a deformed object in 3D. 

    • UOIS: It is a technology that can separate objects that have never been seen before. 

    • Amodal Segmentation: At the recycling site, some of the garbage may be covered by other objects. This technology is designed to identify the entire shape by inferring the covered areas.


    Pebblous' other ideas for customers

    Pebblous operates as a data infrastructure group. The focus is not on individual data assets in isolation, but on the full infrastructure that governs how data is generated, managed, and operationalized.

    The deployment in this case was not a fully automated environment. Workers and AI shared the floor, each handling the tasks they perform most effectively. Dividing responsibilities between human operators and AI systems reduces individual workload and improves overall throughput.

    But what if we could go further and help people with AI systems more efficiently than ever before?

    This thinking led to the Projector Collaboration System, a concept that received formal recognition from one of Pebblous' enterprise clients.

    • A computer vision system and a projector are co-mounted above the conveyor belt. The projector overlays a light indicator directly onto the objects the AI has flagged as priority targets.

    • Rather than scanning every item on the belt, the operator's attention is directed precisely to where it is needed. The AI handles the filtering; the human handles the judgment.

    • This collaboration model improves sorting accuracy while reducing operator cognitive load.


    Manufacturing AI cases with healthy data, what is the effect of introduction?

    Here is how the outcomes of this deployment broke down across five dimensions.

    1. Improved recognition accuracy

    The contamination rate dropped from a range of 15-25% down to 5%. The system now accurately identifies crushed and contaminated materials that previously slipped through, recovering recyclable volume that would otherwise go to landfill or incineration. Improved handling of optically challenging materials also reduced overall misclassification rates.

    2. Maximize gripping success rate

    Previously, robots encountered crushed PET bottles and failed to determine a viable grip point, resulting in missed attempts and line stoppages. With training data covering a wide range of shape and texture variations, the AI now identifies the optimal grip method immediately and keeps the line moving.

    3. Reduce data deployment time and cost by 80%

    There is no longer a need to capture and label physical samples on-site over a period of months. DataClinic generates and automatically labels synthetic data in a virtual environment, producing high-quality, balanced datasets in a fraction of the time.

    While exact project figures remain confidential, the following is a conservative estimate based on comparable industry benchmarks. Combined, the factors below represent approximately $370,000 in annual savings for a facility at this scale.

    🪣

    Using a mid-sized sorting facility processing 500 tons per month as a baseline: reducing the contamination rate from 20% to 5% through synthetic data-augmented AI converts 75 tons per month from waste disposal to recyclable material. At a treatment cost of approximately $220 per ton, this represents roughly $200,000 in annual savings.

    💸

    If AI reallocates 10 of 30 workers across three shifts to higher-value tasks, the direct labor cost reduction represents an additional $180,000 per year.

    4. Sustainable operation

    Robotic systems take over the sorting tasks with the highest injury and exposure risk, removing workers from direct contact with odors, pollutants, and mechanical hazards. The facility, previously constrained to a three-shift human schedule, becomes capable of continuous 24/7 operation.

    5. Visualize ESG performance

    For organizations where ESG metrics factor into investor and regulatory reporting, the impact extends beyond operations.

    • Previously: Companies had no concrete data to substantiate environmental performance claims, regardless of how strong their actual recycling operations were.

    • Now: Operating on a data foundation means recycling rates, carbon savings, and waste diversion volumes can be quantified and reported with precision, giving organizations a credible, auditable basis for their circular economy claims.

    View the Data Quality Diagnostic Report for Recycled AI

    Case Study 2: 1 million images of national industrial waste, what are the results of the DataClinic analysis?

    The second case examines a dataset distributed through AI Hub: Korea's largest industrial waste image dataset, comprising one million high-resolution images across 72 waste categories sourced from factories and industrial facilities.

    Despite its scale, a direct analysis of this dataset using Pebblous DataClinic surfaced several significant quality issues.

    Industrial Waste Image Dataset Collage - 72 different types of wastes, including metals, waste fibers, glass ceramics, synthetic resins, etc
    Industrial Waste Image Dataset Collage - 72 different types of wastes, including metals, waste fibers, glass ceramics, synthetic resins, etc

    Problem 1: Class Imbalance Leads to Biased Recognition

    The most significant issue is severe class imbalance. The average class contains 11,257 images, but the distribution is extremely uneven. The smallest class has just 20 images, while the synthetic resin-vinyl class contains 79,560. That is a ratio of nearly 4,000 to 1 within the same dataset.

    Industrial Waste Image Dataset - Graph of comparison between maximum cleese, average, minimum class
    Industrial Waste Image Dataset - Graph of comparison between maximum cleese, average, minimum class

    Severe imbalance produces a biased model. Classes with high sample counts are learned well. Classes with only 20 samples receive almost no meaningful training signal. When those underrepresented materials appear on the actual sorting floor, the AI either misclassifies them or fails to register them entirely.

    One million images total. But what drives model performance is not the aggregate count. It is how that data is distributed across classes. Balance matters more than volume.

    Solution: Reinforcing Minority Classes and Refine Majority Classes

    Two interventions address this problem.

    1. Concentrate minority classes with synthetic data generation

    • Each minority class needs to be brought up to a level sufficient for the model to learn meaningful patterns. Based on DataClinic's analysis of this dataset, a minimum of 5,000 samples per class, representing at least 50% of the class average, is the appropriate threshold. At that level, adequate data diversity is achievable.

    • This threshold varies by case. Depending on the complexity of the class and the degree of intra-class variation, the appropriate minimum may be higher or lower than 5,000.

    1. Organize multi-classes with lighter data

    • Deduplicate and remove near-identical images from the dominant 79,560-image class. This significantly reduces raw data volume, but the model gains the ability to learn all classes at a comparable depth. The tradeoff clearly favors balance.

    Problem 2: Materials Difficult for Humans to Distinguish Challenge AI Classification

    Consider the following:

    What material do you think the two objects below are made of?

    Glass Ceramics - Other
    Glass Ceramics - Window glass

    The second image is recognizable as glass. The first is considerably less obvious.

    Both belong to the same major category: glass ceramics, classified as "glass ceramics: other" and "glass ceramics: window glass," respectively.

    🫙

    Both items share the same material class, yet their visual appearance is dramatically different. A human with domain knowledge can bridge that gap with context. An AI model trained without sufficient visual diversity across a class cannot. A model that saw only typical glass samples will classify these as separate materials.

    The purpose of classification AI in recycling and manufacturing is to route identical materials to the same destination. When a model cannot reconcile visual dissimilarity with material identity, that routing logic breaks down.

    Solution: Redesigning Class Labels Based on Visual Characteristics

    This is solvable.

    Counterintuitively, the existence of visually distinct examples within the same material class is a training opportunity. A model improves its class understanding through exposure to diverse images, not through repetition of similar ones.

    The model faces the same challenge a reader did when confronted with that quiz. The solution is to decompose broad class labels into visually distinct sub-classes. Domain experts must define these boundaries and apply granular labels based on their knowledge of material characteristics.

    Go to the Quality Assistance Report for 1 Million Industrial Waste Data


    Companies that are ready for manufacturing AI and companies that are not. The gap between them will compound over the next five to ten years.

    Many organizations recognize the urgency and are moving to adopt manufacturing AI. Most are stalling at the same point: data quality.

    Working alongside Hyundai Motor, Samsung E&A, and LG Electronics on physical AI and manufacturing AI initiatives, one consistent truth has emerged: good manufacturing AI comes from good data. And good data for manufacturing AI means data that accurately reflects the complexity of real production environments.

    In March 2026, at Korea's Data Insight & Security Summit (DISS), Pebblous CEO Lee Ju-haeng shared the company's operational methodology and field experience in a keynote titled "Agentic Data Clinic Solution to Solve Data Bottlenecks in the Age of Physical AI."

    Among the 14 keynote presenters, Pebblous ranked first in both "most compelling solution to adopt" and "most requested for a consultation," a reflection of the company's track record solving these problems at the operational level.

    Ranked #1 Solution for Adoption & Consultation Agentic Data Clinic
    Ranked #1 Solution for Adoption & Consultation Agentic Data Clinic

    If you are working to build manufacturing AI capabilities that improve both operational performance and technical sophistication at your facility, Pebblous can develop a tailored solution based on the approach described in this analysis.

    As a manufacturer, what is your competitive edge?

    The answers vary: plant capacity, core technology, experienced talent. But if you assess honestly, most competitors can match you on at least one of those fronts. The differentiator that is harder to replicate is manufacturing AI.

    A truly intelligent factory captures your operational know-how as structured data. Given the right equipment and sensor infrastructure, AI can optimize both utilization rates and defect rates in real time.

    🤖

    Manufacturing AI makes your competitive advantages harder to replicate. It builds a system that amplifies your existing strengths: faster throughput, more consistent quality, and sustained performance over time.


    Why is manufacturing AI important?

    A measurable performance gap is already forming between companies that have embedded AI deeply into their processes and those that have not. It may appear marginal today, but the compounding effect over time is significant. Two manufacturers in the same industry, with comparable facilities, can end up with very different productivity and margin profiles depending on whether AI is in the loop.

    Tesla is the clearest example of this shift. With its Gigafactory, Tesla is simultaneously improving both the cost and margin in its electric vehicle mass production system.

    Tesla Gigafactory
    Tesla Gigafactory
    • Tesla's Gigafactory is not the same as what is commonly called an automated factory. It is not just a factory where robots replace people, but the entire factory is designed to learn and improve on its own based on data. It is also called the "Software-Defined Factory."

    • This goes well beyond integrating an LLM into a manufacturing workflow. The facility operates as a single system driven by manufacturing AI. As design, production, and operations become connected through shared data, and that data accumulates over time, the system compounds in intelligence. The factory gets measurably smarter the longer it runs.

    🏭

    This points to a critical insight: applying AI to individual processes in isolation does not produce structural gains. Real results accumulate only when the entire factory is redesigned around data, with an operating model that lets AI intervene in real time.

    📊

    Underlying all of this is one non-negotiable prerequisite: data quality. AI makes decisions based on data and learns from data. Errors in that data produce bad decisions; biased data produces biased models. No matter how sophisticated the model architecture, poor-quality data turns a promising system into one that compounds mistakes rather than improves outcomes.


    Strengths of manufacturing AI

    A well-implemented manufacturing AI delivers tangible operational improvements across three areas.

    • Preservation: AI monitors machine sensor data continuously and flags anomalies before failures occur. By learning from historical field data, it identifies the conditions that precede breakdowns, reducing unplanned downtime and maintenance costs. Operational logs provide a clear audit trail for root cause analysis after any incident.

    • Optimizing production: Manufacturing AI detects visual defects that human inspectors miss. It automatically schedules production to eliminate bottlenecks, balancing inventory levels, order timelines, and machine utilization simultaneously to minimize waste. Energy management systems monitor plant-wide power consumption in real time and shift intensive operations to off-peak hours when electricity rates are lower.

    • Safety Management: Manufacturing AI enables cleaner human-machine collaboration by dividing tasks according to each party's strengths. Even well-designed robotic systems can pose safety risks in shared workspaces. When integrated with smart manufacturing AI, these systems include real-time collision detection, motion recognition, and automatic shutoff, allowing robots and workers to operate in close proximity without incident.


    Changes in synthetic data generation strategies that support manufacturing AI

    "Is the data in our factory now of sufficient quality for AI to learn?"

    This is the right question to be asking. It is also the starting point for improving the data infrastructure that underlies manufacturing AI. At the center of that infrastructure is synthetic data. The following covers how synthetic data strategies need to evolve as manufacturing AI matures, and what that means for how AI systems perform in the field.

    Directions for future manufacturing AI to develop

    • Neuro-Symbolic AI enables models to evaluate both object attributes and scene context simultaneously. Earlier approaches assessed object properties in isolation. Modern systems understand the full scene and the relationships between objects as a unified whole.

    • When assessing properties, the output might be "Transparent and modified A part." However, by considering the full scene context, you can infer that it is a "Transparent and modified A part, overlapped with B part, where reflective light is blocking the boundaries.”

    • Manufacturing AI needs to move beyond surface-level visual recognition and assess internal attributes: material composition, optical reflectance, light absorption, contamination levels, and structural deformation.

    • It also requires a multimodal sensing approach, combining NIR, hyperspectral imaging, and X-ray rather than relying on a single RGB camera. Training datasets must reflect the full complexity of real production environments.


    Case Study 1: Three Data Quality Problems and Solutions from a Major Korean Company's Recycling AI Deployment

    The following is drawn from a proof-of-concept engagement with a parent company of Pebblous. The project focused on improving data quality for an AI-powered waste sorting system deployed at a public recycling facility, specifically targeting waste plastics classification.

    Pebblous has since continued this work across multiple government-sponsored manufacturing AI programs and its own R&D initiatives.

    • Before AI: Sorting operations were entirely manual. When COVID-19 quarantine mandates prevented workers from reporting to facilities, operations ground to a halt. The urgency for AI automation became undeniable.

    • After initial AI deployment: When AI was introduced on the floor, new limitations surfaced immediately. Existing models had been trained on outdoor or generic waste imagery and failed to handle the complexity of an actual sorting facility environment.

    That gap in performance is what led the company to engage Pebblous DataClinic.

    Synthetic waste plastic data generated by Pebblous
    Synthetic waste plastic data generated by Pebblous

    Problem 1: The Challenge of Classifying Varied Materials (Even within the Same Plastic Category)

    This is particularly relevant for manufacturers producing a broad range of product types within a single facility.

    Material

    Real-World Examples

    PET

    Water bottles, beverage bottles

    HDPE

    Shampoo bottles, detergent containers

    ECAL

    Milk cartons, juice cartons (composite: paper + aluminum + plastic)

    PET Oil

    Cooking oil bottles

    Mixed Soft Plastic

    Plastic bags, snack packaging

    Mixed Rigid Plastic

    Plastic caps, toys

    Cardboard

    Shipping boxes, paper cartons

    Metal

    Cans, aluminum

    In this case, materials were classified into eight types. Depending on product characteristics, variation extends beyond material composition to include color and surface condition. A given item may appear transparent, white, black, painted, or contaminated.

    Synthesis data of waste plastics generated by automation of metal application by Pebblous
    Synthesis data of waste plastics generated by automation of metal application by Pebblous

    ⬆️

    In nearly every manufacturing AI engagement Pebblous has worked on, this is where problems originate. The number of possible condition combinations grows exponentially, scaling to thousands of distinct cases depending on the product range.

    • However, not all of these combinations are evenly distributed in real-world data. Certain combinations of data are significantly lacking. You'll have to wait months to actually collect them.

    • The manufacturing AI is biased at this point. Combinations with a lot of data are quickly recognized, but they are not clearly distinguished in the face of rare combinations and cause errors. 

    • A further complication: some items, like juice cartons, are composite materials. The outer layer may be PE while the interior is an entirely different material. This cannot be determined by visual inspection alone.

    Solution: Intentional mass production of rare cases, built-in composite material information

    Data Clinic can intentionally mass produce desired combinations by adjusting materials, colors, and contamination conditions to set values. Thousands of combinations that are difficult to collect in reality can be secured. It solves the problem of performance imbalance that AI cannot recognize only certain combinations.

    In addition, for composite objects, the data clinic creates information about the composite material from scratch when it generates composite data for that object. The finished image then automatically attaches the information "PE on the outside, composite on the inside."

    Problem 2: Transparent or Reflective Materials Defeat Standard AI Recognition

    A significant portion of the materials at sorting facilities are transparent or reflective, and standard AI models struggle to accurately identify them. 

    • Both are among the most common materials at sorting facilities. When transparent vinyl is layered over a transparent PET bottle, AI models frequently fail to register either as a distinct object, treating the combined item as background rather than a sorting target.

    • Metal Cans: The reflective surface of metal cans mirrors the surrounding environment. AI models have difficulty defining accurate object boundaries when specular reflections distort the visual signal.

    • Black Plastic (Carbon Black): Carbon-black plastics absorb nearly all incident light. Standard camera sensors rely on reflected light for object detection, so when reflectance approaches zero, the sensor effectively cannot detect the object.

    Synthetic waste plastic data generated by Pebblous by applying morphological deformation simulation
    Synthetic waste plastic data generated by Pebblous by applying morphological deformation simulation

    Solution: Precision Implementation of Materials and Glosses (PBR)

    "Do I have to sort them manually in this case?" 

    Not necessarily. Augmenting training data with synthetic examples that accurately simulate these optical properties eliminates the need for manual sorting in these cases.

    • Physically Based Rendering (PBR) enables accurate simulation of surface properties, from the transparency of a water bottle to the opacity of a detergent container. PBR models how light physically reflects, refracts, and absorbs across different materials according to actual physical laws. It is the same rendering technology used in high-end film visual effects.

    • A complementary approach combines NIR (near-infrared), hyperspectral imaging, and X-ray sensing to characterize materials by chemical composition rather than appearance. This enables differentiation between visually identical materials like PET and PS that are chemically distinct.

    • Carbon black presents a separate challenge: because it absorbs light completely, standard optical sensors cannot detect it at all. The only viable solution is upgrading to a more sensitive modality, such as MIR or Raman spectroscopy, and generating synthetic training data calibrated to that sensor's characteristics.

    Problem 3: Even Minor Shape Deformation Causes AI to Lose Recognition

    "What's this?"

    Waste items are routinely crushed, torn, and tangled during collection and transport. When a model encounters a severely flattened PET bottle, it often fails to recognize it. A human immediately understands it as a damaged version of a familiar object. To the AI, it is an entirely unfamiliar shape.

    Synthetic waste plastic data generated by reflecting various shapes of containers
    Synthetic waste plastic data generated by reflecting various shapes of containers
    Pebblous' synthetic data for resource recycling Infinite Shape Deformation Simulation of Plastic Bottles.
    Pebblous' synthetic data for resource recycling Infinite Shape Deformation Simulation of Plastic Bottles.

    💥

    When a crumpled or contaminated item triggers a model failure, the output may appear to be a recognition error. However, the deformed item is still a valid classification target. The AI must be trained to treat deformation and contamination as expected characteristics of the material class, not anomalies to be rejected.

    ‼️

    This becomes even more critical in the context of physical AI, where robotic systems must do more than classify objects. A standard plastic bottle and a crushed one require entirely different gripping strategies. If the robot can identify the item but cannot determine how to grip it in its deformed state, the classification step is effectively useless. Object recognition and grasp planning must be developed together.

    Solution: infinite shape deformation simulation

    Physically collecting, photographing, and labeling crushed plastics at scale is impractical. Using DataClinic's data quality management tools, Pebblous generated a large, diverse set of deformation states, including crumpled, curved, and torn variations, in a virtual environment. This enables AI models to recognize objects across the full deformation range encountered in real-world operations.

    Several techniques from active research areas were applied to address the shape deformation problem.

    • Gaussian Splatting/Diffusion Model: Restores the original shape of a deformed object in 3D. 

    • UOIS: It is a technology that can separate objects that have never been seen before. 

    • Amodal Segmentation: At the recycling site, some of the garbage may be covered by other objects. This technology is designed to identify the entire shape by inferring the covered areas.


    Pebblous' other ideas for customers

    Pebblous operates as a data infrastructure group. The focus is not on individual data assets in isolation, but on the full infrastructure that governs how data is generated, managed, and operationalized.

    The deployment in this case was not a fully automated environment. Workers and AI shared the floor, each handling the tasks they perform most effectively. Dividing responsibilities between human operators and AI systems reduces individual workload and improves overall throughput.

    But what if we could go further and help people with AI systems more efficiently than ever before?

    This thinking led to the Projector Collaboration System, a concept that received formal recognition from one of Pebblous' enterprise clients.

    • A computer vision system and a projector are co-mounted above the conveyor belt. The projector overlays a light indicator directly onto the objects the AI has flagged as priority targets.

    • Rather than scanning every item on the belt, the operator's attention is directed precisely to where it is needed. The AI handles the filtering; the human handles the judgment.

    • This collaboration model improves sorting accuracy while reducing operator cognitive load.


    Manufacturing AI cases with healthy data, what is the effect of introduction?

    Here is how the outcomes of this deployment broke down across five dimensions.

    1. Improved recognition accuracy

    The contamination rate dropped from a range of 15-25% down to 5%. The system now accurately identifies crushed and contaminated materials that previously slipped through, recovering recyclable volume that would otherwise go to landfill or incineration. Improved handling of optically challenging materials also reduced overall misclassification rates.

    2. Maximize gripping success rate

    Previously, robots encountered crushed PET bottles and failed to determine a viable grip point, resulting in missed attempts and line stoppages. With training data covering a wide range of shape and texture variations, the AI now identifies the optimal grip method immediately and keeps the line moving.

    3. Reduce data deployment time and cost by 80%

    There is no longer a need to capture and label physical samples on-site over a period of months. DataClinic generates and automatically labels synthetic data in a virtual environment, producing high-quality, balanced datasets in a fraction of the time.

    While exact project figures remain confidential, the following is a conservative estimate based on comparable industry benchmarks. Combined, the factors below represent approximately $370,000 in annual savings for a facility at this scale.

    🪣

    Using a mid-sized sorting facility processing 500 tons per month as a baseline: reducing the contamination rate from 20% to 5% through synthetic data-augmented AI converts 75 tons per month from waste disposal to recyclable material. At a treatment cost of approximately $220 per ton, this represents roughly $200,000 in annual savings.

    💸

    If AI reallocates 10 of 30 workers across three shifts to higher-value tasks, the direct labor cost reduction represents an additional $180,000 per year.

    4. Sustainable operation

    Robotic systems take over the sorting tasks with the highest injury and exposure risk, removing workers from direct contact with odors, pollutants, and mechanical hazards. The facility, previously constrained to a three-shift human schedule, becomes capable of continuous 24/7 operation.

    5. Visualize ESG performance

    For organizations where ESG metrics factor into investor and regulatory reporting, the impact extends beyond operations.

    • Previously: Companies had no concrete data to substantiate environmental performance claims, regardless of how strong their actual recycling operations were.

    • Now: Operating on a data foundation means recycling rates, carbon savings, and waste diversion volumes can be quantified and reported with precision, giving organizations a credible, auditable basis for their circular economy claims.

    View the Data Quality Diagnostic Report for Recycled AI

    Case Study 2: 1 million images of national industrial waste, what are the results of the DataClinic analysis?

    The second case examines a dataset distributed through AI Hub: Korea's largest industrial waste image dataset, comprising one million high-resolution images across 72 waste categories sourced from factories and industrial facilities.

    Despite its scale, a direct analysis of this dataset using Pebblous DataClinic surfaced several significant quality issues.

    Industrial Waste Image Dataset Collage - 72 different types of wastes, including metals, waste fibers, glass ceramics, synthetic resins, etc
    Industrial Waste Image Dataset Collage - 72 different types of wastes, including metals, waste fibers, glass ceramics, synthetic resins, etc

    Problem 1: Class Imbalance Leads to Biased Recognition

    The most significant issue is severe class imbalance. The average class contains 11,257 images, but the distribution is extremely uneven. The smallest class has just 20 images, while the synthetic resin-vinyl class contains 79,560. That is a ratio of nearly 4,000 to 1 within the same dataset.

    Industrial Waste Image Dataset - Graph of comparison between maximum cleese, average, minimum class
    Industrial Waste Image Dataset - Graph of comparison between maximum cleese, average, minimum class

    Severe imbalance produces a biased model. Classes with high sample counts are learned well. Classes with only 20 samples receive almost no meaningful training signal. When those underrepresented materials appear on the actual sorting floor, the AI either misclassifies them or fails to register them entirely.

    One million images total. But what drives model performance is not the aggregate count. It is how that data is distributed across classes. Balance matters more than volume.

    Solution: Reinforcing Minority Classes and Refine Majority Classes

    Two interventions address this problem.

    1. Concentrate minority classes with synthetic data generation

    • Each minority class needs to be brought up to a level sufficient for the model to learn meaningful patterns. Based on DataClinic's analysis of this dataset, a minimum of 5,000 samples per class, representing at least 50% of the class average, is the appropriate threshold. At that level, adequate data diversity is achievable.

    • This threshold varies by case. Depending on the complexity of the class and the degree of intra-class variation, the appropriate minimum may be higher or lower than 5,000.

    1. Organize multi-classes with lighter data

    • Deduplicate and remove near-identical images from the dominant 79,560-image class. This significantly reduces raw data volume, but the model gains the ability to learn all classes at a comparable depth. The tradeoff clearly favors balance.

    Problem 2: Materials Difficult for Humans to Distinguish Challenge AI Classification

    Consider the following:

    What material do you think the two objects below are made of?

    Glass Ceramics - Other
    Glass Ceramics - Window glass

    The second image is recognizable as glass. The first is considerably less obvious.

    Both belong to the same major category: glass ceramics, classified as "glass ceramics: other" and "glass ceramics: window glass," respectively.

    🫙

    Both items share the same material class, yet their visual appearance is dramatically different. A human with domain knowledge can bridge that gap with context. An AI model trained without sufficient visual diversity across a class cannot. A model that saw only typical glass samples will classify these as separate materials.

    The purpose of classification AI in recycling and manufacturing is to route identical materials to the same destination. When a model cannot reconcile visual dissimilarity with material identity, that routing logic breaks down.

    Solution: Redesigning Class Labels Based on Visual Characteristics

    This is solvable.

    Counterintuitively, the existence of visually distinct examples within the same material class is a training opportunity. A model improves its class understanding through exposure to diverse images, not through repetition of similar ones.

    The model faces the same challenge a reader did when confronted with that quiz. The solution is to decompose broad class labels into visually distinct sub-classes. Domain experts must define these boundaries and apply granular labels based on their knowledge of material characteristics.

    Go to the Quality Assistance Report for 1 Million Industrial Waste Data


    Companies that are ready for manufacturing AI and companies that are not. The gap between them will compound over the next five to ten years.

    Many organizations recognize the urgency and are moving to adopt manufacturing AI. Most are stalling at the same point: data quality.

    Working alongside Hyundai Motor, Samsung E&A, and LG Electronics on physical AI and manufacturing AI initiatives, one consistent truth has emerged: good manufacturing AI comes from good data. And good data for manufacturing AI means data that accurately reflects the complexity of real production environments.

    In March 2026, at Korea's Data Insight & Security Summit (DISS), Pebblous CEO Lee Ju-haeng shared the company's operational methodology and field experience in a keynote titled "Agentic Data Clinic Solution to Solve Data Bottlenecks in the Age of Physical AI."

    Among the 14 keynote presenters, Pebblous ranked first in both "most compelling solution to adopt" and "most requested for a consultation," a reflection of the company's track record solving these problems at the operational level.

    Ranked #1 Solution for Adoption & Consultation Agentic Data Clinic
    Ranked #1 Solution for Adoption & Consultation Agentic Data Clinic

    If you are working to build manufacturing AI capabilities that improve both operational performance and technical sophistication at your facility, Pebblous can develop a tailored solution based on the approach described in this analysis.

    Share article

    Pebblous

    RSS·Powered by Inblog