Developing AI algorithms is not a game. It requires a special set of skills and expertise to make the models work successfully and deliver business benefits. Any mistake while training the AI models not only makes it perform inaccurately but also can prove to be disastrous when making business decisions.

Most of what you do in artificial intelligence is about getting your data sets right. Without this, your entire AI model ends up being only about garbage in, garbage out. But what is a right data set and how can you avoid the pitfalls of a flawed AI model? ETCIO brings you the three biggest mistakes that people make while training their AI models and how you can avoid falling into these traps.

1. Not knowing about all the data:

Almost every organization today is dealing with the ‘Data’ problem. Having more or less data is not the big concern here. Not knowing about ALL of our organizational data is where the real problem lies. Enterprises cannot manage what they don’t know they have.

“If you’re looking for positive business outcomes, you need to know all your organizational data – in the cloud, on-premises, stored virtually, on mobile devices and everywhere in between,” said Pradeep Seshadri, Director – Sales Engineering, India & SAARC, Commvault.

“Most of the organizations we work with only have a partial view of their data, which exposes them to data risks, leading to poor business efficiencies and unmet business goals. With a unified view of organization’s data, analytics and AI tools can easily search, access, leverage relevant data to forecast business risks and accelerate business outcomes. With data sitting out in business units and not under central management – both public and private organizations are missing opportunities to add value, reduce cost, manage risk and simply run better,” Seshadri explained.

He suggested companies need to truly unlock the potential in data and business and to stop worrying about having enough data and shift their focus on whether they truly know all their data and classify them for efficient management because not all data are important at all times. A strong data management strategy is the cornerstone of success in today’s data-driven era.

2. Having dirty datasets:

When developing an AI powered solution, model insights and analysis are only as good as the data being used. Majority of the time the raw data used as initial input comes from heterogeneous sources and is “dirty”, i.e. the set may contain inaccuracies, missing data, miscoding and other issues that influence the strength of the solution. One of the biggest challenges in AI is to discover and repair dirty data; failure to do this can lead to inaccurate analytics and unpredictable conclusions. Essentially, garbage data in is garbage analysis out.

“AI models are famous for proxying features via multiple factors, like postal code, height, etc. When the data input is biased, the AI model will find a way to replicate the bias in the outcomes, even if the bias isn’t explicitly included in the variables of the model. Hence, false conclusions because of incorrect or “dirty” data can inform poor business strategy and decision-making,” said Saurabh Kumar, Partner, Deloitte India.

Thus, ensuring that the data used for analytics and training AI is free from error, bias, and other bad components is necessary to ensure risk-free operation. The practice of data cleaning with AI is now emerging as the best way of eliminating bad data and ensuring that all data is usable by and among other tools and technologies.

“Organizations need to develop a robust framework to measure and monitor the data being used for AI models. They need to spend considerable time doing Exploratory Data Analysis (EDA) on the data to understand if the data has any biases or omissions. When making the data pipelines for ML models, it is a good practice to have audit checks and reports designed at various points to understand and better monitor the quality of the data flowing into the model. Companies should look towards developing AI enabled solutions, that are transparent, meaning the outcome of an AI model can be properly explained and communicated. Transparent AI is explainable AI,” Kumar added.

3. Not having variety in datasets:

Algorithms learn from data. They develop understanding, make decisions and evaluate their confidence from the training data they’re given. The better the training data is, the better the AI performs. And the quality and quantity of the datasets can impact how the AI performs. That’s why it’s important to have larger datasets and as much variety as possible because it helps the AI learn more edge cases and, in turn, improve its learning capacity to perform better.

“Not having enough variety of data, however, could result in bias, which could have serious consequences on the problem the AI is trying to solve. Take, for example, what could happen if the judicial system used AI to determine and assign the sentencing periods for people who are convicted of crimes. If there’s bias in the AI, this could lead to patterns of long sentences for specific racial groups,” said Sachin Dev Duggal, Co-Founder & CEO, Builder.ai.

“In order to solve a problem with AI, you need to identify the right modality of data to solve that specific problem. For instance, in one of our products where we apply computer vision to detect discrepancies in UI screens, named as visual QA, we use images only. But in other applications, we may use a combination of numeric and textual data. So the need for variety in data isn’t going to help AI, per se, but instead will be critical to solve the problem at hand,” he added.





Source link