4 October 2024

Perfect Vs. Good: Finding a balance between data maturity and AI adoption

Rob Ven Den Bergh

‘Our data just isn’t ready for AI yet. Maybe in a year or two.’

We speak to a lot of data people about how AI tooling helps unpick the problem of generating and providing company-wide data insights, and they usually offer some version of the above. Their biggest worry is their data simply isn’t mature enough for enhancement. 

Questioning readiness for AI tooling is understandable. Some data teams see their terminology and warehouse structure as too nuanced or inconsistent, and some opt to try to build a solution themselves - as difficult as that can be (learn more about in-house building here). 

It’s natural to be cautious, but a message to ‘data doubters’: Your organisation and your data are more ready than you think.

Data maturity can be calculated in a few ways. Still, convention breaks it down into the following stages:


  • Data Exploration: The organisation collects data but primarily uses it in a very ad-hoc, exploratory manner with limited structured analysis.

  • Data-Informed: Data is systematically analysed to support decision-making, though it is often used alongside intuition or past experience.

  • Data-Driven: Decisions are consistently made based on data insights, with a focus on measurable outcomes and data as the primary guiding force.

  • Data-Transformed: Data insights are fully automated and integrated into most business processes, enabling real-time decision-making and continuous innovation driven with advanced analytics.


Anecdotally speaking, only 1% or 2% of companies reach that final stage of data maturity. Within a ‘data-transformed’ org, generating insights is automated at this point, and the impact of data on strategic goals is closely tied and easily demonstrated.

The tiny proportion who manage to get there demonstrates how difficult it is. 

‘Data maturity’ can refer to everything from processes or governance to data quality and utilisation. Data teams often struggle to achieve the highest impact stuff; for example, enabling self-service insight amongst the non-technical team.

What does ‘mature data’ look like? 

Some of the features of mature data could be seen as the ‘bread and butter’ of a well-oiled data team, things you’ve already implemented and understand. 

  1. A centralised Data Warehouse or data lake and a clear ETL process (Extract, transform, load) to ensure new data is clean and updated regularly.


  2. Planned measures to maintain data quality and governance, like regular profiling and cleansing, to maintain accuracy. Some clear guidelines on data ownership and compliance (e.g., GDPR), and Metadata management should be in place, too.


  3. Using cloud services as part of a scalable infrastructure and secure data-sharing processes.


  4. An established data team with some AI and ML understanding and wider company alignment around producing more data-driven insights.


  5. A clear understanding of responsibility towards ethical data management and risk mitigation during third-party/AI tooling adoption. 

 

Mature data functions use SQL with really clear comments, CTEs (common table expressions), version control (git), plus explicit conventions and joins throughout to model their data. These additions make it much easier for LLMs to decipher relevant intent behind queries and map natural language into code more effectively. Something like this:

-- Purpose: Calculate monthly sales metrics by region
-- Author: Data Analytics Team
-- First CTE: Aggregate monthly sales data by region
WITH monthly_sales AS (
    SELECT
        -- Truncate the order date to the first day of the month for grouping
        DATE_TRUNC('month', o.order_date) AS sales_month,
        -- Get the region name from the regions table
        r.region_name,
        -- Calculate total revenue by summing the product of quantity
          and unit price
        SUM(oi.quantity * oi.unit_price) AS total_revenue,
        -- Count the total number of unique orders
        COUNT(DISTINCT o.order_id) AS total_orders,
        -- Count the total number of unique customers
        COUNT(DISTINCT o.customer_id) AS unique_customers
    FROM
        -- Orders table alias as 'o'
        orders o
        -- Join with order_items table to access quantity and unit price
        INNER JOIN order_items oi ON o.order_id = oi.order_id
        -- Join with customers table to get customer details
        INNER JOIN customers c ON o.customer_id = c.customer_id
        -- Join with regions table to get region information
        INNER JOIN regions r ON c.region_id = r.region_id
    WHERE
        -- Filter orders from January 1, 2023, to the current date
        o.order_date BETWEEN DATE '2023-01-01' AND CURRENT_DATE
        -- Include only orders that have been completed
        AND o.order_status = 'Completed'
    GROUP BY
        -- Group results by month and region
        sales_month,
        r.region_name
),
-- Second CTE: Calculate additional sales metrics
sales_metrics AS (
    SELECT
        -- Bring over grouped data from monthly_sales CTE
        sales_month,
        region_name,
        total_revenue,
        total_orders,
        unique_customers,
        -- Calculate average order value
        total_revenue / total_orders AS avg_order_value,
        -- Calculate revenue per customer
        total_revenue / unique_customers AS revenue_per_customer
    FROM
        monthly_sales
)
-- Final query: Select and format the desired metrics
SELECT
    sales_month,
    region_name,
    total_revenue,
    total_orders,
    unique_customers,
    -- Round average order value to two decimal places
    ROUND(avg_order_value, 2) AS avg_order_value,
    -- Round revenue per customer to two decimal places
    ROUND(revenue_per_customer, 2) AS revenue_per_customer
FROM
    sales_metrics
ORDER BY
    -- Order results by month and region for readability
    sales_month,
    region_name;

Conversely, less mature data might lack the same clarity, missing aggregation, formatting or readability. Joins might be more ambiguous, and integers like dates might be stored poorly, and require manual updates. 

Below is a particularly unpleasant example of data that is difficult for humans and AI alike, lacking almost any relational structure or adaptive syntax to be understood in context: 

SELECT a.id, b.name, c.value, d.date, e.status,
f.description, g.quantity, h.price, i.total FROM table1 a, table2 b, 
table3 c, table4 d, table5 e, table6 f, table7 g,
table8 h, table9
WHERE a.id = b.id AND b.id = c.id AND c.id = d.id AND d.id = e.id
AND e.id = f.id AND f.id 
= g.id AND g.id = h.id AND h.id = i.id AND a.type = 'X' 
AND b.category = 'Y' AND c.region = 'Z' AND d.date > '2021-01-01' 
AND e.status = 'Active' AND f.flag = 1 AND g.quantity > 100 
AND h.price < 500 
AND i.total BETWEEN 1000 AND 5000
ORDER BY d.date DESC, h.price ASC;

If all your structured data doesn’t look like the first example, remove your head from your hands. All is not lost. 


Don’t let perfect be the enemy of good 

Your data warehouse and practices may still evolve, but there are several ways to enhance data using AI without achieving 100% uniformity across your data. 

Incremental AI Adoption | Start small and scale up as data maturity improves. You can focus on specific business areas where data quality is good and introduce LLMs here first. Examples include customer support logs, sales data, or inventory management. 

Leverage AI to Enhance Data Maturity | Use cloud-based AI tools to do the hard work. It isn’t just for analysis; some tools will allow you to create clean datasets and simplify their management - enabling complex tasks like natural language querying on additional, new datasets over time.

Adopt User-Friendly Tools | Implement AI tools that translate natural language queries into SQL, rather than plugging away, trying to upskill non-technical people on traditional BI platforms they don’t engage with. 

Implement Feedback Loops | AI tools can identify areas where data quality needs enhancement and adjust data collection and management processes accordingly. Easy-to-use tools also mean more useful feedback from business users, who can actually manipulate their own data without supervision. 

Risk Management | Understand the limitations imposed by your less mature data. Then, ensure you have contingency plans to mitigate these risks. Any third-party applications should have relevant data protections of their own. We made sure Fluent was SOC2 compliant for this very reason. 

Accept Imperfection | Perfect data is rare; aiming for “good enough” while striving for continuous improvement is the key to introducing AI into an organisation. 

With this approach in place, a business can experience AI’s compounding benefits, such as carrying the burden of lower-value tasks like ad-hoc querying, improving data cleanliness, and enriching decision-making. 

© 2024 Artickl Ltd. All rights reserved.

© 2024 Artickl Ltd. All rights reserved.

© 2024 Artickl Ltd. All rights reserved.