Neszed-Mobile-header-logo
Monday, June 30, 2025
Newszed-Header-Logo
HomeAIA Coding Guide to Build a Functional Data Analysis Workflow Using Lilac...

A Coding Guide to Build a Functional Data Analysis Workflow Using Lilac for Transforming, Filtering, and Exporting Structured Insights

In this tutorial, we demonstrate a fully functional and modular data analysis pipeline using the Lilac library, without relying on signal processing. It combines Lilac’s dataset management capabilities with Python’s functional programming paradigm to create a clean, extensible workflow. From setting up a project and generating realistic sample data to extracting insights and exporting filtered outputs, the tutorial emphasizes reusable, testable code structures. Core functional utilities, such as pipe, map_over, and filter_by, are used to build a declarative flow, while Pandas facilitates detailed data transformations and quality analysis.

!pip install lilac[all] pandas numpy

To get started, we install the required libraries using the command !pip install lilac[all] pandas numpy. This ensures we have the full Lilac suite alongside Pandas and NumPy for smooth data handling and analysis. We should run this in our notebook before proceeding.

import json
import uuid
import pandas as pd
from pathlib import Path
from typing import List, Dict, Any, Tuple, Optional
from functools import reduce, partial
import lilac as ll

We import all the essential libraries. These include json and uuid for handling data and generating unique project names, pandas for working with data in tabular form, and Path from pathlib for managing directories. We also introduce type hints for improved function clarity and functools for functional composition patterns. Finally, we import the core Lilac library as ll to manage our datasets.

def pipe(*functions):
   """Compose functions left to right (pipe operator)"""
   return lambda x: reduce(lambda acc, f: f(acc), functions, x)


def map_over(func, iterable):
   """Functional map wrapper"""
   return list(map(func, iterable))


def filter_by(predicate, iterable):
   """Functional filter wrapper"""
   return list(filter(predicate, iterable))


def create_sample_data() -> List[Dict[str, Any]]:
   """Generate realistic sample data for analysis"""
   return [
       {"id": 1, "text": "What is machine learning?", "category": "tech", "score": 0.9, "tokens": 5},
       {"id": 2, "text": "Machine learning is AI subset", "category": "tech", "score": 0.8, "tokens": 6},
       {"id": 3, "text": "Contact support for help", "category": "support", "score": 0.7, "tokens": 4},
       {"id": 4, "text": "What is machine learning?", "category": "tech", "score": 0.9, "tokens": 5}, 
       {"id": 5, "text": "Deep learning neural networks", "category": "tech", "score": 0.85, "tokens": 4},
       {"id": 6, "text": "How to optimize models?", "category": "tech", "score": 0.75, "tokens": 5},
       {"id": 7, "text": "Performance tuning guide", "category": "guide", "score": 0.6, "tokens": 3},
       {"id": 8, "text": "Advanced optimization techniques", "category": "tech", "score": 0.95, "tokens": 3},
       {"id": 9, "text": "Gradient descent algorithm", "category": "tech", "score": 0.88, "tokens": 3},
       {"id": 10, "text": "Model evaluation metrics", "category": "tech", "score": 0.82, "tokens": 3},
   ]

In this section, we define reusable functional utilities. The pipe function helps us chain transformations clearly, while map_over and filter_by allow us to transform or filter iterable data functionally. Then, we create a sample dataset that mimics real-world records, featuring fields such as text, category, score, and tokens, which we will later use to demonstrate Lilac’s data curation capabilities.

def setup_lilac_project(project_name: str) -> str:
   """Initialize Lilac project directory"""
   project_dir = f"./{project_name}-{uuid.uuid4().hex[:6]}"
   Path(project_dir).mkdir(exist_ok=True)
   ll.set_project_dir(project_dir)
   return project_dir


def create_dataset_from_data(name: str, data: List[Dict]) -> ll.Dataset:
   """Create Lilac dataset from data"""
   data_file = f"{name}.jsonl"
   with open(data_file, 'w') as f:
       for item in data:
           f.write(json.dumps(item) + '\n')
  
   config = ll.DatasetConfig(
       namespace="tutorial",
       name=name,
       source=ll.sources.JSONSource(filepaths=[data_file])
   )
  
   return ll.create_dataset(config)

With the setup_lilac_project function, we initialize a unique working directory for our Lilac project and register it using Lilac’s API. Using create_dataset_from_data, we convert our raw list of dictionaries into a .jsonl file and create a Lilac dataset by defining its configuration. This prepares the data for clean and structured analysis.

def extract_dataframe(dataset: ll.Dataset, fields: List[str]) -> pd.DataFrame:
   """Extract data as pandas DataFrame"""
   return dataset.to_pandas(fields)


def apply_functional_filters(df: pd.DataFrame) -> Dict[str, pd.DataFrame]:
   """Apply various filters and return multiple filtered versions"""
  
   filters = {
       'high_score': lambda df: df[df['score'] >= 0.8],
       'tech_category': lambda df: df[df['category'] == 'tech'],
       'min_tokens': lambda df: df[df['tokens'] >= 4],
       'no_duplicates': lambda df: df.drop_duplicates(subset=['text'], keep='first'),
       'combined_quality': lambda df: df[(df['score'] >= 0.8) & (df['tokens'] >= 3) & (df['category'] == 'tech')]
   }
  
   return {name: filter_func(df.copy()) for name, filter_func in filters.items()}

We extract the dataset into a Pandas DataFrame using extract_dataframe, which allows us to work with selected fields in a familiar format. Then, using apply_functional_filters, we define and apply a set of logical filters, such as high-score selection, category-based filtering, token count constraints, duplicate removal, and composite quality conditions, to generate multiple filtered views of the data.

def analyze_data_quality(df: pd.DataFrame) -> Dict[str, Any]:
   """Analyze data quality metrics"""
   return {
       'total_records': len(df),
       'unique_texts': df['text'].nunique(),
       'duplicate_rate': 1 - (df['text'].nunique() / len(df)),
       'avg_score': df['score'].mean(),
       'category_distribution': df['category'].value_counts().to_dict(),
       'score_distribution': {
           'high': len(df[df['score'] >= 0.8]),
           'medium': len(df[(df['score'] >= 0.6) & (df['score'] < 0.8)]),
           'low': len(df[df['score'] < 0.6])
       },
       'token_stats': {
           'mean': df['tokens'].mean(),
           'min': df['tokens'].min(),
           'max': df['tokens'].max()
       }
   }


def create_data_transformations() -> Dict[str, callable]:
   """Create various data transformation functions"""
   return {
       'normalize_scores': lambda df: df.assign(norm_score=df['score'] / df['score'].max()),
       'add_length_category': lambda df: df.assign(
           length_cat=pd.cut(df['tokens'], bins=[0, 3, 5, float('inf')], labels=['short', 'medium', 'long'])
       ),
       'add_quality_tier': lambda df: df.assign(
           quality_tier=pd.cut(df['score'], bins=[0, 0.6, 0.8, 1.0], labels=['low', 'medium', 'high'])
       ),
       'add_category_rank': lambda df: df.assign(
           category_rank=df.groupby('category')['score'].rank(ascending=False)
       )
   }

To evaluate the dataset quality, we use analyze_data_quality, which helps us measure key metrics like total and unique records, duplicate rates, category breakdowns, and score/token distributions. This gives us a clear picture of the dataset’s readiness and reliability. We also define transformation functions using create_data_transformations, enabling enhancements such as score normalization, token-length categorization, quality tier assignment, and intra-category ranking.

def apply_transformations(df: pd.DataFrame, transform_names: List[str]) -> pd.DataFrame:
   """Apply selected transformations"""
   transformations = create_data_transformations()
   selected_transforms = [transformations[name] for name in transform_names if name in transformations]
  
   return pipe(*selected_transforms)(df.copy()) if selected_transforms else df


def export_filtered_data(filtered_datasets: Dict[str, pd.DataFrame], output_dir: str) -> None:
   """Export filtered datasets to files"""
   Path(output_dir).mkdir(exist_ok=True)
  
   for name, df in filtered_datasets.items():
       output_file = Path(output_dir) / f"{name}_filtered.jsonl"
       with open(output_file, 'w') as f:
           for _, row in df.iterrows():
               f.write(json.dumps(row.to_dict()) + '\n')
       print(f"Exported {len(df)} records to {output_file}")

Then, through apply_transformations, we selectively apply the needed transformations in a functional chain, ensuring our data is enriched and structured. Once filtered, we use export_filtered_data to write each dataset variant into a separate .jsonl file. This enables us to store subsets, such as high-quality entries or non-duplicate records, in an organized format for downstream use.

def main_analysis_pipeline():
   """Main analysis pipeline demonstrating functional approach"""
  
   print(" Setting up Lilac project...")
   project_dir = setup_lilac_project("advanced_tutorial")
  
   print(" Creating sample dataset...")
   sample_data = create_sample_data()
   dataset = create_dataset_from_data("sample_data", sample_data)
  
   print(" Extracting data...")
   df = extract_dataframe(dataset, ['id', 'text', 'category', 'score', 'tokens'])
  
   print(" Analyzing data quality...")
   quality_report = analyze_data_quality(df)
   print(f"Original data: {quality_report['total_records']} records")
   print(f"Duplicates: {quality_report['duplicate_rate']:.1%}")
   print(f"Average score: {quality_report['avg_score']:.2f}")
  
   print(" Applying transformations...")
   transformed_df = apply_transformations(df, ['normalize_scores', 'add_length_category', 'add_quality_tier'])
  
   print(" Applying filters...")
   filtered_datasets = apply_functional_filters(transformed_df)
  
   print("\n Filter Results:")
   for name, filtered_df in filtered_datasets.items():
       print(f"  {name}: {len(filtered_df)} records")
  
   print(" Exporting filtered datasets...")
   export_filtered_data(filtered_datasets, f"{project_dir}/exports")
  
   print("\n Top Quality Records:")
   best_quality = filtered_datasets['combined_quality'].head(3)
   for _, row in best_quality.iterrows():
       print(f"  • {row['text']} (score: {row['score']}, category: {row['category']})")
  
   return {
       'original_data': df,
       'transformed_data': transformed_df,
       'filtered_data': filtered_datasets,
       'quality_report': quality_report
   }


if __name__ == "__main__":
   results = main_analysis_pipeline()
   print("\n Analysis complete! Check the exports folder for filtered datasets.")

Finally, in the main_analysis_pipeline, we execute the full workflow, from setup to data export, showcasing how Lilac, combined with functional programming, allows us to build modular, scalable, and expressive pipelines. We even print out the top-quality entries as a quick snapshot. This function represents our full data curation loop, powered by Lilac.

In conclusion, users will have gained a hands-on understanding of creating a reproducible data pipeline that leverages Lilac’s dataset abstractions and functional programming patterns for scalable, clean analysis. The pipeline covers all critical stages, including dataset creation, transformation, filtering, quality analysis, and export, offering flexibility for both experimentation and deployment. It also demonstrates how to embed meaningful metadata such as normalized scores, quality tiers, and length categories, which can be instrumental in downstream tasks like modeling or human review.


Check out the Codes. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.


Bio picture Nikhil

Nikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.

a sleek banner advertisement showcasing

Source link

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments