๐ŸŽฏData Quality

Now that youโ€™ve learned more about your data and cleaned it up, itโ€™s time to ensure the quality of your data is up to par. With these data exploration tools, you can determine if your data is accurate, consistent, and reliable. High-quality data is essential for making informed decisions, as well as for the effective operation of systems and processes that rely on it. Maintaining high-quality data is critical for organizations in order to avoid negative impacts on decision-making and business operations.

Cleanlab

Cleanlab is focused on data-centric AI (DCAI), providing algorithms/interfaces to help companies (across all industries) improve the quality of their datasets and diagnose/fix various issues in them. This tool automatically detects problems in an ML dataset. This data-centric AI package facilitates machine learning with messy, real-world data by providing clean labels for robust training and flagging errors in your data.

Cleanlabโ€™s Chief Scientist & Co-Founder, Jonas Mueller, will present more about the tool at ODSC East coming this May, in a session called โ€œImproving ML Datasets with Cleanlab, a Standard Framework for Data-Centric AI.โ€

Great Expectations

Great Expectations (GX) helps data teams build a shared understanding of their data through quality testing, documentation, and profiling. With Great Expectations, data teams can express what they โ€œexpectโ€ from their data using simple assertions. Great Expectations provides support for different data backends such as flat file formats, SQL databases, Pandas dataframes and Sparks, and comes with built-in notification and data documentation functionality.

Sam Bail, technical lead at Superconductive (the core maintainers behind Great Expectations), delivered a talk about building a robust data pipeline during ODSC East 2021. You can watch it on demand here.

VisiData

VisiData is a free, open-source tool that lets you quickly open, explore, summarize, and analyze datasets in your computerโ€™s terminal. VisiData works with CSV files, Excel spreadsheets, SQL databases, and many other data sources. It combines the clarity of a spreadsheet, the efficiency of the terminal, and the power of Python, into a lightweight utility that can handle millions of rows with ease.

Last updated