Dataiku Basics

What’s Dataiku ?

Dataiku is an collaborative, end-to-end data science platform. It was founded in 2013.

Their goal is to democratize access to AI, especially in enterprise. They provide a platform to handle AI end-to-end : from the dataset and data cleaning, through tuning algorithms and features selection, to the creation of dashboards and reports.

Dataiku can act like an orchestrator of data connections since he can connect to many infrastructures (external databases, Hadoop, Kubernetes, Amazon, …).

They also have versioning, scalability and security at heart.

Dataiku can be used by both “coders” and “clickers” (visual) user profiles. The platform can especially be useful for a lot of practices: data scientists, analysts, data architects / engineers, data team managers, data exec, CTO / CIO / IT Managers, ….

Getting started

You have to download the Data Science Studio (DSS) and get it running at http://localhost:11200.

By activating your DSS license, you get a login and a password. By default, it’s admin / admin.

In your DSS, you can have DSS projects, Applications, Dashboards and Wikis. All of them can be shared with other users.

Create a Project

A DSS Project is backed by a regular Git repository, and visually kinda looks like a GitHub / GitLab project, with recent activities and a To-do checklist in addition. DSS Projects can be put into folders, and be shared for collaboration.

You can upload your Dataset by file (CSV / XLSX) or from external sources like SQL database, Hadoop clusters, Amazon S3 Buckets, HDFS, Hive, Google Sheets, …

If it comes from an external storage, DSS won’t copy the dataset but will create a view of a sample of the data & reminder the dataset location and connection information. It acts as an orchestrator of external datasets connections.

For each dataset, you can visualize the raw data, get charts and statistics. In History tab, you can see the list of commits since the dataset creation. In Settings you have information about the source of your input data (whether a file, an Amazon S3 connection, etc).

Pro-tip: you may want to set up naming conventions for projects creation.

Partitionning

If new samples are added to the input dataset, it will be treated apart from the older data (if the partitioning is based on dates), which will be left untouched. So we don’t have to recompute 100% of the dataset when new samples come in.

The partition split can be activated and customized in the Settings tab of your dataset. Instead of splitting by date, you can split by country, by day and country, etc.

Explore & Analyze

In Explore tab, we can view our raw data, and use Analyze on a particular column to run some basic statistics on it:

Statistical Worksheet

Available in Statistics tab. A Worksheet provides a summary of Exploratory Data Analysis (EDA) tasks.

You can have multiple Worksheets, each one having multiples Cards.

Card types

Recipes

Recipes are steps of processing logic, it performs data transformation. You can see them as functions that takes dataset in input and returns a dataset.

Usually, it is used to clean and preprocess data for further analysis.

You can use Recipes in the ACTIONS tab, at the upper right of your Project.

There are three types of Recipes:

Visual Recipes

Visual recipes accomplish most common data operations :

  • Group – similar to GROUP BY in SQL queries or .groupby()in pandas DataFrames. Group will use a column unique values as keys, and aggregate all the different values for each key with a custom aggregation methods (sum, count, ….)
  • Join two datasets together, using same methods as in SQL: left join, inner join, outer join, right join, cross join, …
  • Filter,
  • Split,
  • Top N, Sort, …

Instead of coding these, it is done with a visual interface.

Code Recipes

You can use raw code, written in Python, R, SQL, Shell, or other languages to define your transformations.

Plugin Recipes

Plugin Recipe wrap the visual interface on an external tool – for example the plugin recipe Geocoding can take an address as input and return a GPS location.

You can also use Twitter tools, SQL Updates, …

On the flow we can see our first dataset untouched, the recipe containing actions/scripts to transform orders and the output after transformation : orders_prepared

Formulas

Formula allows to create new columns using one or multiple columns. The Formula language is similar to Google Sheet or Excel formula ones, and kinda looks Pythonic.

A Card apply a statistical analysis task. For each task you can specify which variables to describe.

Gamification

Dataiku even have achievements that you unlock depending on your actions on the platform. So modern.

Lab for Experimentation

The Lab allows experimenting new ideas and discard them easily without disturbing the current Flow – and thus your code in production, or your colleagues working simultaneously.

Of course, if you are happy with your work in the Lab, you can deploy it to the Flow! It’s done by clicking the Deploy Script button.

The Lab offers 2 different tools : visual analysis and code notebooks.

Visual Analysis

Offers interactive data preparation & visual machine learning.

Code Notebooks

Allow to open a Notebook in different languages – a Python Notebook, a R Notebook, etc. You also have already implemented tools.

Dashboards & Reports

Every DSS project has a default dashboard. You can create more.

  • A Dashboard can have multiple slides,
  • Each slide can have multiples tiles.

Each tile holds an insight from the project. Tiles can refer to an extract of the dataset, a chart, metrics, Jupyter notebooks, web apps, macros, …

Example of Tiles

In order to set up permissions to know who can write or view each dashboard, you can define groups of users. You can find more information in Project security > Permissions.

You can also download it as a PDF, JPEG or PNG file for external export.

Versionning

A DSS project uses git, thus giving us:

  • A version control with each change as a commit
  • Ability to see changed files
  • Ability to revert
  • Ability to compare two revisions
  • Ability to use git remotes, pull and fetch changes from, and push changes to them.
  • Ability to use branches. But a DSS can only be on one branch at any given time, so you can’t work on multiple branches at once – but you can use duplicated objects. (Feature only available if you have associated a remote)

In addition, while being in a object (i.e. a dataset or a recipe) you click see the History tab and see the whole history of only this object.

A bit of history

Release and creations of : distributed databases, versioning systems, machine learning-friendly programming languages and tools.

  • 1974: First apparence of SQL databases.
  • 1990: Release of Python
  • 1993: Release of R
  • 2005: Initial release of Git
  • 2005: Initial release of Bigtable
  • 2006: Amazon S3 & AWS are founded.
  • 2006: Initial release of Apache Hadoop.
  • 2007: Initial release of scikit-learn
  • 2007: Initial release of Neo4J, MongoDB, Hypertable
  • 2008: Initial release of Cassandra, HBase
  • 2008: GitHub is founded
  • 2010: Initial release of Elasticsearch
  • 2010: Initial release of Hive
  • 2010: Kaggle is founded.
  • 2011: GitLab is founded
  • 2013: Dataiku is founded.
  • 2014: Initial release of Kubernetes
  • 2018: Release of Amazon EKS (Elastic Kubernetes Service).

Links


Leave a Reply

Your email address will not be published. Required fields are marked *