Data Pre-Processing With Orange tool

Sujay Patel
3 min readNov 16, 2021

Preprocessing data is very crucial and import step in any machine learning project. For Data preprocessing is the process of transforming raw data into an understandable format.

Data Preprocessing includes techniques like:

  1. Feature Scaling
  2. Standardization
  3. Encoding
  4. Discritization
  5. Randomization
  6. Handling missing values(Imputation)

Now we’re all set to perform preprocessing.

For performing preprocessing we’ll use preprocess widget.

Preprocess widget

Let’s create workflow for it.

Workflow

Here I’m importing train_x and train_y and merging them and select Target variable and finally adding preprocess widget to it.

Now we’ll perform all task in preprocess widget.

Feature Scaling

Feature scaling is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1 or maximum absolute value of each feature is scaled to unit size.

Standardization

Standardization means making distribution to mean 0 and standard deviation 1.

Let’s select that option in preprocess widget.

Standardization

Encoding

For performing encoding you can use Continuize Discrete Variables option.

Encoding

Discritization

Discretization methods are used to chop a continuous function (i.e., the real solution to a system of differential equations in CFD) into discrete function, where the solution values are defined at each point in space and time. Discretizatin simply refers to the spacing between each point in your solution space.

Discrization

In preprocess widget you there is a option called Discretize Continuous Variables.

Discretized dataset

You can also achieved same results using python script but there you have to hardcode the things you want to achieve.

Randomization

You can achieved randomization using same preprocess widget. Just seect randomized widget and use builtin options as per your need.

Randomization

Handling Missing Values

We can handle missing values with preprocess widget in orange.

Imputation

There are 3 options available.

  1. Add average or most frequent observation at the place of empty observation.
  2. Replace with random value from dataset.
  3. Remove entire instance.

Conclusion

We’ve covered a lot of content in this article about data preprocessing with orange.

--

--