Data profiling

When faced with large datasets, it is often hard to quickly locate some variables in a dataset. There are several approaches to this. One common one is the use of proc contents and proc datasets to enlist all variables together with labels. Nevertheless, in case you are interested in knowing just one variable, then you have to search between all dataset values to locate it.

To search for a variable in a data set you follow the steps below;

Create information content using proc contents

Proc contents data=<name of the dataset>;

Run;

Create information datasets using proc datasets

Proc datasets lib=<name of the library>;

Run;

Search variables that contain a certain value

Data temp;

Set <name of the dataset>;

Where index (<name of the variable>, “%%%%”);

Run;

For you to only search for known variables, every variable must be checked. Names of datasets should be manually changed for every checked dataset. In case you have a library containing five datasets and an average of 50 variables within every dataset, then you shall have 5 multiplied by 50 giving 250 steps to undertake so. It is obvious that this process is time consuming. To locate a variable within a whole library is vital during figure, table and listing.

In order to simplify this;

A SAS macro is used to process and check every variable name as well as values within every dataset automatically. In case the macro locates giving values, it will generate automatically an exception dataset that shall have in it the name of the dataset as well as the names of variables for desired variables. Within the macro, a SAS resource table generates names of variables as well as their respective labels. After assigning a libname, SASA simply generates a data set with the details of all variables.

To undertake this you follow the procedure below;

Obtain information on datasets

Proc sql noprint;

Create table ds as

Select* from dictionary.tables where libname=upcase(“&lib”);

Quit;

Getting information on variables

Proc sql noprint;

Create table cont as

Select* from dictionary.columns where libname=upcase(“&lib”);

Quit;

After obtaining information on datasets and variables, an autocheck will be undertaken for every character variable. The following functions are important in facilitating this process;

  1. The index function helps to search for partially known values. For instance, in case you want to find variables that have criminal information; the index function will output all the variables that include criminal in their value. This implies that variables such as “criminal files” “student criminal” shall be included.
  2. The like function is used for those values that the known values are not matched exactly. For instance if you want to locate a variable that has the name Ann Bobby, but unsure if its Ann Bobby or Annah Bobby. In case you just put Ann, you definitely will get all the persons that have Ann as their first names. With the like function using the syntax like “Ann%Bobby” then you solve the problem.
  3. Lastly we have the sound like function that is used in the scenario such as in 2 above in cases where you only know its pronunciation but unaware of the spelling. For instance a person named Ann may spell like Aeine. In this example the function shall locate all of them.

How data profiling techniques may be used as part of an applied forecasting project.

Data profiling techniques analyze existing data within a data source as well as identifies meta-data on the respective data. Data profiling techniques allow for analysis, integration as well as reporting on critical forecasting project issues. Data profiling helps in intelligence gathering as well as strategic operations. Based on tremendous insight gathering of data profiling techniques.

Data profiling techniques consists of mathematical and statistical tools that forecast future events within a project. Essentially, data scientists and statisticians have standardized datasets to come up with correlative statistical algorithms that can help very much within forecasting projects. Data profiling can also links and visualizes datasets making it easy to interpret. Solutions to forecasting projects are in most cases correlative as opposed to being causative. Therefore data profiling focuses on the probability of an occurrence based on it occurring within conditions that are similar. Often if a deeper understanding of the underlying reasons causes is not made, all the forecasts would be in accurate.

Data profiling also analyzes and presents information that is easily understandable and relevant for decision making processes making forecasting reliable as all decisions are backed up by factual data. The role of data profiling in forecasting projects cannot be underscored enough without relating it to data analytics tools which serve to further draw meaningful information from data. Forecasting is in fact very much dependent on data profiling for it to be efficient and accurate.

2020-07-12

1 Comment

  1. Like!! I blog quite often and I genuinely thank you for your information. The article has truly peaked my interest.

    ปั้มไลค์

Comments are closed.