Introduction: ETL Tool

A data warehouse is an architectural construct of an information system that provides users with current and historical decision support information that is hard to access or present in traditional operational systems.

Companies over the years have gathered huge volumes of data. Data warehouse puts this data to use, by analysing trends in this data, which can be used to provide competitive advantage.

ETL system is the foundation of any DW system

An ETL system extracts data from source systems, enforces data quality & consistency, conforms data so that separate source systems could be used together, delivers data in a presentation ready format that can be used by end users

It consumes almost 70-80% of resources needed for implementation & maintenance of a DW

Mission of ETL system: get data out of the source systems & load it into the DW

Extract: Extract relevant data

Transform: Transform data to DW format, build keys, Cleansing of data

Load: Load data into DW, build aggregates, etc.

DESCRIPTION :

A java based ETL tool for extraction of data from various sources(XML, spreadsheets, flat files, RDBMS), applying several transformations(like deduplication, missing values, standardization of dates, uppercase, lowercase) on the source data and loading of structured data into star schema is made.

Example: for a university registration system, the source file can contain detailed information about students, courses, faculty, campus, academic year and semester.

Step 1: Extraction From Various Sources

Following steps are involved in the whole process:

1. Extraction from various sources

a. XML - For extracting data from XML files SAX parsers can be used, which is an event based sequential access parser API. SAX parsers operate on each piece of the XML document sequentially. We extract the elements from the XML file based on the metadata provided along with the source file.

b. Spreadsheet - For Extracting data from spreadsheet we can use Apache POI APIs. POI supports both xls and xlsx formats. It provides support for working with Microsoft Excel documents and it’s able to handle both XLS and XLSX formats of spreadsheets.

c. Flat files – For extracting the data from flat files, we read the file sequentially using stream reader based on the metadata provided along with.

Step 2: Transforming the Given Data

a. Data Staging Area – We have used relational tables for our staging area. The extracted data is bought into the tables. Here various transformations are applied on the data.

b. Splitting – Fields like name and address are combination of firstname, lastname and city, state, pincode respectively. We perform splitting of such fields by creating multiple columns in the relational tables.

c. Merging – We can merge the information details like course id, course name, course description from different files into a single entity.

d. Deduplication–We can apply this transformation on the name column like student name or faculty name. For ex. Resolving all the abbreviated names to full names and removing multiple copies.

e. Date/Time Conversion – We can standardize all the dates in the source data to “DD/MM/YYYY” format.

Step 3: Loading the Data

Loading the data into star schema – After applying the transformations, we load the data into appropriate star schema structure. We have the following star schema in our application:

· Dimension Tables:

1. Student(studentid,name,degree,type,isVertical,gender,dob,city,state,country,pincode)

2. Course(courseid,name,isElective,type,credits,isInter)

3. Faculty(facultyid,name,doj,workEx,qualification)

4. Time(semester,academicYear,typeSem)

5. Campus (campusid,name,city,state,country)

· Fact Tables:

Registration table would be a factless fact table so we have the surrogate keys of all the dimensions in our fact table.

· Look-up tables : We have the following look up tables in our application corresponding to all the dimensions

1. StudentLookup

2. FacultyLookup

3. CourseLookup

4. TimeLookup

5. CampusLookup

The above looks up tables have the mapping between natural keys and surrogate keys of the corresponding dimensions