Simplifying Data Acquisition for Excel Report

Puja RomulusData ScientistNovember 20 2020

This article is my own reflection of my past projects with ZebraX. One might wonder why we still talk about Excel Data in this era of SQL, NoSQL, Graph, etc. But the truth is, these Excel files feature still have a lot of things to offer as the primary data management to run the business nowadays.

With the buzz of machine learning everywhere and many options of algorithm out there, it is imperative to have a well-structured data format like we see in a database table and not the semi-structured Excel format. This article explores the workaround in handling such a situation

Why Excel?

The spreadsheet is still one of the gold standards for data management in many organizations. It gives the user the ability to store information, perform in-situ calculation, and make a dashboard for reporting.

While some database provider (Postgresql and Mysql) has helped to transform data management in many organizations, its schema rigidity becomes a hurdle for any user migrating from Excel. Employees are still using Excel for daily/weekly/monthly reports and use databases only to store the summarized information. The condition poses a challenge for machine learning tasks because summarized information will inevitably provide low-quality insight.

Since we don’t know which data yields a higher value, it is always better to store everything first then filter out the unnecessary later

figure-1.png

                                                                  figure 1. rekap-shri-feb-2017

But let’s face it, transforming such a complex Excel format into a structured table requires significant effort from employees. Rather than spending time inserting data into the database, they should do more analytical work. For Data Scientist or Engineer, the solution is to tap into Excel data directly, convert it to a standard table format, then store it in a database. In this way, the quality of raw data is ensured. Certainly, it raises some questions such as

How much effort is needed to restructure the data ? And is it possible to fully automate data acquisition from different Excel files ?

The answer to the second one is probably no. But it is possible to integrate some automatic features in the data acquisition process, making it more guided rather than fully automated. It is still going to reduce some efforts needed to clean up the data though

Traits of Excel Data

For the example, we displayed the Figure 1 data titled “Rekap SHRI Kabupaten Demak” which available in Satu Data Indonesia. The displayed Figure 1 is modified according to the month and date shown.

https://data.go.id/dataset/rekap-shri-feb-2017

Here are some similar traits found in many Excel Reports so far:

  • The user usually divides information periodically (daily, weekly, monthly, or annually). While our clients often use daily format (1 sheet = 1 day), the Rekap-SHRI example is in a monthly period (1 sheet = 1 month)
  • In each sheet, the user combines multiple reports in the form of different tables. Those reports may be related or each on its own.
  • In a relational database, the only attribute is the column. In an Excel report, the user often utilizes both row and column as attributes. Additionally, there are many pivots and multi indexes that further complicate the Excel table structure.
  • There are many calculated cells, rows, or columns already presents. All this information is unnecessary in the database.

The following table contains all traits previously explained. The file is a periodical report (monthly) and consists of many different tables. It also has multi indexes and row-column attributes, and finally some unnecessary information such as TOTAL Image for post

figure-2.png

                                                        figure 2. Table1_rekap-shri-feb-2017

The Workaround

Here is a simple web-based parser that will guide you through the steps of restructuring the example data in Figure 2. The application is developed using Jupyter Notebook and a package named Voila. Below is the link to the application, followed by GIF tutorials explaining how to use the app.

https://shielded-brook-51240.herokuapp.com/

The first step is to Upload rekap-shri-feb.xlsx to the system. We treat this file as the training data. The preview button shows the result in the output display.

index.gif

                                                                      figure 3. Upload

Then go to the next tab Auto Split, in which the sheet is split into different tables depending on the option that you choose. This auto split function is fully automated and can work most of the time.

figure 4.gif

                                                                    figure 4. Auto Split

Go to tab Reformat and Choose Cluster. Select cluster 0 for table in Figure 2,

figure-5.gif

                                                                  figure 5. Select Cluster

As mentioned earlier, some rows or columns are irrelevant. It is customary to first remove unnecessary row by choosing Filter Row and option remove. The unique marker is a string that occurs only in that particular row.

figure-6.gif

                                                                   figure 6. filter row

Next is to remove unnecessary columns by choosing Filter Column and option remove. The unique marker is a string that occurs only in that particular column.

figure-7.gif

                                                                  figure 7. Filter Column

If we look at the Excel File, some cells are merged as a sign of MultiIndex or Multilevel Column. The Merge option is just returning the format to its original form. The necessary information is the index of those MultiIndex/Multilevel columns, or in this case [4,5]

figure-8.gif

                                                                    figure 8. Merge

Here is an important part where this table with row-column attributes are transformed to a standard SQL table. The Unpivot requires ID Column (The column that stays there and not being transformed like the other)

figure-9.gif

                                                                    figure 9. Unpivot

The last section is Rename Columns. Insert new column names in one line separated by comma.

figure-10.gif

                                                                figure 10. Rename Columns

All previous reformatting process is saveable for future usage. Just type in any name for the config then click Save

figure-11.gif

                                                                          figure 11. Save

Now that config is saved, try converting other files from different months in 2017. These files are also available in Satu Data Indonesia.

figure-12.gif

                                                                figure 12. Serve & Parse

Conclusion

If the aim is to create a Generalized Parser for any excel format, it’s probably impossible to do. Many reformatting features still require manual tuning from the user. However, if we settle with the Guided Parser for any excel format, then this application probably fulfilled its purpose, to simplify data acquisition for excel report.

Take the next step
Get a Demo/Diagnostic for your Organization's Digital Transformation Journey