This article is my own reflection of my past projects with ZebraX. One might wonder why we still talk about Excel Data in this era of SQL, NoSQL, Graph, etc. But the truth is, these Excel files feature still have a lot of things to offer as the primary data management to run the business nowadays.
With the buzz of machine learning everywhere and many options of algorithm out there, it is imperative to have a well-structured data format like we see in a database table and not the semi-structured Excel format. This article explores the workaround in handling such a situation
The spreadsheet is still one of the gold standards for data management in many organizations. It gives the user the ability to store information, perform in-situ calculation, and make a dashboard for reporting.
While some database provider (Postgresql and Mysql) has helped to transform data management in many organizations, its schema rigidity becomes a hurdle for any user migrating from Excel. Employees are still using Excel for daily/weekly/monthly reports and use databases only to store the summarized information. The condition poses a challenge for machine learning tasks because summarized information will inevitably provide low-quality insight.
Since we don’t know which data yields a higher value, it is always better to store everything first then filter out the unnecessary later
figure 1. rekap-shri-feb-2017
But let’s face it, transforming such a complex Excel format into a structured table requires significant effort from employees. Rather than spending time inserting data into the database, they should do more analytical work. For Data Scientist or Engineer, the solution is to tap into Excel data directly, convert it to a standard table format, then store it in a database. In this way, the quality of raw data is ensured. Certainly, it raises some questions such as
How much effort is needed to restructure the data ? And is it possible to fully automate data acquisition from different Excel files ?
The answer to the second one is probably no. But it is possible to integrate some automatic features in the data acquisition process, making it more guided rather than fully automated. It is still going to reduce some efforts needed to clean up the data though
Traits of Excel Data
For the example, we displayed the Figure 1 data titled “Rekap SHRI Kabupaten Demak” which available in Satu Data Indonesia. The displayed Figure 1 is modified according to the month and date shown.
Here are some similar traits found in many Excel Reports so far:
The following table contains all traits previously explained. The file is a periodical report (monthly) and consists of many different tables. It also has multi indexes and row-column attributes, and finally some unnecessary information such as TOTAL Image for post
figure 2. Table1_rekap-shri-feb-2017
Here is a simple web-based parser that will guide you through the steps of restructuring the example data in Figure 2. The application is developed using Jupyter Notebook and a package named Voila. Below is the link to the application, followed by GIF tutorials explaining how to use the app.
https://shielded-brook-51240.herokuapp.com/ The first step is to Upload rekap-shri-feb.xlsx to the system. We treat this file as the training data. The preview button shows the result in the output display.
figure 3. Upload Then go to the next tab Auto Split, in which the sheet is split into different tables depending on the option that you choose. This auto split function is fully automated and can work most of the time.
figure 4. Auto Split Go to tab Reformat and Choose Cluster. Select cluster 0 for table in Figure 2,
figure 5. Select Cluster As mentioned earlier, some rows or columns are irrelevant. It is customary to first remove unnecessary row by choosing Filter Row and option remove. The unique marker is a string that occurs only in that particular row.
figure 6. filter row Next is to remove unnecessary columns by choosing Filter Column and option remove. The unique marker is a string that occurs only in that particular column.
figure 7. Filter Column If we look at the Excel File, some cells are merged as a sign of MultiIndex or Multilevel Column. The Merge option is just returning the format to its original form. The necessary information is the index of those MultiIndex/Multilevel columns, or in this case [4,5]
figure 8. Merge Here is an important part where this table with row-column attributes are transformed to a standard SQL table. The Unpivot requires ID Column (The column that stays there and not being transformed like the other)
figure 9. Unpivot The last section is Rename Columns. Insert new column names in one line separated by comma.
figure 10. Rename Columns All previous reformatting process is saveable for future usage. Just type in any name for the config then click Save
figure 11. Save Now that config is saved, try converting other files from different months in 2017. These files are also available in Satu Data Indonesia.
figure 12. Serve & Parse
If the aim is to create a Generalized Parser for any excel format, it’s probably impossible to do. Many reformatting features still require manual tuning from the user. However, if we settle with the Guided Parser for any excel format, then this application probably fulfilled its purpose, to simplify data acquisition for excel report.