A brief look into Data Wrangling

An Essential Supporting Skill for doing Analytics in Data Science

Andi Sama - CIO, Sinergi Wahana Gemilang; with Davin Ardian

Supporting files are available in github

When presented with a set of accessible datasets (structured or unstructured), most Data Scientists used to think and experiment a lot to find the best way to generate useful models using various available algorithms. Wondering what to do with these datasets to predict certain relationships for example (doing dimensionality reduction, classification, regression or clustering to mention the common ones). That’s assuming that the provided datasets are ready and have been in the suitable formats to be directly processed for analytics, like building an Artificial Intelligence (AI)’s model through Machine Learning (ML) or Deep Learning (DL) approaches.

In the real world, the datasets that we need for doing analytics may come from different sources and in various formats. We wish that we have some ways or another to convert these available datasets to suitable formats that we expect them to be, enabling us to be better in doing analytics. This is the process that we call as data wrangling, for pre-processing semi structured or unstructured data by doing data extraction and transformation. Below, we see a series of linux commands to find the occurrences of “Failed Password” strings. Input is taken from a file generated from the exported server log in IBM Qradar SIEM (Security Information and Event Management) version 7.3.2. The file was exported through a plugins in QRadar Experience Center.

andisama@ASM-T480:~/asama/wrangling$ cat 2020-03-20-data_export.csv | gawk -F, '{ print $20, "==>", $21 }' | sort | grep "Failed password"<182>Mar 20 16:56:01  Mar 12 13:23:51 redhat1 sshd[23510]: Failed password for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login Failed<182>Mar 20 16:56:31  Mar 12 13:23:58 redhat1 sshd[23510]: Failed password for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login Failed<182>Mar 20 17:04:31  Mar 12 13:23:51 redhat1 sshd[23510]: Failed password for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login Failed<182>Mar 20 17:04:31  Mar 12 13:23:51 redhat1 sshd[23510]: Failed password for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login Failed<182>Mar 20 17:05:01  Mar 12 13:23:58 redhat1 sshd[23510]: Failed password for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login Failed<182>Mar 20 17:05:01  Mar 12 13:23:58 redhat1 sshd[23510]: Failed password for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login Failed

A similar process for pre-processing data for analytics in structured data has been known for many years, called as ETL (extract, transform, load). ETL is done before the data is loaded into data warehouse to be accessed by analytics tools (e.g. doing prediction through statistics) or presenting the data in a visualization tool (e.g. business intelligence).

In this article, we will briefly discuss a few ways to do data extraction, transformation and cleansing by examples. The purpose is to give ideas and illustrations that these can be done with available common tools (in Linux Operating System), aside from multiple available tools (whether from supported open source or licensed software) that will help to do this in a much better and efficient way.

Start Experiencing

In this article, we use WSL (Windows Subsystem for Linux in Windows 10). It’s the Ubuntu 18.04.4 LTS release 18.04 linux system within Microsoft Windows 10. In general, we can do this also on any linux distribution – be it on public cloud like IBM Cloud, AWS, GCP, Azure, etc or installed on-premise in laptop or server.

We quickly see current date and time and the running processes with ‘date’ and ‘ps -ef’ commands, as well as file system disk space usage and summary of running processes with ‘df’ and ‘top’ commands respectively.

Linux ASM-T480 4.4.0-18362-Microsoft #476-Microsoft Fri Nov 01 16:53:00 PST 2019 x86_64 x86_64 x86_64 GNU/LinuxDistributor ID: Ubuntu Description: Ubuntu 18.04.4 LTS Release: 18.04 Codename: bionicIt is now Fri Mar 20 23:10:31 +07 2020Welcome andisamaYou are now in /home/andisama/asama/wranglingandisama@ASM-T480:~/asama/wrangling$ dateFri Mar 20 23:15:52 +07 2020andisama@ASM-T480:~/asama/wrangling$ pwd/home/andisama/asama/wranglingandisama@ASM-T480:~/asama/wrangling$ ps -efUID        PID  PPID  C STIME TTY          TIME CMDroot         1     0  0 Mar20 ?        00:00:00 /init roroot         7     1  0 Mar20 tty1     00:00:00 /init roandisama     8     7  0 Mar20 tty1     00:00:01 -bashandisama   181     8  0 00:23 tty1     00:00:00 ps –efandisama@ASM-T480:~/asama/wrangling$ dfFilesystem     1K-blocks      Used Available Use% Mounted onrootfs         246742012 206386956  40355056  84% /root           246742012 206386956  40355056  84% /roothome           246742012 206386956  40355056  84% /homedata           246742012 206386956  40355056  84% /datacache          246742012 206386956  40355056  84% /cachemnt            246742012 206386956  40355056  84% /mntnone           246742012 206386956  40355056  84% /devnone           246742012 206386956  40355056  84% /runnone           246742012 206386956  40355056  84% /run/locknone           246742012 206386956  40355056  84% /run/shmnone           246742012 206386956  40355056  84% /run/usercgroup         246742012 206386956  40355056  84% /sys/fs/cgroupC:\            246742012 206386956  40355056  84% /mnt/cD:\            252056572 214760988  37295584  86% /mnt/dG:\            241172480 202835180  38337300  85% /mnt/gandisama@ASM-T480:~/asama/wrangling$ toptop - 00:17:23 up  1:06,  0 users,  load average: 0.52, 0.58, 0.59Tasks:   4 total,   1 running,   3 sleeping,   0 stopped,   0 zombie%Cpu(s):  2.5 us,  4.0 sy,  0.0 ni, 93.0 id,  0.0 wa,  0.5 hi,  0.0 si,  0.0 stKiB Mem : 16628712 total, 10613864 free,  5778372 used,   236476 buff/cacheKiB Swap: 33554432 total, 33543808 free,    10624 used. 10709484 avail MemPID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND1 root      20   0    8896    308    272 S   0.0  0.0   0:00.17 init7 root      20   0    8900    212    164 S   0.0  0.0   0:00.01 init8 andisama  20   0   15328   3768   3660 S   0.0  0.0   0:01.19 bash176 andisama  20   0   16388   2072   1500 R   0.0  0.0   0:00.06 top

Let’s start by defining a string using echo command, then save it to a local file. Display the file using cat (concatenate files & print on the standard output) command.

echo "Sinergi Wahana Gemilang, a Value Added Distributor for IBM Hardware & Software in Indonesia" > myfilecat myfileSinergi Wahana Gemilang, a Value Added Distributor for IBM Hardware & Software in Indonesia

Try doing pipe “sending the output from a command as an input to another command”. In this case, output from the cat command will be the input to the sed (stream editor) command. In sed command, we put ‘s/Hardware/Systems/’ as its parameter, meaning replacing Hardware to Systems on the first occurrence.

Passing the output of cat to wc (word count) command, we can count character by using ‘wc -c’ command and count word by using ‘wc -w’ command for instance.

cat myfile | sed 's/Hardware/Systems/'Sinergi Wahana Gemilang, a Value Added Distributor for IBM Systems & Software in Indonesiacat myfile | wc -c92cat myfile | wc -w14

Another example is to set ‘s/[abc]/x/’ as sed’s parameter, meaning that for first occurrence of a or b or c, replace with x. 2nd example we add ‘g’ at the end, meaning that for every occurrence of a or b or c, replace with x. the string ‘[abc]’ after ‘s/’ is actually any regular expression (regex) that we can define to do manipulation of input data stream. ‘-E’ option after sed command indicates to use modern regular expression, minimizing use of backslash (escape character) for the compatibility of early versions.

cat myfile | sed 's/[abc]/x/'Sinergi Wxhana Gemilang, a Value Added Distributor for IBM Hardware & Software in Indonesiacat myfile | sed -E 's/[abc]/x/g'Sinergi Wxhxnx Gemilxng, x Vxlue Added Distrixutor for IBM Hxrdwxre & Softwxre in Indonesix

Further examples show the intention to replace every occurrence of ‘in’ to ‘xyz’ (first example), then every occurrence of ‘in’ or ‘is’ to ‘xyz’ (second example).

cat myfile | sed -E 's/(in)/xyz/g'Sxyzergi Wahana Gemilang, a Value Added Distributor for IBM Hardware & Software xyz Indonesiacat myfile | sed -E 's/(in|is)/xyz/g'Sxyzergi Wahana Gemilang, a Value Added Dxyztributor for IBM Hardware & Software xyz Indonesia

Note that regular expression can become very complex, and most of the times there are more than one (sometimes many) approaches in solving a particular regex problem (such as as simple as recognizing an email address). Interactive regex debugger can come handy when dealing with complex regex, such as (Firas Dib, 2020) and (GSkinner, 2020). ‘man sed’ is another way to find more documentation about sed, using man command. man command is typically available to any command, so we can explore more about its functionalities by reading through the provided documentation.

man sedSED(1)                                                                       User Commands                                                                       SED(1)NAMEsed - stream editor for filtering and transforming textSYNOPSISsed [OPTION]... {script-only-if-no-other-script} [input-file]...DESCRIPTIONSed is a stream editor.  A stream editor is used to perform basic text transformations on an input stream (a file or input from a pipeline).  While in some wayssimilar to an editor which permits scripted edits (such as ed), sed works by making only one pass over the input(s), and is consequently more efficient.  But itis sed's ability to filter text in a pipeline which particularly distinguishes it from other types of editors.

A Practical Use-Case: Analyzing Security Events Log

Let’s look at some real data to explore. Command ‘ls –al’ list the files that we have in current directory. The comma separated value’s linux server log file ‘2020–03–20-data_export.csv’ is the result of export command executed from IBM QRadar SIEM version 7.3.2 server, through an installed plugins in QRadar Experience Center.

We notice that the log file has only 109 log entries. While this is not considered as a big log file, this should be enough for us to do the exploration. gawk command with ‘print $0’ is printing everything in the log file without any filter at all. gawk command, as a pattern scanning and processing language is a GNU project’s implementation of the AWK programming language and is good for processing columnar data.

ls -altotal 252drwxrwxrwx 1 andisama andisama   4096 Mar 20 17:33 .drwxrwxrwx 1 andisama andisama   4096 Mar 20 17:33 ..-rwxrwxrwx 1 andisama andisama 211272 Mar 20 17:33 2020-03-20-data_export.csv-rwxrwxrwx 1 andisama andisama  41877 Mar 20 17:33 2020-03-20-RedHat_logs_export_visible.csvcat 2020-03-20-data_export.csv | gawk -F, '{ print $0 }' | wc -l109

One way to do data exploration is to visualize and understand the data along with its structure and variations in its values. Following that, we may come up with some ideas on what we want to do with the whole or just part of the data.

Let’s quickly see the last log entry using gawk command that is passed to tail command.

cat 2020-03-20-data_export.csv | gawk -F, '{ print $0 }' | tail -n1N/A,FALSE,0,0,N/A,N/A,FALSE,N/A,00:00:00:00:00:00,PDE4Mj5NYXIgMjAgMTY6NDk6MDAgIE1hciA3IDEzOjM4OjQ5IHJlZGhhdDEgc3NoZFsyOTcwN106IEFjY2VwdGVkIHBhc3N3b3JkIGZvciB1c2VyMSBmcm9tIDE0Ni44OS4xNi4xMjMgcG9ydCA2NDAzNQ0KAA==,2,71,11,FALSE,0:0:0:0:0:0:0:0,N/A,0,N/A,8058,<182>Mar 20 16:49:00  Mar 7 13:38:49 redhat1 sshd[29707]: Accepted password for user1 from 146.89.16.123 port 64035 ,Messages,N/A,1,0,unknown,N/A,Ssh Message,unknown,0,null,1.5847E+12,255,0,FALSE,0,"true""""",N/A,This event was recognized as an Ssh log message,0,1.5847E+12,No Offense,255,LinuxServer @ Mar,TRUE,1,FALSE,N/A,169.254.3.6,1.5847E+12,5,Default Domain,0,0,2,-1,N/A,3c 31 38 32 3e 4d 61 72 20 32 30 20 31 36 3a 34 39 3a 30 30 20 20 4d 61 72 20 37 20 31 33 3a 33 38 3a 34 39 20 72 65 64 68 61 74 31 20 73 73 68 64 5b 32 39 37 30 37 5d 3a 20 41 63 63 65 70 74 65 64 20 70 61 73 73 77 6f 72 64 20 66 6f 72 20 75 73 65 72 31 20 66 72 6f 6d 20 31 34 36 2e 38 39 2e 31 36 2e 31 32 33 20 70 6f 72 74 20 36 34 30 33 35 0d 0a 00,N/A,44251372,null,N/A,"Mar 20, 2020, 4:49:00 PM","Mar 20, 2020, 4:49:00 PM",00:00:00:00:00:00,0,FALSE,169.254.3.6,0,N/A,0.0.0.0,1,"Mar 20, 2020, 4:49:00 PM",0:0:0:0:0:0:0:0,N/A,null,"100,090,100,089,100,000,000,000,000,000,000,000,000,000,000,000,000,000",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

Well, it looks quite complicated and for sure it’s hard to understand the meaning of each part without proper documentation (in which, most of the time we do not have access to the documentation or the documentation is not the latest one or at worst — not available at all).

Let’s take a look at more data then. Using the similar commands, replace tail with head command to see the first log entry of the log file. We can use ‘tail -n25’ or ‘head -n25’ for example to display more data, in this case last and first 25 entries of the log.

cat 2020-03-20-data_export.csv | gawk -F, '{ print $0 }' | head -n1N/A,FALSE,0,0,N/A,N/A,FALSE,N/A,00:00:00:00:00:00,PDE4Mj5NYXIgMjAgMTc6MDc6NDEgIE1hciA4IDIzOjA1OjExIHJlZGhhdDEgc3NoZFsxNDM0MF06IERpZCBub3QgcmVjZWl2ZSBpZGVudGlmaWNhdGlvbiBzdHJpbmcgZnJvbSAxNzIuMTYuMTQyLjE3DQoA,3,71,11,FALSE,0:0:0:0:0:0:0:0,N/A,0,N/A,18040,<182>Mar 20 17:07:41  Mar 8 23:05:11 redhat1 sshd[14340]: Did not receive identification string from 172.16.142.17 ,SSH Terminated,N/A,1,0,unknown,N/A,Did not receive identification string from [ip],unknown,0,null,1.5847E+12,255,0,FALSE,0,"true""""",N/A,Did not receive identification string from [ip],0,1.5847E+12,No Offense,255,LinuxServer @ Mar,FALSE,1,FALSE,N/A,172.16.142.17,1.5847E+12,5,Default Domain,0,0,5,-1,N/A,3c 31 38 32 3e 4d 61 72 20 32 30 20 31 37 3a 30 37 3a 34 31 20 20 4d 61 72 20 38 20 32 33 3a 30 35 3a 31 31 20 72 65 64 68 61 74 31 20 73 73 68 64 5b 31 34 33 34 30 5d 3a 20 44 69 64 20 6e 6f 74 20 72 65 63 65 69 76 65 20 69 64 65 6e 74 69 66 69 63 61 74 69 6f 6e 20 73 74 72 69 6e 67 20 66 72 6f 6d 20 31 37 32 2e 31 36 2e 31 34 32 2e 31 37 0d 0a 00,N/A,44250068,null,N/A,"Mar 20, 2020, 5:07:41 PM","Mar 20, 2020, 5:07:41 PM",00:00:00:00:00:00,0,FALSE,169.254.3.6,0,N/A,0.0.0.0,3,"Mar 20, 2020, 5:07:41 PM",0:0:0:0:0:0:0:0,N/A,null,"100,090,100,089,100,000,000,000,000,000,000,000,000,000,000,000,000,000",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

A more complete log entries is shown below where we display only column 20 and 21 using ‘{ print $20, “==>”, $21 }’ parameter that is passed to gawk command. ‘-F,’ option to gawk indicates that we tell the command that the log file contain comma separated values data.

cat 2020-03-20-data_export.csv | gawk -F, '{ print $20, "==>", $21 }' | head -n24<182>Mar 20 17:07:41  Mar 8 23:05:11 redhat1 sshd[14340]: Did not receive identification string from 172.16.142.17  ==> SSH Terminated<182>Mar 20 17:07:31  Mar 8 16:06:02 redhat1 sshd[5444]: subsystem request for sftp  ==> SSH In Progress<182>Mar 20 17:07:21  Mar 8 16:06:02 redhat1 sshd[5444]: Accepted publickey for user1 from 172.16.44.15 port 3460 ssh2  ==> Host Login Succeeded<182>Mar 20 17:07:11  Mar 8 15:15:06 redhat1 sshd[3749]: Accepted password for user1 from 10.0.0.153 port 4455 ssh2  ==> Host Login Succeeded<182>Mar 20 17:07:01  Mar 8 15:15:06 redhat1 sshd[3749]: Could not reverse map address 10.0.0.153.  ==> Messages<182>Mar 20 17:06:51  Mar 8 12:06:00 redhat1 sshd[29619]: subsystem request for sftp  ==> SSH In Progress<182>Mar 20 17:06:41  Mar 8 12:06:00 redhat1 sshd[29619]: Accepted publickey for user1 from 172.16.44.15 port 4242 ssh2  ==> Host Login Succeeded<182>Mar 20 17:06:31  Mar 8 11:30:48 redhat1 sshd[28104]: Accepted password for user1 from 10.20.22.30 port 32807 ssh2  ==> Host Login Succeeded<182>Mar 20 17:06:21  Mar 8 10:17:52 redhat1 sshd[24556]: Accepted password for user1 from 10.20.22.30 port 32798 ssh2  ==> Host Login Succeeded<182>Mar 20 17:06:01  Mar 7 13:38:49 redhat1 sshd[29707]: Accepted password for user1 from 146.89.16.123 port 64035  ==> Messages<182>Mar 20 17:05:51  Mar 12 13:24:17 redhat1 sshd[23522]: Accepted password for user1 from 10.0.0.153 port 2007 ssh2  ==> Host Login Succeeded<182>Mar 20 17:05:41  Mar 12 13:24:17 redhat1 sshd[23522]: Could not reverse map address 10.0.0.153.  ==> Messages<182>Mar 20 17:05:21  Mar 12 13:23:58 redhat1 sshd[23510]: Failed keyboard-interactive for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login Failed<182>Mar 20 17:05:11  Mar 12 13:23:58 redhat1 sshd[23510]: Failed none for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> Information<182>Mar 20 17:05:01  Mar 12 13:23:58 redhat1 sshd[23510]: Failed password for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login Failed<182>Mar 20 17:05:01  Mar 12 13:23:58 redhat1 sshd[23510]: Failed password for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login Failed<182>Mar 20 17:04:41  Mar 12 13:23:51 redhat1 sshd[23510]: Failed none for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> Information<182>Mar 20 17:04:31  Mar 12 13:23:51 redhat1 sshd[23510]: Failed password for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login Failed<182>Mar 20 17:04:31  Mar 12 13:23:51 redhat1 sshd[23510]: Failed password for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login Failed<182>Mar 20 17:04:11  Mar 12 13:23:48 redhat1 sshd[23510]: Failed none for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> Information<182>Mar 20 17:04:01  Mar 12 13:23:48 redhat1 sshd[23510]: Could not reverse map address 10.0.0.153.  ==> Messages<182>Mar 20 17:03:41  Mar 12 12:06:18 redhat1 sshd[21409]: subsystem request for sftp  ==> SSH In Progress<182>Mar 20 17:03:31  Mar 12 12:06:18 redhat1 sshd[21409]: Accepted publickey for user1 from 172.16.44.15 port 3247 ssh2  ==> Host Login Succeeded<182>Mar 20 17:03:21  Mar 12 10:53:50 redhat1 sshd[18886]: Accepted password for user1 from 10.0.0.153 port 1776 ssh2  ==> Host Login Succeeded

We further filter the log file by using grep command and looking for lines containing ‘SSH Login Failed’ string. We also count the number of results (how many lines are there) by using ‘wc -l’ command. grep searches for pattern in each file as standard input while wc command can print newline, word, and byte counts for each file. Out of 109 lines available log entries in the file, we get 10 lines that matched the search criteria of grep. Further, it also illustrates the search result that is looking for lines containing ‘Failed password’ string.

cat 2020-03-20-data_export.csv | gawk -F, '{ print $20, "==>", $21 }' | sort | grep "SSH Login Failed"<182>Mar 20 16:55:51  Mar 12 13:23:48 redhat1 sshd[23510]: Failed keyboard-interactive for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login Failed<182>Mar 20 16:56:01  Mar 12 13:23:51 redhat1 sshd[23510]: Failed password for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login Failed<182>Mar 20 16:56:21  Mar 12 13:23:51 redhat1 sshd[23510]: Failed keyboard-interactive for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login Failed<182>Mar 20 16:56:31  Mar 12 13:23:58 redhat1 sshd[23510]: Failed password for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login Failed<182>Mar 20 16:56:51  Mar 12 13:23:58 redhat1 sshd[23510]: Failed keyboard-interactive for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login Failed<182>Mar 20 17:04:31  Mar 12 13:23:51 redhat1 sshd[23510]: Failed password for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login Failed<182>Mar 20 17:04:31  Mar 12 13:23:51 redhat1 sshd[23510]: Failed password for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login Failed<182>Mar 20 17:05:01  Mar 12 13:23:58 redhat1 sshd[23510]: Failed password for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login Failed<182>Mar 20 17:05:01  Mar 12 13:23:58 redhat1 sshd[23510]: Failed password for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login Failed<182>Mar 20 17:05:21  Mar 12 13:23:58 redhat1 sshd[23510]: Failed keyboard-interactive for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login Failedcat 2020-03-20-data_export.csv | gawk -F, '{ print $20, "==>", $21 }' | sort | grep "SSH Login Failed" | wc -l10cat 2020-03-20-data_export.csv | gawk -F, '{ print $20, "==>", $21 }' | sort | grep "Failed password"<182>Mar 20 16:56:01  Mar 12 13:23:51 redhat1 sshd[23510]: Failed password for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login Failed<182>Mar 20 16:56:31  Mar 12 13:23:58 redhat1 sshd[23510]: Failed password for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login Failed<182>Mar 20 17:04:31  Mar 12 13:23:51 redhat1 sshd[23510]: Failed password for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login Failed<182>Mar 20 17:04:31  Mar 12 13:23:51 redhat1 sshd[23510]: Failed password for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login Failed<182>Mar 20 17:05:01  Mar 12 13:23:58 redhat1 sshd[23510]: Failed password for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login Failed<182>Mar 20 17:05:01  Mar 12 13:23:58 redhat1 sshd[23510]: Failed password for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login Failed

Recall that we have demonstrated the use of sed command earlier as a stream editor with regular expression input to find a matching pattern then replace it with something. Now, we are implementing sed for our log file by taking list of lines for column 20 and 21 as before then pipe it to sed for editing. The sed command is shown below, implementing ‘s/.*[0–9]\]: ’ as its parameter. Looks more complicated than before? It seems so.

Parameter ‘s/.*[0–9]\]: //’ to sed is actually very simple, it is a regex that is saying like: search all lines in log data stream that begin with any character ‘.’ followed by 0 or more characters ‘*’ and followed by series of 0, 1, 2, … or 9 and ended with ‘:]’, then replaced those matching strings with blanks.

cat 2020-03-20-data_export.csv | gawk -F, '{ print $20, "==>", $21 }' | sed -E 's/.*[0-9]\]: //' | grep "Failed"Failed keyboard-interactive for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login FailedFailed none for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> InformationFailed password for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login FailedFailed password for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login FailedFailed none for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> InformationFailed password for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login FailedFailed password for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login FailedFailed none for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> InformationFailed keyboard-interactive for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login FailedFailed none for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> InformationFailed password for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login FailedFailed keyboard-interactive for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login FailedFailed none for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> InformationFailed password for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login FailedFailed keyboard-interactive for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> SSH Login FailedFailed none for illegal user user2 from 10.0.0.153 port 2006 ssh2  ==> Information

Below we see the same regular expression that is shown in a GUI-based regex debugger tool (Firas Dib, 2020), a suggested way for doing regex debugging during a recent MIT Jan 2020 lecture (MIT, 2020a) on supporting tools for Computer Science classes.

Looks better with this tool? Well, hopefully that we can learn regex easier with this, especially when analyzing the complex ones.

A Practical Use-Case: Analyzing TSLA stock data (2010–2020)

Let’s take a look at another real data to explore. Command ‘ls -al’ list the files that we have in current directory. Please note the addition of one file. This comma separated value’s linux server log file ‘TSLA.csv’ is downloaded from kaggle (Timo Bozsolik, 2020). Tesla (TSLA) is the provider of electric cars, solar panels & clean energy storage.

Command ’head -n5’ following ”gawk -F, ‘{ print $0 }’“ shows the first 5 rows of the data. Command ’tail -n5’ following ”gawk -F, ‘{ print $0 }’“ shows the last 5 rows of the data.

ls -altotal 40600drwxrwxrwx 1 andisama andisama     4096 Mar 22 20:41 .drwxrwxrwx 1 andisama andisama     4096 Mar 20 17:33 ..-rwxrwxrwx 1 andisama andisama   195995 Mar 21 02:21 2020-03-20-data_export.csv-rwxrwxrwx 1 andisama andisama    41877 Mar 21 02:21 2020-03-20-RedHat_logs_export_visible.csv-rw-rw-rw- 1 andisama andisama    37287 Mar 22 05:24 image000001.jpg-rw-rw-rw- 1 andisama andisama   147944 Mar 22 05:24 image000002.jpg-rw-rw-rw- 1 andisama andisama    89637 Mar 22 05:24 image000003.jpg-rwxrwxrwx 1 andisama andisama   175082 Mar 22 20:41 TSLA.csv-rwxrwxrwx 1 andisama andisama 39797034 Mar 22 04:31 WonderfulIndonesia_Bali.mp4-rwxrwxrwx 1 andisama andisama       43 Mar 22 04:31 WonderfulIndonesia_Bali_sourcelink.txtcat TSLA.csv | gawk -F, '{ print $0 }' | head –n5Date,Open,High,Low,Close,Adj Close,Volume2010-06-29,19.000000,25.000000,17.540001,23.889999,23.889999,187663002010-06-30,25.790001,30.420000,23.299999,23.830000,23.830000,171871002010-07-01,25.000000,25.920000,20.270000,21.959999,21.959999,82188002010-07-02,23.000000,23.100000,18.709999,19.200001,19.200001,5139800cat TSLA.csv | gawk -F, '{ print $0 }' | tail -n52020-01-28,568.489990,576.809998,558.080017,566.900024,566.900024,117885002020-01-29,575.690002,589.799988,567.429993,580.989990,580.989990,178015002020-01-30,632.419983,650.880005,618.000000,640.809998,640.809998,290057002020-01-31,640.000000,653.000000,632.520020,650.570007,650.570007,157193002020-02-03,673.690002,786.140015,673.520020,780.000000,780.000000,47065000

We notice, that this data has header in its first row. To proceed, we will need only the data but the 1st line. This is done with series of piping commands, that is removing the first row ‘tail -n+2’. In here, we inspect the first 5 rows to make sure that the 1st row has been eliminated.

cat TSLA.csv | gawk -F, '{ print $0 }' | tail -n+2 | head –n52010-06-29,19.000000,25.000000,17.540001,23.889999,23.889999,187663002010-06-30,25.790001,30.420000,23.299999,23.830000,23.830000,171871002010-07-01,25.000000,25.920000,20.270000,21.959999,21.959999,82188002010-07-02,23.000000,23.100000,18.709999,19.200001,19.200001,51398002010-07-06,20.000000,20.000000,15.830000,16.110001,16.110001,6866900

Let’s see how many rows that we have, excluding the header.

cat TSLA.csv | tail -n+2 | wc -l2416

We run quick analysis by passing column 2 to 7 (open price to transaction volume) to R programming language, then print the statistics summaries for our data.

echo “print the first line just as a reference" >> nullcat TSLA.csv | gawk -F, '{ print $0 }' | head -n1Date,Open,High,Low,Close,Adj Close,Volumeecho “The statistics summary for Open Price" >> nullcat TSLA.csv | gawk -F, '{ print $2 }' | tail -n+2 | R --slave -e 'x <- scan(file="stdin", quiet=TRUE); summary(x)'Min. 1st Qu.  Median    Mean 3rd Qu.    Max.16.14   34.34  213.03  186.27  266.45  673.69echo “The statistics summary for High Price" >> nullcat TSLA.csv | gawk -F, '{ print $3 }' | tail -n+2 | R --slave -e 'x <- scan(file="stdin", quiet=TRUE); summary(x)'Min. 1st Qu.  Median    Mean 3rd Qu.    Max.16.63   34.90  216.75  189.58  270.93  786.14echo “The statistics summary for Low Price" >> nullcat TSLA.csv | gawk -F, '{ print $4 }' | tail -n+2 | R --slave -e 'x <- scan(file="stdin", quiet=TRUE); summary(x)'Min. 1st Qu.  Median    Mean 3rd Qu.    Max.14.98   33.59  208.87  182.92  262.10  673.52echo “The statistics summary for Close Price" >> nullcat TSLA.csv | gawk -F, '{ print $5 }' | tail -n+2 | R --slave -e 'x <- scan(file="stdin", quiet=TRUE); summary(x)'Min. 1st Qu.  Median    Mean 3rd Qu.    Max.15.8    34.4   213.0   186.4   266.8   780.0echo “The statistics summary for Adjusted Close Price" >> nullcat TSLA.csv | gawk -F, '{ print $6 }' | tail -n+2 | R --slave -e 'x <- scan(file="stdin", quiet=TRUE); summary(x)'Min. 1st Qu.  Median    Mean 3rd Qu.    Max.15.8    34.4   213.0   186.4   266.8   780.0echo “The statistics summary for Transaction Volume" >> nullcat TSLA.csv | gawk -F, '{ print $7 }' | tail -n+2 | R --slave -e 'x <- scan(file="stdin", quiet=TRUE); summary(x)'Min.  1st Qu.   Median     Mean  3rd Qu.     Max.118500  1899275  4578400  5572722  7361150 47065000

Processing Video File/Stream

Now, we are trying to do something else — processing a streamed video using ffmpeg, a video and audio converter. ffmpeg is a tool for fast video and audio converter that can also grab its input from a live audio/video source, such as from /dev/video0 if it is available on the installed linux system (we can use ‘ls /dev’ command to list all available supported devices).

Let‘s take a downloaded video file: WonderfulIndonesia_Bali.mp4 as as input (Indonesia.Travel, 2013). This 3’21” long, mp4 format video, is then passed to ffmpeg with ‘-vf fps=1/60 image%06d.jpg‘ as its parameter to extract 1 frame (taking video capture) for each 60 seconds of video, then save the generated images to files. The extracted 3 jpeg-type images are stored in image000001.jpg, image000002.jpg, image000003.jpg.

The ‘ls -al‘ command also list the files before and after executing ffmpeg command. Note that the output of ffmpeg command is suppressed and only shows the beginning and the end of its generated messages.

ls -altotal 40108drwxrwxrwx 1 andisama andisama     4096 Mar 22 05:23 .drwxrwxrwx 1 andisama andisama     4096 Mar 20 17:33 ..-rwxrwxrwx 1 andisama andisama   195995 Mar 21 02:21 2020-03-20-data_export.csv-rwxrwxrwx 1 andisama andisama    41877 Mar 21 02:21 2020-03-20-RedHat_logs_export_visible.csv-rwxrwxrwx 1 andisama andisama 39797034 Mar 22 04:31 WonderfulIndonesia_Bali.mp4-rwxrwxrwx 1 andisama andisama       43 Mar 22 04:31 WonderfulIndonesia_Bali_sourcelink.txtffmpeg -i ./WonderfulIndonesia_Bali.mp4 -vf fps=1/60 image%06d.jpgffmpeg version 3.4.6-0ubuntu0.18.04.1 Copyright (c) 2000-2019 the FFmpeg developersbuilt with gcc 7 (Ubuntu 7.3.0-16ubuntu3). . .. . .. . .frame=    3 fps=0.7 q=1.6 Lsize=N/A time=00:03:00.00 bitrate=N/A speed=44.1xvideo:268kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknownls -altotal 40384drwxrwxrwx 1 andisama andisama     4096 Mar 22 05:24 .drwxrwxrwx 1 andisama andisama     4096 Mar 20 17:33 ..-rwxrwxrwx 1 andisama andisama   195995 Mar 21 02:21 2020-03-20-data_export.csv-rwxrwxrwx 1 andisama andisama    41877 Mar 21 02:21 2020-03-20-RedHat_logs_export_visible.csv-rw-rw-rw- 1 andisama andisama    37287 Mar 22 05:24 image000001.jpg-rw-rw-rw- 1 andisama andisama   147944 Mar 22 05:24 image000002.jpg-rw-rw-rw- 1 andisama andisama    89637 Mar 22 05:24 image000003.jpg-rwxrwxrwx 1 andisama andisama 39797034 Mar 22 04:31 WonderfulIndonesia_Bali.mp4-rwxrwxrwx 1 andisama andisama       43 Mar 22 04:31 WonderfulIndonesia_Bali_sourcelink.txt

The 3 extracted frames are visualized below.

What’s next then?

Being as a Data Scientist (or just as an aspiring Data Scientist) requires continuous long-life learning in exploring and experimenting with datasets, as well as having lots of curiosity to keep ourselves updated on various types of algorithms and tools for multi-industry use-cases implementations.

As any other scientists in other disciplines often dedicate their lifetime in pursuing things in their area of research focus in order to keep updated and strive to be better than state-of-the-arts. It may seem almost impossible to achieve the defined goals at the beginning of the journey — however, with strong persistence and a lot of patient, at the end of the road, although not always true, all of the efforts that we have been doing, will be worth it.

Becoming handy and having updated knowledge and experience with practical tools in pre-processing datasets (hence, data wrangling) before doing any analytics or machine learning / deep learning modeling would be an invaluable skills for a Data Scientist.

Well, let’s get started by doing something. And the right time, is Now!.

References:

Andi Sama et al., 2019a, Think like a Data Scientist.

Firas Dib, 2020, Regular Expression — 101.

GSkinner, 2020, HTML/JS based tool for creating, testing, and learning about Regular Expressions.

Indonesia.Travel, 2013, Wonderful Indonesia — Bali.

MIT, 2020a, Data Wrangling, Lecture 4 of “Missing Semester”, February 2020.

MIT, 2020b, MIT 6.S191 — Introduction to Deep Learning, January 2020.

Sinergi Wahana Gemilang, 2017–2020, Multiple articles on Internet of Things (IoT), Artificial Intelligence (AI) and Quantum Computing, SWG Insight quarterly — multiple editions.

Timo Bozsolik, 2020, “Tesla stock data from 2010 to 2020”.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store