diff --git a/dq-visuals/more.../README.md b/dq-visuals/more.../README.md deleted file mode 100644 index cda336c8..00000000 --- a/dq-visuals/more.../README.md +++ /dev/null @@ -1,7 +0,0 @@ ---- -description: Collibra DQ advanced features ---- - -# More... - -When specific DQ challenges require specific DQ detection techniques, Collibra DQ offers a wide variety of advanced functionality. While Schema and Shapes utilize auto-discovery, other detection algorithms are best suited for users that understand their data and have specific use-cases in mind. Read more to understand if specific dimensions can be applied to your data. diff --git a/dq-visuals/more.../duplicates.md b/dq-visuals/more.../duplicates.md deleted file mode 100644 index 5c14036e..00000000 --- a/dq-visuals/more.../duplicates.md +++ /dev/null @@ -1,39 +0,0 @@ -# Duplicates (advanced) - -{% hint style="info" %} -This is an advanced opt-in feature -{% endhint %} - -## General Ledger. Accounting use-case - -{% embed url="https://owl-analytics.com/general-ledger" %} - -Whether you're looking for a fuzzy matching percent or single client cleanup, Owl's duplicate detection can help you sort and rank the likelihood of duplicate data. - -![](../../.gitbook/assets/owl-dupe-booked.png) - -```bash --f "file:///home/ec2-user/single_customer.csv" \ --d "," \ --ds customers \ --rd 2018-01-08 \ --dupe \ --dupenocase \ --depth 4 -``` - -## User Table has duplicate user entry - -Carrisa Rimmer vs Carrissa Rimer - -![](../../.gitbook/assets/owl-dupe-carrissa.png) - -## ATM customer data with only a 88% match - -As you can see below, less than a 90% match in most cases is a false positive. Each dataset is a bit different, but in many cases you should tune your duplicates to roughly a 90+% match for interesting findings. - -![](../../.gitbook/assets/owl-dupes.png) - -## Simple DataFrame Example - -![](../../.gitbook/assets/owl-dupe-df.png) diff --git a/dq-visuals/more.../explorer-2.md b/dq-visuals/more.../explorer-2.md deleted file mode 100644 index a60b4419..00000000 --- a/dq-visuals/more.../explorer-2.md +++ /dev/null @@ -1,117 +0,0 @@ ---- -description: A no-code option to get started quickly and onboard a dataset. ---- - -# Explorer (no-code) - -![](<../../.gitbook/assets/explorer (3).gif>) - -## Getting Started - -This page can be accessed by clicked the Explorer option (the compass icon). - -![](<../../.gitbook/assets/image (87) (1).png>) - -{% hint style="info" %} -All UI functionality has corresponding API endpoints to define, run, and get results programmatically. -{% endhint %} - -## Select Your Data Source - -![](<../../.gitbook/assets/image (89) (1) (1).png>) - -## Create a new DQ Job by clicking +Create DQ Job - -![](<../../.gitbook/assets/image (92) (1) (1).png>) - -#### **View Data is an interactive option to run queries and explore the data** - -#### The bar chart icon will take you to a profile page of the dataset created prior to Explorer 2 - -## Select The Scope and Define a Query - -![](<../../.gitbook/assets/image (98) (1) (1).png>) - -#### Pick Date Column if your dataset contains an appropriate time filter - -#### Click Build Model -> to Save and Continue - -![](<../../.gitbook/assets/image (99) (1) (1) (1).png>) - -## Transform Tab (advanced / optional) - -{% content-ref url="../../dq-job-examples/owlcheck/owlcheck-transform.md" %} -[owlcheck-transform.md](../../dq-job-examples/owlcheck/owlcheck-transform.md) -{% endcontent-ref %} - -#### Click Build Model -> to Save and Continue - -## Profile - -![](<../../.gitbook/assets/image (88) (1).png>) - -#### Use the drop-downs to enable different analysis. Best practice is to leave the defaults. - -## Pattern (advanced / optional) - -Toggle on Pattern to enable this layer - -Click +Add to define a group and series of columns - -{% content-ref url="pattern-mining.md" %} -[pattern-mining.md](pattern-mining.md) -{% endcontent-ref %} - -#### Click Save to and Click Outlier to Continue - -## Outlier (advanced / optional) - -{% content-ref url="outliers.md" %} -[outliers.md](outliers.md) -{% endcontent-ref %} - -#### Click Save to and Click Dupe to Continue - -## Dupe (advanced / optional) - -{% content-ref url="duplicates.md" %} -[duplicates.md](duplicates.md) -{% endcontent-ref %} - -#### Click Save to and Click Source to Continue - -## Source (advanced / optional) - -Navigate to the source dataset - -Click Preview to interlace the columns - -Manually map the columns by dragging left to right or deselect columns - -{% content-ref url="validate-source.md" %} -[validate-source.md](validate-source.md) -{% endcontent-ref %} - -#### Click Save to and Click Save/Run to Continue - -## Run - -1. Select an agent -2. Click Estimate Job -3. Click Run to start the job - -![](<../../.gitbook/assets/image (90) (1) (1) (1) (1).png>) - -![](<../../.gitbook/assets/image (100) (1) (1).png>) - -{% hint style="info" %} -\*Note if you do not see your agent, please verify the agent has been assigned to your connection via: -{% endhint %} - -{% content-ref url="../../connecting-to-dbs-in-owl-web/add-connection-to-agent.md" %} -[add-connection-to-agent.md](../../connecting-to-dbs-in-owl-web/add-connection-to-agent.md) -{% endcontent-ref %} - -_Admin Console-->Remote Agent--> (Link icon on far right)-->Map connections to this agent and then reload the explorer page_ - -## **** diff --git a/dq-visuals/more.../missing-records.md b/dq-visuals/more.../missing-records.md deleted file mode 100644 index bf87ffe0..00000000 --- a/dq-visuals/more.../missing-records.md +++ /dev/null @@ -1,17 +0,0 @@ -# Records (advanced) - -{% hint style="info" %} -This is an advanced opt-in feature -{% endhint %} - -## Where'd my rows go? - -Owl is constantly learning which records or rows in a dataset are most common. In the case below the NYSE had a reasonable dataset volume (row count). - -![](../../.gitbook/assets/owl-missing-records.png) - -## Row Count Trend - -We can see the rows dipping just slightly outside their predicted range. Arguably a subtle drop, yet abnormal to not represent these companies that typically do trade on the NYSE. Were they de-listed? - -![](../../.gitbook/assets/owl-row-trend.png) diff --git a/dq-visuals/more.../outliers.md b/dq-visuals/more.../outliers.md deleted file mode 100644 index 41a370db..00000000 --- a/dq-visuals/more.../outliers.md +++ /dev/null @@ -1,152 +0,0 @@ -# Outliers (advanced) - -{% hint style="info" %} -This is an advanced opt-in feature -{% endhint %} - -## Numerical Outliers - -Kodak Coin! In 2018 Kodak announced themselves as Kodak coin and witnessed a steep change in their stock price. Owl automatically captured this event and provided the ability to drill into the item. - -![](../../.gitbook/assets/owl-outlier-numerical.png) - -### Complex outliers made Simple - -Even though Owl uses complex formulas to identify the correct outliers in a dataset, it uses simple terms when displaying them. If you notice below the change happened gradually, therefore if you only compared avgs or previous values you would not understand the full impact of this price change. 0% changed from yesterday and its moving/trailing avg would have caught up. - -![](<../../.gitbook/assets/owl-outlier-numerical (2).png>) - -## Dynamic history options - -Data may not always enter your data pipeline on time and as expected due to weekend, holidays, errors, etc. To help capture outliers in spite of gaps, there are two main options: - -* 1\) Extend the lookback period (to 10 days from 5 days, for example) -* 2\) Utilize additional flags per below (fllbminrow new as of 2021.11) - -| Flag | Description | Example | -| ---------- | ---------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------- | -| fllbminrow | File Lookback Minimum Rows: determines minimum number of rows that a previous file scan needs to be counted as file lookback |

-fllbminrow 1 (counts nay DQ scans with 1 or more row in minimum history)

-fllbminrow 0 (default behavior, row count does not matter)

| -| dllb | Date Lookback: determines how many days of learning | -dllb 5 (5 days) | - -## Categorical Outliers - -Categorical Outliers are much different than numerical outliers and require separate techniques to automatically capture meaningful anomalies. The details regarding Owl's methodology and testing can be found below, 3 minute read on the topic. - -{% embed url="https://medium.com/owl-analytics/categorical-outliers-dont-exist-8f4e82070cb2" %} - -Owl will automatically learn the normal behavior of your String and Categorical attributes such as STOCK,OPTION,FUTURE or state codes such as MD,NC,D.C. When a strange pattern occurs (e.g NYC instead of NY), Owl will show this as a categorical outlier. - -Owl is able to detect Categorical Outliers both with and without taking time into account. If a time dimension is not provided, Owl will calculate the distribution of categorical values within the available data, and identify the values that fall into the most infrequent percentile (configurable). - -![Categorical Outliers without Time](../../.gitbook/assets/screen-shot-2020-07-07-at-9.43.19-pm.png) - -If a time dimension is provided, Owl will first identify infrequent categories in the historical context and then in the context of the current Owlcheck. Only values that are historically infrequent or non-existent, and are infrequent in the current run will be considered Outliers. - -![Categorical Outliers with Time](../../.gitbook/assets/screen-shot-2020-07-07-at-9.37.17-pm.png) - -## Training Outlier Detection Model - -Although Owl uses different techniques to detect Numerical and Categorical Outliers, the training process is very similar. - -At a minimum, Owl requires historical data that can be used as the training dataset. If no other input is provided, Owl will calculate the normal range for each selected column and look for numerical and categorical outliers within the training dataset without any further context. The output will essentially consist of infrequent values that fall outside the normal range fo each column. - -![](../../.gitbook/assets/screen-shot-2020-07-07-at-8.17.02-pm.png) - -To obtain more targeted results, the Owl requires a "key" column. This column will be used to provide context by grouping each column by the key column. Defining a good key column tends to provide results that are a better indicators of actual data quality issues instead of simply infrequent values. - -![](../../.gitbook/assets/screen-shot-2020-07-07-at-8.18.40-pm.png) - -Another input that can make outlier detection more precise is a data/time column and a look back period. This enables a more precise calculation of the normal range for a column and in the case of numerical outliers, makes it possible for Owl to establish a trend. Given a time column and key column, Owl will not only identify numerical outliers, it will plot the historical trend of the column value trailing the outlier. - -![](../../.gitbook/assets/screen-shot-2020-07-07-at-8.19.14-pm.png) - -Owl also allows further refinement of the time dimension by defining time bins and processing intervals. By default, when given a time column, Owl will bin the data into days and process the data in daily interval. However, if the data is high frequency, day bins and day intervals might be too coarse grained. In this case, it might make more sense to group the data into bins on the minute and process the data in hour or minute intervals. The same concept applies in the other direction. What if the data is already aggregated on the month or year? In this case, it makes more sense to set the bins and intervals to month by month or month by year. - -![](../../.gitbook/assets/screen-shot-2020-07-07-at-8.20.18-pm.png) - -Some data may be measured in really small or large units or contain a lot of noise. In this case, Owl allows the user to adjust the sensitivity level and unit of measure for outlier detection on each column. Click the advanced tab to make these adjustments. - -![](../../.gitbook/assets/screen-shot-2020-07-07-at-8.20.33-pm.png) - -Once Outlier detection is complete for a given run, it's time to tune the scoring of the model. Owl allows the user to label any outlier findings as legitimate, thus preventing that outlier from being detected in the future or effecting the score of the current run. In addition, it is possible to define the significance of an outlier finding to a given dataset. This can be accomplished by setting how many quality points should be deducted for each outlier finding on any given run on that dataset. It is also possibly to adjust sensitivity and unit of measure of future runs by clicking on the small gear icon on the far left of the screen. - -![](../../.gitbook/assets/screen-shot-2020-07-07-at-8.38.05-pm.png) - -## Spark DataFrame Example - -![](../../.gitbook/assets/owl-categorical-outlier.png) - -#### Real World Example - -Imagine you are the data manager at Iowa Department of Commerce, Alcoholic Beverage Division. As part of the Department's open data initiative, the monthly[ Iowa liquor sales data](https://data.iowa.gov/Sales-Distribution/Iowa-Liquor-Sales/m3tr-qhgy) are available to the public for analysis. (Thank you Iowa!) - -An Iowan data analyst emails you about a data quality issue with **address** for store #2508 in the year 2016. You quickly run a SQL query on your data warehouse to see what is going on. - -```sql --- Assuming Postgres DB -select date_trunc('MONTH', "date") "date_month", address, count(*) "sales_count" -from iowa_liquor_sales -where "date" >= '2016-01-01' and "date" < '2017-01-01' and store_number = '2508' -group by date_trunc('MONTH', "date"), address -order by date_month, address -``` - -| date\_month | address | sales\_count | -| ------------------- | ------------------------- | ------------ | -| 2016-01-01 00:00:00 | 1843 JOHNSON AVENUE, N.W. | 422 | -| 2016-02-01 00:00:00 | 1843 JOHNSON AVENUE, N.W. | 451 | -| 2016-03-01 00:00:00 | 1843 JOHNSON AVENUE, N.W. | 579 | -| 2016-04-01 00:00:00 | 1843 JOHNSON AVENUE, N.W. | 404 | -| 2016-05-01 00:00:00 | 1843 Johnson Avenue, N.W. | 625 | -| 2016-06-01 00:00:00 | 1843 Johnson Avenue, N.W. | 695 | -| 2016-07-01 00:00:00 | 1843 Johnson Avenue, N.W. | 457 | -| 2016-08-01 00:00:00 | 1843 Johnson Avenue, N.W. | 744 | -| 2016-09-01 00:00:00 | 1843 Johnson Avenue, N.W. | 681 | -| 2016-10-01 00:00:00 | 1843 Johnson Avenue, N.W. | 728 | -| 2016-11-01 00:00:00 | 1843 Johnson Avenue, N.W. | 1062 | -| 2016-12-01 00:00:00 | 1843 Johnson Avenue, N.W. | 992 | - -Because `store_number` is an unique number assigned to the store who ordered the liquor, the inconsistent `address` values for the same store pose data quality problem. But `address` is a string value that can take many forms. For store #2508, the reported address value has a shifted behavior from all capital letters starting on May 2016. For other cases, it might be completely different behavior change that you would have to manually check one by one. With over 2,000 unique stores, 19 million rows, and 8 years of data, you need an automated way to detect meaningful categorical outliers. - -The following command shows an example of running monthly OwlDQ Checks, from the month of Jan 2016 to the month of December 2016. Each monthly run looks back 3 months of data to establish a baseline for categorical columns that you suspect would have similar data quality issues: `store_name`, `address`, and`city`. - -```bash -/opt/owl/bin/owlcheck - # connection information to data - -lib "/opt/owl/drivers/postgres/" -driver "org.postgresql.Driver" - -c, "jdbc:postgresql://localhost:5432/postgres" - -u, "postgres", "-p", "password" - # Specify dataset name - -ds "iowa_liquor_sales_by_store_number_monthly" - # Specify date filter for the last filter, e.g. date >= '2016-12-01' and date < '2017-01-01' - -rd "2016-12-01" -rdEnd "2017-01-01" - # SQL query template (${rd} and ${rdEnd} matches with -rd and -rdEnd - -q "select distinct on (date, store_number) date, store_number, store_name, address, city - from iowa_liquor_sales where date >= '${rd}' and date < '${rdEnd}' " - # Turn on Outliers - -dl - # Group on store_number (optional if no grouping) - -dlkey "store_number" - # Specify column that is of date type (optional, if running OwlCheck without time context) - -dc "date" - # Specify columns to run Outlier analysis (if not specified, all the columns in query are included in analysis) - -dlinc "store_name,address,city" - # Specify 3 month lookback for each OwlCheck - -dllb 3 - # Run Monthly OwlCheck - -tbin "MONTH" - # "backrun" Convenient way to run 12 preceding MONTHly owl check - -br 12 -``` - -**Results** - -The `-br 12` option ran 12 monthly OwlChecks for every month of 2016. The figure below shows OwlCheck Hoot page for the lastest run of dataset `iowa_liquor_sales_by_store_numbers_monthly`. The Hoot page shows that OwlCheck identified 24 Outliers among 4.8k rows of unique date x store\_number for month of December, 2016. - -![Monthly OwlCheck for 2016-12-01](<../../.gitbook/assets/image (39).png>) - -Since the original data quality issue that inspired us to run OwlCheck is from May 2016, we can navigate to specific run date 2016-05-01 by click on the line graph on top. Then searching for store #2508 on the **key** column shows outlier detected for **column** `address`. Press \[+] for that row to see contextual details about this detected value. - -![Monthly OwlCheck for 2016-05-01. The drill-in outlier details for store #2508 is shown](<../../.gitbook/assets/image (36).png>) - -We can verify that OwlCheck identified the outlier of interest among other 60 data quality issues. Using OwlCheck, you can identify issues at scale for past data (using backrun), current (using simple OwlCheck), and future (using scheduled jobs). diff --git a/dq-visuals/more.../overview.md b/dq-visuals/more.../overview.md deleted file mode 100644 index c2ce9cf4..00000000 --- a/dq-visuals/more.../overview.md +++ /dev/null @@ -1,79 +0,0 @@ -# Summary - -## **Click or Code ** - -Collibra DQ offers easy to use no (low) code options for getting started quickly. Alternatively, more technical users may prefer programmatic APIs. - -## **Core Components** - -Collibra DQ offers a full DQ suite to cover the unique challenges of each dataset. - -**9 Dimensions of DQ** - -1. Behaviors - Data observability -2. Rules - SQL-based rules engine -3. Schema - When columns are added or dropped -4. Shapes - Typos and Formatting Anomalies -5. Duplicates - Fuzzy matching, Identify similar but not exact entries -6. Outliers - Anomalous records, clustering, time-series, categorical -7. Pattern - Classification, cross-column & parent/child anomalies -8. Record - Deltas for a given column(s) -9. Source - Source to target reconciliation - -[Check out our videos to learn more](https://www.youtube.com/channel/UCKMcJ5NRiCDZQxBvSsVtTXw/videos) - -## **Behavior ** - -**Imagine a column going null, automatic row count checks - does your data behave/look/feel the same way it has in the past.** - -![](../../.gitbook/assets/behavior.jpg) - -## **Rules** - -**Assures only values compliant with your data rules are allowed within a data object. ** - -![](../../.gitbook/assets/rules.jpg) - -## **Schema ** - -**Columns add or dropped.** - -![](../../.gitbook/assets/schema.jpg) - -## **Shapes** - -**Infrequent formats.** - -![](<../../.gitbook/assets/shapes (1).jpg>) - -## Dupes - -**Fuzzy matching to identify entries that have been added multiple times with similar but not exact detail.** - -![](../../.gitbook/assets/dupes.jpg) - -## **Outliers** - -**Data points that differ significantly from other observations.** - -![](../../.gitbook/assets/outliers.jpg) - -## **Pattern** - -**Recognizing relevant patterns between data examples.** - -![](../../.gitbook/assets/pattern.jpg) - -## **Source** - -**Validating source to target accuracy.** - -![](../../.gitbook/assets/source.jpg) - -## **Record** - -**Deltas for a given column. ** - -![](../../.gitbook/assets/record.jpg) - -## diff --git a/dq-visuals/more.../pattern-mining.md b/dq-visuals/more.../pattern-mining.md deleted file mode 100644 index 90c35d6b..00000000 --- a/dq-visuals/more.../pattern-mining.md +++ /dev/null @@ -1,23 +0,0 @@ -# Patterns (advanced) - -{% hint style="info" %} -This is an advanced opt-in feature -{% endhint %} - -Owl uses the latest advancements in data science and ML to find deep patterns across millions of rows and columns. In the example below it noticed that Valerie is likely the same user as she has the same customer_id and card_number but recently showed up with a different last name. Possible misspelling or data quality issue? - -![](../../.gitbook/assets/owl-patterns.png) - -## Training Anti-Pattern Detection Model - -When the Patterns feature is enabled, Owl will build a collection of patterns that it identifies within the data. It will then use that collection to identify values that break established patterns. For example, in the image below, Owl learned that a bike route that starts at "MLK library" will end at "San Jose Diridon Caltrain Station". However, when the current day's data cross referenced against this pattern, Owl detects an anti-pattern where a trip starts at "MLK Library" but ends at "Market at 4th". Owl raises this anti-pattern as a data quality issue and highlights the what it believes the "end_station" value should have been. - -To build a Pattern model, Owl requires historical data that contains the valid patterns and if possible, a date/time column. The user can then needs to define the date/time column, the look back period, and what columns make up the pattern. In the image below, the pattern was composed of "end_station", "start_terminal", "start_station". - -It is possible that an apparent anti-pattern finding is actually valid data and not a data quality issue. In this case, Owl allows the user to further instruct the existing Patterns model on how to properly score and handle the findings. For example, if it turns out that "Market at 4th" is actually a valid "end_station" for a bike trip, the user can negate the identified anti-pattern by labeling it as valid. This action will instruct Owl to not raise this particular anti-pattern again. Owl will also rescore the current Owlcheck results to reflect the user's feedback. In addition, it is possible to define the weight of an anti-pattern finding on the current dataset by setting the numerical value to deduct per finding. - -![](../../.gitbook/assets/screen-shot-2020-03-19-at-5.55.49-pm.png) - -## Fraud Detection? - -Think about a scenario where a dataset has a SSN column along with FNAME, LNAME and many others. What if your traditional rules engine passes because one of the rows has a valid SSN and a valid Name but the SSN doesn't belong to that person (his or her name and address, etc...)? This is where data mining can derive more sophisticated insights than a rules based approach. diff --git a/dq-visuals/more.../schema-evolution.md b/dq-visuals/more.../schema-evolution.md deleted file mode 100644 index f0bfe8d6..00000000 --- a/dq-visuals/more.../schema-evolution.md +++ /dev/null @@ -1,13 +0,0 @@ ---- -description: Detect schema evolution and unexpected schema changes. ---- - -# Schema (automatic) - -Dataset schemas are the columns or fields that define the dataset. They are often located in the header row of a tabular file or database table. However, JSON and XML are two examples of formats that include schema columns that are not in the header but rather nested throughout the document. OwlDQ automatically without needing to turn on any features detects the schema columns as well as reads or infers their data types (varchar, string, double, decimal, int, date, timestamp etc...). Owl observes each dataset so if a column is ever altered, removed or added it will automatically raise the event via its standard composite scoring system. - -#### Scoring... Alerting... Schema Detection... Automatically - -![](../../.gitbook/assets/owl-schema.png) - -Schema Evolution is one of Owl's 9 DQ dimensions. It can be an important measurement for data stewards to understand how the dataset is changing overtime. The orange bar on the chart shows a change in schema and allows for drilling in over time. diff --git a/dq-visuals/more.../shapes.md b/dq-visuals/more.../shapes.md deleted file mode 100644 index 6b7c4b34..00000000 --- a/dq-visuals/more.../shapes.md +++ /dev/null @@ -1,17 +0,0 @@ -# Shapes (automatic) - -Owl will automatically detect inconsistencies in data formats. These inconsistencies are where Data Scientists spend an enormous amount of time cleaning the data before building a ML model. Many reports have documented that over 80% of the time it takes to build a credible model comes from first understanding all the different formats and then writing munging or ETL style code to clean it before processing. What about all the patterns the process or person doesn't even know about? - -![](../../.gitbook/assets/owl-phone-shapes.png) - -### Drill-in to any Shape anomaly and see a visual example - -See an itemized list view of the most infrequent or odd shapes in your datasets. - -![](../../.gitbook/assets/owl-shape-drillin.png) - -### Shape Tuning - -By clicking the gear icon in the upper right corner of the SHAPE tab on the HOOT page. - -![](../../.gitbook/assets/shape-tuning-owl.png) diff --git a/dq-visuals/more.../validate-source.md b/dq-visuals/more.../validate-source.md deleted file mode 100644 index 761dac2f..00000000 --- a/dq-visuals/more.../validate-source.md +++ /dev/null @@ -1,49 +0,0 @@ -# Source (advanced) - -{% hint style="info" %} -This is an advanced opt-in feature -{% endhint %} - -## Does your data-lake reconcile with your upstream system? - -Copying data from one system to another is probably the most common data activity to all organizations. Owl refers to this as source to target. As simple as this activity sounds, Owl has found that most of the time files and database tables are not being copied properly. To ensure and protect against target systems getting out of sync or not matching the originating source, turn on `-vs` to validate that the source matches the target. - -## A row count is not enough... - -The most common check we encounter is a row count. However, a row count does not account for: - -* Schema differences - Boolean to Int, Decimal to Double with precision loss, Timestamps and Dates -* Value differences - Char or Varchars with whitespace vs Strings, null chars, delimiter fields that cause shifting, and much more. - -![](../../.gitbook/assets/screen-shot-2019-10-01-at-8.50.39-pm.png) - -## OwlCheck Created from Wizard - -The Owl Wizard GUI creates the below OwlCheck which it can execute from the GUI by clicking RUN or by pasting at the cmdline. - -```bash --lib /home/ec2-user/owl/drivers/valdrivers \ --driver org.postgresql.Driver \ --u user -p password \ --c "jdbc:postgresql://ec2-34-227-151-67.compute-1.amazonaws.com:5432/postgres" \ --q "select * from public.dateseries4" \ --ds psql_dateseries2 -rd 2018-11-07 \ --srcq select dz, sym as symz, high as highz, low as lowz, close as closez, volume as volumez, changed as changedz, changep as changepz, adjclose as adjclosez, open as openz from lake.dateseries \ --srcu user \ --srcp password \ --srcds mysqlSYMZ \ --srcd com.mysql.cj.jdbc.Driver \ --srcc "jdbc:mysql://owldatalake.chzid9w0hpyi.us-east-1.rds.amazonaws.com:3306/lake" \ -valsrckey "SYMZ" \ --vs \ --valsrcinc "dz,symz,openz,highz,lowz,closez,volumez,changedz,changepz,adjclosez" -``` - -### End of Day Stock Data from Oracle to Mysql - -In this example we loaded NYSE_EOD data in both Oracle and Mysql and then used Owl's Source Validation feature. We see 3 main classes of issues. 1) The row count is off by 1 row, this means a row was dropped or went missing when the data was copied. 2) The schemas don't exactly match. 3) In 2 cases the values are different at the cell level. NULL vs NYSE and 137.4 vs 137.42 - -![](../../.gitbook/assets/screen-shot-2019-10-09-at-9.40.55-am.png) - -### Latest View in 2.13+ - -![](../../.gitbook/assets/source.png)