-
Notifications
You must be signed in to change notification settings - Fork 111
Yulia's first documentation article #239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
YLukashina
wants to merge
3
commits into
shuttle-hq:master
Choose a base branch
from
YLukashina:master
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 2 commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,257 @@ | ||
| --- | ||
| title: Hands-On Tutorial - Creating fake user data in GA to train your number-generating skills | ||
| image: https://storage.googleapis.com/getsynth-public/media/users.jpg | ||
| description: In this tutorial we're going to create fake GA data with synth! | ||
| --- | ||
|
|
||
|  | ||
|
|
||
| In this tutorial, we will show you that even if the target data looks tricky at first glance, with [`synth`](http://getsynth.com), there is always a solution. We will focus on numeric data. | ||
|
|
||
| Together, we will create fake Google Analytics data imitating the behavior of individual website users at the session level, also known as [User Explorer](https://support.google.com/analytics/answer/6339208?hl=en#zippy=%2Cin-this-article). | ||
|
|
||
| For those who actually work with GA: this kind of data may help your data science team in testing user profiling algorithms for e-commerce or online campaigning. | ||
|
|
||
| ## Prerequisite: Install Synth | ||
| If you have not yet done so, you need to get `synth` running on your machine. It can be installed with a single line from a CLI tool of your choice as shown [here](https://getsynth.com/download). | ||
|
|
||
| ## Elaborating a Data Model | ||
|
|
||
| In the GA GUI, a target report would like like this: | ||
| | User ID | Sessions | Avg. Session Duration | Bounce rate | Revenue | Transactions | Goal Conversion Rate | ||
| | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | | ||
| | fcnMA33 | 64 | 00:06:43 | 1.56% | $0.00 | 0 | 96.88% | | ||
| | LAhfn66 | 48 | 00:07:59 | 2.08% | $0.00 | 0 | 33.33% | | ||
|
|
||
| Before we start writing a code, let's look at the data closely. | ||
|
|
||
| In our case, in the table we see values in different formats. In reality though, we can limit them to just two data types: integers and floating numbers. | ||
|
|
||
| Integer columns are: | ||
| * Sessions | ||
| * Transactions | ||
|
|
||
| Floating type columns are: | ||
| * Bounce rate | ||
| * Avg. Session Duration | ||
| * Revenue | ||
| * Goal Conversion Rate | ||
|
|
||
| For all columns, we need numbers that are equal or greater than zero. Besides, for the percentage columns, we need to introduce the upper limit. The duration column will be expressed in minutes since it is a bit more intuitive than hours and seconds. Additionally, we need to specify a reasonable maximum duration as well. | ||
|
|
||
| --- | ||
| **note** | ||
| `revenue` and `transactions` have a dependency: a revenue can only be above zero if at least one transaction took place. \ | ||
| You will need to filter your data later and generate additional rows if not enough data has been left. | ||
|
|
||
| --- | ||
|
|
||
| The same data in JSON format: | ||
|
|
||
| ```json | ||
| { | ||
| "users":[ | ||
| { | ||
| "user id": "fcnMA33", | ||
| "sessions": 64, | ||
| "avg session duration": 6.71, | ||
| "bounce rate": 1.56, | ||
| "revenue": 0.00, | ||
| "transactions": 0, | ||
| "goal conversion rate": 96.886 | ||
| }, | ||
| { | ||
| "user id": "LAhfn66", | ||
| "sessions": 48, | ||
| "avg session duration": 7.99, | ||
| "bounce rate": 2.08, | ||
| "revenue": 0.00, | ||
| "transactions": 0, | ||
| "goal conversion rate": 33.33 | ||
| } | ||
| ] | ||
| } | ||
| ``` | ||
|
|
||
| In the `synth` [number generators reference](https://www.getsynth.com/docs/content/number), you will find out that the tool offers one generator for both types of numbers but the output can be customized using different arguments and parameters. | ||
|
|
||
|
|
||
| ## Creating Namespace and Collection | ||
|
|
||
| Similarly to inserting data into a database, we need to create a "bucket" first: | ||
| * a [namespace](https://www.getsynth.com/docs/getting_started/core-concepts#namespaces) where you will store your data [schemas](https://www.getsynth.com/docs/getting_started/schema): a new directory on your storage drive | ||
| * a [collection](https://www.getsynth.com/docs/getting_started/core-concepts#collections) for each schema: a JSON file | ||
|
|
||
| In your CLI tool, execute the following commands: | ||
|
|
||
| ```commandline | ||
| $ mkdir fake_ga | ||
| ``` | ||
| ```commandline | ||
| $ cd fake_ga && touch users.json | ||
| ``` | ||
|
|
||
| Now, we can start coding the data model. | ||
|
|
||
| ## Coding Your Data Model | ||
|
|
||
| You need to open the `users.json` file in a code editor and reproduce the data model. | ||
|
|
||
| ```json | ||
YLukashina marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| { | ||
| "type": "array", | ||
| "length": 2, | ||
| "content": { | ||
| "type": "object", | ||
| "user id": {}, | ||
| "sessions": {}, | ||
| "avg session duration": {}, | ||
| "bounce rate": {}, | ||
| "revenue": {}, | ||
| "transactions": {}, | ||
| "goal conversion rate": {} | ||
| } | ||
| } | ||
| ``` | ||
| Then, let's fill in the fields with their generators. | ||
|
|
||
| ### User ID: A String Generator With Pattern | ||
|
|
||
| We will use [this generator](https://www.getsynth.com/docs/content/string#pattern) to create user IDs. We can make them look GA-authentic by specifying a regex pattern: | ||
YLukashina marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ```json | ||
| { | ||
| "type": "string", | ||
| "pattern": "[a-zA-Z]{5}[0-9]{2}" | ||
| } | ||
| ``` | ||
| ### A Number Generator With a Range | ||
| This generator is explained [here](https://www.getsynth.com/docs/content/number#range). It allows to define the lower and the upper limits as well as the step: a minimum difference between two values. | ||
|
|
||
| The `number` generator also has a `subtype` parameter that allows to tell `synth` directly what kind of numeric data you need. The values for this parameter come from the [Rust programming language](https://doc.rust-lang.org/reference/types/numeric.html). Examples: | ||
| * use `i32`, `i64`, or other subtypes to get integers | ||
| * use `f32` or `f64` to get floating numbers | ||
|
|
||
| Alternatively, simply enter the data in a desired format into the data model, and `synth` will figure out what you want it to generate. | ||
|
|
||
| --- | ||
| **note** | ||
| By default, `synth` will exclude the upper limit value from the dataset. To include the maximum value you need, either increase it or use the `include_high` parameter and set it to `true`. | ||
|
|
||
| --- | ||
|
|
||
| ### Integer Columns | ||
| #### Sessions (The Number of Sessions) | ||
| The number of sessions can be between zero and eternity but we will add the upper limit of 1.000. | ||
| ```json | ||
| { | ||
| "type": "number", | ||
| "range": { | ||
| "low": 0, | ||
| "high": 1000, | ||
| "step": 1, | ||
| "include_high": true | ||
| } | ||
| } | ||
| ``` | ||
| With a `subtype` parameter: | ||
| ```json | ||
| { | ||
| "type": "number", | ||
| "subtype": "i32", | ||
| "range": { | ||
| "low": 0, | ||
| "high": 1000, | ||
| "step": 1, | ||
| "include_high": true | ||
| } | ||
| } | ||
| ``` | ||
| #### Transactions | ||
| Let's keep it close to reality and set a maximum here, too. | ||
| ```json | ||
| { | ||
| "type": "number", | ||
| "range": { | ||
| "low": 0, | ||
| "high": 100, | ||
| "step": 1, | ||
| "include_high": true | ||
| } | ||
| } | ||
| ``` | ||
| ### Floating Number Columns | ||
| Floating numbers will be generated as soon as you enter a floating number in the `step` parameter. | ||
| #### Bounce Rate and Goal Conversion Rate | ||
| We could have converted percentages to indexes and set both fields to a range between `0` and `1` but we will keep it as appears in the original report. | ||
| ```json | ||
| { | ||
| "type": "number", | ||
| "range": { | ||
| "low": 0, | ||
| "high": 100, | ||
| "step": 0.01, | ||
| "include_high": true | ||
| } | ||
| } | ||
| ``` | ||
| #### Average Session Duration | ||
| The value `0.017` is a result of our decision to use minutes: 1/60 = 0.01(6). | ||
| ```json | ||
| { | ||
| "type": "number", | ||
| "range": { | ||
| "low": 0.017, | ||
| "high": 10, | ||
| "step": 0.017, | ||
| "include_high": true | ||
| } | ||
| } | ||
| ``` | ||
| #### Revenue | ||
| ```json | ||
| { | ||
| "type": "number", | ||
| "range": { | ||
| "low": 0, | ||
| "high": 10.000, | ||
| "step": 10, | ||
| "include_high": true | ||
| } | ||
| } | ||
| ``` | ||
| ### Putting It All Together | ||
| Add the generators to `users.json`. | ||
| ## Executing Your Data Model | ||
| In your CLI tool, run the data model: | ||
| ```commandline | ||
| synth generate users.json fake_ga | ||
| ``` | ||
| In our case, `users.json` is optional since we have only one collection. If you have many in the same namespace and want to generate data only using one schema, you can specify a collection as in the code snippet above. | ||
|
|
||
| Your data will be printed to the console. Have a look at this: | ||
|
|
||
| ```json | ||
| { | ||
| "users":[ | ||
| { | ||
| "avg session duration":0.272, | ||
| "bounce rate":0.26, | ||
| "goal conversion rate":0.21, | ||
| "revenue":0.7493000000000001, | ||
| "sessions":850, | ||
| "transactions":14, | ||
| "user id":"yCWXw25" | ||
| }, | ||
| { | ||
| "avg session duration":8.347, | ||
| "bounce rate":0.5, | ||
| "goal conversion rate":0.47000000000000003, | ||
| "revenue":0.8706, | ||
| "sessions":926, | ||
| "transactions":44, | ||
| "user id":"UTqsk63" | ||
| } | ||
| ] | ||
| } | ||
| ``` | ||
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.