Harel Shein (harel.shein@gmail.com)

@Harel Shein has joined the channel

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-14 12:13:10

@Maciej Obuchowski has joined the channel

Julien Le Dem (julien@apache.org)

2023-11-14 12:13:46

@Julien Le Dem has joined the channel

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-11-14 12:13:46

@Paweł Leszczyński has joined the channel

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-11-14 12:13:46

@Jakub Dardziński has joined the channel

Michael Robinson (michael.robinson@astronomer.io)

2023-11-14 12:13:46

@Michael Robinson has joined the channel

Willy Lulciuc (willy@datakin.com)

2023-11-14 12:13:46

@Willy Lulciuc has joined the channel

Peter Hicks (hickstricks@gmail.com)

2023-11-14 12:13:46

@Peter Hicks has joined the channel

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-11-14 12:13:57

👋

Ross Turk (ross@rossturk.com)

2023-11-14 12:14:02

@Ross Turk has joined the channel

Michael Robinson (michael.robinson@astronomer.io)

2023-11-14 12:16:19

👋

Julien Le Dem (julien@apache.org)

2023-11-14 12:18:42

👋

Willy Lulciuc (willy@datakin.com)

2023-11-14 12:18:53

👋

Ross Turk (ross@rossturk.com)

2023-11-14 12:29:47

🌊

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-14 13:53:08

👋

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-11-14 18:30:48

https://github.com/OpenLineage/OpenLineage/pull/2260 fun PR incoming

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-11-15 04:35:37

*Thread Reply:* hey look, more fun https://github.com/OpenLineage/OpenLineage/pull/2263

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-15 05:03:58

*Thread Reply:* nice to have fun with you Jakub

🙂 Jakub Dardziński, Harel Shein, Willy Lulciuc

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-11-15 05:42:34

*Thread Reply:* Can't wait to see it on the 1st January.

Harel Shein (harel.shein@gmail.com)

2023-11-15 06:56:03

*Thread Reply:* Ain’t no party like a dev ex improvement party

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-15 11:45:53

*Thread Reply:* Gentoo installation party is in similar category of fun

Willy Lulciuc (willy@datakin.com)

2023-11-15 03:32:27

@Paweł Leszczyński approved PR #2661 with minor comments, I think the enum defined in the db layer is one comment we’ll need to address before merging; otherwise solid work dude 👌

#2661 Runless events - consume job event

🙌 Paweł Leszczyński, Harel Shein

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-11-15 03:34:42

_Minor_: We can consider defining a _run_state column and eventually dropping the event_type. That is, we can consider columns prefixed with _ to be "remappings" of OL properties to Marquez. -> didn't get this one. Is it for now or some future plans?

Willy Lulciuc (willy@datakin.com)

2023-11-15 03:36:02

*Thread Reply:* future 😉

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-11-15 03:36:10

*Thread Reply:* ok

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-11-15 03:36:23

*Thread Reply:* I will then replace enum with string

Willy Lulciuc (willy@datakin.com)

2023-11-15 03:36:10

also, what about this PR? https://github.com/MarquezProject/marquez/pull/2654

#2654 Runless events - refactor job_versions_io_mapping

Problem This is currently a draft PR which is far from being merged. It is missing few tests related to schema changes which are marked with <code>todo</code> within the code. I've created a PR to have a better discussion on adding <code>job_id</code> to <code>job_versions_io_mapping</code>. This PR should be a follow-up of <a href="https://github.com/MarquezProject/marquez/pull/2641">#2641</a>. The assumption was that it should be helpful in optimising get-lineage query. I would like first to clarify how are we going to make benefit of this extra column. Solution Please describe your change as it relates to the problem, or bug fix, as well as any dependencies. If your change requires a database schema migration, please describe the schema modification(s) and whether it's a backwards-incompatible or backwards-compatible change. <blockquote> Note: All database schema changes require discussion. Please <a href="https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue">link the issue</a> for context. </blockquote> One-line summary: Checklist ☐ You've <a href="https://github.com/MarquezProject/marquez/blob/main/CONTRIBUTING.md#sign-your-work">signed-off</a> your work ☐ Your changes are accompanied by tests (if relevant) ☐ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☐ You've included a one-line summary of your change for the <a href="https://github.com/MarquezProject/marquez/blob/main/CHANGELOG.md#unreleased"><code>CHANGELOG.md</code></a> (Depending on the change, this may not be necessary). ☐ You've versioned your <code>.sql</code> database schema migration according to <a href="https://flywaydb.org/documentation/concepts/migrations#naming">Flyway's naming convention</a> (if relevant) ☐ You've included a <a href="https://github.com/MarquezProject/marquez/blob/main/CONTRIBUTING.md#copyright--license">header</a> in any source code files (if relevant)

Labels

docs, api

Comments

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-11-15 03:36:33

*Thread Reply:* this is the next to go

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-11-15 03:36:38

*Thread Reply:* and i consider it ready

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-11-15 03:37:31

*Thread Reply:* Then we have a draft one with streaming support https://github.com/MarquezProject/marquez/pull/2682/files -> which has an integration test of lineage endpoint working for streaming jobs

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-11-15 03:38:32

*Thread Reply:* I still need to work on #2682 but you can review #2654. once you get some sleep, of course 😉

❤️ Willy Lulciuc

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-15 11:44:44

Got the doc + poc for hook-level coverage: https://docs.google.com/document/d/1q0shiUxopASO8glgMqjDn89xigJnGrQuBMbcRdolUdk/edit?usp=sharing

👀 Jakub Dardziński

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-11-15 12:24:27

*Thread Reply:* did you check if LineageCollector is instantiated once per process?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-15 12:26:37

*Thread Reply:* Using it only via get_hook_lineage_collector

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-11-15 12:17:31

is it time to support hudi?

😂 Harel Shein

Michael Robinson (michael.robinson@astronomer.io)

2023-11-15 14:57:10

Anyone have thoughts about how to address the question about “pain points” here? https://openlineage.slack.com/archives/C01CK9T7HKR/p1700064564825909. (Listing pros is easy — it’s the cons we don’t have boilerplate for)

} Naresh reddy (https://openlineage.slack.com/team/U066HKFCHUG)

Can anyone tell me why OL is better than other competitors if you can provide an analysis that would be great

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1700064564825909

Michael Robinson (michael.robinson@astronomer.io)

2023-11-15 14:58:08

*Thread Reply:* Maybe something like “OL has many desirable integrations, including a best-in-class Spark integration, but it’s like any other open standard in that it requires contributions in order to approach total coverage. Thankfully, we have many active contributors, and integrations are being added or improved upon all the time.”

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-15 16:04:51

*Thread Reply:* Maybe rephrase pain points to "something we're not actively focusing on"

Michael Robinson (michael.robinson@astronomer.io)

2023-11-15 14:59:19

Apparently an admin can view a Slack archive at any time at this URL: https://openlineage.slack.com/services/export. Only public channels are available, though.

Julien Le Dem (julien@apache.org)

2023-11-15 16:53:09

*Thread Reply:* you are now admin

👍 Michael Robinson

Willy Lulciuc (willy@datakin.com)

2023-11-15 17:32:26

have we discussed adding column level lineage support to Airflow? https://marquezproject.slack.com/archives/C01E8MQGJP7/p1700087438599279?thread_ts=1700084629.245949&cid=C01E8MQGJP7

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-11-15 17:33:19

*Thread Reply:* we have it in SQL operators

Willy Lulciuc (willy@datakin.com)

2023-11-15 17:34:25

*Thread Reply:* OOh any docs / code? or if you’d like to respond in the MQZ slack 🙏

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-11-15 17:35:19

*Thread Reply:* I’ll reply there

❤️ Willy Lulciuc, Harel Shein

Michael Robinson (michael.robinson@astronomer.io)

2023-11-15 17:50:23

Any opinions about a free task management alternative to the free version of Notion (10-person limit)? Looking at Trello for keeping track of talks.

Harel Shein (harel.shein@gmail.com)

2023-11-15 19:32:17

*Thread Reply:* What about GitHub projects?

👍 Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2023-11-16 09:27:46

*Thread Reply:* Projects is the way to go, thanks

Michael Robinson (michael.robinson@astronomer.io)

2023-11-16 10:23:34

*Thread Reply:* Set up a Projects board. New projects are private by default. We could make it public. The one thing that’s missing that we could use is a built-in date field for alerting about upcoming deadlines…

🙌 Harel Shein, Maciej Obuchowski

Michael Robinson (michael.robinson@astronomer.io)

2023-11-16 09:31:24

worlds are colliding: 6point6 has been acquired by Accenture

Michael Robinson (michael.robinson@astronomer.io)

2023-11-16 09:31:59

*Thread Reply:* https://newsroom.accenture.com/news/2023/accenture-to-expand-government-transformation-capabilities-in-the-uk-with-acquisition-of-6point6

newsroom.accenture.com

Accenture to Expand Government Transformation Capabilities in the U.K. with Acquisition of 6point6

Accenture has signed an agreement to acquire 6point6, a U.K. technology consultancy, specializing in cloud, data, and cybersecurity.

Original URL: https://newsroom.accenture.com/news/2023/accenture-to-expand-government-transformation-capabilities-in-the-uk-with-acquisition-of-6point6

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-16 10:03:27

*Thread Reply:* We should sell OL to governments

🙃 Harel Shein

Harel Shein (harel.shein@gmail.com)

2023-11-16 10:20:36

*Thread Reply:* we may have to rebrand to ClosedLineage

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-16 10:23:37

*Thread Reply:* not in this way; just emit any event second time to secret NSA endpoint

Michael Robinson (michael.robinson@astronomer.io)

2023-11-16 11:13:17

*Thread Reply:* we would need to improve our stock photo game

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-16 12:17:22

CFP for Berlin Buzzwords went up: https://2024.berlinbuzzwords.de/call-for-papers/ Still over 3 months to submit 🙂

Michael Robinson (michael.robinson@astronomer.io)

2023-11-16 12:42:56

*Thread Reply:* thanks, updated the talks board

Michael Robinson (michael.robinson@astronomer.io)

2023-11-16 12:43:10

*Thread Reply:* https://github.com/orgs/OpenLineage/projects/4/views/1

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-11-16 15:19:53

*Thread Reply:* I'm in, will think what to talk about and appreciate any advice 🙂

Julien Le Dem (julien@apache.org)

2023-11-17 13:42:19

just searching for OpenLineage in the Datahub code base. They have an “interesting” approach? https://github.com/datahub-project/datahub/blob/2b0811b9875d7d7ea11fb01d0157a21fdd[…]odules/airflow-plugin/src/datahubairflowplugin/_extractors.py

<https://github.com/datahub-project/datahub/blob/2b0811b9875d7d7ea11fb01d0157a21fdd67f020/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/_extractors.py | _extractors.py>

<pre><code>from openlineage.airflow.extractors import BaseExtractor </code></pre>

Julien Le Dem (julien@apache.org)

2023-11-17 13:47:21

*Thread Reply:* It looks like the datahub airflow plugin uses OL. but turns it off https://github.com/datahub-project/datahub/blob/2b0811b9875d7d7ea11fb01d0157a21fdd67f020/docs/lineage/airflow.md disable_openlineage_plugin true Disable the OpenLineage plugin to avoid duplicative processing. They reuse the extractors but then “patch” the behavior.

<https://github.com/datahub-project/datahub/blob/2b0811b9875d7d7ea11fb01d0157a21fdd67f020/docs/lineage/airflow.md | airflow.md>

Airflow Integration :::note If you're looking to schedule DataHub ingestion using Airflow, see the guide on <../../metadata-ingestion/schedule_docs/airflow.md|scheduling ingestion with Airflow>. ::: The DataHub Airflow plugin supports: • Automatic column-level lineage extraction from various operators e.g. <code>SqlOperator</code>s (including <code>MySqlOperator</code>, <code>PostgresOperator</code>, <code>SnowflakeOperator</code>, and more), <code>S3FileTransformOperator</code>, and a few others. • Airflow DAG and tasks, including properties, ownership, and tags. • Task run information, including task successes and failures. • Manual lineage annotations using <code>inlets</code> and <code>outlets</code> on Airflow operators. There's two actively supported implementations of the plugin, with different Airflow version support. If you're using Airflow older than 2.1, it's possible to use the v1 plugin with older versions of <code>acryl-datahub-airflow-plugin</code>. See the <a href="https://github.com/datahub-project/datahub/blob/2b0811b9875d7d7ea11fb01d0157a21fdd67f020/docs/lineage/airflow.md#compatibility">compatibility section</a> for more details. DataHub Plugin v2 Installation The v2 plugin requires Airflow 2.3+ and Python 3.8+. If you don't meet these requirements, use the v1 plugin instead. <pre><code>pip install 'acryl-datahub-airflow-plugin[plugin-v2]' </code></pre> Configuration Set up a DataHub connection in Airflow, either via command line or the Airflow UI. Command Line <pre><code>airflow connections add --conn-type 'datahub-rest' 'datahub_rest_default' --conn-host '<http://datahub-gms:8080>' --conn-password '<optional datahub auth token>' </code></pre> Airflow UI On the Airflow UI, go to Admin -> Connections and click the "+" symbol to create a new connection. Select "DataHub REST Server" from the dropdown for "Connection Type" and enter the appropriate values. <a href="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/airflow/plugin_connection_setup.png">https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/airflow/pluginconnectionsetup.png</a> Optional Configurations No additional configuration is required to use the plugin. However, there are some optional configuration parameters that can be set in the <code>airflow.cfg</code> file. <pre><code>[datahub] # Optional - additional config here. enabled = True # default </code></pre> Automatic lineage extraction To automatically extract lineage information, the v2 plugin builds on top of Airflow's built-in <a href="https://openlineage.io/docs/integrations/airflow/default-extractors">OpenLineage extractors</a>. The SQL-related extractors have been updated to use DataHub's SQL parser, which is more robust than the built-in one and uses DataHub's metadata information to generate column-level lineage. We discussed the DataHub SQL parser, including why schema-aware parsing works better and how it performs on benchmarks, during the <https://youtu.be/1QVcUmRQK5E?si=U27zygR7Gi_KdkzE&t=2309|June 2023 community town hall>. DataHub Plugin v1 Installation The v1 plugin requires Airflow 2.1+ and Python 3.8+. If you're on older versions, it's still possible to use an older version of the plugin. See the <a href="https://github.com/datahub-project/datahub/blob/2b0811b9875d7d7ea11fb01d0157a21fdd67f020/docs/lineage/airflow.md#compatibility">compatibility section</a> for more details. If you're using Airflow 2.3+, we recommend using the v2 plugin instead. If you need to use the v1 plugin with Airflow 2.3+, you must also set the environment variable <code>DATAHUB_AIRFLOW_PLUGIN_USE_V1_PLUGIN=true</code>. <pre><code>pip install 'acryl-datahub-airflow-plugin[plugin-v1]' # The DataHub rest connection type is included by default. # To use the DataHub Kafka connection type, install the plugin with the kafka extras. pip install 'acryl-datahub-airflow-plugin[plugin-v1,datahub-kafka]' </code></pre> Configuration Disable lazy plugin loading <pre><code>[core] lazy_load_plugins = False </code></pre> On MWAA you should add this config to your <a href="https://docs.aws.amazon.com/mwaa/latest/userguide/configuring-env-variables.html#configuring-2.0-airflow-override">Apache Airflow configuration options</a>. Setup a DataHub connection You must configure an Airflow connection for Datahub. We support both a Datahub REST and a Kafka-based connections, but you only need one. <pre><code># For REST-based: airflow connections add --conn-type 'datahub_rest' 'datahub_rest_default' --conn-host '<http://datahub-gms:8080>' --conn-password '<optional datahub auth token>' # For Kafka-based (standard Kafka sink config can be passed via extras): airflow connections add --conn-type 'datahub_kafka' 'datahub_kafka_default' --conn-host 'broker:9092' --conn-extra '{}' </code></pre> Configure the plugin If your config doesn't align with the default values, you can configure the plugin in your <code>airflow.cfg</code> file. <pre><code>[datahub] enabled = true conn_id = datahub_rest_default # or datahub_kafka_default # etc. </code></pre> Validate that the plugin is working <ol><li>Go and check in Airflow at Admin -> Plugins menu if you can see the DataHub plugin</li><li>Run an Airflow DAG. In the task logs, you should see Datahub related log messages like:</li> </ol> <pre><code>Emitting DataHub ... </code></pre> Manual Lineage Annotation Using <code>inlets</code> and <code>outlets</code> You can manually annotate lineage by setting <code>inlets</code> and <code>outlets</code> on your Airflow operators. This is useful if you're using an operator that doesn't support automatic lineage extraction, or if you want to override the automatic lineage extraction. We have a few code samples that demonstrate how to use <code>inlets</code> and <code>outlets</code>: • <../../metadata-ingestion-modules/airflow-plugin/src/datahubairflowplugin/exampledags/lineagebackenddemo.py|<code>lineage_backend_demo.py</code>> • <../../metadata-ingestion-modules/airflow-plugin/src/datahubairflowplugin/exampledags/lineagebackendtaskflow_demo.py|<code>lineage_backend_taskflow_demo.py</code>> - uses the <a href="https://airflow.apache.org/docs/apache-airflow/stable/concepts/taskflow.html">TaskFlow API</a> For more information, take a look at the <a href="https://airflow.apache.org/docs/apache-airflow/stable/lineage.html">Airflow lineage docs</a>. Custom Operators If you have created a <a href="https://airflow.apache.org/docs/apache-airflow/stable/howto/custom-operator.html">custom Airflow operator</a> that inherits from the BaseOperator class, when overriding the <code>execute</code> function, set inlets and outlets via <code>context['ti'].task.inlets</code> and <code>context['ti'].task.outlets</code>. The DataHub Airflow plugin will then pick up those inlets and outlets after the task runs. <pre><code>class DbtOperator(BaseOperator): ... def execute(self, context): # do something inlets, outlets = self._get_lineage() # inlets/outlets are lists of either datahub_airflow_plugin.entities.Dataset or datahub_airflow_plugin.entities.Urn context['ti'].task.inlets = self.inlets context['ti'].task.outlets = self.outlets def _get_lineage(self): # Do some processing to get inlets/outlets return inlets, outlets </code></pre> If you override the <code>pre_execute</code> and <code>post_execute</code> function, ensure they include the <code>@prepare_lineage</code> and <code>@apply_lineage</code> decorators respectively. Reference the <a href="https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/lineage.html#lineage">Airflow docs</a> for more details. Emit Lineage Directly If you can't use the plugin or annotate inlets/outlets, you can also emit lineage using the <code>DatahubEmitterOperator</code>. Reference <../../metadata-ingestion-modules/airflow-plugin/src/datahubairflowplugin/exampledags/lineageemission_dag.py|<code>lineage_emission_dag.py</code>> for a full example. In order to use this example, you must first configure the Datahub hook. Like in ingestion, we support a Datahub REST hook and a Kafka-based hook. See the plugin configuration for examples. Debugging Missing lineage If you're not seeing lineage in DataHub, check the following: • Validate that the plugin is loaded in Airflow. Go to Admin -> Plugins and check that the DataHub plugin is listed. • With the v2 plugin, it should also print a l…

Julien Le Dem (julien@apache.org)

2023-11-17 13:48:52

*Thread Reply:* Of course this approach will need changing again with AF 2.7

Julien Le Dem (julien@apache.org)

2023-11-17 13:49:02

*Thread Reply:* It’s their choice 🤷

Julien Le Dem (julien@apache.org)

2023-11-17 13:51:23

*Thread Reply:* It looks like we can possibly learn from their approach in SQL parsing: https://datahubproject.io/docs/lineage/airflow/#automatic-lineage-extraction

datahubproject.io

Airflow Integration | DataHub

If you're looking to schedule DataHub ingestion using Airflow, see the guide on scheduling ingestion with Airflow.

Original URL: https://datahubproject.io/docs/lineage/airflow/#automatic-lineage-extraction

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-11-17 16:42:51

*Thread Reply:* what's that approach? I only know they have been claiming best SQL parsing capabilities

Julien Le Dem (julien@apache.org)

2023-11-17 20:54:48

*Thread Reply:* I haven’t looked in the details but I’m assuming it is in this repo. (my comment is entirely based on the claim here)

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-11-20 02:58:07

*Thread Reply:* <https://www.acryldata.io/blog/extracting-column-level-lineage-from-sql> -> The interesting difference is that in order to find table schemas, they use their data catalog to evaluate column-level lineage instead of doing this on the client side.

My understanding by example is: If you do create table x as select ** from y you need to resolve ** to know column level lineage. Our approach is to do that on the client side, probably with an extra call to database. Their approach is to do that based on the data catalog information.

Julien Le Dem (julien@apache.org)

2023-11-17 20:56:54

I’m off on vacation. See you in a week

❤️ Jakub Dardziński, Maciej Obuchowski, Paweł Leszczyński, Harel Shein, Ross Turk, Willy Lulciuc

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-21 05:23:31

Maybe move today's meeting earlier, since no one from west coast is joining? @Harel Shein

Harel Shein (harel.shein@gmail.com)

2023-11-21 09:27:22

*Thread Reply:* Ah! That would have been a good idea, but I can’t :(

Harel Shein (harel.shein@gmail.com)

2023-11-21 09:27:44

*Thread Reply:* Do you prefer an earlier meeting tomorrow?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-21 09:28:54

*Thread Reply:* maybe let's keep today's meeting then

👍 Harel Shein

Michael Robinson (michael.robinson@astronomer.io)

2023-11-22 09:23:31

The full project history is now available at https://openlineage.github.io/slack-archives/. Check it out!

🙌 Harel Shein, Maciej Obuchowski, Paweł Leszczyński

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-11-22 09:25:38

*Thread Reply:* nice one! 👏

Harel Shein (harel.shein@gmail.com)

2023-11-22 09:49:12

*Thread Reply:* very cool!

Ross Turk (ross@rossturk.com)

2023-11-22 12:08:24

*Thread Reply:* tfw you thought the scrollback was gone 😳

Harel Shein (harel.shein@gmail.com)

2023-11-22 12:09:41

*Thread Reply:* slack has a good activation story, I wonder how much longer they can keep this up for

Ross Turk (ross@rossturk.com)

2023-11-22 12:10:08

*Thread Reply:* always nice to be reminded that there are no actual incremental costs on their end

Harel Shein (harel.shein@gmail.com)

2023-11-22 12:11:23

*Thread Reply:* I guess it’s the difference between storing your data in memory vs. on a glacier 🧊

Ross Turk (ross@rossturk.com)

2023-11-22 12:36:12

*Thread Reply:* ah yes surely there is some tiering going on

Harel Shein (harel.shein@gmail.com)

2023-11-27 11:44:20

anyone seen this PR from decathlon?

Willy Lulciuc (willy@datakin.com)

2023-11-27 14:32:03

*Thread Reply:* i might get to marquez slack/PRs today, but most likely tmr morning

Harel Shein (harel.shein@gmail.com)

2023-11-27 14:32:52

*Thread Reply:* If you’re looking for priorities, it would be really great if you could give feedback on one of @Paweł Leszczyński streaming support PRs today

👍 Willy Lulciuc

Willy Lulciuc (willy@datakin.com)

2023-11-27 14:34:01

*Thread Reply:* ok, I’ll get to the streaming PR first

Willy Lulciuc (willy@datakin.com)

2023-11-27 14:34:29

*Thread Reply:* FYI, the namespace filtering is a good idea, just needs some feedback on impl / naming

Michael Robinson (michael.robinson@astronomer.io)

2023-11-27 14:31:11

Jens would like to know if there’s anything we want included in the welcome portion of the slide deck. Suggestions? (Aside from the usual links)

Willy Lulciuc (willy@datakin.com)

2023-11-28 02:35:56

@Paweł Leszczyński I reviewed your PR today (mainly the logic on versioning for streaming jobs); here is the main versioning limitations for jobs: a new JobVersion is created only when a job run completes or fails (or is in the done state); that is, we don’t know if we have received all the input/output datasets so we hold off on creating a new job version until we do.

For streaming, we’ll need to create a job version on start. Do we assume we have all input/output datasets associated with the streaming job? Does OpenLineage guarantee this to be the case for streaming jobs? Having versioning logic for batch vs streaming is a reasonable solution, just want to clarify

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-11-28 02:38:00

*Thread Reply:* yes, the logic adds distinction on how to create job version per processing type. For streaming, I think it makes more sense to create it at the beginning. Then, within other events of the same run, we need to check if the version has changed, and create new version in that case

Willy Lulciuc (willy@datakin.com)

2023-11-28 02:48:40

*Thread Reply:* would we want to use the same versioning func Utils.newJobVersionFor() for streaming? That is, should we assume the input/output datasets contained within the OL event be the “current” set for the streaming job?

Willy Lulciuc (willy@datakin.com)

2023-11-28 02:50:19

*Thread Reply:* that is, 2 input streams, 1 output stream (version 1) then, 1 input streams, 2 output stream (version 2) ...

Willy Lulciuc (willy@datakin.com)

2023-11-28 02:51:16

*Thread Reply:* but what about the case when the in/out streams are not present: 1 input streams, 2 output stream (version 2) then, 1 input streams, 0 output stream (version 3) ...

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-11-28 03:08:36

*Thread Reply:* The meaning for the streaming events should be slightly different.

For batch, input and output datasets are cumulative from all the events. If we have an event with output datasets A + B, then another event with output datasets B + C, then we assume job has output datasets A + B + C.

For streaming, we may have a streaming job than for a week was reading data from topics A + B, and then in the next week it was reading from B + C. I think this should the mimicked in different job versions. Making it cumulative for jobs that run for several weeks does not make that much sense to me. The problem here is: what happens if a producer some extra events with no input/output datasets specified, like amount of bytes read? Shall we treat it as a new version? If not, why not?

This part is missing in PR and our Flink integration always sends all the input & datasets. I can add extra logic that will prevent creating new job version if event has no input nor output datasets. However, I can't see any clean and generic solution to this.

Willy Lulciuc (willy@datakin.com)

2023-11-28 14:07:01

*Thread Reply:* > The problem here is: what happens if a producer some extra events with no input/output datasets specified, like amount of bytes read? Shall we treat it as a new version? If not, why not? We can view the bytes read as additional metadata about the jobs inputs/outputs that wouldn’t trigger a new version (for the job or dataset). I would associate the bytes with the current dataset version and sum them up (I’ve read X bytes from dataset version D); you can also view tags in a similar way. In our current versioning logic for datasets, we create a new dataset version when a job completes, I think we’ll want to do something similar for streaming jobs; that is, when X bytes are written to a given dataset that would trigger a new version

Willy Lulciuc (willy@datakin.com)

2023-11-28 14:11:17

*Thread Reply:* > I can add extra logic that will prevent creating new job version if event has no input nor output datasets Yes, if in/out no datasets are present, then I wouldn’t create a new job version. @Julien Le Dem opened an issue a while back about this https://github.com/MarquezProject/marquez/issues/1513. that is, there’s a difference between an empty set [ ] and null

#1513 the Marquez job versioning logic to take input and output into account only when provided

When Inputs are not provided it should not be interpreted as "no input"

Milestone

<a href="https://github.com/MarquezProject/marquez/milestone/4">Roadmap</a>

Willy Lulciuc (willy@datakin.com)

2023-11-28 14:11:51

*Thread Reply:* > This part is missing in PR and our Flink integration always sends all the input & datasets This is very important to note in the code andor API docs

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-11-29 04:48:36

*Thread Reply:* Sure we should. Just wanted to make sure if this the way we want to go.

🙏 Willy Lulciuc

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-11-30 09:53:22

*Thread Reply:* @Willy Lulciuc did you had a chance to look at this as well https://github.com/MarquezProject/marquez/pull/2654 ? This should be merged before streaming support I believe.

#2654 Runless events - refactor job_versions_io_mapping

Labels

docs, api

Comments

👀 Willy Lulciuc

Willy Lulciuc (willy@datakin.com)

2023-11-30 14:13:27

*Thread Reply:* ahh sorry, I hadn’t realized they were related / dependent on one another. sure I’ll give the PR a pass

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-12-01 09:15:30

*Thread Reply:* I looked into your comments and found them, as always, really useful. I introduced changes based on most of them. Please take a look at my reponses within Job model class. I think there is one issue we still need to discuss.

One solution I see is deprecating type and introducing JobLabels class as property within Job with fields like jobType, processingType , integration

Another would be to send processingType within existing type field. This would mimic existing API, but require further work. The disadvantage is that we still have mismatch between job type in marquez and openlineage spec.

I would opt for (2), but (1) works for me as well.

Michael Robinson (michael.robinson@astronomer.io)

2023-11-28 16:57:50

I’m working on a redesign of the Ecosystem page for a more modern, user-friendly layout. It’s a work in progress, and feedback is welcome: https://github.com/OpenLineage/docs/pull/258.

#258 Redesign ecosystem page for more efficient, user-friendly display of integrations

This redesigns the Ecosystem page to create a more efficient, user-friendly experience using MaterialUI Card and CardMedia components rather than the large Markdown table that we have currently. Content is loaded dynamically into the components to keep code to a minimum. The styling is a work in progress, but the screenshot offers a sense of the general direction. Feedback is welcome. <a href="https://private-user-images.githubusercontent.com/68482867/286410491-db5d6947-e58a-4f14-a668-b71f10ae4743.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTEiLCJleHAiOjE3MDEyMDg5NzIsIm5iZiI6MTcwMTIwODY3MiwicGF0aCI6Ii82ODQ4Mjg2Ny8yODY0MTA0OTEtZGI1ZDY5NDctZTU4YS00ZjE0LWE2NjgtYjcxZjEwYWU0NzQzLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFJV05KWUFYNENTVkVINTNBJTJGMjAyMzExMjglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjMxMTI4VDIxNTc1MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTFiZjI4MDgyMzMzYTUwNjU1ZTUwZWFkMTQ0ZTBiNjBlOThmYjY3ZmJiZDdiYzlmYmUyODg0NmZlYmRkOTY1MmImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.PrldXgZqUMKYNKjmbpSrOCdsDh3zvuS-IrWdwKYPGNA">Screenshot 2023-11-28 at 16 50 14</a>

Comments

Michael Robinson (michael.robinson@astronomer.io)

2023-11-29 11:37:41

Can someone count the folks in the room please? Can’t see anyone other than the speaker

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-29 11:39:11

❤️ Willy Lulciuc

🔥 Willy Lulciuc

🚀 Willy Lulciuc

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-29 12:08:38

@Michael Robinson can you hear the questions?

👍 Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2023-11-29 12:16:11

*Thread Reply:* I could hear all but one of the questions after the first talk

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-29 12:16:39

*Thread Reply:* Oh then it's better than I thought

Ross Turk (ross@rossturk.com)

2023-11-29 12:43:44

I just had a lovely conversation at reinvent with the CTO of dbt, Connor, and didn’t even know it was him until the end 🤯

🙃 Harel Shein, Willy Lulciuc, Jakub Dardziński, Paweł Leszczyński, Maciej Obuchowski

Michael Robinson (michael.robinson@astronomer.io)

2023-11-29 14:38:01

Congrats on a great event!

🙌 Maciej Obuchowski

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-30 05:35:30

*Thread Reply:* Yeah it was pretty nice 🙂 A lot of good discussions with Google people. Also Jarek Potiuk was there

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-30 05:36:29

*Thread Reply:* I think it won't be the last one Warsaw OpenLineage meetup

🎉 Paweł Leszczyński

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-11-30 07:17:49

https://openlineage.slack.com/archives/C01CK9T7HKR/p1701288000527449 putting it here. I don’t feel like I’m the best person to answer but I feel like operational lineage which we’re trying to provide is the thing

} Stefan Krawczyk (https://openlineage.slack.com/team/U065SAYCS5C)

Hi all. I’m <a href="https://www.linkedin.com/in/skrawczyk/">Stefan</a>, and I’m after some directional advice: Context: • I drive an open source project called <a href="https://github.com/DAGWorks-Inc/hamilton">Hamilton</a>. TL;DR: it’s an opinionated way to write python code to express dataflows, e.g. great for feature engineering, data processing, doing LLM workflows, etc. in python. • One of the features is that you get “<a href="https://blog.dagworks.io/p/lineage-hamilton-in-10-minutes-c2b8a944e2e6">lineage as code</a>”, e.g. you can get “column/dataframe level” lineage for pandas (or <a href="https://blog.dagworks.io/p/expressing-pyspark-transformations">pyspark</a>) code. So given a dataflow definition, and execution of a Hamilton dataflow, we can emit this code provenance for any artifacts generated — and this works where ever python runs. Ask: • As I understand open lineage was built more for artifact to artifact lineage. E.g. this table -> table -> ML model, etc. Question, for people who use/consume lineage, would this extra level of granularity (e.g. the code that ran to create an artifact) that we can provide with Hamilton, be interesting to emit as part of an open lineage event? (e.g. see inside your python airflow task, or spark job). I’m trying to determine how to prioritize an open lineage integration, and whether someone would pick up Hamilton because of it. • If you would find this extra level of granularity useful, could I schedule a call with you so I can learn more about your use case please? CC @Jakub Dardziński since I saw at the meet up that you deal with the airflow python operator & open lineage.

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1701288000527449

Harel Shein (harel.shein@gmail.com)

2023-11-30 09:26:25

created a project for Blog post ideas: https://github.com/orgs/OpenLineage/projects/5/views/1

:gratitude_thank_you: Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2023-11-30 11:15:28

Release update: we’re due for an OpenLineage release and overdue for a Marquez release. As tomorrow, the first, is a Friday, we should wait until Monday at the earliest. I’m planning to open a vote for an OL release then, but Marquez is red so I’m holding off on a Marquez release for the time being.

Willy Lulciuc (willy@datakin.com)

2023-11-30 19:50:45

*Thread Reply:* I can address the red CI status, it’s bc we’re seeing issues publishing our snaphots

Willy Lulciuc (willy@datakin.com)

2023-11-30 19:51:09

*Thread Reply:* I think we should release Marquez on Mon. as well

Michael Robinson (michael.robinson@astronomer.io)

2023-11-30 19:52:12

*Thread Reply:* 👍

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-01 05:04:26

*Thread Reply:* I want to get this https://github.com/OpenLineage/OpenLineage/pull/2284 into OL release

#2284 spark: get column-level lineage from JDBC dbtable option

There were several problems with JDBC column level lineage. First, it did not support dbtable option - which ment that one-to-one relationships weren't supported by column lineage collector. Second, the <code>JdbcColumnLineageCollector</code> did not report the lineage when there was only one single input column. Third, it used wrong, naive dataset name straight from the parser results, rather than fixed one from <code>JdbcUtils</code>. Those changes fix that.

Labels

integration/spark, ci

Willy Lulciuc (willy@datakin.com)

2023-11-30 19:51:43

would be interesting if we can use this comparison as a learning (improve docs, etc): https://blog.singleorigin.tech/race-to-the-finish-line-age/

Single Origin Blog

Race to the finish line(age): 3 Field-Level Lineage Solutions

Field-level lineage is no longer a nice-to-have feature for organizations with data-intensive workloads. It’s a must have. In this post, we will compare solutions from Google Cloud, OpenLineage, and Single Origin across five categories to see how they compare.

Written by

Engineering

Original URL: https://blog.singleorigin.tech/race-to-the-finish-line-age/

➕ Harel Shein, Paweł Leszczyński

Willy Lulciuc (willy@datakin.com)

2023-11-30 19:52:10

or rather, use the format for comparing OL with other tools 😉

👀 Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2023-12-01 10:12:21

*Thread Reply:* It would be nice to have something like this (I would want it to be a little more even-handed, though). It will be interesting to see if they will ever update this now that there’s automated lineage from Airflow supported by OL

Michael Robinson (michael.robinson@astronomer.io)

2023-12-01 08:58:19

Review needed of the newsletter section on Airflow Provider progress @Jakub Dardziński @Maciej Obuchowski when you have a moment. It will ship by 5 PM ET today, fyi. Already shared it with you. Thanks!

👍 Maciej Obuchowski

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-01 10:45:56

*Thread Reply:* LGTM 👍

Michael Robinson (michael.robinson@astronomer.io)

2023-12-01 11:02:04

*Thread Reply:* Thanks @Jakub Dardziński

Michael Robinson (michael.robinson@astronomer.io)

2023-12-01 10:21:23

They finally uploaded the OpenLineage Airflow Summit videos to the Airflow channel on YT: https://www.youtube.com/@ApacheAirflow/videos

YouTube

Apache Airflow

This channel is a central repository for all talks and videos related to Apache Airflow. Check out <a href="http://airflow.apache.org">airflow.apache.org</a> for more information. Apache Airflow, Apache, Airflow, the Airflow logo, and the Apache feather logo are either registered trademarks or trademarks of the Apache Software Foundation. All other products or name brands are trademarks of their respective holders, including the Apache Software Foundation.

Original URL: https://www.youtube.com/@ApacheAirflow/videos

Michael Robinson (michael.robinson@astronomer.io)

2023-12-01 11:13:14

On Monday I’m meeting with someone at Confluent about organizing a meetup in London in January. I’m thinking I’ll suggest Jan. 24 or 31 as mid-week days work better and folks need time to come back from the vacation. If you have thoughts on this, would you please let me know by 10:00 am ET on Monday? Also, standup will be happening before the meeting — perhaps we can discuss it then. @Harel Shein

👍 Harel Shein, Maciej Obuchowski

Michael Robinson (michael.robinson@astronomer.io)

2023-12-04 10:20:59

*Thread Reply:* Confluent says January 31st will work for them for a London meetup, and they’ll be providing a speaker as well. Is it safe to firm this up with them?

🎉 Harel Shein, Maciej Obuchowski

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-04 10:37:42

*Thread Reply:* I'd say yes, eventually if Maciej doesn't get new passport till this time I can speak

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-04 10:44:17

*Thread Reply:* I already got the photos 😂

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-04 10:44:41

*Thread Reply:* you gotta share them

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-04 10:45:36

*Thread Reply:* Also apparently it's possible to get temporary passport at airport in 15 minutes

Michael Robinson (michael.robinson@astronomer.io)

2023-12-04 10:57:40

*Thread Reply:* How civilized...

Harel Shein (harel.shein@gmail.com)

2023-12-04 17:07:10

*Thread Reply:* if the price is right 🙂

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-04 17:08:55

*Thread Reply:* you can get it in the Warsaw airport just like last-minute passport, costs barely nothing (30 PLN which is ~7/8 USD)

Harel Shein (harel.shein@gmail.com)

2023-12-04 17:09:21

*Thread Reply:* wow, that’s impressive!

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-04 17:10:19

*Thread Reply:* yeah, many people are surprised how developed our public service may be

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-05 06:44:04

*Thread Reply:* tbh it's always random, can be good can be shit 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-05 06:44:21

*Thread Reply:* lately it's definitely been better than 10 years ago tho

Michael Robinson (michael.robinson@astronomer.io)

2023-12-01 15:34:09

Feedback requested on the newsletter:

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-04 14:38:30

https://blog.datahubproject.io/extracting-column-level-lineage-from-sql-779b8ce17567 https://datastation.multiprocess.io/blog/2022-04-11-sql-parsers.html

Medium

Extracting Column-Level Lineage from SQL

How we built one of the best open-source SQL lineage parsers.

Reading time

8 min read

Original URL: https://blog.datahubproject.io/extracting-column-level-lineage-from-sql-779b8ce17567

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-04 14:40:02

*Thread Reply:* [6] Note that this isn’t a fully fair comparison, since the DataHub one had access to the underlying schemas whereas the other parsers don’t accept that information. 🙂

Harel Shein (harel.shein@gmail.com)

2023-12-04 17:00:33

*Thread Reply:* it’s open source, should we consider testing it out?

Harel Shein (harel.shein@gmail.com)

2023-12-04 17:00:53

*Thread Reply:* I’m not sure about the methodology, but these numbers are pretty significant

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-04 17:06:43

*Thread Reply:* We tested on a corpus of ~7000 BigQuery SELECT statements and ~2000 CREATE TABLE ... AS SELECT (CTAS) statements.⁶

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-12-05 04:09:18

*Thread Reply:* More doctors smoke camels than any other cigarette 😉 If you test on BigQuery, you will not get comparable results for SnowFlake for example.

Wondering if we can do anything about this. We could write a blog post on lineage extraction from Snowflake SQL queries. This is something we spent time on and possibly we support dialect specific queries that others don't.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-05 04:12:45

*Thread Reply:* it all comes to the question whether we should start publishing comparisons

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-12-05 04:18:46

*Thread Reply:* We can also accept schema information in our sql lineage parser. Actually, this would have been good idea I believe.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-05 04:24:59

*Thread Reply:* for select ** use-case?

👍 Paweł Leszczyński

Michael Robinson (michael.robinson@astronomer.io)

2023-12-04 15:42:36

Release vote is here when you get a moment: https://openlineage.slack.com/archives/C01CK9T7HKR/p1701722066253149

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)

@channel I’m opening a vote to release OpenLineage 1.6.0, including: • a new JobTypeFacet containing additional job-related information to improve support for Flink and streaming in general • an option for the Flink job listener to read from Flink conf • in the dbt integration, a new command to send metadata of the last run without running the job • bug fixes in the Spark and Flink integrations • more. Three +1s from committers will authorize. Thanks in advance.

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1701722066253149

👍 Harel Shein

Kacper Muda (mudakacper@gmail.com)

2023-12-05 05:15:18

@Kacper Muda has joined the channel

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-05 05:15:30

Should we disable openlineage-airflow on Airflow 2.8 to force people to use provider?

👍 Kacper Muda, Harel Shein

Michael Robinson (michael.robinson@astronomer.io)

2023-12-05 11:12:52

*Thread Reply:* it sounds like maybe something about this should be included in the 2.8 docs. The dev rel team is talking about the marketing around 2.8 right now…

Michael Robinson (michael.robinson@astronomer.io)

2023-12-05 11:14:48

*Thread Reply:* also, the release will evidently be out next Thursday

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-05 11:15:38

*Thread Reply:* I mean, openlineage-airflow is not part of Airflow

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-05 11:15:50

*Thread Reply:* We'd have provider for 2.8

Michael Robinson (michael.robinson@astronomer.io)

2023-12-05 11:16:40

*Thread Reply:* so maybe the airflow newsletter would be better

Michael Robinson (michael.robinson@astronomer.io)

2023-12-05 11:18:29

*Thread Reply:* is there anything about the provider that should be in the 2.8 marketing?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-05 11:19:04

*Thread Reply:* I don't think so

Michael Robinson (michael.robinson@astronomer.io)

2023-12-05 11:25:09

*Thread Reply:* Kenten wants to mention that it will be turned off in the 2.8 docs, so please lmk if anything about this changes

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-05 06:20:00

https://github.com/OpenLineage/docs/pull/263

✅ Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2023-12-05 14:32:00

Changelog PR for 1.6.0: https://github.com/OpenLineage/OpenLineage/pull/2298

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-05 14:36:27

*Thread Reply:* that’s weird ruff-lint found issues, especially when it has ruff version pinned

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-05 14:36:46

*Thread Reply:* CHANGELOG.md:10: acccording ==> according this change is accurate though 🙂

Michael Robinson (michael.robinson@astronomer.io)

2023-12-05 14:45:42

*Thread Reply:* I tried to sneak in a fix in dev but the linter didn’t like it so I changed it back. All set now

Michael Robinson (michael.robinson@astronomer.io)

2023-12-05 14:45:58

*Thread Reply:* The release is in progress

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-05 14:47:05

*Thread Reply:* ah, gotcha dev/get_changes.py:49:17: E722 Do not use bare `except` dev/get_changes.py:49:17: S112 `try`-`except`-`continue` detected, consider logging the exception for next time just add except Exception: instead of except: 🙂

👍 Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2023-12-05 14:50:29

*Thread Reply:* GTK, thank you

Michael Robinson (michael.robinson@astronomer.io)

2023-12-05 14:50:02

The release-integration-flink job failed with this error message: Execution failed for task ':examples:stateful:compileJava'. > Could not resolve all files for configuration ':examples:stateful:compileClasspath'. > Could not find io.**********************:**********************_java:1.6.0-SNAPSHOT. Required by: project :examples:stateful

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-05 15:06:17

*Thread Reply:* No cache is found for key: v1-release-client-java--rOhZzScpK7x+jzwfqkQVwOVgqXO91M7VEEtzYHNvSmY= Found a cache from build 155811 at v1-release-client-java- is this standard behaviour?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-05 15:08:03

*Thread Reply:* well, same happened for 1.5.0 and it worked

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-05 15:09:11

*Thread Reply:* we gotta wait for Maciej/Pawel :<

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-05 18:07:20

*Thread Reply:* Looks like Gradle version got bumped and gives some problems

Michael Robinson (michael.robinson@astronomer.io)

2023-12-06 12:47:10

*Thread Reply:* Think we can release by midday tomorrow?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-06 12:57:31

*Thread Reply:* oh forgot about this totally

Michael Robinson (michael.robinson@astronomer.io)

2023-12-05 16:29:52

Feedback sought on a redesign of the ecosystem page that (hopefully) freshens and modernizes the page: https://github.com/OpenLineage/docs/pull/258

#258 Redesign ecosystem page for improved display of integrations

This redesigns the Ecosystem page to create a more modern, better-organized display of the integrations using MaterialUI Card, CardMedia, CardActions and ExpandMore components for a cleaner and more dynamic experience than the large Markdown table that we have currently. Components are populated dynamically from arrays to keep code to a minimum. The styling uses MaterialUI CardActions and ExpandMore functionality for a minimalist yet dynamic design. Feedback is welcome. To do: ☑︎ - edit images to suit new design ☑︎ - include multiple links while maintaining programmatic creation of cards ☑︎ - reincorporate additional links where needed in new implementation ☑︎ - update GE logo <a href="https://private-user-images.githubusercontent.com/68482867/287871952-915ff6a5-7b8e-48ba-8a41-ab50864ba790.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTEiLCJleHAiOjE3MDE4MTIwOTMsIm5iZiI6MTcwMTgxMTc5MywicGF0aCI6Ii82ODQ4Mjg2Ny8yODc4NzE5NTItOTE1ZmY2YTUtN2I4ZS00OGJhLThhNDEtYWI1MDg2NGJhNzkwLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFJV05KWUFYNENTVkVINTNBJTJGMjAyMzEyMDUlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjMxMjA1VDIxMjk1M1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWIwZjVjMmM4NGY3ZmRjNzBkZjc2YmEzMjU3MDY0MjViZmZkOGUxNDNjYTI0MGY0MTI3ZmIzYjg3NmUxZTc3ZmYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.hLruleNorHFz-i6MTP4HuCaZAzcxQuCGRIKhCUS4XXM">Screenshot 2023-12-04 at 13 38 54</a> <a href="https://private-user-images.githubusercontent.com/68482867/288104282-c91fde0a-136b-4ebc-872a-2e80750ddf95.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTEiLCJleHAiOjE3MDE4MTIwOTMsIm5iZiI6MTcwMTgxMTc5MywicGF0aCI6Ii82ODQ4Mjg2Ny8yODgxMDQyODItYzkxZmRlMGEtMTM2Yi00ZWJjLTg3MmEtMmU4MDc1MGRkZjk1LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFJV05KWUFYNENTVkVINTNBJTJGMjAyMzEyMDUlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjMxMjA1VDIxMjk1M1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTk0NDYyY2U3MzA2NDYzZDIxNmE1YzViYTczZWE3MDJjNmZmM2UxYTQyYWEwYTJiMWZkZTc0OGM4YzE2YmI3YzcmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.cTwddUGo0qratRq0iVT_DhHtnPgQwCOiLldOG8QJiuI">Screenshot 2023-12-05 at 10 47 37</a> Recording: <a href="https://github.com/OpenLineage/docs/assets/68482867/f3d53ff2-390d-41fe-be40-a0af81e166f0">https://github.com/OpenLineage/docs/assets/68482867/f3d53ff2-390d-41fe-be40-a0af81e166f0</a>

Comments

Michael Robinson (michael.robinson@astronomer.io)

2023-12-07 10:49:06

Changelog PR for 1.6.1: https://github.com/OpenLineage/OpenLineage/pull/2301

Michael Robinson (michael.robinson@astronomer.io)

2023-12-07 11:19:29

1.6.1 release is in progress

Michael Robinson (michael.robinson@astronomer.io)

2023-12-07 11:23:41

*Thread Reply:* @Maciej Obuchowski the flink job failed again

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-07 11:34:24

*Thread Reply:* 😢

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-07 11:35:09

*Thread Reply:* well, at least it's a different error

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-07 11:39:27

*Thread Reply:* one more try? https://github.com/OpenLineage/OpenLineage/pull/2302 @Michael Robinson

#2302 flink: bump example app version during release process

✅ Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2023-12-07 12:22:32

*Thread Reply:* 1.6.2 changelog PR: https://github.com/OpenLineage/OpenLineage/pull/2304

Michael Robinson (michael.robinson@astronomer.io)

2023-12-07 12:22:55

*Thread Reply:* @Maciej Obuchowski 👆

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-07 12:24:45

*Thread Reply:* merged 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-07 12:25:05

*Thread Reply:* going out for a few hours, so next try would be tomorrow if it fails again...

👍 Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2023-12-07 13:14:14

*Thread Reply:* Thanks, Maciej. That worked, and 1.6.2 is out.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-07 13:20:12

*Thread Reply:* Great

Michael Robinson (michael.robinson@astronomer.io)

2023-12-07 13:04:15

Starting a thread for collaboration on the community meeting next week

Michael Robinson (michael.robinson@astronomer.io)

2023-12-07 13:05:27

*Thread Reply:* Releases: 1.6.2

Michael Robinson (michael.robinson@astronomer.io)

2023-12-07 13:09:51

*Thread Reply:* Open proposals: 2186, 2187, 2218, 2243, 2273, 2281, 2289, 2163, 2162, 2161

Michael Robinson (michael.robinson@astronomer.io)

2023-12-07 13:10:13

*Thread Reply:* 2023 recap/“best-of”?

Michael Robinson (michael.robinson@astronomer.io)

2023-12-07 13:13:28

*Thread Reply:* @Harel Shein any thoughts? Also, does anyone know if Julien will be back from vacation?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-07 13:20:39

*Thread Reply:* We should probably try to something with Google proposal

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-07 13:21:00

*Thread Reply:* Not sure if it needs additional discussion, maybe just implementation?

Harel Shein (harel.shein@gmail.com)

2023-12-07 15:28:13

*Thread Reply:* I can ask him, but it would probably be good if you could facilitate next week @Michael Robinson?

👍 Michael Robinson

Harel Shein (harel.shein@gmail.com)

2023-12-07 15:29:43

*Thread Reply:* I agree that we need to address those Google proposals, we should ask Jens if he’s up for presenting and discussing them first?

✅ Michael Robinson

Harel Shein (harel.shein@gmail.com)

2023-12-07 15:30:48

*Thread Reply:* maybe Pawel wants to present progress with https://github.com/OpenLineage/OpenLineage/issues/2162?

👍 Paweł Leszczyński

Michael Robinson (michael.robinson@astronomer.io)

2023-12-12 12:14:40

*Thread Reply:* Still waiting on a response from Jens

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-12 12:22:11

*Thread Reply:* I think Jens does not have a lot of time now

Michael Robinson (michael.robinson@astronomer.io)

2023-12-12 12:25:30

*Thread Reply:* Emailed him in case he didn’t see the message

Michael Robinson (michael.robinson@astronomer.io)

2023-12-13 11:22:56

*Thread Reply:* Jens confirmed

🎉 Harel Shein, Maciej Obuchowski

Michael Robinson (michael.robinson@astronomer.io)

2023-12-13 13:28:47

*Thread Reply:* He will have to join about 15 minutes late

Harel Shein (harel.shein@gmail.com)

2023-12-10 12:13:00

Hey, I have a meeting scheduled with a few Ray committers from Anyscale for Tuesday December 12th at 8pm CET / 2pm ET / 11am PT. Would anyone like to join? I think this would a good entry way to lineage tracking for AI/LLM workflows. JFYI Ray is in Python.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-11 04:59:03

*Thread Reply:* would love to come but I'm at friend's birthday at that time 😐

Willy Lulciuc (willy@datakin.com)

2023-12-11 20:26:40

*Thread Reply:* I’d love to as well, but have diner plans 😕

Harel Shein (harel.shein@gmail.com)

2023-12-11 20:27:02

*Thread Reply:* It’s 11am PT..

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-12 09:45:23

*Thread Reply:* count me in if not too late

Harel Shein (harel.shein@gmail.com)

2023-12-12 09:46:36

*Thread Reply:* not too late, adding you

Willy Lulciuc (willy@datakin.com)

2023-12-11 20:25:50

@Paweł Leszczyński mind giving this PR a quick look? https://github.com/MarquezProject/marquez/pull/2700 … it’s a dep on https://github.com/MarquezProject/marquez/pull/2698

#2700 Add tags API tests and class `BaseResourceIntegrationTest`

#2698 add DELETE end point for dataset tags

👀 Paweł Leszczyński

Willy Lulciuc (willy@datakin.com)

2023-12-12 04:39:08

*Thread Reply:* thanks @Paweł Leszczyński for the +1 ❤️

Willy Lulciuc (willy@datakin.com)

2023-12-12 12:36:32

@Jakub Dardziński: In Marquez, metrics are exposed via the endpoint /metrics using prometheus (most of the custom metrics defined are here). Oddly enough, prometheus roadmap states that they have yet to adopt OpenMetrics! But, you can backfill the metrics into prometheus. So, knowing this, I would move to using metrics core by Dropwizard and us an exporter to export metrics to datadog using metrics-datadog. The one major benefit here is that we can define a framework around defining custom metrics internally within Marquez using core Dropwizard libraries, and then enable the reporter via configuration to emit metrics in marquez.yml : For example: `metrics: frequency: 1 minute # Default is 1 second. reporters: - type: datadog . .

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-12 12:41:26

*Thread Reply:* I tested this actually and it works the only thing is traces, I found it very poor to just have metrics around function name

Willy Lulciuc (willy@datakin.com)

2023-12-12 12:43:07

*Thread Reply:* I totally agree, although I feel metrics and tracing are two separate things here

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-12 12:44:39

*Thread Reply:* I really appreciate your help and advice! 🙂

🙏 Willy Lulciuc

Willy Lulciuc (willy@datakin.com)

2023-12-12 12:46:00

*Thread Reply:* Of course, happy to chime in here

Willy Lulciuc (willy@datakin.com)

2023-12-12 12:47:02

*Thread Reply:* I’m just happy this is getting some much needed love 😄

Willy Lulciuc (willy@datakin.com)

2023-12-12 12:51:15

*Thread Reply:* Also, it seems like datadog uses OpenTelemetry: > Datadog Distributed Tracing allows you easily ingest traces via the Datadog libraries and agent or via OpenTelemetry And looks like OpenTelemetry has support for Dropwizard

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-12 12:52:24

*Thread Reply:* yep, that's why I liked otel idea

Willy Lulciuc (willy@datakin.com)

2023-12-12 12:54:11

*Thread Reply:* Also, here are the docs for DD + OpenTelemetry … so enabling OpenTelemetry in Marquez would be doable

Willy Lulciuc (willy@datakin.com)

2023-12-12 12:54:40

*Thread Reply:* and we can make all of the configurable via marquez.yml

Willy Lulciuc (willy@datakin.com)

2023-12-12 12:55:00

*Thread Reply:* hit me up with any questions! (just know, there will be a delay)

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-03 16:11:56

*Thread Reply:* > and we can make all of the configurable via marquez.yml it ain’t that easy - we would need to build extended jar with OTEL agent which I think is way too much work compared to benefits. you can still configure via env vars or system properties

Willy Lulciuc (willy@datakin.com)

2023-12-12 18:47:37

I’ve been looking into partitioning for psql, think there’s potential here for huge perf gains. Anyone have experience?

Willy Lulciuc (willy@datakin.com)

2023-12-12 18:50:50

*Thread Reply:* partition ranges will give a boost by default

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-13 05:05:12

*Thread Reply:* Which tables do you want to partition? Event ones?

Willy Lulciuc (willy@datakin.com)

2023-12-13 05:14:18

*Thread Reply:* • runs • job_versions • dataset_versions • lineage_events • and all the facets tables

👍 Maciej Obuchowski

Willy Lulciuc (willy@datakin.com)

2023-12-12 20:03:11

@Paweł Leszczyński • PR 2682 approved with minor comments on stream versioning logic / suggestions ✅ • PR 2654 approved with minor comment (we’ll want to do a follow up analysis on the query perf improvements) ✅

#2682 Support streaming jobs in Marquez

#2654 Runless events - refactor job_versions_io_mapping

❤️ Harel Shein, Paweł Leszczyński

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-12-13 05:03:23

*Thread Reply:* Thanks @Willy Lulciuc. I applied all the recent comments and merged 2654.

There is one discussion left in 2682, which I would like to resolve before merging. I added extra comment on the implemented approach and I am open to get to know if this is the approach we can go with.

@Julien Le Dem @Maciej Obuchowski discussion is about when to create a new job version for a streaming job. No deep dive in the code is required to take part in it. https://github.com/MarquezProject/marquez/pull/2682#discussion_r1425108745

Willy Lulciuc (willy@datakin.com)

2023-12-13 18:08:11

*Thread Reply:* awesome, left my final thoughts 👍

Harel Shein (harel.shein@gmail.com)

2023-12-13 06:52:57

Maybe we should clarify the documentation on adding custom facets at the integration level? Wdyt? https://openlineage.slack.com/archives/C01CK9T7HKR/p1702446541936589?threadts=1702033180.635339&channel=C01CK9T7HKR&messagets=1702446541.936589|https://openlineage.slack.com/archives/C01CK9T7HKR/p1702446541936589?threadts=1702033180.635339&channel=C01CK9T7HKR&messagets=1702446541.936589

} Simran Suri (https://openlineage.slack.com/team/U069R6P724Q)

Actually no, I didn't tried this out, can you help me with some more brief on it like how it can be extended? do I need to add some configs?

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1702446541936589?thread_ts=1702033180.635339&channel=C01CK9T7HKR&message_ts=1702446541.936589

Kacper Muda (mudakacper@gmail.com)

2023-12-13 08:23:49

Hey, i think it would help some people using Airflow integration (with Airflow 2.6) if we release a patch version of OL package with this PR included #2305. I am not sure what is the release cycle here, but maybe there is already an ETA on next patch release? If so, please let me know 🙂 Thanks !

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-13 08:25:13

*Thread Reply:* you gotta ask for the release in #general, 3 votes from committers approve immediate release 🙂

👍 Kacper Muda

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-14 05:14:10

*Thread Reply:* @Michael Robinson 3 votes are in 🙂

Kacper Muda (mudakacper@gmail.com)

2023-12-14 05:22:04

*Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1702474416084989

} Kacper Muda (https://openlineage.slack.com/team/U062WKNK3LK)

Hi, i'd like to request a release (patch 1.6.3) that will include this PR: <a href="https://github.com/OpenLineage/OpenLineage/pull/2305">#2305</a> . It would help people using OL with Airflow integration (with Airflow version 2.6).

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1702474416084989

Michael Robinson (michael.robinson@astronomer.io)

2023-12-14 12:42:10

*Thread Reply:* Thanks for the ping. I replied in #general and will initiate the release as soon as possible.

Harel Shein (harel.shein@gmail.com)

2023-12-14 14:16:21

seems that we don’t output the correct namespace as in the naming doc for Kafka. we output the kafka server/broker URL as namespace (in the Flink integration specifically) https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md#kafka

Harel Shein (harel.shein@gmail.com)

2023-12-14 18:01:15

*Thread Reply:* @Paweł Leszczyński, would you be able to add the Kafka: prefix to the Kafka visitors in the flink integration tomorrow?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-12-15 02:51:13

*Thread Reply:* I am happy to do this. Just to make, sure: docs is correct, flink implementation is missing kafka:// prefix, right?

Harel Shein (harel.shein@gmail.com)

2023-12-15 05:46:57

*Thread Reply:* Exactly

👍 Paweł Leszczyński

Harel Shein (harel.shein@gmail.com)

2023-12-15 08:24:47

*Thread Reply:* Thanks @Paweł Leszczyński. made a couple of suggestions, but we can def merge without

Harel Shein (harel.shein@gmail.com)

2023-12-14 14:16:29

we’re also missing Iceberg naming schema

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-12-15 02:53:21

*Thread Reply:* would love to discuss this first. If a user stores an iceberg table in S3, then should it conform S3 naming or iceberg naming?

it's S3 location which defines a dataset. iceberg is a format for accessing data but not identifier as such.

Harel Shein (harel.shein@gmail.com)

2023-12-15 05:57:19

*Thread Reply:* No rush, just something we noticed and that some people in the community are implementing their own patch for it.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-14 14:26:24

my next year goal is to have programmatic way of using naming convention

👍 Julien Le Dem, Harel Shein

❤️ Willy Lulciuc, Paweł Leszczyński

Harel Shein (harel.shein@gmail.com)

2023-12-14 15:59:26

anyone heard of https://bitol.io/?

Willy Lulciuc (willy@datakin.com)

2023-12-14 16:17:55

*Thread Reply:* nope but would be worth reaching out to them to see how we could collaborate? they’re part of the LFAI (sandbox): https://github.com/bitol-io/open-data-contract-standard

bitol-io/open-data-contract-standard

Home of the Open Data Contract Standard (ODCS).

Website

<https://bitol.io>

Stars

Willy Lulciuc (willy@datakin.com)

2023-12-14 16:18:39

*Thread Reply:* background https://medium.com/profitoptics/data-contract-101-568a9adbf9a9

Medium

Data Contract 101

A quick and not-so-dirty introduction to data contracts

Reading time

7 min read

Original URL: https://medium.com/profitoptics/data-contract-101-568a9adbf9a9

Willy Lulciuc (willy@datakin.com)

2023-12-14 16:20:17

*Thread Reply:* ugh it’s yaml-based

Harel Shein (harel.shein@gmail.com)

2023-12-14 17:54:05

*Thread Reply:* We should still have a conversation:)

Willy Lulciuc (willy@datakin.com)

2023-12-14 19:20:29

*Thread Reply:* @Michael Robinson soft ping 😉

Harel Shein (harel.shein@gmail.com)

2023-12-14 17:53:38

If you want to join the conversation on Ray.io integration: https://join.slack.com/t/ray-distributed/shared_invite/zt-2635sz8uo-VW076XU6bKMEiFPCJWr65Q

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-14 18:10:49

*Thread Reply:* is there any specific channel/conversation?

Harel Shein (harel.shein@gmail.com)

2023-12-14 19:50:05

*Thread Reply:* Yeah, but it’s private. Added you. For everyone else, Ping me on slack when you join and I’ll add you.

Michael Robinson (michael.robinson@astronomer.io)

2023-12-15 12:07:39

A vote to release Marquez 0.43.0 is open. We need one more: https://marquezproject.slack.com/archives/C01E8MQGJP7/p1702657403267769

🙌 Willy Lulciuc, Harel Shein

Michael Robinson (michael.robinson@astronomer.io)

2023-12-15 14:28:28

*Thread Reply:* the changelog PR is RFR

Willy Lulciuc (willy@datakin.com)

2023-12-15 13:19:49

AWS is making moves! https://github.com/aws-samples/aws-mwaa-openlineage

aws-samples/aws-mwaa-openlineage

In this repository, we show how to get started with data lineage on AWS using OpenLineage. This is an AWS Cloud Development Kit project (CDK) which deploys a pre-configured demo environment for evaluating and experiencing OpenLineage firsthand.

Stars

Language

Python

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-15 13:27:59

*Thread Reply:* the repo itself is pretty old, last updated 2mo ago and used OL package not provider (1.4.1)

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-15 13:28:13

*Thread Reply:* still it's nice they're doing this :)

Michael Robinson (michael.robinson@astronomer.io)

2023-12-15 15:27:26

*Thread Reply:* since they’re using MWAA they won’t be affected by turn-off with coming with Airflow 2.8 for a while. Otherwise that would be a good excuse to get in touch with them

🙏 Willy Lulciuc

Shubham Mehta (shubhammehta.93@gmail.com)

2023-12-23 03:54:50

*Thread Reply:* I think this repo was related to the blog which was authored a while back - https://aws.amazon.com/blogs/big-data/automate-data-lineage-on-amazon-mwaa-with-openlineage/ No other moves from our end so far, at least MWAA team : )

Amazon Web Services

Automate data lineage on Amazon MWAA with OpenLineage | Amazon Web Services

In modern data architectures, datasets are combined across an organization using a variety of purpose-built services to unlock insights. As a result, data governance becomes a key component for data consumers and producers to know that their data-driven decisions are based on trusted and accurate datasets. One aspect of data governance is data lineage, which […]

Original URL: https://aws.amazon.com/blogs/big-data/automate-data-lineage-on-amazon-mwaa-with-openlineage/

👍 Jakub Dardziński, Willy Lulciuc, Maciej Obuchowski

Paul Wilson Villena (pgvillena@gmail.com)

2024-03-06 23:53:51

*Thread Reply:* Hi All, I am one of the owners of this repo and working to update this to work with MWAA 2.8.1, with apache-airflow-providers-openlineage==1.4.0. I am facing an issue with my set-up. I am using Redshift SQL as a sample use-case for this and getting an error relating to the Default Extractor. Haven't really looked at this at much detail yet but wondering if you have thoughts? I just updated the env variables to use: AIRFLOWOPENLINEAGETRANSPORT and AIRFLOWOPENLINEAGENAMESPACE and changed operator from PostgresOperator to SQLExecuteQueryOperator. [2024-03-07 03:52:55,496] Failed to extract metadata using found extractor <airflow.providers.openlineage.extractors.base.DefaultExtractor object at 0x7fc4aa1e3950> - section/key [openlineage/disabled_for_operators] not found in config task_type=SQLExecuteQueryOperator airflow_dag_id=rs_source_to_staging task_id=task_insert_event_data airflow_run_id=manual__2024-03-07T03:52:11.634313+00:00 [2024-03-07 03:52:55,498] section/key [openlineage/config_path] not found in config [2024-03-07 03:52:55,498] section/key [openlineage/config_path] not found in config [2024-03-07 03:52:55,499] Executing: insert into event SELECT eventid, venueid, catid, dateid, eventname, starttime::TIMESTAMP FROM s3_datalake.event;

aws-samples/aws-mwaa-openlineage

Stars

Language

Python

🙌 Shubham Mehta

Kacper Muda (mudakacper@gmail.com)

2024-03-08 06:04:50

*Thread Reply:* I'll look into it 🙂

Kacper Muda (mudakacper@gmail.com)

2024-03-08 06:57:39

*Thread Reply:* @Paul Wilson Villena It looks like a small mistake in the OL, that I'll fix in the next version - we missed adding a callback there, and getting the airflow configuration raises error when disabled_for_operators is not defined in the airflow.cfg file / the env variable. For now it should help to simply add the <a href="https://airflow.apache.org/docs/apache-airflow-providers-openlineage/1.4.0/configurations-ref.html#id1">[openlineage]</a> section to airflow.cfg, and set disabled_for_operators="" , or just export AIRFLOW__OPENLINEAGE__DISABLED_FOR_OPERATORS="" ,

🙌 Paul Wilson Villena

Kacper Muda (mudakacper@gmail.com)

2024-03-08 07:56:15

*Thread Reply:* Will be released in the next provider version: https://github.com/apache/airflow/pull/37994

🙌 Jakub Dardziński, Paul Wilson Villena

🙏 Shubham Mehta, Paul Wilson Villena

Paul Wilson Villena (pgvillena@gmail.com)

2024-03-09 07:56:31

*Thread Reply:* Hi @Kacper Muda it seems I need to also set this: Otherwise this error persists: section/key [openlineage/config_path] not found in config os.environ["AIRFLOW__OPENLINEAGE__CONFIG_PATH"]=""

Kacper Muda (mudakacper@gmail.com)

2024-03-11 03:57:07

*Thread Reply:* Yes, sorry for missing that. I fixed in the code and forgot to mention it. If You were to not use AIRFLOW__OPENLINEAGE__TRANSPORT You'd have to set it to empty string as well, as it's missing the same fallback 🙂

Kacper Muda (mudakacper@gmail.com)

2024-04-15 07:11:52

*Thread Reply:* @Paul Wilson Villena FYI, apache-airflow-providers-openlineage==1.7.0 has just been released, containing the fix to that problem 🙂

Michael Robinson (michael.robinson@astronomer.io)

2023-12-15 14:47:24

The release is finished. Slack post, etc., coming soon

❤️ Harel Shein, Willy Lulciuc

Willy Lulciuc (willy@datakin.com)

2023-12-15 16:24:52

have we thought of making the SQL parser pluggable?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-15 16:35:47

*Thread Reply:* what do you mean by that?

Willy Lulciuc (willy@datakin.com)

2023-12-15 16:43:51

*Thread Reply:* (this is coming from apple) like what if a user wanted to provide their own parse for SQL in place of the one shipped with our integrations

Willy Lulciuc (willy@datakin.com)

2023-12-15 16:44:27

*Thread Reply:* for example, if/when we integrate with DataHub, can they use their parse instead of one provided

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-15 17:33:30

*Thread Reply:* that would be difficult, we would need strong motivation for that 🫥

👍 Willy Lulciuc

Harel Shein (harel.shein@gmail.com)

2023-12-18 12:15:59

This question fits with what we said we would try to document more, can someone help them out with it this week? https://openlineage.slack.com/archives/C063PLL312R/p1702683569726449

} Alia Nabawy (https://openlineage.slack.com/team/U05HTTNJY8Z)

Hi , We are trying to compile some info for Data Engineers on best practices to use in their data processing pipelines to maximize the generation of column lineage. Currently Sql Spark is the main avenue for generating column lineage but there are other Spark channels that will not generate column lineage. Do you have this list of coverage/limitations available for the Spark plug-in in the context of column lineage today ? Also within Spark Sql parser there are still limitations for column lineage I believe Spark Dataframes and APIs do not generate column lineage ?

Original URL: https://openlineage.slack.com/archives/C063PLL312R/p1702683569726449

Michael Robinson (michael.robinson@astronomer.io)

2023-12-18 14:15:48

Airflow 2.8 has been released. Are we still “turning off” the external Airflow integration with this one? What do Airflow users need to know to avoid unpleasant surprises? Kenten is open to including a note in the 2.8 blog post.

Kacper Muda (mudakacper@gmail.com)

2023-12-19 08:52:59

*Thread Reply:* As a newcomer here, I believe it would be wise to avoid supporting Airflow 2.8+ in the openlineage-airflow package. This approach would encourage users to transition to the provider package. It's important to clearly communicate that ongoing development and enhancements will be focused on the apache-airflow-providers-openlineage package, while the openlineage-airflow will primarily be updated for bug fixes. I'll look into whether this strategy is already noted in the documentation. If not, I will propose a documentation update.

➕ Harel Shein, Jakub Dardziński

:gratitude_thank_you: Michael Robinson

Kacper Muda (mudakacper@gmail.com)

2023-12-21 08:10:17

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2330 Please let me know if some changes are required, i was not sure how to properly implement it.

Harel Shein (harel.shein@gmail.com)

2023-12-19 09:51:18

This looks cool, might be useful for us? https://github.com/aklivity/zilla

aklivity/zilla

🦎 A multi-protocol, event-native proxy. Securely interface web apps, IoT clients, & microservices to Apache Kafka® via declaratively defined, stateless APIs.

Website

<https://docs.aklivity.io/zilla>

Stars

317

Harel Shein (harel.shein@gmail.com)

2023-12-19 09:54:26

*Thread Reply:* the license is a bit weird, but should be ok for us. it’s apache, unless you directly compete with the company that built it.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-21 09:32:34

*Thread Reply:* tbh not sure how

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-21 09:33:43

*Thread Reply:* I think we should be focused on 1) being compatible with most popular solutions (kafka...) 2) being easy to integrate with (custom transports)

rather than forcing our opinionated way on how OpenLineage events should flow in customer architecture

Michael Robinson (michael.robinson@astronomer.io)

2023-12-19 10:09:42

Apologies for having to miss today’s committer sync — I’ll be picking up my daughter from school

Harel Shein (harel.shein@gmail.com)

2023-12-20 14:13:00

WDYT about starting to add integration specific channels and adding a little welcome bot for people when they join?

spark-openlineage-dev
spark-openlineage-users
airflow-openlineage-dev
airflow-openlineage-users
spark-openlineage-dev
flink-openlineage-users etc…

👀 Michael Robinson

Willy Lulciuc (willy@datakin.com)

2023-12-20 15:10:22

*Thread Reply:* the -dev and -users seems like overkill, but also understand that we may want to split user questions from development

Willy Lulciuc (willy@datakin.com)

2023-12-20 15:11:55

*Thread Reply:* maybe just shorten to spark-integration , flink-integration , etc. Or integrations-spark etc

Willy Lulciuc (willy@datakin.com)

2023-12-20 15:37:24

*Thread Reply:* we probably should consider a development and welcomechannel

Harel Shein (harel.shein@gmail.com)

2023-12-20 17:19:00

*Thread Reply:* yeah.. makes sense to me. let’s leave this thread open for a few days so more people can chime in and then I’ll make a proposal based on that.

👍 Willy Lulciuc

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-12-21 01:57:27

*Thread Reply:* makes sense to me

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-21 05:18:53

*Thread Reply:* I think there is not enough discussion for it to make sense

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-21 05:19:01

*Thread Reply:* and empty channels do not invite to discussion

Michael Robinson (michael.robinson@astronomer.io)

2023-12-21 09:01:22

*Thread Reply:* Maybe worth it for spark questions alone? And then for equal coverage we need the others. It’s getting easy to overlook questions in general due to the increased volume and long code snippets, IMO.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-21 09:23:22

*Thread Reply:* yeah I think the volume is still quite low

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-21 09:24:15

*Thread Reply:* something like Airflow's #troubleshooting channel easily has order of magnitude more messages

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-21 09:24:41

*Thread Reply:* and even then, I'd split between something like #troubleshooting and #development rather than between integrations

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-21 09:25:13

*Thread Reply:* not only because it's too granular, but also there's a development that isn't strictly integration related or touches multiple ones

👍 Willy Lulciuc

Michael Robinson (michael.robinson@astronomer.io)

2023-12-20 14:45:33

Link to the vote to release the hot fix in Marquez: https://marquezproject.slack.com/archives/C01E8MQGJP7/p1703101476368589

Michael Robinson (michael.robinson@astronomer.io)

2023-12-21 09:04:15

For the newsletter this time around, I’m thinking that a year-end review issue might be nice in mid-January when folks are back from vacation. And then a “double issue” at the end of January with the usual updates. We’ve still got a rather, um, “select” readership, so the stakes are low. If you have an opinion, please lmk.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-21 09:06:17

*Thread Reply:* I’m for mid-January option

👍 Michael Robinson, Maciej Obuchowski

Michael Robinson (michael.robinson@astronomer.io)

2023-12-21 11:33:40

1.7.0 changelog PR needs a review: https://github.com/OpenLineage/OpenLineage/pull/2331

Michael Robinson (michael.robinson@astronomer.io)

2023-12-21 12:48:58

Notice for the release notes (WDYT?): COMPATIBILITY NOTICE Starting in 1.7.0, the Airflow integration will no longer support Airflow versions >=2.8.0. Please use the OpenLineage Airflow Provider instead. It includes a link to here: https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/index.html

👍 Kacper Muda

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-22 12:46:44

https://eu.communityovercode.org/ is that proper conference to talk about OL?

eu.communityovercode.org

Community Over Code EU

Community Over Code Europe is the annual gathering in Europe of the Apache Software Foundation community.

Original URL: https://eu.communityovercode.org/

Willy Lulciuc (willy@datakin.com)

2023-12-22 16:06:32

*Thread Reply:* Yes! At least to a more broader audience

Shubham Mehta (shubhammehta.93@gmail.com)

2023-12-23 03:52:59

@Shubham Mehta has joined the channel

Harel Shein (harel.shein@gmail.com)

2024-01-02 08:50:13

Happy new year all!

🎉 Kacper Muda, Maciej Obuchowski, Jakub Dardziński, Paweł Leszczyński, Willy Lulciuc

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-02 12:28:58

Am I the only one that sees Free trial in progress here in OL slack?

Kacper Muda (mudakacper@gmail.com)

2024-01-02 12:32:25

*Thread Reply:* I see it too:

Harel Shein (harel.shein@gmail.com)

2024-01-02 13:04:34

*Thread Reply:* same for me, I think Slack initiated that

Harel Shein (harel.shein@gmail.com)

2024-01-02 13:04:37

*Thread Reply:* or @Michael Robinson?

Harel Shein (harel.shein@gmail.com)

2024-01-02 13:05:01

*Thread Reply:* we’re on the Slack Pro trial (we were on the free plan before)

Michael Robinson (michael.robinson@astronomer.io)

2024-01-02 13:07:18

*Thread Reply:* I think Slack initiated it

Harel Shein (harel.shein@gmail.com)

2024-01-03 15:37:11

Spotted!

🔥 Shubham Mehta

❤️ Willy Lulciuc

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-03 15:44:21

*Thread Reply:* a moment earlier it makes more context

😂 Harel Shein, Maciej Obuchowski

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-01-04 08:29:30

https://github.com/OpenLineage/OpenLineage/issues/2349 - this issue is really interesting. I am hoping to see follow-up from David.

#2349 [INTEGRATION] Support the the spark-3.x-bigquery connectors

What is the target system? Spark What kind of integration is this? ☑︎ Produces OpenLineage metadata ☐ Consumes OpenLineage metadata ☐ Something else How should this integration be implemented? The BigQuery data source is supported today only for the old <code>spark-bigquery-with-dependencies</code> connectors, implementing Spark DataSource v1 API. The newer connectors implementing the DataSource v2 API are faster and better integrated with the Spark query planner, their jar name is <code>spark-3.X-bigquery.jar</code>. They are now the default connector in Dataproc latest 2.2 image. Please add support for the new connectors, thanks! The Spark BigQuery connector team will be happy to add features/utility methods if needed. Where should this integration be implemented? ☐ In the target system ☑︎ In the OpenLineage repo ☐ Somewhere else Do you plan to make this contribution yourself? ☐ I am interested in doing this work

Comments

Michael Robinson (michael.robinson@astronomer.io)

2024-01-04 10:27:12

So who wants to speak at our meetup with Confluent in London on Jan. 31st?

🙂 Maciej Obuchowski

Harel Shein (harel.shein@gmail.com)

2024-01-04 11:08:20

*Thread Reply:* do we have sponsorship to fly over?

Michael Robinson (michael.robinson@astronomer.io)

2024-01-04 11:09:39

*Thread Reply:* Not currently but I can look into it

Harel Shein (harel.shein@gmail.com)

2024-01-04 11:11:17

*Thread Reply:* do we have other active community members based in the UK?

Michael Robinson (michael.robinson@astronomer.io)

2024-01-04 11:12:00

*Thread Reply:* Yes

Michael Robinson (michael.robinson@astronomer.io)

2024-01-04 11:12:07

*Thread Reply:* I’ll ask around

Michael Robinson (michael.robinson@astronomer.io)

2024-01-04 12:42:16

*Thread Reply:* Abdallah Terrab at Decathlon has volunteered

Harel Shein (harel.shein@gmail.com)

2024-01-04 12:43:34

*Thread Reply:* woohoo!

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-05 11:16:55

*Thread Reply:* does it mean we're still looking for someone?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-05 11:17:19

*Thread Reply:* I already told I'll go last month, but not sure if it's still needed

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-05 11:17:43

*Thread Reply:* and does Confluent have a talk there?

Michael Robinson (michael.robinson@astronomer.io)

2024-01-05 11:52:04

*Thread Reply:* Sorry about that, Maciej. I’ll ask Viraj if Astronomer would cover your ticket. There will be a Confluent speaker.

Harel Shein (harel.shein@gmail.com)

2024-01-05 11:53:54

*Thread Reply:* if we need to choose between kafka summit and a meetup - I think we should go for kafka summit 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-05 11:58:21

*Thread Reply:* I think so too

Michael Robinson (michael.robinson@astronomer.io)

2024-01-05 13:12:51

*Thread Reply:* Viraj has requested approval for summit, and we can expect to hear from finance soon

Michael Robinson (michael.robinson@astronomer.io)

2024-01-05 13:18:01

*Thread Reply:* Also, question from him about the talk: what does “streaming” refer to in the title — Kafka only?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-05 13:20:26

*Thread Reply:* Kafka, Spark, Flink

👍 Michael Robinson, Willy Lulciuc

Willy Lulciuc (willy@datakin.com)

2024-01-08 15:13:41

*Thread Reply:* If someone wants to do a joint talk let me know 😉

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-08 15:55:57

*Thread Reply:* @Willy Lulciuc will you be in UK then?

Willy Lulciuc (willy@datakin.com)

2024-01-08 16:02:10

*Thread Reply:* I can be, if astronomer approves? but also realizing it’s Jan 31st, so a bit tight

Harel Shein (harel.shein@gmail.com)

2024-01-08 16:16:30

*Thread Reply:* yeah, that sounds… unlikely 🙂

👍 Willy Lulciuc

Willy Lulciuc (willy@datakin.com)

2024-01-08 16:17:16

*Thread Reply:* also thinking about all the traveling ive been doing recently and things I need to work on. would be great to have some focus time

Harel Shein (harel.shein@gmail.com)

2024-01-04 12:42:33

did anyone submit a talk to https://www.databricks.com/dataaisummit/call-for-presentations/?

databricks.com

Call for presentations - Data + AI Summit 2024 | Databricks

Do you have big ideas, impactful stories or case studies to share related to data science and machine learning? Do you love sharing your tips and tricks, how-tos and best practices? Or have you built the latest and greatest features in popular open source technologies? Our global Summit community of over 70,000 would love to hear from you. The deadline for submissions is January 5, 2024.

Original URL: https://www.databricks.com/dataaisummit/call-for-presentations/

Harel Shein (harel.shein@gmail.com)

2024-01-04 13:27:05

*Thread Reply:* tagging @Julien Le Dem on this one. since the deadline is tomorrow.

Julien Le Dem (julien@apache.org)

2024-01-04 22:16:01

*Thread Reply:* I don’t think I did.

Julien Le Dem (julien@apache.org)

2024-01-04 22:17:07

*Thread Reply:* I don’t have my computer with me. ⛷️

Julien Le Dem (julien@apache.org)

2024-01-04 22:18:06

*Thread Reply:* Does @Willy Lulciuc want to submit? Happy to be co-speaker (if you want. But not necessary)

Harel Shein (harel.shein@gmail.com)

2024-01-05 08:39:14

*Thread Reply:* Willy is also on vacation, I’m happy to submit for the both of us. I’ll try to get something out today

👍 Julien Le Dem

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-05 11:23:52

*Thread Reply:* would be great to have someting on Databricks conference 🙂

Harel Shein (harel.shein@gmail.com)

2024-01-05 11:25:50

*Thread Reply:* agreed. ideas welcome!

Harel Shein (harel.shein@gmail.com)

2024-01-05 11:26:34

*Thread Reply:* what I’m currently thinking. learnings from openlineage adoption in Airflow and Flink, and what can be learned / applied on Spark lineage.

Harel Shein (harel.shein@gmail.com)

2024-01-05 11:26:37

*Thread Reply:* catchy title!

Michael Robinson (michael.robinson@astronomer.io)

2024-01-04 13:22:12

This month’s TSC meeting is next Thursday. Anyone have any items to add to the agenda?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-04 14:10:20

*Thread Reply:* @Kacper Muda would you want to talk about doc changes in Airflow provider maybe?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-04 14:13:36

*Thread Reply:* no pressure if it's too late for you 🙂

Kacper Muda (mudakacper@gmail.com)

2024-01-04 14:16:08

*Thread Reply:* It's fine, I could probably mention something about it - the problem is that I have a standing commitment every Thursday from 5:30 to 7:30 PM (Polish time, GMT+1) which means I'm unable to participate. 😞

Michael Robinson (michael.robinson@astronomer.io)

2024-01-04 15:46:35

*Thread Reply:* @Kacper Muda would you be open to recording something? We could play it during the meeting. Something to consider if you’d like to participate but the time doesn’t work.

Kacper Muda (mudakacper@gmail.com)

2024-01-05 10:04:33

*Thread Reply:* Let me see how much time I'll have during the weekend and come back to You 🙂

👍 Michael Robinson

Kacper Muda (mudakacper@gmail.com)

2024-01-08 02:48:35

*Thread Reply:* Sorry, I got sick and won't be able to do it. Maybe i'll try to make it personally to the next meeting, then the docs should already be released 🙂

👍 Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2024-01-10 13:56:46

*Thread Reply:* @Julien Le Dem are there updates on open proposals that you would like to cover?

Michael Robinson (michael.robinson@astronomer.io)

2024-01-10 13:58:54

*Thread Reply:* @Paweł Leszczyński as you authored about half of the changes in 1.7.0, would you be willing to talk about the Flink fixes? No slides necessary

Julien Le Dem (julien@apache.org)

2024-01-10 19:21:39

*Thread Reply:* @Michael Robinson no updates from me

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-01-11 07:56:57

*Thread Reply:* sry @Michael Robinson, I won't be able to join this time. My changes in 1.7.0 are rather small fixes. Perhaps someone else can present them shortly.

👍 Michael Robinson

Harel Shein (harel.shein@gmail.com)

2024-01-04 16:28:04

I saw some weird behavior with openlineage-airflow where it will not respected the transport config for the client, even when setting the OPENLINEAGE_CONFIG to point to a config file. the workaround is that if you set the OPENLINEAGE_URL env it will reach that and read the config. this bug doesn’t seem to exist in the airflow provider since the loading method is completely different.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-05 10:08:32

*Thread Reply:* would you mind creating an issue?

Harel Shein (harel.shein@gmail.com)

2024-01-05 11:23:58

*Thread Reply:* will do. let me repro on the latest version of openlineage-airflow and see if I can repro on the provider

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-05 11:25:05

*Thread Reply:* config is definitely more complex than it needs to be...

Harel Shein (harel.shein@gmail.com)

2024-01-05 11:29:44

*Thread Reply:* hmm.. so in 1.7.0 if you define OPENLINEAGE_URL then it completely ignores whatever is in OPENLINEAGE_CONFIG yaml

Harel Shein (harel.shein@gmail.com)

2024-01-05 11:30:47

*Thread Reply:* but….

Harel Shein (harel.shein@gmail.com)

2024-01-05 11:31:09

*Thread Reply:* if you don’t define OPENLINEAGE_URL, and you do define OPENLINEAGE_CONFIG: then openlineage is disabled 😂

Harel Shein (harel.shein@gmail.com)

2024-01-05 11:33:31

*Thread Reply:* ok, I found the bug

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-05 11:33:38

*Thread Reply:* are you sure OPENLINEAGE_CONFIG points to something valid?

Harel Shein (harel.shein@gmail.com)

2024-01-05 11:33:41

*Thread Reply:* it’s in transport factory

Harel Shein (harel.shein@gmail.com)

2024-01-05 11:34:44

*Thread Reply:* ~look at the code flow for create:~ ~the default factory doesn’t supply config, so it tried to set http config from env vars. if that doesn’t work, it just returns the console transport~

Harel Shein (harel.shein@gmail.com)

2024-01-05 11:35:11

*Thread Reply:* oh, no nvm. it’s a different flow.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-05 11:35:14

*Thread Reply:* not exactly

Harel Shein (harel.shein@gmail.com)

2024-01-05 11:35:31

*Thread Reply:* but yes, OPENLINEAGE_CONFIG works for sure

Harel Shein (harel.shein@gmail.com)

2024-01-05 11:36:00

*Thread Reply:* on 0.21.1, it was working when the OPENLINEAGE_URL was supplied

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-05 11:36:16

*Thread Reply:* transport_config = None if "transport" not in self.config else self.config["transport"] self.transport = factory.create(transport_config) this self.config actually looks at @property def config(self) -> dict[str, Any]: if self._config is None: self._config = load_config() return self._config which then uses this: ```def loadconfig() -> dict[str, Any]: file = _findyaml() if file: try: with open(file) as f: config: dict[str, Any] = yaml.safe_load(f) return config except Exception: # noqa: BLE001, S110 # Just move to read env vars pass return defaultdict(dict)

def findyaml() -> str | None: # Check OPENLINEAGECONFIG env variable path = os.getenv("OPENLINEAGECONFIG", None) try: if path and os.path.isfile(path) and os.access(path, os.R_OK): return path except Exception: # noqa: BLE001 if path: log.exception("Couldn't read file %s: ", path) else: pass # We can get different errors depending on system

# Check current working directory:
try:
    cwd = os.getcwd()
    if "openlineage.yml" in os.listdir(cwd):
        return os.path.join(cwd, "openlineage.yml")
except Exception:  # noqa: BLE001, S110
    pass  # We can get different errors depending on system

# Check $HOME/.openlineage dir
try:
    path = os.path.expanduser("~/.openlineage")
    if "openlineage.yml" in os.listdir(path):
        return os.path.join(path, "openlineage.yml")
except Exception:  # noqa: BLE001, S110
    # We can get different errors depending on system
    pass
return None```

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-05 11:37:16

*Thread Reply:* oh I think I see

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-05 11:37:40

*Thread Reply:* so this isn't passed if you have config but there is no transport field in this config transport_config = None if "transport" not in self.config else self.config["transport"] self.transport = factory.create(transport_config)

Harel Shein (harel.shein@gmail.com)

2024-01-05 11:38:22

*Thread Reply:* here’s the config I’m using: ```transport: type: "kafka" config: bootstrap.servers: "kafka1,kafka2" security.protocol: "SSL"

# CA certificate file for verifying the broker's certificate.
ssl.ca.location=ca-cert
# Client's certificate
ssl.certificate.location=client_?????_client.pem
# Client's key
ssl.key.location=client_?????_client.key
# Key password, if any.
ssl.key.password=abcdefgh

topic: "SOF0002248-afaas-lineage-DEV-airflow-lineage" flush: True```

Harel Shein (harel.shein@gmail.com)

2024-01-05 11:38:35

*Thread Reply:* it should load, but fail when actually trying to emit to kafka

Harel Shein (harel.shein@gmail.com)

2024-01-05 11:38:47

*Thread Reply:* but it should still init the transport

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-05 11:39:22

*Thread Reply:* 🤔

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-05 11:40:09

*Thread Reply:* no logs?

Harel Shein (harel.shein@gmail.com)

2024-01-05 11:40:24

*Thread Reply:* I’m testing on this image: ```FROM quay.io/astronomer/astro-runtime:6.4.0

COPY openlineage.yml /usr/local/airflow/

ENV OPENLINEAGE_CONFIG=/usr/local/airflow/openlineage.yml

ENV AIRFLOWCORELOGGING_LEVEL=DEBUG

ENV OPENLINEAGE_URL=http://foo.bar/```

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-05 11:40:39

*Thread Reply:* are you sure there are no permission errors?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-05 11:41:02

*Thread Reply:* def _find_yaml() -> str | None: # Check OPENLINEAGE_CONFIG env variable path = os.getenv("OPENLINEAGE_CONFIG", None) try: if path and os.path.isfile(path) and os.access(path, os.R_OK): return path except Exception: # noqa: BLE001 if path: log.exception("Couldn't read file %s: ", path) else: pass # We can get different errors depending on system it checks stuff like os.access(path, os.R_OK)

Harel Shein (harel.shein@gmail.com)

2024-01-05 11:41:02

*Thread Reply:* I’m sure, because if I uncomment ENV OPENLINEAGE_URL=<http://foo.bar/> on 0.21.1, it works

Harel Shein (harel.shein@gmail.com)

2024-01-05 11:41:29

*Thread Reply:* ah, I can add a permissive chmod to the dockerfile to see if it helps

Harel Shein (harel.shein@gmail.com)

2024-01-05 11:41:40

*Thread Reply:* but I’m also not seeing any log/exception in the task logs

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-05 11:47:49

*Thread Reply:* will look at this later if you won't find solution 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-05 11:48:09

*Thread Reply:* one more thing, can you try with just transport: type: console

Harel Shein (harel.shein@gmail.com)

2024-01-05 11:52:49

*Thread Reply:* I tried that too

Harel Shein (harel.shein@gmail.com)

2024-01-05 11:52:51

*Thread Reply:* didn’t work

Harel Shein (harel.shein@gmail.com)

2024-01-05 11:53:27

*Thread Reply:* but yeah, there’s something not great about separation of concerns between client config and adapter config

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-05 12:04:00

*Thread Reply:* adapter should not care, unless you're using MARQUEZ_URL... which is backwards compatibility for when it was still marquez airflow integration

Michael Robinson (michael.robinson@astronomer.io)

2024-01-05 11:50:11

I’m starting to put together the year-in-review issue of the newsletter and wonder if anyone has thoughts on the “big stories” of 2023 in OpenLineage. So far I’ve got: • Launched the Airflow Provider • Added static AKA design lineage • Welcomed new ecosystem partners (Google, Metaphor, Grai, Datahub) • Started meeting up and held events with Metaphor, Google, Collibra, etc. • Graduated from the LFAI What am I missing? Wondering in particular about features. Is iceberg support in Flink a “big” enough story? Custom transport types? SQL parser improvements?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-01-08 06:37:27

Blog posts content about contributing to Openlineage-spark and code internals. The content comes from november meetup at google and I split it into two posts: https://docs.google.com/document/d/1Hu6clFckse1J_M1w2MMaTTJS0wUihtFsxbDQchtTVtA/edit?usp=sharing

❤️ Jakub Dardziński

Kacper Muda (mudakacper@gmail.com)

2024-01-08 11:22:32

Does anyone remember why execution_date was chosen as part of the runid for an Airflow task, instead of, for example, start_date? Due to this decision, we can encounter duplicate runid if we delete the DagRun from the database, because the execution_date remains the same. If I run a backfill job for yesterday, then delete it and run it again, I get the same ids. I'm trying to understand the rationale behind this choice so we can determine whether it's a bug or a feature. 😉

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-08 11:40:45

*Thread Reply:* start_date is unreliable AFAIK, there can be no start date sometimes

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-08 11:40:54

*Thread Reply:* this might be true for only some version of airflow

Kacper Muda (mudakacper@gmail.com)

2024-01-08 11:45:02

*Thread Reply:* Also here where they define combination of some deterministic attributes, executiondate is used and not startdate, so there might be something to it. That still leaves us with the behaviour i described.

<https://github.com/apache/airflow/blob/3dc99d8a285aaadeb83797e691c9f6ec93ff9c93/airflow/models/taskinstance.py | taskinstance.py>

<pre><code> # deterministic per task instance </code></pre>

Willy Lulciuc (willy@datakin.com)

2024-01-08 16:08:01

*Thread Reply:* > Due to this decision, we can encounter duplicate run_id if we delete the DagRun from the database, because the execution_date remains the same. Hmm so given that OL runID uses the same params Airflow uses to generate the hash, this seems more like a limitation. Better question would be: if a user runs, deletes, then runs the same DAG again, is that an expected scenario we should handle? tl’dr yes, but Airflow hasn’t felt it important enough to address.

Harel Shein (harel.shein@gmail.com)

2024-01-08 12:28:50

https://www.snowflake.com/summit/call-for-papers/

Snowflake

Call For Papers - Snowflake

Original URL: https://www.snowflake.com/summit/call-for-papers/

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-08 12:36:09

*Thread Reply:* not concurrent with Databricks conference this year? 😂

Harel Shein (harel.shein@gmail.com)

2024-01-08 12:38:34

*Thread Reply:* nope, a week before so that everyone goes there and gets sick and can’t attend the databricks conf on the following week

😅 Willy Lulciuc

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-08 12:40:34

*Thread Reply:* outstanding business move again

Harel Shein (harel.shein@gmail.com)

2024-01-08 15:55:12

*Thread Reply:* @Willy Lulciuc wanna submit to this?

Willy Lulciuc (willy@datakin.com)

2024-01-08 15:56:19

*Thread Reply:* yeah id love to: maybe something like “Detecting Snowflake table schema changes with OpenLineage events” + use cases + using lineage to detect impact?

Harel Shein (harel.shein@gmail.com)

2024-01-08 16:24:27

*Thread Reply:* yeah! that sounds like a fun talk!

Harel Shein (harel.shein@gmail.com)

2024-01-08 16:25:18

*Thread Reply:* idk if @Julien Le Dem will already be in 🇫🇷 that week? but I’d be happy to co-present if not.

🔥 Maciej Obuchowski

Julien Le Dem (julien@apache.org)

2024-01-08 16:26:52

*Thread Reply:* I don’t know when I’m flying out yet but it will be in the middle of that time frame.

Julien Le Dem (julien@apache.org)

2024-01-08 16:28:18

*Thread Reply:* +1 on Harel co-presenting :)

🙏 Willy Lulciuc

Julien Le Dem (julien@apache.org)

2024-01-08 16:29:09

*Thread Reply:* School last day is the 4th. I need to be in France (not jet lagged) before the 8th

Willy Lulciuc (willy@datakin.com)

2024-01-08 16:45:37

*Thread Reply:* ok, @Harel Shein I’ll work on getting a rough draft ready before the deadline (added to my TODO of tasks)

👍 Harel Shein

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-09 10:01:59

hey, I'm not feeling well, will probably skip today meeting

Harel Shein (harel.shein@gmail.com)

2024-01-09 10:04:17

*Thread Reply:* get well soon ❤️‍🩹

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-09 12:55:04

Added comment to discussion about Spark parent job issue: https://github.com/OpenLineage/OpenLineage/issues/1672#issuecomment-1883524216 I think we have the consensus so I'll work on it.

Harel Shein (harel.shein@gmail.com)

2024-01-10 15:41:39

*Thread Reply:* @Maciej Obuchowski should we give an update on that at the TSC meeting tomorrow?

Harel Shein (harel.shein@gmail.com)

2024-01-10 15:52:33

*Thread Reply:* @Michael Robinson CC ^

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-09 12:56:51

Another issue: do you think we should somehow handle the partitioning in OpenLineage standard and integrations? I would think of a situation where we somehow know how dataset is partitioned - not think about how to automagically detect that. Some example situations:

Job reads particular partitions of a dataset. Should we indicate this, possibly as InputFacet?
Run writes a particular partition to a dataset. Would it be an useful information that some particular partition was written only by some run, while others were written by particular different runs, rather than treat output dataset as "global" modification of a dataset that possibly changes all the data?

Harel Shein (harel.shein@gmail.com)

2024-01-10 15:53:30

*Thread Reply:* interesting. did you hear anyone with this usecase/requirement?

Michael Robinson (michael.robinson@astronomer.io)

2024-01-12 10:04:24

Hey, there’s a Windows user getting this error when trying to run Marquez: org-apache-tomcat-jdbc-pool-connectionpool-unable-to-create-initial-connection. Is it a driver issue? I’ll try to get more details and reproduce it, but if you know what this probably is related to, please lmk

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-12 10:05:16

*Thread Reply:* do we even support Windows? 😱

Michael Robinson (michael.robinson@astronomer.io)

2024-01-12 10:50:00

*Thread Reply:* Here’s more info about the use case: Thanks Michael , this is really helpful , so I am working on prj where in I need to run marquez and open lineage on top of airflow dags which run dbt commands internally thru Bashoperator. I need to present to my org if we are going to be benefited by bringing in marquez matadata lineage 10:48 so was taking this approach of setting marquez first , then will explore how it integrates with airflow using bashoperator

Michael Robinson (michael.robinson@astronomer.io)

2024-01-12 10:50:16

*Thread Reply:* We don’t support this operator, right? What kind of graph can they expect?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-12 10:54:56

*Thread Reply:* They can use dbt integration directly maybe?

openlineage.io

dbt | OpenLineage

dbt (data build tool) is a powerful transformation engine. It operates on data already within a warehouse, making it easy for data engineers to build complex pipelines from the comfort of their laptops. While it doesn’t perform extraction and loading of data, it’s extremely powerful at transformations.

Original URL: https://openlineage.io/docs/integrations/dbt

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-16 05:30:24

Harel Shein (harel.shein@gmail.com)

2024-01-16 06:13:54

*Thread Reply:* Nice!! ❤️

Harel Shein (harel.shein@gmail.com)

2024-01-16 06:14:43

*Thread Reply:* We now have a former astronomer as engineering director at DataHub

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-01-16 06:43:02

*Thread Reply:* ?

Harel Shein (harel.shein@gmail.com)

2024-01-16 11:12:26

*Thread Reply:* https://www.linkedin.com/in/samantha-clark-engineer/

linkedin.com

Samantha Clark - Acryl Data | LinkedIn

I'm a results-driven professional with 15 years of experience in application development,… | Learn more about Samantha Clark's work experience, education, connections & more by visiting their profile on LinkedIn

Original URL: https://www.linkedin.com/in/samantha-clark-engineer/

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-16 14:21:22

taking on letting users to run integration tests

❤️ Harel Shein

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-16 14:22:07

*Thread Reply:* so there are two issues I think

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-16 14:22:37

*Thread Reply:* in Airflow workflows there are: filters: branches: ignore: /pull\/[0-9]+/ which are set only for PRs with forked repos

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-16 14:23:34

*Thread Reply:* in Spark there are tests that strictly require env vars (that contain credentials to various external systems like databricks or bigquery). if there are no such env vars tests fail which is confusing new committers

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-16 14:24:27

*Thread Reply:* first behaviour is silent - which I think is bad because it’s easy to skip integration tests, build is green (but should it be? we don’t know, integration tests didn’t run and someone needs to know and remember that before approving and merging)

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-16 14:25:51

*Thread Reply:* second is misleading because it hints there’s something wrong in the code while it doesn’t neccessarily need to be. on the other hand you shouldn’t approve and merge failing build so you see there’s some action required

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-16 14:27:21

*Thread Reply:* reg. action required: for now we’ve been running some manual step (using https://github.com/jklukas/git-push-fork-to-upstream-branch) which is a workaround but it’s not straightforward and requires manual work. it also doesn’t solve two issues I mentioned above

jklukas/git-push-fork-to-upstream-branch

Command-line utility for triggering CI builds on forked PRs

Stars

Language

Shell

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-16 14:28:46

*Thread Reply:* what I propose is to simply add approval step before integration tests: https://circleci.com/docs/workflows/#holding-a-workflow-for-a-manual-approval

it’s circleCI only thing so you need to login into circleCI, check if there’s any pending task to approve and then approve or not

circleci.com

Using Workflows to Orchestrate Jobs - CircleCI

Learn about using CircleCI workflows to orchestrate jobs

Original URL: https://circleci.com/docs/workflows/#holding-a-workflow-for-a-manual-approval

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-16 14:30:14

*Thread Reply:* it doesn’t allow for much of configuration but I think it would work. you can’t also integrate it in any way with github UI (e.g. there’s no option to click something in PR’s UI to approve)

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-16 14:31:39

*Thread Reply:* but that would let project maintainers to manage when the code is safe to run and it’s still visible and (I think) readable for everyone

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-16 14:34:00

*Thread Reply:* the only thing I’m not sure about is who can approve

Anyone who has push access to the repository can click the **Approval** button to continue the workflow

but I’m not sure to which repo. if someone runs on fork and he/she has push access for fork - can he/she approve? it wouldn’t make sense..

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-16 14:37:34

*Thread Reply:* https://circleci.com/blog/access-control-cicd/

that’s best I could find from circleCI on this subject

CircleCI

Implementing access control policies in CI/CD pipelines

Learn how to implement access control policies in CI/CD pipelines.

Original URL: https://circleci.com/blog/access-control-cicd/

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-16 15:01:00

*Thread Reply:* so I think the best solution would be to:

add approval steps only before integration tests
enable Pass secrets to builds from forked pull requests (requires careful review of the CI process)
make sure release and integration tests contexts are set properly for tasks

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-16 15:01:31

*Thread Reply:* contexts give possibility to let users run e.g. unit tests in CI without exposing credentials

Harel Shein (harel.shein@gmail.com)

2024-01-16 15:06:09

*Thread Reply:* this approach makes sense to me, assuming the permission model is how you outlined it

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-16 15:07:21

*Thread Reply:* one thing to add and test: approval steps could have condition to always run if it’s not from fork. not sure if that’s possible

Harel Shein (harel.shein@gmail.com)

2024-01-16 15:08:50

*Thread Reply:* that sounds like it should be doable within what available in circle. GHA can definitely do that

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-16 15:09:17

*Thread Reply:* GHA can do things that circleCI can’t 😂

😂 Harel Shein, Paweł Leszczyński

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-17 09:46:38

*Thread Reply:* > approval steps could have condition to always run if it’s not from fork. not sure if that’s possible ffs it’s not that easy to set up

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-17 16:10:22

*Thread Reply:* whenever I touch circleCI base changes I feel like magician. yet, here goes the PR with the magic (just look at the side effect, it bothered me for so long 🪄 ) 🚨 🚨 🚨 https://github.com/OpenLineage/OpenLineage/pull/2374

#2374 ci: approve integration tests

Problem Currently, there's no option to conveniently let new committers to either run integration tests (Airflow integration case) or use OL credentials to external services needed to run integration tests successfully (Spark integration case). The only solution we've had so far was <a href="https://github.com/jklukas/git-push-fork-to-upstream-branch">https://github.com/jklukas/git-push-fork-to-upstream-branch</a> but this was far from good. Solution • Add approval steps before integration tests. • Remove approval steps if pipeline runs from someone with write access to the repo Several additional notes: • Most important we should enable <code>Pass secrets to builds from forked pull requests</code> after merging this PR (solves Spark integration problem). • Side-effect of changing <code>determine_changed_module</code> is a huge speedup (from ~1 min to literally seconds with added new step). • I forced (with code change in circleCI config) to not remove approval steps just for test. You can check the result <a href="https://app.circleci.com/pipelines/gh/OpenLineage/OpenLineage/8912/workflows/14897b21-7ae6-4726-9070-a73b6f0f23c3">here</a>. Screenshot attached with task on hold waiting for approval: <a href="https://private-user-images.githubusercontent.com/3889552/297518241-5a12a1ef-ebc2-43ed-ba84-054fe98c32f6.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDU1MjYxMjMsIm5iZiI6MTcwNTUyNTgyMywicGF0aCI6Ii8zODg5NTUyLzI5NzUxODI0MS01YTEyYTFlZi1lYmMyLTQzZWQtYmE4NC0wNTRmZTk4YzMyZjYucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDExNyUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDAxMTdUMjExMDIzWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9MTAwYzNiZDFiMDY4ZDdiNzk2NDI0Y2I1NTgwZGM4ZDU4ZWY1ODAyYmQ2MzQ4NTdiMDc0NWViMjI0ZmFhZTcwMiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.Pyg29G97DnEh2391q9rx7V33OcQL9AJ8G1QP8oYic1o">image</a> One-line summary: Introduce approvals to integration tests in CI. Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☑︎ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☑︎ Your changes are accompanied by tests (if relevant) ☑︎ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☑︎ Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant) ☐ You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant) <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2023 contributors to the OpenLineage project

Labels

ci, tests

🙌 Paweł Leszczyński

Harel Shein (harel.shein@gmail.com)

2024-01-17 16:29:18

*Thread Reply:* I'm assuming the reason for the speedup in determine_changed_modules is that we don't go install yq every time this runs?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-17 16:35:08

*Thread Reply:* correct

Michael Robinson (michael.robinson@astronomer.io)

2024-01-16 15:52:55

What are your nominees/candidates for the most important releases of 2023? I’ll start (feel free to disagree with my choices, btw): • 1.0.0 • 1.7.0 (disabled the external Airflow integration for 2.8+) • 0.26.0 (Fluentd) • 0.19.2 (column lineage in SQL parser) • …

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-16 15:53:54

*Thread Reply:* • 1.7.0 (disabled the external Airflow integration for 2.8+) that doesn’t sound like one of the most important

:gratitude_thank_you: Michael Robinson

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-16 15:55:49

*Thread Reply:* 1.0.0 had this: https://openlineage.io/docs/releases/1_0_0 which actually fixed spec to match JSON schema spec

openlineage.io

1.0.0 | OpenLineage

Added

Original URL: https://openlineage.io/docs/releases/1_0_0

👍 Michael Robinson

Gowthaman Chinnathambi (gowthamancdev@gmail.com)

2024-01-18 23:26:45

@Gowthaman Chinnathambi has joined the channel

Michael Robinson (michael.robinson@astronomer.io)

2024-01-19 16:18:50

1.8 changelog PR: https://github.com/OpenLineage/OpenLineage/pull/2380

Harel Shein (harel.shein@gmail.com)

2024-01-19 21:06:26

FYI: I'm trying to set up our meetings with the new LFAI tooling, you may get some emails. you can ignore for now.

👍 Michael Robinson, Maciej Obuchowski

Krishnaraj Raol (krishnaraj.raol@datavizz.in)

2024-01-22 00:46:57

@Krishnaraj Raol has joined the channel

Michael Robinson (michael.robinson@astronomer.io)

2024-01-22 10:45:06

@Harel Shein @Julien Le Dem @tati Python client releases on both Marquez and OpenLineage are failing because PyPI no long supports password authentication. We need to configure the projects for Trusted Publishers or use an API token. I’ve looked and can’t find OpenLineage credentials for PyPI, but if I had them we’d still need to set up 2FA in order to make the change. How should we proceed here? Should we sync to sort this out? Thanks (btw I reached out to Willy separately when this came up during a Marquez release attempt last week)

Harel Shein (harel.shein@gmail.com)

2024-01-22 12:15:48

*Thread Reply:* ah! I can look into it now.

:gratitude_thank_you: Michael Robinson

Harel Shein (harel.shein@gmail.com)

2024-01-22 12:45:31

*Thread Reply:* I'm failing to find the credentials for PyPI anywhere.

Harel Shein (harel.shein@gmail.com)

2024-01-22 12:45:38

*Thread Reply:* @Maciej Obuchowski any ideas? (git blame shows you wrote that part, and @Michael Collado did some circleCI setup at some point)

Harel Shein (harel.shein@gmail.com)

2024-01-22 12:46:09

*Thread Reply:* I just sent a reset password email to whoever registered for the openlineage user

Michael Robinson (michael.robinson@astronomer.io)

2024-01-22 13:15:34

*Thread Reply:* Thanks @Harel Shein. Can confirm I didn’t get it

Julien Le Dem (julien@apache.org)

2024-01-22 14:21:06

*Thread Reply:* The password should be in the CircleCI context right?

Harel Shein (harel.shein@gmail.com)

2024-01-22 14:23:23

*Thread Reply:* yes, it's there. I was trying to avoid writing a job that prints it out

👍 Julien Le Dem

Harel Shein (harel.shein@gmail.com)

2024-01-22 14:23:57

*Thread Reply:* that will be the fallback if no one responds 🙂

Harel Shein (harel.shein@gmail.com)

2024-01-22 14:38:29

*Thread Reply:* alright, I've setup 2FA and added a few more emails to the PyPI account as fallback.

Harel Shein (harel.shein@gmail.com)

2024-01-22 14:39:09

*Thread Reply:* unfortunately, there's only one Trusted Publisher for PyPI, which is GH Actions. so we'll have to use the API token route. PR incoming soon

Harel Shein (harel.shein@gmail.com)

2024-01-22 14:43:39

*Thread Reply:* didn't need to make any changes. I updated the circle context and re-ran the PyPI release - we're back to :largegreencircle:

Harel Shein (harel.shein@gmail.com)

2024-01-22 14:43:46

*Thread Reply:* ^ @Michael Robinson FYI

👍 Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2024-01-22 15:11:54

*Thread Reply:*

Michael Robinson (michael.robinson@astronomer.io)

2024-01-22 15:14:47

*Thread Reply:* thank you @Harel Shein. Releasing the jars now. Everything looks good

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-22 17:16:55

*Thread Reply:* Great!

tati (tatiana.alchueyr@astronomer.io)

2024-01-22 10:45:15

@tati has joined the channel

Michael Robinson (michael.robinson@astronomer.io)

2024-01-23 11:03:11

It looks like my laptop bit the dust today so might miss the sync

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-23 11:29:38

Do I have to create account to join the meeting?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-23 11:31:23

*Thread Reply:* turns out you can just pass your mail

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-01-23 11:31:57

*Thread Reply:* same on my side

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-01-23 11:32:06

*Thread Reply:* i don't remember if i have one

Harel Shein (harel.shein@gmail.com)

2024-01-23 11:34:22

I did something wrong, meeting link should we good now: https://zoom-lfx.platform.linuxfoundation.org/meeting/91671382513?password=424b74a1-43fa-4d0e-885f-c9b5417cf57b

Julien Le Dem (julien@apache.org)

2024-01-23 11:34:51

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-23 11:35:09

*Thread Reply:* use same e-mail you got invited to

Harel Shein (harel.shein@gmail.com)

2024-01-23 13:21:41

there's a data warehouse (https://www.firebolt.io/) and a streaming platform (https://memphis.dev/) written in Go. so I guess it's not futile to write a Go client? 🙂

🤔 Maciej Obuchowski

Michael Robinson (michael.robinson@astronomer.io)

2024-01-23 15:40:21

Any potential issues with scheduling a meetup on Tuesday, March 19th in Boston that you know of? The Astronomer all-hands is the preceding week

👍 Harel Shein, Maciej Obuchowski

Michael Robinson (michael.robinson@astronomer.io)

2024-01-23 16:06:35

PR to add 1.8 release notes to the docs needs a review: https://github.com/OpenLineage/docs/pull/274

#274 Add release notes for 1.8

This updates the release notes with the latest release.

Comments

Michael Robinson (michael.robinson@astronomer.io)

2024-01-23 16:18:19

*Thread Reply:* thanks @Jakub Dardziński

🙂 Jakub Dardziński

jayant joshi (itsjayantjoshi@gmail.com)

2024-01-24 01:10:16

@jayant joshi has joined the channel

Michael Robinson (michael.robinson@astronomer.io)

2024-01-24 15:19:34

Feedback requested on this draft of the year-in-review issue of the newsletter: https://docs.google.com/document/d/1MJB9ughykq9O8roe2dlav6d8QbHZBV2A0bTkF4w0-jo/edit?usp=sharing. Did you give a talk that isn't in the talks section? Is there an important release that should be in the releases section but isn't? Other feedback? Please share.

Michael Robinson (michael.robinson@astronomer.io)

2024-01-24 16:37:33

Feedback requested on a new page for displaying the ecosystem survey results: https://github.com/OpenLineage/docs/pull/275. The image was created for us by Amp. @Julien Le Dem @Harel Shein @tati

#275 Add survey results infographic page

This adds a page to display a graphical representation of the Ecosystem Survey results. To do: add link to raw data. <a href="https://private-user-images.githubusercontent.com/68482867/299465008-4928021d-9b1c-4d42-8134-9bbd335d01ca.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDYxMzI1NTUsIm5iZiI6MTcwNjEzMjI1NSwicGF0aCI6Ii82ODQ4Mjg2Ny8yOTk0NjUwMDgtNDkyODAyMWQtOWIxYy00ZDQyLTgxMzQtOWJiZDMzNWQwMWNhLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAxMjQlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMTI0VDIxMzczNVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTAwODFmZWU1MDUyZWM5YWZiNzYwMWVkZDkyZjNkMWQ0ZDZmYzYzYWY2NzhjZTg2NWY4MTU5MDc4MTg5NzhmOTAmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.P8rslwiwsf9_JjhGRND8GwQ9ezTHDFUthvmeGvv9w-g">FirefoxScreenshot2024-01-24T21-26-42 558Z</a>

Comments

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-25 06:01:50

*Thread Reply:* Looks great!

Harel Shein (harel.shein@gmail.com)

2024-01-25 09:00:27

*Thread Reply:* Very cool indeed! I wonder if we should share the raw data as well?

Harel Shein (harel.shein@gmail.com)

2024-01-25 09:00:53

*Thread Reply:* Maybe if you could share it here first @Michael Robinson ?

Michael Robinson (michael.robinson@astronomer.io)

2024-01-25 10:17:58

*Thread Reply:* Yes, planning to include a link to the raw data as well and will share here first

Michael Robinson (michael.robinson@astronomer.io)

2024-01-25 10:28:10

*Thread Reply:* @Harel Shein thanks for the suggestion. Lmk if there's a better way to do this, but here's a link to Google's visualizations: https://docs.google.com/forms/d/1j1SyJH0LoRNwNS1oJy0qfnDn_NPOrQw_fMb7qwouVfU/viewanalytics. And a .csv is attached. Would you use this link on the page or link to a spreadsheet instead?

Michael Robinson (michael.robinson@astronomer.io)

2024-01-26 11:47:27

*Thread Reply:* Going with linking to Google's charts for the raw data for now. LMK if you'd prefer another format, e.g. Google sheet

Harel Shein (harel.shein@gmail.com)

2024-01-26 11:48:39

*Thread Reply:* was just looking at it, looks great @Michael Robinson!

:gratitude_thank_you: Michael Robinson

tati (tatiana.alchueyr@astronomer.io)

2024-01-29 17:27:28

*Thread Reply:* Excellent work, @Michael Robinson! 👏 👏 👏

♥️ Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2024-01-29 20:49:25

*Thread Reply:* Thank you!

Michael Robinson (michael.robinson@astronomer.io)

2024-01-24 16:38:11

😍 Julien Le Dem, Paweł Leszczyński, Maciej Obuchowski, Harel Shein, Ross Turk

Julien Le Dem (julien@apache.org)

2024-01-24 18:36:40

This looks nice!

:gratitude_thank_you: Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2024-01-26 15:37:06

Thanks for any feedback on the Mailchimp version of the newsletter special issue before it goes out on Monday:

🙌 Paweł Leszczyński, Maciej Obuchowski

Michael Robinson (michael.robinson@astronomer.io)

2024-01-26 15:41:53

*Thread Reply:* @Jakub Dardziński @Maciej Obuchowski is the Airflow Provider stuff still current?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-26 15:43:44

*Thread Reply:* yeah, looks current

:gratitude_thank_you: Michael Robinson

Laurent Paris (laurent@datakin.com)

2024-01-27 08:55:40

@Laurent Paris has joined the channel

👋 Maciej Obuchowski, Harel Shein, Michael Robinson, Julien Le Dem, Paweł Leszczyński, Willy Lulciuc

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-29 11:47:19

Hey, created issue around using Spark metrics to increase operational ease of using Spark integration, feel free to comment: https://github.com/OpenLineage/OpenLineage/issues/2402

💯 Harel Shein, tati, Paweł Leszczyński

Lohit VijayaRenu (lohit@stripe.com)

2024-01-29 16:34:08

@Lohit VijayaRenu has joined the channel

👋 Julien Le Dem, Maciej Obuchowski, Michael Robinson, Harel Shein

Michael Robinson (michael.robinson@astronomer.io)

2024-01-30 12:54:19

Flyte is offering a 20-25-minute speaking slot at their community meeting on March 6th at 9 am PT. They'd like it to be a general introduction to OpenLineage

Harel Shein (harel.shein@gmail.com)

2024-01-30 12:57:07

*Thread Reply:* I can take it if no one else is interested. I’ll be doing a lot of intro to OL presentations in the next few weeks, so I’ll be very practiced by then :)

🔥 Maciej Obuchowski, Paweł Leszczyński, Jakub Dardziński, Julien Le Dem, Michael Robinson, Willy Lulciuc

👍 Julien Le Dem, Michael Robinson

Emili Parreno (emili@stripe.com)

2024-01-31 08:01:36

@Emili Parreno has joined the channel

👋 Maciej Obuchowski, Harel Shein

Harel Shein (harel.shein@gmail.com)

2024-01-31 11:54:02

@Paweł Leszczyński / @tati I'm expecting at least 5 pictures from the meetup today! 😄

😅 tati

Julien Le Dem (julien@apache.org)

2024-01-31 12:27:17

*Thread Reply:* Each!

‼️ Harel Shein

😅 tati

Michael Robinson (michael.robinson@astronomer.io)

2024-01-31 13:29:36

*Thread Reply:* could we also get a signup sheet and headcount please? 😬

Michael Robinson (michael.robinson@astronomer.io)

2024-02-01 10:44:28

Hi, is there any reason not to perform a release today as scheduled? I know we released 1.8 only one week ago, but it's the first of the month and @Damien Hawes's PR #2390 to add support for Scala 2.12 and 2.13 in Spark, along with fixes in the Spark and Flink integrations, are unreleased. Would it make more sense to wait for Damien's PR #2395?

#2395 [SPARK] Enable 'app' to produce Scala 2.12 and Scala 2.13 variants

Summary: Enable 'app' to produce Scala 2.12 and Scala 2.13 variants Problem Spark 3.2.x, 3.3.x, 3.4.x, and 3.5.x are compiled using Scala 2.12 and Scala 2.13. Due to a change in the Scala Collections API in Scala 2.13, NoSuchMethodErrors are thrown when running the openlineage-spack connector in an Apache Spark runtime when the runtime was compiled using Scala 2.13. Relates to: <a href="https://github.com/OpenLineage/OpenLineage/issues/2303">#2303</a> Solution In this pull request, we are introducing several changes to improve and streamline the project's build process. A notable addition is the application of the <code>io.openlineage.scala-variant</code> plugin to the <code>app/build.gradle</code> file. This change is aimed at producing Scala 2.12 and Scala 2.13 variants of the <code>app</code> module. Building upon the previous changes, we have also removed the <code>java-test-fixtures</code> plugin and updated the dependency configuration across various modules (spark series, shared, app). This decision was driven by the realization that the <code>java-test-fixtures</code> plugin was configuring a 'testFixtures' source set, which was consistently unused across these modules. Consequently, we replaced the redundant <code>testFixturesApi</code> configuration with the more relevant <code>testImplementation</code> configuration. This adjustment eliminates the generation of unnecessary, empty jars in the build process, thereby streamlining it. Furthermore, we have refactored the <code>CommonConfigPlugin</code> as part of an effort to align with Gradle's recommended practices. The plugin has been modified to adopt a reactive approach rather than eagerly applying plugins like Spotless, Lombok, PMD, and Java. Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☑︎ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☑︎ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☑︎ Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary) ☑︎ You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant) <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2023 contributors to the OpenLineage project

Labels

integration/spark

Damien Hawes (damien.hawes@booking.com)

2024-02-01 10:45:00

*Thread Reply:* This isn't ready.

👍 Michael Robinson, Maciej Obuchowski

Damien Hawes (damien.hawes@booking.com)

2024-02-01 10:45:17

*Thread Reply:* I'm working on enabling integration tests for the Scala 2.13 variants.

Damien Hawes (damien.hawes@booking.com)

2024-02-01 10:45:52

*Thread Reply:* It will take some time, probably Monday / Tuesday next week is my ETA.

Michael Robinson (michael.robinson@astronomer.io)

2024-02-01 10:46:29

*Thread Reply:* Thanks @Damien Hawes, no pressure. But if early next week is your estimate I think it makes sense to wait. So this is GTK

Damien Hawes (damien.hawes@booking.com)

2024-02-01 10:44:31

@Damien Hawes has joined the channel

Michael Robinson (michael.robinson@astronomer.io)

2024-02-01 12:28:15

Decathlon showed part of one of their graphs last night

❤️ Willy Lulciuc, Harel Shein, Maciej Obuchowski, Julien Le Dem

Willy Lulciuc (willy@datakin.com)

2024-02-01 12:28:56

*Thread Reply:* marquez in the wild! 💯💯🚀. thanks for sharing!

🚀 Harel Shein, Maciej Obuchowski, Paweł Leszczyński

Michael Robinson (michael.robinson@astronomer.io)

2024-02-01 12:29:13

*Thread Reply:* some metrics too

Michael Robinson (michael.robinson@astronomer.io)

2024-02-01 12:29:35

*Thread Reply:* they're doing anomaly detection successfully

Willy Lulciuc (willy@datakin.com)

2024-02-01 12:30:07

*Thread Reply:* all using Marquez 👀

Michael Robinson (michael.robinson@astronomer.io)

2024-02-01 12:30:08

*Thread Reply:* eg, deprecated tables still being used or tables written in multiple locations

❤️ Willy Lulciuc, Harel Shein

Willy Lulciuc (willy@datakin.com)

2024-02-01 12:31:12

*Thread Reply:* this needs to become a blog post!

➕ Michael Robinson, Harel Shein, Paweł Leszczyński

Harel Shein (harel.shein@gmail.com)

2024-02-01 13:24:04

*Thread Reply:* Would love to get the slides if they are willing to share!

➕ Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2024-02-01 17:54:08

*Thread Reply:* They've said yes to a blog post. This presentation gets us closer to starting in earnest. I've asked for the slides. Too bad the Confluent organizer wasn't supportive of recording. Maybe next time

❤️ Willy Lulciuc, Harel Shein

Harel Shein (harel.shein@gmail.com)

2024-02-02 09:32:33

Congratulations to our new committer on the team @Damien Hawes!!

🔥 Maciej Obuchowski, Paweł Leszczyński, Jakub Dardziński, Michael Robinson, Willy Lulciuc

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-02 09:46:17

*Thread Reply:* hey @Damien Hawes, now you can push your branches into origin and the integration tests are automatically approved 🙂

Damien Hawes (damien.hawes@booking.com)

2024-02-02 09:46:41

*Thread Reply:* Thank you for the congratulations @Harel Shein. It is humbling to be nominated and accepted.

@Maciej Obuchowski - haha, thanks!

Michael Robinson (michael.robinson@astronomer.io)

2024-02-02 10:47:44

*Thread Reply:* Congratulations @Damien Hawes! Thank you for all your contributions

Willy Lulciuc (willy@datakin.com)

2024-02-03 01:00:49

*Thread Reply:* Congrats @Damien Hawes! 💯 💯 🙏

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-02 15:24:09

https://opensource.googleblog.com/2024/02/announcing-google-season-of-docs-2024.html maybe we should improve our docs? 🙂

Google Open Source Blog

Announcing Google Season of Docs 2024

Season of Docs 2024 is here! Explore the Season of Docs website to learn more about participating in the program.

Original URL: https://opensource.googleblog.com/2024/02/announcing-google-season-of-docs-2024.html

Josh Fischer (josh@joshfischer.io)

2024-02-03 20:13:49

@Josh Fischer has joined the channel

Michael Robinson (michael.robinson@astronomer.io)

2024-02-05 17:00:55

Agenda items or discussion topics for Thursday's TSC? @Julien Le Dem @Harel Shein

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-05 17:58:33

*Thread Reply:* I can take 5 minutes explaining Spark job hierarchy

👍 Julien Le Dem, Michael Robinson, Harel Shein

Julien Le Dem (julien@apache.org)

2024-02-05 17:59:03

*Thread Reply:* Nothing specific on my end

👍 Michael Robinson

Harel Shein (harel.shein@gmail.com)

2024-02-06 11:02:30

*Thread Reply:* more updates on the Spark side of things? @Paweł Leszczyński / @Damien Hawes may want to talk about the recent additions? we could also discuss the proposals for circuit breakers / metrics?

Michael Robinson (michael.robinson@astronomer.io)

2024-02-06 13:20:42

Anyone have an opinion about creating an OpenLineage "company" rather than a group on LinkedIn? You can get metrics from LinkedIn's API if you have a company rather than a group.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-06 13:37:19

*Thread Reply:* Airflow does it: https://www.linkedin.com/company/apache-airflow/

linkedin.com

Apache Airflow | LinkedIn

Apache Airflow | 8,652 followers on LinkedIn. Apache Airflow is a Open Source platform to programmatically author, schedule & monitor workflows. | Apache Airflow is an open-source platform for developing, scheduling, and monitoring batch-oriented workflows. Airflow&#39;s extensible Python framework enables you to build workflows connecting with virtually any technology. A web interface helps manage the state of your workflows.

Original URL: https://www.linkedin.com/company/apache-airflow/

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-06 13:37:34

*Thread Reply:* Spark too https://www.linkedin.com/company/apachespark/

linkedin.com

Apache Spark | LinkedIn

Apache Spark | 3,277 followers on LinkedIn. Unified engine for large-scale data analytics | Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters. Key Features - Batch/streaming data Unify the processing of your data in batches and real-time streaming, using your preferred language: Python, SQL, Scala, Java or R. - SQL analytics Execute fast, distributed ANSI SQL queries for dashboarding and ad-hoc reporting. Runs faster than most data warehouses. - Data science at scale Perform Exploratory Data Analysis (EDA) on petabyte-scale data without having to resort to downsampling - Machine learning Train machine learning algorithms on a laptop and use the same code to scale to fault-tolerant clusters of thousands of machines.

Original URL: https://www.linkedin.com/company/apachespark/

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-06 13:37:39

*Thread Reply:* I think that's a good idea

Harel Shein (harel.shein@gmail.com)

2024-02-06 13:56:52

*Thread Reply:* ➕

Michael Robinson (michael.robinson@astronomer.io)

2024-02-06 14:47:39

*Thread Reply:* Cool, thanks

Michael Robinson (michael.robinson@astronomer.io)

2024-02-06 14:50:11

We have agreement from Astronomer to move ahead with the Orbit changes discussed today in the committers sync. So I'll start on the exports asap.

Michael Robinson (michael.robinson@astronomer.io)

2024-02-06 16:51:20

Please follow our new OpenLineage company page on LinkedIn. Evidently, the only way to join the company is to add it to your experience history.

linkedin.com

OpenLineage | LinkedIn

OpenLineage | 3 followers on LinkedIn. An open framework for data lineage collection and analysis | Data lineage is the foundation for a new generation of powerful, context-aware data tools and best practices. OpenLineage enables consistent collection of lineage metadata, creating a deeper understanding of how data is produced and used.

Original URL: https://www.linkedin.com/company/openlineage/

❤️ Willy Lulciuc, Maciej Obuchowski

Julien Le Dem (julien@apache.org)

2024-02-08 20:27:04

FYI: deadline Feb 25th https://2024.berlinbuzzwords.de/call-for-papers/

🙏 Willy Lulciuc

Willy Lulciuc (willy@datakin.com)

2024-02-09 02:30:17

*Thread Reply:* @Peter Hicks want to submit a column-level lineage talk?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-09 07:19:14

*Thread Reply:* we'll probably submit something about OpenLineage/Streaming with @Paweł Leszczyński

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-09 07:19:28

*Thread Reply:* @Willy Lulciuc want to come to Berlin? 😄

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-02-09 07:20:47

*Thread Reply:* we were thinking with Maciej about some kind of Flink & Streaming around OpenLineage talk, as this can be interesting to the community. I'll prepare abstract next wek

Michael Robinson (michael.robinson@astronomer.io)

2024-02-09 09:03:58

*Thread Reply:* updated the talks project on github

👍 Maciej Obuchowski

Willy Lulciuc (willy@datakin.com)

2024-02-09 14:20:52

*Thread Reply:* @Maciej Obuchowski i wish, i’d need a sponsor 😅

Michael Robinson (michael.robinson@astronomer.io)

2024-02-09 09:09:01

This open-source community management tool looks interesting as a supplement to Orbit: https://ossinsight.io/analyze/OpenLineage/OpenLineage#overview

👍 Maciej Obuchowski

Michael Robinson (michael.robinson@astronomer.io)

2024-02-09 09:18:47

*Thread Reply:* you can compare projects side-by-side, which is something Orbit doesn't offer

Michael Robinson (michael.robinson@astronomer.io)

2024-02-09 09:28:50

*Thread Reply:*

Harel Shein (harel.shein@gmail.com)

2024-02-13 10:47:27

Metaplane added an airflow provider to send data lineage data to. It's basically a new connection that extends BaseHook, and users need to proactively send callbacks, not sure why the took that approach. https://www.metaplane.dev/blog/airflow-integration

metaplane.dev

Announcing our Airflow Integration: Observability into DAGs | Metaplane

Metaplane now integrates with Airflow to provide observability into your DAG

Original URL: https://www.metaplane.dev/blog/airflow-integration

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-13 10:56:04

*Thread Reply:* If you're doing all that work, you could actually just dag_policy to add it to all the dags automatically

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-13 10:57:09

*Thread Reply:* https://docs.metaplane.dev/docs/airflow#dag-and-task-lineage I think that's fairly... unsophisticated approach?

Metaplane

Airflow

Apache Airflow™ is an open-source platform for developing, scheduling, and monitoring batch-oriented workflows. Airflow’s extensible Python framework enables you to build workflows connecting with virtually any technology.

Original URL: https://docs.metaplane.dev/docs/airflow#dag-and-task-lineage

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-13 10:58:30

*Thread Reply:* However, if I was redoing Airflow integration from scratch, I'd seriously rethink using connections instead of OPENLINEAGE_URL or configuring it the way we did

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-13 11:00:36

*Thread Reply:* The plugin could load up custom transport types and generate connection types based out of it

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-13 11:02:13

*Thread Reply:* Just curious why they did not use listeners... it's not like it's new feature now, it has been there for like 5 minor releases

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-13 11:42:12

*Thread Reply:* we still could add support from Airflow connections with some sort of deprecation of OPENLINEAGE_URL and current way

Kacper Muda (mudakacper@gmail.com)

2024-02-15 10:22:24

Hey, i was working on PR updating the docs for python, java and airflow (probably spark is next), and it hit me that we still have those in two places: README.md inside the package and openlineage.io site. Both contain quite the same information, sometimes the site has more (f.e. airflow). Do You think it would be good idea to just put a redirect to the site in README.md files for the packages? Maybe add some brief description at the top and then redirect user to the site? In long term, maintaining both and keeping them in sync is not an optimal solution imho.

Michael Robinson (michael.robinson@astronomer.io)

2024-02-15 10:36:10

*Thread Reply:* I agree -- in addition to the maintenance burden there's the risk posed by out-of-date/conflicting info for users

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-15 14:38:01

*Thread Reply:* I've wanted to remove README.md files as user-facing docs for some time.

However, there might be worth keeping those (maybe under different name) for purely internal development docs - not related to external API's, but like - use this incantation to compile the integration.

Kacper Muda (mudakacper@gmail.com)

2024-02-19 06:57:09

*Thread Reply:* I added the redirect here https://github.com/OpenLineage/OpenLineage/pull/2448

There was not much information about the compilation and other internal stuff, so i think first they need to be created and then we can keep them inside the package files under some different name, as Maciej mentioned.

Kacper Muda (mudakacper@gmail.com)

2024-02-22 09:43:46

*Thread Reply:* Bump on this one ^

Michael Robinson (michael.robinson@astronomer.io)

2024-02-16 11:31:58

2023 OpenLineage Survey Analysis/takeaways What surprised you or struck you as notable in the 2023 survey data? What would you like to see added, changed or removed in the 2024 version? I need your help to ensure we get the most useful and actionable insights we can from this exercise. I created a doc for sharing opinions/comments, but I'd be happy to discuss it in any forum. Here's the doc, including some initial takeaways, as a starting point: https://docs.google.com/document/d/1aiKtKjcFU0AjS46cow6cbx8EvzV0P2KnGQ4rFD4qKLM/edit?usp=sharing.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-20 06:48:35

@Damien Hawes can we share Gradle plugins that you've implemented in Spark buildSrc with Flink integration too? It would cut down on boilerplate, but not sure how can we do this (without copying code) 🙂

Damien Hawes (damien.hawes@booking.com)

2024-02-20 13:20:20

*Thread Reply:* You have to publish the plugin to the Gradle Plugin repositoriy

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-21 05:33:33

*Thread Reply:* Seems like we could publish the plugin to the local dir first: https://docs.gradle.org/current/userguide/plugins.html#sec:custom_plugin_repositories

Willy Lulciuc (willy@datakin.com)

2024-02-20 14:51:53

@Maciej Obuchowski in our OL spec, we require _prodcuer and _schemaURL , but our airflow provider, we only send the _producer. Was this an intentional omission of _schemaURL?

<https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json | OpenLineage.json>

<https://github.com/apache/airflow/blob/main/airflow/providers/openlineage/plugins/adapter.py | adapter.py>

Willy Lulciuc (willy@datakin.com)

2024-02-20 14:59:31

ahh i think I was confused given it’s marked as _base_skip_redact here, but looks like _schemaURL is being added… need to verify

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-20 15:27:31

*Thread Reply:* schemaURL is always send, the thing is it’s incorrect in many cases AFAIR

Willy Lulciuc (willy@datakin.com)

2024-02-20 15:28:34

*Thread Reply:* yeah… we’re validating events and most (if not all) don’t have that field populated. does the airflow provider not set it?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-20 15:29:12

*Thread Reply:* airlfow provider uses facets from openlineage-python package

Willy Lulciuc (willy@datakin.com)

2024-02-20 15:29:39

*Thread Reply:* > thing is it’s incorrect in many cases AFAIR more curious about this comment. why would this be the case?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-20 15:31:52

*Thread Reply:* tbh I’m not sure, URL for the base schema is correct only for https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json and facets in it (which is not too many). and for others it was not just done or mistakenly repeated with the same pattern?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-20 15:32:11

*Thread Reply:* dunno, looks more like historical approach that wasn’t adjusted at some point

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-20 15:32:40

*Thread Reply:* and.. schemaURL is apparently irrelevant since noone validated it and reported issues

Willy Lulciuc (willy@datakin.com)

2024-02-20 15:36:48

*Thread Reply:* is there an open issue for this? I feel we should have some guidance here or not require it, but we should decide what we do here

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-20 15:38:49

*Thread Reply:* elephant in the room

😅 Willy Lulciuc, Maciej Obuchowski

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-20 15:39:11

*Thread Reply:* no, I don’t think there is one

Willy Lulciuc (willy@datakin.com)

2024-02-20 15:39:53

*Thread Reply:* ok, I’ll open one. we’re ingestion events into kafka and was incorrectly applying validation to events based on the spec

Willy Lulciuc (willy@datakin.com)

2024-02-20 15:40:27

*Thread Reply:* @Julien Le Dem ping to address the elephant above 😉 we’re talking about in this thread

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-20 15:40:43

*Thread Reply:* I see. btw generating python facets from json schema would be best solution, it was too complex so far ;_;

👍 Willy Lulciuc

Willy Lulciuc (willy@datakin.com)

2024-02-20 15:41:18

*Thread Reply:* ok, I’ll summarize our discussion in the issue

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-20 15:41:27

*Thread Reply:* thanks Willy!

🙏 Willy Lulciuc

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-22 15:19:30

*Thread Reply:* @Willy Lulciuc double-checking - given the example of SQLJobFacet should the schema URL be set to: https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/facets/SQLJobFacet.json or https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/facets/SQLJobFacet.json#/$defs/SQLJobFacet or https://openlineage.io/spec/facets/1-0-0/SQLJobFacet.json#/$defs/SQLJobFacet?

or in other words - should it be the same as it is in Java client? 🙂 which is the last one

Michael Robinson (michael.robinson@astronomer.io)

2024-02-21 06:46:42

Abdallah (Decathlon) made a release request today. https://openlineage.slack.com/archives/C01CK9T7HKR/p1708514231690979

} Abdallah (https://openlineage.slack.com/team/U05HBLE7YPL)

Hi team, I hope you are all doing well. After the recent contributions, some of which addressed significant issues, may I kindly request a new release, please? Thank you very much for your time.

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1708514231690979

Damien Hawes (damien.hawes@booking.com)

2024-02-21 07:13:14

*Thread Reply:* I would ask that if we make a release today, we include the Scala 2.13 support for Spark (and merge the PR for the docs)

Damien Hawes (damien.hawes@booking.com)

2024-02-21 07:13:50

*Thread Reply:* I guess we have to decide splitting Iceberg off from the main code a reason to hold back the release?

Damien Hawes (damien.hawes@booking.com)

2024-02-21 07:14:12

*Thread Reply:* Thoughts @Maciej Obuchowski?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-21 07:19:42

*Thread Reply:* I think we can do a release today/tomorrow, but having functionality removed makes this much harder choice

Damien Hawes (damien.hawes@booking.com)

2024-02-21 07:24:34

*Thread Reply:* What functionality would be removed?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-21 07:27:12

*Thread Reply:* Iceberg support? Or do you propose something else?

Damien Hawes (damien.hawes@booking.com)

2024-02-21 07:29:02

*Thread Reply:* Oh, I was under the impression that Iceberg support wasn't going to be removed. Instead the direct dependencies on Iceberg in the core code were being removed, and bundled into their own module, but at the end of the day, the project would still contain classes capable of dealing with Iceberg.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-21 07:31:11

*Thread Reply:* oh, okay

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-21 07:31:26

*Thread Reply:* I understood that without https://github.com/OpenLineage/OpenLineage/pull/2437/files there will be no support for Iceberg for 2.13

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-21 07:31:44

*Thread Reply:* so I guess we can go and then follow up with next release soon?

Damien Hawes (damien.hawes@booking.com)

2024-02-21 07:32:16

*Thread Reply:* There should still be Iceberg support for 2.13

Damien Hawes (damien.hawes@booking.com)

2024-02-21 07:32:58

*Thread Reply:* If there wasn't, that would hurt us, and by us, I mean the team I belong to @ Booking and my partner teams.

👍 Maciej Obuchowski

Damien Hawes (damien.hawes@booking.com)

2024-02-21 07:34:10

*Thread Reply:* Basically, if I understand Mattia's direction is, we want to say:

OK, OpenLineage has been tested against these versions of Iceberg and found to be working.

Michael Robinson (michael.robinson@astronomer.io)

2024-02-21 09:42:13

*Thread Reply:* Just to confirm, are we waiting for this one https://github.com/OpenLineage/OpenLineage/pull/2446?

Damien Hawes (damien.hawes@booking.com)

2024-02-21 10:05:29

*Thread Reply:* That one can be merged

Damien Hawes (damien.hawes@booking.com)

2024-02-21 10:05:56

*Thread Reply:* We're waiting for comments on this: https://openlineage.slack.com/archives/C01CK9T7HKR/p1708349868363669

} Maciej Obuchowski (https://openlineage.slack.com/team/U01RA9B5GG2)

Hey all - with @Damien Hawes as we get closer to releasing Spark integration with support for Scala 2.13 we think of requiring specifying Scala version in the artifact name. This is generally standard procedure for libraries containing part in Scala - for example <a href="https://mvnrepository.com/artifact/org.apache.iceberg/iceberg-spark-runtime-3.3_2.13/1.4.3">Iceberg integration for Spark</a> or <a href="https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.13/3.5.0">Spark itself</a>. The reason we ask is because Scala 2.12 compatible version currently does not have the suffix: for OpenLineage 1.8 the maven artifact coordinates are <code>&lt;dependency&gt; &lt;groupId&gt;io.openlineage&lt;/groupId&gt; &lt;artifactId&gt;openlineage-spark&lt;/artifactId&gt; &lt;version&gt;1.8.0&lt;/version&gt; &lt;/dependency&gt;</code> and for 1.9 would be <code>&lt;dependency&gt; &lt;groupId&gt;io.openlineage&lt;/groupId&gt; &lt;artifactId&gt;openlineage-spark_2.12&lt;/artifactId&gt; &lt;version&gt;1.9.0&lt;/version&gt; &lt;/dependency&gt;</code>

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1708349868363669

Damien Hawes (damien.hawes@booking.com)

2024-02-21 10:06:05

*Thread Reply:* No-one has left comments

Damien Hawes (damien.hawes@booking.com)

2024-02-21 10:06:35

*Thread Reply:* Which means, at least in my opinion, no-one else has anything to say.

Michael Robinson (michael.robinson@astronomer.io)

2024-02-21 10:07:26

*Thread Reply:* +1. It's been over 48 hrs, so seems safe to go ahead

Damien Hawes (damien.hawes@booking.com)

2024-02-21 10:09:14

*Thread Reply:* @Maciej Obuchowski - I'm going to merge that PR, ye?

Damien Hawes (damien.hawes@booking.com)

2024-02-21 10:09:43

*Thread Reply:* (First it needs an approval)

Michael Robinson (michael.robinson@astronomer.io)

2024-02-21 10:09:51

*Thread Reply:* Pawel is OOO, I believe

Damien Hawes (damien.hawes@booking.com)

2024-02-21 10:10:04

*Thread Reply:* Aye, but I believe @Maciej Obuchowski can approve.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-21 10:19:30

*Thread Reply:* :gh_approved:

Damien Hawes (damien.hawes@booking.com)

2024-02-21 10:20:29

*Thread Reply:* :gh_merged:

Michael Robinson (michael.robinson@astronomer.io)

2024-02-21 10:41:41

*Thread Reply:* Working on the changelog now

Michael Robinson (michael.robinson@astronomer.io)

2024-02-21 10:42:43

*Thread Reply:* oops, forgot about the release vote. please +1

Damien Hawes (damien.hawes@booking.com)

2024-02-21 11:44:32

*Thread Reply:* 👍

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-21 07:19:16

https://github.com/datahub-project/datahub/pull/9870/files

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-05 15:28:11

*Thread Reply:* it got merged 👀

Harel Shein (harel.shein@gmail.com)

2024-03-05 15:31:33

*Thread Reply:* amazing feedback on a 10k line PR 😅

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-05 15:32:09

*Thread Reply:* maybe they have policy that feedback starts from 10k lines

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-05 15:32:15

*Thread Reply:* it wasn’t enough

Harel Shein (harel.shein@gmail.com)

2024-03-05 15:32:20

*Thread Reply:* 🙈

Harel Shein (harel.shein@gmail.com)

2024-03-05 15:32:32

*Thread Reply:* too big to review, LGTM

☝️ Jakub Dardziński

Damien Hawes (damien.hawes@booking.com)

2024-02-21 11:55:31

I just noticed this. shared should not have a dependency on spark. 👀

Michael Robinson (michael.robinson@astronomer.io)

2024-02-21 12:45:00

*Thread Reply:* sounds like an easy fix? we have time

Damien Hawes (damien.hawes@booking.com)

2024-02-21 12:48:21

*Thread Reply:* Easy fix? Yeah - not at all.

😟 Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2024-02-21 13:58:34

*Thread Reply:* Allright, putting the release on hold, then

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-21 14:29:44

*Thread Reply:* yeah - we could end up with Spark 2 dependency when using it in Spark 3 context, and that's not good

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-21 14:33:01

*Thread Reply:* oh, unless you mean compile time dependency on Spark 2

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-21 14:35:00

*Thread Reply:* then no, we need to have it, everything in lifecycle package depends on it 🙂

The idea is that it contains code common to all Spark versions, that Spark itself mostly does not change - and have spark2/3/... directories for things that specifically diverge from baseline

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-21 14:36:59

*Thread Reply:* I assume Paweł ment should not have Spark dependency as it should not depend on particular Spark version

Michael Robinson (michael.robinson@astronomer.io)

2024-02-21 16:39:58

*Thread Reply:* Correct me if I'm wrong, but it sounds safe to proceed. So here's the changelog PR: https://github.com/OpenLineage/OpenLineage/pull/2452

Damien Hawes (damien.hawes@booking.com)

2024-02-22 04:01:10

*Thread Reply:* Yes

Damien Hawes (damien.hawes@booking.com)

2024-02-22 04:01:12

*Thread Reply:* It's safe

👍 Michael Robinson

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 09:48:44

*Thread Reply:* @Michael Robinson let's wait with release till we solve the intermittent test failing issue https://openlineage.slack.com/archives/C065PQ4TL8K/p1708609977907359

} Maciej Obuchowski (https://openlineage.slack.com/team/U01RA9B5GG2)

@Damien Hawes there have been some random test failures after merging last 2.13 PR, for example <a href="https://app.circleci.com/pipelines/github/OpenLineage/OpenLineage/9437/workflows/4eda8d67-3bd1-4527-84fa-88c19e6774bd/jobs/179622">https://app.circleci.com/pipelines/github/OpenLineage/OpenLineage/9437/workflows/4eda8d67-3bd1-4527-84fa-88c19e6774bd/jobs/179622</a> ```> Task :app:copyIntegrationTestFixtures Too long with no output (exceeded 10m0s): context deadline exceeded``<code> hanging on</code>copyIntegrationTestFixtures`?

Original URL: https://openlineage.slack.com/archives/C065PQ4TL8K/p1708609977907359

👍 Michael Robinson

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-21 16:06:05

Any idea/preference where to put very specific doc information? For example: if you're running Spark in this specific way, do this. Separate doc page seems like overkill, but I'm not sure where to put something like this where it would be discoverable

Michael Robinson (michael.robinson@astronomer.io)

2024-02-21 16:44:33

*Thread Reply:* Maybe the information could go on a new stub about "Special Cases" or something (not a great title but don't know the use case). That way the page isn't just about the exceptional case?

Michael Robinson (michael.robinson@astronomer.io)

2024-02-21 17:00:08

*Thread Reply:* Can we not trust search?

Willy Lulciuc (willy@datakin.com)

2024-02-21 17:05:03

@Michael Robinson do we know when the next release will be going out? I’d sneak in some feedback this week on: • https://github.com/OpenLineage/OpenLineage/pull/2371 • and some of the circuit breaker work if it’s not too late @Paweł Leszczyński @Maciej Obuchowski 😉

Michael Robinson (michael.robinson@astronomer.io)

2024-02-21 17:15:41

*Thread Reply:* it almost happened today, so tomorrow? 🤞

👍 Willy Lulciuc

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 05:15:51

*Thread Reply:* still provide feedback anyway @Willy Lulciuc!

❤️ Willy Lulciuc, Paweł Leszczyński

Willy Lulciuc (willy@datakin.com)

2024-02-22 20:42:36

*Thread Reply:* will do! I’ll get some feedback in tmr

👍 Maciej Obuchowski, Paweł Leszczyński

Kacper Muda (mudakacper@gmail.com)

2024-02-22 08:47:05

Hey, i made a PR that updates the OL Airflow Provider documentation (removing outdated stuff, moving some from current OL docs, adding new info). It's not a small one, but i think it's worth the time. Any feedback is highly appreciated, let me know if something is missing, is not clear or simply wrong 🙂

🙌 Jakub Dardziński, Harel Shein

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 08:52:57

@Damien Hawes there have been some random test failures after merging last 2.13 PR, for example https://app.circleci.com/pipelines/github/OpenLineage/OpenLineage/9437/workflows/4eda8d67-3bd1-4527-84fa-88c19e6774bd/jobs/179622 ```> Task :app:copyIntegrationTestFixtures

Too long with no output (exceeded 10m0s): context deadline exceeded``hanging oncopyIntegrationTestFixtures`?

Damien Hawes (damien.hawes@booking.com)

2024-02-22 09:09:24

*Thread Reply:* That's a weird one

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 09:14:15

*Thread Reply:* Ah, it happened also before latest PR https://app.circleci.com/jobs/github/OpenLineage/OpenLineage/179357

Damien Hawes (damien.hawes@booking.com)

2024-02-22 09:15:38

*Thread Reply:* I wonder if the disk is full of that particular executor

Damien Hawes (damien.hawes@booking.com)

2024-02-22 09:15:42

*Thread Reply:* and thats why it fails?

Damien Hawes (damien.hawes@booking.com)

2024-02-22 09:20:09

*Thread Reply:* Its always failing at the :app:copyIntegrationTestFixtures step

Damien Hawes (damien.hawes@booking.com)

2024-02-22 09:21:11

*Thread Reply:* Because I'm not able to replicate this on my local

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 09:22:45

*Thread Reply:* yeah I can't replicate that too

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 09:26:22

*Thread Reply:* I'm rerunning with SSH on CI, will take a look at disk space

Damien Hawes (damien.hawes@booking.com)

2024-02-22 09:26:46

*Thread Reply:* OK. I was literally about to edit the CI to run df -H

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 09:27:23

*Thread Reply:* circleci@ip-10-0-52-168:~$ df -h Filesystem Size Used Avail Use% Mounted on /dev/root 146G 13G 133G 9% / devtmpfs 7.7G 0 7.7G 0% /dev tmpfs 7.7G 0 7.7G 0% /dev/shm tmpfs 1.6G 836K 1.6G 1% /run tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 7.7G 0 7.7G 0% /sys/fs/cgroup /dev/nvme0n1p15 105M 6.1M 99M 6% /boot/efi

Damien Hawes (damien.hawes@booking.com)

2024-02-22 09:28:27

*Thread Reply:* And if you run df -H /home/circleci/openlineage/integration/spark

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 09:30:07

*Thread Reply:* circleci@ip-10-0-52-168:~$ df -H /home/circleci/openlineage/integration/spark Filesystem Size Used Avail Use% Mounted on /dev/root 156G 15G 142G 10% /

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 09:30:18

*Thread Reply:* also 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 09:32:43

*Thread Reply:* I guess disk usage is increasing, but very slowly? circleci@ip-10-0-52-168:~$ df /home/circleci/openlineage/integration/spark Filesystem 1K-blocks Used Available Use% Mounted on /dev/root 152243760 14273256 137954120 10% / circleci@ip-10-0-52-168:~$ df /home/circleci/openlineage/integration/spark Filesystem 1K-blocks Used Available Use% Mounted on /dev/root 152243760 14273436 137953940 10% /

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 09:34:28

*Thread Reply:* the max memory seems very small?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 09:37:15

*Thread Reply:* let's try it? https://github.com/OpenLineage/OpenLineage/pull/2454

#2454 gradle: increase max memory

Gradle's max memory is set up to 1GB and usage on CI is very close to it

Labels

integration/spark

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 09:41:33

*Thread Reply:* ⬆️ ⬆️ it died without filling the disk

Damien Hawes (damien.hawes@booking.com)

2024-02-22 10:07:31

*Thread Reply:* Seems to be running again

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 10:07:42

*Thread Reply:* This one will hang 😐 https://app.circleci.com/pipelines/github/OpenLineage/OpenLineage/9441/workflows/4037b224-d6f6-4213-afd5-5d7884007e53/jobs/179711

Damien Hawes (damien.hawes@booking.com)

2024-02-22 10:11:30

*Thread Reply:* I'm going to push a change to your branch

Damien Hawes (damien.hawes@booking.com)

2024-02-22 10:11:41

*Thread Reply:* To see if we can skip the copy

👍 Maciej Obuchowski

Damien Hawes (damien.hawes@booking.com)

2024-02-22 10:24:11

*Thread Reply:* Pushed

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 10:25:41

*Thread Reply:* even before, I reran it with SSH and it managed to copy the dependencies after all

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 10:25:50

*Thread Reply:* it took a lot of time tho

Damien Hawes (damien.hawes@booking.com)

2024-02-22 10:26:14

*Thread Reply:* Could you tell which dependencies it was copying?

Damien Hawes (damien.hawes@booking.com)

2024-02-22 10:26:21

*Thread Reply:* Like the fixture dependency

Damien Hawes (damien.hawes@booking.com)

2024-02-22 10:26:28

*Thread Reply:* or the container dependencies?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 10:26:33

*Thread Reply:* not really, I just looked at df numbers

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 10:26:36

*Thread Reply:* circleci@ip-10-0-113-50:~$ df -H Filesystem Size Used Avail Use% Mounted on /dev/root 156G 15G 141G 10% / devtmpfs 8.3G 0 8.3G 0% /dev tmpfs 8.3G 0 8.3G 0% /dev/shm tmpfs 1.7G 857k 1.7G 1% /run tmpfs 5.3M 0 5.3M 0% /run/lock tmpfs 8.3G 0 8.3G 0% /sys/fs/cgroup /dev/nvme0n1p15 110M 6.4M 104M 6% /boot/efi circleci@ip-10-0-113-50:~$ df -H Filesystem Size Used Avail Use% Mounted on /dev/root 156G 22G 135G 14% / devtmpfs 8.3G 0 8.3G 0% /dev tmpfs 8.3G 0 8.3G 0% /dev/shm tmpfs 1.7G 1.3M 1.7G 1% /run tmpfs 5.3M 0 5.3M 0% /run/lock tmpfs 8.3G 0 8.3G 0% /sys/fs/cgroup /dev/nvme0n1p15 110M 6.4M 104M 6% /boot/efi

Damien Hawes (damien.hawes@booking.com)

2024-02-22 10:30:56

*Thread Reply:* I wonder if the "copyIntegrationTestFixtures" was a red herring

Damien Hawes (damien.hawes@booking.com)

2024-02-22 10:31:03

*Thread Reply:* And it's actually the "copyDependencies" step

Damien Hawes (damien.hawes@booking.com)

2024-02-22 10:31:43

*Thread Reply:* Because the fixtures JAR is tiny

Damien Hawes (damien.hawes@booking.com)

2024-02-22 10:32:08

*Thread Reply:* It should take seconds, at most.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 10:35:11

*Thread Reply:* your PR failed on your favorite step, spotless 😂

Damien Hawes (damien.hawes@booking.com)

2024-02-22 10:35:20

*Thread Reply:* Yeah ...

Damien Hawes (damien.hawes@booking.com)

2024-02-22 10:35:34

*Thread Reply:* One of these days, I am going to make a pre-commit or something

Damien Hawes (damien.hawes@booking.com)

2024-02-22 10:35:41

*Thread Reply:* or pre-push

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 10:37:03

*Thread Reply:* we have pre-commit config but it's focused on Python parts https://github.com/OpenLineage/OpenLineage/blob/31f8ce588526e9c7c4bc7d849699cb7ce2969c8f/.pre-commit-config.yaml#L1

<https://github.com/OpenLineage/OpenLineage/blob/31f8ce588526e9c7c4bc7d849699cb7ce2969c8f/.pre-commit-config.yaml | .pre-commit-config.yaml>

<pre><code>repos: </code></pre>

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 10:37:24

*Thread Reply:* spotless is such a low hanging fruit tho...

Damien Hawes (damien.hawes@booking.com)

2024-02-22 10:37:38

*Thread Reply:* I did it for one of my local repos

Damien Hawes (damien.hawes@booking.com)

2024-02-22 10:37:50

*Thread Reply:* But I didn't get it quite right

Damien Hawes (damien.hawes@booking.com)

2024-02-22 10:38:01

*Thread Reply:* Perhaps I should run "spotlessCheck" on commit

Damien Hawes (damien.hawes@booking.com)

2024-02-22 10:38:33

*Thread Reply:* At least that way, I know my commit will not be committed if the check fails

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 10:38:46

*Thread Reply:* https://github.com/jguttman94/pre-commit-gradle

jguttman94/pre-commit-gradle

Some custom gradle hooks for pre-commit.

Stars

Language

Python

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 10:46:02

*Thread Reply:* 😞 https://github.com/pre-commit/pre-commit/issues/1110#issuecomment-518939116

Comment on #1110 How to run pre-commit inside folder

no I don't want to support that -- it rarely comes up and in most cases it's already a hack and worst case you can wrap whatever tool in <code>entry: bash -c 'cd ... && ... "$@"' --</code> there's a slippery slope of things that could be supported. adding support for every wacky <code>cd</code> / <code>env</code> / etc. under the sun is not on the table, especially when it's very very standard approach: to run with the current environment in the root of the repository. there's a quote that I don't know where I'm stealing from and I don't know the exact phrasing but it goes something like "In open source, no is temporary but yes is forever" -- I really don't want to add yet more edge cases for things I will never ever use and can barely explain how the rules for such a thing would work.

Damien Hawes (damien.hawes@booking.com)

2024-02-22 10:47:21

*Thread Reply:* Or I should just use intellij to commit with the reformat code option

Damien Hawes (damien.hawes@booking.com)

2024-02-22 10:47:24

*Thread Reply:* instead of the CLI

Damien Hawes (damien.hawes@booking.com)

2024-02-22 10:57:06

*Thread Reply:* Interesting ...

Damien Hawes (damien.hawes@booking.com)

2024-02-22 10:57:15

*Thread Reply:* https://app.circleci.com/pipelines/github/OpenLineage/OpenLineage/9444/workflows/f728a624-4db6-4d88-987b-5ad8c12d151d/jobs/179832

Damien Hawes (damien.hawes@booking.com)

2024-02-22 10:57:19

*Thread Reply:* This is still getting stuck

Damien Hawes (damien.hawes@booking.com)

2024-02-22 10:59:02

*Thread Reply:* I wonder

Damien Hawes (damien.hawes@booking.com)

2024-02-22 10:59:04

*Thread Reply:* I wonder

Damien Hawes (damien.hawes@booking.com)

2024-02-22 10:59:14

*Thread Reply:* I wonder if it is the download of the archives

Damien Hawes (damien.hawes@booking.com)

2024-02-22 10:59:30

*Thread Reply:* Those archives are "big"

Damien Hawes (damien.hawes@booking.com)

2024-02-22 10:59:34

*Thread Reply:* like >300MB

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 11:03:55

*Thread Reply:* it moved

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 11:04:09

*Thread Reply:* looks like :app:copySparkBinariesSpark350Scala212 that's right after it

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 11:05:49

*Thread Reply:* 300mb should not take that much anyway

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:06:14

*Thread Reply:* It's not the copying

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:06:21

*Thread Reply:* It's the downloading from the Apache archive

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:06:41

*Thread Reply:* https://archive.apache.org

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:06:47

*Thread Reply:* It can be really slow at times

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 11:07:59

*Thread Reply:* The main archive of all public software releases of the Apache Software Foundation. This is simply a copy of the main distribution directory with the only difference that nothing will be ever removed over here. If you are looking for current software releases, please visit one of our numerous mirrors. Do note that heavy use of this service will result in immediate throttling of your download speeds to either 12 or 6 mbps for the remainder of the day, depending on severity. Continuous abuse (to the tune of more than 40 GB downloaded per week) will cause an automatic ban, so please tune your services to this fact. 🤔

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:08:31

*Thread Reply:* lmao

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:09:01

*Thread Reply:* Another reason why we need a container registry

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 11:09:18

*Thread Reply:* then we'll get rate limited by it 🙂

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:09:42

*Thread Reply:* The problem is the mirrors don't contain all of the versions of Spark

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:09:47

*Thread Reply:* I think they go back to 3.3.4

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 11:12:22

*Thread Reply:* I think we could start using quay.io : for a task

check if docker image with particular tag exists
if not, build it using Apache archive and push it to quay
run the tests

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 11:13:01

*Thread Reply:* <a href="http://quay.io">quay.io</a> does not restrict anonymous pulls against its repositories (either public or private) and only rate limits in the most severe circumstances to maintain service levels (e.g. tens of requests per second from the same IP address).

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:14:38

*Thread Reply:* Aye, but who do we speak to in order to provision that?

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:17:39

*Thread Reply:* Maybe, just maybe, we can use circle ci's cache mechanism >.>

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:20:37

*Thread Reply:* I wonder, @Maciej Obuchowski - the GCP project that exists, can we not use the container registry in that one?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 11:21:48

*Thread Reply:* quay.io is free, GCR is paid and not that cheap https://cloud.google.com/artifact-registry/pricing

Google Cloud

Pricing | Artifact Registry | Google Cloud

Review pricing for Artifact Registry

Original URL: https://cloud.google.com/artifact-registry/pricing

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:22:04

*Thread Reply:* Is quay.io free?

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:22:11

*Thread Reply:* I see this: https://quay.io/plans/

quay.io

Quay

Quay is the best place to build, store, and distribute your containers. Public repositories are always free.

Original URL: https://quay.io/plans/

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 11:22:53

*Thread Reply:* Public repositories are always free. 🙂

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:23:01

*Thread Reply:* oooooh

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 11:25:42

*Thread Reply:* I'm setting up an OpenLineage organization that we could push the images to

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:25:48

*Thread Reply:* LMAO

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:25:55

*Thread Reply:* I was doing that myself

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:27:42

*Thread Reply:* Though, I don't know an "organization email" to use

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 11:29:59

*Thread Reply:* well, I was faster, I got the openlineage name 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 11:30:18

*Thread Reply:* Added one of mine, it's possible to change it later

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:32:35

*Thread Reply:* OK.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 11:32:50

*Thread Reply:* it should be possible now to log in using docker login -u="${QUAY_ACCOUNT_ID}" -p="${QUAY_ACCOUNT_TOKEN}" quay.io from CI task which is marked integration-tests

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:33:22

*Thread Reply:* OK. But I'll need to build those images first, and push them.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 11:33:26

*Thread Reply:* and push to https://quay.io/repository/openlineage/spark

quay.io

Quay

Quay is the best place to build, store, and distribute your containers. Public repositories are always free.

Original URL: https://quay.io/repository/openlineage/spark

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 11:33:49

*Thread Reply:* as in, the user QUAY_ACCOUNT_ID has permission to write there

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:34:19

*Thread Reply:* Can you add me to the org, so I can push the images?

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:34:41

*Thread Reply:* (I still have the binaries downloaded on my local)

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 11:35:37

*Thread Reply:* using email <a href="mailto:damien.hawes@booking.com">damien.hawes@booking.com</a>?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 11:36:06

*Thread Reply:* check the mail 🙂

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:36:09

*Thread Reply:* ye

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:50:51

*Thread Reply:* OK

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:50:53

*Thread Reply:* Pushing the images

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 11:52:46

*Thread Reply:* I can see spark-3.2.4-scala-2.13

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:52:53

*Thread Reply:* yup

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:53:00

*Thread Reply:* Next one is almost there

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:54:08

*Thread Reply:* These are some chunky images.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 11:54:20

*Thread Reply:* Earlier I was thinking of pushing it on CI: checking if the Spark tag exist, then create image and push it if it does not

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 11:54:39

*Thread Reply:* but we need to do this only once

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:54:43

*Thread Reply:* Exactly

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 11:55:04

*Thread Reply:* and if we want to add support for another Spark/Scala version, we still need to do some work manually

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 11:55:10

*Thread Reply:* so I guess this does not matter?

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:55:27

*Thread Reply:* The process is fairly trivial.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 11:55:39

*Thread Reply:* but still would be good to have documentation how to create the image, so you're not bothered by questions next time 🙂

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:55:40

*Thread Reply:* And we can make an openlineage-spark-docker directory

👍 Maciej Obuchowski

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:55:56

*Thread Reply:* Place the gradle in there

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:56:03

*Thread Reply:* put in a README.md

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:56:07

*Thread Reply:* etc

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:56:23

*Thread Reply:* It should be a project that changes very rarely

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:58:03

*Thread Reply:* Good find on the quay.io btw

Damien Hawes (damien.hawes@booking.com)

2024-02-22 11:58:12

*Thread Reply:* Really nice one

Damien Hawes (damien.hawes@booking.com)

2024-02-22 12:02:17

*Thread Reply:* OK. All images have been pushed.

👍 Maciej Obuchowski

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 12:09:08

*Thread Reply:* yep, I can see all of them

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 12:09:35

*Thread Reply:* People still love to use 2.4.8 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 12:14:49

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2455 should be enought to check if it works?

#2455 spark: use prebuild Spark images from quay.io

draft: test if it works

Labels

integration/spark

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 12:15:23

*Thread Reply:* > It should be a project that changes very rarely we should be building those images on minor Spark releases too

Damien Hawes (damien.hawes@booking.com)

2024-02-22 12:16:10

*Thread Reply:* Yes - that's what the spline folks did

Damien Hawes (damien.hawes@booking.com)

2024-02-22 12:16:44

*Thread Reply:* but they never supported 2.13

Damien Hawes (damien.hawes@booking.com)

2024-02-22 12:40:27

*Thread Reply:* LMAO

Damien Hawes (damien.hawes@booking.com)

2024-02-22 12:40:41

*Thread Reply:* WARN <a href="http://tc.quay.io/openlineage/spark:spark-3.3.4-scala-2.12">tc.quay.io/openlineage/spark:spark-3.3.4-scala-2.12</a> - The architecture 'arm64' for image '<a href="http://quay.io/openlineage/spark:spark-3.3.4-scala-2.12">quay.io/openlineage/spark:spark-3.3.4-scala-2.12</a>' (ID sha256:dbdc0c8a3e1b182004c3c850c2ecb767b76cc14e55e3e994a34356630e689e86) does not match the Docker server architecture 'amd64'. This will cause the container to execute much more slowly due to emulation and may lead to timeout failures.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 12:48:33

*Thread Reply:* 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 12:48:39

*Thread Reply:* you have only arm containers locally?

Damien Hawes (damien.hawes@booking.com)

2024-02-22 13:03:19

*Thread Reply:* Yeah

Damien Hawes (damien.hawes@booking.com)

2024-02-22 13:03:21

*Thread Reply:* It's an M1

Damien Hawes (damien.hawes@booking.com)

2024-02-22 13:03:36

*Thread Reply:* I'm pushing amd64 containers at the moment

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 13:04:15

*Thread Reply:* yeah it adds another dimension to the problem

Damien Hawes (damien.hawes@booking.com)

2024-02-22 13:06:39

*Thread Reply:* First pushed

Damien Hawes (damien.hawes@booking.com)

2024-02-22 13:06:47

*Thread Reply:* They'll probably take like 15 minutes or so

🙏 Maciej Obuchowski

Damien Hawes (damien.hawes@booking.com)

2024-02-22 13:18:32

*Thread Reply:* OK

Damien Hawes (damien.hawes@booking.com)

2024-02-22 13:18:33

*Thread Reply:* done

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 13:42:43

*Thread Reply:* not sure it did exactly what we want but probably okay for now

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 13:43:09

*Thread Reply:* it would be best if this was multi-arch build I think

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 13:43:22

*Thread Reply:* anyway it's not a job for now, rerunning the tests and I'm finishing for today

Damien Hawes (damien.hawes@booking.com)

2024-02-22 17:29:43

*Thread Reply:* OK - I have extracted the logic for building the images. It now also performs a multi-platform build, targeting linux/amd64 and linux/arm64. This should be enough for the CI/CD pipeline, folks developing with Linux and folks developing with Apple ARM chips.

The PR is here: https://github.com/OpenLineage/OpenLineage/pull/2456

The images with multiple manifests are here:

https://quay.io/repository/openlineage/spark?tab=tags

#2456 [SPARK] Extract the Docker image build logic into a separate Gradle project

Why? Because the integration tests for the Spark integration were timing out, due to Apache throttling us. We were downloading the Spark binaries too often. Solution We registered an OpenLineage org on <a href="https://quay.io">quay.io</a>. Where I pushed the images that I had already built. This project expands on that, by providing multiple platforms for the images (<code>linux/arm64</code> and <code>linux/amd64</code>)

Labels

documentation, integration/spark

Assignees

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-23 05:34:35

*Thread Reply:* To explain the situation for people not following the issue.

We had problem with CI where download of 300MB archive from archive.apache.com took over 10 minutes, probably because we were rate limited. That failed our integration tests and blocked the release.

We used those archives to create docker images that were used for integration tests - compiled Spark of particular version with particular version of Scala.

Solution to that problem was manually prebuilding the images and pushing it to free quay.io repository, This is not a problem, since bumping version of Spark that we test on also requires manual action, and because @Damien Hawes provided Gradle task to complete the work.

I've created openlineage organization on quay.io where we can push the images - and any other images we could want, for example jupyter already configured with Spark integration to allow people easier experimentation with OpenLineage.

If no one has any philosophical problems with that solution, I would like to see few committer volunteers to be added as admins to the quay.io organization - to increase the bus factor. @Julien Le Dem @Paweł Leszczyński @Damien Hawes - do you want to be added there?

Damien Hawes (damien.hawes@booking.com)

2024-02-23 06:14:29

*Thread Reply:* @Maciej Obuchowski - sure.

👍 Maciej Obuchowski

Julien Le Dem (julien@apache.org)

2024-02-23 19:59:19

*Thread Reply:* yes! Thank you Maciej. As long as there’s clear doc and an easy one liner to update those, this sounds good to me. I think we need to pay extra attention on limiting write access as this is a potential injection point to modify what the build does invisibly. (you can push a different image and affect the build without modifying the repo). Is there already a signature verification on download from quay (to avoid unauthorized modification of those images)?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-02-26 02:41:11

*Thread Reply:* Could we add add image building and storing in quai.io as part of our CI when needed image is not present there?

I would love to have some info in the docs what has to be done to support Spark 3.6 once it gets released. Especially, how can one publish 3.6 image to quai.io?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-26 05:11:55

*Thread Reply:* > Could we add add image building and storing in quai.io as part of our CI when needed image is not present there? We discussed this and decided it's already required to do manual work to support additional Spark version, so automating this won't give us much > I would love to have some info in the docs what has to be done to support Spark 3.6 once it gets released. Especially, how can one publish 3.6 image to quai.io? There is a readme here: https://github.com/OpenLineage/OpenLineage/pull/2456/files#diff-44ca475a04d6a92886f82dd27b47d30c8e57f518aa3dbc467feef43ec1c57638

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-02-26 05:27:26

*Thread Reply:* thanks Maciej

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-26 05:39:58

*Thread Reply:* > I think we need to pay extra attention on limiting write access as this is a potential injection point to modify what the build does invisibly. (you can push a different image and affect the build without modifying the repo). @Julien Le Dem Yes, only authorized users (committers) can upload images. CI won’t write images, just read them, they would be pushed by committers before execution > Is there already a signature verification on download from quay (to avoid unauthorized modification of those images)? (edited) Docker verifies SHA of downloaded images. Do you mean some additional mechanism to avoid potential problems with compromised committer?

Julien Le Dem (julien@apache.org)

2024-02-26 16:34:29

*Thread Reply:* The sha could be saved in the repo and compared so that it can not be changed independently by someone who would have gained access to the credentials.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-27 08:05:21

*Thread Reply:* @Julien Le Dem that would make a lot of sense if the same commit that changes the images could not change the SHA 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-27 08:05:51

*Thread Reply:* unless we've made those SHAs part of something external, for example CircleCI config

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-27 08:06:28

*Thread Reply:* but TBH I think it's low risk, CircleCI would limit us fast if somebody would for example put crypto miner there

Julien Le Dem (julien@apache.org)

2024-02-28 20:28:49

*Thread Reply:* to me the risk is more to introduce vulnerabilities/backdoors in the OpenLineage released artifact through pushing a cached image that modifies the result of the build.

Julien Le Dem (julien@apache.org)

2024-02-28 20:30:29

*Thread Reply:* The idea of saving the image signature in the repo is that you can not use a new image in the build without creating a new commit and traceability.

Michael Robinson (michael.robinson@astronomer.io)

2024-02-22 09:28:00

CFP closes on April 30 https://events.linuxfoundation.org/open-source-summit-europe/

LF Events

Open Source Summit Europe | LF Events

The premier vendor-neutral conference for open source developers and technologists to collaborate, share information and learn about the latest technologies and innovations across open source.

Original URL: https://events.linuxfoundation.org/open-source-summit-europe/

👀 Maciej Obuchowski

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 09:49:42

*Thread Reply:* Only a few days after Airflow Summit September 10-12, 2024 🤔

Michael Robinson (michael.robinson@astronomer.io)

2024-02-22 15:48:04

New communication channel: https://medium.com/@openlineageproject

Medium

OpenLineage – Medium

Read writing from OpenLineage on Medium. Every day, OpenLineage and thousands of other voices read, write, and share important stories on Medium.

Original URL: https://medium.com/@openlineageproject

👍 Maciej Obuchowski

Michael Robinson (michael.robinson@astronomer.io)

2024-02-22 15:48:19

*Thread Reply:* More to come...

Willy Lulciuc (willy@datakin.com)

2024-02-23 03:19:10

I might be the only one running into this issue (for now). I plan on opening PR for a fix this weekend, but if someone wants to pick it up your more than welcome to: https://github.com/OpenLineage/OpenLineage/issues/2458

#2458 Add deserialization support for `server` models

In Marquez, we're planning to use the models defined in pkg <code>io.openlineage.server.**</code> to collect OpenLineage events: <pre><code> @POST @Consumes(APPLICATION_JSON) @Path("/lineage") public Response collect(@NotNull OpenLineage.BaseEvent event) { metaDb.write(event); return Response.ok().build(); } </code></pre> To add polymorphic deserialization support for <code>server</code> models, we need to annotate the interface <code>io.openlineage.server.BaseEvent</code> with <code>@JsonTypeInfo</code> and <code>@JsonSubTypes</code>: <pre><code>@JsonTypeInfo( use = JsonTypeInfo.Id.NAME, include = JsonTypeInfo.As.EXISTING_PROPERTY, property = "type") @JsonSubTypes({ @JsonSubTypes.Type(value = RunEvent.class, name = "RUN_EVENT"), @JsonSubTypes.Type(value = JobEvent.class, name = "JOB_EVENT"), @JsonSubTypes.Type(value = DatasetEvent.class, name = "DATASET_EVENT") }) public interface BaseEvent { ... } </code></pre>

Labels

proposal

👍 Maciej Obuchowski

Michael Robinson (michael.robinson@astronomer.io)

2024-02-23 13:08:55

This is a GCP-specific question, but does anyone know the answer? https://openlineage.slack.com/archives/C01CK9T7HKR/p1708010167626709?thread_ts=1707920807.530409&cid=C01CK9T7HKR

} ldacey (https://openlineage.slack.com/team/U05NMJ0NBUK)

but day to day there is normally just one file downloaded so that should be a cleaner view in the future I assume? that input dataset (SFTP file) should only refer to the output dataset on GCS [sftp_file] --> [GCS file in landing folder] --> [GCS file in raw folder (renamed), only added if the checksum doesn't exist] -> [bronze delta table on GCS] --> [silver delta tableon GCS] --> [gold delta table on ADLS gen2] I am just considering whether I should proceed with file level datasets if it will make the UI too messy and complex. we do not control changes in the schema from clients, so on one hand it is nice to track

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1708010167626709?thread_ts=1707920807.530409&cid=C01CK9T7HKR

Damien Hawes (damien.hawes@booking.com)

2024-02-25 07:52:08

FYI: The release of 1.9.0 did not go through

Issue: https://github.com/OpenLineage/OpenLineage/issues/2467 PR: https://github.com/OpenLineage/OpenLineage/pull/2468

#2467 OpenLineage 1.9.0 was tagged, but the release failed on `release-integration-flink` step

#2468 [FLINK] Add the "isReleaseVersion" property back to integration/flink/build.gradle

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-26 05:47:35

*Thread Reply:* The biggest problem with release, as always, is that you can't test it in other way than running it 😞

Damien Hawes (damien.hawes@booking.com)

2024-02-25 08:13:45

It also seems like, despite the fact that the spark step completed, there was a silent failure or something. I don't see the artefacts on central.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-26 05:47:08

*Thread Reply:* @Michael Robinson needs to manually promote them

Michael Robinson (michael.robinson@astronomer.io)

2024-02-26 08:57:04

*Thread Reply:* @Damien Hawes looking into this today

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-26 08:58:14

*Thread Reply:* @Michael Robinson you can rerun release now

👍 Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2024-02-26 13:54:06

*Thread Reply:* @Damien Hawes the release is out. comms to follow shortly

Michael Robinson (michael.robinson@astronomer.io)

2024-02-26 15:45:18

Feedback and input requested for this month's newsletter. I've added sections for the Flink and Spark integrations. Please lmk what you think about the "highlights" I've chosen for these and for Airflow if you have a moment between now and EOD Wednesday. Thanks. https://docs.google.com/document/d/15caPR4q7dOPs6co2x0q5hYSX65ZhHHdhKrFP7dVSRPI/edit?usp=sharing

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-26 15:54:33

*Thread Reply:* Would it be okay to highlight 2 new committers? 🙂

Michael Robinson (michael.robinson@astronomer.io)

2024-02-26 15:55:14

*Thread Reply:* 👆why I ask for input! thank you @Jakub Dardziński

👍 Jakub Dardziński

Harel Shein (harel.shein@gmail.com)

2024-02-26 16:24:51

Snowflake taking a play from the Databricks playbook? https://www.snowflake.com/en/data-cloud/horizon/

snowflake.com

Snowflake Horizon

Snowflake Horizon: Data Cloud governance solution, offering compliance, security, privacy, interoperability, and seamless data access in one unified platform.

Original URL: https://www.snowflake.com/en/data-cloud/horizon/

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-27 11:16:28

gotta skip today meeting. I hope to see you all next week!

Julien Le Dem (julien@apache.org)

2024-02-27 12:11:00

The meetup I mentioned about OpenLineage/OpenTelemetry: https://x.com/J_/status/1565162740246671360 I speak in English but other two speakers speak in Hebrew

X (formerly Twitter)

Julien Le Dem (@J_) on X

A few weeks ago I presented at a meetup in Tel Aviv about the future of data lineage. Great discussion about end-to-end observability: from services to your data pipelines and drawing parallels between @opentelemetry and @OpenLineage <https://t.co/hzXrV7zbR6>

Original URL: https://x.com/J_/status/1565162740246671360

🙏 Willy Lulciuc

Julien Le Dem (julien@apache.org)

2024-02-27 13:49:21

*Thread Reply:* the slides from my part: https://docs.google.com/presentation/d/1BLM2ocs2S64NZLzNaZz5rkrS9lHRvtr9jUIetHdiMbA/edit#slide=id.g11e446d5059_0_1055

Harel Shein (harel.shein@gmail.com)

2024-02-28 10:20:23

*Thread Reply:* thanks for sharing that, that otel to ol comparison is going to be very useful for me today :)

Michael Robinson (michael.robinson@astronomer.io)

2024-02-28 13:18:03

Could use another pair of eyes on this month's newsletter draft if anyone has time today

🙌 Paweł Leszczyński, Maciej Obuchowski

Kacper Muda (mudakacper@gmail.com)

2024-02-28 15:00:46

*Thread Reply:* LGTM 🙂

:gratitude_thank_you: Michael Robinson

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-05 14:07:46

Hey, I created new Airflow AIP. It proposes instrumenting Airflow Hooks and Object Storage to collect dataset updates automatically, to allow gathering lineage from PythonOperator and custom operators. Feel free to comment on Confluence https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-62+Getting+Lineage+from+Hook+Instrumentation or on Airflow mailing list: https://lists.apache.org/thread/5chxcp0zjcx66d3vs4qlrm8kl6l4s3m2

🙌 Kacper Muda, Harel Shein, Paweł Leszczyński

Kacper Muda (mudakacper@gmail.com)

2024-03-06 05:25:42

Hey, does anyone want to add anything here (PR that adds AWS MSK IAM transport)? It looks like it's ready to be merged.

:gh_approved: Maciej Obuchowski

:gh_merged: Maciej Obuchowski

Harel Shein (harel.shein@gmail.com)

2024-03-06 10:14:34

did we miss a step in publishing 1.9.1? going https://search.maven.org/remote_content?g=io.openlineage&a=openlineage-spark&v=LATEST|here gives me the 1.8 release

Harel Shein (harel.shein@gmail.com)

2024-03-06 10:17:30

*Thread Reply:* oh, this might be related to having 2 scala versions now, because I can see the 1.9.1 artifacts

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-06 10:17:35

*Thread Reply:* yes

Harel Shein (harel.shein@gmail.com)

2024-03-06 10:17:48

*Thread Reply:* we may need to fix the docs then https://openlineage.io/docs/integrations/spark/quickstart/quickstart_databricks

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-06 10:18:22

*Thread Reply:* another place 🙂

Harel Shein (harel.shein@gmail.com)

2024-03-06 10:19:01

*Thread Reply:* yup

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-06 10:19:44

*Thread Reply:* https://github.com/OpenLineage/docs/pull/299

#299 Update databricks quickstart to point to proper OL-Spark

Harel Shein (harel.shein@gmail.com)

2024-03-06 10:22:25

*Thread Reply:* thx :gh_merged:

Michael Robinson (michael.robinson@astronomer.io)

2024-03-06 13:33:45

Hi, here's a tentative agenda for next week's TSC (on Wednesday at 9:30 PT):

Announcements including @Peter Huang's election, Kafka Summit talk, Data Council panel, Boston meetup
Recent release 1.9.1 highlights
Expanded Scala support in Spark overview @Damien Hawes
Circuit breaker in Spark & Flink, built-in lineage in Spark @Paweł Leszczyński
Discussion items
Open discussion Am I forgetting anything? Have a discussion item or want to do a demo? 🙂 Let me know. I'll also make a slide deck whether or not I can join next week and share it here. Reminders will go out today, and I believe links, meeting info and invites are all up to date. Please let me know if you spot incorrect meeting info anywhere.

Harel Shein (harel.shein@gmail.com)

2024-03-06 13:46:49

*Thread Reply:* I thought @Paweł Leszczyński wanted to present?

Michael Robinson (michael.robinson@astronomer.io)

2024-03-06 13:51:06

*Thread Reply:* What was the topic? Protobuf or built-in lineage maybe? Or the many docs improvements lately?

Harel Shein (harel.shein@gmail.com)

2024-03-06 13:53:31

*Thread Reply:* I think so? https://github.com/OpenLineage/OpenLineage/pull/2272

Michael Robinson (michael.robinson@astronomer.io)

2024-03-06 13:55:44

*Thread Reply:* Imagine there are lots of folks who would be interested in a presentation on that

Harel Shein (harel.shein@gmail.com)

2024-03-06 13:58:15

*Thread Reply:* I think so too 🙂

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-03-07 02:22:25

*Thread Reply:* There two things worth presenting: circuit breaker +/or built-in lineage (once it gets merged).

👍 Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2024-03-07 09:08:15

*Thread Reply:* updating the agenda

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-06 16:07:06

is there a reason why facet objects have _schemaURL property but BaseEvent has schemaURL?

Willy Lulciuc (willy@datakin.com)

2024-03-06 16:07:34

*Thread Reply:* yeah, we use _ to avoid naming conflicts in a facet

👍 Julien Le Dem

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-06 16:07:34

*Thread Reply:* same goes for producer

Julien Le Dem (julien@apache.org)

2024-03-06 16:08:18

*Thread Reply:* Facets have user defined fields. So all base fields are prefixed

Julien Le Dem (julien@apache.org)

2024-03-06 16:08:27

*Thread Reply:* Base events do not

Willy Lulciuc (willy@datakin.com)

2024-03-06 16:08:30

*Thread Reply:* it should be a made more clear… recently ran into the issue when validating OL events

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-06 16:09:38

*Thread Reply:* it might be another missing point but we set _producer in BaseFacet: def __attrs_post_init__(self) -> None: self._producer = PRODUCER but we don’t do that for producer in BaseEvent

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-06 16:09:52

*Thread Reply:* is this supposed to be like that?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-06 16:09:57

*Thread Reply:* I’m kinda lost 🙂

Julien Le Dem (julien@apache.org)

2024-03-06 16:10:43

*Thread Reply:* We should set producer in baseevent as well

☝️ Jakub Dardziński

Julien Le Dem (julien@apache.org)

2024-03-06 16:11:35

*Thread Reply:* The idea is the base event might be produced by the spark integration but the facet might be produced by iceberg library

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-06 16:13:02

*Thread Reply:* > The idea is the base event might be produced by the spark integration but the facet might be produced by iceberg library right, it doesn’t require adding _ , it just helps in making the difference

and also this reason too: > Facets have user defined fields. So all base fields are prefixed > Base events do not

Julien Le Dem (julien@apache.org)

2024-03-06 16:13:34

*Thread Reply:* Since users can create custom facets with whatever fields we just tell Them that “_**” is reserved.

Julien Le Dem (julien@apache.org)

2024-03-06 16:13:55

*Thread Reply:* So the underscore prefix is a mechanism specific to facets

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-06 16:14:04

*Thread Reply:* 👍

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-06 16:15:19

*Thread Reply:* last question: we don’t want to block users from setting their own _producerfield? it seems the only way now is to use openlineage.client.facet.set_producer method to override default, you can’t just do RunEvent(…, _producer='my_own')

Julien Le Dem (julien@apache.org)

2024-03-06 16:17:11

*Thread Reply:* The idea is the producer identifies the code that generates the metadata. So you set it once and all the facets you generate have the same

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-06 16:17:54

*Thread Reply:* mhm, probably you don’t need to use several producers (at least) per Python module

👍 Julien Le Dem

Julien Le Dem (julien@apache.org)

2024-03-06 16:18:09

*Thread Reply:* Yep

Julien Le Dem (julien@apache.org)

2024-03-06 16:18:39

*Thread Reply:* In airflow each provider should have its own for the facets they produce

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-06 16:18:43

*Thread Reply:* just searched for set_producer in current docs - no results 😨

😅 Julien Le Dem

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-06 16:19:55

*Thread Reply:* a number of things will get to the right track after I’m done with generating code 🙂

Julien Le Dem (julien@apache.org)

2024-03-06 16:20:54

*Thread Reply:* Thanks for looking into that. If you can fix the doc by adding a paragraph about that, that would be helpful

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-06 16:21:38

*Thread Reply:* I can create an issue at least 😂

👍 Julien Le Dem

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-06 16:23:44

*Thread Reply:* there you go: https://github.com/OpenLineage/docs/issues/300 if I missed something please comment

#300 Describe producer and schemaURL more thoroughly

Both <code>producer</code> and <code>schemaURL</code> concepts need more explanation: <ol><li>What they should be used for.</li><li>Why facets have <code>_</code> prefix.</li><li>How to set producers (and schemaURLs?) (at least in Python it's not documented how to use <code>set_producer</code> method).</li> </ol>

Assignees

<a href="https://github.com/JDarDagran">@JDarDagran</a>

Harel Shein (harel.shein@gmail.com)

2024-03-06 17:24:05

I feel like our getting started with openlineage page is mostly a getting started with Marquez page. but I'm also not sure what should be there otherwise.

openlineage.io

Getting Started | OpenLineage

Data lineage is the foundation for a new generation of powerful, context-aware data tools and best practices. OpenLineage enables consistent collection of lineage metadata, creating a deeper understanding of how data is produced and used.

Original URL: https://openlineage.io/getting-started/

Michael Robinson (michael.robinson@astronomer.io)

2024-03-07 09:00:37

*Thread Reply:* https://openlineage.io/docs/guides/spark ?

openlineage.io

Using OpenLineage with Spark | OpenLineage

Adapted from a blog post by Michael Collado

Original URL: https://openlineage.io/docs/guides/spark

Michael Robinson (michael.robinson@astronomer.io)

2024-03-07 09:03:54

*Thread Reply:* Unfortunately it's probably not that "quick" given the setup required..

Michael Robinson (michael.robinson@astronomer.io)

2024-03-07 09:04:30

*Thread Reply:* Maybe better? https://openlineage.io/docs/integrations/spark/quickstart/quickstart_local

openlineage.io

Quickstart with Jupyter | OpenLineage

Trying out the Spark integration is super easy if you already have Docker Desktop and git installed.

Original URL: https://openlineage.io/docs/integrations/spark/quickstart/quickstart_local

Harel Shein (harel.shein@gmail.com)

2024-03-07 12:21:18

*Thread Reply:* yeah, that's where I was struggling as well. should our quickstart be platform specific? that also feels strange.

Damien Hawes (damien.hawes@booking.com)

2024-03-07 10:35:46

Quick question, for the spark.openlineage.facets.disabled property, why do we need to include [;] in the value? Why can't we use , to act as the delimiter? Why do we need [ and ] to enclose the string?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-07 13:22:42

*Thread Reply:* There was some concrete reason AFAIK right @Paweł Leszczyński?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-03-08 02:23:02

*Thread Reply:* We do have a logic that converts Spark conf entries to OpenLineageYaml without a need to understand its content. I think [] was added for this reason to know that Spark conf entry has to be translated into an array.

Initially disabled facets were just separated by ; . Why not a comma? I don't remember if there was any problem with this.

https://github.com/OpenLineage/OpenLineage/pull/1271/files -> this PR introduced it

https://github.com/OpenLineage/OpenLineage/blob/1.9.1/integration/spark/app/src/main/java/io/openlineage/spark/agent/ArgumentParser.java#L152 -> this code check if spark conf value is of array type

<https://github.com/OpenLineage/OpenLineage/blob/1.9.1/integration/spark/app/src/main/java/io/openlineage/spark/agent/ArgumentParser.java | ArgumentParser.java>

Peter Huang (huangzhenqiu0825@gmail.com)

2024-03-07 15:27:02

Hi team, do we have any proposal or previous discussion of Trino OpenLineage integration?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-07 15:30:33

*Thread Reply:* There is old third-party integration: https://github.com/takezoe/trino-openlineage

It has right idea to use EventListener, but I can't vouch if it works

Peter Huang (huangzhenqiu0825@gmail.com)

2024-03-07 15:34:18

*Thread Reply:* Thanks. We are investigating the integration in our org. It will be a good start point 🙂

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-07 15:38:13

*Thread Reply:* I think the ideal solution would be to use EventListener. So far we only have very basic integration in Airflow's TrinoOperator

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-07 15:39:24

*Thread Reply:* The only thing I haven't really checked out what are real possibilities for EventListener in terms of catalog details discovery, e.g. what's database connection for the catalog.

Peter Huang (huangzhenqiu0825@gmail.com)

2024-03-07 17:00:41

*Thread Reply:* Thanks for calling out this. We will evaluate and post some observation in the thread.

Alok (a_prusty@apple.com)

2024-03-07 18:54:22

*Thread Reply:* Thanks Peter Hey Maciej/Jakub Could you please share the process to follow in terms of contributing a Trino open lineage integration. (Design doc and issue ?)

There was an issue for trino integration but it was closed recently. https://github.com/OpenLineage/OpenLineage/issues/164

#164 [INTEGRATION] Add support for Trino

Labels

integration/trino

Comments

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-08 04:40:51

*Thread Reply:* It would be great to see design doc and maybe some POC if possible. I've reopened the issue for you.

If you get agreement around the design I don't think there are more formal steps needed, but maybe @Julien Le Dem has other idea

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-08 06:36:25

*Thread Reply:* Trino has their plugins directory btw: https://github.com/trinodb/trino/tree/master/plugin including event listeners like: https://github.com/trinodb/trino/tree/master/plugin/trino-mysql-event-listener

Alok (a_prusty@apple.com)

2024-03-08 13:40:01

*Thread Reply:* Thanks Maciej and Jakub Yes the integration will be done with Trino’s event listener framework that has details around query, source and destination dataset details etc.

> It would be great to see design doc and maybe some POC if possible. I’ve reopened the issue for you. Thanks for re-opening the issue. We will add the design doc and POC to the issue.

Julien Le Dem (julien@apache.org)

2024-03-12 17:57:55

*Thread Reply:* I agree with @Maciej Obuchowski, a quick design doc followed by a POC would be great. The integration could either live in OpenLineage or Trino but that can be discussed after the POC.

👍 Alok

Julien Le Dem (julien@apache.org)

2024-03-12 17:58:24

*Thread Reply:* (obviously, adding it to the trino repo would require aproval from the trino community)

Mariusz Górski (gorskimariusz13@gmail.com)

2024-03-22 09:45:34

*Thread Reply:* Gentleman, we are also actively looking into this topic with the same repo from @takezoe as our base, I have submitted a PR to revive this project - it does work, the POC is there in a form of docker-compose.yaml deployment 🙂 some obvious things are missing for now (like kafka output instead of api) but I think it's a good starting point and it's compatible with latest trino and OL

#1 revive this project

• added facets for trino.metadata and trino.queryStatistics • implement properly queryCreated method of trino EventListener • added docker-compose • added pom sorter • bumped trino version • bumped java version • bumped ol version • updated readme tested with commited docker-compose: trino: <pre><code>trino> create table memory.default.mariusz13 as select ** from tpch.sf1.nation limit 1; CREATE TABLE: 1 row Query 20240322_112141_00000_vcdx9, FINISHED, 1 node Splits: 19 total, 19 done (100.00%) 1.80 [25 rows, 0B] [13 rows/s, 0B/s] </code></pre> mock api: <pre><code>2024-03-22 11:21:41 5.15.0 INFO 5000 received request: { "method" : "POST", "path" : "/api/v1/lineage", "headers" : { "Host" : [ "olapi:5000" ], "Content-Type" : [ "application/json" ], "Content-Length" : [ "590" ] }, "keepAlive" : true, "secure" : false, "protocol" : "HTTP_1_1", "localAddress" : "172.19.0.3:5000", "remoteAddress" : "172.19.0.2:49510", "body" : { "eventTime" : "2024-03-22T11:21:41.662Z", "producer" : "<https://github.com/takezoe/trino-openlineage>", "schemaURL" : "<https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunEvent>", "eventType" : "START", "run" : { "runId" : "c0a4ee42-6ed0-38e2-b1c2-e701c151eb6d", "facets" : { } }, "job" : { "namespace" : "trino", "name" : "20240322_112141_00000_vcdx9", "facets" : { "sql" : { "_producer" : "<https://github.com/takezoe/trino-openlineage>", "_schemaURL" : "<https://openlineage.io/spec/facets/1-0-0/SQLJobFacet.json#/$defs/SQLJobFacet>", "query" : "create table memory.default.mariusz13 as select ** from tpch.sf1.nation limit 1" } } } } } 2024-03-22 11:21:43 5.15.0 INFO 5000 no expectation for: { "method" : "POST", "path" : "/api/v1/lineage", "headers" : { "Host" : [ "olapi:5000" ], "Content-Type" : [ "application/json" ], "Content-Length" : [ "707" ] }, "keepAlive" : true, "secure" : false, "protocol" : "HTTP_1_1", "localAddress" : "172.19.0.3:5000", "remoteAddress" : "172.19.0.2:49510", "body" : { "eventTime" : "2024-03-22T11:21:43.463Z", "producer" : "<https://github.com/takezoe/trino-openlineage>", "schemaURL" : "<https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunEvent>", "eventType" : "COMPLETE", "run" : { "runId" : "c0a4ee42-6ed0-38e2-b1c2-e701c151eb6d", "facets" : { } }, "job" : { "namespace" : "trino", "name" : "20240322_112141_00000_vcdx9", "facets" : { "sql" : { "_producer" : "<https://github.com/takezoe/trino-openlineage>", "_schemaURL" : "<https://openlineage.io/spec/facets/1-0-0/SQLJobFacet.json#/$defs/SQLJobFacet>", "query" : "create table memory.default.mariusz13 as select ** from tpch.sf1.nation limit 1" } } }, "inputs" : [ { "namespace" : "tpch", "name" : "sf1.nation" } ], "outputs" : [ { "namespace" : "memory", "name" : "default.mariusz13" } ] } } </code></pre>

Peter Huang (huangzhenqiu0825@gmail.com)

2024-03-26 15:54:02

*Thread Reply:* Thanks for put the foundation for the implementation. Base on it, I feel @Alok would still participate and make contribute to it. How about create a design doc and list all of the possible TBDs as @Julien Le Dem suggested.

Michael Robinson (michael.robinson@astronomer.io)

2024-03-30 09:29:24

*Thread Reply:* Adding @takezoe to this thread. Thanks for your work on a Trino integration and welcome!

❤️ Mariusz Górski

Harel Shein (harel.shein@gmail.com)

2024-04-02 09:11:57

*Thread Reply:* throwing the CFP for the Trino conference here in case any one of the contributors want to present there https://sessionize.com/trino-fest-2024

sessionize.com

Trino Fest 2024 : Call for Speakers

Trino Fest is an event that brings together engineers, analysts, data scientists, and anyone else who’s interested in using or contributing to Trino. ...

Original URL: https://sessionize.com/trino-fest-2024

Harel Shein (harel.shein@gmail.com)

2024-04-02 09:12:43

*Thread Reply:* I'm also very happy to help with an idea for an abstract

Alok (a_prusty@apple.com)

2024-04-02 12:26:51

*Thread Reply:* Hey Harel Just FYI we are already engaged with Trino community to have a talk around Trino open lineage integration and have submitted an Abstract for review.

🎉 Harel Shein

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-02 12:33:55

*Thread Reply:* once you release the integration, please add a reference about it to OpenLineage docs! https://github.com/OpenLineage/docs

OpenLineage/docs

Documentation and website for OpenLineage

Website

<https://openlineage.io>

Stars

👍 Alok, Mariusz Górski

Mariusz Górski (gorskimariusz13@gmail.com)

2024-04-02 12:51:54

*Thread Reply:* I think it's ready for review https://github.com/trinodb/trino/pull/21265 just with API sink integration, additional features can be added at @Alok's convenience as next PRs

#21265 feat: add OpenLineage EventListener plugin

Description Adds OpenLineage plugin that uses EventListener SPI to transform and send lineage information to OpenLineage compatible API. co-authored with <a href="https://github.com/takezoe">@takezoe</a> Additional context and related issues Release notes ( ) This is not user-visible or is docs only, and no release notes are required. ( ) Release notes are required. Please propose a release note for me. ( ) Release notes are required, with the following suggested text: <pre><code># Section ** Fix some things. ({issue}`issuenumber`) </code></pre>

Labels

docs

Comments

🎉 Michael Robinson

👍 Alok

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-08 11:12:43

Hey, there’s discrepancy between in Airflow. Docs say it completely blocks emitting OL events on operator class level. The actual behaviour is that it only blocks metadata extraction (so for instance it doesn’t call Snowflake DB for SnowflakeOperator). My question is what should be desired behaviour. Thoughts so far:

current name indicates it should block emission (similar to disabled option)
imo it doesn’t make sense to emit empty events with basic Airflow info only - from OL perspective it’s way more informative to attach inputs/outputs information Thanks for any opinion!

👍 Maciej Obuchowski

🤔 Maciej Obuchowski

tati (tatiana.alchueyr@astronomer.io)

2024-03-08 11:14:07

*Thread Reply:* I believe we should not extract or emit any open lineage events if this option is used

➕ Kacper Muda

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-08 11:14:07

*Thread Reply:* I'm for option 2, don't send any event from task

tati (tatiana.alchueyr@astronomer.io)

2024-03-08 11:44:54

*Thread Reply:* @Jakub Dardziński do you see any use case for not extracting metadata extraction but still emitting events?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-08 11:53:31

*Thread Reply:* The use case AFAIK was old SnowflakeOperator bug, we wanted to disable the collection there, since it zombified the task. The events being emitted still gave information about status of the task as well as non-dataset related metadata

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-08 11:53:38

*Thread Reply:* but I think it's less relevant now

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-08 11:58:09

*Thread Reply:* ^ this and you might want to have information about task execution because OL is a backend for some task-tracking system

tati (tatiana.alchueyr@astronomer.io)

2024-03-08 12:05:01

*Thread Reply:* Hm, I believe users don't expect us to spend time processing/extracting OL events if this configuration is used. It's the documented behaviour

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-08 12:09:39

*Thread Reply:* the question is if we should change docs or behaviour

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-08 12:09:45

*Thread Reply:* I believe the latter

tati (tatiana.alchueyr@astronomer.io)

2024-03-08 12:46:39

*Thread Reply:* +1 behaviour

Harel Shein (harel.shein@gmail.com)

2024-03-08 13:39:26

*Thread Reply:* +1

Michael Robinson (michael.robinson@astronomer.io)

2024-03-11 21:08:22

Hi, here's the in progress for Wednesday

Harel Shein (harel.shein@gmail.com)

2024-03-11 22:06:19

*Thread Reply:* Looks like a great agenda! Left a couple of comments

Harel Shein (harel.shein@gmail.com)

2024-03-11 22:06:43

*Thread Reply:* @Michael Robinson will you be able to facilitate or do you need help?

Kacper Muda (mudakacper@gmail.com)

2024-03-12 05:57:39

*Thread Reply:* I'm also missing from the committer list, but can't comment on slides 🙂

😱 Harel Shein

Michael Robinson (michael.robinson@astronomer.io)

2024-03-12 11:22:16

*Thread Reply:* Sorry about that @Kacper Muda. Gave you access just now

Michael Robinson (michael.robinson@astronomer.io)

2024-03-12 11:22:37

*Thread Reply:* We probably need to add you to lists posted elsewhere... I'll check

Kacper Muda (mudakacper@gmail.com)

2024-03-12 11:22:52

*Thread Reply:* No worries, thanks 🙂 !

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-12 05:54:38

https://github.com/open-metadata/OpenMetadata/pull/15317 👀

🔥 Jakub Dardziński, Harel Shein, Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2024-03-12 11:23:15

*Thread Reply:* this is awesome

Michael Robinson (michael.robinson@astronomer.io)

2024-03-12 11:26:48

*Thread Reply:* it looks like they use temporary deployments to test...

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-12 11:43:35

*Thread Reply:* yeah the GitHub history is wild

Michael Robinson (michael.robinson@astronomer.io)

2024-03-12 11:33:04

Hi, I'm at the conference hotel and my earbuds won't pair with my new mac for some reason. Does the agenda look good? Want to send out the reminders soon. I'll add the OpenMetadata news!

Harel Shein (harel.shein@gmail.com)

2024-03-12 11:42:00

*Thread Reply:* I think we can also add the Datahub PR?

Harel Shein (harel.shein@gmail.com)

2024-03-12 11:47:51

*Thread Reply:* @Paweł Leszczyński prefers to present only the circuit breakers

👍 Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2024-03-12 11:47:55

*Thread Reply:* https://github.com/datahub-project/datahub/pull/9870/files

Michael Robinson (michael.robinson@astronomer.io)

2024-03-12 11:48:09

*Thread Reply:* This one?

Harel Shein (harel.shein@gmail.com)

2024-03-12 11:48:15

*Thread Reply:* yes!

🔥 Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2024-03-12 13:09:47

It's been a while since we've updated the twitter profile. Current description: "A standard api for collecting Data lineage and Metadata at runtime." What would you think of using our website's tagline: "An open framework for data lineage collection and analysis." Other ideas?

👍 Maciej Obuchowski, Harel Shein, Julien Le Dem, Kacper Muda

✅ Michael Robinson

Harel Shein (harel.shein@gmail.com)

2024-03-13 12:34:32

can someone grant me write access to our forked sqlparser-rs repo?

Harel Shein (harel.shein@gmail.com)

2024-03-13 12:34:41

*Thread Reply:* @Julien Le Dem maybe?

Julien Le Dem (julien@apache.org)

2024-03-13 12:38:24

*Thread Reply:* I should probably add the committer group to it

➕ Harel Shein, Maciej Obuchowski

Julien Le Dem (julien@apache.org)

2024-03-13 12:42:44

*Thread Reply:* I have made the committer group maintainer on this repo

🙏 Harel Shein, Maciej Obuchowski

❤️ Peter Huang

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-13 17:19:20

https://github.com/OpenLineage/OpenLineage/pull/2514 small but mighty 😉

#2514 common: use uv instead of pip in openlineage-common tests.

Problem <code>unit-test-integration-common</code> step in current CI takes over 8 min to complete. Solution Use <code>uv</code> as alternative package manager to <code>pip</code>. In Airflow new tool written in Rust (with love from <a href="https://github.com/mobuchowski">@mobuchowski</a> 😅 ) has already been introduced to speed up parts of CI. With this PR the step goes under 1 minute. 🚀 Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☑︎ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☐ Your changes are accompanied by tests (if relevant) ☑︎ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☐ Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant) ☐ You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant) <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2023 contributors to the OpenLineage project

Labels

ci, common

🥳 Kacper Muda, Maciej Obuchowski

Michael Robinson (michael.robinson@astronomer.io)

2024-03-14 11:52:15

Regarding the approved release, based on the additions it seems to me like we should make it a minor release (so 1.10.0). Any objections? Changes are here: https://github.com/OpenLineage/OpenLineage/compare/1.9.1...HEAD

➕ Harel Shein, Paweł Leszczyński

Kacper Muda (mudakacper@gmail.com)

2024-03-14 14:16:09

We encountered a case of a START event, exceeding 2MB in Airflow. This was traced back to an operator with unusually long arguments and attributes. Further investigation revealed that our Airflow events contain redundant data across different facets, leading to unnecessary bloating of event sizes (those long attributes and args were attached three times to a single event). I proposed to remove some redundant facets and to refine the operator's attributes inclusion logic within AirflowRunFacet. I am not sure how breaking is this change, but some systems might depend on the current setup. Suggesting an immediate removal might not be the best approach, and i'd like to know your thoughts. (A similar problem exists within the Airflow provider.) CC @Maciej Obuchowski @Willy Lulciuc @Jakub Dardziński

https://github.com/OpenLineage/OpenLineage/pull/2509

#2509 [Airflow] Remove redundant facets and utils

Problem We encountered a case where a customer received an overly large OpenLineage START event, exceeding 2MB. This was traced back to an operator with unusually long arguments and attributes. Further investigation revealed that our Airflow events contain redundant data across different facets, leading to unnecessary bloating of event sizes. (In our case, the Operator attributes and args were attached three times to a single event, and then again in the COMPLETE event !) Specifically: • AirflowVersionRunFacet duplicates task information found in both AirflowRunFacet and ProcessingEngineRunFacet. • AirflowRunArgsRunFacet mirrors data in AirflowRunFacet["dagRun"]["externaltrigger"]. • AirflowMappedTaskRunFacet repeats information available in AirflowRunFacet["taskInstance"]["mapindex"] and AirflowRunFacet["task"]["operator_class"]. • UnknownOperatorAttributeRunFacet nearly replicates AirflowRunFacet["task"], albeit without excluding certain non-critical or voluminous details. Initially, I considered selectively including only unique information in UnknownOperatorAttributeRunFacet that was absent in AirflowRunFacet to avoid duplication. However, upon reflection, I realized that if information is deemed important, it should be incorporated into AirflowRunFacet; if not, it should be excluded entirely. This led me to conclude that the most effective approach would be to eliminate the UnknownOperatorAttributeRunFacet altogether. Furthermore, an audit of <code>integration/airflow/openlineage/airflow/utils.py</code> revealed numerous unused functions. Their removal will clean up the codebase. Solution Given these findings, I recommend removing AirflowVersionRunFacet, AirflowRunArgsRunFacet, AirflowMappedTaskRunFacet, and UnknownOperatorAttributeRunFacet from the default event composition. This change would streamline the event data, leaving only AirflowRunFacet and ProcessingEngineRunFacet attached by default. Users needing extensive details can opt to attach additional facets as necessary. Additionally, I propose refining the operator's attribute inclusion logic within AirflowRunFacet. Instead of excluding known unimportant or large attributes, we should selectively include only those known to be important or compact. This approach ensures that custom operator attributes with substantial data do not inflate the event size. One-line summary: Removing unused utils and duplicate airflow facets. This proposal aims to optimize event data relevance and size, enhancing efficiency and user experience. Important I believe that removing these facets won't result in the loss of any essential information; it'll just have to be extracted from different facet. However, I recognize that some users or systems might depend on the current setup, suggesting that an immediate removal might not be the best approach. Although duplicating data in our events isn't ideal, especially now that we're aware of it, perhaps a more gradual removal process would work better. We could mark these facets as deprecated and schedule their removal for a future release. What do you think? Also, I'm not sure if all the information included in <code>integration.airflow.openlineage.airflow.utils.TaskInfo</code> is important, maybe we can remove some of it. Similar PR will be raised for the openlineage-provider in Airflow. Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☑︎ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☑︎ Your changes are accompanied by tests (if relevant) ☐ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☐ Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant) ☐ You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant) <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2023 contributors to the OpenLineage project

Labels

integration/airflow, extractor

Comments

Michael Robinson (michael.robinson@astronomer.io)

2024-03-14 15:06:14

As mentioned during yesterday's TSC, we can't get insight into DataHub's integration from the PR description in their repo. And it's a very big PR. Does anyone have any intel? PR is here: https://github.com/datahub-project/datahub/pull/9870

#9870 Openlineage endpoint and Spark Lineage Beta Plugin

Checklist ☐ The PR conforms to DataHub's <a href="https://github.com/datahub-project/datahub/blob/master/docs/CONTRIBUTING.md">Contributing Guideline</a> (particularly <a href="https://github.com/datahub-project/datahub/blob/master/docs/CONTRIBUTING.md#commit-message-format">Commit Message Format</a>) ☐ Links to related issues (if applicable) ☐ Tests for the changes have been added/updated (if applicable) ☐ Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same. ☐ For any breaking change/potential downtime/deprecation/big changes an entry has been made in <a href="https://github.com/datahub-project/datahub/blob/master/docs/how/updating-datahub.md">Updating DataHub</a>

Labels

ingestion, product, devops

Comments

Michael Robinson (michael.robinson@astronomer.io)

2024-03-14 15:07:51

Changelog PR for 1.10 is RFR: https://github.com/OpenLineage/OpenLineage/pull/2516

#2516 Add missing changes for 1.10 release.

Labels

documentation

Michael Robinson (michael.robinson@astronomer.io)

2024-03-14 16:20:59

@Julien Le Dem @Paweł Leszczyński Release is failing in the Java client job due to (I think) the version of spotless: ```Could not resolve com.diffplug.spotless:spotlessplugingradle:6.21.0. Required by: project : > com.diffplug.spotless:com.diffplug.spotless.gradle.plugin:6.21.0

No matching variant of com.diffplug.spotless:spotlessplugingradle:6.21.0 was found. The consumer was configured to find a library for use during runtime, compatible with Java 8, packaged as a jar, and its dependencies declared externally, as well as attribute 'org.gradle.plugin.api-version' with value '8.4'```

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-14 16:53:58

*Thread Reply:* @Michael Robinson https://github.com/OpenLineage/OpenLineage/pull/2517

#2517 release: revert spotless bump that fails with Java 8

Labels

client/java

✅ Michael Robinson

🙌 Paweł Leszczyński

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-14 18:17:25

fix to broken main: https://github.com/OpenLineage/OpenLineage/pull/2518

#2518 dagster: add new latest version provider in conftest

Problem 30 minutes from now new version of Dagster was released. It breaks tests with latest version. Solution Add new provider for latest version as advised in docstrings. Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☑︎ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☑︎ Your changes are accompanied by tests (if relevant) ☑︎ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☐ Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant) ☐ You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant) <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2023 contributors to the OpenLineage project

Labels

integration/dagster

Comments

Michael Robinson (michael.robinson@astronomer.io)

2024-03-14 18:47:34

*Thread Reply:* Thanks, just tried again

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-14 18:48:38

*Thread Reply:* ? it needs approve and merge 😛

Michael Robinson (michael.robinson@astronomer.io)

2024-03-14 18:50:52

*Thread Reply:* Oh oops disregard

Michael Robinson (michael.robinson@astronomer.io)

2024-03-14 18:50:57

*Thread Reply:* different PR

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-14 18:51:22

*Thread Reply:* 👍

Michael Robinson (michael.robinson@astronomer.io)

2024-03-14 19:01:47

There's an issue with the Flink job on CI: ** What went wrong: Could not determine the dependencies of task ':shadowJar'. > Could not resolve all dependencies for configuration ':runtimeClasspath'. > Could not find io.**********************:**********************_sql_java:1.10.1. Searched in the following locations: - <https://repo.maven.apache.org/maven2/io/**********************/**********************-sql-java/1.10.1/**********************-sql-java-1.10.1.pom> - <https://packages.confluent.io/maven/io/**********************/**********************-sql-java/1.10.1/**********************-sql-java-1.10.1.pom> - file:/home/circleci/.m2/repository/io/**********************/**********************-sql-java/1.10.1/**********************-sql-java-1.10.1.pom Required by: project : > project :shared project : > project :flink115 project : > project :flink117 project : > project :flink118

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-14 19:33:58

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2521

#2521 release: make flink release job depend on SQL java build job

<a href="https://github.com/OpenLineage/OpenLineage/pull/2481">#2481</a> added dependency for build and Flink integration test job to use SQL parser integration, but did not do the same for release job.

Labels

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-14 19:34:11

*Thread Reply:* @Jakub Dardziński still awake? 🙂

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-14 19:35:42

*Thread Reply:* it’s just approval bot

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-14 19:38:11

*Thread Reply:* created issue on how to avoid those in the future https://github.com/OpenLineage/OpenLineage/issues/2522

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-14 19:39:05

*Thread Reply:* https://app.circleci.com/jobs/github/OpenLineage/OpenLineage/188526 I lack emojis on this server to fully express my emotions

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-14 19:39:49

*Thread Reply:* https://openlineage.slack.com/archives/C065PQ4TL8K/p1710454645059659 you might have missed that

} Jakub Dardziński (https://openlineage.slack.com/team/U02S6F54MAB)

fix to broken main: <a href="https://github.com/OpenLineage/OpenLineage/pull/2518">https://github.com/OpenLineage/OpenLineage/pull/2518</a>

Original URL: https://openlineage.slack.com/archives/C065PQ4TL8K/p1710454645059659

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-14 19:40:22

*Thread Reply:* merge -> rebase -> problem gone

Michael Robinson (michael.robinson@astronomer.io)

2024-03-15 09:56:16

*Thread Reply:* PR to update the changelog is RFR @Jakub Dardziński @Maciej Obuchowski: https://github.com/OpenLineage/OpenLineage/pull/2526

#2526 Update changelog for 1.10.2

Labels

documentation

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-14 19:24:46

https://github.com/OpenLineage/OpenLineage/pull/2520 It’s a long-awaited PR - feel free to comment!

#2520 python/client: generate Python facets from JSON schemas.

Problem So far objects specified with JSON Schema needed to be manually developed and checked in Python. This brought many discrepancies, including wrong schema URLs, some attributes changes and inconsistent naming. Solution This PR makes use of <a href="https://github.com/koxudaxi/datamodel-code-generator">datamodel-code-generator</a>. It's capable of parsing JSON Schema and generating e.g. <code>Pydantic</code> or <code>dataclasses</code> classes. In order to use <code>attrs</code> (which is more modern version of <code>dataclasses</code>) and overcome some limitations of the tool there's a number of steps added in order to customize code enough to meet OpenLineage requirements. This PR also updates references to the latest base JSON Schema spec for all child facets. Important note Newly generated code creates v2 interface which is going next to be used in existing integrations. v2 interface introduces some breaking changes (facets are put into separate modules per JSON Schema spec file, some names changed, several classes are now <code>kw_only</code>). One-line summary: Generate Python facets from JSON schemas. Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☑︎ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☑︎ Your changes are accompanied by tests (if relevant) ☑︎ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☑︎ Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary) ☑︎ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant) ☑︎ You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant) <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2023 contributors to the OpenLineage project

Labels

client/python

🎉 Kacper Muda, Maciej Obuchowski

Michael Robinson (michael.robinson@astronomer.io)

2024-03-14 20:48:21

OpenLineage is trending upward on OSSRank. Please vote!

oss-rank

OpenLineage - OSSRank

An Open standard for metadata and lineage collection designed to instrument jobs as they are running

Original URL: https://ossrank.com/p/24

✅ Jakub Dardziński, Peter Huang

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-15 17:28:36

https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/ParentRunFacet.json#L20 here the format is uuid however if you follow logic for parent id in current dbt integration you might discover that parent run facet has assigned value of DAG’s run_id (which is not uuid)

@Julien Le Dem, what has higher priority? I think lots of people are using dbt-ol wrapper with current lineage_parent_id macro

<https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/ParentRunFacet.json | ParentRunFacet.json>

<pre><code> "format": "uuid" </code></pre>

Julien Le Dem (julien@apache.org)

2024-03-19 10:31:30

*Thread Reply:* It is a uuid because it should be the id of an OL run

👍 Jakub Dardziński

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-18 12:21:37

where can I find who has write access to OL repo?

Michael Robinson (michael.robinson@astronomer.io)

2024-03-18 12:29:00

*Thread Reply:* Settings > Collaborators and teams

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-18 12:33:41

*Thread Reply:* thanks Michael, seems like I don’t have enough permissions to see that

Julien Le Dem (julien@apache.org)

2024-03-19 10:31:57

Sorry, I have a dr appointment today and won’t join the meeting

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-19 10:32:24

*Thread Reply:* I gotta skip too. Maciej and Pawel are at the Kafka Summit

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-19 10:32:36

*Thread Reply:* I hope you’re fine!

Julien Le Dem (julien@apache.org)

2024-03-19 12:24:18

*Thread Reply:* I am fine thank you 🙂

Julien Le Dem (julien@apache.org)

2024-03-19 12:24:20

*Thread Reply:* just a visit

Harel Shein (harel.shein@gmail.com)

2024-03-19 10:35:57

Should we cancel the sync today?

👍 Michael Robinson, Kacper Muda, Maciej Obuchowski

Harel Shein (harel.shein@gmail.com)

2024-03-20 10:02:53

looking at XTable today, any thoughts on how we can collaborate with them?

xtable.apache.org

Apache XTable™ (Incubating)

Apache XTable™ (Incubating) is a cross-table interop of lakehouse table formats Apache Hudi, Apache Iceberg, and Delta Lake. Apache XTable™ is NOT a new or separate format, Apache XTable™ provides abstractions and tools for the translation of lakehouse table format metadata.

Original URL: https://xtable.apache.org/

Harel Shein (harel.shein@gmail.com)

2024-03-20 10:09:30

*Thread Reply:* @Julien Le Dem @Willy Lulciuc this reminds me of some ideas we had a few years ago.. :)

Harel Shein (harel.shein@gmail.com)

2024-03-20 10:16:38

*Thread Reply:* hmm.. ok. maybe not that relevant for us, at first I thought this was an abstraction for read/write on top of Iceberg/Hudi/Delta.. but I think this is more of a data sync appliance. would still be relevant for linking together synced datasets (but I don't think it's that important now)

Peter Huang (huangzhenqiu0825@gmail.com)

2024-03-20 13:21:26

*Thread Reply:* From the introduction https://www.confluent.io/blog/introducing-tableflow/, looks like they are using Flink for both data ingestion and compaction. It means we should at least consider to support hudi source and sink for flink lineage 🙂

Confluent

Introducing Tableflow: Unifying Streaming and Analytics

Seamlessly integrate Apache Kafka data into your lakehouse as Apache Iceberg tables, bridging the operational and analytical divide, with Tableflow. Read more in our blog post.

Original URL: https://www.confluent.io/blog/introducing-tableflow/

Michael Robinson (michael.robinson@astronomer.io)

2024-03-21 13:21:29

A key growth metric trending in the right direction:

🚀 Kacper Muda, Harel Shein, Maciej Obuchowski

Michael Robinson (michael.robinson@astronomer.io)

2024-03-21 14:45:21

Eyes on this PR to add OpenMetadata to the Ecosystem page would be appreciated: https://github.com/OpenLineage/docs/pull/303. TIA! @Mariusz Górski

#303 Adds OpenMetadata to consumers on the Ecosystem page

This updates the Ecosystem page for inclusion of OpenMetadata in the Consumers section, including a logo, description and links. <a href="https://private-user-images.githubusercontent.com/68482867/315580789-135b3a6f-4bf3-4000-a096-47325caa0b12.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTEwNDcwMjIsIm5iZiI6MTcxMTA0NjcyMiwicGF0aCI6Ii82ODQ4Mjg2Ny8zMTU1ODA3ODktMTM1YjNhNmYtNGJmMy00MDAwLWEwOTYtNDczMjVjYWEwYjEyLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAzMjElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMzIxVDE4NDUyMlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWYxYTYzMWMyOTJkNzBiMTRlYTY1ZjBhNWIyZWRhMDJjYzM4YzdiYzNjZDFlOGQ4ZjE4NWVmOWYyNGM2MWRiMTUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.qssvFasR6RwkHasN_pb7d2H8xkiTev-PKC2UwHUAhvM">Screenshot 2024-03-21 at 14 29 39</a>

Comments

🚀 Jakub Dardziński, Harel Shein

Harel Shein (harel.shein@gmail.com)

2024-03-21 15:21:55

I really want to improve this page in the docs, anyone wants to work with me on that?

openlineage.io

OpenLineage Integrations | OpenLineage

Capability Matrix

Original URL: https://openlineage.io/docs/integrations/about

Harel Shein (harel.shein@gmail.com)

2024-03-21 15:22:40

*Thread Reply:* perhaps also make this part of the PR process, so when we add support for something, we remember to update the docs

➕ Willy Lulciuc, Paweł Leszczyński

Willy Lulciuc (willy@datakin.com)

2024-03-21 15:22:55

*Thread Reply:* I free up next week and would love to chat… obviously, time permitting but the page needs some love ❤️

❤️ Harel Shein, Paweł Leszczyński

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-21 15:24:40

*Thread Reply:* I can verify the information once you have some PR 🙂

🙏 Harel Shein

Michael Robinson (michael.robinson@astronomer.io)

2024-03-22 12:53:38

RFR: a PR to add DataHub to the Ecosystem page https://github.com/OpenLineage/docs/pull/304

#304 Add Datahub to the Ecosystem page

This adds DataHub to the integrations. Included: a description, logo and links. <a href="https://private-user-images.githubusercontent.com/68482867/316105351-af6f151a-9353-4c21-84ed-56e240aec1bc.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTExMjY3MjAsIm5iZiI6MTcxMTEyNjQyMCwicGF0aCI6Ii82ODQ4Mjg2Ny8zMTYxMDUzNTEtYWY2ZjE1MWEtOTM1My00YzIxLTg0ZWQtNTZlMjQwYWVjMWJjLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAzMjIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMzIyVDE2NTM0MFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTUyZmYxNTM3N2RkNWNjOWM1MmIyYWNmOGFkMjllNzBkM2U4ZjE0MDY3ODE2MmNkZDI0MjgyMDNkYmRlNzRhZjcmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.cI_cp4xgVgPpIVVfYvgDPb49ft10bqUYxeVP_IAoVn0">Screenshot 2024-03-22 at 12 49 19</a>

Comments

Michael Robinson (michael.robinson@astronomer.io)

2024-03-22 12:55:17

*Thread Reply:* The description comes from the very brief README in DataHub's GH repo and a glance at the code. No other documentation or resources appear to be available.

Michael Robinson (michael.robinson@astronomer.io)

2024-03-22 12:58:43

*Thread Reply:* @Tamás Németh

arjun krishnamurthy (arjunkrishnamurthy1991@gmail.com)

2024-04-23 01:53:33

*Thread Reply:* Do we have any update on this?

Tamás Németh (treff7es@gmail.com)

2024-04-23 09:25:31

*Thread Reply:* sorry, I will add on the following week

🙌 Michael Robinson, arjun krishnamurthy

Michael Robinson (michael.robinson@astronomer.io)

2024-03-22 15:57:42

Dagster is launching column-lineage support for dbt using the sqlglot parser https://github.com/dagster-io/dagster/pull/20407

#20407 feat(dbt): emit column dependencies using `sqlglot`

Summary & Motivation Makes use of the great <code>sqlglot</code> library to build column lineage metadata when executing a dbt project. We do this in the following steps: <ol><li>Retrieve the current dbt node's SQL file and its parents' column schemas.</li><li>Retrieve the column names from the current node.</li><li>For each column, retrieve its dependencies on upstream columns from direct parents. Basically just invoke <a href="https://sqlglot.com/sqlglot/lineage.html#lineage"><code>lineage</code></a> from <code>sqlglot</code>)</li><li>Render the lineage as a JSON blob on the asset materialization for the dbt node.</li> </ol> To retrieve the dbt node's parents, and those corresponding nodes' column schemas, we augment our <code>dagster</code> dbt package implementation from <a href="https://github.com/dagster-io/dagster/pull/19623">#19623</a> to emit column schemas for the dbt node's parents. We make use of the dbt <a href="https://docs.getdbt.com/reference/dbt-jinja-functions/model"><code>model</code></a> variable to retrieve dbt node's refs/sources as relation objects to pass to <a href="https://docs.getdbt.com/reference/dbt-jinja-functions/adapter#get_columns_in_relation"><code>adapter.get_columns_in_relation</code></a>. How I Tested These Changes pytest • assert expected column dependencies against jaffle shop • assert expected column dependencies against executing a subset of jaffle shop • assert expected column dependencies against executing a subset of jaffle shop with ambiguous column selection (e.g. <code>select **</code>)

Comments

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-22 17:17:03

*Thread Reply:* I kinda like their approach to use post-hooks in order to enable column-level lineage so that custom macro collects information about columns, logs it and they parse the log after the execution

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-22 17:17:32

*Thread Reply:* it doesn’t force dbt docs generate step that some might not want to use

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-22 17:17:57

*Thread Reply:* but at the same time reuses DBT adapter to make additional calls to retrieve missing metadata

Willy Lulciuc (willy@datakin.com)

2024-03-23 14:32:29

@Paweł Leszczyński interesting project I came across over the weekend: https://github.com/HamaWhiteGG/flink-sql-lineage

HamaWhiteGG/flink-sql-lineage

The Lineage Analysis system for FlinkSQL supports advanced syntax such as Watermark, UDTF, CEP, Windowing TVFs, and CTAS.

Stars

323

Language

Java

👍 Julien Le Dem

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-03-25 03:16:45

*Thread Reply:* Wow, this is something we would love to have (flink SQL support). It's great to know that people around the globe are working on the same thing and heading same direction. Great finding @Willy Lulciuc. Thanks for sharing!

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-25 06:57:01

*Thread Reply:* On Kafka Summit I've talked with Timo Walther from Flink SQL team and he proposed alternative approach.

Flink SQL has stable (across releases) CompiledPlan JSON text representation that could be parsed, and has all the necessary info - as this is used for serializing actual execution plan both ways.

Peter Huang (huangzhenqiu0825@gmail.com)

2024-03-29 19:43:38

*Thread Reply:* As Flink SQL will convert to transformations before execution, technical speaking our existing solution has already be able to create linage info for Flink SQL apps (not including column lineage and table schemas (that can be inferred within flink table environment)). I will create Flink SQL job for e2e testing purpose.

👍 Maciej Obuchowski

Peter Huang (huangzhenqiu0825@gmail.com)

2024-03-29 19:45:49

*Thread Reply:* I am also working on Flink side for table lineage. Hopefully, new lineage features can be released in flink 1.20.

Michael Robinson (michael.robinson@astronomer.io)

2024-03-25 09:58:57

Sessions for this year's Data+AI Summit have been published. A search didn't turn up anything related to lineage, but did you know Julien and Willy's talk at last year's summit has received 4k+ views? 👀

databricks.com

Home | Databricks

Data and AI Summit — the premier event for the global data, analytics and AI community. Register now to level up your skills.

Original URL: https://www.databricks.com/dataaisummit

YouTube

} Databricks (https://www.youtube.com/@Databricks)

Cross-Platform Data Lineage with OpenLineage

Original URL: https://www.youtube.com/watch?v=rO3BPqUtWrI

Harel Shein (harel.shein@gmail.com)

2024-03-25 10:16:40

*Thread Reply:* seems like our talk was not accepted, but I can see 9 sessions on unity catalog 😕

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-26 05:59:45

https://github.com/OpenLineage/OpenLineage/pull/2272 🔥 🔥 🔥

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-26 05:59:55

finally merged 🙂

🎉 Harel Shein, Michael Robinson

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-26 06:00:18

pawel-big-lebowski commented on Nov 21, 2023 whoa

Harel Shein (harel.shein@gmail.com)

2024-03-26 07:27:22

I’ll miss the sync today (on the way to data council)

🔥 Paweł Leszczyński, Maciej Obuchowski, Michael Robinson

Julien Le Dem (julien@apache.org)

2024-03-26 12:06:44

*Thread Reply:* Same

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-26 13:59:03

*Thread Reply:* have fun at the conference!

❤️ Harel Shein

Damien Hawes (damien.hawes@booking.com)

2024-03-26 13:23:08

OK @Maciej Obuchowski - 1 job has many stages; 1 stage has many tasks. Transitively, this means that 1 job has many tasks.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-26 13:58:44

*Thread Reply:* batch or streaming one? 🙂

Damien Hawes (damien.hawes@booking.com)

2024-03-26 14:01:36

*Thread Reply:* Doesn't matter. It's the same concept.

Damien Hawes (damien.hawes@booking.com)

2024-03-26 13:27:58

Also @Paweł Leszczyński, seem Spark metrics has this:

local-1711474020860.driver.LiveListenerBus.listenerProcessingTime.io.openlineage.spark.agent.OpenLineageSparkListener count = 12 mean rate = 1.19 calls/second 1-minute rate = 1.03 calls/second 5-minute rate = 1.01 calls/second 15-minute rate = 1.00 calls/second min = 0.00 milliseconds max = 1985.48 milliseconds mean = 226.81 milliseconds stddev = 549.12 milliseconds median = 4.93 milliseconds 75% <= 53.64 milliseconds 95% <= 1985.48 milliseconds 98% <= 1985.48 milliseconds 99% <= 1985.48 milliseconds 99.9% <= 1985.48 milliseconds

Michael Robinson (michael.robinson@astronomer.io)

2024-03-27 09:23:49

Do you think Bipan's team could potentially benefit significantly from upgrading to the latest version of openlineage-spark? https://openlineage.slack.com/archives/C01CK9T7HKR/p1711483070147019

} Bipan Sihra (https://openlineage.slack.com/team/U06RFHBSTHR)

Hi Team, Looking for feedback on the below Problem and Proposal. We are using OpenLineage with our AWS EMR clusters to extract lineage and send it to a backend Marquez deployment (also in AWS). This is working fine and we are getting table and column level lineage. Problem: Is we are seeing: • 15+ OpenLineage events with multiple jobs being shows in Marquez for a single Spark job in EMR. This causes confusion because team members using Marquez are unsure which "job" in Marquez to look at. • The S3 locations are being populated in the namespace. We wanted to use namespace for teams. However, having S3 locations in the namespace in a way "pollutes" the list. I understand the above are not issues/bugs. However, our users want us to "clean" up the Marquez UI. Proposal: One idea was to have a Lambda intercept the 10-20 raw OpenLineage events from EMR and then process -> condense them down to 1 event with the job, run, inputs, outputs. And secondly, to swap out the namespace from S3 to actual team names via a lookup we would host ourselves. While the above proposal technically could work we wanted to check with the team here if it makes sense, any caveats, alternatives others have used. Ideally, we don't want to own parsing OpenLineage events if there is an existing solution.

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1711483070147019

Michael Robinson (michael.robinson@astronomer.io)

2024-03-27 09:55:04

*Thread Reply:* @Paweł Leszczyński wdyt?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-03-27 10:00:01

*Thread Reply:* I think the issue here is that marquez is not able to properly visualize parent run events that Maciej has added recently for a Spark application

Michael Robinson (michael.robinson@astronomer.io)

2024-03-27 10:03:22

*Thread Reply:* So if they downgraded would they have a graph closer to what they want?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-27 11:23:31

*Thread Reply:* I don't see parent run events there?

Michael Robinson (michael.robinson@astronomer.io)

2024-03-27 09:54:54

I'm exploring ways to improve the demo gif in the Marquez README. An improved and up-to-date demo gif could also be used elsewhere -- in the Marquez landing pages, for example, and the OL docs. Along with other improvements to the landing pages, I created a new gif that's up to date and higher-resolution, but it's large (~20 MB). • We could put it on YouTube and link to it, but that would downgrade the user experience in other ways. • We could host it somewhere else, but that would mean adding another tool to the stack and, depending on file size limits, could cost money. (I can't imagine it would cost but I haven't really looked into this option yet. Regardless of cost, tt seems to have the same drawbacks as YT from a UX perspective.) • We could have GitHub host it in another repo (for free) in the Marquez or OL orgs. ◦ It could go in the OL Docs because it's likely we'll want to use it in the docs anyway, but even if we never serve it wouldn't this create issues for local development at a minimum? I opened a PR to do this, which a PR with other improvements is waiting on, but not sure about this approach. ◦ It could go in the unused Marquez website repo, but there's a good chance we'll forget it's there and remove or archive the repo without moving it first. ◦ In another repo, or even a new one for stuff like this? Anyone have an opinion or know of a better option?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-28 10:52:31

*Thread Reply:* maybe make it a HTML5 video?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-28 10:55:32

*Thread Reply:* https://wp-rocket.me/blog/replacing-animated-gifs-with-html5-video-for-faster-page-speed/

WP Rocket

Replacing Animated GIFs with Videos for Faster Page Speed

Replacing GIFs with HTML5 video is a quick and easy way to speed up your site. Learn how to convert GIFS and use the tag in this tutorial.

Written by

Raelene Morey

Est. reading time

9 minutes

Original URL: https://wp-rocket.me/blog/replacing-animated-gifs-with-html5-video-for-faster-page-speed/

Michael Robinson (michael.robinson@astronomer.io)

2024-03-29 10:51:18

*Thread Reply:* 👀

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-28 10:52:04

@Julien Le Dem @Harel Shein how did Data Council panel and talk go?

Harel Shein (harel.shein@gmail.com)

2024-03-28 10:53:30

*Thread Reply:* Was just composing the message below :)

Harel Shein (harel.shein@gmail.com)

2024-03-28 10:53:05

Some great discussions here at data council, the panel was really great and we can definitely feel energy around OpenLineage continuing to build up! 🚀 Thanks @Julien Le Dem for organizing and shoutout to @Ernie Ostic @Sheeri Cabral (Collibra) @Eric Veleker for taking the time and coming down here and keeping pushing more and building the community! ❤️

🏄‍♂️ Michael Robinson, Maciej Obuchowski

👍 Ernie Ostic

🎉 tati

Michael Robinson (michael.robinson@astronomer.io)

2024-03-29 11:08:13

*Thread Reply:* @Harel Shein did anyone take pictures?

Harel Shein (harel.shein@gmail.com)

2024-03-29 11:10:54

*Thread Reply:* there should be plenty of pictures from the conference organizers, we'll ask for some

🙌 Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2024-03-29 11:15:03

*Thread Reply:* Did a search and didn't see anything

Harel Shein (harel.shein@gmail.com)

2024-03-29 11:16:59

*Thread Reply:* here's one

Julien Le Dem (julien@apache.org)

2024-03-29 11:17:07

*Thread Reply:* Speaker dinner the night before: https://www.linkedin.com/posts/datacouncil-aidatacouncil-ugcPost-7178852429705224193-De46?utmsource=share&utmmedium=memberios|https://www.linkedin.com/posts/datacouncil-aidatacouncil-ugcPost-7178852429705224193-De46?utmsource=share&utmmedium=memberios

linkedin.com

Data Council on LinkedIn: #datacouncil

What a memorable evening at last night&#39;s Data Council Speaker Dinner! A heartfelt THANK YOU to all the amazing speakers who took the time to share their…

Original URL: https://www.linkedin.com/posts/datacouncil-ai_datacouncil-ugcPost-7178852429705224193-De46?utm_source=share&utm_medium=member_ios

Julien Le Dem (julien@apache.org)

2024-03-29 11:17:19

*Thread Reply:* Ahah. Same picture

Harel Shein (harel.shein@gmail.com)

2024-03-29 11:17:51

*Thread Reply:* haha. Julien and Ernie look great while I'm explaining how to land an airplane 🛬

😊 Ernie Ostic

Michael Robinson (michael.robinson@astronomer.io)

2024-03-29 11:44:29

*Thread Reply:* Great pic!

Julien Le Dem (julien@apache.org)

2024-03-30 20:40:06

*Thread Reply:* The photo gallery is there

Pixieset

Data Council 2024

Photo collection by Tico Mendoza Studios

Original URL: https://ticomendozaphotography.pixieset.com/datacouncil2024/

Julien Le Dem (julien@apache.org)

2024-03-30 20:47:50

*Thread Reply:*

Michael Robinson (michael.robinson@astronomer.io)

2024-04-01 09:01:34

*Thread Reply:* awesome! just in time for the newsletter 🙂

Eric Veleker (eric@atlan.com)

2024-04-05 22:29:56

*Thread Reply:* Thank you for thinking of us. Onwards and upwards.

Peter Huang (huangzhenqiu0825@gmail.com)

2024-03-28 15:53:09

I just find the naming conventions for hive/iceberg/hudi are not listed in the doc https://openlineage.io/docs/spec/naming/. Shall we further standardize them? Any suggestions?

openlineage.io

Naming Conventions | OpenLineage

Employing a unique naming strategy per resource ensures that the spec is followed uniformly regardless of metadata producer.

Original URL: https://openlineage.io/docs/spec/naming/

👍 Maciej Obuchowski

Harel Shein (harel.shein@gmail.com)

2024-03-28 16:22:22

*Thread Reply:* Yes. This also came up in a conversation with one of the maintainers of dbt-core, we can also pick up on a proposal to extend the naming conventions markdown to something a bit more scalable.

Harel Shein (harel.shein@gmail.com)

2024-03-28 16:23:29

*Thread Reply:* What you think about this proposal? https://github.com/OpenLineage/OpenLineage/pull/1702

#1702 Add proposal for new naming schemas.

This document seeks to outline a change in OpenLineage to programmatic names for integrations and connections. Some high-level issues to discuss are: • What the structure of the json schemas should be in general • For example, should these schemas use regex to capture name and namespace components? • How and where validity tests should be run • Are these defined in an existing facet? A new facet? Are they unit tests? • Will the OpenLineage spec provide builder methods or merely specify how names, namespaces, and unique names will be built? Closes: <a href="https://github.com/OpenLineage/OpenLineage/issues/1681">#1681</a> Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☑︎ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☐ Your changes are accompanied by tests (if relevant) ☐ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☐ Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant) ☐ You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant) <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2023 contributors to the OpenLineage project

Labels

documentation, proposal

Comments

Peter Huang (huangzhenqiu0825@gmail.com)

2024-03-28 16:52:33

*Thread Reply:* Thanks for sharing the info. Will take a deeper look later today.

👍 Harel Shein

Mariusz Górski (gorskimariusz13@gmail.com)

2024-03-29 02:14:19

*Thread Reply:* I think this is similar topic to resource naming in ODD, might be worth to take a look for inspiration: https://github.com/opendatadiscovery/oddrn-generator

opendatadiscovery/oddrn-generator

Stars

Language

Python

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-29 05:42:19

*Thread Reply:* the thing is we need to have language-agnostic way of defining those naming conventions and be able to generate code for them, similar to facets spec

👍 Mariusz Górski

Mariusz Górski (gorskimariusz13@gmail.com)

2024-03-29 08:10:22

*Thread Reply:* could be also an idea to have micro rest api embedded in each client, so managing naming convention would be stored there and each client (python/java) could run it as a subprocess 🤔

Harel Shein (harel.shein@gmail.com)

2024-04-01 12:44:15

*Thread Reply:* we can also just write it in Rust, @Maciej Obuchowski 😁

👍 Mariusz Górski

😅 Maciej Obuchowski

Harel Shein (harel.shein@gmail.com)

2024-04-01 13:11:01

*Thread Reply:* no real changes/additions, but starting to organize the doc for now: https://github.com/OpenLineage/OpenLineage/pull/2554

Harel Shein (harel.shein@gmail.com)

2024-03-29 11:14:25

@Maciej Obuchowski we also heard some good things about the sqlglot parser. have you looked at it recently?

tobymao/sqlglot

Python SQL Parser and Transpiler

Website

<https://sqlglot.com/>

Stars

5285

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-29 12:59:29

*Thread Reply:* I love the fact that our parser is in type safe language :)

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-29 14:14:46

*Thread Reply:* does it matter after all when it comes to parsing SQL? it might be worth to run some comparisons but it may turn out that sqlglot misses most of Snowflake dialect that we currently support

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-29 16:56:04

*Thread Reply:* We'd miss on Java side parsing as well

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-29 16:57:37

*Thread Reply:* very importantly this ^

Harel Shein (harel.shein@gmail.com)

2024-03-29 17:23:36

*Thread Reply:* That’s important. Yes

Michael Robinson (michael.robinson@astronomer.io)

2024-04-01 10:05:24

OpenLineage 1.11.0 release vote is now open: https://openlineage.slack.com/archives/C01CK9T7HKR/p1711980285409389

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)

@channel I'd like to open a vote to release OpenLineage 1.11.0, including: • Spark: lineage metadata extraction built-in to Spark extensions • Spark: change <code>SparkPropertyFacetBuilder</code> to support recording Spark runtime config • Java client: add metrics gathering mechanism • Flink: support Flink 1.19.0 • SQL: show error message when <code>OpenLineageSql</code> cannot find native library Three +1s from committers will authorize. Thanks!

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1711980285409389

Julien Le Dem (julien@apache.org)

2024-04-02 11:29:23

Sorry, I’ll be late to the sync

👍 Maciej Obuchowski

Harel Shein (harel.shein@gmail.com)

2024-04-02 12:56:31

forgot to mention, but we have the TSC meeting coming up next week. we should start sourcing topics

👍 Michael Robinson, Maciej Obuchowski

Michael Robinson (michael.robinson@astronomer.io)

2024-04-03 15:58:23

*Thread Reply:* 1.10 and 1.11 releases Data Council, Kafka Summit, & Boston meetup shout outs and quick recaps Datadog poc update or demo?

Michael Robinson (michael.robinson@astronomer.io)

2024-04-03 16:04:39

*Thread Reply:* Discussion item about Trino integration next steps?

Michael Robinson (michael.robinson@astronomer.io)

2024-04-03 16:06:16

*Thread Reply:* Accenture+Confluent roundtable reminder for sure

Michael Robinson (michael.robinson@astronomer.io)

2024-04-03 16:24:17

*Thread Reply:* job to job dependencies discussion item? https://openlineage.slack.com/archives/C065PQ4TL8K/p1712153842519719

} Julian LaNeve (https://openlineage.slack.com/team/U0544QC1DS9)

is there any notion of process dependencies in openlineage? i.e. if I have two airflow tasks that depend on each other, with no dataset in between, can I express that in the openlineage spec?

Original URL: https://openlineage.slack.com/archives/C065PQ4TL8K/p1712153842519719

➕ Harel Shein

Harel Shein (harel.shein@gmail.com)

2024-04-03 16:43:56

*Thread Reply:* I think it's too early for Datadog update tbh, but I like the job to job discussion. We can make also bring up the naming library discussion that we talked about yesterday

Michael Robinson (michael.robinson@astronomer.io)

2024-04-08 16:44:21

*Thread Reply:* Shared a slide deck with you today. (If anyone else would like access, please lmk!)

Michael Robinson (michael.robinson@astronomer.io)

2024-04-09 12:36:57

*Thread Reply:* Friendly reminder: this month's tsc is tomorrow

Harel Shein (harel.shein@gmail.com)

2024-04-02 21:19:03

one more thing, if we want we could also apply for a free Datadog account for OpenLineage and Marquez: https://www.datadoghq.com/partner/open-source/

Datadog

Datadog for Open Source Projects | Datadog

See metrics from all of your apps, tools & services in one place with Datadog's cloud monitoring as a service solution. Try it for free.

Original URL: https://www.datadoghq.com/partner/open-source/

👀 Maciej Obuchowski

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 05:27:02

*Thread Reply:* would be nice for tests

Julian LaNeve (lanevejulian@gmail.com)

2024-04-03 10:17:22

is there any notion of process dependencies in openlineage? i.e. if I have two airflow tasks that depend on each other, with no dataset in between, can I express that in the openlineage spec?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-03 11:39:25

*Thread Reply:* AFAIK no, it doesn't aim to do reflect that cc @Julien Le Dem

Julien Le Dem (julien@apache.org)

2024-04-03 11:42:55

*Thread Reply:* It is not in the core spec but this could be represented as a job facet. It is probably in the airflow facet right now but we could add a more generic job dependency facet

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 11:52:36

*Thread Reply:* we do represent hierarchy though - with ParentRunFacet

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 11:57:16

*Thread Reply:* if we were to add some dependency facet, what would we want to model?

we want to note the dependency between jobs, not between particular runs, so a. we are in job X and want to note that job Y will run after it ends b. we are in job Y and want to note that it ran because it depended on successful run of job X
we want also to note the dependency between particular runs: a. we are in run x of job X, and want to note that run y of job Y will happen after it ends b. we are in run y of job Y, and want to note that it depended (as in - ran because the preceding job(s) finished successfully) on run x of job X

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 11:59:47

*Thread Reply:* do we also want to model something like Airflow's trigger rules? https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html#trigger-rules

Harel Shein (harel.shein@gmail.com)

2024-04-03 12:46:01

*Thread Reply:* I don't think this is about hierarchy though, right? If I understand @Julian LaNeve correctly, I think it's more #2

Julian LaNeve (lanevejulian@gmail.com)

2024-04-03 12:48:25

*Thread Reply:* yeah it's less about hierarchy - definitely more about #2.

assume we have a DAG that looks like this: Task A -> Task B -> Task C today, OL can capture the full set of dependencies this if we do: A -> (dataset 1) -> B -> (ds 2) -> C but it's not always the case that you have datasets between everything. my question was moreso around "how can I use OL to capture the relationship between jobs if there are no datasets in between"

Julien Le Dem (julien@apache.org)

2024-04-03 12:52:32

*Thread Reply:* I had opened an issue to track this a while ago but we did not get too far in the discussion: https://github.com/OpenLineage/OpenLineage/issues/552

#552 [INTEGRATION][Airflow] Add a facet to capture job/run dependencies (a task is dependent on another task in the DAG)

In Airflow, tasks in a DAG depend on each other in a workflow. We want to capture those dependencies.

Labels

enhancement

Comments

Julian LaNeve (lanevejulian@gmail.com)

2024-04-03 12:53:19

*Thread Reply:* oh nice - unsurprisingly you were 2 years ahead of me 😆

😅 Julien Le Dem

Julien Le Dem (julien@apache.org)

2024-04-03 12:53:57

*Thread Reply:* You can track the dependency both at the job level and at the run level. At the job level you would do something along the lines of: job: { facets: { job_dependencies: { predecessors: [ { namespace: , name: }, ... ], successors: [ { namespace: , name: }, ... ] } }}

Julien Le Dem (julien@apache.org)

2024-04-03 12:56:57

*Thread Reply:* At the run level you could track the actual task run dependencies: run: { facets: { run_dependencies: { predecessor: [ "{run uuid}", ...], successors: [...], } }}

Julien Le Dem (julien@apache.org)

2024-04-03 13:00:42

*Thread Reply:* I think the current airflow run facet contains that information in an airflow specific representation: https://github.com/apache/airflow/blob/main/airflow/providers/openlineage/plugins/facets.py

<https://github.com/apache/airflow/blob/main/airflow/providers/openlineage/plugins/facets.py | facets.py>

<pre><code># Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership. The ASF licenses this file # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance # with the License. You may obtain a copy of the License at # # <http://www.apache.org/licenses/LICENSE-2.0> # # Unless required by applicable law or agreed to in writing, # software distributed under the License is distributed on an # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY # KIND, either express or implied. See the License for the # specific language governing permissions and limitations # under the License. from __future__ import annotations from attrs import define from openlineage.client.facet import BaseFacet from openlineage.client.utils import RedactMixin @define(slots=False) class AirflowMappedTaskRunFacet(BaseFacet): """Run facet containing information about mapped tasks.""" mapIndex: int operatorClass: str _additional_skip_redact = ["operatorClass"] @classmethod def from_task_instance(cls, task_instance): task = task_instance.task from airflow.providers.openlineage.utils.utils import get_operator_class return cls( mapIndex=task_instance.map_index, operatorClass=f"{get_operator_class(task).__module__}.{get_operator_class(task).__name__}", ) @define(slots=False) class AirflowRunFacet(BaseFacet): """Composite Airflow run facet.""" dag: dict dagRun: dict task: dict taskInstance: dict taskUuid: str @define(slots=False) class UnknownOperatorInstance(RedactMixin): """Describes an unknown operator. This specifies the (class) name of the operator and its properties. """ name: str properties: dict[str, object] type: str = "operator" _skip_redact = ["name", "type"] @define(slots=False) class UnknownOperatorAttributeRunFacet(BaseFacet): """RunFacet that describes unknown operators in an Airflow DAG.""" unknownItems: list[UnknownOperatorInstance] </code></pre>

👍 Maciej Obuchowski

Julien Le Dem (julien@apache.org)

2024-04-03 13:02:16

*Thread Reply:* I think we should have the discussion in the ticket so that it does not get lost in the slack history

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 13:21:13

*Thread Reply:* run: { facets: { run_dependencies: { predecessor: [ "{run uuid}", ...], successors: [...], } }} I like this format, but would have full run/job identifier as ParentRunFacet

Julien Le Dem (julien@apache.org)

2024-04-03 13:23:05

*Thread Reply:* For the trigger rules I wonder if this is too specific to airflow.

Julien Le Dem (julien@apache.org)

2024-04-03 13:23:27

*Thread Reply:* But if there’s a generic way to capture this, it makes sense

Michael Robinson (michael.robinson@astronomer.io)

2024-04-03 11:13:23

Don't forget to register for this! https://events.confluent.io/roundtable-data-lineage/Accenture

events.confluent.io

Accenture & Confluent present:Open Standards for Data Lineage

One of the greatest wishes of companies around the world is end-to-end visibility into their analytics & their reporting workflow. Where do these data come from? Where do they go? Whom am I giving access to? How can I track data quality issues? What’s the impact of the latest deployment? The capability to follow the data flow to answer these questions is called data lineage.Multiple platforms now offer data lineage, within their data governance tools. But they only work for the technology in which they are created. Cross-platform data lineage is still a great challenge because different technologies cannot talk to each other effectively.How can we solve this problem?Join our event on open standards for analytics visibility. Could an open standard revolutionise analytics like TCP/IP did for networks? Global experts discuss. Don't miss out!

Original URL: https://events.confluent.io/roundtable-data-lineage/Accenture

👀 Maciej Obuchowski

👍 Harel Shein, Peter Huang

Michael Robinson (michael.robinson@astronomer.io)

2024-04-03 17:24:31

This attempt at a SQLAlchemy was basically working, if not perfectly, the last time I played with it: https://github.com/OpenLineage/OpenLineage/pull/2088. What more do I need to do to get it to the point where it can be merged as an "experimental"/"we warned you" integration? I mean, other than make sure it's still working and clean it up? 🙂

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-04 06:46:06

https://docs.getdbt.com/docs/collaborate/column-level-lineage#sql-parsing

docs.getdbt.com

Column-level lineage | dbt Developer Hub

Use dbt Explorer's column-level lineage to gain insights about your data at a granular level.

Original URL: https://docs.getdbt.com/docs/collaborate/column-level-lineage#sql-parsing

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-04 06:48:15

*Thread Reply:* seems like it’s only for dbt cloud

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-04 07:20:56

*Thread Reply:* > Column-level lineage relies on SQL parsing. Was thinking about doing the same thing at some point

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-04 07:21:12

*Thread Reply:* Basically with dbt we know schemas, so we also can resolve wildcards as well

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-04 07:22:01

*Thread Reply:* but that requires adding capability for providing known schema into sqlparser

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-04 07:23:15

*Thread Reply:* that's not very hard to add afaik 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-04 07:23:27

*Thread Reply:* not exactly into sqlparser too

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-04 07:23:32

*Thread Reply:* just our parser

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-04 07:23:46

*Thread Reply:* yeah, our parser

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-04 07:23:55

*Thread Reply:* still someone has to add it :D

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-04 07:24:04

*Thread Reply:* some rust enthusiast probably

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-04 07:24:14

*Thread Reply:* 👀

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-04 07:27:56

*Thread Reply:* but also: dbt provides schema info only if you generate catalog.json with generate docs command

Harel Shein (harel.shein@gmail.com)

2024-04-04 07:36:13

*Thread Reply:* Right now we have the dbl-ol wrapper anyway, so we can make another dbt docs command on behalf of the user too

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-04 07:39:17

*Thread Reply:* not sure if running commands on behalf of user is good idea, but denoting in docs that running it increases accuracy of column-level lineage is probably a good idea

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-04 07:39:22

*Thread Reply:* once we build it

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-04 07:39:24

*Thread Reply:* of course

Harel Shein (harel.shein@gmail.com)

2024-04-04 07:42:41

*Thread Reply:* That depends, what are the side effects of running dbt docs?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-04 07:49:19

*Thread Reply:* the other option is similar to dagster's approach - run post-hook macro that prints schema to logs and read the logs with dbt-ol wrapper

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-04 07:49:40

*Thread Reply:* which again won't work in dbt cloud - there catalog.json seems like the only option

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-04 08:06:26

*Thread Reply:* > That depends, what are the side effects of running dbt docs? refreshing someone's documentation? 🙂

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-04 09:51:44

*Thread Reply:* it would be configurable imho, if someone doesn’t want column level lineage in price of additional step, it’s their choice

Harel Shein (harel.shein@gmail.com)

2024-04-04 11:57:44

*Thread Reply:* yup, agreed. I'm sure we can also run dbt docs to a temp directory that we'll delete right after

Michael Robinson (michael.robinson@astronomer.io)

2024-04-04 14:29:06

Some encouraging stats from Sonatype: these are Spark integration downloads (unique IPs) over the last 12 months

Michael Robinson (michael.robinson@astronomer.io)

2024-04-04 14:30:45

*Thread Reply:* That's an increase of 17560.5%

🎉 Harel Shein, Jakub Dardziński

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-04 18:43:27

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/releases/tag/1.11.3 that’s a lot of notes 😮

Julien Le Dem (julien@apache.org)

2024-04-10 16:40:01

*Thread Reply:* the way spark.jars.packages io.openlineage:openlineage_spark:{version} works, every spark job downloads the jar when it runs.

Julien Le Dem (julien@apache.org)

2024-04-10 16:40:09

*Thread Reply:* so that's a cool way to track that.

👍 Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2024-04-10 17:08:41

*Thread Reply:* Just learned about this tool that claims to turn downloads, etc., into data that's more usable for insights into users (as opposed to, say, spark jobs): https://about.scarf.sh/

about.scarf.sh

Scarf

Privacy conscious usage analytics for open source software. Scarf provides marketing and sales intelligence to commercial open source businesses.

Original URL: https://about.scarf.sh/

Charles kuzmiak (charles.kuzmiak@kraken.tech)

2024-06-23 21:56:33

*Thread Reply:* Hello, my jobs are not downloading the JAR, is there some specific setup needed to enable it ?

Michael Robinson (michael.robinson@astronomer.io)

2024-04-04 15:15:42

Marquez committers: there's a committer vote open 👀

Michael Robinson (michael.robinson@astronomer.io)

2024-04-10 16:14:05

*Thread Reply:* We still need a few more votes if you can spare a moment to vote over there...

Harel Shein (harel.shein@gmail.com)

2024-04-04 15:22:19

did anyone submit a CFP here? https://sessionize.com/open-source-summit-europe-2024/ it's a linux foundation conference too

sessionize.com

Open Source Summit Europe 2024: Call for Papers

CHOOSE ONE (1) EVENTOpen Source Summit Europe is an umbrella event for open source projects and developer communities to come together under one roof ...

Original URL: https://sessionize.com/open-source-summit-europe-2024/

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 10:57:08

*Thread Reply:* looks like a nice conference

Harel Shein (harel.shein@gmail.com)

2024-04-05 11:29:43

*Thread Reply:* too far for me, but might be a train ride for you?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 11:37:19

*Thread Reply:* yeah, I might submit something 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 11:37:45

*Thread Reply:* and I think there are actually direct trains to Vienna from Warsaw

Damien Hawes (damien.hawes@booking.com)

2024-04-05 04:53:56

Hmm @Maciej Obuchowski @Paweł Leszczyński - I see we released 1.11.3, but I don't see the artifacts in central. Are the artifacts blocked?

Damien Hawes (damien.hawes@booking.com)

2024-04-05 04:54:03

https://mvnrepository.com/artifact/io.openlineage/openlineage-spark

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-05 04:54:42

*Thread Reply:* after last release, it took me some 24h to see openlineage-flink artifact published

Damien Hawes (damien.hawes@booking.com)

2024-04-05 04:56:02

*Thread Reply:* I recall something about the artifacts had to be manually published from the staging area.

Damien Hawes (damien.hawes@booking.com)

2024-04-05 05:11:40

*Thread Reply:* @Maciej Obuchowski - can you check if the release is stuck in staging?

Damien Hawes (damien.hawes@booking.com)

2024-04-05 05:11:53

*Thread Reply:* I recall last time it failed because there wasn't a javadoc associated with it

Damien Hawes (damien.hawes@booking.com)

2024-04-05 05:23:25

*Thread Reply:* Nevermind @Paweł Leszczyński @Maciej Obuchowski - it seems like the search indexes haven't been updated.

Damien Hawes (damien.hawes@booking.com)

2024-04-05 05:23:36

*Thread Reply:* https://repo.maven.apache.org/maven2/io/openlineage/openlineage-spark_2.13/1.11.3/

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 06:01:51

*Thread Reply:* @Michael Robinson has to manually promote them but it's not instantaneous I believe

👍 Michael Robinson

Harel Shein (harel.shein@gmail.com)

2024-04-05 09:23:51

I'm seeing some really strange behavior with OL Spark, I'm going to give some data to help out, but these are still breadcrumbs unfortunately. 🧵

Harel Shein (harel.shein@gmail.com)

2024-04-05 09:25:06

*Thread Reply:* the driver for this job is running for more than 5 hours, but the job actually finished after 20 minutes

Harel Shein (harel.shein@gmail.com)

2024-04-05 09:25:08

*Thread Reply:*

Harel Shein (harel.shein@gmail.com)

2024-04-05 09:25:56

*Thread Reply:* most the cpu time in those 5 hours are spent in openlineage methods

Harel Shein (harel.shein@gmail.com)

2024-04-05 09:25:59

*Thread Reply:*

Harel Shein (harel.shein@gmail.com)

2024-04-05 09:26:36

*Thread Reply:* it's also not reproducible 😕

Harel Shein (harel.shein@gmail.com)

2024-04-05 09:26:46

*Thread Reply:* but happens "sometimes"

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 09:55:37

*Thread Reply:* DatasetIdentifier.equals?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 09:55:58

*Thread Reply:* can you check what calls it?

Harel Shein (harel.shein@gmail.com)

2024-04-05 11:10:46

*Thread Reply:* unfortunately, some of the stack frames are truncated by JVM

Harel Shein (harel.shein@gmail.com)

2024-04-05 11:14:08

*Thread Reply:*

Harel Shein (harel.shein@gmail.com)

2024-04-05 11:17:03

*Thread Reply:* top methods by time:

Harel Shein (harel.shein@gmail.com)

2024-04-05 11:18:27

*Thread Reply:* maybe this has something to do with SymLink and the lombok implementation of .equals() ?

Harel Shein (harel.shein@gmail.com)

2024-04-05 11:19:31

*Thread Reply:* and then some sort of circular dependency

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 11:19:32

*Thread Reply:* one possible place, looks like n^2 algorithm: https://github.com/OpenLineage/OpenLineage/blob/4ba93747e862e333267b46a57f02a09264[…]rk3/agent/lifecycle/plan/column/JdbcColumnLineageCollector.java

<https://github.com/OpenLineage/OpenLineage/blob/4ba93747e862e333267b46a57f02a09264042bb0/integration/spark/spark3/src/main/java/io/openlineage/spark3/agent/lifecycle/plan/column/JdbcColumnLineageCollector.java | JdbcColumnLineageCollector.java>

<pre><code> .equals(di.getName())) </code></pre>

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 11:19:46

*Thread Reply:* but is this a JDBC job?

Harel Shein (harel.shein@gmail.com)

2024-04-05 11:20:01

*Thread Reply:* let me see

Harel Shein (harel.shein@gmail.com)

2024-04-05 11:20:08

*Thread Reply:* I don't think so

Harel Shein (harel.shein@gmail.com)

2024-04-05 11:23:34

*Thread Reply:* it's not

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 11:23:52

*Thread Reply:* ok, we don't use lang3 Pair a lot - it has to be in ColumnLevelLineageBuilder 🙂

Harel Shein (harel.shein@gmail.com)

2024-04-05 11:30:08

*Thread Reply:* yes.. I'm staring at that class for a while now

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 11:33:43

*Thread Reply:* what's the rough size of the logical plan of the job?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 11:33:59

*Thread Reply:* I'm trying to understand whether we're looking at some infinite loop

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 11:34:06

*Thread Reply:* or just something done very ineffiently

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 11:35:07

*Thread Reply:* like every input being added in this manner: ``` public void addInput(ExprId exprId, DatasetIdentifier datasetIdentifier, String attributeName) { inputs.computeIfAbsent(exprId, k -> new LinkedList<>());

Pair&lt;DatasetIdentifier, String&gt; input = Pair.of(datasetIdentifier, attributeName);

if (!inputs.get(exprId).contains(input)) {
  inputs.get(exprId).add(input);
}

}``it's a candidate: it has to traverse the list returned frominputs` for every CLL dependency field added

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 11:35:59

*Thread Reply:* it looks like we're building size N list in N^2 time: inputs.stream() .filter(i -> i instanceof InputDatasetFieldWithIdentifier) .map(i -> (InputDatasetFieldWithIdentifier) i) .forEach( i -> context .getBuilder() .addInput( ExprId.apply(i.exprId().exprId()), new DatasetIdentifier( i.datasetIdentifier().getNamespace(), i.datasetIdentifier().getName()), i.field())); 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 11:39:42

*Thread Reply:* ah, this isn't even used now since it's for new extension-based spark collection

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 11:39:58

*Thread Reply:* @Paweł Leszczyński this is most likely a future bug ⬆️

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 11:42:13

*Thread Reply:* I think we're still doing it now anyway: ``` private static void extractInternalInputs( LogicalPlan node, ColumnLevelLineageBuilder builder, List datasetIdentifiers) {

datasetIdentifiers.stream()
    .forEach(
        di -> {
          ScalaConversionUtils.fromSeq(node.output()).stream()
              .filter(attr -> attr instanceof AttributeReference)
              .map(attr -> (AttributeReference) attr)
              .collect(Collectors.toList())
              .forEach(attr -> builder.addInput(attr.exprId(), di, attr.name()));
        });

}```

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 11:53:44

*Thread Reply:* and that's linked list - must be pretty slow jumping all those pointers

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 12:01:48

*Thread Reply:* maybe it's that simple 🙂 https://github.com/OpenLineage/OpenLineage/commit/306778769ae10fa190f3fd0eff7a6482fc50f57f

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 12:12:22

*Thread Reply:* There are some more funny places in CLL code, like we're iterating over list of schema fields and calling some function with name of that field : schema.getFields().stream() .map(field -> Pair.of(field, getInputsUsedFor(field.getName()))) then immediately iterate over it second time to get the field back from it's name: List<Pair<DatasetIdentifier, String>> getInputsUsedFor(String outputName) { Optional<OpenLineage.SchemaDatasetFacetFields> outputField = schema.getFields().stream() .filter(field -> field.getName().equalsIgnoreCase(outputName)) .findAny();

Harel Shein (harel.shein@gmail.com)

2024-04-05 12:51:15

*Thread Reply:* I think the time spent by the driver (5 hours) just on these methods smells like an infinite loop?

Harel Shein (harel.shein@gmail.com)

2024-04-05 12:51:59

*Thread Reply:* like, as inefficient as it may be, this is a lot of time

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 12:52:20

*Thread Reply:* did it finish eventually?

Harel Shein (harel.shein@gmail.com)

2024-04-05 12:52:51

*Thread Reply:* yes... but.. I wonder if something killed it somewhere?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 12:52:58

*Thread Reply:* I mean, it can be something like 10000^3 loop 🙂

Harel Shein (harel.shein@gmail.com)

2024-04-05 12:53:01

*Thread Reply:* I couldn't find anything in the logs to indicate

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 12:53:10

*Thread Reply:* and it has to do those pair comparisons

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 12:54:12

*Thread Reply:* would be easier if we could see the general size of a plan of this job - if it's something really small then I'm probably wrong

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 12:54:37

*Thread Reply:* but if there are 1000s of columns... anything can happen 🙂

Harel Shein (harel.shein@gmail.com)

2024-04-05 12:55:09

*Thread Reply:* yeah.. trying to find out. I don't have that facet enabled there, and I can't find the ol events in the logs (it's writing to console, and I think they got dropped)

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 12:58:32

*Thread Reply:* DevNullTransport 🙂

😅 Harel Shein, Jakub Dardziński

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 13:04:37

*Thread Reply:* I think this might be potentially really slow too https://github.com/OpenLineage/OpenLineage/blob/50afacdf731f810354be0880c5f1fd05a1[…]park/agent/lifecycle/plan/column/ColumnLevelLineageBuilder.java

<https://github.com/OpenLineage/OpenLineage/blob/50afacdf731f810354be0880c5f1fd05a1bb22e6/integration/spark/shared/src/main/java/io/openlineage/spark/agent/lifecycle/plan/column/ColumnLevelLineageBuilder.java | ColumnLevelLineageBuilder.java>

<pre><code> .distinct() </code></pre>

Harel Shein (harel.shein@gmail.com)

2024-04-05 13:08:09

*Thread Reply:* generally speaking, we have a similar problem here like we had with Airflow integration

Harel Shein (harel.shein@gmail.com)

2024-04-05 13:08:35

*Thread Reply:* we are not holding up the job per se, but... we are holding up the spark application

Harel Shein (harel.shein@gmail.com)

2024-04-05 13:09:07

*Thread Reply:* do we have a way to be defensive about that somehow, shutdown hook from spark to our thread or something

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 13:10:26

*Thread Reply:* there's no magic

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 13:10:44

*Thread Reply:* circuit breaker with timeout does not work?

Harel Shein (harel.shein@gmail.com)

2024-04-05 13:12:03

*Thread Reply:* it would, but we don't turn that on by default

Harel Shein (harel.shein@gmail.com)

2024-04-05 13:12:18

*Thread Reply:* also, if we do, what should be our default values?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 13:14:08

*Thread Reply:* what would not hurt you if you enabled it, 30 seconds?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 13:14:23

*Thread Reply:* I guess we should aim much lower with the runtime

Harel Shein (harel.shein@gmail.com)

2024-04-05 13:21:41

*Thread Reply:* yeah, and make sure we emit metrics / logs when that happens

Harel Shein (harel.shein@gmail.com)

2024-04-05 13:27:31

*Thread Reply:* wait, our circuit breaker right now only supports cpu & memory

Harel Shein (harel.shein@gmail.com)

2024-04-05 13:27:39

*Thread Reply:* we would need to add a timeout one, right?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 13:38:27

*Thread Reply:* ah, yes

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 13:38:41

*Thread Reply:* we've talked about it but it's not implemented yet https://github.com/OpenLineage/OpenLineage/blob/3dad978a3a76ea9bb709334f1526086f95[…]o/openlineage/client/circuitBreaker/ExecutorCircuitBreaker.java

<https://github.com/OpenLineage/OpenLineage/blob/3dad978a3a76ea9bb709334f1526086f95d3e2c6/client/java/src/main/java/io/openlineage/client/circuitBreaker/ExecutorCircuitBreaker.java | ExecutorCircuitBreaker.java>

<pre><code> // TODO: this method shall implement timeout in future </code></pre>

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 13:41:32

*Thread Reply:* and BTW, no abnormal CPU or memory usage?

Harel Shein (harel.shein@gmail.com)

2024-04-05 13:44:13

*Thread Reply:* nope, not at all

Harel Shein (harel.shein@gmail.com)

2024-04-05 13:51:02

*Thread Reply:* green line is when spark job actually finishes, but the graph is the whole runtime of the driver

Harel Shein (harel.shein@gmail.com)

2024-04-05 13:51:33

*Thread Reply:*

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 14:16:19

*Thread Reply:* I mean, it's using 100% of one core 🙂

🙃 Harel Shein

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-08 02:05:06

*Thread Reply:* it's similar to what aniruth experienced. there's something that for some type of logical plans causes recursion alike behaviour. However, I don't think it's recursion bcz it's ending at some point. If we had DebugFacet we would be able to know which logical plan nodes are involved in this.

Harel Shein (harel.shein@gmail.com)

2024-04-08 10:09:17

*Thread Reply:* I'll try to get that for us

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-08 13:23:34

*Thread Reply:* > If we had DebugFacet we would be able to know which logical plan nodes are involved in this. if the event would not take 1GB 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-08 13:24:16

*Thread Reply:* > it's similar to what aniruth experienced. there's something that for some type of logical plans causes recursion alike behaviour. However, I don't think it's recursion bcz it's ending at some point. If we had DebugFacet we would be able to know which logical plan nodes are involved in this. (edited) what about my thesis that something is just extremely slow in column-level lineage code?

Michael Robinson (michael.robinson@astronomer.io)

2024-04-05 16:25:37

Some adoption metrics from Sonatype and PyPI, visualized using Preset. In Preset, you can see the number for each month (but we're out of seats on the free tier there). The big number is the downloads for the last month (February in most cases).

🔥 Paweł Leszczyński, Maciej Obuchowski, Ross Turk

Damien Hawes (damien.hawes@booking.com)

2024-04-08 09:51:27

Good news. @Paweł Leszczyński - the memory leak fixes worked. Our streaming pipelines have run through the weekend without a single OOM crash.

🎉 Harel Shein, Peter Huang, Jakub Dardziński, Maciej Obuchowski, Paweł Leszczyński

❤️ Paweł Leszczyński

Peter Huang (huangzhenqiu0825@gmail.com)

2024-04-08 10:23:11

*Thread Reply:* @Damien Hawes Would you please point me the PR that fixes the issue?

Damien Hawes (damien.hawes@booking.com)

2024-04-08 10:25:14

*Thread Reply:* This was the issue: https://github.com/OpenLineage/OpenLineage/issues/2561

There were two PRs:

JobMetricsHolder: https://github.com/OpenLineage/OpenLineage/pull/2565
UnknownEntryFacetListener: https://github.com/OpenLineage/OpenLineage/pull/2557

Damien Hawes (damien.hawes@booking.com)

2024-04-08 10:31:15

*Thread Reply:* @Peter Huang ^

:gratitude_thank_you: Peter Huang

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-08 12:41:47

*Thread Reply:* @Damien Hawes any other feedback for OL with streaming pipelines you have so far?

Damien Hawes (damien.hawes@booking.com)

2024-04-08 12:42:20

*Thread Reply:* It generates a TON of data

Damien Hawes (damien.hawes@booking.com)

2024-04-08 12:44:19

*Thread Reply:* There are some optimisations that could be made:

A lot of the facets can be cached, and don't need to be recreated every time. The connector (obviously) doesn't care about the size of the data that is being processed, rather it cares about how frequent the spark events are. Spark's micro-batching thing means that the job start -> stage submitted -> task started -> task ended -> stage complete -> job end cycle fires more frequently.

Damien Hawes (damien.hawes@booking.com)

2024-04-08 12:46:10

*Thread Reply:* This has an impact on any backend using it, as the run id keeps changing. This means the parent suddenly has thousands of jobs as children.

Damien Hawes (damien.hawes@booking.com)

2024-04-08 12:46:29

*Thread Reply:* Our biggest pipeline generates a new event cycle every 2 minutes.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-08 12:57:23

*Thread Reply:* "Too much data" is exactly what I thought 🙂 The obvious potential issue with caching is the same issue we just fixed... potential memory leaks, and cache invalidation

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-08 12:58:35

*Thread Reply:* > the run id keeps changing In this case, that's a bug. We'd still need some wrapping event for whole streaming job though, probably other than application start

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-08 13:19:19

*Thread Reply:* on the other topic, did those problems stop? https://github.com/OpenLineage/OpenLineage/issues/2513 with https://github.com/OpenLineage/OpenLineage/pull/2535/files

#2513 Spark integration can cause YARN applications to crash if Spark context is explicitly stopped

Damien Hawes (damien.hawes@booking.com)

2024-04-08 18:15:51

*Thread Reply:* So far, we haven't seen anything.

Damien Hawes (damien.hawes@booking.com)

2024-04-08 18:17:15

*Thread Reply:* >> the run id keeps changing > In this case, that's a bug. We'd still need some wrapping event for whole streaming job though, probably other than application start That could be quite the deviation though, because in our case, the dataset that is being written to keeps changing, as it's partitioned by date and hour.

Harel Shein (harel.shein@gmail.com)

2024-04-08 11:05:44

when talking about the naming scheme for datasets, would everyone here agree that we generally use: {scheme}://{authority}/{unique_name} ? where generally authority == namespace

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-08 11:08:20

*Thread Reply:* I think so, and if we don’t then we should

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-08 11:10:31

*Thread Reply:* ~which brings me to the question why construct dataset name as such~ nvm

Harel Shein (harel.shein@gmail.com)

2024-04-08 11:10:36

*Thread Reply:* please feel free to chime in here too https://github.com/dbt-labs/dbt-core/issues/8725

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-08 12:42:18

*Thread Reply:* > where generally authority == namespace (edited) {scheme}://{authority} is namespace

👍 Jakub Dardziński, Harel Shein

Harel Shein (harel.shein@gmail.com)

2024-04-08 14:13:01

*Thread Reply:* agreed

Daniel Adari (danieladarif@gmail.com)

2024-04-10 03:54:41

Hey! I’m new to the world of data :) Would love to know the advantages of using open lineage over open metadata, Thanks!

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-10 03:57:19

*Thread Reply:* fun fact is that you can use both: https://github.com/open-metadata/OpenMetadata/pull/15317

#15317 feat: OpenLineage integration

Describe your changes: Adds connector for collecting lineage using OpenLineage standard. Co-authored with <a href="https://github.com/dechoma">@dechoma</a> I worked on adding OpenLineage connector because it's a widely used standard for harvesting lineage information and OpenMetadata was missing such integration. This connector is targeted to users: • who use (or plan to use) OpenLineage to emit lineage information to kafka topic (this is one of two ways OL recommends emmiting lineage information) • use OpenMetadata to store metadata within their projects and needs to enrich it with lineage information collected using OL Type of change: ☐ Bug fix ☐ Improvement ☑︎ New feature ☐ Breaking change (fix or feature that would cause existing functionality to not work as expected) ☐ Documentation Checklist: ☑︎ I have read the <a href="https://docs.open-metadata.org/developers/contribute">CONTRIBUTING</a> document. ☐ My PR title is <code>Fixes <issue-number>: <short explanation></code> ☑︎ I have commented on my code, particularly in hard-to-understand areas. ☐ For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Labels

documentation, UI, Ingestion, backend, safe to test, e2e:Integration, e2e:DataAssets

Comments

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-10 03:57:45

*Thread Reply:* also, this would be better in #general 🙂

👍 Daniel Adari

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-10 04:00:06

*Thread Reply:* that's also funnier fact that OL doesn't aim to compete with other tools rather than to let them integrate with

Daniel Adari (danieladarif@gmail.com)

2024-04-10 04:02:33

*Thread Reply:* Great! So you’d recommend using open metadata as the main platform for metadata collection and integrate it with OL?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-10 04:19:27

*Thread Reply:* I don't have recommendation 🙂 but feel very invited to try out OpenLineage, it's a really great product!

Michael Robinson (michael.robinson@astronomer.io)

2024-04-10 09:38:52

*Thread Reply:* I would say it's not an apples to apples comparison because OpenLineage is a lineage metadata specification and OpenMetadata is a data catalog with a lineage solution. The fact that OpenMetadata and DataHub have recently merged OpenLineage integrations tells you pretty much everything you need to know about where the data world is headed in terms of lineage. 😉 With OpenLineage, you're not bound to one data catalog's set of connectors/extractors, and that's the point of an open and shared spec that's also extensible. I recommend reading the docs and exploring the GitHub repo for more information about the spec and the object model. As Maciej said, you'll probably get more responses to this kind of question in #general . In any case, welcome and have fun exploring!

openlineage.io

About OpenLineage | OpenLineage

OpenLineage is an open framework for data lineage collection and analysis. At its core is an extensible specification that systems can use to interoperate with lineage metadata.

Original URL: https://openlineage.io/docs/

OpenLineage/OpenLineage

An Open Standard for lineage metadata collection

Website

<http://openlineage.io>

Stars

1566

👍 Jakub Dardziński

❤️ Minkyu Park

Michael Robinson (michael.robinson@astronomer.io)

2024-04-10 12:19:28

Apologies in advance for having to leave today's TSC 30 minutes early! Conflict with another meeting

Damien Hawes (damien.hawes@booking.com)

2024-04-11 10:39:15

@Paweł Leszczyński - regarding https://github.com/OpenLineage/OpenLineage/issues/2594. I may have a solution, however, I am looking for advice on where to place this code. I need to go quite deep into Spark internals, basically, I need to drop down to the input partition level. I have this POC PR open: https://github.com/OpenLineage/OpenLineage/pull/2600

#2594 [SPARK] Capturing the lineage of LogicalRDDs

#2600 Draft: [SPARK] Proof of concept for handling lineage with DataStreamWriter#foreachBatch

Damien Hawes (damien.hawes@booking.com)

2024-04-11 10:40:00

*Thread Reply:* Also @Maciej Obuchowski

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-11 10:59:56

*Thread Reply:* this code is really weird

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-11 11:00:23

*Thread Reply:* create LogicalRDD from dataframe, then create dataset backed by this RDD?

Damien Hawes (damien.hawes@booking.com)

2024-04-11 11:00:47

*Thread Reply:* And break logical lineage in the process, yes

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-11 11:01:26

*Thread Reply:* originDataset is thrown away, right?

Damien Hawes (damien.hawes@booking.com)

2024-04-11 11:02:35

*Thread Reply:* Yes

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-11 11:05:56

*Thread Reply:* okay, I see your solution - definitely something that needs time to spend looking into

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-11 11:07:55

*Thread Reply:* regarding proxy, you can also try forceAccess version MethodUtils.invokeMethod(Object object, boolean forceAccess, String methodName)

Damien Hawes (damien.hawes@booking.com)

2024-04-11 11:12:17

*Thread Reply:* No need. They are public methods

👍 Maciej Obuchowski

Damien Hawes (damien.hawes@booking.com)

2024-04-11 11:12:27

*Thread Reply:* As it's a Scala case class

Damien Hawes (damien.hawes@booking.com)

2024-04-15 07:37:00

*Thread Reply:* @Paweł Leszczyński - any thoughts on this?

Damien Hawes (damien.hawes@booking.com)

2024-04-15 07:50:37

*Thread Reply:* @Paweł Leszczyński - I guess it could fit into there, but this is more about matching on a LogicalRDD in the first place. We already have LogicalRDD input dataset builders, but they're specific to HadoopFS based things.

I'm thinking about cracking that open and making it a bit more generic, by allowing it to accept strategies for different types of LogicalRDDs, and delegating to a strategy per logical RDD type.

This is a potential rabbit hole though.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-15 07:50:51

*Thread Reply:* I am also looking into LogicalRDDVisitor

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-15 07:51:30

*Thread Reply:* to check if this should fit that visitor

Damien Hawes (damien.hawes@booking.com)

2024-04-15 07:52:50

*Thread Reply:* Again, it could. The thing is that visitor looks for files:

@Override public List<D> apply(LogicalPlan x) { LogicalRDD logicalRdd = (LogicalRDD) x; List<RDD<?>> fileRdds = Rdds.findFileLikeRdds(logicalRdd.rdd()); return findInputDatasets(fileRdds, logicalRdd.schema()); }

Though, that isn't to say we can't change it.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-15 08:02:31

*Thread Reply:* It looks really hacky but perhaps this is the only way to go with this. Before going this way, I would check if there is no Spark action before that creates this LogicalRDD. If so, we glue two logical plans which would be better in my opinion.

Damien Hawes (damien.hawes@booking.com)

2024-04-15 08:04:02

*Thread Reply:* The approach I was thinking was:

LogicalRddInputDatasetBuilder

Damien Hawes (damien.hawes@booking.com)

2024-04-15 08:04:16

*Thread Reply:* Because LogicalRDDs are always leaves.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-15 08:04:57

*Thread Reply:* but something needs to create rdd first

Damien Hawes (damien.hawes@booking.com)

2024-04-15 08:06:11

*Thread Reply:* Aye - the thing is, when it's foreachBatch, we don't know what it is, in this case.

Damien Hawes (damien.hawes@booking.com)

2024-04-15 08:06:35

*Thread Reply:* We can't see that it came from "foreachBatch"

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-15 08:15:26

*Thread Reply:* Isn't it a Spark action that is run within other Spark action? (sry if asking silly questions)

Damien Hawes (damien.hawes@booking.com)

2024-04-15 08:23:04

*Thread Reply:* I guess you could see it that way, the problem is spark breaks the logical lineage

Damien Hawes (damien.hawes@booking.com)

2024-04-15 08:23:16

*Thread Reply:* but retains the physical lineage

Tom Linton (tom.linton@atlan.com)

2024-04-12 10:27:03

Hi - I'm trying to build a custom extractor and running into an issue: Broken plugin: [airflow.providers.openlineage.plugins.openlineage] 'NoneType' object has no attribute 'get_operator_classnames'

I can't tell if I'm placing my extractor in the correct location. I've placed it in plugins/custom_extractor/CustomExtractor.py

I've set this env variable AIRFLOW__OPENLINEAGE__EXTRACTORS: 'custom_extractor.CustomExtractor.PythonOLExtractor'

Any ideas here? I can't figure it out from the documentation

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-12 10:30:59

*Thread Reply:* I think the issue is that you have some circular dependency somewhere, that breaks the import - that's usually how it's manifested. Other than that it would help if you could share the code

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-12 10:31:40

*Thread Reply:* can you check if you have the same issue if you do not import anything from Airflow or OpenLineage in your custom extractor? PythonOLExtractor

Tom Linton (tom.linton@atlan.com)

2024-04-12 10:34:40

*Thread Reply:* I'll check now, thanks!

Tom Linton (tom.linton@atlan.com)

2024-04-12 10:57:27

*Thread Reply:* @Maciej Obuchowski I believe when I get rid of these imports, things start to work. Can't confirm because then i have a break where Dataset is not defined etc

from openlineage.airflow.extractors import TaskMetadata from openlineage.airflow.extractors.base import BaseExtractor from openlineage.client.run import Dataset

Tom Linton (tom.linton@atlan.com)

2024-04-12 10:58:06

*Thread Reply:* Do I have to get rid of these imports and manually build the dataset

Tom Linton (tom.linton@atlan.com)

2024-04-12 10:58:08

*Thread Reply:* ?

Tom Linton (tom.linton@atlan.com)

2024-04-12 11:12:05

*Thread Reply:*

Untitled

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-12 11:21:46

*Thread Reply:* @Tom Linton try to move imports locally, to the extract_on_complete - and leave them behind typing.TYPE_CHECKING at top level

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-12 11:23:08

*Thread Reply:* something like that

```from typing import Optional, List, TYPE_CHECKING

if TYPECHECKING: from openlineage.airflow.extractors import TaskMetadata from openlineage.airflow.extractors.base import BaseExtractor from openlineage.client.run import Dataset from customop.CustomPythonOperator import InsAndOuts

def create_dataset(datasets: List[InsAndOuts]) -> List[Dataset]: from openlineage.client.run import Dataset return [Dataset(namespace=item.connection, name="{}.{}.{}".format(item.db, item.schema, item.table), facets={} ) for item in datasets]

class PythonOLExtractor(BaseExtractor):

@classmethod
def get_operator_classnames(cls) -&gt; List[str]:
    return ['CustomPythonOperator']

def extract(self) -&gt; Optional[TaskMetadata]:
    pass

def extract_on_complete(self, task_instance) -&gt; Optional[TaskMetadata]:
    from openlineage.airflow.extractors import TaskMetadata
    task = task_instance.task
    inputs = self.operator.get('input_data')
    output = self.operator.get('output_data')

    return TaskMetadata(
        name=task.task_id,
        inputs=create_dataset(inputs),
        outputs=create_dataset(output)
    )```

Tom Linton (tom.linton@atlan.com)

2024-04-12 11:39:05

*Thread Reply:* Hmmm... still getting this error

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-12 11:44:39

*Thread Reply:* can you paste current version or is that it? ⬆️

Tom Linton (tom.linton@atlan.com)

2024-04-12 11:46:03

*Thread Reply:* Current OL version?

Tom Linton (tom.linton@atlan.com)

2024-04-12 11:46:56

*Thread Reply:* Current version of my code was just a copy and paste of your message

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-12 11:48:04

*Thread Reply:* ah, okay, that was quick example in the notepad

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-12 11:52:52

*Thread Reply:* ```def create_dataset(datasets): from openlineage.client.run import Dataset return [Dataset(namespace=item.connection, name="{}.{}.{}".format(item.db, item.schema, item.table), facets={} ) for item in datasets]

class PythonOLExtractor: def init(self, operator): super().init() self.operator = operator

@classmethod
def get_operator_classnames(cls):
    return ['CustomPythonOperator']

def extract(self):
    pass

def extract_on_complete(self, task_instance):
    from openlineage.airflow.extractors import TaskMetadata
    task = task_instance.task
    inputs = self.operator.get('input_data')
    output = self.operator.get('output_data')

    return TaskMetadata(
        name=task.task_id,
        inputs=create_dataset(inputs),
        outputs=create_dataset(output)
    )```

maybe try this 🙂

Tom Linton (tom.linton@atlan.com)

2024-04-12 12:09:10

*Thread Reply:* Errors gone, but the operation is failing due to no module names openlineage.airflow

Tom Linton (tom.linton@atlan.com)

2024-04-12 12:09:20

*Thread Reply:* do I need to pip install openlineage?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-12 12:31:20

*Thread Reply:* What version of airflow you're on? The answer depends

Tom Linton (tom.linton@atlan.com)

2024-04-12 12:32:15

*Thread Reply:* airflow:2.8.4

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-12 12:35:34

*Thread Reply:* Then you need to import from apache.airflow.providers.openlineage

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-12 12:36:18

*Thread Reply:* Not openlineage.airflow and install apache-airflow-providers-openlineage

Tom Linton (tom.linton@atlan.com)

2024-04-12 12:49:59

*Thread Reply:* I'm dying a slow death here 🫠

Now TaskMetadata is not part of airflow.providers.openlineage

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-12 12:58:08

*Thread Reply:* yeah, we renamed it OperatorLineage 🙂 https://github.com/apache/airflow/blob/9c4e333f5b7cc6f950f6791500ecd4bad41ba2f9/airflow/providers/openlineage/extractors/base.py#L34

> I'm dying a slow death here 🫠 good example why not to give advice from the phone, you rarely give one without bugs

<https://github.com/apache/airflow/blob/9c4e333f5b7cc6f950f6791500ecd4bad41ba2f9/airflow/providers/openlineage/extractors/base.py | base.py>

<pre><code>class OperatorLineage: </code></pre>

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-12 12:58:27

*Thread Reply:* from airflow.providers.openlineage.extractors import OperatorLineage

Tom Linton (tom.linton@atlan.com)

2024-04-12 12:59:12

*Thread Reply:* 🤣 I think I got it now!

Tom Linton (tom.linton@atlan.com)

2024-04-12 12:59:25

*Thread Reply:* I really appreciate all the help you've given me!

❤️ Eric Veleker

Charles kuzmiak (charles.kuzmiak@kraken.tech)

2024-07-09 03:38:21

*Thread Reply:* Hello, did you manage to get it working? I tried all the things above but still getting Broken plugin: [airflow.providers.openlineage.plugins.openlineage] 'NoneType' object has no attribute 'get_operator_classnames' 😞

Charles kuzmiak (charles.kuzmiak@kraken.tech)

2024-07-09 03:39:54

*Thread Reply:* ```from typing import List, TYPE_CHECKING

if TYPE_CHECKING: from openlineage.airflow.extractors.base import BaseExtractor

class DBTExtractor(BaseExtractor): @classmethod def getoperatorclassnames(cls) -> List[str]: return ["DbtRunKubernetesOperator"]

def _execute_extraction(self):
    from airflow.providers.openlineage.extractors import OperatorLineage
    from openlineage.client.run import Dataset

    input_dataset = Dataset(
        namespace="Extractors",
        name=f"{self.operator}_Extractors_in",
    )
    output_dataset = Dataset(
        namespace="Extractors",
        name=f"{self.operator}_Extractors_out",
    )

    return OperatorLineage(
        inputs=[input_dataset],
        outputs=[output_dataset],
    )```

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-09 05:06:41

*Thread Reply:* If you use the BaseExtractor at top level, you can gate the import it behind the TYPE_CHECKING

Charles kuzmiak (charles.kuzmiak@kraken.tech)

2024-07-09 20:36:15

*Thread Reply:* hello, as you can see, already am 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-10 08:12:15

*Thread Reply:* Ah, I wrote opposite of what I ment. I mean you can't do that, because you're actually using this in class definition 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-12 10:38:24

@Damien Hawes are you using intellij? If yes, have you solved the issue with intellij not recognizing dependencies between projects after each Gradle refresh? I can add them manually, but it gets dropped when intellij picks dependencies from Gradle again. I think it's because intellij does not respect the particular configuration we're using for dependencies now

Damien Hawes (damien.hawes@booking.com)

2024-04-12 10:39:19

*Thread Reply:* I filed an issue with JetBrains a long time ago.

Damien Hawes (damien.hawes@booking.com)

2024-04-12 10:39:37

*Thread Reply:* Ultimately, it's IntelliJ's module system not playing nicely with Gradle

Damien Hawes (damien.hawes@booking.com)

2024-04-12 10:40:08

*Thread Reply:* It's an annoying thing, that's for sure.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-12 10:50:57

*Thread Reply:* I see, so just like my other issue: https://youtrack.jetbrains.com/issue/IDEA-140707 🙂

YouTrack

CLion interferes with PyCharm and vice-versa : IDEA-140707

In our source trees, we very often have top-level folders with some subdirectories containing Python modules and others containing C++ files. If I create a PyCharm project, in the top-level directory then to create a CLion project in the same top-level directory, they compete, with each IDE complaining that the project has been modified outside of the envirionment. Bug, or feature? P.S. Kudos on an amazing product, btw. Clion is the first C++ IDE for which I've actually paid money for a long time.…

Original URL: https://youtrack.jetbrains.com/issue/IDEA-140707

Michael Robinson (michael.robinson@astronomer.io)

2024-04-16 11:45:47

If you're using 1.12.0, how are things going? Curious if you're seeing improvements... (cc @Maciej Obuchowski)

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-16 12:02:24

https://github.com/open-metadata/openmetadata-spark-agent/blob/main/src/main/java/org/openmetadata/spark/agent/OpenMetadataSparkListener.java 🙂

<https://github.com/open-metadata/openmetadata-spark-agent/blob/main/src/main/java/org/openmetadata/spark/agent/OpenMetadataSparkListener.java | OpenMetadataSparkListener.java>

``` /* * Copyright 2024 Collate * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * <a href="http://www.apache.org/licenses/LICENSE-2.0">http://www.apache.org/licenses/LICENSE-2.0</a> * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ /* * This code has been referenced from * <a href="https://github.com/openlineage/OpenLineage">https://github.com/openlineage/OpenLineage</a> */ package org.openmetadata.spark.agent; import static io.openlineage.spark.agent.util.ScalaConversionUtils.asJavaOptional; import io.openlineage.client.Environment; import io.openlineage.client.OpenLineage; import io.openlineage.spark.agent.ArgumentParser; import io.openlineage.spark.agent.EventEmitter; import io.openlineage.spark.agent.JobMetricsHolder; import io.openlineage.spark.agent.Versions; import io.openlineage.spark.agent.lifecycle.ContextFactory; import io.openlineage.spark.agent.lifecycle.ExecutionContext; import io.openlineage.spark.agent.util.ScalaConversionUtils; import java.io.PrintWriter; import java.net.URISyntaxException; import java.time.ZonedDateTime; import java.util.Collections; import java.util.HashMap; import java.util.Map; import java.util.Optional; import java.util.Properties; import java.util.Set; import java.util.WeakHashMap; import java.util.stream.Collectors; import lombok.extern.slf4j.Slf4j; import org.apache.commons.io.output.ByteArrayOutputStream; import org.apache.hadoop.conf.Configuration; import org.apache.spark.SparkConf; import org.apache.spark.SparkContext; import org.apache.spark.SparkContext$; import org.apache.spark.SparkEnv; import org.apache.spark.SparkEnv$; import org.apache.spark.package$; import org.apache.spark.rdd.RDD; import org.apache.spark.scheduler.ActiveJob; import org.apache.spark.scheduler.SparkListenerApplicationEnd; import org.apache.spark.scheduler.SparkListenerApplicationStart; import org.apache.spark.scheduler.SparkListenerEvent; import org.apache.spark.scheduler.SparkListenerJobEnd; import org.apache.spark.scheduler.SparkListenerJobStart; import org.apache.spark.scheduler.SparkListenerTaskEnd; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionEnd; import org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart; import scala.Function0; import scala.Function1; import scala.Option; @Slf4j public class OpenMetadataSparkListener extends org.apache.spark.scheduler.SparkListener { private static final Map<Long, ExecutionContext> sparkSqlExecutionRegistry = Collections.synchronizedMap(new HashMap<>()); private static final Map<Integer, ExecutionContext> rddExecutionRegistry = Collections.synchronizedMap(new HashMap<>()); private static WeakHashMap<RDD<?>, Configuration> outputs = new WeakHashMap<>(); private static ContextFactory contextFactory; private static JobMetricsHolder jobMetrics = JobMetricsHolder.getInstance(); private final Function1<SparkSession, SparkContext> sparkContextFromSession = ScalaConversionUtils.toScalaFn(SparkSession::sparkContext); private final Function0<Option<SparkContext>> activeSparkContext = ScalaConversionUtils.toScalaFn(SparkContext$.MODULE$::getActive); String sparkVersion = package$.MODULE$.SPARK_VERSION(); private static final boolean isDisabled = checkIfDisabled(); /** called by the tests **/ public static void init(ContextFactory contextFactory) { OpenMetadataSparkListener.contextFactory = contextFactory; clear(); } @Override public void onOtherEvent(SparkListenerEvent event) { if (isDisabled) { return; } initializeContextFactoryIfNotInitialized(); if (event instanceof SparkListenerSQLExecutionStart) { sparkSQLExecStart((SparkListenerSQLExecutionStart) event); } else if (event instanceof SparkListenerSQLExecutionEnd) { sparkSQLExecEnd((SparkListenerSQLExecutionEnd) event); } } /** called by the SparkListener when a spark-sql (Dataset api) execution starts **/ private static void sparkSQLExecStart(SparkListenerSQLExecutionStart startEvent) { getSparkSQLExecutionContext(startEvent.executionId()) .ifPresent(context -> context.start(startEvent)); } /** called by the SparkListener when a spark-sql (Dataset api) execution ends **/ private static void sparkSQLExecEnd(SparkListenerSQLExecutionEnd endEvent) { ExecutionContext context = sparkSqlExecutionRegistry.remove(endEvent.executionId()); if (context != null) { context.end(endEvent); } else { contextFactory.createSparkSQLExecutionContext(endEvent).ifPresent(c -> c.end(endEvent)); } } /** called by the SparkListener when a job starts **/ @Override public void onJobStart(SparkListenerJobStart jobStart) { if (isDisabled) { return; } initializeContextFactoryIfNotInitialized(); Optional<ActiveJob> activeJob = asJavaOptional( SparkSession.getDefaultSession() .map(sparkContextFromSession) .orElse(activeSparkContext)) .flatMap( ctx -> Optional.ofNullable(ctx.dagScheduler()) .map(ds -> ds.jobIdToActiveJob().get(jobStart.jobId()))) .flatMap(ScalaConversionUtils::asJavaOptional); Set<Integer> stages = ScalaConversionUtils.fromSeq(jobStart.stageIds()).stream() .map(Integer.class::cast) .collect(Collectors.toSet()); <pre><code>if (sparkVersion.startsWith("3")) { jobMetrics.addJobStages(jobStart.jobId(), stages); } Optional.ofNullable(getSqlExecutionId(jobStart.properties())) .map(Optional::of) .orElseGet( () -> asJavaOptional( SparkSession.getDefaultSession() .map(sparkContextFromSession) .orElse(activeSparkContext)) .flatMap( ctx -> Optional.ofNullable(ctx.dagScheduler()) .map(ds -> ds.jobIdToActiveJob().get(jobStart.jobId())) .flatMap(ScalaConversionUtils::asJavaOptional)) .map(job -> getSqlExecutionId(job.properties()))) .map(Long::parseLong) .map(id -> getExecutionContext(jobStart.jobId(), id)) .orElseGet(() -> getExecutionContext(jobStart.jobId())) .ifPresent( context -> { // set it in the rddExecutionRegistry so jobEnd is called activeJob.ifPresent(context::setActiveJob); context.start(jobStart); }); </code></pre> } private String getSqlExecutionId(Properties properties) { return properties.getProperty("spark.sql.execution.id"); } /** called by the SparkListener when a job ends **/ @Override public void onJobEnd(SparkListenerJobEnd jobEnd) { if (isDisabled) { return; } ExecutionContext context = rddExecutionRegistry.remove(jobEnd.jobId()); if (context != null) { context.end(jobEnd); } if (sparkVersion.startsWith("3")) { jobMetrics.cleanUp(jobEnd.jobId()); } } @Override public void onTaskEnd(SparkListenerTaskEnd taskEnd) { if (isDisabled || sparkVersion.startsWith("2")) { return; } jobMetrics.addMetrics(taskEnd.stageId(), taskEnd.taskMetrics()); } public static Optional<ExecutionContext> getSparkSQLExecutionContext(long executionId) { return Optional.ofNullable( sparkSqlExecutionRegistry.computeIfAbsent( executionId, (e) -> contextFactory.createSparkSQLExecutionContext(executionId).orElse(null))); } public static Opti…

Julien Le Dem (julien@apache.org)

2024-04-16 12:43:22

*Thread Reply:* they keep making thin wrappers around OpenLineage

Julien Le Dem (julien@apache.org)

2024-04-16 12:43:53

*Thread Reply:* ```import static io.openlineage.spark.agent.util.ScalaConversionUtils.asJavaOptional;

import io.openlineage.client.Environment; import io.openlineage.client.OpenLineage; import io.openlineage.spark.agent.ArgumentParser; import io.openlineage.spark.agent.EventEmitter; import io.openlineage.spark.agent.JobMetricsHolder; import io.openlineage.spark.agent.Versions; import io.openlineage.spark.agent.lifecycle.ContextFactory; import io.openlineage.spark.agent.lifecycle.ExecutionContext; import io.openlineage.spark.agent.util.ScalaConversionUtils;```

Julien Le Dem (julien@apache.org)

2024-04-16 12:45:53

*Thread Reply:* https://github.com/open-metadata/openmetadata-spark-agent/blob/main/src/main/java/org/openmetadata/transport/OpenMetadataTransportBuilder.java#L[…]7

<https://github.com/open-metadata/openmetadata-spark-agent/blob/main/src/main/java/org/openmetadata/transport/OpenMetadataTransportBuilder.java | OpenMetadataTransportBuilder.java>

<pre><code>/** ** This code has been referenced from ** <https://github.com/Natural-Intelligence/openLineage-openMetadata-transporter.git> **/ </code></pre>

Julien Le Dem (julien@apache.org)

2024-04-16 12:46:06

*Thread Reply:* /** ** This code has been referenced from ** <https://github.com/Natural-Intelligence/openLineage-openMetadata-transporter.git> **/

Julien Le Dem (julien@apache.org)

2024-04-16 12:46:27

*Thread Reply:* https://github.com/open-metadata/openmetadata-spark-agent/blob/main/src/main/java/org/openmetadata/spark/agent/OpenMetadataSparkListener.java#L[…]C4

Julien Le Dem (julien@apache.org)

2024-04-16 12:46:40

*Thread Reply:* I think the repo needs a NOTICE file

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-16 12:52:36

*Thread Reply:* They are literally the second datahub 🙂 Why contribute to opensource project if you can fork it

😢 Jakub Dardziński

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-18 09:13:47

@Julien Le Dem @Michael Robinson @tati should we keep some initial agenda for Tuesday meetings, maybe just lightweight as a slack thread?

👍 Michael Robinson, tati

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-18 09:14:15

We can try here: for the next Tuesday 23rd April

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-18 09:16:40

*Thread Reply:* First point, discuss moving docs to the monorepo - the gains would be • to have docs in the same PR that the feature/bugfix is • @Michael Robinson would not need to copy the changelog • we can have doc versioning: the release doc version would be tied to the tag for the particular release

➕ Kacper Muda, Harel Shein, Paweł Leszczyński, Michael Robinson, Julien Le Dem, tati

Michael Robinson (michael.robinson@astronomer.io)

2024-04-22 09:20:39

*Thread Reply:* For Marquez we did this for the landing pages and docs, and while we haven't taken advantage of the ability to implement versioning yet, there haven't been any problems that I'm aware of.

Julien Le Dem (julien@apache.org)

2024-04-22 13:55:53

*Thread Reply:* Another point for tomorrow: • discussing maintaining milestones and assignees for OL tickets to increase visibility of what people work on or plan to work on . We mentioned it last week but @Harel Shein and @tati were missing or not able to talk.

➕ Maciej Obuchowski, tati

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-23 10:47:06

*Thread Reply:* View vs table: • When reading from view or materialized view, should we reference view or unterlying table (tables?) that the view is based on • https://github.com/trinodb/trino/pull/21265#discussion_r1576169364

Mariusz Górski (gorskimariusz13@gmail.com)

2024-04-22 12:40:10

@Jakub Dardziński thanks for massive update to generated.schema in Python, I've started to beta test it but found one issue: https://github.com/OpenLineage/OpenLineage/issues/2629

#2629 [BUG] Cannot deserialize JSON into Python generated classes

Currently there is no way to deserialize JSON objects into Classes generated within python client. The reason is that <code>schemaURL</code> field in BaseEvent class has <code>init=False</code> <a href="https://github.com/OpenLineage/OpenLineage/blob/main/client/python/openlineage/client/generated/base.py#L28">https://github.com/OpenLineage/OpenLineage/blob/main/client/python/openlineage/client/generated/base.py#L28</a> Imagine you have json object (Python dict) you want to turn into Python class. _data = {...}. # RunEvent in JSON data = RunEvent(**data) you will get <pre><code>TypeError: __init__() got an unexpected keyword argument 'schemaURL' </code></pre> It would be handy if Python client allowed not only creating objects and serializing them into JSON but also supported reverse operation.

Labels

proposal

Comments

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-23 05:09:58

*Thread Reply:* I don’t have strong opinion on this. If it would enable bi-directional operations on serialization then probably we should do this. The only reason _schemaURL is not ClassVar is to include it in __attrs_attrs__

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-23 05:10:30

*Thread Reply:* btw, what do you use to deserialize JSONs back to classes? cattrs or something else?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-23 05:13:58

*Thread Reply:* cc @Maciej Obuchowski

Mariusz Górski (gorskimariusz13@gmail.com)

2024-04-24 03:37:13

*Thread Reply:* we can use cattrs converters cattr.Converter().structure(_data, RunEvent) if we can do sth about type hints that look like this str | None that currently break this approach

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-22 14:04:56

https://github.com/OpenLineage/OpenLineage/pull/2632 fixes the main build

#2632 common: force higher version of typing_extensions

Fixes the build by forcing typing_extensions to use version that doesn't break with dbt <1.3

Labels

common

Assignees

<a href="https://github.com/pawel-big-lebowski">@pawel-big-lebowski</a>, <a href="https://github.com/JDarDagran">@JDarDagran</a>

arjun krishnamurthy (arjunkrishnamurthy1991@gmail.com)

2024-04-23 00:54:29

Hi, Do we officially support datahub as a consumer for OpenLineage? Do we have any docs on the integration?

dolfinus (martinov_m_s_@mail.ru)

2024-04-23 03:57:22

Datahub added OL compatible API endpoint just a few weeks ago, in v0.13.1 https://datahubproject.io/docs/releases/#whats-changed

datahubproject.io

DataHub Releases | DataHub

Summary

Original URL: https://datahubproject.io/docs/releases/#whats-changed

😱 Mariusz Górski

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-23 08:42:38

*Thread Reply:* https://openlineage.slack.com/archives/C065PQ4TL8K/p1708517956104609 🙂

} Maciej Obuchowski (https://openlineage.slack.com/team/U01RA9B5GG2)

<a href="https://github.com/datahub-project/datahub/pull/9870/files">https://github.com/datahub-project/datahub/pull/9870/files</a>

Original URL: https://openlineage.slack.com/archives/C065PQ4TL8K/p1708517956104609

Fabio Manganiello (fabio.manganiello@booking.com)

2024-04-23 10:12:40

Hi channel! I have posted this today on #general but I think this may be a more appropriate place: https://openlineage.slack.com/archives/C01CK9T7HKR/p1713866218368999

Basically the issue that we're facing is that we may run the same dbt project from an multistep Airflow DAG - but with different model selectors or configurations.

With the current dbt-ol logic, the dbt wrapper job name ends up being always in the format dbt-run-<dbt_project_name>. That's a problem if you share the same dbt project across multiple tasks in the same DAG - maybe using different models, which can point to different dependencies, so you may want those things to be captured in the graph.

My proposal (/workaround) would be to enrich job_name in dbt-ol using the format dbt-run-$OPENLINEAGE_PARENT_ID-<dbt_project_name>, if OPENLINEAGE_PARENT_ID is set. So different Airflow tasks that launch the same project won't have name clashes on the job name.

The other idea that crossed my mind (which is probably more general-purpose) would be to have a generic OPENLINEAGE_NAME_TEMPLATE variable, which would allow us to specify something in the format e.g. dbt-run-${OPENLINEAGE_PARENT_ID}.${JOB_NAME} without trying to come up with a one-size-fits-all solution. What are your thoughts?

} Fabio Manganiello (https://openlineage.slack.com/team/U06BV4F12JU)

Hi folks, I've just bumped into a graph representation issue when using the <code>dbt-ol</code> wrapper in multi-step Airflow DAGs. I basically I have a DAG that launches two tasks - <code>step_01</code> and <code>step_02</code>. Each of these launch a dbt process wrapped through <code>dbt-ol</code> - same <code>dbt_project.yml</code>, but different model selectors. When <code>step_01</code> completes, I get an OL event like this: <code>{ "eventType": "COMPLETE", "run": { "runId": "b35cb64f-088f-451b-b4d5-99cb164da571", "facets": { "parent": { "additionalProperties": { "job": { "name": "lineage-dbt-playground-bk-aws-uk.step_01.dbt", "namespace": "default" }, "run": { "runId": "0fa3b25b-b9ad-3dbd-95b9-627dec474a68" } } } } }, "job": { "namespace": "default", "name": "dbt-run-dbt_playground" } }</code> When <code>step_02</code> completes, I get an OL event like this: <code>{ "eventType": "COMPLETE", "run": { "runId": "7f3a08c3-f3cd-47ab-a994-7d385e6dee42", "facets": { "parent": { "additionalProperties": { "job": { "name": "lineage-dbt-playground-bk-aws-uk.step_02.dbt", "namespace": "default" }, "run": { "runId": "1f985b62-a723-30d2-8c49-ee9a70b3d6c6" } } } } }, "job": { "namespace": "default", "name": "dbt-run-dbt_playground" } }</code> So basically both the Airflow tasks launched a <code>dbt-ol</code> wrapper, and each of them generated OL events using <code>dbt-run-<dag_id></code> as the identifier. And this is consistent with what I see in the code: <a href="https://github.com/OpenLineage/OpenLineage/blob/main/integration/dbt/scripts/dbt-ol#L144">https://github.com/OpenLineage/OpenLineage/blob/main/integration/dbt/scripts/dbt-ol#L144</a> So when we materialize the graph from the events we end up with: <ol><li>The same job name</li><li>Which has two different run instances</li><li>Which have two different parents In our implementation (which currently projects the state of the latest run event to a job record with the given namespace and name) this means that <code>dbt-run-<dag_id></code> is basically a "floating node" in a multi-step Airflow DAG, it will move across the DAG leaf nodes while the DAG executes, and eventually it will always be attached to the latest task run by the DAG (or at least the one that emits the latest <code>COMPLETE</code> event).</li> </ol> Would it be possible to construct the job name using additional identifiers as well? dbt can't access the parent's context, so it can't know if it's launched as a stand-alone process or via an Airflow DAG (or another orchestrator). But maybe it could support something like an optional <code>OPENLINEAGE_NAME_PREFIX</code> / <code>OPENLINEAGE_NAME_SUFFIX</code> env variable to be attached to the name (or maybe more in general an <code>OPENLINEAGE_NAME_TEMPLATE</code>) to prevent such ambiguities? What are your thoughts?

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1713866218368999

Harel Shein (harel.shein@gmail.com)

2024-04-23 11:45:43

@Maciej Obuchowski after our sync, can you help me merge this in? I might not have the right permissions on that repo

#5 Sync new additions to sql parser

Our fork is behind quite a bit from origin

Comments

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-23 11:46:27

*Thread Reply:* I don't too 😞

Merging is blocked The base branch does not allow updates. Learn more about protected branches.

Harel Shein (harel.shein@gmail.com)

2024-04-23 11:53:56

*Thread Reply:* aren't you the maintainer?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-23 12:15:50

*Thread Reply:* only the main OL repo

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-23 12:34:22

*Thread Reply:* @Julien Le Dem you can make me maintainer of this repo as well 🙂

Julien Le Dem (julien@apache.org)

2024-04-23 12:41:19

*Thread Reply:* You are already maintainer 🤔

Julien Le Dem (julien@apache.org)

2024-04-23 12:41:34

*Thread Reply:* Pawel is admin

Julien Le Dem (julien@apache.org)

2024-04-23 12:41:59

*Thread Reply:* The committers group is maintainer as well

Harel Shein (harel.shein@gmail.com)

2024-04-23 12:44:43

*Thread Reply:* hmm.. I can't change base branch protections or override them

Julien Le Dem (julien@apache.org)

2024-04-23 12:44:48

*Thread Reply:* I think this is the reason?

Screen Shot 2024-04-23 at 9.44.23 AM.png

Julien Le Dem (julien@apache.org)

2024-04-23 12:45:33

*Thread Reply:* only admins can bypass branch protections. Not maintainers

Julien Le Dem (julien@apache.org)

2024-04-23 12:46:57

*Thread Reply:* maybe it's just that a pr is not the right way to do this?

Harel Shein (harel.shein@gmail.com)

2024-04-23 12:47:37

*Thread Reply:* 🤷

Harel Shein (harel.shein@gmail.com)

2024-04-23 12:48:07

*Thread Reply:* I used the fork syncing feature and that created this PR

Julien Le Dem (julien@apache.org)

2024-04-23 12:48:50

*Thread Reply:* ah, I see. I'm looking at that doc

Julien Le Dem (julien@apache.org)

2024-04-23 12:48:58

*Thread Reply:* https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Julien Le Dem (julien@apache.org)

2024-04-23 12:51:54

*Thread Reply:* maciej, you are admin as well

Julien Le Dem (julien@apache.org)

2024-04-23 12:52:53

*Thread Reply:* What are the 18 commits ahead?

Screen Shot 2024-04-23 at 9.52.33 AM.png

Julien Le Dem (julien@apache.org)

2024-04-23 12:53:44

*Thread Reply:* i see, I think only Pawel merged PRs before

Julien Le Dem (julien@apache.org)

2024-04-23 12:54:07

*Thread Reply:* those are the commits to add a notice to the readme and merge update PRs

Harel Shein (harel.shein@gmail.com)

2024-04-23 12:54:15

*Thread Reply:* yes

Julien Le Dem (julien@apache.org)

2024-04-23 12:55:19

*Thread Reply:* The way branch protection is setup, I think you need an admin to merge everytime, because it's not just the branch from upstream.

Julien Le Dem (julien@apache.org)

2024-04-23 12:56:00

*Thread Reply:* I merged it.

Julien Le Dem (julien@apache.org)

2024-04-23 12:56:44

*Thread Reply:* I think, the way we handle the fork, we should uncheck that read-only check box

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-23 12:59:38

*Thread Reply:* yes, let's uncheck this 🙂

Harel Shein (harel.shein@gmail.com)

2024-04-23 13:02:44

*Thread Reply:* thanks

Michael Robinson (michael.robinson@astronomer.io)

2024-04-23 16:55:05

The 1.13.0 changelog PR is RFR if anyone has a moment: https://github.com/OpenLineage/OpenLineage/pull/2638

Minkyu Park (minkyu.park.200@gmail.com)

2024-04-24 13:55:58

Hi folks, was there any discussion regarding the beam integration since this issue that I missed? I'm working on beam/dataflow lineage for my team and will be appreciated if I get some help for brainstorming the idea.

#80 [INTEGRATION] Add Beam integration

❤️ Julien Le Dem

Harel Shein (harel.shein@gmail.com)

2024-04-24 15:38:53

*Thread Reply:* I don't think so, but as a former Beam user - I'd be down to help!

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-25 08:39:55

*Thread Reply:* @Dominik Dębowczyk might have some experience with Beam as well

Catherine Lin (mscatherinelin@gmail.com)

2024-04-26 09:40:06

Hello, I am trying to use the OL integration with Airflow and Marquez. I am trying to graph workflow lineage within Marquez by using the ParentRunFacet to obtain upstream and downstream task dependencies however, I am not clear on how to send this information to Marquez so that the task hierarchy is visualized in the UI.

Can anyone provide some pointers here? Does it require creating a custom extractor class?
If so what are the attributes that contribute to the construction of node dependencies and edges?
I found this source code that generates edges between nodes in Marquez but it seems to only create edges between jobs and datasets. Thank you!

<https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/service/LineageService.java | LineageService.java>

``` /* * Copyright 2018-2023 contributors to the Marquez project * SPDX-License-Identifier: Apache-2.0 */ package marquez.service; import static java.util.stream.Collectors.groupingBy; import static java.util.stream.Collectors.toList; import com.google.common.base.Functions; import com.google.common.collect.ImmutableSet; import com.google.common.collect.ImmutableSortedSet; import com.google.common.collect.Maps; import java.util.Collections; import java.util.HashMap; import java.util.HashSet; import java.util.LinkedHashMap; import java.util.LinkedHashSet; import java.util.List; import java.util.Map; import java.util.Objects; import java.util.Optional; import java.util.Set; import java.util.UUID; import java.util.stream.Collectors; import java.util.stream.Stream; import javax.validation.constraints.NotNull; import lombok.NonNull; import lombok.extern.slf4j.Slf4j; import marquez.common.models.DatasetId; import marquez.common.models.JobId; import marquez.common.models.RunId; import marquez.db.JobDao; import marquez.db.LineageDao; import marquez.db.LineageDao.DatasetSummary; import marquez.db.LineageDao.JobSummary; import marquez.db.LineageDao.RunSummary; import marquez.db.models.JobRow; import marquez.service.DelegatingDaos.DelegatingLineageDao; import marquez.service.LineageService.UpstreamRunLineage; import marquez.service.models.DatasetData; import marquez.service.models.Edge; import marquez.service.models.Graph; import marquez.service.models.JobData; import marquez.service.models.Lineage; import marquez.service.models.Node; import marquez.service.models.NodeId; import marquez.service.models.NodeType; import marquez.service.models.Run; @Slf4j public class LineageService extends DelegatingLineageDao { public record UpstreamRunLineage(List<UpstreamRun> runs) {} public record UpstreamRun(JobSummary job, RunSummary run, List<DatasetSummary> inputs) {} private final JobDao jobDao; public LineageService(LineageDao delegate, JobDao jobDao) { super(delegate); this.jobDao = jobDao; } // TODO make input parameters easily extendable if adding more options like 'withJobFacets' public Lineage lineage(NodeId nodeId, int depth, boolean withRunFacets) { log.debug("Attempting to get lineage for node '{}' with depth '{}'", nodeId.getValue(), depth); Optional<UUID> optionalUUID = getJobUuid(nodeId); if (optionalUUID.isEmpty()) { log.warn( "Failed to get job associated with node '{}', returning orphan graph...", nodeId.getValue()); return toLineageWithOrphanDataset(nodeId.asDatasetId()); } UUID job = optionalUUID.get(); log.debug("Attempting to get lineage for job '{}'", job); Set<JobData> jobData = getLineage(Collections.singleton(job), depth); <pre><code>// Ensure job data is not empty, an empty set cannot be passed to LineageDao.getCurrentRuns() or // LineageDao.getCurrentRunsWithFacets(). if (jobData.isEmpty()) { // Log warning, then return an orphan lineage graph; a graph should contain at most one // job->dataset relationship. log.warn( "Failed to get lineage for job '{}' associated with node '{}', returning orphan graph...", job, nodeId.getValue()); return toLineageWithOrphanDataset(nodeId.asDatasetId()); } List<Run> runs = withRunFacets ? getCurrentRunsWithFacets( jobData.stream().map(JobData::getUuid).collect(Collectors.toSet())) : getCurrentRuns(jobData.stream().map(JobData::getUuid).collect(Collectors.toSet())); for (JobData j : jobData) { if (j.getLatestRun().isEmpty()) { for (Run run : runs) { if (j.getName().getValue().equalsIgnoreCase(run.getJobName()) && j.getNamespace().getValue().equalsIgnoreCase(run.getNamespaceName())) { j.setLatestRun(run); break; } } } } Set<UUID> datasetIds = jobData.stream() .flatMap(jd -> Stream.concat(jd.getInputUuids().stream(), jd.getOutputUuids().stream())) .collect(Collectors.toSet()); Set<DatasetData> datasets = new HashSet<>(); if (!datasetIds.isEmpty()) { datasets.addAll(this.getDatasetData(datasetIds)); } if (nodeId.isDatasetType()) { DatasetId datasetId = nodeId.asDatasetId(); DatasetData datasetData = this.getDatasetData(datasetId.getNamespace().getValue(), datasetId.getName().getValue()); if (!datasetIds.contains(datasetData.getUuid())) { log.warn( "Found jobs {} which no longer share lineage with dataset '{}' - discarding", jobData.stream().map(JobData::getId).toList(), nodeId.getValue()); return toLineageWithOrphanDataset(nodeId.asDatasetId()); } } return toLineage(jobData, datasets); </code></pre> } private Lineage toLineageWithOrphanDataset(@NonNull DatasetId datasetId) { final DatasetData datasetData = getDatasetData(datasetId.getNamespace().getValue(), datasetId.getName().getValue()); return new Lineage( ImmutableSortedSet.of( Node.dataset().data(datasetData).id(NodeId.of(datasetData.getId())).build())); } private Lineage toLineage(Set<JobData> jobData, Set<DatasetData> datasets) { Set<Node> nodes = new LinkedHashSet<>(); // build mapping for later Map<UUID, DatasetData> datasetById = datasets.stream().collect(Collectors.toMap(DatasetData::getUuid, Functions.identity())); <pre><code>Map<DatasetData, Set<UUID>> dsInputToJob = new HashMap<>(); Map<DatasetData, Set<UUID>> dsOutputToJob = new HashMap<>(); // build jobs Map<UUID, JobData> jobDataMap = Maps.uniqueIndex(jobData, JobData::getUuid); for (JobData data : jobData) { if (data == null) { log.error("Could not find job node for {}", jobData); continue; } Set<DatasetData> inputs = data.getInputUuids().stream() .map(datasetById::get) .filter(Objects::nonNull) .collect(Collectors.toSet()); Set<DatasetData> outputs = data.getOutputUuids().stream() .map(datasetById::get) .filter(Objects::nonNull) .collect(Collectors.toSet()); data.setInputs(buildDatasetId(inputs)); data.setOutputs(buildDatasetId(outputs)); inputs.forEach( ds -> dsInputToJob.computeIfAbsent(ds, e -> new HashSet<>()).add(data.getUuid())); outputs.forEach( ds -> dsOutputToJob.computeIfAbsent(ds, e -> new HashSet<>()).add(data.getUuid())); NodeId origin = NodeId.of(new JobId(data.getNamespace(), data.getName())); Node node = new Node( origin, NodeType.JOB, data, buildDatasetEdge(inputs, origin), buildDatasetEdge(origin, outputs)); nodes.add(node); } for (DatasetData dataset : datasets) { NodeId origin = NodeId.of(new DatasetId(dataset.getNamespace(), dataset.getName())); Node node = new Node( origin, NodeType.DATASET, dataset, buildJobEdge(dsOutputToJob.get(dataset), origin, jobDataMap), buildJobEdge(origin, dsInputToJob.get(dataset), jobDataMap)); nodes.add(node); } return new Lineage(Lineage.withSortedNodes(Graph.directed().nodes(nodes).build())); </code></pre> } private ImmutableSet<DatasetId> buildDatasetId(Set<DatasetData> datasetData) { if (datasetData == null) { return ImmutableSet.of(); } return datasetData.stream() .map(ds -> new DatasetId(ds.getNamespace(), ds.getName())) .collect(ImmutableSet.toImmutableSet()); } private ImmutableSet<Edge> buildJobEdge( NodeId origin, Set<UUID> uuids, Map<UUID, JobData> jobDataMap) { if (uuids == null) { return ImmutableSet.of(); } return uuids.stream() .map(jobDataMap::get) .filter(O…

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-26 12:24:32

*Thread Reply:* Hey Catherine, please post the question in Marquez Slack, there are probably more people that could help you 🙂

👍 Catherine Lin

Fabio Manganiello (fabio.manganiello@booking.com)

2024-04-29 05:29:57

Hi, today we've been discussing with the team about the dbt job name disambiguity / templating options (context: https://openlineage.slack.com/archives/C01CK9T7HKR/p1713866218368999).

The job name display / templating path would probably fix the same-dbt-project-different-model issue, but (unlike the job name prefix/suffix intermediate solutions) it also resonates quite well with another issue that we're currently trying to solve (i.e. job name explosion / intelligibility), which @Maciej Obuchowski briefly mentioned on that thread.

Shall we organize a meeting to discuss these options, as it may probably be a quicker way to reach consensus? Or shall I bring it up to tomorrow's committer sync general discussion?

(cc @Arnab Bhattacharyya)

} Fabio Manganiello (https://openlineage.slack.com/team/U06BV4F12JU)

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1713866218368999

Harel Shein (harel.shein@gmail.com)

2024-04-29 21:26:51

I have a conflict with the committer sync tomorrow, but wanted to raise something that someone asked me about today. What do folks here think about emitting OL events as OpenTelemetry traces? If so, where would you implement that? Motivation is that there are a bunch of tools built around opentelemetry that can leverage this integration and get some immediate value.

Harel Shein (harel.shein@gmail.com)

2024-04-29 21:27:57

*Thread Reply:* I’m not sure about it myself, so I wanted to get more opinions as well

Michael Robinson (michael.robinson@astronomer.io)

2024-04-30 10:09:16

*Thread Reply:* I have an item, too, if I can hijack this thread 🙂. I'd like to propose automating the selection process for the newsletter's contributor of the month and shifting the focus to a PR rather than a contributor, perhaps using Airflow's PR of the month script.

👍 Julien Le Dem

Iftach Schonbaum (iftach.schonbaum@hunters.ai)

2024-06-02 15:47:22

*Thread Reply:* feedback here: we would love to be able to use lineage as otel traces so we can mix lineage with tracing. this can help in a pipeline that starts with spark/flink and continues in to output processing in a different pipeline which is more kafka and backend based

Harel Shein (harel.shein@gmail.com)

2024-06-02 22:07:34

*Thread Reply:* Hey @Iftach Schonbaum if you have a specific use case in mind and are willing to work on this, happy to collaborate!

Michael Robinson (michael.robinson@astronomer.io)

2024-04-30 13:28:49

Hi, as discussed during today's committer sync, we're sunsetting the Contributor of the Month feature of the newsletter for the time being and adding an automated PR of the Month feature in its place. Airflow uses a script for this use case that IMO does a pretty good job selecting candidates (we could also create our own script, of course, or fork Airflow's and tweak it if someone wants to take that on). Applied to OpenLineage, the script identifies the following as the top five PRs for the month: Top 5 out of 68 PRs:

Next month I'll give folks more time to vote, but if you have time by 3 PM ET tomorrow would you please vote in the thread? Absent votes we'll just go with #1 unless there are objections. 🙂

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-30 13:56:21

*Thread Reply:* how about reactions like when release voting?

Michael Robinson (michael.robinson@astronomer.io)

2024-04-30 14:19:25

*Thread Reply:* sure, works for me

Michael Robinson (michael.robinson@astronomer.io)

2024-04-30 15:11:33

*Thread Reply:* PR #2520: python/client: generate Python facets from JSON schemas.. Score: 199.211

➕ Kacper Muda

Michael Robinson (michael.robinson@astronomer.io)

2024-04-30 15:11:49

*Thread Reply:* ** PR #2510: sqlparser: update code to conform to upstream sqlparser-rs changes. Score: 46.9

Michael Robinson (michael.robinson@astronomer.io)

2024-04-30 15:12:03

*Thread Reply:* PR #2548: [SPEC] Allow nested struct fields in SchemaDatasetFacet. Score: 40.5

➕ Jakub Dardziński

Michael Robinson (michael.robinson@astronomer.io)

2024-04-30 15:12:15

*Thread Reply:* PR #2572: [CLIENT] Fix missing pkg_resources module on Python 3.12. Score: 35.0

Michael Robinson (michael.robinson@astronomer.io)

2024-04-30 15:12:39

*Thread Reply:* PR #2578: [Airflow] Fixed format returned by airflow.macros.lineage_parent_id.. Score: 33.6.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-30 15:13:30

*Thread Reply:* I’m voting vor 2548. @dolfinus went beyond his main language and contributed to Java code generator (which wasn’t easy)

Michael Robinson (michael.robinson@astronomer.io)

2024-05-01 15:13:06

*Thread Reply:* It's a tie! Thanks for voting

Michael Robinson (michael.robinson@astronomer.io)

2024-05-01 12:37:22

Hi all, starting a thread to collect agenda items for this month's TSC meeting, which is next Wednesday at 9:30 am PT. Please reply here with your items.

Michael Robinson (michael.robinson@astronomer.io)

2024-05-02 10:02:45

*Thread Reply:* Release: 1.13.1

Michael Robinson (michael.robinson@astronomer.io)

2024-05-02 10:06:35

*Thread Reply:* Events: Open Standards for Data Lineage panel

Michael Robinson (michael.robinson@astronomer.io)

2024-05-07 11:37:15

*Thread Reply:* issue templates, label updates in GH (Kacper)

Michael Robinson (michael.robinson@astronomer.io)

2024-05-07 11:37:24

*Thread Reply:* protobuf update (Pawel)

Fabio Manganiello (fabio.manganiello@booking.com)

2024-05-02 08:09:01

Hi folks, I've put together a couple of ideas for the job name override problem (specifically taking the Airflow+dbt case as an example but it can have wider applications): https://paste.manganiello.tech/?d28179d0c3f3cb07#9TWwtxmsMZ2o7c3EgpxEKQLgQGbSb4QLNdjK9krYG56w. Let me know your thoughts and whether I should publish it as a Github discussion or through other means (e.g. Google doc) (cc @Arnab Bhattacharyya)

PrivateBin

Encrypted note on PrivateBin

Visit this link to see the note. Giving the URL to anyone allows them to access the note, too.

Original URL: https://paste.manganiello.tech/?d28179d0c3f3cb07#9TWwtxmsMZ2o7c3EgpxEKQLgQGbSb4QLNdjK9krYG56w

Damien Hawes (damien.hawes@booking.com)

2024-05-03 04:31:08

I've got a question about the difference between io.openlineage.client and io.openlineage.server - the server classes lack the structure of the client classes. For example, if we look at RunFacets , the client definition contains:

ProcessingEngineRunFacet
NominalTimeRunFacet etc

The server version doesn't. This has an implication when you deserialise the events as the server version. Specifically, all information in the facets gets shunted into the additionalProperties field. So I'm just curious as to what is the purpose of the server classes?

Julien Le Dem (julien@apache.org)

2024-05-03 13:00:10

*Thread Reply:* the server classes are meant for receiving OL event. Typically a server will accept any facet, including future versions of standard facets. So it deserializes the facets in a generic way. As opposed to the client where the purpose is to generate facet against the current version of the schema.

Damien Hawes (damien.hawes@booking.com)

2024-05-03 13:35:29

*Thread Reply:* Aye - I understand that. It's more around the implication. The server models lose a lot of type information. Compare and contrast the following.

Using this snippet of code:

``` var serverEvent = readResource( "test-events.ndjson", io.openlineage.server.OpenLineage.RunEvent.class).get(0);

var clientEvent =
    readResource(
        "test-events.ndjson",
        io.openlineage.client.OpenLineage.RunEvent.class).get(0);```

The first screenshot shots the structure of the server event and the second screenshot shows the structure of the client event.

image.png

Damien Hawes (damien.hawes@booking.com)

2024-05-03 13:37:51

*Thread Reply:* Important thing to note: in the server event, all facets are located in the additionalProperties of the facets object, and all properties of any individual facet is located within its own additionalProperties property.

Damien Hawes (damien.hawes@booking.com)

2024-05-03 13:38:05

*Thread Reply:* Whilst the client model retains that typing information.

Damien Hawes (damien.hawes@booking.com)

2024-05-03 13:39:12

*Thread Reply:* This does imply that a (Java) server deserialising these models and using the facet information will have to perform a lot of casts.

Julien Le Dem (julien@apache.org)

2024-05-03 16:33:42

*Thread Reply:* Yes, this is true. I've been thinking of a better way to do this. Possibly something like this:

OutputDatasetFacet outputStatistics = readServer.getOutputs().get(0).getOutputFacets().getAdditionalProperties().get("outputStatistics"); OutputStatisticsOutputDatasetFacet translated = mapper.convertValue(mapper.valueToTree(outputStatistics), OutputStatisticsOutputDatasetFacet.class);

Julien Le Dem (julien@apache.org)

2024-05-03 16:35:09

*Thread Reply:* If this is a satisfying solution, we could add it to the server model. Something like: > getOutputFacets().getFacet("outputStatistics", OutputStatisticsOutputDatasetFacet .class)

Julien Le Dem (julien@apache.org)

2024-05-03 16:36:42

*Thread Reply:* The server model has not been used much so far and can be improved for sure.

Julien Le Dem (julien@apache.org)

2024-05-03 16:54:33

*Thread Reply:* another option is to have all the same get{name}Facet as the client model but deserialize from the Map when it happens. It might help if we make additional properties values as JsonNode and not just Objects.

Fabio Manganiello (fabio.manganiello@booking.com)

2024-05-03 07:49:49

I've opened a PR for the dbt job name duplication issue: https://github.com/OpenLineage/OpenLineage/pull/2658. It doesn't implement the general-purpose name templating solutions I've discussed in my proposal, instead it just adds dbt profile and model(s) to the name if available. It also adds a new environment variable to control the new behaviour (OPENLINEAGE_DBT_USE_EXTENDED_JOB_NAME) in order to minimize the risk of breaking back-compatibility - i.e. the same workflow run before and after the upgrade may generate different job names. Wondering if this is actually required or if we're ok with having the new syntax rolled out to everyone.

#2658 [dbt] Support a less ambiguous logic to generate job names.

Problem The name of the <code>dbt-ol run</code> wrapper job generated by the connector isn't sufficiently unique, as it only depends on the dbt project name. Multiple tasks that launch the same project, but with different profiles and/or models, will experience name clashes. Closes: <a href="https://github.com/OpenLineage/OpenLineage/issues/2657">#2657</a> Solution If a new environment variable is set (<code>OPENLINEAGE_DBT_USE_EXTENDED_JOB_NAME</code>, used in order to prevent breaking existing flows, but it should ideally be the default), then the name of the job published on the event will change from: <code>dbt_run_{dbt_project_name}</code> To: <code>dbt-run-{dbt_project_name}-{dbt_profile_name}-{dbt_model}</code> ☐ Your change modifies the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json">core</a> OpenLineage model ☐ Your change modifies one or more OpenLineage <a href="https://github.com/OpenLineage/OpenLineage/tree/main/spec/facets">facets</a> One-line summary: Include profile and models in the dbt job name to make it more unique. Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☑︎ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☐ Your changes are accompanied by tests (if relevant) ☑︎ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☑︎ Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant) ☐ You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant) <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2023 contributors to the OpenLineage project

Labels

integration/dbt, common

Kacper Muda (mudakacper@gmail.com)

2024-05-06 08:56:21

This is my proposal on how we should update labels on GitHub so that they are more useful. I don't need help with the operation itself, but would like some feedback before i start 🙂 . I will leave it here for a while, if you think it's a good idea, leave a ➕ , else - leave a comment in the docs.

https://docs.google.com/spreadsheets/d/1WYt0IHc2N5pWhMSfN7vlFaXgssV_pdAlL0UxAo5yblE/edit?usp=sharing

➕ Maciej Obuchowski, Damien Hawes

Julien Le Dem (julien@apache.org)

2024-05-07 10:50:29

I have a conflict for today's meeting and won't be able to join

dolfinus (martinov_m_s_@mail.ru)

2024-05-08 05:04:00

I have a question about environment-properties facet. It contains different variables for different environment types, e.g. spark.databricks.clusterUsageTags.clusterName for Databrics, or spark.cluster.name for GCP, but it also can contain OS environment variables like SOME_ENV. Shouldn't this be splitted into different facets, e.g. spark_databricsInfo, spark_gcpInfo, spark_envVars?

Harel Shein (harel.shein@gmail.com)

2024-05-09 08:25:44

We need to get a release out today, according to our policy. @Michael Robinson will you be able to do it?

Harel Shein (harel.shein@gmail.com)

2024-05-09 08:26:13

*Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1715087821212419?threadts=1715071578.130549&channel=C01CK9T7HKR&messagets=1715087821.212419|https://openlineage.slack.com/archives/C01CK9T7HKR/p1715087821212419?threadts=1715071578.130549&channel=C01CK9T7HKR&messagets=1715087821.212419

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)

Thanks for requesting a release. It's approved and will be performed within 2 business days.

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1715087821212419?thread_ts=1715071578.130549&channel=C01CK9T7HKR&message_ts=1715087821.212419

Michael Robinson (michael.robinson@astronomer.io)

2024-05-09 08:47:08

*Thread Reply:* 👍

Michael Robinson (michael.robinson@astronomer.io)

2024-05-09 10:05:52

*Thread Reply:* FYI: https://github.com/OpenLineage/OpenLineage/pull/2683

👍 Harel Shein

Michael Robinson (michael.robinson@astronomer.io)

2024-05-09 10:19:26

*Thread Reply:* 🛫

Harel Shein (harel.shein@gmail.com)

2024-05-11 15:29:10

https://x.com/andrewlamb1111/status/1789243972788846746?s=46&t=MfKJRSPDcqoPJCIHj3-F5g|https://x.com/andrewlamb1111/status/1789243972788846746?s=46&t=MfKJRSPDcqoPJCIHj3-F5g 😅

X (formerly Twitter)

Andrew Lamb (@andrewlamb1111) on X

BigQuery wins the very dubious distinction of having required the most new SQL syntax recently in sqlparser-rs. For example <https://t.co/XujvKNyTNH>

Original URL: https://x.com/andrewlamb1111/status/1789243972788846746?s=46&t=MfKJRSPDcqoPJCIHj3-F5g

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-13 03:53:39

*Thread Reply:* I would not guess it, if whole Snowflake syntax was added

Harel Shein (harel.shein@gmail.com)

2024-05-13 12:05:13

Hey all, 2 things from a catch up with @Jens Pfau & @Natalia Gorchakova today (feel free to add more details if I'm missing anything):

We want to move forward on the Facet repository, are there any objections to the approach of adding a new repo in the OpenLineage org and having different vendors be the codeowners for their facets?
Currently, most of the implementation of OpenLineage we've seen have 1 output dataset for a Job (maybe more at the parent level, but a single Job usually has 1 output). In cases where a single job outputs 2 datasets, we have a mapping issue where we see a cartesian product of all input datasets and output datasets, which is not what is happening. The workaround for this right now is to product more jobs than actually observed, which isn't great. Any thoughts on that?

Julien Le Dem (julien@apache.org)

2024-05-13 16:15:04

*Thread Reply:* 1. The current proposal is to put it under OpenLineage/spec/registry in the same repo and use CODEOWNERS file to delegate access. Would that work?

Where does the cartesian product happen? the OL spec definitely allows having multiple outputs even if most jobs have a single output. I must be missing something?

<https://github.com/OpenLineage/OpenLineage/blob/88286027d9e1b6f4556e263174db72b38b04c945/spec/OpenLineage.json | OpenLineage.json>

Harel Shein (harel.shein@gmail.com)

2024-05-13 22:03:25

*Thread Reply:* 1. I think that works too, as long as we have the codeowners file.

Yes. But if you know the relationship between input and outputs (which input datasets maps to which output dataset), there’s no way to capture that. This exists in column level lineage, but not here.

Julien Le Dem (julien@apache.org)

2024-05-13 22:11:31

*Thread Reply:* 2. I remember this discussion now. Do we want to capture this as a facet? It does have semantics close to the column lineage facet. Only at the table level

👍 Paweł Leszczyński

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-14 05:37:44

*Thread Reply:* we can create TableDependenciesDatasetFacet? It could exist also, if we want to have duplication with CLL facet

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-14 05:37:51

*Thread Reply:* and we don't always attach CLL facet, but if we go with additional facet we'd need to always attach TableDependenciesDatasetFacet - we don't want OL consumers to guess whether they need to use one facet or the other depending on one's existence

👍 Jakub Dardziński

Jens Pfau (jenspfau@google.com)

2024-05-17 08:13:06

*Thread Reply:* I think OpenLineage/spec/registry works. @Natalia Gorchakova to chime in if you disagree.

Jens Pfau (jenspfau@google.com)

2024-05-17 08:17:18

*Thread Reply:* If we add something like a TableDependenciesDatasetFacet , the consumer would need to reconcile information of inputs and outputs with the content of this facet. Might not be ideal?

But I don't have a better idea either 😞

Harel Shein (harel.shein@gmail.com)

2024-05-20 21:18:04

*Thread Reply:* OpenLineage/spec/registry it is! I'll start a PR with the folder structure, but we need to get the proposal merged in first. a few comments to address there still @Julien Le Dem if you have some time?

🙏 Maciej Obuchowski

👀 Julien Le Dem

Natalia Gorchakova (natalia.gorchakova.pub@gmail.com)

2024-05-23 07:20:00

*Thread Reply:* @Harel Shein should we also define the name under which the facet should be reported? i.e. https://github.com/OpenLineage/OpenLineage/blob/e88aaa147edfebd7d1f399b554b846d951[…]ntegration/spark/shared/facets/spark/v1/logical-plan-facet.json is reported under name spark.logicalPlan

<https://github.com/OpenLineage/OpenLineage/blob/e88aaa147edfebd7d1f399b554b846d951a130ee/integration/spark/shared/facets/spark/v1/logical-plan-facet.json | logical-plan-facet.json>

<pre><code>{ "$schema": "<http://json-schema.org/schema#>", "definitions": { "Run": { "properties": { "facets": { "properties": { "spark.logicalPlan": { "$ref": "#/definitions/LogicalPlanRunFacet" } } } } }, "LogicalPlanRunFacet": { "type": "object", "properties": { "type": { "type": "string" }, "name": { "type": "string" }, "children": { "type": "array", "items": { "anyOf": [ { "$ref": "#/definitions/LogicalPlanRunFacet" }, { "$ref": "#/definitions/InsertIntoHadoopFsRelationCommand" } ] } } }, "additionalProperties": true }, "InsertIntoHadoopFsRelationCommand": { "type": "object", "properties":{ "outputPath": { "type": "string" } }, "additionalProperties": true }, "LogicalRelation": { "type": "object", "properties":{ "relation": { "anyOf": [ { "$ref": "#/definitions/HadoopFsRelation" }, { "$ref": "#/definitions/JDBCRelation" } ] } }, "additionalProperties": true }, "HadoopFsRelation": { "type": "object", "properties":{ "relation": { "type": "object", "properties": { "location": { "type": "string" } } } }, "additionalProperties": true }, "JDBCRelation": { "type": "object", "properties":{ "relation": { "type": "object", "properties": { "jdbcOptions": { "type": "object" } } } }, "additionalProperties": true } } } </code></pre>

Natalia Gorchakova (natalia.gorchakova.pub@gmail.com)

2024-05-23 07:21:38

*Thread Reply:* for example, we can: • for each facet definition define the name that should be used to report it • force the naming convention (at least the facet name should start from top level name)

👍 Julien Le Dem

Natalia Gorchakova (natalia.gorchakova.pub@gmail.com)

2024-05-23 07:36:28

*Thread Reply:* @Harel Shein @Julien Le Dem I tried to map the proposed facet registry schema to GCP needs. And that's what i got: name: "gcp" with GcpJobFacet / GcpRunFacet as consumer name "gcpdataproc" with GcpDataprocJobFacet / GcpDataprocRunFacet as producer name "gcpSYSTEM" with GcpSYSTEMJobFacet / GcpSYSTEMRunFacet as producer Is that the way you had in mind?

Harel Shein (harel.shein@gmail.com)

2024-05-24 17:10:44

*Thread Reply:* so, something like this for the registry.json?

```{ "producer": { "rootdocurl": "https://google_dataproc_docs", "producedfacets": [ "ol:gcpdataproc:GcpDataprocJobFacet.json", "ol:gcp_dataproc:GcpDataprocRunFacet.json", ....

"consumer": { "rootdocurl": "https://google_dataplex_docs", "consumedfacets": [ "ol:gcpdataplex:GcpJobFacet.json", "ol:gcp_dataplex:GcpRunFacet.json", ....

} }``` ?

Julien Le Dem (julien@apache.org)

2024-06-04 20:51:53

*Thread Reply:* Sorry for the late reply. This was the week I was travelling. That sounds good to me. I commented on the last piece of this in your PR @Natalia Gorchakova

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-14 06:01:33

Should we drop Airflow 2.1? It's not supported by any cloud provider anymore.

✅ Jakub Dardziński

Kacper Muda (mudakacper@gmail.com)

2024-05-14 06:28:49

*Thread Reply:* Maybe we should go for both 2.1 and 2.2 at the same time since both do not have listener API and a very limited Ol functionality?

Michael Robinson (michael.robinson@astronomer.io)

2024-05-14 11:30:03

I can't attend today's committer sync because of a conflict with a team offsite. See you next week!

Michael Robinson (michael.robinson@astronomer.io)

2024-05-16 18:37:10

👀 Next week's Marquez community meeting will feature a demo of a new data quality feature in the UI. Get the meeting link here: http://bit.ly/MarquezMeet

❤️ David Sharp, Willy Lulciuc, Harel Shein, Mattia Bertorello, Maciej Obuchowski, Peter Hicks

Michael Robinson (michael.robinson@astronomer.io)

2024-05-20 09:06:31

OpenLineage should be the top reply here 😉 https://www.reddit.com/r/dataengineering/comments/1cvmerf/data_lineage_tools/

Mattia Bertorello (mattia.bertorello@booking.com)

2024-05-21 04:50:27

*Thread Reply:* I put my 🆙 let's be at at the top! 🙂

:gratitude_thank_you: Michael Robinson

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-21 05:19:44

Discussion for today:

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-21 05:20:06

*Thread Reply:* • How do we deprecate non-standard facets ◦ When we have 1-1 replacement ◦ When we don't?

➕ Jakub Dardziński, Kacper Muda

Damien Hawes (damien.hawes@booking.com)

2024-05-21 10:51:30

*Thread Reply:* Related:

Publishing additional artefacts containing the non-standard facets. For example, the Spark facets.

Kacper Muda (mudakacper@gmail.com)

2024-05-21 11:01:08

Does somebody know if these talks/blog posts are available somewhere else so that we can repair the redirection? https://github.com/OpenLineage/docs/pull/325/files I think it would be good to leave it there event without the URL's just to show that there are people from OL actively participating in events. WDYT?

👀 Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2024-05-21 12:01:13

*Thread Reply:* The cross-platform one: https://youtu.be/rO3BPqUtWrI?si=Yu7oenQY2RhkERhx

YouTube

} Databricks (https://www.youtube.com/@Databricks)

Cross-Platform Data Lineage with OpenLineage

Original URL: https://youtu.be/rO3BPqUtWrI?si=Yu7oenQY2RhkERhx

Michael Robinson (michael.robinson@astronomer.io)

2024-05-21 12:02:47

*Thread Reply:* We could link to the 2021 Berlin Buzzwords slides if there's not a video: https://2021.berlinbuzzwords.de/sites/berlinbuzzwords.de/files/2021-06/Data%20Pipelines%20Observability%20with%20OpenLineage.pdf

Michael Robinson (michael.robinson@astronomer.io)

2024-05-21 12:15:14

*Thread Reply:* Agree with your idea to keep items even if we can't link out to recordings, etc. When we redesign this, we could make it less obvious whether or not there are links...

Michael Robinson (michael.robinson@astronomer.io)

2024-05-21 12:38:08

Does anyone remember a discussion of adding a contributors doc to the repo? I fear that I volunteered to work on this but only vaguely remember the details. So I'm thinking of revising this to list organizations and companies rather than individuals. If you remember the discussion or have opinions about this, would you please let me know?

<https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTORS.md | CONTRIBUTORS.md>

Contributors to OpenLineage This file recognizes the people who have make an important contribution to OpenLineage. To understand how to become an OpenLineage contributor see the <CONTRIBUTING.md|OpenLineage Contributing Guide>. <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2023 contributors to the OpenLineage project

Michael Robinson (michael.robinson@astronomer.io)

2024-05-24 11:01:07

*Thread Reply:* I opened a PR to rewrite the doc as a table for acknowledging orgs.

Michael Robinson (michael.robinson@astronomer.io)

2024-05-21 13:48:19

Is anyone available to give this a review? It changes around 100 files across the project (to update the copyright year): https://github.com/OpenLineage/OpenLineage/pull/2712

Kacper Muda (mudakacper@gmail.com)

2024-05-22 06:11:21

As discussed multiple times, I think the time has come to drop Airflow 2.1 and 2.2 support https://github.com/OpenLineage/OpenLineage/pull/2710. Feel free to leave any thoughts in the PR 🙂

:gh_approved: Maciej Obuchowski

:gh_merged: Maciej Obuchowski

Michael Robinson (michael.robinson@astronomer.io)

2024-05-22 15:48:37

Friendly reminder: this month's Marquez meeting is tomorrow https://openlineage.slack.com/archives/C065PQ4TL8K/p1715899030556759

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)

👀 Next week's Marquez community meeting will feature a demo of a new data quality feature in the UI. Get the meeting link here: <a href="http://bit.ly/MarquezMeet">http://bit.ly/MarquezMeet</a>

Original URL: https://openlineage.slack.com/archives/C065PQ4TL8K/p1715899030556759

Michael Robinson (michael.robinson@astronomer.io)

2024-05-23 11:59:20

Anyone know of any reason we can't merge release commits via PR rather than pushing to main, which evidently is no longer permitted? I can't think of a reason why it wouldn't work but would prefer to avoid a mess if possible. @Julien Le Dem @Maciej Obuchowski @tati @Harel Shein

Willy Lulciuc (willy@datakin.com)

2024-05-23 14:46:43

*Thread Reply:* hmm as the release dev, you should be able to apply commits to main (or at least within the window of the release)

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-23 16:09:55

*Thread Reply:* yeah, I wonder if some settings have been changed?

Harel Shein (harel.shein@gmail.com)

2024-05-23 16:12:55

*Thread Reply:* this might be related to what we discussed about doing for the registry?

Harel Shein (harel.shein@gmail.com)

2024-05-23 16:13:01

*Thread Reply:* @Julien Le Dem would know

Julien Le Dem (julien@apache.org)

2024-05-27 22:19:14

*Thread Reply:* a few weeks ago we change that setting in the weekly meeting. I believe Maciej missed that one. I put it back. Maciej and Willy are admins and can tweak the branch protection rules as needed.

Julien Le Dem (julien@apache.org)

2024-05-27 22:19:45

*Thread Reply:* The mvn release plugin does push to main as it updates the current version to {next release}-snapshot

Michael Robinson (michael.robinson@astronomer.io)

2024-05-23 15:15:18

To clarify my post above: releases are currently blocked because it seems that a new branch protection is keeping me from pushing to main as our current release process requires. To unblock releases, it seems to me we need either: • a new role for the release dev with the necessary privileges for bypassing the protection • a new process (e.g., a series of PRs for the commits generated by the script) I'm not suggesting we remove the protection. But I do think we need an agreed-upon fix to either the process or the permissions in GH. In my haste to get the release out I opened a PR (since closed). Merging that particular PR wouldn't have unblocked the release because it had both the dev version snapshot commit and the release version commit. Regardless, IMO, a hasty workaround isn't the correct way to fix this.

Julien Le Dem (julien@apache.org)

2024-05-27 22:17:45

I have a conflict at 9am and will miss the second half of the meeting tomorrow.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-28 05:36:12

Agenda for today's committer meeting: • Improvements for the release process

👍 Julien Le Dem, Michael Robinson, Maciej Obuchowski

Julien Le Dem (julien@apache.org)

2024-05-28 11:22:17

*Thread Reply:* we can also follow up on the registry

👍 Maciej Obuchowski

Harel Shein (harel.shein@gmail.com)

2024-05-28 15:42:59

*Thread Reply:* sorry I missed the sync today, did you get to talk about the registry?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-29 04:50:14

*Thread Reply:* @Harel Shein yes, briefly 😉

😅 Harel Shein

Peter Huang (huangzhenqiu0825@gmail.com)

2024-05-29 22:12:49

*Thread Reply:* Sorry. I missed the committer meeting. Is there a meeting calendar I can subscribe?

Harel Shein (harel.shein@gmail.com)

2024-05-30 08:59:05

*Thread Reply:* added you to the distro, lmk if you don't see it

:gratitude_thank_you: Peter Huang

Michael Robinson (michael.robinson@astronomer.io)

2024-05-28 13:04:46

It's time to vote for this month's PR of the Month! Applied to OpenLineage, Airflow's script for this use case identifies the following as the top five PRs for the month:

Top 5 out of 53 PRs: > * PR #2228: [PROPOSAL #2161] Add a Registry of Producers and Consumers in OpenLineage. > * PR #2658: Support a less ambiguous logic to generate job names. > * PR #2677: Spark: Add facets to Spark Application events. > * PR #2719: Spark: Add jobType facet to Spark application events. > ** PR #2693: python: use v2 Python facets. Would you please vote in the thread by 3 PM ET on Friday? Absent votes I'll go with #1.

Willy Lulciuc (willy@datakin.com)

2024-05-28 18:50:31

*Thread Reply:* My votes for the registry! 💯

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-05-29 03:10:15

*Thread Reply:* Same with my vote. Registry is cool.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-29 04:49:49

*Thread Reply:* plus one for the registry

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-05-31 12:31:50

@Mariusz Górski congrats on getting this PR https://github.com/trinodb/trino/pull/21265 merged!

🎉 Michael Robinson, Maciej Obuchowski, Alok, Mariusz Górski, dolfinus, Paweł Leszczyński

🔥 Michael Robinson, Maciej Obuchowski, Mariusz Górski, Paweł Leszczyński, Julien Le Dem

🙏 Mariusz Górski, Paweł Leszczyński

❤️ Julien Le Dem

Mariusz Górski (gorskimariusz13@gmail.com)

2024-05-31 13:37:56

*Thread Reply:* yeah finally ♥️

Julien Le Dem (julien@apache.org)

2024-06-04 18:29:26

*Thread Reply:* Oooooh. That'd be a great blog post on the OL blog!

Harel Shein (harel.shein@gmail.com)

2024-06-04 11:31:30

joining the second half of today's call

Michael Robinson (michael.robinson@astronomer.io)

2024-06-04 12:01:22

*Thread Reply:* We wrapped up early today, FYI

Harel Shein (harel.shein@gmail.com)

2024-06-04 12:38:12

*Thread Reply:* 😞 time management for this call has been challenging with my new role. does this time of day still make sense for everyone?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-06-04 15:31:03

*Thread Reply:* would you prefer earlier or later hours?

Harel Shein (harel.shein@gmail.com)

2024-06-04 15:41:30

*Thread Reply:* doesn't matter to me tbh, but I realize it's a late meeting in the day for y'all.

Harel Shein (harel.shein@gmail.com)

2024-06-04 21:02:29

*Thread Reply:* Maybe we can move it earlier in the day for the next few weeks to make it easier for EU folks?

Michael Robinson (michael.robinson@astronomer.io)

2024-06-06 09:06:28

Hi all, here's this month's thread for the TSC meeting agenda (the meeting is next Wednesday). Please reply with any agenda items.

Michael Robinson (michael.robinson@astronomer.io)

2024-06-07 15:31:26

*Thread Reply:* A slide deck draft for the meeting: https://docs.google.com/presentation/d/1T04oYaZAhmxTzzJ7WVup118kw29IJDSs/edit?usp=sharing&ouid=116057523906319252244&rtpof=true&sd=true @Harel Shein

Harel Shein (harel.shein@gmail.com)

2024-06-10 10:01:42

Hey all, I'll be facilitating the monthly TSC meeting tomorrow, would love to get agenda items for the above ^

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-06-11 08:14:58

*Thread Reply:* • ~can we merge this~ <https://github.com/OpenLineage/OpenLineage/pull/2740> ?

#2740 Register GCP common job facet

One-line summary: Register gcp job facet that contains common attributes that will improve the way lineage is parsed / displayed by GCP platform. Solution Based on the proposal <a href="https://github.com/OpenLineage/OpenLineage/pull/2228/files">https://github.com/OpenLineage/OpenLineage/pull/2228/files</a>, GCP Lineage would like to define facets that are expected from integrations. The list of support facets is not final and will be extended further by next PR. Checklist ☐ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☐ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☐ Your changes are accompanied by tests (if relevant) ☐ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☐ Your comment includes a one-liner for the changelog about the specific purpose of the change (not required for changes to tests, docs, or CI config) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant) ☐ You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant) <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2024 contributors to the OpenLineage project

Labels

area:documentation, area:ci, area:spec, area:client/java, area:tests, language:java, language:python

Comments

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-06-11 09:29:07

*Thread Reply:* Wednesday, right?

☝️ Harel Shein

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-06-11 09:29:26

*Thread Reply:* I can talk about what's new in Airflow integration

➕ Kacper Muda

Damien Hawes (damien.hawes@booking.com)

2024-06-11 03:51:45

@Paweł Leszczyński @Maciej Obuchowski - can you guys give your opinion on this one: https://github.com/OpenLineage/OpenLineage/pull/2650

👀 Paweł Leszczyński

Damien Hawes (damien.hawes@booking.com)

2024-06-11 03:52:38

(Also, how do I get access to approve the run of integration tests from Circle CI?)

Kacper Muda (mudakacper@gmail.com)

2024-06-11 03:59:17

*Thread Reply:* I think that you simply have to log in to CircleCI with your GH account, it's synced with OL organisation on GH.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-06-11 03:59:52

*Thread Reply:* yeah, anyone with push access to repo should be able to approve

Damien Hawes (damien.hawes@booking.com)

2024-06-11 04:01:50

*Thread Reply:* Ah

Damien Hawes (damien.hawes@booking.com)

2024-06-11 04:01:51

*Thread Reply:* I see it

Damien Hawes (damien.hawes@booking.com)

2024-06-11 04:01:54

*Thread Reply:* thanks!

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-06-11 09:37:45

Do we have topics for today's committer meeting? I don't have anything now tbh

Kacper Muda (mudakacper@gmail.com)

2024-06-11 10:03:59

*Thread Reply:* Nothing from my side

Harel Shein (harel.shein@gmail.com)

2024-06-11 10:06:42

*Thread Reply:* mostly to finalize the slides for wednesday

Damien Hawes (damien.hawes@booking.com)

2024-06-11 09:59:36

I don't + I have an overlap.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-06-12 12:08:15

https://www.prnewswire.com/news-releases/databricks-open-sources-unity-catalog-creatin[…]strys-only-universal-catalog-for-data-and-ai-302170787.html 🤔

😶 Jakub Dardziński, Harel Shein

Willy Lulciuc (willy@datakin.com)

2024-06-12 12:30:36

*Thread Reply:* all the data catalogs out there just got some competition

Julien Le Dem (julien@apache.org)

2024-06-13 05:44:54

*Thread Reply:* coincidentaly databricks is submitting a project for approval at the LFAI meeting today...

Julien Le Dem (julien@apache.org)

2024-06-13 09:08:23

*Thread Reply:* yep, it's unity

Julien Le Dem (julien@apache.org)

2024-06-13 09:08:48

*Thread Reply:* integration with OL might become easier

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-06-13 09:15:47

*Thread Reply:* should we create first issue to accept OpenLineage when they actually open the github repo? 🙂

Julien Le Dem (julien@apache.org)

2024-06-13 10:42:30

*Thread Reply:* yep!

Julien Le Dem (julien@apache.org)

2024-06-13 10:42:42

*Thread Reply:* it should happen today

🙌 Harel Shein, Willy Lulciuc, Minkyu Park

Julien Le Dem (julien@apache.org)

2024-06-13 13:11:08

*Thread Reply:* https://github.com/unitycatalog/unitycatalog/issues

Julien Le Dem (julien@apache.org)

2024-06-13 13:11:15

*Thread Reply:* https://www.unitycatalog.io/

unitycatalog.io

Unity Catalog | Databricks

Unity Catalog OSS transforms how you manage and govern data, ensuring compatibility and control across diverse platforms and tools.

Original URL: https://www.unitycatalog.io/

Minkyu Park (minkyu.park.200@gmail.com)

2024-06-13 15:48:28

*Thread Reply:* It supports Iceberg catalog API too. Another competition with polaris catalog?

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-06-13 16:46:14

If anyone is interested in how to get OpenLineage from AWSGlue using the Spark Integration, I can share a link to my company’s beta instructions so you can see how we are handling it.

Also if anyone knows of a good place for me to figure out which spark functions are handled, that would be amazing.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-06-14 06:00:07

*Thread Reply:* Definitely worth to take a look 🙂

> which spark functions are handled What do you mean exactly? Handling of different logical plan types is somehow spread over the code

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-06-14 10:15:38

*Thread Reply:* Ah, I assumed it was handled by function…..it’s handled by logical plan type. Customers always want to know if their jobs will be supported and if they’ll get lineage from that, and it helps to have some kind of a guideline (even if it’s “if the spark engine turns it into an X type of logical plan”).

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-06-14 10:16:22

*Thread Reply:* It also helps us figure out what we might develop for a PR - would an unsupported lineage feature be part of an existing logical plan type or a new one, etc

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-06-17 04:47:02

*Thread Reply:* > Customers always want to know if their jobs will be supported and if they’ll get lineage from that, and it helps to have some kind of a guideline (even if it’s “if the spark engine turns it into an X type of logical plan”) Sometimes the problem is it works for X version of connector (or even worse, some version of connector X with combination of Spark version Y)

It also helps us figure out what we might develop for a PR - would an unsupported lineage feature be part of an existing logical plan type or a new one, etc Yeah - Spark interfaces are a mess, and connectors do the same thing in multiple different ways. For example Hive adds their own LogicalPlan nodes, some connectors use DataSourceV1 interfaces (RelationProvider etc) and some DataSourceV2 ones, that have known LogicalPlan nodes but implement different interfaces underneath.

Overall the best solution would be implementation of the Java interfaces by the connector, which we're currently iterating on from a feedback from connector authors: https://github.com/OpenLineage/OpenLineage/pull/2675

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-06-21 04:57:53

*Thread Reply:* @Sheeri Cabral (Collibra) I'm curious - have you managed to configure OL in Glue from the job itself or just using Job Details page in AWS console?

👀 Sheeri Cabral (Collibra)

Charles kuzmiak (charles.kuzmiak@kraken.tech)

2024-06-23 21:14:37

*Thread Reply:* I would be interested too 👀

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-06-24 13:36:16

*Thread Reply:* We have worked with customers to put it in the job itself, as per https://openlineage.io/docs/integrations/spark/configuration/usage/

openlineage.io

Usage | OpenLineage

Configuring the OpenLineage Spark integration is straightforward. It uses built-in Spark configuration mechanisms.

Original URL: https://openlineage.io/docs/integrations/spark/configuration/usage/

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-06-24 13:36:47

*Thread Reply:* (the “directly in your application” part, and so far we’ve only tested on Python)

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-06-24 13:43:48

*Thread Reply:* Here’s an example… ```import sys from pytz import timezone from datetime import datetime from pyspark.sql import SparkSession from pyspark.sql.functions import lit, col, when

spark = SparkSession\ .builder\ .appName('openlineage')\ .config("spark.driver.extraJavaOptions","-Dlog4j.configuration=log4j.properties")\ .config("spark.executor.extraJavaOptions","-Dlog4j.configuration=log4j.properties")\ .config('spark.sql.legacy.parquet.int96RebaseModeInRead', 'CORRECTED')\ .config('spark.sql.legacy.parquet.int96RebaseModeInWrite', 'CORRECTED')\ .config('spark.sql.legacy.parquet.datetimeRebaseModeInRead', 'CORRECTED')\ .config('spark.sql.legacy.parquet.datetimeRebaseModeInWrite', 'CORRECTED')\ .getOrCreate()

filepath= "" staticvaluefieldspath = ""

filedf = spark.read.format("parquet").load(filepath)

staticvaluefieldsdf = spark.read.format("parquet").load(staticvaluefieldspath) joindf = filedf.alias("a").join(staticvaluefieldsdf.alias("b"),col("a.regulator") ==col("b.regulator"), "left")\ .withColumn("testdata",lit(10))\ .withColumn("derived_column",when(col('b.foo').isNull(),lit('No foo found')).otherwise(col('b.bar')))\

joinadf= joindf.select("a.**","testdata","derived_column")

print("first dataframe") joinadf.write.format('parquet').mode('overwrite').save("")

print("end of the job")```

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-06-24 13:47:16

*Thread Reply:* (You do need to attach the jar to the job, which is on the job details page I think?)

Julien Le Dem (julien@apache.org)

2024-06-17 09:10:06

Delta Lake is also part of the LFAI&Data as of a few weeks ago. It will be interesting to discuss integrating the new OL interfaces for spark datasources in there. It should be either being part of the same foundation.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-06-17 09:14:18

*Thread Reply:* @Dominik Dębowczyk is working on next version of OL interfaces - we've missed pretty huge "dependency hell" issue we have with current ones

👍 Julien Le Dem

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-06-18 09:31:11

Any topic for today's meeting?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-06-19 05:53:32

~What do you think about making a requirement that non-committer PRs need to be approved by two committers? We're thinking of a situation when committer creates it's own PR from second account, and approves it himself.~ Let's discard this proposal.

Kacper Muda (mudakacper@gmail.com)

2024-06-19 05:59:40

*Thread Reply:* Sounds great, imo any changes improving security are very welcome and this one should not disrupt the development in any way, as we have committers that are active daily and will be able to accept a PR without additional delay.

Harel Shein (harel.shein@gmail.com)

2024-06-19 12:53:27

*Thread Reply:* Hmmm.. I’m not sure I follow what we’re trying to protect here? If we have anyone who is a committer doing that, we will revert the PR and that committer’s privileges would be revoked immediately. As it does not follow our guidelines

Kacper Muda (mudakacper@gmail.com)

2024-06-20 01:48:16

*Thread Reply:* I think if we are assuming that we will detect any unwanted merged change right away, this approach you described makes sense. It may be problematic if we will overlook some changes that will lead to some backdoor (we'd have to overlook it again when releasing) or using CI for some heavy computations (this can be problematic right away) ?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-06-20 07:50:09

*Thread Reply:* I have looked and I don't think any relevant OSS project does what I suggested, so I guess it's not a big deal.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-06-19 10:37:49

Should we do a release soon? Would be nice to do it before CircleCI removes mac executors - to potentially give us more time before release

➕ Harel Shein, Paweł Leszczyński

dolfinus (martinov_m_s_@mail.ru)

2024-06-20 07:43:26

*Thread Reply:* > before CircleCI removes mac executors Too late (

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-06-20 07:44:18

*Thread Reply:* Not really, they just disabled it for today

👌 dolfinus

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-06-20 07:44:43

*Thread Reply:* The problem is they told they will enable ARM ones in May before they remove Intel ones

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-06-20 07:44:53

*Thread Reply:* And they haven't done it yet

Harel Shein (harel.shein@gmail.com)

2024-06-25 11:05:10

who's joining the committer sync today? I'm guessing we will take at most 30 minutes since Poland needs to start losing to France at 6pm ⚽

Michael Robinson (michael.robinson@astronomer.io)

2024-06-25 11:06:38

*Thread Reply:* picking up my daughter from camp but I'll try to listen in from the car. good to know I need to be rooting for France

Michael Robinson (michael.robinson@astronomer.io)

2024-06-25 11:06:42

*Thread Reply:* lol

☝️ Kacper Muda

Michael Robinson (michael.robinson@astronomer.io)

2024-06-25 11:07:02

*Thread Reply:* I think both Maciej and Jakub are OOO

Kacper Muda (mudakacper@gmail.com)

2024-06-25 11:20:42

*Thread Reply:* Yes, Maciej and Jakub are OOO, I'm also not joining today

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-06-25 14:04:31

*Thread Reply:* Poland did NOT lose to France today (1:1).

The greatest achievement of Polish team in Euro happened during Euro 2004. Although polish team did not qualify, few weeks before Euro we won a friendly match with Greece, which has won whole Euro 2004. Beating a champion is logically equivalent to being a champion (kind of transitive relation).

Today there was a draw with French team. So, although Polish team is already out, we're in the really good position to become a logically equivalent champion of Euro 2024. Feeling so proud.

🚀 Jakub Dardziński, Minkyu Park, Maciej Obuchowski

Harel Shein (harel.shein@gmail.com)

2024-06-25 14:26:45

*Thread Reply:* hahaha.

Minkyu Park (minkyu.park.200@gmail.com)

2024-06-25 18:37:28

*Thread Reply:* Probably it already happened, but can I get an invitation for the future meetings?

✅ Harel Shein

Julian LaNeve (lanevejulian@gmail.com)

2024-06-26 17:24:48

hey folks, was just chatting with @Ibby Khajanchi who's working on an openlineage integration at bloomberg. they're using great expectations and noticed our OL <> GX integration doesn't support the latest so he was asking about contributing!

🙂 Ibby Khajanchi

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-06-26 17:56:57

*Thread Reply:* I'd be more than happy to give a review to that contribution 🙂

🙂 Ibby Khajanchi

Ibby Khajanchi (khajanchi.ibrahim@gmail.com)

2024-06-26 18:11:56

*Thread Reply:* Thanks Julian and Jakub. Looking forward to tackling this issue 😀 😀

👍 Jakub Dardziński, Maciej Obuchowski

Ibby Khajanchi (khajanchi.ibrahim@gmail.com)

2024-07-02 11:45:46

*Thread Reply:* I left a comment about a decision to make regarding this pr's direction: https://github.com/OpenLineage/OpenLineage/pull/2134

#2134 Support more recent versions of GX

Limit upper GX version to <0.16.0. Reopening <a href="https://github.com/OpenLineage/OpenLineage/pull/2024">#2024</a> accidentally closed by me. Kudos to <a href="https://github.com/ivanstillfront">@ivanstillfront</a>. Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☑︎ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☑︎ Your changes are accompanied by tests (if relevant) ☑︎ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☐ Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant) ☐ You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant) <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2023 contributors to the OpenLineage project

Labels

tool:great-expectations, area:integration/common

Comments

dolfinus (martinov_m_s_@mail.ru)

2024-06-27 06:49:52

Could someone take a look on PRs in docs repo? https://github.com/OpenLineage/docs/pulls

✅ Paweł Leszczyński

Michael Robinson (michael.robinson@astronomer.io)

2024-06-28 12:57:28

Hi, all, it's time to vote on this month's PR of the month! Here are the candidates: > Top 5 out of 34 PRs: > * PR #2758: Spark/transport type extraction. > * PR #2782: Spark, Flink: Fix S3 dataset names. > * PR #2756: remodeled transformation type. > * PR #2767: Spark: fallback to spark.sql.warehouse.dir as table namespace. > ** PR #2643: spark: add GCP run and job facets. Could you vote in the thread by 12pm ET on Monday? The newsletter is scheduled to go out that afternoon. Absent votes, I'll go with the top scorer, 2758. 🙂

Kacper Muda (mudakacper@gmail.com)

2024-07-01 03:31:12

*Thread Reply:* My vote goes for 2758, great job there.

I think #2756 and #2758 are closely related (one is spec change, the other is spark implementation), but since we have to choose one PR i think the implementation deserves it.

:gratitude_thank_you: Michael Robinson

Harel Shein (harel.shein@gmail.com)

2024-07-02 09:55:42

I need to miss the committer sync today

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-02 10:11:51

Any discussion topics for today?

Minkyu Park (minkyu.park.200@gmail.com)

2024-07-02 11:30:23

I was just gonna say hi

Michael Robinson (michael.robinson@astronomer.io)

2024-07-02 16:05:57

Hi, feedback requested on some slides I'm working on for the next Airflow town hall (on 7/10). I'm planning to transition to a quick demo after the last slide. @Maciej Obuchowski

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-03 04:51:24

*Thread Reply:* Nice, updated version of "that" slide 🙂

Screenshot 2024-07-03 at 10.50.58.png

👍 Michael Robinson

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-03 04:53:41

*Thread Reply:* Maybe add brief mention that the work isn't done yet, and team is working on AIP-62 that would further increase amount of metadata we can extract from Airflow DAGs?

:gratitude_thank_you: Michael Robinson

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-03 05:33:21

*Thread Reply:* btw, @Michael Robinson its't it today? It's today in my calendar

Michael Robinson (michael.robinson@astronomer.io)

2024-07-03 12:32:07

*Thread Reply:* Thanks! No, not today. It's one week from today

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-03 13:17:00

*Thread Reply:* only realized my calendar is buggy when I got into the call with another 10 confused people 🙂

Julien Le Dem (julien@apache.org)

2024-07-03 17:16:57

*Thread Reply:* Nice slides @Michael Robinson! I commented on that slide but otherwise this looks good

Michael Robinson (michael.robinson@astronomer.io)

2024-07-03 17:20:03

*Thread Reply:* Thanks!

Michael Robinson (michael.robinson@astronomer.io)

2024-07-03 13:18:54

Hi all, your agenda items requested for the TSC meeting next week 🙂

Ibby Khajanchi (khajanchi.ibrahim@gmail.com)

2024-07-03 13:30:39

Hi. Can I get opinions at the bottom of this PR: https://github.com/OpenLineage/OpenLineage/pull/2134. It'll direct how I take this PR. Thank you 😄

#2134 Support more recent versions of GX

Labels

tool:great-expectations, area:integration/common

Comments

Emili Parreno (emili@stripe.com)

2024-07-08 10:52:42

Hi folks, is there any interest on adding Protobuf support to the Java & Python clients?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-08 12:41:35

*Thread Reply:* Sure - especially if you want to contribute and help maintain it 🙂

Kacper Muda (mudakacper@gmail.com)

2024-07-08 13:04:26

*Thread Reply:* I'll be happy to help and review a PR if needed 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-08 12:41:59

Topics for tomorrow: • singular release of unshaded Spark interfaces: https://github.com/OpenLineage/OpenLineage/pull/2809

➕ Julien Le Dem

Harel Shein (harel.shein@gmail.com)

2024-07-08 13:08:49

• also, agenda for this week's TSC? @Michael Robinson

Jens Pfau (jenspfau@google.com)

2024-07-08 13:43:19

*Thread Reply:* @Michael Robinson can we discuss the certification process proposal?

CC @Sheeri Cabral (Collibra)

👍 Michael Robinson, Sheeri Cabral (Collibra), Julien Le Dem, Maciej Obuchowski

Michael Robinson (michael.robinson@astronomer.io)

2024-07-08 14:58:40

*Thread Reply:* Thanks, Harel

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-07-08 17:10:08

*Thread Reply:* Absolutely!

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-09 05:11:28

*Thread Reply:* The doc to review: https://docs.google.com/document/d/1h_PI0HLX7ECVll068EmExZHF5xVYqVNGsNj7DCPiB5Y

👍 Michael Robinson, Sheeri Cabral (Collibra)

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-07-09 08:39:48

Topic for today: Java 17 for Spark 4.0. More details in the docs -> https://docs.google.com/document/d/1tofVBMxDAKsbPV3Rh64SoBUGIPihgAHC1SVJj-WrR9g/edit?usp=sharing

Damien Hawes (damien.hawes@booking.com)

2024-07-12 05:40:52

*Thread Reply:* @Paweł Leszczyński @Maciej Obuchowski - do I understand the problem correctly: > When running tests, Gradle doesn't use the JAR, but instead uses the compiled (test) classes?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-07-12 05:41:58

*Thread Reply:* yes, this is what I meant

Damien Hawes (damien.hawes@booking.com)

2024-07-12 05:43:59

*Thread Reply:* Oh. This might be a use case for the test-fixtures plugin then

Damien Hawes (damien.hawes@booking.com)

2024-07-12 05:45:49

*Thread Reply:* 1. Apply the test-fixtures plugin

Create the relevant directories under src/${testFixturesSourceSetName}/java and src/${testFixturesSourceSetName}}/resources
Migrate the tests to it (from the test source set)

Damien Hawes (damien.hawes@booking.com)

2024-07-12 05:47:10

*Thread Reply:* The test-fixtures plugin creates a JAR, that can then be declared as a dependency in another gradle module: implementation(project(":foo", configuration: "test-fixtures")) I don't know the exact strings, but it should be something like that.

🤔 Maciej Obuchowski

Damien Hawes (damien.hawes@booking.com)

2024-07-12 05:47:43

*Thread Reply:* Though, whether this solves any problems that you're currently experiencing, I'm not so sure.

Damien Hawes (damien.hawes@booking.com)

2024-07-12 05:50:24

*Thread Reply:* When you apply the test-fixtures plugin, the dependency chain looks like this:

test -- depends on --> test-fixtures test -- depends on --> main test-fixtures -- depends on --> main

Damien Hawes (damien.hawes@booking.com)

2024-07-12 05:51:58

*Thread Reply:* This means the test-fixtures source set can see the classes inside the main source set

Damien Hawes (damien.hawes@booking.com)

2024-07-12 05:52:10

*Thread Reply:* Just like how the test source set sees them now

Michael Robinson (michael.robinson@astronomer.io)

2024-07-09 13:40:22

Hi all, here's a slide deck for tomorrow's TSC meeting: https://docs.google.com/presentation/d/1lFbIFDApGzJVX6vRZSnlCUCow-ssQtpM/edit?usp=sharing&ouid=116057523906319252244&rtpof=true&sd=true

Michael Robinson (michael.robinson@astronomer.io)

2024-07-09 13:45:50

*Thread Reply:* Have opinions about what should be highlighted in 1.17 or announced about the project? Please comment in the deck! Have a last-minute idea for a discussion item, etc.? DM me!

✅ Sheeri Cabral (Collibra)

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-07-10 11:57:06

*Thread Reply:* (I have nothing to add, I just want to say I looked at the deck and it looks great and thanks for doing this work!)

:gratitude_thank_you: Michael Robinson

Public Channels

Private Channels

Direct Messages

Group Direct Messages

ENV OPENLINEAGE_URL=http://foo.bar/```