Julien Le Dem (julien@apache.org)

@Julien Le Dem has joined the channel

Mars Lan (mars.th.lan@gmail.com)

2020-10-21 08:23:39

@Mars Lan has joined the channel

Wes McKinney (wesmckinn@gmail.com)

2020-10-21 11:39:13

@Wes McKinney has joined the channel

Ryan Blue (rblue@netflix.com)

2020-10-21 12:46:39

@Ryan Blue has joined the channel

Drew Banin (drew@fishtownanalytics.com)

2020-10-21 12:53:42

@Drew Banin has joined the channel

Willy Lulciuc (willy@datakin.com)

2020-10-21 13:29:49

@Willy Lulciuc has joined the channel

Lewis Hemens (lewis@dataform.co)

2020-10-21 13:52:50

@Lewis Hemens has joined the channel

Julien Le Dem (julien@apache.org)

2020-10-21 14:15:41

This is the official start of the OpenLineage initiative. Thank you all for joining. First item is to provide feedback on the doc: https://docs.google.com/document/d/1qL_mkd9lFfe_FMoLTyPIn80-fpvZUAdEIfrabn8bfLE/edit

🎉 Willy Lulciuc, Abe Gong

Abe Gong (abe@superconductive.com)

2020-10-21 23:22:03

@Abe Gong has joined the channel

Shirshanka Das (sdas@linkedin.com)

2020-10-22 13:50:35

@Shirshanka Das has joined the channel

deleted_profile (fengtao04@gmail.com)

2020-10-23 15:03:44

@deleted_profile has joined the channel

Chris White (chris@prefect.io)

2020-10-23 19:30:36

@Chris White has joined the channel

Julien Le Dem (julien@apache.org)

2020-10-24 19:29:04

Thanks all for joining. In addition to the google doc, I have opened a pull request with an initial openapi spec: https://github.com/OpenLineage/OpenLineage/pull/1 The goal is to specify the initial model (just plain lineage) that will be extended with various facets. It does not intend to restrict to HTTP. Those same PUT calls without output can be translated to any async protocol

GitHub

initial spec proposal by julienledem · Pull Request #1 · OpenLineage/OpenLineage

Signed-off-by: Julien Le Dem <a href="mailto:julien@apache.org">julien@apache.org</a>

Original URL: https://github.com/OpenLineage/OpenLineage/pull/1

Julien Le Dem (julien@apache.org)

2020-10-24 19:31:09

For reference, the slides of the kickoff meeting: https://docs.google.com/presentation/d/1bOnm4J7y1JRJBJtSImm-3vvXzvqqkL-UsCShAuub5oU/edit?usp=sharing

Wes McKinney (wesmckinn@gmail.com)

2020-10-25 12:13:26

Am I the only weirdo that would prefer a Google Group mailing list to Slack for communicating?

👍 Ryan Blue

Julien Le Dem (julien@apache.org)

2020-10-25 17:22:09

*Thread Reply:* slack is the new email?

Wes McKinney (wesmckinn@gmail.com)

2020-10-25 17:40:19

*Thread Reply:* :(

Ryan Blue (rblue@netflix.com)

2020-10-27 12:27:04

*Thread Reply:* I'd prefer a google group as well

Ryan Blue (rblue@netflix.com)

2020-10-27 12:27:25

*Thread Reply:* I think that is better for keeping people engaged, since it isn't just a ton of history to go through

Ryan Blue (rblue@netflix.com)

2020-10-27 12:27:38

*Thread Reply:* And I think it is also better for having thoughtful design discussions

Julien Le Dem (julien@apache.org)

2020-10-29 15:40:14

*Thread Reply:* I’m happy to create a google group if that would help.

Julien Le Dem (julien@apache.org)

2020-10-29 15:45:23

*Thread Reply:* Here it is: https://groups.google.com/g/openlineage

Julien Le Dem (julien@apache.org)

2020-10-29 15:46:34

*Thread Reply:* Slack is more of a way to nudge discussions along, we can use github issues or the mailing list to discuss specific points

Julien Le Dem (julien@apache.org)

2020-11-03 17:34:53

*Thread Reply:* @Ryan Blue and @Wes McKinney any recommendations on automating sending github issues update to that list?

Ryan Blue (rblue@netflix.com)

2020-11-03 17:35:34

*Thread Reply:* I don't really know how to do that

Ravi Suhag (suhag.ravi@gmail.com)

2021-04-02 07:18:25

*Thread Reply:* @Julien Le Dem How about using Github discussions. They are specifically meant to solve this problem. Feature is still in beta, but it be enabled from repository settings. One positive side i see is that it will really easy to follow through and one separate place to go and look for discussions and ideas which are being discussed.

Julien Le Dem (julien@apache.org)

2021-04-02 19:51:55

*Thread Reply:* I just enabled it: https://github.com/OpenLineage/OpenLineage/discussions

🙌 Ravi Suhag

Wes McKinney (wesmckinn@gmail.com)

2020-10-25 12:14:06

Or GitHub Issues

Julien Le Dem (julien@apache.org)

2020-10-25 17:21:44

*Thread Reply:* the plan is to use github issues for discussions on the spec. This is to supplement

Laurent Paris (laurent@datakin.com)

2020-10-26 19:28:17

@Laurent Paris has joined the channel

Josh Benamram (josh@databand.ai)

2020-10-27 21:17:30

@Josh Benamram has joined the channel

Victor Shafran (victor.shafran@databand.ai)

2020-10-28 04:07:27

@Victor Shafran has joined the channel

Victor Shafran (victor.shafran@databand.ai)

2020-10-28 04:09:00

👋 Hi everyone!

👋 Willy Lulciuc, Abe Gong, Drew Banin, Julien Le Dem

Zhamak Dehghani (zdehghan@thoughtworks.com)

2020-10-29 17:59:31

@Zhamak Dehghani has joined the channel

Julien Le Dem (julien@apache.org)

2020-11-02 18:30:51

I’ve opened a github issue to propose OpenAPI as the way to define the lineage metadata: https://github.com/OpenLineage/OpenLineage/issues/2 I have also started a thread on the OpenLineage group: https://groups.google.com/g/openlineage/c/2i7ogPl1IP4 Discussion should happen there: ^

GitHub

Proposal to use an openapi spec to define the metadata. · Issue #2 · OpenLineage/OpenLineage

I'm proposing to use openapi to formalize the core lineage model. This doesn't preclude having the ability to serialize in some other binary format (avro, protobuf,...)

Original URL: https://github.com/OpenLineage/OpenLineage/issues/2

Evgeny Shulman (evgeny.shulman@databand.ai)

2020-11-04 10:56:00

@Evgeny Shulman has joined the channel

Julien Le Dem (julien@apache.org)

2020-11-05 20:51:22

FYI I have updated the PR with a simple genrator: https://github.com/OpenLineage/OpenLineage/pull/1

} julienledem (https://github.com/julienledem)

#1 initial spec proposal

The goal is to specify the initial model (just plain lineage) that will be extended with various facets. It does not intend to restrict to HTTP. Those same PUT calls without output can be translated to any async protocol Signed-off-by: Julien Le Dem <a href="mailto:julien@apache.org">julien@apache.org</a>

Comments

Daniel Henneberger (danny@datakin.com)

2020-11-11 15:05:46

@Daniel Henneberger has joined the channel

Julien Le Dem (julien@apache.org)

2020-12-08 17:27:57

Please send me your github ids if you wish to be added to the github repo

👍 Willy Lulciuc

Fabrice Etanchaud (fabrice.etanchaud@netc.fr)

2020-12-10 02:10:35

@Fabrice Etanchaud has joined the channel

Julien Le Dem (julien@apache.org)

2020-12-10 17:04:29

As mentioned on the mailing List, the initial spec is ready for a final review. Thanks for all who gave feedback so far.

Julien Le Dem (julien@apache.org)

2020-12-10 17:04:39

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/1

} julienledem (https://github.com/julienledem)

#1 initial spec proposal

Comments

Julien Le Dem (julien@apache.org)

2020-12-10 17:04:51

The next step will be to define individual facets

Julien Le Dem (julien@apache.org)

2020-12-13 00:28:11

I have opened a PR to update the ReadMe: https://openlineage.slack.com/archives/C01EB6DCLHX/p1607835827000100

} julienledem (https://github.com/julienledem)

Pull request opened by julienledem

#4 Adding README

Signed-off-by: Julien Le Dem <a href="mailto:julien@apache.org">julien@apache.org</a>

Original URL: https://openlineage.slack.com/archives/C01EB6DCLHX/p1607835827000100

👍 Willy Lulciuc

Willy Lulciuc (willy@datakin.com)

2020-12-14 17:55:46

*Thread Reply:* Looks great!

Maxime Beauchemin (max@preset.io)

2020-12-13 17:45:49

👋

👋 Shirshanka Das, Julien Le Dem, Willy Lulciuc, Arthur Wiedmer, Mario Measic

Julien Le Dem (julien@apache.org)

2020-12-14 20:19:57

I’m planning to merge https://github.com/OpenLineage/OpenLineage/pull/1 soon. That will be the base that we can iterate on and will enable starting the discussion on individual facets

} julienledem (https://github.com/julienledem)

#1 initial spec proposal

Comments

Julien Le Dem (julien@apache.org)

2020-12-16 21:40:52

Thank you all for the feedback. I have made an update to the initial spec adressing the final comments

Julien Le Dem (julien@apache.org)

2020-12-16 21:41:16

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/1

} julienledem (https://github.com/julienledem)

#1 initial spec proposal

Comments

Julien Le Dem (julien@apache.org)

2020-12-19 11:21:27

The contributing guide is available here: https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md Here is an example proposal for adding a new facet: https://github.com/OpenLineage/OpenLineage/issues/9

#9 [PROPOSAL] track custom state and transitions

👍 Josh Benamram, Victor Shafran

Julien Le Dem (julien@apache.org)

2020-12-19 18:27:36

Welcome to the newly joined members 🙂 👋

👋 Chris Lambert, Ananth Packkildurai, Arthur Wiedmer, Abe Gong, ale, James Le, Ha Pham, David Krevitt, Harel Shein

Ash Berlin-Taylor (ash@apache.org)

2020-12-21 05:23:21

Hello! Airflow PMC member here. Super interested in this effort

👋 Willy Lulciuc

Julien Le Dem (julien@apache.org)

2020-12-21 12:15:42

*Thread Reply:* Welcome!

Ash Berlin-Taylor (ash@apache.org)

2020-12-21 05:25:07

I'm joining this slack now, but I'm basically done for the year, so will investigate proposals etc next year

🙌 Willy Lulciuc

Zachary Friedman (zafriedman@gmail.com)

2020-12-21 10:02:37

Hey all 👋 Super curious what people's thoughts are on the best way for data quality tools i.e. Great Expectations to integrate with OpenLineage. Probably a Dataset level facet of some sort (from the 25 minutes of deep spec knowledge I have 😆), but curious if that's something being worked on? @Abe Gong

👋 Abe Gong, Willy Lulciuc

Abe Gong (abe@superconductive.com)

2020-12-21 10:30:51

*Thread Reply:* Yes, that’s about right.

Abe Gong (abe@superconductive.com)

2020-12-21 10:31:45

*Thread Reply:* There’s some subtlety here.

Abe Gong (abe@superconductive.com)

2020-12-21 10:32:02

*Thread Reply:* The initial OpenLineage spec is pretty explicit about linking metadata primarily to execution of specific tasks, which is appropriate for ValidationResults in Great Expectations

✅ Zachary Friedman

Abe Gong (abe@superconductive.com)

2020-12-21 10:32:57

*Thread Reply:* There isn’t as strong a concept of persistent data objects (e.g. a specific table, or batches of data from a specific table)

✅ Zachary Friedman

Abe Gong (abe@superconductive.com)

2020-12-21 10:33:20

*Thread Reply:* (In the GE ecosystem, we call these DataAssets and Batches)

Abe Gong (abe@superconductive.com)

2020-12-21 10:33:56

*Thread Reply:* This is also an important conceptual unit, since it’s the level of analysis where Expectations and data docs would typically attach.

✅ Zachary Friedman

Abe Gong (abe@superconductive.com)

2020-12-21 10:34:47

*Thread Reply:* @James Campbell and I have had some productive conversations with @Julien Le Dem and others about this topic

Julien Le Dem (julien@apache.org)

2020-12-21 12:20:53

*Thread Reply:* Yep! The next step will be to open a few github issues with proposals to add to or amend the spec. We would probably start with a Descriptive Dataset facet of a dataset profile (or dataset update profile). There are other aspects to clarify as well as @Abe Gong is explaining above.

✅ James Campbell

Zachary Friedman (zafriedman@gmail.com)

2020-12-21 10:08:24

Also interesting to see where this would hook into Dagster. Because one of the many great features of Dagster IMO is it let you do stuff like this (without a formal spec albeit). An OpenLineageMaterialization could be interesting

Julien Le Dem (julien@apache.org)

2020-12-21 12:23:41

*Thread Reply:* Totally! We had a quick discussion with Dagster. Looking forward to proposals along those lines.

Harikiran Nayak (hari@streamsets.com)

2020-12-21 14:35:11

Congrats @Julien Le Dem @Willy Lulciuc and team on launching OpenLineage!

🙌 Willy Lulciuc

Willy Lulciuc (willy@datakin.com)

2020-12-21 14:48:11

*Thread Reply:* Thanks, @Harikiran Nayak! It’s amazing to see such interest in the community on defining a standard for lineage metadata collection.

Harikiran Nayak (hari@streamsets.com)

2020-12-21 15:03:29

*Thread Reply:* Yep! Its a validation that the problem is real!

Kriti (kathuriakritihp@gmail.com)

2020-12-22 02:05:45

Hey folks! Worked on a variety of lineage problems across domains. Super excited about this initiative!

👋 Willy Lulciuc

Julien Le Dem (julien@apache.org)

2020-12-22 13:23:43

*Thread Reply:* Welcome!

👋 Kriti

Julien Le Dem (julien@apache.org)

2020-12-30 22:30:23

*Thread Reply:* What are you current use cases for lineage?

Julien Le Dem (julien@apache.org)

2020-12-22 19:54:33

(for review) Proposal issue template: https://github.com/OpenLineage/OpenLineage/pull/11

GitHub

Add Proposal issue template by julienledem · Pull Request #11 · OpenLineage/OpenLineage

To formalize a bit more proposal submission

Original URL: https://github.com/OpenLineage/OpenLineage/pull/11

Julien Le Dem (julien@apache.org)

2020-12-22 19:55:16

for people interested, <#C01EB6DCLHX|github-notifications> has the github integration that will notify of new PRs …

Martin Charrel (martin.charrel@datadoghq.com)

2020-12-29 09:39:46

👋 Hello! I'm currently working on lineage systems @ Datadog. Super excited to learn more about this effort

👋 Willy Lulciuc

Julien Le Dem (julien@apache.org)

2020-12-30 22:28:54

*Thread Reply:* Welcome!

Julien Le Dem (julien@apache.org)

2020-12-30 22:29:43

*Thread Reply:* Would you mind sharing your main use cases for collecting lineage?

Marko Jamedzija (marko@popcore.com)

2021-01-03 05:54:34

Hi! I’m also working on a similar topic for some time. Really looking forward to having these ideas standardized 🙂

👋 Willy Lulciuc

Alexander Gilfillan (agilfillan@dealerinspire.com)

2021-01-05 11:29:31

I would be interested to see how to extend this to dashboards/visualizations. If that still falls with the scope of this project.

Julien Le Dem (julien@apache.org)

2021-01-05 12:55:01

*Thread Reply:* Definitely, each dashboard should become a node in the lineage graph. That way you can understand all the dependencies of a given dashboard. SOme example of interesting metadata around this: is the dashboard updated in a timely fashion (data freshness); is the data correct (data quality)? Observing changes upstream of the dashboard will provide insights to what’s hapening when freshness or quality suffer

Alexander Gilfillan (agilfillan@dealerinspire.com)

2021-01-05 13:20:41

*Thread Reply:* 100%. On a granular scale, the difference between a visualization and dashboard can be interesting. One visualization can be connected to multiple dashboards. But of course this depends on the BI tool, Redash would be an example in this case.

Julien Le Dem (julien@apache.org)

2021-01-05 15:15:23

*Thread Reply:* We would need to decide how to model those things. Possibly as a Job type for dashboard and visualization.

Alexander Gilfillan (agilfillan@dealerinspire.com)

2021-01-06 18:20:06

*Thread Reply:* It could be. Its interesting in Redash for example you create custom queries that run at certain intervals to produce the data you need to visualize. Pretty much equivalent to job. But you then build certain visualizations off of that “job”. Then you build dashboards off of visualizations. So you could model it as an job or it could make sense for it to be more modeled like an dataset.

Thats the hard part of this. How to you model a visualization/dashboard to all the possible ways they can be created since it differs depending on how the tool you use abstracts away creating an visualization.

Jason Reid (reid.david.jason@gmail.com)

2021-01-05 17:06:02

👋 Hi everyone!

🙌 Willy Lulciuc, Arthur Wiedmer

👋 Abe Gong

Jason Reid (reid.david.jason@gmail.com)

2021-01-05 17:10:22

*Thread Reply:* Part of my role at Netflix is to oversee our data lineage story so very interested in this effort and hope to be able to participate in its success

Julien Le Dem (julien@apache.org)

2021-01-05 18:12:48

*Thread Reply:* Hi Jason and welcome

Julien Le Dem (julien@apache.org)

2021-01-05 18:15:12

A reference implementation of the OpenLineage initial spec is in progress in Marquez: https://github.com/MarquezProject/marquez/pull/880

} henneberger (https://github.com/henneberger)

#880 Add open lineage endpoint

Signed-off-by: henneberger <a href="mailto:git@danielhenneberger.com">git@danielhenneberger.com</a>

Comments

Julien Le Dem (julien@apache.org)

2021-01-07 12:46:19

*Thread Reply:* The OpenLineage reference implementation in Marquez will be presented this morning Thursday (01/07) at 10AM PST, at the Marquez Community meeting.

When: Thursday, January 7th at 10AM PST Where: https://us02web.zoom.us/j/89344845719?pwd=Y09RZkxMZHc2U3pOTGZ6SnVMUUVoQT09

Julien Le Dem (julien@apache.org)

2021-01-07 12:46:36

*Thread Reply:* https://wiki.lfaidata.foundation/pages/viewpage.action?pageId=18481442

Julien Le Dem (julien@apache.org)

2021-01-07 12:46:44

*Thread Reply:* that’s in 15 min

Julien Le Dem (julien@apache.org)

2021-01-07 12:48:35

*Thread Reply:* https://lists.lfaidata.foundation/g/marquez-technical-discuss/viewevent?repeatid=32038&eventid=987925&calstart=2021-01-07

Julien Le Dem (julien@apache.org)

2021-01-12 17:10:23

*Thread Reply:* And it’s merged!

Julien Le Dem (julien@apache.org)

2021-01-12 17:10:53

*Thread Reply:* Marquez now has a reference implementation of the initial OpenLineage spec

Jon Loyens (jon@data.world)

2021-01-06 17:43:02

👋 Hi everyone! I'm one of the co-founder at data.world and looking forward to hanging out here

👋 Julien Le Dem, Willy Lulciuc

Elena Goydina (egoydina@provectus.com)

2021-01-11 11:39:20

👋 Hi everyone! I was looking for the roadmap and don't see any. Does it exist?

Julien Le Dem (julien@apache.org)

2021-01-13 19:06:34

*Thread Reply:* There’s no explicit roadmap so far. With the initial spec defined and the reference implementation implemented, next steps are to define more facets (for example, data shape, dataset size, etc), provide clients to facilitate integrations (java, python, …), implement more integrations (Spark in the works). Members of the community are welcome to drive their own initiatives around the core spec. One of the design goals of the facet is to enable numerous and independant parallel efforts

Julien Le Dem (julien@apache.org)

2021-01-13 19:06:48

*Thread Reply:* Is there something you are interested about in particular?

Julien Le Dem (julien@apache.org)

2021-01-13 19:09:42

I have opened a proposal to move the spec to JSONSchema, this will make it more focused and decouple from http: https://github.com/OpenLineage/OpenLineage/issues/15

} julienledem (https://github.com/julienledem)

#15 [PROPOSAL] Move the OpenLineage spec to JonSchema (from OpenAPI)

Purpose: OpenAPI and JsonSchema are mostly the same model with a goal of OpenAPI to be a strict superset of JSONSchema in the near future. The goal is to make OpenLineage independent from the underlying protocol. OpenAPI assumes a REST endpoint which is not our intent here. This would still make possible to define an OpenLineage HTTP endpoint while decoupling the model and make it more obvious that it can be used with Kafka for example. Proposed implementation Convert the OpenAPI spec to a json schema.

Assignees

julienledem

👍 Willy Lulciuc

Julien Le Dem (julien@apache.org)

2021-01-19 12:26:39

Here is a PR with the corresponding change: https://github.com/OpenLineage/OpenLineage/pull/17

} julienledem (https://github.com/julienledem)

#17 adding Json-Schema for OpenLineage

resolves <a href="https://github.com/OpenLineage/OpenLineage/issues/15">#15</a> Signed-off-by: Julien Le Dem <a href="mailto:julien@apache.org">julien@apache.org</a>

Xinbin Huang (bin.huangxb@gmail.com)

2021-02-01 17:07:50

Really excited to see this project! I am curious what's the current state and the roadmap of it?

Julien Le Dem (julien@apache.org)

2021-02-01 17:55:59

*Thread Reply:* You can find the initial spec here: https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md The process to contribute to the model is described here: https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md In particular, now we’d want to contribute more facets and integrations. Marquez has a reference implementation: https://github.com/MarquezProject/marquez/pull/880 On the roadmap: • define more facets: data profile, etc • more integrations • java/python client You can see current discussions here: https://github.com/OpenLineage/OpenLineage/issues

✅ Xinbin Huang

Julien Le Dem (julien@apache.org)

2021-02-01 17:56:43

For people curious about following github activity you can subscribe to: <#C01EB6DCLHX|github-notifications>

Julien Le Dem (julien@apache.org)

2021-02-01 17:57:05

*Thread Reply:* It is not on general, as it can be a bit noisy

Zachary Friedman (zafriedman@gmail.com)

2021-02-09 13:50:17

Random-ish question: why is producer and schemaURL nested under nominalTime facet in the spec for postRunStateUpdate? It seems like the producer of its metadata isn’t related to the time of the lineage event?

Julien Le Dem (julien@apache.org)

2021-02-09 20:02:48

*Thread Reply:* Hi @Zachary Friedman! I replied bellow. https://openlineage.slack.com/archives/C01CK9T7HKR/p1612918909009900

} Julien Le Dem (https://openlineage.slack.com/team/U01DCLP0GU9)

<code>producer</code> and <code>schemaURL</code> are defined in the <code>BaseFacet</code> type and therefore all facets (including <code>nominalTime</code>) have it. • The <code>producer</code> is an identifier for the code that produced the metadata. The idea is that different facets in the same event can be produced by different libraries. For example In a Spark integration, Iceberg could emit it’s own facet in addition to other facets. The producer identifies what produced what. • The <code>_schemaURL</code> is the identifier of the version of the schema for a given facet. Similarly an event could contain a mixture of Core facets from the spec as well as custom facets. This makes explicit what the definition for this facet is.

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1612918909009900

Julien Le Dem (julien@apache.org)

2021-02-09 20:01:49

producer and schemaURL are defined in the BaseFacet type and therefore all facets (including nominalTime) have it. • The producer is an identifier for the code that produced the metadata. The idea is that different facets in the same event can be produced by different libraries. For example In a Spark integration, Iceberg could emit it’s own facet in addition to other facets. The producer identifies what produced what. • The _schemaURL is the identifier of the version of the schema for a given facet. Similarly an event could contain a mixture of Core facets from the spec as well as custom facets. This makes explicit what the definition for this facet is.

👍 Zachary Friedman

Julien Le Dem (julien@apache.org)

2021-02-09 21:27:05

As discussed previously, I have separated a Json Schema spec for the OpenLineage events from the OpenAPI spec defining a HTTP endpoint: https://github.com/OpenLineage/OpenLineage/pull/17

} julienledem (https://github.com/julienledem)

#17 adding Json-Schema for OpenLineage

resolves <a href="https://github.com/OpenLineage/OpenLineage/issues/15">#15</a> Signed-off-by: Julien Le Dem <a href="mailto:julien@apache.org">julien@apache.org</a>

Reviewers

@wslulciuc, @henneberger

Julien Le Dem (julien@apache.org)

2021-02-09 21:27:26

*Thread Reply:* Feel free to comment, this is ready to merge

Willy Lulciuc (willy@datakin.com)

2021-02-11 20:12:18

*Thread Reply:* Thanks, Julien. The new spec format looks great 👍

Julien Le Dem (julien@apache.org)

2021-02-09 21:34:31

And the corresponding code generator to start the java (and other languages) client: https://github.com/OpenLineage/OpenLineage/pull/18

} julienledem (https://github.com/julienledem)

#18 OpenLineage java model Generator

This a generator of the OpenLineage model as java classes with Jackson serialization

Reviewers

@wslulciuc

👍 Willy Lulciuc

Julien Le Dem (julien@apache.org)

2021-02-11 22:25:24

those are merged, we now have a jsonschema, an openapi spec that extends it and a generated java model

🎉 Willy Lulciuc

🙌 Willy Lulciuc

Julien Le Dem (julien@apache.org)

2021-02-17 19:39:55

Following up on a previous discussion: This proposal and the accompanying PR add the notion of InputFacets and OutputFacets: https://github.com/OpenLineage/OpenLineage/issues/20 In summary, we are collecting metadata about jobs and datasets. At the Job level, when it’s fairly static metadata (not changing every run, like the current code version of the job) it goes in a JobFacet. When it is dynamic and changes every run (like the schedule time of the run), it goes in a RunFacet. This proposal is adding the same notion at the Dataset level: when it is static and doesn’t change every run (like the dataset schema) it goes in a Dataset facet. When it is dynamic and changes every run (like the input time interval of the dataset being read, or the statistics of the dataset being written) it goes in an inputFacet or an outputFacet. This enables Job and Dataset versioning logic, to keep track of what changes in the definition of something vs runtime changes

} julienledem (https://github.com/julienledem)

#20 [PROPOSAL] Add InputFacets and OutputFacets

Purpose: We currently have JobFacets for everything that does not change every run (example: the current version in source control like the git sha). We have RunFacets for everything that changes every Run (example: the schedule time). We are missing the same notion for input and output datasets. Currently we only have Dataset facets for things that don't change every run (example: Schema). I propose to add Input and Output facets to capture things that change in datasets every run (ex: number of row produced, etc) Proposed implementation Add InputFacets and OutputFacets in the core model. Directly in the Input or Output field. <pre><code>"inputs": { "description": "The set of ****input**** datasets.", "type": "array", "items": { "allOf": [ { "$ref": "#/definitions/Dataset" }, { "type": "object", "properties": { "inputFacets": { ... } } } ] } } </code></pre>

Labels

proposal

Comments

👍 Kevin Mellott, Petr Šimeček

Julien Le Dem (julien@apache.org)

2021-02-19 14:27:23

*Thread Reply:* @Kevin Mellott and @Petr Šimeček Thanks for the confirmation on this slack message. To make your comment visible to the wider community, please chime in on the github issue as well: https://github.com/OpenLineage/OpenLineage/issues/20 Thank you.

} julienledem (https://github.com/julienledem)

#20 [PROPOSAL] Add InputFacets and OutputFacets

Labels

proposal

Comments

Julien Le Dem (julien@apache.org)

2021-02-19 14:27:46

*Thread Reply:* The PR is out for this: https://github.com/OpenLineage/OpenLineage/pull/23

} julienledem (https://github.com/julienledem)

#23 add input and output facets

This PR resolves <a href="https://github.com/OpenLineage/OpenLineage/issues/20">#20</a> Signed-off-by: Julien Le Dem <a href="mailto:julien@apache.org">julien@apache.org</a>

Reviewers

@jcampbell, @abegong, @henneberger

Weixi Li (ashlee.happy@gmail.com)

2021-02-19 04:14:59

Hi, I am really interested in this project and Marquez. I am a bit not clear about the differences and relationship between those two projects. As my understanding, OpenLineage provides an api specification for other tools running jobs (e.g. Spark, Airflow) to send out an event to update the run state of the job, then for example Marquez can be the destination for those events and show the data lineage from those run state updates. When you are saying there is an reference implementation of the OpenLineage spec in Marquez, do you mean there is an /lineage endpoint implemented in the Marquez api https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/api/OpenLineageResource.java? Then my question is what is next step after Marquez has this api? How does Marquez use that endpoint to integrate with airflow for example? I did not find the usage of that endpoint in Marquez project. The library marquez-airflow which integrates Airflow with Marquez seems like only use the other marquez apis to build the data lineage. Or did I misunderstand something? Thank you very much!

api/src/main/java/marquez/api/OpenLineageResource.java

<pre><code>/** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** <http://www.apache.org/licenses/LICENSE-2.0> ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. **/ package marquez.api; import static javax.ws.rs.core.MediaType.APPLICATION_JSON; import static javax.ws.rs.core.Response.Status.INTERNAL_SERVER_ERROR; import com.codahale.metrics.annotation.ExceptionMetered; import com.codahale.metrics.annotation.ResponseMetered; import com.codahale.metrics.annotation.Timed; import com.fasterxml.jackson.core.JsonProcessingException; import java.sql.SQLException; import javax.validation.Valid; import javax.validation.constraints.NotNull; import javax.ws.rs.Consumes; import <a href="http://javax.ws.rs.POST">javax.ws.rs.POST</a>; import javax.ws.rs.Path; import javax.ws.rs.Produces; import javax.ws.rs.container.AsyncResponse; import javax.ws.rs.container.Suspended; import javax.ws.rs.core.Response; import lombok.extern.slf4j.Slf4j; import marquez.service.OpenLineageService; import marquez.service.models.LineageEvent; @Slf4j @Path("/api/v1/lineage") public class OpenLineageResource { private final OpenLineageService openLineageService; public OpenLineageResource(OpenLineageService openLineageService) { this.openLineageService = openLineageService; } @Timed @ResponseMetered @ExceptionMetered @POST @Consumes(APPLICATION_JSON) @Produces(APPLICATION_JSON) public void create( @Valid @NotNull LineageEvent event, @Suspended final AsyncResponse asyncResponse) throws JsonProcessingException, SQLException { openLineageService .createAsync(event) .whenComplete( (result, err) -&gt; { if (err != null) { log.error("Unexpected error while processing request", err); asyncResponse.resume(Response.status(INTERNAL_SERVER_ERROR).build()); } else { asyncResponse.resume(Response.status(201).build()); } }); } } </code></pre>

Weixi Li (ashlee.happy@gmail.com)

2021-02-19 05:03:21

*Thread Reply:* Okay, I found the spark integration in Marquez calls the /lineage endpoint. But I am still curious about the future plan to integrate with other tools, like airflow?

Julien Le Dem (julien@apache.org)

2021-02-19 12:41:23

*Thread Reply:* Just restating some of my answers from teh marquez slack for the benefits of folks here.

• OpenLineage defines the schema to collect metadata • Marquez has a /lineage endpoint implementing the OpenLineage spec to receive this metadata, implemented by the OpenLineageResource you pointed out • In the future other projects will also have OpenLineage endpoints to receive this metadata • The Marquez Spark integration produces OpenLineage events: https://github.com/MarquezProject/marquez/tree/main/integrations/spark • The Marquez airflow integration still uses the original marquez api but will be migrated to open lineage. • All new integrations will use OpenLineage metadata

Weixi Li (ashlee.happy@gmail.com)

2021-02-22 03:55:18

*Thread Reply:* thank you! very clear answer🙂

Ernie Ostic (ernie.ostic@getmanta.com)

2021-03-02 13:49:04

Hi Everyone. Just got started with the Marquez REST API and a little bit into the Open Lineage aspects. Very easy to use. Great work on the curl examples for getting started. I'm working with Postman and am happy to share a collection I have once I finish testing. A question about tags --- are there plans for a "post new tag" call in the API? ...or maybe I missed it. Thx. --ernie

Julien Le Dem (julien@apache.org)

2021-03-02 17:51:29

*Thread Reply:* I forgot to reply in thread 🙂 https://openlineage.slack.com/archives/C01CK9T7HKR/p1614725462008300

} Julien Le Dem (https://openlineage.slack.com/team/U01DCLP0GU9)

OpenLineage doesn’t have a Tag facet yet (but tags are defined in the Marquez api). Feel free to open a proposal on the github repo. <a href="https://github.com/OpenLineage/OpenLineage/issues/new/choose">https://github.com/OpenLineage/OpenLineage/issues/new/choose</a>

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1614725462008300

Julien Le Dem (julien@apache.org)

2021-03-02 17:51:02

OpenLineage doesn’t have a Tag facet yet (but tags are defined in the Marquez api). Feel free to open a proposal on the github repo. https://github.com/OpenLineage/OpenLineage/issues/new/choose

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-03-16 11:21:37

Hey everyone. What's the story for stream processing (like Flink jobs) for OpenLineage? It does not fit cleanly with runEvent model, which It is required to issue 1 START event and 1 of [ COMPLETE, ABORT, FAIL ] event per run. as unbounded stream jobs usually do not complete.

I'd imagine few "workarounds" that work for some cases - for example, imagine a job calculating hourly aggregations of transactions and dumpling them into parquet files for further analysis. The job could issue OTHER event type adding additional output dataset every hour. Another option would be to create new "run" every hour, just indicating the added data.

Adam Bellemare (adam.bellemare@shopify.com)

2021-03-16 15:07:04

*Thread Reply:* Ha, I signed up just to ask this precise question!

😀 Maciej Obuchowski

Adam Bellemare (adam.bellemare@shopify.com)

2021-03-16 15:07:44

*Thread Reply:* I’m still looking into the spec myself. Are we required to have 1 or more runs per Job? Or can a Job exist without a run event?

Ravi Suhag (suhag.ravi@gmail.com)

2021-04-02 07:24:39

*Thread Reply:* Run event can be emitted when it starts. and it can stay in RUNNING state unless something happens to the job. Additionally, you could send event periodically as state RUNNING to inform the system that job is healthy.

Adam Bellemare (adam.bellemare@shopify.com)

2021-03-16 15:09:31

Similar to @Maciej Obuchowski question about Flink / Streaming jobs - what about Streaming sources (eg: a Kafka topic)? It does fit into the dataset model, more or less. But, has anyone used this yet for a set of streaming sources? Particularly with schema changes over time?

Julien Le Dem (julien@apache.org)

2021-03-16 18:30:46

Hi @Maciej Obuchowski and @Adam Bellemare, streaming jobs are meant to be covered by the spec but I agree there are a few details to iron out.

Julien Le Dem (julien@apache.org)

2021-03-16 18:31:55

In particular, streaming job still have runs. If they run continuously they do not run forever and you want to track that a job has been started at a point in time with a given version of the code, then stopped and started again after being upgraded for example.

👍 Maciej Obuchowski

Julien Le Dem (julien@apache.org)

2021-03-16 18:32:23

I agree with @Maciej Obuchowski that we would also send OTHER events to keep track of progress.

Julien Le Dem (julien@apache.org)

2021-03-16 18:32:46

For example one could track checkpointing this way.

Julien Le Dem (julien@apache.org)

2021-03-16 18:35:35

For a Kafka topic you could have streaming dataset specific facets or even Kafka specific facets (ex: list of offsets we stopped reading at, schema id, etc )

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-03-17 10:05:53

*Thread Reply:* That's good idea.

Now I'm wondering - let's say we want to track on which offset checkpoint ended processing. That would mean we want to expose checkpoint id, time, and offset. I suppose we don't want to overwrite previous checkpoint info, so we want to have some collection of data in this facet.

Something like appendable facets would be nice, to just add new checkpoint info to the collection, instead of having to push all the checkpoint infos all the time we just want to add new data point.

Julien Le Dem (julien@apache.org)

2021-03-16 18:45:23

Let me know if you have more thoughts

Adam Bellemare (adam.bellemare@shopify.com)

2021-03-17 09:18:49

*Thread Reply:* Thanks Julien! I will try to wrap my head around some use-cases and see how it maps to the current spec. From there, I can see if I can figure out any proposals

Julien Le Dem (julien@apache.org)

2021-03-17 13:43:29

*Thread Reply:* You can use the proposal issue template to propose a new facet for example: https://github.com/OpenLineage/OpenLineage/issues/new/choose

Carlos Zubieta (carlos.zubieta@wizeline.com)

2021-03-16 18:49:00

Hi everyone, I just hear about OpenLineage and would like to learn more about it. The talks in the repo explain nicely the purpose and general ideas but I have a couple of questions. Are there any working implementations to produce/consume the spec? Also, are there any discussions/guides standard information, naming conventions, etc. in the facets?

Julien Le Dem (julien@apache.org)

2021-03-16 20:04:44

• The Spark integration using OpenLineage: https://github.com/MarquezProject/marquez/tree/main/integrations/spark • in particular: ◦ A simple OpenLineage client (we’re working on adding this to the OpenLineage repo): https://github.com/MarquezProject/marquez/tree/b758751b6c0ba6d2f0da1ba7ec636b73317[…]450/integrations/spark/src/main/java/marquez/spark/agent/client ◦ emitting events: ▪︎ https://github.com/MarquezProject/marquez/blob/b758751b6c0ba6d2f0da1ba7ec636b73317[…]ava/marquez/spark/agent/lifecycle/SparkSQLExecutionContext.java ▪︎ https://github.com/MarquezProject/marquez/blob/b758751b6c0ba6d2f0da1ba7ec636b73317[…]ava/marquez/spark/agent/lifecycle/SparkSQLExecutionContext.java • The Marquez OpenLineage endpoint: https://github.com/MarquezProject/marquez/blob/893beddcb7dbc4d4b7b994f003ce461a478[…]bf466/api/src/main/java/marquez/service/OpenLineageService.java

Julien Le Dem (julien@apache.org)

2021-03-16 20:05:06

Hi @Carlos Zubieta here are some pointers ^

Julien Le Dem (julien@apache.org)

2021-03-16 20:06:51

Marquez has a reference implementation of an OpenLineage endpoint. The Spark integration emits OpenLineage events.

Carlos Zubieta (carlos.zubieta@wizeline.com)

2021-03-16 20:56:37

Thank you @Julien Le Dem!!! Will take a close look

Adam Bellemare (adam.bellemare@shopify.com)

2021-03-17 15:41:50

Q related to People/Teams/Stakeholders/Owners with regards to Jobs and Datasets (didn’t find anything in search): Let’s say I have a dataset , and there are a number of other downstream jobs that ingest from it. In the case that the dataset is mutated in some way (or deleted, archived, etc), how would I go about notifying the stakeholders of that set about the changes?

Just to be clear, I’m not concerned about the mechanics of doing this, just that there is someone that needs to be notified, who has self-registered on this set. Similarly, I want to manage the datasets I am concerned about , where I can grab a list of all the datasets I tagged myself on.

This seems to suggest that we could do with additional entities outside of Dataset, Run, Job. However, at the same time, I can see how this can lead to an explosion of other entities. Any thoughts on this particular domain? I think I could achieve something similar with aspects, but this would require that I update the aspect on each entity if I want to wholesale update the user contact, say their email address.

Has anyone else run into something like this? Have you any advice? Or is this something that may be upcoming in the spec?

Adam Bellemare (adam.bellemare@shopify.com)

2021-03-17 16:42:24

*Thread Reply:* One thing we were considering is just adding these in as Facets ( Tags as per Marquez), and then plugging into some external people managing system. However, I think the question can be generalized to “should there be some sort of generic entity that can enable relationships between itself and Datasets, Jobs, Runs) as part of an integration element?

Julien Le Dem (julien@apache.org)

2021-03-18 16:03:55

*Thread Reply:* That’s a great topic of discussion. I would definitely use the OpenLineage facets to capture what you describe as aspect above. The current Marquez model has a simple notion of ownership at the namespace model but this need to be extended to enable use cases you are describing (owning a dataset or a job) . Right now the owner is just a generic identifier as a string (a user id or a group id for example). Once things are tagged (in some way), you can use the lineage API to find all the downstream or upstream jobs and datasets. In OpenLineage I would start by being able to capture the owner identifier in a facet with contact info optional if it’s available at runtime. It will have the advantage of keeping track of how that changed over time. This definitely deserves its own discussion.

Julien Le Dem (julien@apache.org)

2021-03-18 17:52:13

*Thread Reply:* And also to make sure I understand your use case, you want to be able to notify the consumers of a dataset that it is being discontinued/replaced/… ? What else are you thinking about?

Adam Bellemare (adam.bellemare@shopify.com)

2021-03-22 09:15:19

*Thread Reply:* Let me pull in my colleagues

Adam Bellemare (adam.bellemare@shopify.com)

2021-03-22 09:15:24

*Thread Reply:* Standby

Olessia D'Souza (olessia.dsouza@shopify.com)

2021-03-22 10:59:57

*Thread Reply:* 👋 Hi Julien. I’m Olessia, I’m working on the metadata collection implementation with Adam. Some thought on this:

Olessia D'Souza (olessia.dsouza@shopify.com)

2021-03-22 11:00:45

*Thread Reply:* To start off, we’re thinking that there often isn’t a single owner, but rather a set of Stakeholders that evolve over time. So we’d like to be able to attach multiple entries, possibly of different types, to a Dataset. We’re also thinking that a dataset should have at least one owner. So a few things I’d like to confirm/discuss options:

If I were to stay true to the spec as it’s defined atm I wouldn’t be able to add a required facet. True/false?
According to the readme, “...emiting a new facet with the same name for the same entity replaces the previous facet instance for that entity entirely”. If we were to store multiple stakeholders, we’d have a field “stakeholders” and its value would be a list? This would make queries involving stakeholders not very straightforward. If the facet is overwritten every time, how do I a) add individuals to the list b) track changes to the list over time. Let me know what I’m missing, because based on what you said above tracking facet changes over time is possible.
Run events are issued by a scheduler. Why should it be in the domain of the scheduler to know the entire list of Stakeholders?
I noticed that Marquez has separate endpoints to capture information about Datasets, and some additional information beyond what’s described in the spec is required. In this context, we could add a required Stakeholder facets on a Dataset, and potentially even additional end points to add and remove Stakeholders. Is that a valid way to go about this, in your opinion?

Curious to hear your thoughts on all of this!

Julien Le Dem (julien@apache.org)

2021-03-24 17:06:50

*Thread Reply:* > To start off, we’re thinking that there often isn’t a single owner, but rather a set of Stakeholders that evolve over time. So we’d like to be able to attach multiple entries, possibly of different types, to a Dataset. We’re also thinking > that a dataset should have at least one owner. So a few things I’d like to confirm/discuss options: > -> If I were to stay true to the spec as it’s defined atm I wouldn’t be able to add a required facet. True/false? Correct, The spec defines what facets looks like (and how you can make your own custom facets) but it does not make statements about whether facets are required. However, you can have your own validation and make certain things required if you wish to on the client side? > - According to the readme, “...emiting a new facet with the same name for the same entity replaces the previous facet instance for that entity entirely”. If we were to store multiple stakeholders, we’d have a field “stakeholders” and its value would be a list? Yes, I would indeed consider such a facet on the dataset with the stakeholder.

> This would make queries involving stakeholders not very straightforward. If the facet is overwritten every time, how do I > a) add individuals to the list You would provide the new list of stake holders. OpenLineage standardizes lineage collection and defines a format for expressing metadata. Marquez will keep track of how metadata has evolved over time.

> b) track changes to the list over time. Let me know what I’m missing, because based on what you said above tracking facet changes over time is possible. Each event is an observation at a point in time. In a sense they are each immutable. There’s a “current” version but also all the previous ones stored in Marquez. Marquez stores each version of a dataset it received through OpenLineage and exposes an API to see how that evolved over time.

> - Run events are issued by a scheduler. Why should it be in the domain of the scheduler to know the entire list of Stakeholders? The scheduler emits the information that it knows about. For example: “I started this job and it’s reading from this dataset and is writing to this other dataset.” It may or may not be in the domain of the scheduler to know the list of stakeholders. If not then you could emit different types of events to add a stakeholder facet to a dataset. We may want to refine the spec for that. Actually I would be curious to hear what you think should be the source of truth for stakeholders. It is not the intent to force everything coming from the scheduler.

example 1: stakeholders are people on call for the job, they are defined as part of the job and that also enables alerting
example 2: stakeholders are consumers of the jobs: they may be defined somewhere else

> - I noticed that Marquez has separate endpoints to capture information about Datasets, and some additional information beyond what’s described in the spec is required. In this context, we could add a required Stakeholder facets on a Dataset, and potentially even additional end points to add and remove Stakeholders. Is that a valid way to go about this, in your opinion?

Julien Le Dem (julien@apache.org)

2021-03-24 17:06:50

*Thread Reply:* Marquez existed before OpenLineage. In particular the /run end-point to create and update runs will be deprecated as the OpenLineage /lineage endpoint replaces it. At the moment we are mapping OpenLineage metadata to Marquez. Soon Marquez will have all the facets exposed in the Marquez API. (See: https://github.com/MarquezProject/marquez/pull/894/files) We could make Marquez Configurable or Pluggable for validation purposes. There is already a notion of LineageListener for example. Although Marquez collects the metadata. I feel like this validation would be better upstream or with some some other mechanism. The question is when do you create a dataset vs when do you become a stakeholder? What are the various stakeholder and what is the responsibility of the minimum one stakeholder? I would probably make it required to deploy the job that the stakeholder is defined. This would apply to the output dataset and would be collected in Marquez.

In general, you are very welcome to make suggestion on additional endpoints for Marquez and I’m happy to discuss this further as those ideas are progressing.

> Curious to hear your thoughts on all of this! Thanks for taking the time!

Julien Le Dem (julien@apache.org)

2021-05-24 16:27:03

*Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1621887895004200

} Julien Le Dem (https://openlineage.slack.com/team/U01DCLP0GU9)

I have created a channel to discuss <#C022MMLU31B|user-generated-metadata> since this came up in a few discussions.

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1621887895004200

Julien Le Dem (julien@apache.org)

2021-03-24 18:58:00

Thanks for the Python client submission @Maciej Obuchowski https://github.com/OpenLineage/OpenLineage/pull/34

} mobuchowski (https://github.com/mobuchowski)

#34 OpenLineage python client implementation

Signed-off-by: Maciej Obuchowski <a href="mailto:maciej.obuchowski@getindata.com">maciej.obuchowski@getindata.com</a>

🙌 Willy Lulciuc

Julien Le Dem (julien@apache.org)

2021-03-24 18:59:50

I also have added a spec to define a standard naming policy. Please review: https://github.com/OpenLineage/OpenLineage/pull/31/files

Julien Le Dem (julien@apache.org)

2021-03-31 23:45:35

We now have a python client! Thanks @Maciej Obuchowski

👍 Maciej Obuchowski, Kevin Mellott, Ravi Suhag, Ross Turk, Willy Lulciuc, Mirko Raca

Zachary Friedman (zafriedman@gmail.com)

2021-04-02 19:37:36

Question, what do you folks see as the canonical mechanism for receiving OpenLineage events? Do you see an agent like statsd? Or do you see this as purely an API spec that services could implement? Do you see producers of lineage data writing code to send formatted OpenLineage payloads to arbitrary servers that implement receipt of these events? Curious what the long-term vision is here related to how an ecosystem of producers and consumers of payloads would interact?

Julien Le Dem (julien@apache.org)

2021-04-02 19:54:52

*Thread Reply:* Marquez is the reference implementation for receiving events and tracking changes. But the definition of the API let’s other receive them (and also enables using openlineage events to sync between systems)

Julien Le Dem (julien@apache.org)

2021-04-02 19:55:32

*Thread Reply:* In particular, Egeria is involved in enabling receiving and emitting openlineage

Zachary Friedman (zafriedman@gmail.com)

2021-04-03 18:03:01

*Thread Reply:* Thanks @Julien Le Dem. So to get specific, if dbt were to emit OpenLineage events, how would this work? Would dbt Cloud hypothetically allow users to configure an endpoint to send OpenLineage events to, similar in UI implementation to configuring a Stripe webhook perhaps? And then whatever server the user would input here would point to somewhere that implements receipt of OpenLineage payloads? This is all a very hypothetical example, but trying to ground it in something I have a solid mental model for.

Michael Collado (collado.mike@gmail.com)

2021-04-05 17:51:57

*Thread Reply:* hypothetically speaking, that all sounds right. so a user, who, e.g., has a dbt pipeline and an AWS glue pipeline could configure both of those projects to point to the same open lineage service and get their entire lineage graph even if the two pipelines aren't connected.

Willy Lulciuc (willy@datakin.com)

2021-04-06 20:33:51

*Thread Reply:* Yeah, OpenLineage events need to be published to a backend (can be Kafka, can be a graphDB, etc). Your Stripe webhook analogy is aligned with how events can be received. For example, in Marquez, we expose a /lineage endpoint that consumes OpenLineage events. We then map an OpenLineage event to the Marquez model (sources, datasets, jobs, runs) that’s persisted in postgres.

Zachary Friedman (zafriedman@gmail.com)

2021-04-07 10:47:06

*Thread Reply:* Thanks both!

Julien Le Dem (julien@apache.org)

2021-04-13 20:52:53

*Thread Reply:* sorry, I was away last week. Yes that sounds right.

Jakub Moravec (IBM/Manta) (jkb.moravec@gmail.com)

2021-04-07 09:41:09

Hi everyone, I just started discovering OpenLineage and Marquez, it looks great and the quick-start tutorial is very helpful! One question though, I pushed some metadata to Marquez using the Lineage POST endpoint, and when I try to confirm that everything was created using Marquez REST API, everything is there ... but I don't see these new objects in the Marquez UI... what is the best way how to investigate where the issue is?

Willy Lulciuc (willy@datakin.com)

2021-04-14 13:12:31

*Thread Reply:* Welcome, @Jakub Moravec (IBM/Manta) 👋 . Given that you're able to retrieve metadata using the marquezAPI, you should be able to also view dataset and job metadata in the UI. Mind using the search bar in the top right-hand corner in the UI to see if your metadata is searchable? The UI only renders jobs and datasets that are connected in the lineage graph. We're working towards a more general metadata exploration experience, but currently the lineage graph is the main experience.

Jakob Külzer (jakob.kulzer@shopify.com)

2021-04-08 11:23:18

Hi friends, we're exploring OpenLineage and while building out integration for existing systems we realized there is no obvious way for an input to specify what "version" of that dataset is being consumed. For example, we have a job that rolls up a variable number of what OpenLineage calls dataset versions. By specifying only that dataset, we can't represent the specific instances of it that are actually rolled up. We think that would be a very important part of the lineage graph.

Are there any thoughts on how to address specific dataset versions? Is this where custom input facets would come to play?

Furthermore, based on the spec, it appears that events can provide dataset facets for both inputs and outputs and this seems to open the door to race conditions in which two runs concurrently create dataset versions of a dataset. Is this where the eventTime field is supposed to be used?

Julien Le Dem (julien@apache.org)

2021-04-13 20:56:42

*Thread Reply:* Your intuition is right here. I think we should define an input facet that specifies which dataset version is being read. Similarly you would have an output facet that specifies what version is being produced. This would apply to storage layers like Deltalake and Iceberg as well.

Julien Le Dem (julien@apache.org)

2021-04-13 20:57:58

*Thread Reply:* Regarding the race condition, input and output facets are attached to the run. The version of the dataset that was read is an attribute of a run and should not modify the dataset itself.

Julien Le Dem (julien@apache.org)

2021-04-13 21:01:34

*Thread Reply:* See the Dataset description here: https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md#core-lineage-model

spec/OpenLineage.md

<pre><code># OpenLineage Spec ## Specification The specification for OpenLineage is formalized as a JsonSchema [OpenLineage.json](OpenLineage.json). An OpenAPI spec is also provided for HTTP based implementations: [OpenLineage.yml](OpenLineage.yml) The documentation is published at: <https://openlineage.github.io/> It allows extensions to the spec using `Custom Facets` as described in this document. ## Core concepts ### Core Lineage Model ![Open Lineage model](OpenLineageModel.svg) - ****Run Event****: and event describing an observed state of a job run. It is required to at least send one event for a START transition and a COMPLETE/FAIL/ABORT transition. Aditional events are optional. - ****Job****: a process definition that consumes and produces datasets (defined as its inputs and outputs). It is identified by a unique name within a namespace (which is typicaly assigned to the scheduler starting the jobs). The **Job** evolves over time and this change is captured when the job runs. - ****Dataset****: an abstract representation of data. It has a unique name within a namespace derived from its physical location (for example db.host.database.schema.table). Typicaly, a **Dataset** changes when a job writing to it completes. Similarly to the **Job** and **Run** distinction, metadata that is more static from run to run is captured in a DatasetFacet (for example, the schema that does not change every run), what changes every **Run** is captured as an **InputFacet** or an **OutputFacet** (for example, what subset of the data set was read or written, like a time partition). - ****Run****: An instance of a running job with a start and completion (or failure) time. It is uniquely identified by an id relative to its job definition. - ****Facet****: A piece of metadata attached to one of the entities defined above. example: Here is an example of a simple start run event not adding any facet information: </code></pre> { "transition": "START", "eventTime": "2020-12-09T23:37:31.081Z", "run": { "runId": "345", }, "job": { "namespace": "my-scheduler-namespace", "name": "myjob.mytask", }, "inputs": [ { "namespace": "my-datasource-namespace", "name": "instance.schema.table", } ], "outputs": [ { "namespace": "my-datasource-namespace", "name": "instance.schema.output_table", } ], "producer": "<a href="https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client">https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client</a>" } <pre><code> ### Lifecycle The OpenLineage API defines events to capture the lifecycle of a **Run** for a given **Job**. When a **job** is being **run**, we capture metadata by sending run events when the state of the job transitions to a different state. We might observe different aspects of the job run at different stages. This means that different metadata might be collected in each event during the lyfecycle of a run. All metadata is additive. for example, if more inputs or outputs are detected as the job is running we might send additional events specifically for those datasets without re-emiting previously observed inputs or outputs. Example: - When the run starts, we collect the following Metadata: - Run Id - Job id - transition: START - event time - source location and version (ex: git sha) - If known: Job inputs and outputs. (input schema, ...) - When the run completes: - Run Id - Job id - transition: COMPLETE - event time - Output datasets schema (and other metadata). ### Facets Facets are pieces of metadata that can be attached to the core entities: - Run - Job - Dataset (Inputs or Outputs) A facet is an atomic piece of metadata identified by its name. This means that emiting a new facet whith the same name for the same entity replaces the previous facet instance for that entity entirely). It is defined as a JSON object that can be either part of the spec or custom facets defined in a different project. Custom facets must use a distinct prefix named after the project defining them to avoid colision with standard facets defined in the [OpenLineage.json](OpenLineage.json) spec. They have a \_schemaURL field pointing to the corresponding version of the facet schema (as a JSONPointer: [$ref URL location](<https://swagger.io/docs/specification/using-ref/>) ). Example: <https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/MyCustomJobFacet> The versioned URL must be an immutable pointer to the version of the facet schema. For example, it should include a tag of a git sha and not a branch name. This should also be a canonical URL. There should be only one URL used for a given version of a schema. Custom facets can be promoted to the standard by including them in the spec. ### Standard Facets #### Run Facets - ****nominalTime****: Captures the time this run is scheduled for. This is a typical usage for time based scheduled job. The job has a nominal schedule time that will be different from the actual time it is running at. - ****parent****: Captures the parent job and Run when the run was spawn from a parent run. For example in the case of Airflow, there's a run for the DAG that then spawns runs for individual tasks that would refer to the parent run as the DAG run. Similarly when a SparkOperator starts a Spark job, this creates a separate run that refers to the task run as its parent. #### Job Facets - ****sourceCodeLocation****: Captures the source code location and version (example: git sha) of the job. - ****sql****: Capture the SQL query if this job is a SQL query. #### Dataset Facets - ****schema****: Captures the schema of the dataset - ****dataSource****: Captures the Database instance containing this datasets (ex: Database schema. Object store bucket, ...) #### Input Dataset Facets **Note: for now there no input facets defined** #### Output Dataset Facets **Note: for now there no output facets defined** </code></pre>

Stephen Pimentel (stephenpiment@gmail.com)

2021-04-14 18:20:42

Hi everyone! I’m exploring what existing, open-source integrations are available, specifically for Spark, Airflow, and Trino (PrestoSQL). My team is looking both to use and contribute to these integrations. I’m aware of the integration in the Marquez repo: • Spark: https://github.com/MarquezProject/marquez/tree/main/integrations/spark • Airflow: https://github.com/MarquezProject/marquez/tree/main/integrations/airflow Are there other efforts I should be aware of, whether for these two or for Trino? Thanks for any information!

👋 Arthur Wiedmer, Maciej Obuchowski, Peter Hicks

Zachary Friedman (zafriedman@gmail.com)

2021-04-19 16:17:06

*Thread Reply:* I think for Trino integration you'd be looking at writing a Trino extractor if I'm not mistaken, yes?

Zachary Friedman (zafriedman@gmail.com)

2021-04-19 16:17:23

*Thread Reply:* But extractor would obviously be at the Marquez layer not OpenLineage

Zachary Friedman (zafriedman@gmail.com)

2021-04-19 16:19:00

*Thread Reply:* And hopefully the metadata you'd be looking to extract from Trino wouldn't have any connector-specific syntax restrictions.

Antonio Moctezuma (antoniomoctezuma@northwesternmutual.com)

2021-04-16 15:37:24

Hey all! Right now I am working on getting OpenLineage integrated with some microservices here at Northwestern Mutual and was looking for some advice. The current service I am trying to integrate it with moves files from one AWS S3 bucket to another so i was hoping to track that movement with OpenLineage. However by my understanding the inputs that would be passed along in a runEvent are meant to be datasets that have schema and other properties. But I wanted to have that input represent the file being moved. Is this a proper usage of Open Lineage? Or is this a use case that is still being developed? Any and all help is appreciated!

Julien Le Dem (julien@apache.org)

2021-04-19 21:42:14

*Thread Reply:* This is a proper usage. That schema is optional if it’s not available.

Julien Le Dem (julien@apache.org)

2021-04-19 21:43:27

*Thread Reply:* You would model it as a job reading from a folder (the input dataset) in the input bucket and writing to a folder (the output dataset) in the output bucket

Julien Le Dem (julien@apache.org)

2021-04-19 21:43:58

*Thread Reply:* This is similar to how this is modeled in the spark integration (spark job reading and writing to s3 buckets)

Julien Le Dem (julien@apache.org)

2021-04-19 21:47:06

*Thread Reply:* for reference: getting the urls for the inputs: https://github.com/MarquezProject/marquez/blob/c5e5d7b8345e347164aa5aa173e8cf35062[…]marquez/spark/agent/lifecycle/plan/HadoopFsRelationVisitor.java

<https://github.com/MarquezProject/marquez/blob/c5e5d7b8345e347164aa5aa173e8cf35062641e8/integrations/spark/src/main/java/marquez/spark/agent/lifecycle/plan/HadoopFsRelationVisitor.java | HadoopFsRelationVisitor.java>

<pre><code> public List&lt;Dataset&gt; apply(LogicalPlan x) { </code></pre>

Julien Le Dem (julien@apache.org)

2021-04-19 21:47:54

*Thread Reply:* getting the output URL: https://github.com/MarquezProject/marquez/blob/c5e5d7b8345e347164aa5aa173e8cf35062[…]ark/agent/lifecycle/plan/InsertIntoHadoopFsRelationVisitor.java

<https://github.com/MarquezProject/marquez/blob/c5e5d7b8345e347164aa5aa173e8cf35062641e8/integrations/spark/src/main/java/marquez/spark/agent/lifecycle/plan/InsertIntoHadoopFsRelationVisitor.java | InsertIntoHadoopFsRelationVisitor.java>

<pre><code> return Collections.singletonList( </code></pre>

Julien Le Dem (julien@apache.org)

2021-04-19 21:48:48

*Thread Reply:* See the spec (comments welcome) for the naming of S3 datasets: https://github.com/OpenLineage/OpenLineage/pull/31/files#diff-e3a8184544e9bc70d8a12e76b58b109051c182a914f0b28529680e6ced0e2a1cR87

Antonio Moctezuma (antoniomoctezuma@northwesternmutual.com)

2021-04-20 11:11:38

*Thread Reply:* Hey Julien, thank you so much for getting back to me. I'll take a look at the documentation/implementations you've sent me and will reach out if I have anymore questions. Thanks again!

Antonio Moctezuma (antoniomoctezuma@northwesternmutual.com)

2021-04-20 17:39:24

*Thread Reply:* @Julien Le Dem I left a quick comment on that spec PR you mentioned. Just wanted to let you know.

Julien Le Dem (julien@apache.org)

2021-04-20 17:49:15

*Thread Reply:* thanks

Josh Quintus (josh.quintus@gmail.com)

2021-04-28 09:41:45

Hello all. I was reading through the OpenLineage documentation on GitHub and noticed a very minor typo (an instance where and should have been an). I was just about to create a PR for it but wanted to check with someone to see if that would be something that the team is interested in.

Thanks for the tool, I'm looking forward to learning more about it.

👍 Maciej Obuchowski

Julien Le Dem (julien@apache.org)

2021-04-28 20:56:53

*Thread Reply:* Thank you! Please do fix typos, I’ll approve your PR.

Josh Quintus (josh.quintus@gmail.com)

2021-04-28 23:21:44

*Thread Reply:* No problem. Here's the PR. https://github.com/OpenLineage/OpenLineage/pull/47

Josh Quintus (josh.quintus@gmail.com)

2021-04-28 23:22:41

*Thread Reply:* Once I fixed the ones I saw I figured "Why not just run it through a spell checker just in case... " and found a few additional ones.

Ross Turk (ross@datakin.com)

2021-05-20 16:30:05

For your enjoyment, @Julien Le Dem was on the Data Engineering Podcast talking about OpenLineage!

https://www.dataengineeringpodcast.com/openlineage-data-lineage-specification-episode-187/

Data Engineering Podcast

Unlocking The Power of Data Lineage In Your Platform with OpenLineage - Episode 187

Data lineage is the common thread that ties together all of your data pipelines, workflows, and systems. In order to get a holistic understanding of your data

Original URL: https://www.dataengineeringpodcast.com/openlineage-data-lineage-specification-episode-187/

🙌 Willy Lulciuc, Maciej Obuchowski, Peter Hicks, Mario Measic

❤️ Willy Lulciuc, Maciej Obuchowski, Peter Hicks, Rogier Werschkull, A Pospiech, Kedar Rajwade, James Le

Ross Turk (ross@datakin.com)

2021-05-20 16:30:09

share and enjoy 🙂

Julien Le Dem (julien@apache.org)

2021-05-21 18:21:23

Also happened yesterday: OpenLineage being accepted by the LFAI&Data.

🎉 Abe Gong, Willy Lulciuc, Peter Hicks, Maciej Obuchowski, Daniel Henneberger, Harel Shein, Antonio Moctezuma, Josh Quintus, Mariusz Górski, James Le

👏 Matt Turck

Willy Lulciuc (willy@datakin.com)

2021-05-21 19:20:55

*Thread Reply:* Huge milestone! 🙌💯🎊

Julien Le Dem (julien@apache.org)

2021-05-24 16:24:55

I have created a channel to discuss <#C022MMLU31B|user-generated-metadata> since this came up in a few discussions.

🙌 Willy Lulciuc

Jonathon Mitchal (bigmit83@gmail.com)

2021-05-31 01:28:35

hey guys, does anyone have any sample openlineage schemas for S3 please? potentially including facets for attributes in a parquet file? that would help heaps thanks. i am trying to slowly bring in a common metadata interface and this will help shape some of the conversations 🙂 with a move to marquez/datahub et al over time

🙌 Willy Lulciuc

Willy Lulciuc (willy@datakin.com)

2021-06-01 17:56:16

*Thread Reply:* We currently don’t have S3 (or distributed filesystem specific facets) at the moment, but such support would be a great addition! @Julien Le Dem would be best to answer if any work has been done in this area 🙂

Willy Lulciuc (willy@datakin.com)

2021-06-01 17:57:19

*Thread Reply:* Also, happy to answer any Marquez specific questions, @Jonathon Mitchal when you’re thinking of making the move. Marquez supports OpenLineage out of the box 🙌

Julien Le Dem (julien@apache.org)

2021-06-01 19:58:21

*Thread Reply:* @Jonathon Mitchal You can follow the naming strategy here for referring to a S3 dataset: https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md#s3

<https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md | Naming.md>

Naming We define the unique name strategy per datasource to ensure it is followed uniformly independently from the type of job consuming and producing data. The namespace for a dataset is the unique name for its datasource. For jobs it is the scheduler. Datasets: The namespace and name of a datasource can be combined to form a URI (scheme:[//authority]path) • Namespace = scheme:[//authority] (the datasource) • Name = path (the datasets) Data warehouses/data bases. Datasets are called tables. Tables are organised in databases and schemas. Postgres: Datasource hierarchy: • Host • Port Naming hierarchy: • Database • Schema • Table Identifier: • Namespace: postgres://{host}:{port} of the service instance. • Scheme = postgres • Authority = {host}:{port} • Unique name: {database}.{schema}.{table} • URI = postgres://{host}:{port}/{database}.{schema}.{table} Snowflake See: <a href="https://docs.snowflake.com/en/sql-reference/identifiers.html">Object Identifiers — Snowflake Documentation</a> Datasource hierarchy: • account name Naming hierarchy: • Database: {database name} => unique across the account • Schema: {schema name} => unique within the database • Table: {table name} => unique within the schema Identifier: • Namespace: snowflake://{account name} • Scheme = snowflake • Authority = {account name} • Name: {database}.{schema}.{table} • URI = snowflake://{account name}/{database}.{schema}.{table} BigQuery See: <a href="https://cloud.google.com/resource-manager/docs/creating-managing-projects|Creating and managing projects | Resource Manager Documentation">https://cloud.google.com/resource-manager/docs/creating-managing-projects|Creating and managing projects | Resource Manager Documentation</a> <a href="https://cloud.google.com/bigquery/docs/datasets-intro|Introduction to datasets | BigQuery">https://cloud.google.com/bigquery/docs/datasets-intro|Introduction to datasets | BigQuery</a> <a href="https://cloud.google.com/bigquery/docs/tables-intro|Introduction to tables | BigQuery">https://cloud.google.com/bigquery/docs/tables-intro|Introduction to tables | BigQuery</a> Datasource hierarchy: • bigquery Naming hierarchy: • Project Name: {project name} => is not unique • Project number: {project number} => numeric: is unique across google cloud • Project ID: {project id} => readable: is unique across google cloud • dataset: {dataset name} => is unique within a project • table: {table name} => is unique within a dataset Identifier : • Namespace: bigquery • Scheme = bigquery • Authority = • Unique name: {project id}.{dataset name}.{table name} • URI = bigquery:{project id}.{schema}.{table} Distributed file systems/blob stores GCS Datasource hierarchy: none, naming is global Naming hierarchy: • bucket name => globally unique • Path Identifier : • Namespace: gs://{bucket name} • Scheme = gs • Authority = {bucket name} • Unique name: {path} • URI = gs://{bucket name}{path} S3 Naming hierarchy: • bucket name => globally unique • Path Identifier : • Namespace: s3://{bucket name} • Scheme = s3 • Authority = {bucket name} • Unique name: {path} • URI = s3://{bucket name}{path} HDFS Naming hierarchy: • Namenode: host + port • Path Identifier : • Namespace: hdfs://{namenode host}:{namenode port} • Scheme = hdfs • Authority = {namenode host}:{namenode port} • Unique name: {path} • URI = hdfs://{namenode host}:{namenode port}{path} Schedulers Airflow Naming hierarchy: • job => workspace/DAG • Task => unique within the job Identifier : • Namespace: {scheduler namespace} • Unique name: {job}.{task}

Julien Le Dem (julien@apache.org)

2021-06-01 19:59:30

*Thread Reply:* There is no facet yet for the attributes of a Parquet file. I can give you feedback if you want to start defining one. https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#proposing-changes

<https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md | CONTRIBUTING.md>

Contributing to OpenLineage This project welcomes contributors from any organization or background, provided they are willing to follow the simple processes outlined below, as well as adhere to the <CODEOFCONDUCT.md|Code of Conduct>. Joining the community The community collaborates primarily through <code>GitHub</code> and the instance messaging tool, <code>Slack</code>. There is also a mailing list. See how to join <a href="https://github.com/OpenLineage/OpenLineage#community">here</a> Reporting an Issue Please use the <a href="https://github.com/OpenLineage/OpenLineage/issues">issues</a> section of the OpenLineage repository and search for a similar problem. If you don't find it, submit your bug, question, proposal or feature request. In the case of bugs, please provide steps to reproduce it and tag your issue with "bug" Contributing to the project Creating Pull Requests Before sending a Pull Request with significant changes, please use the <a href="https://github.com/OpenLineage/OpenLineage/issues">issue tracker</a> to discuss the potential improvements you want to make. OpenLineage uses <a href="https://help.github.com/articles/about-collaborative-development-models/">GitHub's fork and pull model</a> to create a contribution. Every contribution is signed to say that the contributor has the rights to make the contribution and agrees with the <why-the-dco.md|Developer Certificate of Origin (DCO)> Proposing changes Create an issue and prefix the description with "[PROPOSAL]" In the description provide the following sections: • Purpose (Why?): What is the use case this is for. • Proposed implementation (How?): Quick description of how do you propose to implement it. Are you proposing a new facet? This can be just a couple paragraphs to start with. First-Time Contributors If this is your first contribution to open source, you can <a href="https://github.com/firstcontributions/first-contributions#first-contributions">follow this tutorial</a> or check <a href="https://egghead.io/courses/how-to-contribute-to-an-open-source-project-on-github">this video series</a> to learn about the contribution workflow with GitHub. Look tickets labeled <https://github.com/OpenLineage/OpenLineage/labels/good%20first%20issue|'good first issue'> and <https://github.com/OpenLineage/OpenLineage/labels/help%20wanted|'help wanted'>. These are a great starting point if you want to contribute. Don't hesitate to ask questions about the issue if you are not sure about the strategy to follow.

Julien Le Dem (julien@apache.org)

2021-06-01 20:00:50

*Thread Reply:* Adding Parquet metadata as a facet would make a lot of sense. It is mainly a matter of specifying what the json would look like

Julien Le Dem (julien@apache.org)

2021-06-01 20:01:54

*Thread Reply:* for reference the parquet metadata is defined here: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift

<https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift | parquet.thrift>

``` /** * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements. See the NOTICE file * distributed with this work for additional information * regarding copyright ownership. The ASF licenses this file * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at * * <a href="http://www.apache.org/licenses/LICENSE-2.0">http://www.apache.org/licenses/LICENSE-2.0</a> * * Unless required by applicable law or agreed to in writing, * software distributed under the License is distributed on an * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY * KIND, either express or implied. See the License for the * specific language governing permissions and limitations * under the License. **/ /** * File format description for the parquet file format */ namespace cpp parquet namespace java org.apache.parquet.format /** * Types supported by Parquet. These types are intended to be used in combination * with the encodings to control the on disk storage format. * For example INT16 is not included as a type since a good encoding of INT32 * would handle this. **/ enum Type { BOOLEAN = 0; INT32 = 1; INT64 = 2; INT96 = 3; // deprecated, only used by legacy implementations. FLOAT = 4; DOUBLE = 5; BYTEARRAY = 6; FIXEDLENBYTEARRAY = 7; } /* * DEPRECATED: Common types used by frameworks(e.g. hive, pig) using parquet. * ConvertedType is superseded by LogicalType. This enum should not be extended. * * See LogicalTypes.md for conversion between ConvertedType and LogicalType. */ enum ConvertedType { / a BYTE_ARRAY actually contains UTF8 encoded chars **/ UTF8 = 0; /** a map is converted as an optional field containing a repeated key/value pair **/ MAP = 1; /** a key/value pair is converted into a group of two fields **/ MAPKEYVALUE = 2; /** a list is converted into an optional field containing a repeated field for its * values */ LIST = 3; /** an enum is converted into a binary field **/ ENUM = 4; /** * A decimal value. * * This may be used to annotate binary or fixed primitive types. The * underlying byte array stores the unscaled value encoded as two's * complement using big-endian byte order (the most significant byte is the * zeroth element). The value of the decimal is the value * 10^{-scale}. * * This must be accompanied by a (maximum) precision and a scale in the * SchemaElement. The precision specifies the number of digits in the decimal * and the scale stores the location of the decimal point. For example 1.23 * would have precision 3 (3 total digits) and scale 2 (the decimal point is * 2 digits over). */ DECIMAL = 5; /** * A Date * * Stored as days since Unix epoch, encoded as the INT32 physical type. * **/ DATE = 6; /** * A time * * The total number of milliseconds since midnight. The value is stored * as an INT32 physical type. **/ TIME_MILLIS = 7; /** * A time. * * The total number of microseconds since midnight. The value is stored as * an INT64 physical type. **/ TIME_MICROS = 8; /** * A date/time combination * * Date and time recorded as milliseconds since the Unix epoch. Recorded as * a physical type of INT64. **/ TIMESTAMP_MILLIS = 9; /** * A date/time combination * * Date and time recorded as microseconds since the Unix epoch. The value is * stored as an INT64 physical type. **/ TIMESTAMP_MICROS = 10; /** * An unsigned integer value. * * The number describes the maximum number of meaningful data bits in * the stored value. 8, 16 and 32 bit values are stored using the * INT32 physical type. 64 bit values are stored using the INT64 * physical type. * */ UINT8 = 11; UINT16 = 12; UINT32 = 13; UINT64 = 14; /** * A signed integer value. * * The number describes the maximum number of meaningful data bits in * the stored value. 8, 16 and 32 bit values are stored using the * INT32 physical type. 64 bit values are stored using the INT64 * physical type. * */ INT8 = 15; INT16 = 16; INT32 = 17; INT64 = 18; /** * An embedded JSON document * * A JSON document embedded within a single UTF8 column. */ JSON = 19; /** * An embedded BSON document * * A BSON document embedded within a single BINARY column. */ BSON = 20; /** * An interval of time * * This type annotates data stored as a FIXED_LEN_BYTE_ARRAY of length 12 * This data is composed of three separate little endian unsigned * integers. Each stores a component of a duration of time. The first * integer identifies the number of months associated with the duration, * the second identifies the number of days associated with the duration * and the third identifies the number of milliseconds associated with * the provided duration. This duration of time is independent of any * particular timezone or date. **/ INTERVAL = 21; } /* * Representation of Schemas */ enum FieldRepetitionType { /* This field is required (can not be null) and each record has exactly 1 value. **/ REQUIRED = 0; /** The field is optional (can be null) and each record has 0 or 1 values. **/ OPTIONAL = 1; /** The field is repeated and can contain 0 or more values **/ REPEATED = 2; } /* * Statistics per row group and per page * All fields are optional. */ struct Statistics { / * DEPRECATED: min and max value of the column. Use min_value and max_value. * * Values are encoded using PLAIN encoding, except that variable-length byte * arrays do not include a length prefix. * * These fields encode min and max values determined by signed comparison * only. New files should use the correct order for a column's logical type * and store the values in the minvalue and maxvalue fields. * * To support older readers, these may be set when the column order is * signed. / 1: optional binary max; 2: optional binary min; /* count of null value in the column / 3: optional i64 nullcount; /** count of distinct values occurring / 4: optional i64 distinctcount; / * Min and max values for the column, determined by its ColumnOrder. * * Values are encoded using PLAIN encoding, except that variable-length byte * arrays do not include a length prefix. **/ 5: optional binary maxvalue; 6: optional binary minvalue; } /** Empty structs to use as logical type annotations **/ struct StringType {} // allowed for BINARY, must be encoded with UTF-8 struct UUIDType {} // allowed for FIXED[16], must encoded raw UUID bytes struct MapType {} // see LogicalTypes.md struct ListType {} // see LogicalTypes.md struct EnumType {} // allowed for BINARY, must be encoded with UTF-8 struct DateType {} // allowed for INT32 /** * Logical type to annotate a column that is always null. * * Sometimes when discovering the schema of existing data, values are always * null and the physical type can't be determined. This annotation signals * the case where the physical type was guessed from all null values. */ struct NullType {} // allowed for any physical type, only null values stored /** * Decimal logical type annotation * * To maintain forward-compatibility in v1, implementations using this logical * type must also set scale and precision on the annotated SchemaElement. * * Allowed for physical types: INT32, INT64, FIXED, and BINARY **/ struct DecimalType { 1: required i32 scale 2: required i32 precision } /** Time units for logical types **/ struct MilliSeconds {} struct MicroSeconds {} struct NanoSeconds {} union TimeUnit { 1: MilliSeconds MILLIS 2: MicroSeconds MICROS 3: NanoSeconds NANOS } /** * Timestamp logical type annotation * ** Allowed for physica…

Jonathon Mitchal (bigmit83@gmail.com)

2021-06-01 23:20:50

*Thread Reply:* Thats awesome, thanks for the guidance Willy and Julien ... will report back on how we get on

🙏 Willy Lulciuc

Pedram (pedram@hightouch.io)

2021-06-01 17:52:08

hi all! just wanted to introduce myself, I'm the Head of Data at Hightouch.io, we build reverse etl pipelines from the warehouse into various destinations. I've been following OpenLineage for a while now and thought it would be nice to build and expose our runs via the standard and potentially save that back to the warehouse for analysis/alerting. Really interesting concept, looking forward to playing around with it

👋 Willy Lulciuc, Ross Turk

Julien Le Dem (julien@apache.org)

2021-06-01 20:02:34

*Thread Reply:* Welcome! Let use know if you have any questions

Leo (leorobinovitch@gmail.com)

2021-06-03 19:22:10

Hi all! I have a noob question. As I understand it, one of the main purposes of OpenLineage is to avoid runaway proliferation of bespoke connectors for each data lineage/cataloging/provenance tool to each data source/job scheduler/query engine etc. as illustrated in the problem diagram from the main repo below.

My understanding is that instead, things push to OpenLineage which provides pollable endpoints for metadata tools.

I’m looking at Amundsen, and it seems to have bespoke connectors, but these are pull-based - I don’t need to instrument my data resources to push to Amundsen, I just need to configure Amundsen to poll my data resources (e.g. the Postgres metadata extractor here).

Can OpenLineage do something similar where I can just point it at something to extract metadata from it, rather than instrumenting that thing to push metadata to OpenLineage? If not, I’m wondering why?

Is it the case that Open Lineage defines the general framework but doesn’t actually enforce push or pull-based implementations, it just so happens that the reference implementation (Marquez) uses push?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-06-04 04:45:15

*Thread Reply:* > Is it the case that Open Lineage defines the general framework but doesn’t actually enforce push or pull-based implementations, it just so happens that the reference implementation (Marquez) uses push? Yes, at core OpenLineage just enforces format of the event. We also aim to provide clients - REST, later Kafka, etc. and some reference implementations - which are now in Marquez repo. https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/doc/Scope.png

There are several differences between push and poll models. Most important one is that with push model, latency between your job and emitting OpenLineage events is very low. With some systems, with internal, push based model you have more runtime metadata available than when looking from outside. Another one would be that naive poll implementation would need to "rebuild the world" on each change. There are also disadvantages, such as that usually, it's easier to write plugin that extracts data from outside the system than hooking up to the internals.

Integration with Amundsen specifically is planned. Although, right now it seems to me that way to do it is to bypass the databuilder framework and push directly to underlying database, such as Neo4j, or make Marquez backend for Metadata Service: https://raw.githubusercontent.com/amundsen-io/amundsen/master/docs/img/Amundsen_Architecture.png

Original URL: https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/doc/Scope.png

Original URL: https://raw.githubusercontent.com/amundsen-io/amundsen/master/docs/img/Amundsen_Architecture.png

❤️ Julien Le Dem

Leo (leorobinovitch@gmail.com)

2021-06-04 10:39:51

*Thread Reply:* This is really helpful, thank you @Maciej Obuchowski!

Leo (leorobinovitch@gmail.com)

2021-06-04 10:40:59

*Thread Reply:* Similar to what you say about push vs pull, I found DataHub’s comment to be interesting yesterday: > Push is better than pull: While pulling metadata directly from the source seems like the most straightforward way to gather metadata, developing and maintaining a centralized fleet of domain-specific crawlers quickly becomes a nightmare. It is more scalable to have individual metadata providers push the information to the central repository via APIs or messages. This push-based approach also ensures a more timely reflection of new and updated metadata.

Julien Le Dem (julien@apache.org)

2021-06-04 21:59:59

*Thread Reply:* yes. You can also “pull-to-push” for things that don’t push.

Mariusz Górski (gorskimariusz13@gmail.com)

2021-06-17 10:01:37

*Thread Reply:* @Maciej Obuchowski any particular reason for bypassing databuilder and go directly to neo4j? By design databuilder is supposed to be very abstract so any kind of backend can be used with Amundsen. Currently there are at least 4 and neo4j is just one of them.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-06-17 10:28:52

*Thread Reply:* Databuilder's pull model is very different than OpenLineage's push model, where the events are generated while the dataset itself is generated.

So, how would you see using it? Just to proxy the events to concrete search and metadata backend?

I'm definitely not an Amundsen expert, so feel free to correct me if I'm getting it wrong.

Julien Le Dem (julien@apache.org)

2021-07-07 19:59:28

*Thread Reply:* @Mariusz Górski my slide that Maciej is referring to might be a bit misleading. The Amundsen integration does not exist yet. Please add your input in the ticket: https://github.com/OpenLineage/OpenLineage/issues/86

<https://github.com/OpenLineage/OpenLineage/issues/86|#86 Add backend for sending lineage to Amundsen>

Mariusz Górski (gorskimariusz13@gmail.com)

2021-07-09 02:22:06

*Thread Reply:* thanks Julien! will take a look

Kedar Rajwade (kedar@cloudzealous.com)

2021-06-08 10:00:47

@here Hello, My name is Kedar Rajwade. I happened to come across the OpenLineage project and it looks quite interesting. Is there some kind of getting start guide that I can follow. Also are there any weekly/bi-weekly calls that I can attend to know the current/future plans ?

Julien Le Dem (julien@apache.org)

2021-06-08 14:16:42

*Thread Reply:* Welcome! You can look here: https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md

<https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md | CONTRIBUTING.md>

Julien Le Dem (julien@apache.org)

2021-06-08 14:17:19

*Thread Reply:* We’re starting a monthly call, I will publish more details here

Julien Le Dem (julien@apache.org)

2021-06-08 14:17:48

*Thread Reply:* Do you have a specific use case in mind?

Kedar Rajwade (kedar@cloudzealous.com)

2021-06-08 21:32:02

*Thread Reply:* Nothing specific yet

Julien Le Dem (julien@apache.org)

2021-06-09 00:49:09

The first instance of the OpenLineage Monthly meeting is tomorrow June 9 at 9am PT: https://calendar.google.com/event?action=TEMPLATE&tmeid=MDRubzk0cXAwZzA4bXRmY24yZjBkdTZzbDNfMjAyMTA2MDlUMTYwMDAwWiBqdWxpZW5AZGF0YWtpbi5jb20&tmsrc=julien%40datakin.com&scp=ALL|https://calendar.google.com/event?action=TEMPLATE&tmeid=MDRubzk0cXAwZzA4bXRmY24yZjBkdT[…]qdWxpZW5AZGF0YWtpbi5jb20&tmsrc=julien%40datakin.com&scp=ALL

accounts.google.com

Google Calendar

With Google's free online calendar, it’s easy to keep track of life’s important events all in one place.

Original URL: https://calendar.google.com/event?action=TEMPLATE&tmeid=MDRubzk0cXAwZzA4bXRmY24yZjBkdTZzbDNfMjAyMTA2MDlUMTYwMDAwWiBqdWxpZW5AZGF0YWtpbi5jb20&tmsrc=julien%40datakin.com&scp=ALL

🎉 Willy Lulciuc, Maciej Obuchowski

Victor Shafran (victor.shafran@databand.ai)

2021-06-09 08:33:45

*Thread Reply:* Hey @Julien Le Dem, I can’t add a link to my calendar… Can you send an invite?

Leo (leorobinovitch@gmail.com)

2021-06-09 11:00:05

*Thread Reply:* Same!

Julien Le Dem (julien@apache.org)

2021-06-09 11:01:45

*Thread Reply:* Will do. Also if you send your email in dm you can get added to the invite

Julien Le Dem (julien@apache.org)

2021-06-09 11:59:22

*Thread Reply:* You can find the invitation on the tsc mailing list: https://lists.lfaidata.foundation/g/openlineage-tsc/topic/invitation_openlineage/83423919?p=,,,20,0,0,0::recentpostdate%2Fsticky,,,20,2,0,83423919

Kedar Rajwade (kedar@cloudzealous.com)

2021-06-09 12:00:30

*Thread Reply:* @Julien Le Dem Can't access the calendar.

Kedar Rajwade (kedar@cloudzealous.com)

2021-06-09 12:00:43

*Thread Reply:* Can you please share the meeting details

Julien Le Dem (julien@apache.org)

2021-06-09 12:01:12

*Thread Reply:*

Julien Le Dem (julien@apache.org)

2021-06-09 12:01:24

*Thread Reply:*

Michael Collado (collado.mike@gmail.com)

2021-06-09 12:01:55

*Thread Reply:* The calendar invite says 9am PDT, not 10am. Which is right?

Kedar Rajwade (kedar@cloudzealous.com)

2021-06-09 12:01:58

*Thread Reply:* Thanks

Julien Le Dem (julien@apache.org)

2021-06-09 13:25:13

*Thread Reply:* it is 9am,thanks

Julien Le Dem (julien@apache.org)

2021-06-09 18:37:02

*Thread Reply:* I have posted the notes on the wiki (includes link to recording) https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+meeting+archive

🙌 Willy Lulciuc, Victor Shafran

Pedram (pedram@hightouch.io)

2021-06-10 13:53:18

Hi! Are there some 'close-to-real' sample events available to build off and compare to? I'd like to make sure what I'm outputting makes sense but it's hard when only comparing to very synthetic data.

👋 Willy Lulciuc

Willy Lulciuc (willy@datakin.com)

2021-06-10 13:55:51

*Thread Reply:* We’ve recently worked on a getting started guide for OpenLineage that we’d like to publish on the OpenLineage website. That should help with making things a bit more clear on usage. @Ross Turk / @Julien Le Dem might know of when that might become available. Otherwise, happy to answer any immediate questions you might have about posting/collecting OpenLineage events

Pedram (pedram@hightouch.io)

2021-06-10 13:58:58

*Thread Reply:* Here's a sample of what I'm producing, would appreciate any feedback if it's on the right track. One of our challenges is that 'dataset' is a little loosely defined for us as outputs since we take data from a warehouse/database and output to things like Salesforce, Airtable, Hubspot and even Slack.

{ eventType: 'START', eventTime: '2021-06-09T08:45:00.395+00:00', run: { runId: '2821819' }, job: { namespace: '<hightouch://my-workspace>', name: '<hightouch://my-workspace/sync/123>' }, inputs: [ { namespace: '<snowflake://abc1234>', name: '<snowflake://abc1234/my_source_table>' } ], outputs: [ { namespace: '<salesforce://mysf_instance.salesforce.com>', name: 'accounts' } ], producer: 'hightouch-event-producer-v.0.0.1' } { eventType: 'COMPLETE', eventTime: '2021-06-09T08:45:30.519+00:00', run: { runId: '2821819' }, job: { namespace: '<hightouch://my-workspace>', name: '<hightouch://my-workspace/sync/123>' }, inputs: [ { namespace: '<snowflake://abc1234>', name: '<snowflake://abc1234/my_source_table>' } ], outputs: [ { namespace: '<salesforce://mysf_instance.salesforce.com>', name: 'accounts' } ], producer: 'hightouch-event-producer-v.0.0.1' }

Pedram (pedram@hightouch.io)

2021-06-10 14:02:59

*Thread Reply:* One other question I have is really around how customers might take the metadata we emit at Hightouch and integrate that with OpenLineage metadata emitted from other tools like dbt, Airflow, and other integrations to create a true lineage of their data.

For example, if the data goes from S3 -> Snowflake via Airflow and then from Snowflake -> Salesforce via Hightouch, this would mean both Airflow/Hightouch would need to define the Snowflake dataset in exactly the same way to get the benefits of lineage?

Willy Lulciuc (willy@datakin.com)

2021-06-17 19:13:14

*Thread Reply:* Hey, @Dejan Peretin! Sorry for the late replay here! Your OL events look solid and only have a few of suggestions:

I would use a valid UUID for the run ID as the spec will standardize on that type, see https://github.com/OpenLineage/OpenLineage/pull/65
You don’t need to provide the input dataset again on the COMPLETE event as the input datasets have already been associated with the run ID
For the producer, I’d recommend using a link to the producer source code version to link the producer version with the OL event that was emitted.

Willy Lulciuc (willy@datakin.com)

2021-06-17 19:13:59

*Thread Reply:* You can now reference our OL getting started guide for a close-to-real example 🙂 , see http://openlineage.io/getting-started

openlineage.io

Getting Started

Data lineage is the foundation for a new generation of powerful, context-aware data tools and best practices. OpenLineage enables consistent collection of lineage metadata, creating a deeper understanding of how data is produced and used.

Original URL: http://openlineage.io/getting-started

Willy Lulciuc (willy@datakin.com)

2021-06-17 19:18:19

*Thread Reply:* > … this would mean both Airflow/Hightouch would need to define the Snowflake dataset in exactly the same way to get the benefits of lineage? Yes, the dataset and the namespace that it was registered under would have to be the same to properly build the lineage graph. We’re working on defining unique dataset names and have made some good progress in this area. I’d suggest reviewing the OL naming conventions if you haven’t already: https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md

🙌 Pedram

Pedram (pedram@hightouch.io)

2021-06-19 01:09:27

*Thread Reply:* Thanks! I'm really excited to see what the future holds, I think there are so many great possibilities here. Will be keeping a watchful eye. 🙂

Willy Lulciuc (willy@datakin.com)

2021-06-22 15:14:39

*Thread Reply:* 🙂

Antonio Moctezuma (antoniomoctezuma@northwesternmutual.com)

2021-06-11 09:53:39

Hey everyone! I've been running into a minor OpenLineage issue and I was curious if anyone had any advice. So according to OpenLineage specs its suggested that for a dataset coming from S3 that its namespace be in the form of s3://<bucket>. We have implemented our code to do so and RunEvents are published without issue but when trying to retrieve the information of this RunEvent (like the job) I am unable to retrieve it based on namespace from both /api/v1/namespaces/s3%3A%2F%2F<bucket name> (encoding since : and / are special characters in URL) and the beta endpoint of /api/v1-beta/lineage?nodeId=<dataset>:<namespace>:<name> and instead get a 400 error with a "Ambiguous Segment in URI" message.

Any and all advice would be super helpful! Thank you so much!

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-06-11 10:16:41

*Thread Reply:* Sounds like problem is with Marquez - might be worth to open issue here: https://github.com/MarquezProject/marquez/issues

Antonio Moctezuma (antoniomoctezuma@northwesternmutual.com)

2021-06-11 10:25:58

*Thread Reply:* Thank you! Will do.

Julien Le Dem (julien@apache.org)

2021-06-11 15:31:41

*Thread Reply:* Thanks for reporting Antonio

Julien Le Dem (julien@apache.org)

2021-06-16 19:01:52

I have opened a proposal for versioning and publishing the spec: https://github.com/OpenLineage/OpenLineage/issues/63

<https://github.com/OpenLineage/OpenLineage/issues/63|#63 [PROPOSAL] Define the OpenLineage spec versioning mechanism>

Labels

proposal

Julien Le Dem (julien@apache.org)

2021-06-18 15:00:20

We have a nice OpenLineage website now. https://openlineage.io/ Thank you to contributors: @Ross Turk @Willy Lulciuc @Michael Collado!

openlineage.io

Home

Original URL: https://openlineage.io/

❤️ Ross Turk, Kevin Mellott, Leo, Peter Hicks, Willy Lulciuc, Edgar Ramírez Mondragón, Maciej Obuchowski, Supratim Mukherjee

👍 Kedar Rajwade, Mukund

Leo (leorobinovitch@gmail.com)

2021-06-18 15:09:18

*Thread Reply:* Very nice!

Bruno Canal (bcanal@gmail.com)

2021-06-20 10:08:43

Hi everyone! Im trying to run a spark job with openlineage and marquez...But Im getting some errors

Bruno Canal (bcanal@gmail.com)

2021-06-20 10:09:28

*Thread Reply:* Here is the error...

21/06/20 11:02:56 WARN ArgumentParser: missing jobs in [, api, v1, namespaces, spark_integration] at 5 21/06/20 11:02:56 WARN ArgumentParser: missing runs in [, api, v1, namespaces, spark_integration] at 7 21/06/20 11:03:01 ERROR AsyncEventQueue: Listener SparkListener threw an exception java.lang.NullPointerException at marquez.spark.agent.SparkListener.onJobEnd(SparkListener.java:165) at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:39) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91) at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82)

Bruno Canal (bcanal@gmail.com)

2021-06-20 10:10:41

*Thread Reply:* Here is my code ...

```from pyspark.sql import SparkSession from pyspark.sql.functions import lit

spark = SparkSession.builder \ .master('local[1]') \ .config('spark.jars.packages', 'io.github.marquezproject:marquezspark:0.15.2') \ .config('spark.extraListeners', 'marquez.spark.agent.SparkListener') \ .config('openlineage.url', 'http://localhost:5000/api/v1/namespaces/spark_integration/') \ .config('openlineage.namespace', 'sparkintegration') \ .getOrCreate()

Supress success

spark.sparkContext.jsc.hadoopConfiguration().set('mapreduce.fileoutputcommitter.marksuccessfuljobs', 'false') spark.sparkContext.jsc.hadoopConfiguration().set('parquet.summary.metadata.level', 'NONE')

dfsourcetrip = spark.read \ .option('inferSchema', True) \ .option('header', True) \ .option('delimiter', '|') \ .csv('/Users/bcanal/Workspace/poc-marquez/pocspark/resources/data/source/trip.csv') \ .createOrReplaceTempView('sourcetrip')

dfdrivers = spark.table('sourcetrip') \ .select('driver') \ .distinct() \ .withColumn('drivername', lit('Bruno')) \ .withColumnRenamed('driver', 'driverid') \ .createOrReplaceTempView('source_driver')

df = spark.sql( """ SELECT d., t. FROM sourcetrip t, sourcedriver d WHERE t.driver = d.driver_id """ )

df.coalesce(1) \ .drop('driverid') \ .write.mode('overwrite') \ .option('path', '/Users/bcanal/Workspace/poc-marquez/pocspark/resources/data/target') \ .saveAsTable('trip')```

Bruno Canal (bcanal@gmail.com)

2021-06-20 10:12:27

*Thread Reply:* After this execution, I can see just the source from first dataframe called dfsourcetrip...

Bruno Canal (bcanal@gmail.com)

2021-06-20 10:13:04

*Thread Reply:*

Bruno Canal (bcanal@gmail.com)

2021-06-20 10:13:45

*Thread Reply:* I was expecting to see all source dataframes, target dataframes and the job

Bruno Canal (bcanal@gmail.com)

2021-06-20 10:14:35

*Thread Reply:* I`m running spark local on my laptop and I followed marquez getting start to up it

Bruno Canal (bcanal@gmail.com)

2021-06-20 10:14:44

*Thread Reply:* Can anyone help me?

Michael Collado (collado.mike@gmail.com)

2021-06-22 14:42:03

*Thread Reply:* I think there's a race condition that causes the context to be missing when the job finishes too quickly. If I just add spark.sparkContext.setLogLevel('info') to the setup code, everything works reliably. Also works if you remove the master('local[1]') - at least when running in a notebook

anup agrawal (anup.agrawal500@gmail.com)

2021-06-22 13:48:34

@here Hi everyone,

👋 Willy Lulciuc

anup agrawal (anup.agrawal500@gmail.com)

2021-06-22 13:49:10

i need to implement export functionality for my data lineage project.

anup agrawal (anup.agrawal500@gmail.com)

2021-06-22 13:50:26

as part of this i need to convert the information fetched from graph db (neo4j) to CSV format and send in response.

anup agrawal (anup.agrawal500@gmail.com)

2021-06-22 13:51:21

can someone please direct me to the CSV format of open lineage data

Willy Lulciuc (willy@datakin.com)

2021-06-22 15:26:55

*Thread Reply:* Hey, @anup agrawal. This is a great question! The OpenLineage spec is defined using the Json Schema format, and it’s mainly for the transport layer of OL events. In terms of how OL events are eventually stored, that’s determined by the backend consumer of the events. For example, Marquez stores the raw event in a lineage_events table, but that’s mainly for convenience and replayability of events . As for importing / exporting OL events from storage, as long as you can translate the CSV to an OL event, then HTTP backends like Marquez that support OL can consume them

Willy Lulciuc (willy@datakin.com)

2021-06-22 15:27:29

*Thread Reply:* > as part of this i need to convert the information fetched from graph db (neo4j) to CSV format and send in response. Depending on the exported CSV, I would translate the CSV to an OL event, see https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json

Willy Lulciuc (willy@datakin.com)

2021-06-22 15:29:58

*Thread Reply:* When you say “send in response”, who would be the consumer of the lineage metadata exported for the graph db?

anup agrawal (anup.agrawal500@gmail.com)

2021-06-22 23:33:05

*Thread Reply:* so far what i understood about my requirement is that. 1. my service will receive OL events

anup agrawal (anup.agrawal500@gmail.com)

2021-06-22 23:33:24

*Thread Reply:* 2. store it in graph db (neo4j)

anup agrawal (anup.agrawal500@gmail.com)

2021-06-22 23:38:28

*Thread Reply:* 3. this lineage information will be displayed on ui, based on the request.

now my part in that is to implement an Export functionality, so that someone can download it from UI. in UI there will be option to download the report.
so i need to fetch data from storage and convert it into CSV format, send to UI
they can download the report from UI.

SO my question here is that i have never seen how that CSV report look like and how do i achieve that ? when i had asked my team how should CSV look like they directed me to your website.

👍 Willy Lulciuc

Willy Lulciuc (willy@datakin.com)

2021-07-01 19:18:35

*Thread Reply:* I see. @Julien Le Dem might have some thoughts on how an OL event would be represented in different formats like CSV (but, of course, there’s also avro, parquet, etc). The Json Schema is the recommended format for importing / exporting lineage metadata. And, for a file, each line would be an OL event. But, given that CSV is a requirement, I’m not sure how that would be structured. Or at least, it’s something we haven’t previously discussed

anup agrawal (anup.agrawal500@gmail.com)

2021-06-22 13:51:51

i am very new to this .. sorry for any silly questions

Willy Lulciuc (willy@datakin.com)

2021-06-22 20:29:22

*Thread Reply:* There are no silly questions! 😉

Abdulmalik AN (lord.of.d1@gmail.com)

2021-06-29 11:46:33

Hello, I have read every topic and listened to 4 talks and the podcast episode about OpenLineage and Marquez due to my basic understanding for the data engineering field, I have a couple of questions which I did not understand: 1- What are events and facets and what are their purpose? 2- Can I implement the OpenLineage API to any software? or does the software needs to be integrated with the OpenLineage API? 3- Can I say that OpenLineage is about observability and Marquez is about collecting and storing the metadata? Thank you all for being cooperative.

👍 Stephen Pimentel, Kedar Rajwade

Willy Lulciuc (willy@datakin.com)

2021-07-01 19:07:27

*Thread Reply:* Welcome, @Abdulmalik AN 👋 Hopefully the talks / podcasts have been informative! And, sure, happy to clarify a few things:

> What are events and facets and what are their purpose? An OpenLineage event is used to capture the lineage metadata at a point in time for a given run in execution. That is, the runs state transition, the inputs and outputs consumed/produced and the job associated with the run are part of the event. The metadata defined in the event can then be consumed by an HTTP backend (as well as other transport layers). Marquez is an HTTP backend implementation that consumes OL events via a REST API call. The OL core model only defines the metadata that should be captured in the context of a run, while the processing of the event is up to the backend implementation consuming the event (think consumer / producer model here). For Marquez, the end-to-end lineage metadata is stored for pipelines (composed of multiple jobs) with built-in metadata versioning support. Now, for the second part of your question: the OL core model is highly extensible via facets. A facet is user-defined metadata and enables entity enrichment. I’d recommend checking out the getting started guide for OL 🙂

> Can I implement the OpenLineage API to any software? or does the software needs to be integrated with the OpenLineage API? Do you mean HTTP vs other protocols? Currently, OL defines an API spec for HTTP backends, that Marquez has adopted to ingest OL events. But there are also plans to support Kafka and many others.

> Can I say that OpenLineage is about observability and Marquez is about collecting and storing the metadata? > Thank you all for being cooperative. Yep! OL defines the metadata to collect for running jobs / pipelines that can later be used for root cause analysis / troubleshooting failing jobs, while Marquez is a metadata service that implements the OL standard to both consume and store lineage metadata while also exposing a REST API to query dataset, job and run metadata.

<https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/api/OpenLineageResource.java | OpenLineageResource.java>

<https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.yml | OpenLineage.yml>

openlineage.io

Getting Started

Original URL: https://openlineage.io/getting-started/

👍 Kedar Rajwade

Nic Colley (nic.colley@alation.com)

2021-06-30 17:46:52

Hi OpenLineage team! Has anyone got this working on databricks yet? I’ve been working on this for a few days and can’t get it to register lineage. I’ve attached my notebook in this thread.

silly question - does the jar file need be on the cluster? Which versions of spark does OpenLineage support?

Nic Colley (nic.colley@alation.com)

2021-06-30 18:16:58

*Thread Reply:* I based my code on this previous post https://openlineage.slack.com/archives/C01CK9T7HKR/p1624198123045800

} Bruno Canal (https://openlineage.slack.com/team/U025LV2BJUB)

Hi everyone! I<code>m trying to run a spark job with openlineage and marquez...But I</code>m getting some errors

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1624198123045800

Nic Colley (nic.colley@alation.com)

2021-06-30 18:36:59

*Thread Reply:*

Michael Collado (collado.mike@gmail.com)

2021-07-01 13:45:42

*Thread Reply:* In your first cell, you have from pyspark.sql import SparkSession from pyspark.sql.functions import lit spark.sparkContext.setLogLevel('info') unfortunately, the reference to sparkContext in the third line forces the initialization of the SparkContext so that in the next cell, your new configuration is ignored. In pyspark, you must initialize your SparkSession before any references to the SparkContext. It works if you remove the setLogInfo call from the first cell and make your 2nd cell spark = SparkSession.builder \ .config('spark.jars.packages', 'io.github.marquezproject:marquez_spark:0.15.2') \ .config('spark.extraListeners', 'marquez.spark.agent.SparkListener') \ .config('openlineage.url', '<https://domain.com>') \ .config('openlineage.namespace', 'my-namespace') \ .getOrCreate() spark.sparkContext.setLogLevel('info')

Samia Rahman (srahman@thoughtworks.com)

2021-06-30 19:26:42

How would one capture lineage for job that's processing streaming data? Is that in scope for OpenLineage?

➕ Josh Quintus, Maciej Obuchowski

Willy Lulciuc (willy@datakin.com)

2021-07-01 16:32:18

*Thread Reply:* It’s absolutely in scope! We’ve primarily focused on the batch use case (ETL jobs, etc), but the OpenLineage standard supports both batch and streaming jobs. You can check out our roadmap here, where you’ll find Flink and Beam on our list of future integrations.

Willy Lulciuc (willy@datakin.com)

2021-07-01 16:32:57

*Thread Reply:* Is there a streaming framework you’d like to see added to our roadmap?

mohamed chorfa (chorfa672@gmail.com)

2021-06-30 20:33:25

👋 Hello everyone!

Willy Lulciuc (willy@datakin.com)

2021-07-01 16:24:16

*Thread Reply:* Welcome, @mohamed chorfa 👋 . Let’s us know if you have any questions!

👍 mohamed chorfa

mohamed chorfa (chorfa672@gmail.com)

2021-07-03 19:37:58

*Thread Reply:* Really looking follow the evolution of the specification from RawData to the ML-Model

❤️ Julien Le Dem, Willy Lulciuc

Julien Le Dem (julien@apache.org)

2021-07-02 16:53:01

Hello OpenLineage community, We have been working on fleshing out the OpenLineage roadmap. See on github on the currently prioritized effort: https://github.com/OpenLineage/OpenLineage/projects Please add your feedback to the roadmap by either commenting on the github issues or opening new issues.

Julien Le Dem (julien@apache.org)

2021-07-02 17:04:13

In particular, I have opened an issue to finalize our mission statement: https://github.com/OpenLineage/OpenLineage/issues/84

<https://github.com/OpenLineage/OpenLineage/issues/84|#84 Finalize the OpenLineage mission statement>

It is part of our technical charter. Currently we have: "to enable the industry at-large to collect lineage metadata consistently and comprehensively across complex pipelines, creating a deeper understanding of data." We'd want to capture that this is specifically operational lineage, or lineage and metadata collected as the job is running. It is also something more widely applicable than just data pipelines, and should reflect that,

❤️ Ross Turk, Maciej Obuchowski, Peter Hicks

Julien Le Dem (julien@apache.org)

2021-07-07 19:53:17

*Thread Reply:* Based on community feedback, The new proposed mission statement: “to enable the industry at-large to collect real-time lineage metadata consistently across complex ecosystems, creating a deeper understanding of how data is produced and used”

Julien Le Dem (julien@apache.org)

2021-07-07 20:23:24

I have updated the proposal for the spec versioning: https://github.com/OpenLineage/OpenLineage/issues/63

<https://github.com/OpenLineage/OpenLineage/issues/63|#63 [PROPOSAL] Define the OpenLineage spec versioning mechanism>

Purpose: We need to define how the OpenLineage spec gets versioned and published. Some requirements: • The OpenLineage spec and related libraries are in the OpenLineage repo. • The OpenLineage spec version should change only when the spec itself changes. • The libraries in the repo change more frequently than the spec (including when the spec changes). • We want to version the OpenLineage spec independently of the api spec. • The mechanism to version and publish the OpenLineage core spec, should apply to publishing custom facets. Proposed implementation • The spec defines it’s current version using the “$id” field: See: <a href="https://json-schema.org/draft/2020-12/json-schema-core.html#rfc.section.8.2.1">json schema core doc</a> <a href="https://json-schema.org/draft/2019-09/schema">Json schema spec $id</a> <a href="https://github.com/json-schema-org/json-schema-spec/blob/draft-next/meta/core.json">also on github</a> Example: "$id": "<a href="https://openlineage.io/spec/0.1.0/OpenLineage.json">https://openlineage.io/spec/0.1.0/OpenLineage.json</a>" The URL in $id is resolvable and returns that version of the spec. We use github pages to publish the spec to <a href="http://openlineage.io">openlineage.io</a> • The $id urls uses a SEMVER compliant version, following the <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver/">SCHEMAVER semantics</a> MODEL-REVISION-ADDITION • MODEL when you make a breaking schema change which will prevent interaction with any historical data • REVISION when you introduce a schema change which may prevent interaction with some historical data • ADDITION when you make a schema change that is compatible with all historical data • CI verification that: • the version changes when the spec changed: When resolving “$id”, the build fails if the spec is not exactly the same. • The version does not change when the spec does not change. TODO: define a mechanism to verify that the current version of the spec is not already published with a different. Possibly hash based. • Libraries are generating event with current version • Make sure the spec is backward compatible (only add optional fields) and consistent with the versioning semantics • git pre commit: Increments the versions automatically when the spec changes. • CI publish (on tag): The spec gets published automatically to the $id URL on <a href="http://OpenLineage.io">OpenLineage.io</a> Check the $id field has the right domain prefix Publish current OpenLineage.json to that URL The discussion is available on a google doc: <a href="https://docs.google.com/document/d/1inhmb90SB62VyYf8nkkpjBDPpyr2wxkIXQ_AFIQiF9I/edit">https://docs.google.com/document/d/1inhmb90SB62VyYf8nkkpjBDPpyr2wxkIXQ_AFIQiF9I/edit</a>

Assignees

julienledem

Labels

proposal

🙌 Willy Lulciuc

Jorik (jorik.blaas-sigmond@nn.nl)

2021-07-08 07:06:53

Hi all. I'm trying to get my bearings on openlineage. Love the concept. In our data transformation pipelines, output datasets are explicitly versioned (we have an incrementing snapshot id). Our storage layer (deltalake) allows us to also ingest 'older' versions of the same dataset, etc. If I understand it correctly I would have to add some inputFacets and outputFacets to run to store the actual version being referenced. Is that something that is currently available, or on the roadmap, or is it something I could extend myself?

Julien Le Dem (julien@apache.org)

2021-07-08 18:57:44

*Thread Reply:* It is on the roadmap and there’s a ticket open but nobody is working on it at the moment. You are very welcome to contribute a spec and implementation

Julien Le Dem (julien@apache.org)

2021-07-08 18:59:00

*Thread Reply:* Please comment here and feel free to make a proposal: https://github.com/OpenLineage/OpenLineage/issues/35

<https://github.com/OpenLineage/OpenLineage/issues/35|#35 [PROPOSAL] Dataset version facet>

Purpose: Many databases have explicit versioning in place for datasets. It is highly desirable to track the specific version of the dataset if it is known at consumption/generation time. Rather than relying on the service to correctly tie job inputs to specific versions of a dataset, the client should be allowed to report that information if it is available. Proposed implementation A dataset version facet may be sufficient to report dataset version information, if it exists. Custom facets can be used to include any database specific version information that may exist, while the minimum required fields should include a versionName string field.

Labels

proposal

Comments

Jorik (jorik.blaas-sigmond@nn.nl)

2021-07-08 07:07:29

TL;DR: our database supports time-travel, and runs can be set up to use a specific point-in-time of an input. How do we make sure to keep that information within openlineage

Mariusz Górski (gorskimariusz13@gmail.com)

2021-07-09 02:23:29

Hi, on a subject of spark integrations - I know that there is spark-marquez but was curious did you also consider https://github.com/AbsaOSS/spline-spark-agent ? It seems like this and spark-marquez are doing similar thing and maybe it would make sense to add openlineage support to spline spark agent?

AbsaOSS/spline-spark-agent

Spline agent for Apache Spark

Website

<https://absaoss.github.io/spline/>

Stars

Mariusz Górski (gorskimariusz13@gmail.com)

2021-07-09 02:23:42

*Thread Reply:* cc @Julien Le Dem @Maciej Obuchowski

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-07-09 04:28:38

*Thread Reply:* @Michael Collado

👀 Michael Collado

Julien Le Dem (julien@apache.org)

2021-07-12 21:17:12

The OpenLineage Technical Steering Committee meetings are Monthly on the Second Wednesday 9:00am to 10:00am US Pacific and the link to join the meeting is https://us02web.zoom.us/j/81831865546?pwd=RTladlNpc0FTTDlFcWRkM2JyazM4Zz09 The next meeting is this Wednesday All are welcome. • Agenda: ◦ Finalize the OpenLineage Mission Statement ◦ Review OpenLineage 0.1 scope ◦ Roadmap ◦ Open discussion ◦ Slides: https://docs.google.com/presentation/d/1fD_TBUykuAbOqm51Idn7GeGqDnuhSd7f/edit#slide=id.ge4b57c6942_0_46 notes are posted here: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting.,.,_

🙌 Willy Lulciuc, Maciej Obuchowski

Julien Le Dem (julien@apache.org)

2021-07-12 21:18:04

*Thread Reply:* Feel free to share your email with me if you want to be added to the gcal invite

Julien Le Dem (julien@apache.org)

2021-07-14 12:03:31

*Thread Reply:* It is starting now

Jiří Sedláček (yirie.sedlahczech@gmail.com)

2021-07-13 08:22:40

Hello, is it possible to track lineage on column level? For example for SQL like this: CREATE TABLE T2 AS SELECT c1,c2 FROM T1; I would like to record this lineage: T1.C1 -- job1 --> T2.C1 T1.C2 -- job1 --> T2.C2 Would that be possible to record in OL format?

Jiří Sedláček (yirie.sedlahczech@gmail.com)

2021-07-13 08:29:52

(the important thing for me is to be able to tell that T1.C1 has no effect on T2.C2)

Julien Le Dem (julien@apache.org)

2021-07-14 17:00:12

I have updated the notes and added the link to the recording of the meeting this morning: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

Julien Le Dem (julien@apache.org)

2021-07-14 17:04:18

*Thread Reply:* In particular, please review the versioning proposal: https://github.com/OpenLineage/OpenLineage/issues/63

<https://github.com/OpenLineage/OpenLineage/issues/63|#63 [PROPOSAL] Define the OpenLineage spec versioning mechanism>

Purpose: We need to define how the OpenLineage spec gets versioned and published. Some requirements: • The OpenLineage spec and related libraries are in the OpenLineage repo. • The OpenLineage spec version should change only when the spec itself changes. • The libraries in the repo change more frequently than the spec (including when the spec changes). • We want to version the OpenLineage spec independently of the api spec. • The mechanism to version and publish the OpenLineage core spec, should apply to publishing custom facets. Proposed implementation • The spec defines it’s current version using the <code>“$id”</code> field: • See: • <a href="https://json-schema.org/draft/2020-12/json-schema-core.html#rfc.section.8.2.1">json schema core doc</a> • <a href="https://json-schema.org/draft/2019-09/schema">Json schema spec $id</a> • <a href="https://github.com/json-schema-org/json-schema-spec/blob/draft-next/meta/core.json">also on github</a> • Example: <code>"$id": "<https://openlineage.io/spec/0.1.0/OpenLineage.json>"</code> The URL in $id is resolvable and returns that version of the spec. We use github pages to publish the spec to <a href="http://openlineage.io">openlineage.io</a> • The $id urls uses a SEMVER compliant version, following the <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver/">SCHEMAVER semantics</a> MODEL-REVISION-ADDITION • MODEL when you make a breaking schema change which will prevent interaction with any historical data • REVISION when you introduce a schema change which may prevent interaction with some historical data • ADDITION when you make a schema change that is compatible with all historical data • The spec in /spec/OpenLineage.json might be using a prerelease version (ex: 0.1.0-rc0) • CI verifies that: • the version changes when the spec changed: When resolving “$id”, the build fails if the spec is not exactly the same. • The version does not change when the spec does not change. We can verify that the current version of the spec is not already published with a different version. • Libraries are generating event with current version • Make sure the spec is backward compatible (only add optional fields) and consistent with the versioning semantics • git pre commit: Increments the versions automatically when the spec changes. • spec publication: • update the $id to the correct value • tag with spec-{version} • The spec is published automatically to the $id URL on <a href="http://OpenLineage.io">OpenLineage.io</a> • Validation that the $id field has the right domain prefix The discussion is available on a google doc: <a href="https://docs.google.com/document/d/1inhmb90SB62VyYf8nkkpjBDPpyr2wxkIXQ_AFIQiF9I/edit">https://docs.google.com/document/d/1inhmb90SB62VyYf8nkkpjBDPpyr2wxkIXQ_AFIQiF9I/edit</a>

Assignees

julienledem

Labels

proposal

Julien Le Dem (julien@apache.org)

2021-07-14 17:04:33

*Thread Reply:* and the mission statement: https://github.com/OpenLineage/OpenLineage/issues/84

<https://github.com/OpenLineage/OpenLineage/issues/84|#84 Finalize the OpenLineage mission statement>

The mission statement is part of our technical charter. We started from: > "The mission of the Project is to enable the industry at-large to collect lineage metadata consistently and comprehensively across complex pipelines, creating a deeper understanding of data." We'd want to capture that this is specifically operational lineage, or lineage and metadata collected as the job is running. It is also something more widely applicable than just data pipelines, and should reflect that. The current proposal is: > "The mission of the Project is to enable the industry at-large to collect real-time lineage metadata consistently across complex ecosystems, creating a deeper understanding of how data is produced and used"

Comments

Julien Le Dem (julien@apache.org)

2021-07-14 17:05:02

*Thread Reply:* for this one, please give explicit approval in the ticket

👍 Willy Lulciuc

Julien Le Dem (julien@apache.org)

2021-07-14 21:10:42

*Thread Reply:* @Zhamak Dehghani @Daniel Henneberger @Drew Banin @James Campbell @Ryan Blue @Maciej Obuchowski @Willy Lulciuc ^

Julien Le Dem (julien@apache.org)

2021-07-27 18:58:35

*Thread Reply:* Per the votes in the github ticket, I have finalized the charter here: https://docs.google.com/document/d/11xo2cPtuYHmqRLnR-vt9ln4GToe0y60H/edit

🙌 Willy Lulciuc

Jarek Potiuk (jarek@potiuk.com)

2021-07-16 01:25:56

Hi Everyone. I am PMC member and committer of Apache Airflow. Watched the talk at the summit https://airflowsummit.org/sessions/2021/data-lineage-with-apache-airflow-using-openlineage/ and thought I might help (after the Summit is gone 🙂 with making OpenLineage/Marquez more seemlesly integrated in Airflow

airflowsummit.org

Data Lineage with Apache Airflow using OpenLineage

If you manage a lot of data, and you’re attending this summit, you likely rely on Apache Airflow to do a lot of the heavy lifting. Like any powerful tool, Apache Airflow allows you to accomplish what you couldn’t before… but also creates new challenges. As DAGs pile up, complexity layers on top of complexity and it becomes hard to grasp how a failed or delayed DAG will affect everything downstream.

Original URL: https://airflowsummit.org/sessions/2021/data-lineage-with-apache-airflow-using-openlineage/

❤️ Abe Gong, WingCode, Maciej Obuchowski, Ross Turk, Julien Le Dem, Michael Collado, Samia Rahman, mohamed chorfa

🙌 Maciej Obuchowski

👍 Jorik

Samia Rahman (srahman@thoughtworks.com)

2021-07-20 16:38:38

*Thread Reply:* The demo in this does not really use the openlineage spec does it?

Did I miss something - the API that was should for lineage was that of Marquez, how does Marquest use the open lineage spec?

Samia Rahman (srahman@thoughtworks.com)

2021-07-20 18:09:01

*Thread Reply:* I have a question about the SQLJobFacet in the job schema - isn't it better to call it the TransformationJob Facet or the ProjecessJobFacet such that any logic in the appropriate language and be described? Am I misinterpreting the intention of SQLJobFacet is to capture the logic that runs for a job?

Willy Lulciuc (willy@datakin.com)

2021-07-26 19:06:43

*Thread Reply:* > The demo in this does not really use the openlineage spec does it? @Samia Rahman In our Airflow talk, the demo used the marquez-airflow lib that sends OpenLineage events to Marquez’s . You can check out the how does Airflow works with OpenLineage + Marquez here https://openlineage.io/integration/apache-airflow/

openlineage.io

Apache Airflow

Airflow, a common tool used for workflow management, is the foundation of many data pipelines. Enabling OpenLineage in Apache Airflow automatically tracks metadata about jobs and datasets as DAGs execute.

Original URL: https://openlineage.io/integration/apache-airflow/

Willy Lulciuc (willy@datakin.com)

2021-07-26 19:07:51

*Thread Reply:* > Did I miss something - the API that was should for lineage was that of Marquez, how does Marquest use the open lineage spec? Yes, Marquez ingests OpenLineage events that confirm to the spec via the . Hope this helps!

Kenton (swiple.io) (kknoxparton@gmail.com)

2021-07-21 07:52:32

Hi all, does OpenLineage intend on creating lineage off of query logs?

From what I have read, there are a number of supported integrations but none that cater to regular SQL based ETL. Is this on the OpenLineage roadmap?

Willy Lulciuc (willy@datakin.com)

2021-07-26 18:54:46

*Thread Reply:* I would say this is more of an ingestion pattern, then something the OpenLineage spec would support directly. Though I completely agree, query logs are a great source of lineage metadata with minimal effort. On our roadmap, we have Kafka as a supported backend which would enable streaming lineage metadata from query logs into a topic. That said, confluent has some great blog posts on Change Data Capture: • https://www.confluent.io/blog/no-more-silos-how-to-integrate-your-databases-with-apache-kafka-and-cdc/ • https://www.confluent.io/blog/simplest-useful-kafka-connect-data-pipeline-world-thereabouts-part-1/

Confluent

No More Silos: How to Integrate your Databases with Apache Kafka and CDC | Confluent

This article shows you how Apache Kafka integrates with existing RDBMS seamlessly - and discusses Change Data Capture (CDC) options.

Original URL: https://www.confluent.io/blog/no-more-silos-how-to-integrate-your-databases-with-apache-kafka-and-cdc/

Confluent

The Simplest Useful Kafka Connect Data Pipeline in the World – Part 1

Learn how to stream data from a database (MySQL) into Apache Kafka and from Kafka into a text file with the Kafka Connect API.

Original URL: https://www.confluent.io/blog/simplest-useful-kafka-connect-data-pipeline-world-thereabouts-part-1/

Willy Lulciuc (willy@datakin.com)

2021-07-26 18:57:59

*Thread Reply:* Q: @Kenton (swiple.io) Are you planning on using Kafka connect? If so, I see 2 reasonable options:

Stream query logs to a topic using the JDBC source connector, then have a consumer read the query logs off the topic, parse the logs, then stream the result of the query parsing to another topic as an OpenLineage event
Add direct support for OpenLineage to the JDBC connector or any other application you planned to use to read the query logs.

Willy Lulciuc (willy@datakin.com)

2021-07-26 19:01:31

*Thread Reply:* Either way, I think this is a great question and a common ingestion pattern we should document or have best practices for. Also, more details on how you plan to ingestion the query logs would be help drive the discussion.

Kenton (swiple.io) (kknoxparton@gmail.com)

2021-08-05 12:01:55

*Thread Reply:* Using something like sqlflow could be a good starting point? Demo https://sqlflow.gudusoft.com/?utm_source=gspsite&utm_medium=blog&utm_campaign=support_article#/

sqlflow.gudusoft.com

SQLFlow: Visualize column impact and data lineage to track columns across transformations by analyzing SQL query.

SQLFlow: Visualize column impact and data lineage to track columns across transformations by analyzing SQL query. supported databases: bigquery, couchbase, dax, db2, greenplum, hana, hive, impala, informix, mdx, mysql, netezza, openedge, oracle, postgresql, redshift, snowflake, sqlserver, sybase, teradata, vertica

Original URL: https://sqlflow.gudusoft.com/?utm_source=gspsite&utm_medium=blog&utm_campaign=support_article#/

sqlparser/sqlflow_public

Document, sample code and other materials for SQLFlow

Stars

Language

Python

Willy Lulciuc (willy@datakin.com)

2021-09-21 20:22:26

*Thread Reply:* @Kenton (swiple.io) I haven’t heard of sqlflow but it does look promising. It’s not on our current roadmap, but I think there is a need to have support for parsing query logs as OpenLineage events. Do you mind opening an issue and outlining you thoughts? It’d be great to start the discussion if you’d like to drive this feature and help prioritize this 💯

Samia Rahman (srahman@thoughtworks.com)

2021-07-21 08:49:23

The openlineage implementation for airflow and spark code integration currently lives in Marquez repo, my understanding from the open lineage scope is that the the integration implementation is the scope of open lineage, are the spark code migrations going to be moved to open lineage?

Ross Turk (ross@datakin.com)

2021-07-21 11:35:12

@Samia Rahman Yes, that is the plan. For details you can see https://github.com/OpenLineage/OpenLineage/issues/73

<https://github.com/OpenLineage/OpenLineage/issues/73|#73 Move Marquez integrations to OpenLineage>

The <a href="https://github.com/MarquezProject/marquez">Marquez</a> community is excited to announce that the following integrations will be moved to the wider OpenLineage ecosystem: • <a href="https://github.com/MarquezProject/marquez/tree/main/integrations/airflow"><code>marquez-airflow</code></a>: A library that integrates Airflow DAGs to emit OpenLineage metadata. • <a href="https://github.com/MarquezProject/marquez/tree/main/integrations/spark"><code>marquez-spark</code></a>: A spark agent using jvm instrumentation to emit OpenLineage metadata. • <a href="https://github.com/MarquezProject/marquez/tree/main/integrations/common"><code>marquez-integration-common</code></a>: Shared code across integrations. • <code>marquez-dbt-snowflake</code>: A library that integrates with DBT to emit OpenLineage metadata for Snowflake. • <code>marquez-dbt-bigquery</code>: A library that integrates with DBT to emit OpenLineage metadata for BigQuery. > Note: You can view the integrations in the Marquez repo under <a href="https://github.com/MarquezProject/marquez/tree/main/integrations"><code>marquez/integrations/</code></a>. Marquez is an official reference implementation of the OpenLineage API and all the integrations adopt the OpenLineage standard. Why are the integrations being moved to OpenLineage? The OpenLineage standard was influenced by Marquez's underlying data model (datasets, jobs, runs) while also drawing inspiration from the projects lineage ingestion API. With OpenLineage now a standalone project under <a href="https://lfaidata.foundation">LFAI & Data Foundation</a>, we've decided that now would be the right time to move the integrations over as more and more integrations are being added to the <a href="https://github.com/OpenLineage/OpenLineage/projects/1">projects roadmap</a> (Looker, Flink, Beam, etc). This would also allow for the OpenLineage community to grow and learn from existing integration implementations, prioritize bug fixes, propose new integrations and changes to the spec, etc. What is the timeline for moving the integrations to OpenLineage? Below we outline a high-level timeline of moving the integrations to OpenLineage. We encourage everyone to provide feedback and express any concerns you may have by commenting in this issue: ☑︎ Release Marquez <a href="https://github.com/MarquezProject/marquez/projects/12"><code>0.16.0</code></a> ☐ Move the following integrations from Marquez to OpenLineage: ☐ <code>marquez-airflow</code> #93 ☐ <code>marquez-spark</code> ☑︎ <code>marquez-integration-common</code> #74 ☐ <code>marquez-dbt-snowflake</code> ☐ <code>marquez-dbt-bigquery</code> ☐ Deprecate all integrations in Marquez, then link to the new location in the OpenLineage repo Then, going forward, new integrations and improvements will be made by open pull requests against the OpenLineage repo. Note that during this migration phase, existing Marquez specific functionality and environment variables will be maintained to ensure backwards compatibility. The main difference is that integration packages will get a new name: <code>openlineage-**</code> instead of <code>marquez-**</code>.

🙌 Samia Rahman, Willy Lulciuc

Samia Rahman (srahman@thoughtworks.com)

2021-07-21 18:13:11

I have a question about the SQLJobFacet in the job schema - isn't it better to call it the TransformationJob Facet or the ProjecessJobFacet such that any logic in the appropriate language and be described, can be scala or python code that runs in the job facet and processing streaming or batch data? Am I misinterpreting the intention of SQLJobFacet is to capture the logic that runs for a job?

Willy Lulciuc (willy@datakin.com)

2021-07-21 18:22:01

*Thread Reply:* Hey, @Samia Rahman 👋. Yeah, great question! The SQLJobFacet is used only for SQL-based jobs. That is, it’s not intended to capture the code being executed, but rather the just the SQL if it’s present. The SQL fact can be used later for display purposes. For example, in Marquez, we use the SQLJobFacet to display the SQL executed by a given job to the user via the UI.

Willy Lulciuc (willy@datakin.com)

2021-07-21 18:23:03

*Thread Reply:* To capture the logic of the job (meaning, the code being executed), the OpenLineage spec defines the SourceCodeLocationJobFacet that builds the link to source in version control

<https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json | OpenLineage.json>

<pre><code> "SourceCodeLocationJobFacet": { </code></pre>

Julien Le Dem (julien@apache.org)

2021-07-22 17:56:41

The process started a few months back when the LF AI & Data voted to accept OpenLineage as part of the foundation. It is now official, OpenLineage joined the LFAI & data Foundation. https://lfaidata.foundation/blog/2021/07/22/openlineage-joins-lf-ai-data-as-new-sandbox-project/

LF AI

OpenLineage Joins LF AI & Data as New Sandbox Project - LF AI

LF AI & Data Foundation—the organization building an ecosystem to sustain open source innovation in artificial intelligence (AI) and data open source projects, today is announcing OpenLineage as its latest...

Written by

Jacqueline Z Cardoso

Est. reading time

3 minutes

Original URL: https://lfaidata.foundation/blog/2021/07/22/openlineage-joins-lf-ai-data-as-new-sandbox-project/

🙌 Ross Turk, Luke Smith, Maciej Obuchowski, Gyan Kapur, Dr Daniel Smith, Jarek Potiuk, Peter Hicks, Kedar Rajwade, Abe Gong, Damian Warszawski, Willy Lulciuc

❤️ Ross Turk, Jarek Potiuk, Peter Hicks, Abe Gong, Willy Lulciuc

🎉 Laurent Paris, Rifa Achrinza, Minkyu Park, Peter Hicks, mohamed chorfa, Jarek Potiuk, Abe Gong, Damian Warszawski, Willy Lulciuc, James Le

👏 Matt Turck

Namron (ian.norman@avanade.com)

2021-07-29 11:20:17

Hi, I am trying to create lineage between two datasets. Following the Spec, I can see the syntax for declaring the input and output datasets, and for all creating the associated Job (which I take to be the process in the middle joining the two datasets together). What I can't see is where in the specification to relate the job to the inputs and outputs. Do you have an example of this?

Michael Collado (collado.mike@gmail.com)

2021-07-30 17:24:44

*Thread Reply:* The run event is always tied to exactly one job. It's up to the backend to store the relationship between the job and its inputs/outputs. E.g., in marquez, this is where we associate the input datasets with the job- https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/db/OpenLineageDao.java#L132-L143

<https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/db/OpenLineageDao.java | OpenLineageDao.java>

<pre><code> JobRow job = jobDao.upsertJob( UUID.randomUUID(), getJobType(event.getJob()), now, namespace.getUuid(), namespace.getName(), event.getJob().getName(), description, jobContext.getUuid(), location, jobDao.toJson(toDatasetId(event.getInputs()), mapper)); </code></pre>

Julien Le Dem (julien@apache.org)

2021-08-03 15:06:58

the OuputStatistics facet PR is updated based on your comments @Michael Collado https://github.com/OpenLineage/OpenLineage/pull/114

<https://github.com/OpenLineage/OpenLineage/pull/114|#114 add OutputStatistics facet>

Signed-off-by: Julien Le Dem <a href="mailto:julien@apache.org">julien@apache.org</a>

Comments

🙌 Michael Collado

Michael Collado (collado.mike@gmail.com)

2021-08-03 15:11:56

*Thread Reply:* /|~~~ ///| /////| ///////| /////////| \==========|===/ ~~~~~~~~~~~~~~~~~~~~~

Julien Le Dem (julien@apache.org)

2021-08-03 19:59:03

*Thread Reply:* ⛵

Julien Le Dem (julien@apache.org)

2021-08-03 19:59:38

I have updated the DataQuality metrics proposal and the corresponding PR: https://github.com/OpenLineage/OpenLineage/issues/101 https://github.com/OpenLineage/OpenLineage/pull/115

<https://github.com/OpenLineage/OpenLineage/issues/101|#101 Add the DataQualityMetrics Dataset facet to the core spec>

<https://github.com/OpenLineage/OpenLineage/pull/115|#115 Data quality facet>

🙌 Willy Lulciuc, Bruno González

💯 Willy Lulciuc, Dominique Tipton

Oleksandr Dvornik (oleksandr.dvornik@getindata.com)

2021-08-04 10:42:48

Guys, I've merged circleCI publish snapshot PR

Snapshots can be found bellow: https://datakin.jfrog.io/artifactory/maven-public-libs-snapshot-local/io/openlineage/openlineage-java/0.0.1-SNAPSHOT/ openlineage-java-0.0.1-20210804.142910-6.jar https://datakin.jfrog.io/artifactory/maven-public-libs-snapshot-local/io/openlineage/openlineage-spark/0.1.0-SNAPSHOT/ openlineage-spark-0.1.0-20210804.143452-5.jar

Build on main passed (edited)

🎉 Julien Le Dem

Julien Le Dem (julien@apache.org)

2021-08-04 23:08:08

I added a mechanism to enforce spec versioning per: https://github.com/OpenLineage/OpenLineage/issues/63 https://github.com/OpenLineage/OpenLineage/pull/140

<https://github.com/OpenLineage/OpenLineage/issues/63|#63 [PROPOSAL] Define the OpenLineage spec versioning mechanism>

<https://github.com/OpenLineage/OpenLineage/pull/140|#140 Add mechanism to enforce spec versioning>

Ben Teeuwen-Schuiringa (ben.teeuwen@booking.com)

2021-08-05 10:02:49

Hi all, at Booking.com we’re using Spline to extract granular lineage information from spark jobs to be able to trace lineage on column-level and the operations in between. We wrote a custom python parser to create graph-like structure that is sent into arangodb. But tbh, the process is far from stable and is not able to quickly answer questions like ‘which root input columns are used to construct column x’.

My impression with openlineage thus far is it’s focusing on less granular, table input-output information. Is anyone here trying to accomplish something similar on a column-level?

Luke Smith (luke.smith@kinandcarta.com)

2021-08-05 12:56:48

*Thread Reply:* Also interested in use case / implementation differences between Spline and OL. Watching this thread.

Julien Le Dem (julien@apache.org)

2021-08-05 14:46:44

*Thread Reply:* It would be great to have the option to produce the spline lineage info as OpenLineage. To capture the column level lineage, you would want to add a ColumnLineage facet to the Output dataset facets. Which is something that is needed in the spec. Here is a proposal, please chime in: https://github.com/OpenLineage/OpenLineage/issues/148 Is this something you would be interested to do?

<https://github.com/OpenLineage/OpenLineage/issues/148|#148 [PROPOSAL] column level lineage facet>

Purpose: For transformations like SQL queries (but also in other cases), we can extract column level lineage. This allows answering questions like ‘which root input columns are used to construct column x?’. We need to be able to capture this information in the OpenLineage model Proposed implementation We propose to introduce a new dataset facet: example: <pre><code>{ "eventType": "COMPLETE", "eventTime": "2020-12-28T20:52:00.001+10:00", "run" : { "runId": "uuid" }, "job": { "namespace": "scheduler", "name": "myjob", "facets": { "sql": { "query": "Insert into outputTable from select ** from inputTable" } } }, "inputs": [ { "namespace": "N1", "name": "inputTable", "facets": { "schema": { "fields": [ {"name": "col_a", "type": "VARCHAR"}, {"name": "col_b", "type": "int"}] } } } ], "outputs": [ { "namespace": "N2", "name": "outputTable", "facets": { "schema": { "fields": [ {"name": "col_a", "type": "VARCHAR"}, {"name": "col_b", "type": "int"}] }, "columnLineage": { "fields": { "col_a": [ { "namespace": "N1", "name": "col_a" } ], "col_b": [ { "namespace": "N1", "name": "col_b" } ] } } } } ] } </code></pre> Schema: <pre><code> "columnLineage": { "type": "object", "properties": { "fields": { "type": "array", "items": { "type": "object", "additionalProperties": { "type": "object", "properties": { "namespace": { "type": "string" }, "name": { "type": "string" }, }, "required": [ "namespace", "name" ] } } } } </code></pre>

Labels

proposal

Julien Le Dem (julien@apache.org)

2021-08-09 19:49:51

*Thread Reply:* regarding the difference of implementation, the OpenLineage spark integration focuses on extracting metadata and exposing it as a standard representation. (The OpenLineage LineageEvents described in the JSON-Schema spec). The goal is really to have a common language to express lineage and related metadata across everything. We’d be happy if Spline can produce or consume OpenLineage as well and be part of that ecosystem.

<https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json | OpenLineage.json>

Ben Teeuwen-Schuiringa (ben.teeuwen@booking.com)

2021-08-18 08:09:38

*Thread Reply:* Does anyone know if the Spline developers are in this slack group?

Ben Teeuwen-Schuiringa (ben.teeuwen@booking.com)

2022-08-03 03:07:56

*Thread Reply:* @Luke Smith how have things progressed on your side the past year?

Julien Le Dem (julien@apache.org)

2021-08-09 19:39:28

I have opened an issue to track the facet versioning discussion: https://github.com/OpenLineage/OpenLineage/issues/153

<https://github.com/OpenLineage/OpenLineage/issues/153|#153 [PROPOSAL] Version facets independently of the core spec>

Purpose: The OpenLineage core spec is versioned using a version URL in the <code>$id</code> field: See: • <a href="https://github.com/OpenLineage/OpenLineage/blob/main/spec/Versioning.md">https://github.com/OpenLineage/OpenLineage/blob/main/spec/Versioning.md</a> • <a href="https://openlineage.io/spec/1-0-0/OpenLineage.json">https://openlineage.io/spec/1-0-0/OpenLineage.json</a> <pre><code>{ "$schema": "<https://json-schema.org/draft/2020-12/schema>", "$id": "<https://openlineage.io/spec/1-0-0/OpenLineage.json>", </code></pre> Currently all the core facets are also defined in that same core spec. This means that the core spec gets a new version whenever a facet is changed along with every other facet. We'd rather have the core spec and each facet versioned independently so that their version changes only when they change. This matches much better the use cases for using the version number. For example: • a consumer of OpenLineage would care only about the compatibility of the core spec, accept any version of any facet and handles specific facets independently. • a producer of a facet cares what version of the core model and that specific facet (core or custom) it is generating. • a consumer of a specific facet cares about the version of that specific facet and ignore changes to facets it does not consume. Proposed implementation This proposal is to move each core facet definition into its own file and use the same versioning mechanism (per file) as the core spec. This leaves the core spec to only contain the <code>LineageEvent</code> and related core entities (LineageEvent, Run, Job, Dataset) but none of the core facets. See for example: <a href="https://github.com/OpenLineage/OpenLineage/blob/789b55fc6e6e07d0aedbe130f775721829e83618/spec/OpenLineage.json">https://github.com/OpenLineage/OpenLineage/blob/789b55fc6e6e07d0aedbe130f775721829e83618/spec/OpenLineage.json</a> This would be the spec that a consumer would implement to accept any facet version. This would ensure that the core spec version changes rarely, only when we add fields to the core model. Each facet would be versioned independently following the same versioning mechanism. Each facet version will change if and only if this particular facet schema changes. It also makes core facets and custom facets work exactly the same, they are just maintained in different repositories. Example: <a href="https://openlineage.io/spec/1-0-0/DatasetSchemaFacet.json">https://openlineage.io/spec/1-0-0/DatasetSchemaFacet.json</a> <pre><code>{ "$schema": "<https://json-schema.org/draft/2020-12/schema>", "$id": "<https://openlineage.io/spec/1-0-0/DatasetSchemaFacet.json>", "type": "object", "properties": { "schema": { "allOf": [ { "$ref": "<https://openlineage.io/spec/1-0-0/OpenLineage.json#/definitions/DatasetFacet>" }, { "type": "object", "properties": { "fields": { "description": "The fields of the table.", "type": "array", "items": { "type": "object", "properties": { "name": { "description": "The name of the field.", "type": "string", "example": "column1" }, "type": { "description": "The type of the field.", "type": "string", "example": "VARCHAR|INT|..." }, "description": { "description": "The description of the field.", "type": "string" } }, "required": [ "name", "type" ] } } } } ], "type": "object" }, } } } </code></pre>

Labels

proposal

Julien Le Dem (julien@apache.org)

2021-08-09 20:16:18

I have updated the agenda to the OpenLineage monthly TSC meeting: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting (meeting information bellow for reference, you can also DM me your email to get added to a google calendar invite)

Aug 11th 2021 • Agenda: ◦ Coming in OpenLineage 0.1 ▪︎ OpenLineage spec versioning ▪︎ Clients ◦ Marquez integrations imported in OpenLineage ▪︎ Apache Airflow: • BigQuery • Postgres • Snowflake • Redshift • Great Expectations ▪︎ Apache Spark ▪︎ dbt ◦ OpenLineage 0.2 scope discussion ▪︎ Facet versioning mechanism ▪︎ OpenLineage Proxy Backend () ▪︎ Kafka client ◦ Roadmap ◦ Open discussion • Slides: https://docs.google.com/presentation/d/1Lxp2NB9xk8sTXOnT0_gTXicKX5FsktWa/edit#slide=id.ge80fbcb367_0_14

🙌 Willy Lulciuc, Maciej Obuchowski, Dr Daniel Smith

💯 Willy Lulciuc, Dr Daniel Smith

Julien Le Dem (julien@apache.org)

2021-08-11 10:05:27

*Thread Reply:* Just a reminder that this is in 2 hours

Julien Le Dem (julien@apache.org)

2021-08-11 18:50:32

*Thread Reply:* I have added the notes to the meeting page: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

Julien Le Dem (julien@apache.org)

2021-08-11 18:51:19

*Thread Reply:* The recording of the meeting is linked there: https://us02web.zoom.us/rec/share/2k4O-Rjmmd5TYXzT-pEQsbYXt6o4V6SnS6Vi7a27BPve9aoMmjm-bP8UzBBzsFzg.uY1je-PyT4qTgYLZ?startTime=1628697944000 • Passcode: =RBUj01C

Daniel Avancini (dpavancini@gmail.com)

2021-08-11 13:30:52

Hi guys, great discussion today. Something we are particularly interested on is the integration with Airflow 2. I've been searching into Marquez and Openlineage repos and I couldn't find a clear answer on the status of that. I did some work locally to update the marquez-airflow package but I would like to know if someone else is working on this and maybe we could give it some help too.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-08-11 13:36:43

*Thread Reply:* @Daniel Avancini I'm working on it. Some changes in airflow made current approach unfeasible, so slight change in a way how we capture events is needed. You can take a look at progress here: https://github.com/OpenLineage/OpenLineage/tree/airflow/2

Daniel Avancini (dpavancini@gmail.com)

2021-08-11 13:48:36

*Thread Reply:* Thank you Maciej. I'll take a look

Julien Le Dem (julien@apache.org)

2021-08-11 20:37:09

I have migrated the Marquez issues related to OpenLineage integrations to the OpenLineage repo

Julien Le Dem (julien@apache.org)

2021-08-13 19:02:54

And OpenLineage 0.1.0 is out ! https://github.com/OpenLineage/OpenLineage/releases/tag/0.1.0

🙌 Peter Hicks, Maciej Obuchowski, Willy Lulciuc, Oleksandr Dvornik, Luke Smith, Daniel Avancini, Matt Gee

❤️ Willy Lulciuc, Matt Gee

Oleksandr Dvornik (oleksandr.dvornik@getindata.com)

2021-08-16 11:42:24

PR ready for review

<https://github.com/OpenLineage/OpenLineage/pull/204|#204 Introduce UnknownFacet for spark>

PR for <a href="https://github.com/OpenLineage/OpenLineage/issues/175">Issue:</a> Adding new custom facet for spark with name"spark_unknown". Holds information about unvisited nodes from spark plan.

👍 Willy Lulciuc

Luke Smith (luke.smith@kinandcarta.com)

2021-08-20 13:54:08

Anyone have experience parsing spark's logical plan to generate column-level lineage and DAGs with more human readable operations? I assume I could recreate a graph like the one below using the spark.logicalPlan facet. The analysts writing the SQL / spark queries aren't familiar with ShuffledRowRDD , MapPartitionsRDD, etc... It'd be better if I could convert this plan into spark SQL (or capture spark SQL as a facet at runtime).

Michael Collado (collado.mike@gmail.com)

2021-08-26 16:46:53

*Thread Reply:* The logicalPlan facet currently returns the Logical Plan, not the physical plan. This means you end up with expressions like Aggregate and Join rather than WholeStageCodegen and Exchange. I don't know if it's possible to reverse engineer the SQL- it's worth looking into the API and trying to find a way to generate that

Luke Smith (luke.smith@kinandcarta.com)

2021-08-20 13:57:41

Erick Navarro (Erick.Navarro@gt.ey.com)

2021-08-31 14:26:35

👋 Hi everyone!

Erick Navarro (Erick.Navarro@gt.ey.com)

2021-08-31 14:27:00

Nice to e-meet you 🙂 I want to use OpenLineage integration for spark in my Azure Databricks clusters, but I am having problems with the configuration of the listener in the cluster, I was wondering if you could help me, if you know any tutorial for the integration of spark with Azure Databricks that can help me, or some more specific guide for this scenario, I would really appreciate it.

Erick Navarro (Erick.Navarro@gt.ey.com)

2021-08-31 14:27:33

I added this configuration to my cluster :

Erick Navarro (Erick.Navarro@gt.ey.com)

2021-08-31 14:28:37

I receive this error message:

Willy Lulciuc (willy@datakin.com)

2021-08-31 14:30:00

*Thread Reply:* Hey, @Erick Navarro 👋 . Are you using the openlineage-spark lib? (Note, the marquez-spark lib has been deprecated)

Luke Smith (luke.smith@kinandcarta.com)

2021-08-31 14:43:20

*Thread Reply:* My team had this issue as well. Our read of the error is that Databricks attempts to register the listener before installing packages defined with either spark.jars or spark.jars.packages. Since the listener lib is not yet installed, the listener cannot be found. To solve the issue, we

copy the OL JAR to a staging directory on DBFS (we use /dbfs/databricks/init/lineage)
using an init script, copy the JAR from the staging directory to the default JAR location for the Databricks driver -- /mnt/driver-daemon/jars
Within the same init script, write the spark config parameters to a .conf file in /databricks/driver/conf (we use open-lineage.conf) The .conf file will be read by the driver on initialization. It should follow this format (lineagehosturl should point to your API): [driver] { "spark.jars" = "/mnt/driver-daemon/jars/openlineage-spark-0.1-SNAPSHOT.jar" "spark.extraListeners" = "com.databricks.backend.daemon.driver.DBCEventLoggingListener,openlineage.spark.agent.OpenLineageSparkListener" "spark.openlineage.url" = "$lineage_host_url" } Your cluster must be configured to call the init script (enabling lineage for entire cluster). OL is not friendly to notebook-level init as far as we can tell.

@Willy Lulciuc -- I have some utils and init script templates that simplify this process. May be worth adding them to the OL repo along with a readme.

🙏 Erick Navarro

❤️ Erick Navarro

Willy Lulciuc (willy@datakin.com)

2021-08-31 14:51:46

*Thread Reply:* Absolutely, thanks for elaborating on your spark + OL deployment process and I think that’d be great to document. @Michael Collado what are your thoughts?

Michael Collado (collado.mike@gmail.com)

2021-08-31 14:57:02

*Thread Reply:* I haven't tried with Databricks specifically, but there should be no issue registering the OL listener in the Spark config as long as it's done before the Spark session is created- e.g., this example from the README works fine in a vanilla Jupyter notebook- https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark#openlineagesparklistener-as-a-plain-spark-listener

Michael Collado (collado.mike@gmail.com)

2021-08-31 15:11:37

*Thread Reply:* Looks like Databricks' notebooks come with a Spark instance pre-configured- configuring lineage within the SparkSession configuration doesn't seem possible- https://docs.databricks.com/notebooks/notebooks-manage.html#attach-a-notebook-to-a-cluster 😞

docs.databricks.com

Manage notebooks | Databricks on AWS

Learn how to manage notebooks using the UI, the CLI, and by invoking the Workspace API.

Original URL: https://docs.databricks.com/notebooks/notebooks-manage.html#attach-a-notebook-to-a-cluster

Michael Collado (collado.mike@gmail.com)

2021-08-31 15:11:53

*Thread Reply:*

Luke Smith (luke.smith@kinandcarta.com)

2021-08-31 15:59:38

*Thread Reply:* Right, Databricks provides preconfigured spark context / session objects. With Spline, you can set some cluster level config (e.g. spark.spline.lineageDispatcher.http.producer.url ) and install the library on the cluster, but then enable tracking at a notebook level with:

%scala import za.co.absa.spline.harvester.SparkLineageInitializer._ sparkSession.enableLineageTracking() In OL, it would be nice to install and config OL at a cluster level, but to enable it at a notebook level. This way, users could control whether all notebooks run on a cluster emit lineage or just those with lineage explicitly enabled.

Michael Collado (collado.mike@gmail.com)

2021-08-31 16:01:00

*Thread Reply:* Seems, at the very least, we need to provide a way to specify the job name at the notebook level

👍 Luke Smith

Luke Smith (luke.smith@kinandcarta.com)

2021-08-31 16:03:50

*Thread Reply:* Agreed. I'd like a default that uses the notebook name that can also be overridden in the notebook.

Michael Collado (collado.mike@gmail.com)

2021-08-31 16:10:42

*Thread Reply:* if you have some insight into the available options, it would be great if you can open an issue on the OL project. I'll have to carve out some time to play with a databricks cluster and learn what options we have

👍 Luke Smith

Erick Navarro (Erick.Navarro@gt.ey.com)

2021-08-31 18:26:11

*Thread Reply:* Thank you @Luke Smith, the method you recommend works for me, the cluster is running and apparently it fetch the configuration it was my first progress in over a week testing openlineage in azure databricks. Thank you!

Now I have this:

Luke Smith (luke.smith@kinandcarta.com)

2021-08-31 18:52:15

*Thread Reply:* Is this error thrown during init or job execution?

Michael Collado (collado.mike@gmail.com)

2021-08-31 18:55:30

*Thread Reply:* this is likely a race condition- I've seen it happen for jobs that start and complete very quickly- things like defining temp views or similar

Erick Navarro (Erick.Navarro@gt.ey.com)

2021-08-31 19:59:15

*Thread Reply:* During the execution of the job @Luke Smith, thank you @Michael Collado, that was exactly the scenario, the job that I executed was empty, now the cluster is running ok, I don't have errors, I have run some jobs successfully, but I don't see any information in my datakin explorer

Willy Lulciuc (willy@datakin.com)

2021-08-31 20:00:46

*Thread Reply:* Awesome! Great to hear you’re up and running. For datakin specific questions, mind if we move the discussion to the datakin user slack channel?

Erick Navarro (Erick.Navarro@gt.ey.com)

2021-08-31 20:01:17

*Thread Reply:* Yes Willy, thank you!

Erick Navarro (Erick.Navarro@gt.ey.com)

2021-09-02 10:06:00

*Thread Reply:* Hi , @Luke Smith, thank you for your help, are you familiar with this error in azure databricks when you use OL?

Erick Navarro (Erick.Navarro@gt.ey.com)

2021-09-02 10:07:07

*Thread Reply:*

Erick Navarro (Erick.Navarro@gt.ey.com)

2021-09-02 10:17:17

*Thread Reply:* I found the solution here: https://docs.microsoft.com/en-us/answers/questions/170730/handshake-fails-trying-to-connect-from-azure-datab.html

docs.microsoft.com

Handshake fails trying to connect from Azure Databricks to Azure PostgreSQL with SSL - Microsoft Q&A

Microsoft Q&A is the best place to get answers to all your technical questions on Microsoft products and services. Community. Forum.

Original URL: https://docs.microsoft.com/en-us/answers/questions/170730/handshake-fails-trying-to-connect-from-azure-datab.html

Erick Navarro (Erick.Navarro@gt.ey.com)

2021-09-02 10:17:28

*Thread Reply:* It works now! 😄

👍 Luke Smith, Maciej Obuchowski, Minkyu Park, Willy Lulciuc

Willy Lulciuc (willy@datakin.com)

2021-09-02 16:33:01

*Thread Reply:* @Erick Navarro This might be a helpful to add to our openlineage spark docs for others trying out openlineage-spark with Databricks. Let me know if that’s something you’d like to contribute 🙂

Erick Navarro (Erick.Navarro@gt.ey.com)

2021-09-02 19:59:10

*Thread Reply:* Yes of course @Willy Lulciuc, I will prepare a small tutorial for my colleagues and I will share it with you 🙂

Willy Lulciuc (willy@datakin.com)

2021-09-02 20:44:36

*Thread Reply:* Awesome. Thanks!

Thomas Fredriksen (thomafred90@gmail.com)

2021-09-02 03:47:35

Hello everyone! I am currently evaluating OpenLineage and am finding it very interesting as Prefect is in the list of integrations. However, I am not seeing any documentation or code for this. How far are you from supporting Prefect?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-02 04:57:55

*Thread Reply:* Hey! If you mean this picture, it provides concept of how OpenLineage works, not current state of integration. We don't have Prefect support yet; hovewer, it's on our roadmap.

Original URL: https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/doc/problem.png

<https://github.com/OpenLineage/OpenLineage/issues/81|#81 [INTEGRATION] Add Prefect support>

Thomas Fredriksen (thomafred90@gmail.com)

2021-09-02 05:22:15

*Thread Reply:* great, thanks 🙂

Julien Le Dem (julien@apache.org)

2021-09-02 11:49:48

*Thread Reply:* @Thomas Fredriksen Feel free to chime in the github issue Maciej linked if you want.

Luke Smith (luke.smith@kinandcarta.com)

2021-09-02 13:13:05

What's the timeline to support spark 3.0 within OL? One breaking change we've found is within DatasetSourceVisitor.java -- the DataSourceV2 is deprecated in spark 3.0. There may be other issues we haven't found yet. Is there a good feel for the scope of work required to make OL spark 3.0 compatible?

Julien Le Dem (julien@apache.org)

2021-09-02 14:28:11

*Thread Reply:* It is being worked on right now. @Oleksandr Dvornik is adding an integration test in the build so that we run test for both spark 2.4 and spark 3. Please open an issue with the stack trace if you can. From our perspective, it should be mostly compatible with a few exceptions like this one that we’d want to add test cases for.

Julien Le Dem (julien@apache.org)

2021-09-02 14:36:19

*Thread Reply:* The goal is to be able to make a release in the next few weeks. The integration is being used with Spark 3 already.

🙌 Luke Smith

Luke Smith (luke.smith@kinandcarta.com)

2021-09-02 15:50:14

*Thread Reply:* Great, I'll take some time to open an issue for this particular issue and a few others.

Michael Collado (collado.mike@gmail.com)

2021-09-02 17:33:08

*Thread Reply:* are you actually using the DatasetSource interface in any capacity? Or are you just scanning the source code to find incompatibilities?

Luke Smith (luke.smith@kinandcarta.com)

2021-09-03 12:36:20

*Thread Reply:* Turns out this has more to do with a how Databricks handles the delta format. It's related to https://github.com/AbsaOSS/spline-spark-agent/issues/96.

<https://github.com/AbsaOSS/spline-spark-agent/issues/96|#96 Throwing runtime exception for delta commands>

Current spline is not supported for delta commands such as MergeInto, Update, Delete I have added <code>io.delta</code> package with scope provided as databricks cluster by default has <code>io.delta</code> imported, and made changes to <code>WriteCommandExtractor.scala</code> file. <pre><code>import org.apache.spark.sql.delta.commands._ </code></pre> and added below code to commandsToBeImplemented just to see if it throws custom exception when run in Required mode. <pre><code>classOf[MergeIntoCommand] </code></pre> But it throws following exception when I run an application <pre><code>Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.delta.commands.MergeIntoCommand </code></pre> I am trying to debug the issue but your support on this is highly appreciated.

Labels

question

Comments

Luke Smith (luke.smith@kinandcarta.com)

2021-09-03 13:42:43

*Thread Reply:* I haven't been chasing this issue down on my team -- turns out some things were lost in communication. There are really two problems here:

When attempting to do delta I/O with Spark 3 on Databricks, e.g. insert into . . . values . . . We get an error related to DataSourceV2: java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation.source()Lorg/apache/spark/sql/sources/v2/DataSourceV2;
Using Spline, which is Spark 3 compatible, we have issues with the way Databricks handles delta table io. This is related: https://github.com/AbsaOSS/spline-spark-agent/issues/96

So there are two stacked issues related to spark 3 on Databricks with delta IO, not just one. Hope this clears things up.

<https://github.com/AbsaOSS/spline-spark-agent/issues/96|#96 Throwing runtime exception for delta commands>

Labels

question

Comments

Michael Collado (collado.mike@gmail.com)

2021-09-03 13:44:54

*Thread Reply:* So, the first issue is OpenLineage related directly, and the second issue applies to both OpenLineage and Spline?

Luke Smith (luke.smith@kinandcarta.com)

2021-09-03 13:45:49

*Thread Reply:* Yes, that's my read of what I'm getting from others on the team.

Michael Collado (collado.mike@gmail.com)

2021-09-03 13:46:56

*Thread Reply:* For the first issue- can you give some details about the target of the INSERT INTO... ? Is it a data source defined in Databricks? a Hive table? a view on GCS?

Michael Collado (collado.mike@gmail.com)

2021-09-03 13:47:40

*Thread Reply:* oh, it's a Delta table?

Luke Smith (luke.smith@kinandcarta.com)

2021-09-03 14:48:15

*Thread Reply:* Yes, it's created via

CREATE TABLE . . . using DELTA location "/dbfs/mnt/ . . . "

Julien Le Dem (julien@apache.org)

2021-09-02 14:28:53

I have opened a PR to fix some outdated language in the spec: https://github.com/OpenLineage/OpenLineage/pull/241 Thank you @Mandy Chessell for the feedback

<https://github.com/OpenLineage/OpenLineage/pull/241|#241 correct remaining use of transition>

Signed-off-by: Julien Le Dem <a href="mailto:julien@apache.org">julien@apache.org</a>

Comments

Julien Le Dem (julien@apache.org)

2021-09-02 14:37:27

The next OpenLineage monthly meeting is next week. Please chime in this thread if you’d like something added to the agenda

🙌 Willy Lulciuc

marko (marko.kristian.helin@gmail.com)

2021-09-04 12:53:54

*Thread Reply:* Apache Beam integration? I have a very crude integration at the moment. Maybe it’s better to integrate on the orchestration level (airflow, luigi). Thoughts?

Julien Le Dem (julien@apache.org)

2021-09-05 13:06:19

*Thread Reply:* I think it makes a lot of sense to have a Beam level integration similar to the spark one. Feel free to post a draft PR if you want to share.

Julien Le Dem (julien@apache.org)

2021-09-07 21:04:09

*Thread Reply:* I have added Beam as a topic for the roadmap discussion slide: https://docs.google.com/presentation/d/1fI0u8aE0iX9vG4GGrnQYAEcsJM9z7Rlv/edit#slide=id.ge7d4b64ef4_0_0

Julien Le Dem (julien@apache.org)

2021-09-07 21:03:08

I have prepared slides for the OpenLineage meeting tomorrow morning: https://docs.google.com/presentation/d/1fI0u8aE0iX9vG4GGrnQYAEcsJM9z7Rlv/edit#slide=id.ge7d4b64ef4_0_0

Julien Le Dem (julien@apache.org)

2021-09-07 21:03:32

*Thread Reply:* There will be a quick demo of the dbt integration (thanks @Willy Lulciuc!)

🙌 Willy Lulciuc

Julien Le Dem (julien@apache.org)

2021-09-07 21:05:13

*Thread Reply:* Information to join and archive of previous meetings: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

Julien Le Dem (julien@apache.org)

2021-09-08 14:49:52

*Thread Reply:* The recording and notes are now available: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

Venkatesh Tadinada (venkat@mlacademy.io)

2021-09-08 21:58:09

*Thread Reply:* Good meeting today. @Julien Le Dem. Thanks

Shreyas Kaushik (shreyask@gmail.com)

2021-09-08 04:03:29

Hello, was looking to get some lineage out for BQ in my Airflow DAGs and saw that the BQ extractor here - https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/openlineage/airflow/extractors/bigquery_extractor.py#L47 is using an operator that has been deprecated by Airflow - https://github.com/apache/airflow/blob/main/airflow/contrib/operators/bigquery_operator.py#L44 and most of my DAGs are using the operator BigQueryExecuteQueryOperator mentioned there. I presume with this lineage extraction wouldn’t work and some work is needed to support both these operators with the same ( or differnt) extractor. Is that correct or am I missing something ?

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/openlineage/airflow/extractors/bigquery_extractor.py | bigquery_extractor.py>

<https://github.com/apache/airflow/blob/main/airflow/contrib/operators/bigquery_operator.py | bigquery_operator.py>

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-08 04:27:04

*Thread Reply:* We're working on updating our integration to airflow 2. Some changes in airflow made current approach unfeasible, so slight change in a way how we capture events is needed. You can take a look at progress here: https://github.com/OpenLineage/OpenLineage/tree/airflow/2

<https://github.com/apache/airflow/pull/10956|#10956 Fully support running more than one scheduler concurrently>

Shreyas Kaushik (shreyask@gmail.com)

2021-09-08 04:27:38

*Thread Reply:* Thanks @Maciej Obuchowski When is this expected to land in a release ?

Daniel Zagales (dzagales@gmail.com)

2021-11-11 06:35:24

*Thread Reply:* hi @Maciej Obuchowski I wanted to follow up on this to understand when the more recent BQ Operators will be supported, specifically BigQueryInsertJobOperator

Julien Le Dem (julien@apache.org)

2021-09-11 22:30:31

The PR to separate facets in their own file (and allowing versioning them independently) is now available: https://github.com/OpenLineage/OpenLineage/pull/118

<https://github.com/OpenLineage/OpenLineage/pull/118|#118 separate facets to allow versioning them independently>

This PR includes: • Splitting facet outside of the core spec into their own file. • I wrote a separate automated program to generate each facet file from the original so you don't need to check that they have all the fields. • You might to review what has been removed from OpenLineage.json and the general structure of each facet • Updating code gen to deal with that • Code generation merges all the definitions to be compatible with the existing logic. • Adding more precise list of BaseFacets (for each entity type). Which in code gen allows for type checking adding the right type of facet in the right place • It also adds a default implementation to the BaseFacets to allow deserializing custom facets. Signed-off-by: Julien Le Dem <a href="mailto:julien@apache.org">julien@apache.org</a>

Comments

Jose Badeau (jose.badeau@gmail.com)

2021-09-13 03:46:20

Hi, new to the channel but I think OL is a great initiative. Currently we are focused on beam/spark/delta but are moving to beam/flink/iceberg and I’m happy to help where I can.

Willy Lulciuc (willy@datakin.com)

2021-09-13 15:40:01

*Thread Reply:* Welcome, @Jose Badeau 👋. That’s exciting to hear as we have Beam, Flink and Iceberg on our roadmap! Your welcome to join the discussion :)

Julien Le Dem (julien@apache.org)

2021-09-13 20:56:11

Per the discussion last week, Ryan updated the metadata that would be available in Iceberg: https://github.com/OpenLineage/OpenLineage/issues/167#issuecomment-917237320

<https://github.com/OpenLineage/OpenLineage/issues/167#issuecomment-917237320|Comment on #167 [INTEGRATION] Iceberg should be able to send OpenLineage events independently from where the library is used (including in a plain java process)>

Here's an outline of what I talked about at the last open lineage meeting: OpenLineage and Iceberg What is Iceberg? • A table format that tracks the files in a table • A library that implements the table format • In the format, at "snapshot" is a version of a table • Provides atomic operations to update the contents of a table • Provides helper utilities for scanning data Iceberg provides notifications at certain points • For example, it plans a table scan, Iceberg summarizes that scan and sends a notification with • The table that was scanned • The snapshot that was used • The filters that were pushed down • The set of columns that were projected We want to do the same thing for metrics • Scan <pre><code>• How many manifest files were read? Data and delete manifests? • How many data files were read? How many total delete files? How many were reused? • How long did each manifest read take? • How long did the total scan planning take? • How many files per second were located? What was the distribution of files/second over time? </code></pre> • Commit <pre><code>• What snapshot was committed? • What columns were written? • What partitions were modified or appended to? </code></pre> • Commit metrics <pre><code>• How many manifests were rewritten? • How many manifests were created? • How many data files were created? • How many data files were removed? • How many delete files were created? • How much data (bytes) was written? • How much data (bytes) was removed? • How many retries were needed to complete the commit? • How long did the first attempt take? • What was the average length of a retry? • For each retry, how many manifests were rewritten/created/removed? • For each retry, how many data/delete files were created/removed? </code></pre> • Commits create a dataset facet <pre><code>• Snapshot/version of the table • Total number of rows • Total number of data/delete files </code></pre>

Julien Le Dem (julien@apache.org)

2021-09-13 21:00:54

I have also created tickets for follow up discussions: (#269 and #270): https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

Tomas Satka (satka.tomas@gmail.com)

2021-09-14 04:50:22

Hello. I find OpenLineage an interesting tool however can someone help me with integration?

Tomas Satka (satka.tomas@gmail.com)

2021-09-14 04:52:50

I am trying to capture lineage from spark 3.1.1 but when executing i constantly get: java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2.writer()Lorg/apache/spark/sql/sources/v2/writer/DataSourceWriter; at openlineage.spark.agent.lifecycle.plan.DatasetSourceVisitor.findDatasetSource(DatasetSourceVisitor.java:57) as if i would be using openlineage on wrong spark version (2.4) I have tried also spark jar from branch feature/itspark3. Is there any branch or release that works or can be tried with spark 3+?

Oleksandr Dvornik (oleksandr.dvornik@getindata.com)

2021-09-14 05:03:45

*Thread Reply:* Hello Tomas. We are currently working on support for spark v3. Can you please raise an issue with stack trace, that would help us to track and solve it. We are currently adding integration tests. Next step would be fix changes in method signatures for v3 (that's what you have)

Tomas Satka (satka.tomas@gmail.com)

2021-09-14 05:12:45

*Thread Reply:* Hi @Oleksandr Dvornik i raised https://github.com/OpenLineage/OpenLineage/issues/272

<https://github.com/OpenLineage/OpenLineage/issues/272|#272 spark 3 lineage>

Hi Team I am trying to run openlineage with spark 3.1.1. I have tried the version that is released 0.2.2 but also build from main from branch feature/itspark3 but i always get the same exception. <pre><code>java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2.writer()Lorg/apache/spark/sql/sources/v2/writer/DataSourceWriter; at openlineage.spark.agent.lifecycle.plan.DatasetSourceVisitor.findDatasetSource(DatasetSourceVisitor.java:57) at openlineage.spark.agent.lifecycle.plan.DatasetSourceVisitor.isDefinedAt(DatasetSourceVisitor.java:33) at openlineage.spark.agent.lifecycle.plan.DatasetSourceVisitor.isDefinedAt(DatasetSourceVisitor.java:28) at openlineage.spark.agent.lifecycle.plan.wrapper.OutputDatasetVisitor.isDefinedAt(OutputDatasetVisitor.java:27) at openlineage.spark.agent.lifecycle.plan.wrapper.OutputDatasetVisitor.isDefinedAt(OutputDatasetVisitor.java:16) at scala.PartialFunction$OrElse.isDefinedAt(PartialFunction.scala:170) at scala.PartialFunction$OrElse.isDefinedAt(PartialFunction.scala:170) at scala.PartialFunction$OrElse.isDefinedAt(PartialFunction.scala:170) at scala.PartialFunction$OrElse.isDefinedAt(PartialFunction.scala:170) at scala.PartialFunction$OrElse.isDefinedAt(PartialFunction.scala:170) at openlineage.spark.agent.lifecycle.plan.PlanTraversal.isDefinedAt(PlanTraversal.java:32) at openlineage.spark.agent.lifecycle.SparkSQLExecutionContext.start(SparkSQLExecutionContext.java:87) at openlineage.spark.agent.OpenLineageSparkListener.lambda$onJobStart$3(OpenLineageSparkListener.java:154) at java.util.Optional.ifPresent(Optional.java:159) at openlineage.spark.agent.OpenLineageSparkListener.onJobStart(OpenLineageSparkListener.java:143) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1381) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) </code></pre> Do i do something wrong or is spark 3 simply not yet supported? Thanks

👍 Oleksandr Dvornik

Tomas Satka (satka.tomas@gmail.com)

2021-09-14 08:47:39

I also tried to downgrade spark to 2.4.0 and retry with 0.2.2 but i also faced issue.. so my preferred way would be to push for spark 3.1.1 but depends a bit on when you plan to release version supporting it. As backup plan i would try spark 2.4.0 but this is blocking me also: https://github.com/OpenLineage/OpenLineage/issues/274

<https://github.com/OpenLineage/OpenLineage/issues/274|#274 spark 2.4 ArrayIndexOutOfBoundsException>

Hi Team I have tried to collect lineage on spark 2.4 with openlineage 0.2.2. Here is my pom.xml; <pre><code>&lt;project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"&gt; &lt;modelVersion&gt;4.0.0&lt;/modelVersion&gt; &lt;groupId&gt;foo&lt;/groupId&gt; &lt;artifactId&gt;bar&lt;/artifactId&gt; &lt;version&gt;0.1.0-SNAPSHOT&lt;/version&gt; &lt;properties&gt; &lt;java.version&gt;1.8&lt;/java.version&gt; &lt;maven.compiler.source&gt;${java.version}&lt;/maven.compiler.source&gt; &lt;maven.compiler.target&gt;${java.version}&lt;/maven.compiler.target&gt; &lt;project.build.sourceEncoding&gt;UTF-8&lt;/project.build.sourceEncoding&gt; &lt;project.reporting.outputEncoding&gt;UTF-8&lt;/project.reporting.outputEncoding&gt; &lt;spark2.version&gt;2.4.0&lt;/spark2.version&gt; &lt;/properties&gt; &lt;build&gt; &lt;plugins&gt; &lt;plugin&gt; &lt;groupId&gt;org.apache.maven.plugins&lt;/groupId&gt; &lt;artifactId&gt;maven-compiler-plugin&lt;/artifactId&gt; &lt;version&gt;3.8.1&lt;/version&gt; &lt;/plugin&gt; &lt;plugin&gt; &lt;groupId&gt;org.apache.maven.plugins&lt;/groupId&gt; &lt;artifactId&gt;maven-jar-plugin&lt;/artifactId&gt; &lt;version&gt;2.3.2&lt;/version&gt; &lt;configuration&gt; &lt;finalName&gt;${project.artifactId}&lt;/finalName&gt; &lt;/configuration&gt; &lt;/plugin&gt; &lt;plugin&gt; &lt;groupId&gt;org.apache.maven.plugins&lt;/groupId&gt; &lt;artifactId&gt;maven-shade-plugin&lt;/artifactId&gt; &lt;version&gt;3.2.4&lt;/version&gt; &lt;executions&gt; &lt;execution&gt; &lt;phase&gt;package&lt;/phase&gt; &lt;goals&gt; &lt;goal&gt;shade&lt;/goal&gt; &lt;/goals&gt; &lt;configuration&gt; &lt;filters&gt; &lt;filter&gt; &lt;artifact&gt;**:**&lt;/artifact&gt; &lt;excludes&gt; &lt;exclude&gt;META-INF/LICENSE&lt;/exclude&gt; &lt;exclude&gt;META-INF/**.SF&lt;/exclude&gt; &lt;exclude&gt;META-INF/**.DSA&lt;/exclude&gt; &lt;exclude&gt;META-INF/**.RSA&lt;/exclude&gt; &lt;/excludes&gt; &lt;/filter&gt; &lt;/filters&gt; &lt;transformers&gt; &lt;transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/&gt; &lt;/transformers&gt; &lt;/configuration&gt; &lt;/execution&gt; &lt;/executions&gt; &lt;/plugin&gt; &lt;/plugins&gt; &lt;/build&gt; &lt;dependencies&gt; &lt;dependency&gt; &lt;groupId&gt;org.apache.spark&lt;/groupId&gt; &lt;artifactId&gt;spark-sql-kafka-0-10_2.12&lt;/artifactId&gt; &lt;version&gt;${spark2.version}&lt;/version&gt; &lt;/dependency&gt; &lt;dependency&gt; &lt;groupId&gt;org.apache.spark&lt;/groupId&gt; &lt;artifactId&gt;spark-core_2.12&lt;/artifactId&gt; &lt;version&gt;${spark2.version}&lt;/version&gt; &lt;scope&gt;provided&lt;/scope&gt; &lt;/dependency&gt; &lt;dependency&gt; &lt;groupId&gt;org.apache.spark&lt;/groupId&gt; &lt;artifactId&gt;spark-sql_2.12&lt;/artifactId&gt; &lt;version&gt;${spark2.version}&lt;/version&gt; &lt;scope&gt;provided&lt;/scope&gt; &lt;/dependency&gt; &lt;dependency&gt; &lt;groupId&gt;org.apache.spark&lt;/groupId&gt; &lt;artifactId&gt;spark-streaming_2.12&lt;/artifactId&gt; &lt;version&gt;${spark2.version}&lt;/version&gt; &lt;scope&gt;provided&lt;/scope&gt; &lt;/dependency&gt; &lt;dependency&gt; &lt;groupId&gt;io.openlineage&lt;/groupId&gt; &lt;artifactId&gt;openlineage-spark&lt;/artifactId&gt; &lt;version&gt;0.2.2&lt;/version&gt; &lt;/dependency&gt; &lt;/dependencies&gt; &lt;/project&gt; </code></pre> This is my simple test that reads csv and write to csv: <pre><code>import org.apache.spark.sql.SaveMode; import org.apache.spark.sql.SparkSession; import java.time.Duration; public class Example { public static void main(String[] args) throws InterruptedException { SparkSession.Builder builder = SparkSession.builder(); SparkSession session = builder .appName("java example app") .master("local[**]") .config("spark.jars.packages", "io.openlineage:openlineage_spark:0.2.2") .config("spark.extraListeners", "io.openlineage.spark.agent.OpenLineageSparkListener") .config("spark.openlineage.url", "<http://localhost:5000/api/v1/namespaces/spark_integration/>") .getOrCreate(); session.read() .option("header", "true") .option("inferSchema", "true") .csv("src/main/resources/wikidata.csv") .as("source") .write() .mode(SaveMode.Overwrite) .csv("src/main/resources/java-sample.csv"); Thread.sleep(Duration.ofMinutes(3).toMillis()); } } </code></pre> but when i run it i am getting following exception: ``` Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 10582 out of bounds for length 249 at com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:563) at com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:338) at com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:103) at com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:90) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:44) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:58) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:58) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:240) at scala.collection.Iterator.foreach(Iterator.scala:937) at scala.collection.Iterator.foreach$(Iterator.scala:937) at scala.collection.AbstractIterator.foreach(Iterator.scala:1425) at scala.collection.IterableLike.foreach(IterableLike.scala:70) at scala.collection.IterableLike.foreach$(IterableLike.scala:69) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike.flatMap(TraversableLike.scala:240) at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:237) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.findConstructorParam$1(BeanIntrospector.scala:58) at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$19(BeanIntrospector.scala:176) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:233) at scala.collection.In…

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-14 08:55:44

*Thread Reply:* I think this might be actually spark issue: https://stackoverflow.com/questions/53787624/spark-throwing-arrayindexoutofboundsexception-when-parallelizing-list/53787847

Stack Overflow

Spark throwing ArrayIndexOutOfBoundsException when parallelizing list

I've tried to create a Spark implementation of QuickSort to test against a serial implementation. I've got the serial implementation working, but the parallel implementation throws an

Original URL: https://stackoverflow.com/questions/53787624/spark-throwing-arrayindexoutofboundsexception-when-parallelizing-list/53787847

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-14 08:56:10

*Thread Reply:* Can you try newer version in 2.4.** line, like 2.4.7?

👀 Tomas Satka

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-14 08:57:30

*Thread Reply:* This might be also spark 2.4 with scala 2.12 issue - I'd recomment 2.11 versions.

Tomas Satka (satka.tomas@gmail.com)

2021-09-14 09:04:26

*Thread Reply:* @Maciej Obuchowski with 2.4.7 i get following exc:

Tomas Satka (satka.tomas@gmail.com)

2021-09-14 09:04:27

*Thread Reply:* 21/09/14 15:03:25 WARN RddExecutionContext: Unable to access job conf from RDD java.lang.NoSuchFieldException: config$1 at java.base/java.lang.Class.getDeclaredField(Class.java:2411)

Tomas Satka (satka.tomas@gmail.com)

2021-09-14 09:04:48

*Thread Reply:* i can also try to switch to 2.11 scala

Tomas Satka (satka.tomas@gmail.com)

2021-09-14 09:05:37

*Thread Reply:* or do you have some recommended setup that works for sure?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-14 09:09:58

*Thread Reply:* One more check - you're using Java 8 with this, right?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-14 09:10:17

*Thread Reply:* This is what works for me: -> % cat tools/spark-2.4/RELEASE Spark 2.4.8 (git revision 4be4064) built for Hadoop 2.7.3 Build flags: -B -Pmesos -Pyarn -Pkubernetes -Pflume -Psparkr -Pkafka-0-8 -Phadoop-2.7 -Phive -Phive-thriftserver -DzincPort=3036

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-14 09:11:23

*Thread Reply:* spark-shell: Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_292)

Tomas Satka (satka.tomas@gmail.com)

2021-09-14 09:12:05

*Thread Reply:* awesome let me try 🙂

Tomas Satka (satka.tomas@gmail.com)

2021-09-14 09:26:00

*Thread Reply:* data has been sent to marquez. coolio. however i noticed nullpointer being thrown: 21/09/14 15:23:53 ERROR AsyncEventQueue: Listener OpenLineageSparkListener threw an exception java.lang.NullPointerException at io.openlineage.spark.agent.OpenLineageSparkListener.onJobEnd(OpenLineageSparkListener.java:164) at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:39) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91) at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82)

Tomas Satka (satka.tomas@gmail.com)

2021-09-14 10:59:45

*Thread Reply:* closed related issue #274

Tomas Satka (satka.tomas@gmail.com)

2021-09-14 11:02:42

does openlineage capture streaming in spark? as this example is not showing me anything unless i replace readStream() with batch read() and writeStream() with write() ```SparkSession.Builder builder = SparkSession.builder(); SparkSession session = builder .appName("quantweave") .master("local[**]") .config("spark.jars.packages", "io.openlineage:openlineage_spark:0.2.2") .config("spark.extraListeners", "io.openlineage.spark.agent.OpenLineageSparkListener") .config("spark.openlineage.url", "http://localhost:5000/api/v1/namespaces/spark_integration/") .getOrCreate();

    Dataset&lt;Row&gt; df = session
            .readStream()
            .format("kafka")
            .option("kafka.bootstrap.servers", "localhost:9092")
            .option("subscribe", "topic1")
            .option("startingOffsets", "earliest")
            .load();

    Dataset&lt;Row&gt; dff = df
            .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)").as("data");

    dff
            .writeStream()
            .format("kafka")
            .option("kafka.bootstrap.servers", "localhost:9092")
            .option("topic", "topic2")
            .option("checkpointLocation", "/tmp/checkpoint")
            .start();```

Julien Le Dem (julien@apache.org)

2021-09-14 13:38:09

*Thread Reply:* Not at the moment, but it is in scope. You are welcome to open an issue with your example to track this or even propose an implementation if you have the time.

Oleksandr Dvornik (oleksandr.dvornik@getindata.com)

2021-09-14 15:12:01

*Thread Reply:* @Tomas Satka it would be great, if you can add an containerized integration test for kafka with your test case. You can take this as an example here

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/src/test/java/io/openlineage/spark/agent/SparkContainerIntegrationTest.java | SparkContainerIntegrationTest.java>

``` package io.openlineage.spark.agent; import static java.nio.file.Files.readAllBytes; import static org.mockserver.model.HttpRequest.request; import static org.mockserver.model.JsonBody.json; import io.openlineage.spark.agent.client.OpenLineageClient; import java.io.IOException; import java.nio.file.Path; import java.nio.file.Paths; import java.time.Duration; import java.time.temporal.ChronoUnit; import org.junit.jupiter.api.AfterAll; import org.junit.jupiter.api.AfterEach; import org.junit.jupiter.api.BeforeAll; import org.junit.jupiter.api.Tag; import org.junit.jupiter.api.Test; import org.mockserver.client.MockServerClient; import org.mockserver.matchers.MatchType; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import org.testcontainers.containers.GenericContainer; import org.testcontainers.containers.MockServerContainer; import org.testcontainers.containers.Network; import org.testcontainers.containers.wait.strategy.Wait; import org.testcontainers.junit.jupiter.Container; import org.testcontainers.junit.jupiter.Testcontainers; import org.testcontainers.utility.DockerImageName; @Tag("integration-test") @Testcontainers public class SparkContainerIntegrationTest { private static final Network network = Network.newNetwork(); @Container private static final MockServerContainer openLineageClientMockContainer = makeMockServerContainer(); private static GenericContainer<?> pyspark; private static MockServerClient mockServerClient; @BeforeAll public static void setup() { mockServerClient = new MockServerClient( openLineageClientMockContainer.getHost(), openLineageClientMockContainer.getServerPort()); mockServerClient .when(request("/api/v1/lineage")) .respond(org.mockserver.model.HttpResponse.response().withStatusCode(201)); } @AfterEach public void cleanupSpark() { pyspark.stop(); } @AfterAll public static void tearDown() { Logger logger = LoggerFactory.getLogger(SparkContainerIntegrationTest.class); try { openLineageClientMockContainer.stop(); } catch (Exception e2) { logger.error("Unable to shut down openlineage client container", e2); } try { pyspark.stop(); } catch (Exception e2) { logger.error("Unable to shut down pyspark container", e2); } network.close(); } private static MockServerContainer makeMockServerContainer() { return new MockServerContainer( DockerImageName.parse("jamesdbloom/mockserver:mockserver-5.11.2")) .withNetwork(network) .withNetworkAliases("openlineageclient"); } private static GenericContainer<?> makePysparkContainer(String... command) { return new GenericContainer<>( DockerImageName.parse("godatadriven/pyspark:" + System.getProperty("spark.version"))) .withNetwork(network) .withNetworkAliases("spark") .withFileSystemBind("src/test/resources/testdata", "/testdata") .withFileSystemBind("src/test/resources/sparkscripts", "/opt/sparkscripts") .withFileSystemBind("build/libs", "/opt/libs") .withLogConsumer(SparkContainerIntegrationTest::consumeOutput) .withStartupTimeout(Duration.of(2, ChronoUnit.MINUTES)) .dependsOn(openLineageClientMockContainer) .withReuse(true) .withCommand(command); } private static void consumeOutput(org.testcontainers.containers.output.OutputFrame of) { try { switch (of.getType()) { case STDOUT: System.out.write(of.getBytes()); break; case STDERR: System.err.write(of.getBytes()); break; case END: System.out.println(of.getUtf8String()); break; } } catch (IOException ioe) { throw new RuntimeException(ioe); } } @Test public void testPysparkWordCountWithCliArgs() throws IOException, InterruptedException { pyspark = makePysparkContainer( "--master", "local", "--conf", "spark.openlineage.url=" + "<a href="http://openlineageclient:1080/api/v1/namespaces/testPysparkWordCountWithCliArgs">http://openlineageclient:1080/api/v1/namespaces/testPysparkWordCountWithCliArgs</a>", "--conf", "spark.extraListeners=" + OpenLineageSparkListener.class.getName(), "--jars", "/opt/libs/" + System.getProperty("openlineage.spark.jar"), "/opt/sparkscripts/sparkword_count.py"); pyspark.setWaitStrategy(Wait.forLogMessage(".ShutdownHookManager: Shutdown hook called.", 1)); pyspark.start(); <pre><code>Path eventFolder = Paths.get("integrations/container/"); String startEvent = new String(readAllBytes(eventFolder.resolve("pysparkWordCountWithCliArgsStartEvent.json"))) .replaceAll( "<https://github.com/OpenLineage/OpenLineage/tree/\\$VERSION/integration/spark>", OpenLineageClient.OPEN_LINEAGE_CLIENT_URI.toString()); String completeEvent = new String( readAllBytes(eventFolder.resolve("pysparkWordCountWithCliArgsCompleteEvent.json"))) .replaceAll( "<https://github.com/OpenLineage/OpenLineage/tree/\\$VERSION/integration/spark>", OpenLineageClient.OPEN_LINEAGE_CLIENT_URI.toString()); mockServerClient.verify( request() .withPath("/api/v1/lineage") .withBody(json(startEvent, MatchType.ONLY_MATCHING_FIELDS)), request() .withPath("/api/v1/lineage") .withBody(json(completeEvent, MatchType.ONLY_MATCHING_FIELDS))); </code></pre> } @Test public void testPysparkRddToTable() throws IOException, InterruptedException { pyspark = makePysparkContainer( "--master", "local", "--conf", "spark.openlineage.host=" + "<a href="http://openlineageclient:1080">http://openlineageclient:1080</a>", "--conf", "spark.openlineage.namespace=testPysparkRddToTable", "--conf", "spark.extraListeners=" + OpenLineageSparkListener.class.getName(), "--jars", "/opt/libs/" + System.getProperty("openlineage.spark.jar"), "/opt/sparkscripts/sparkrddtotable.py"); pyspark.setWaitStrategy(Wait.forLogMessage(".ShutdownHookManager: Shutdown hook called.", 1)); pyspark.start(); <pre><code>Path eventFolder = Paths.get("integrations/container/"); String startCsvEvent = new String(readAllBytes(eventFolder.resolve("pysparkRddToCsvStartEvent.json"))) .replaceAll( "<https://github.com/OpenLineage/OpenLineage/tree/\\$VERSION/integration/spark>", OpenLineageClient.OPEN_LINEAGE_CLIENT_URI.toString()); String completeCsvEvent = new String(readAllBytes(eventFolder.resolve("pysparkRddToCsvCompleteEvent.json"))) .replaceAll( "<https://github.com/OpenLineage/OpenLineage/tree/\\$VERSION/integration/spark>", OpenLineageClient.OPEN_LINEAGE_CLIENT_URI.toString()); String startTableEvent = new String(readAllBytes(eventFolder.resolve("pysparkRddToTableStartEvent.json"))) .replaceAll( "<https://github.com/OpenLineage/OpenLineage/tree/\\$VERSION/integration/spark>", OpenLineageClient.OPEN_LINEAGE_CLIENT_URI.toString()); String completeTableEvent = new String(readAllBytes(eventFolder.resolve("pysparkRddToTableCompleteEvent.json"))) .replaceAll( "<https://github.com/OpenLineage/OpenLineage/tree/\\$VERSION/integration/spark>", OpenLineageClient.OPEN_LINEAGE_CLIENT_URI.toString()); mockServerClient.verify( request() .withPath("/api/v1/lineage") .withBody(json(startCsvEvent, MatchType.ONLY_MATCHING_FIELDS)), request() .withPath("/api/v1/lineage") .withBody(json(completeCsvEvent, MatchType.ONLY_MATCHING_FIELDS)), request() … </code></pre>

Tomas Satka (satka.tomas@gmail.com)

2021-09-14 18:02:05

*Thread Reply:* Hi @Oleksandr Dvornik i wrote a test for simple read/write from kafka topic using kafka testcontainer. However i discovered a bug. When writing to kafka topic getting java.lang.IllegalArgumentException: One of the following options must be specified for Kafka source: subscribe, subscribepattern, assign. See the docs for more details.

• How would you like me to add the test? Fork openlineage and create PR

Tomas Satka (satka.tomas@gmail.com)

2021-09-14 18:02:50

*Thread Reply:* • Shall i raise bug for writing to kafka that should have only "topic" instead of "subscribe"

Tomas Satka (satka.tomas@gmail.com)

2021-09-14 18:03:42

*Thread Reply:* • Since i dont know expected payload to openlineage mock server can somebody help me to create it?

Oleksandr Dvornik (oleksandr.dvornik@getindata.com)

2021-09-14 19:06:41

*Thread Reply:* Hi @Tomas Satka, yes you should create a fork and raise a PR from that. For more details, please take a look at. Not sure about kafka, cause we don't have that integration yet. About expected payload, as a first step, I would suggest to leave that test without assertion for now. Second step would be investigation (what we can get from that plan node). Third step - implementation and asserting a payload. Basically we parse spark optimized plan, and get as much information as we can for specific implementation. You can take a look at recent PR for HIVE. We visit root node and leaves to get output datasets and input datasets accordingly.

<https://github.com/OpenLineage/OpenLineage/pull/264|#264 [INTEGRATION[spark] add visitors for Hive>

Implement visitors for <code>InsertIntoHiveTable</code> and <code>InsertIntoHiveDirCommand</code>. Solves <a href="https://github.com/OpenLineage/OpenLineage/issues/187">OpenLineage/OpenLineage#187</a>. Signed-off-by: Maciej Obuchowski <a href="mailto:maciej.obuchowski@getindata.com">maciej.obuchowski@getindata.com</a>

Comments

Tomas Satka (satka.tomas@gmail.com)

2021-09-15 04:37:59

*Thread Reply:* Hi @Oleksandr Dvornik PR for step one : https://github.com/OpenLineage/OpenLineage/pull/279

<https://github.com/OpenLineage/OpenLineage/pull/279|#279 adding initial test for spark kafka integration>

Added integration test using kafka testcontainer that is setup on same network. Intention is only to spin up kafka in independent container, run spark job to read from topic1 and write to topic2.

👍 Oleksandr Dvornik

🙌 Julien Le Dem

Luke Smith (luke.smith@kinandcarta.com)

2021-09-14 15:52:41

There may not be an answer to these questions yet, but I'm curious about the plan for Tableau lineage.

• How will this integration be packaged and attached to Tableau instances? ◦ via Extensions API, REST API? • What is the architecture? https://github.com/OpenLineage/OpenLineage/issues/78

<https://github.com/OpenLineage/OpenLineage/issues/78|#78 [INTEGRATION] Add support for Tableau lineage>

Thomas Fredriksen (thomafred90@gmail.com)

2021-09-15 01:58:37

Hi everyone - Following up on my previous post on prefect. The technical integration does not seem very difficult, but I am wondering about how to structure the lineage logic. Is it the case that each prefect task should be mapped to a lineage job? If so, how do we connect the jobs together? Does there have to be a dataset between each job? I am OpenLineage with Marquez by the way

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-15 09:19:23

*Thread Reply:* Hey Thomas!

Following what we do with Airflow, yes, I think that each task should be mapped to job.

You don't need datasets between each tasks. It's necessary only where you consume and produce datasets - and it does not matter where in uour job graph you've produced them.

To map tasks togther In Airflow, we use ParentRunFacet , and the same approach could be used here. In Prefect, I think using flow_run_id would work.

👍 Julien Le Dem

Thomas Fredriksen (thomafred90@gmail.com)

2021-09-15 09:26:21

*Thread Reply:* this is very helpful, thank you

Thomas Fredriksen (thomafred90@gmail.com)

2021-09-15 09:26:43

*Thread Reply:* what would be the namespace used in the Job -definition of each task?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-15 09:31:34

*Thread Reply:* In contrast to dataset namespaces - which we try to standardize, job namespaces should be provided by user, or operator of particular scheduler.

For example, it would be good if it helped you identify Prefect instance where the job was run.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-15 09:32:23

*Thread Reply:* If you use openlineage-python client, you can provide namespace either in client constuctor, or via OPENLINEAGE_NAMESPACE env variable.

Thomas Fredriksen (thomafred90@gmail.com)

2021-09-15 09:32:55

*Thread Reply:* awesome, thank you 🙂

Brad (bradley.mcelroy@live.com)

2021-09-15 17:03:07

*Thread Reply:* Hey @Thomas Fredriksen - just chiming in, I’m also keen for a prefect integration. Let me know if I can help out at all

Julien Le Dem (julien@apache.org)

2021-09-15 17:27:20

*Thread Reply:* Please chime in on https://github.com/OpenLineage/OpenLineage/issues/81

<https://github.com/OpenLineage/OpenLineage/issues/81|#81 [INTEGRATION] Add Prefect support>

Brad (bradley.mcelroy@live.com)

2021-09-15 18:29:20

*Thread Reply:* Done!

❤️ Julien Le Dem

Brad (bradley.mcelroy@live.com)

2021-09-16 00:06:41

*Thread Reply:* For now I'm prototyping in a separate repo https://github.com/limx0/caching_flow_runner/tree/open_lineage

Thomas Fredriksen (thomafred90@gmail.com)

2021-09-17 01:55:08

*Thread Reply:* I really like your PR, @Brad. I think that using FlowRunner and TaskRunner may be a more "proper" way of doing this, as opposed as adding a state-handler to each task the way I do it.

How are you dealing with Prefect-library tasks such as the included BigQuery-tasks and such? Is it necessary to create DatasetTask for them to show up in the lineage graph?

Brad (bradley.mcelroy@live.com)

2021-09-17 02:04:19

*Thread Reply:* Hey @Thomas Fredriksen! At the moment I'm not dealing with any task-specific things. The plan (in my head, and after speaking with another prefect user @davzucky) would be that we add a LineageTask subclass where you could define custom facets on a per task basis

Brad (bradley.mcelroy@live.com)

2021-09-17 02:05:21

*Thread Reply:* or some sort of other hook where basically you would define some lineage attribute or put something in the prefect.context that the TaskRunner would find and attach

Brad (bradley.mcelroy@live.com)

2021-09-17 02:06:23

*Thread Reply:* Sorry I misread your question - any tasks should be automatically tracked (I believe but have not tested yet!)

Thomas Fredriksen (thomafred90@gmail.com)

2021-09-17 02:16:02

*Thread Reply:* @Brad Could you elaborate a bit on your ideas around adding custom context attributes?

Brad (bradley.mcelroy@live.com)

2021-09-17 02:21:57

*Thread Reply:* yeah so basically we just need some hooks that you can easily access from the task decorator or somewhere else that we can pass through to the open lineage adapter to do things like custom facets

Brad (bradley.mcelroy@live.com)

2021-09-17 02:24:31

*Thread Reply:* like for your bigquery example - you might want to record some facets like in https://github.com/OpenLineage/OpenLineage/blob/main/integration/common/openlineage/common/provider/bigquery.py and we need a way to do that with the Prefect bigquery task

Brad (bradley.mcelroy@live.com)

2021-09-17 02:28:28

*Thread Reply:* @davzucky

Thomas Fredriksen (thomafred90@gmail.com)

2021-09-17 02:29:12

*Thread Reply:* I see. Is this supported by the airflow-integration?

Brad (bradley.mcelroy@live.com)

2021-09-17 02:29:32

*Thread Reply:* I think so, yes

Brad (bradley.mcelroy@live.com)

2021-09-17 02:30:51

*Thread Reply:* The airflow code is here https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/openlineage/airflow/extractors/bigquery_extractor.py

Brad (bradley.mcelroy@live.com)

2021-09-17 02:31:54

*Thread Reply:* (I don't actually use airflow or bigquery - but for my own use case I can see wanting to do thing like this)

Thomas Fredriksen (thomafred90@gmail.com)

2021-09-17 03:18:27

*Thread Reply:* Interesting, I like how dynamic this is

Chris Baynes (chris@contiamo.com)

2021-09-15 09:09:21

HI all, I have a clarification question about dataset namespaces. What's the difference between a dataset namespace (in the input/output) and a dataSource name (in the dataSource facet)? The dbt integration appears to set those to the same value (e.g. <snowflake://myprofile>), however it seems that Marquez assumes the dataset namespace to be a more generic concept (similar to a nice user provided name like the job namespace).

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-15 09:29:25

*Thread Reply:* Hey. Generally, dataSource name should be namespace of particular dataset.

In some cases, like Postgres, dataSource facet is used to provide additionally connection strings, with info like particular host and port that we're connected to.

In case of Snowflake - or Bigquery, or S3, or multiple systems where we have only "global" instance, so the dataSource facet does not carry any other additional information.

Chris Baynes (chris@contiamo.com)

2021-09-15 10:11:19

*Thread Reply:* Thanks. So then perhaps marquez could differentiate a bit more between job & dataset namespaces. Right now it doesn't quite feel right to have a single global list of namespaces for jobs & datasets, especially as they also have a separate concept of sources (which are not in a namespace).

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-15 10:18:59

*Thread Reply:* @Willy Lulciuc what do you think?

Chris Baynes (chris@contiamo.com)

2021-09-15 10:41:20

*Thread Reply:* As an example, in marquez I have this list of namespaces (from some sample data): dbt-sales, default, <snowflake://my-account1>, <snowflake://my-account2>. I think the new marquez UI with the nice namespace dropdown and job/dataset search is awesome, and I'd expect to be able to filter by job namespace everywhere, but how about being able to filter datasets by source (which would be populated by the OL dataset namespace) and not persist dataset namespaces in the global namespace table?

Julien Le Dem (julien@apache.org)

2021-09-15 18:38:03

The dbt integration (https://github.com/OpenLineage/OpenLineage/tree/main/integration/dbt) is pretty awesome but there are still a few improvements we could make. Here are a few thoughts. • In dbt-ol if the configuration is wrong or missing we will fail silently. This one seems like a good first thing to fix by logging the error to stdout • We need to wait until the end to know if it worked at all. It would be nice if we checked the config at the beginning and display an error right away. Possibly by adding a parent job/run with a start event at the beginning and an end event at the end when all is done. • While we are sending events at the end the console will hang until it’s done. It’s not clear that progress is made. We could have a simple progress bar by printing a dot for every event sent. (ex: sending 10 OpenLineage events: .........) • We could also write at the beginning that the OL events will be sent at the end so that the user knows what to expect. What do you think? (@Maciej Obuchowski in particular, but anyone using dbt in general)

👀 Maciej Obuchowski

Julien Le Dem (julien@apache.org)

2021-09-15 18:43:18

*Thread Reply:* Last point is that we should persist the configuration and not just have it in environment variables. What is the best way to do this in dbt?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-15 18:49:21

*Thread Reply:* We could have something similar to https://docs.getdbt.com/dbt-cli/configure-your-profile - or even put our config in there

❤️ Julien Le Dem

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-15 18:51:42

*Thread Reply:* I think we should assume that variables/config should be set and valid - and fail the run if they aren't. After all, if someone wouldn't need lineage events, they wouldn't use our wrapper.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-15 18:56:36

*Thread Reply:* 3rd point would be easy to address if we could send events async/in parallel. But there could be dataset version dependencies, and we don't want to get into needless complexity of recognizing that, building a dag etc.

We could batch events if the network roundtrips are responsible for majority of the slowdown. However, we can't assume any particular environment.

Maybe just notifying about the progress is the best thing we can do right now.

👀 Mario Measic

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-15 18:58:22

*Thread Reply:* About second point, I want to add recognizing if we already have a parent run - for example, if running via airflow. If not, creating run for this purpose is a good idea.

Julien Le Dem (julien@apache.org)

2021-09-15 21:31:35

*Thread Reply:* @Maciej Obuchowski can you open github issues to propose those changes?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-16 09:11:31

*Thread Reply:* Done

Ross Turk (ross@datakin.com)

2021-09-16 12:05:10

*Thread Reply:* FWIW, I have been putting my config in ~/.openlineage/config so it can be mapped into a container

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-16 17:56:23

*Thread Reply:* Makes sense, also, all clients could use that config

Mario Measic (mario.measic.gavran@gmail.com)

2021-10-18 04:47:08

*Thread Reply:* if dbt could actually stream the events, that would be great.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-18 09:59:12

*Thread Reply:* Unfortunately, this seems very unlikely for now, due to the fact that we rely on metadata files that dbt only produces after end of execution.

Julien Le Dem (julien@apache.org)

2021-09-15 22:52:09

The split of facets in their own schemas is ready to be merged: https://github.com/OpenLineage/OpenLineage/pull/118

<https://github.com/OpenLineage/OpenLineage/pull/118|#118 separate facets to allow versioning them independently>

Comments

Brad (bradley.mcelroy@live.com)

2021-09-16 00:12:02

Hey @Julien Le Dem I'm going to start a thread here for any issues I run into trying to build a prefect integration

Brad (bradley.mcelroy@live.com)

2021-09-16 00:16:44

*Thread Reply:* This might be useful to others https://github.com/OpenLineage/OpenLineage/pull/284

Brad (bradley.mcelroy@live.com)

2021-09-16 00:18:44

*Thread Reply:* So I'm trying to push a simple event to marquez, but getting the following response: '{"code":400,"message":"Unable to process JSON"}' The JSON I'm pushing:

{ "eventTime": "2021-09-16T04:00:28.343702", "eventType": "START", "inputs": {}, "job": { "facets": {}, "name": "prefect.core.parameter.p", "namespace": "default" }, "producer": "<https://github.com/OpenLineage/OpenLineage/tree/0.0.0/integration/prefect>", "run": { "facets": {}, "runId": "3bce33cb-9495-4c58-b326-6aac71634ace" } } Does anything look obviously wrong here?

marko (marko.kristian.helin@gmail.com)

2021-09-16 02:41:11

*Thread Reply:* What I did previously when debugging something like this was to remove half of the payload until I found the culprit. Binary search essentially. I was running Marquez locally, so probably could’ve enabled better logging as well. Aren’t inputs and facets arrays?

👍 Maciej Obuchowski

Brad (bradley.mcelroy@live.com)

2021-09-16 03:14:54

*Thread Reply:* Thanks for the response @marko - this is a greatly reduced payload already (but I'll keep going). Yep they are supposed to be arrays (I've since fixed that)

Brad (bradley.mcelroy@live.com)

2021-09-16 03:46:01

*Thread Reply:* okay it was my timestamp 🥲

Brad (bradley.mcelroy@live.com)

2021-09-16 19:07:16

*Thread Reply:* Okay - I've got a simply working example now https://github.com/limx0/caching_flow_runner/blob/open_lineage/caching_flow_runner/task_runner.py

<https://github.com/limx0/caching_flow_runner/blob/open_lineage/caching_flow_runner/task_runner.py | task_runner.py>

<pre><code>import datetime from typing import Dict, List import prefect from openlineage.client.facet import DataSourceDatasetFacet from openlineage.client.run import InputDataset from openlineage.client.run import OutputDataset from prefect import Parameter from prefect.core import Edge from prefect.engine import TaskRunner from prefect.engine.result import Result from prefect.engine.state import Failed from prefect.engine.state import Pending from prefect.engine.state import Running from prefect.engine.state import State from prefect.engine.state import Success from caching_flow_runner.adapter import OpenLineageAdapter from caching_flow_runner.util import task_qualified_name def utc_now() -&gt; str: """Create an openlineage compatible timestamp string""" return datetime.datetime.now(datetime.timezone.utc).isoformat()[:-6] + "Z" def flow_namespace(): """Create a namespace from a flow (and optionally project) name""" project = prefect.context.get("project_name") prefix = f"{project}." if project else "" return f"{prefix}{prefect.context.flow_name}" def result_location(result: Result, ****raw_inputs) -&gt; str: """Determine results location using the same formatting kwargs prefect does in `task_runner.get_task_run_state`""" formatting_kwargs = { ****prefect.context.get("parameters", {}).copy(), ****prefect.context, ****raw_inputs, } clone = result.copy().format(****formatting_kwargs) return clone.location class OpenLineageTaskRunner(TaskRunner): def __init__(self, **args, client: OpenLineageAdapter, ****kwargs): super().__init__(**args, ****kwargs) self._client = client self.task_full_name = task_qualified_name(task=self.task) self.state_handlers.append(self.on_state_changed) self.task_inputs = None self.inputs_to_tasks = {} def on_state_changed(self, _, old_state: State, new_state: State): if isinstance(old_state, Running) and isinstance(new_state, Success): self._on_success(new_state) elif isinstance(old_state, (Pending, Running)) and isinstance(new_state, Failed): self._on_failure(state=new_state) def _prefect_inputs_to_input_dataset(self, inputs: Dict[str, Result]) -&gt; List[InputDataset]: """Convert prefect inputs to input Datasets for OpenLineage""" if not inputs: return [] return [ InputDataset( namespace=flow_namespace(), name=self.inputs_to_tasks[k], facets={}, inputFacets={}, ) for k, v in inputs.items() ] def prefect_result_to_output_dataset(self, result: Result) -&gt; OutputDataset: output_facets = {} if not isinstance(self.task, Parameter): output_facets["output-dataset"] = DataSourceDatasetFacet( name=f"{self.task_full_name}-output", uri=result_location(result, ****self.task_inputs), ) return OutputDataset( namespace=flow_namespace(), name=f"{self.task_full_name}", facets={}, outputFacets=output_facets, ) def _task_description(self): if isinstance(self.task, Parameter): # Parameters don't have any doc / description at this stage, simply return the name return self.task.name else: return self.task.__doc__ def _on_start(self, inputs: Dict[str, Result]): context = prefect.context run_id = self._client.start_task( run_id=context.task_run_id, job_name=self.task_full_name, job_description=self._task_description(), event_time=utc_now(), parent_run_id=context.flow_run_id, code_location=None, inputs=self._prefect_inputs_to_input_dataset(inputs), outputs=None, ) <a href="http://self.logger.info">self.logger.info</a>(f"Marquez run CREATED run_id: {run_id}") def _on_success(self, state: Success): context = prefect.context run_id = self._client.complete_task( run_id=context.task_run_id, job_name=self.task_full_name, inputs=None, end_time=utc_now(), outputs=[self.prefect_result_to_output_dataset(result=state._result)], ) <a href="http://self.logger.info">self.logger.info</a>(f"Marquez run COMPLETE run_id: {run_id}") def _on_failure(self, state: Failed): context = prefect.context run_id = self._client.fail_task( run_id=context.task_run_id, job_name=self.task_full_name, inputs=None, outputs=None, end_time=utc_now(), ) <a href="http://self.logger.info">self.logger.info</a>(f"Marquez run FAILED run_id: {run_id}") def get_task_inputs( self, state: State, upstream_states: Dict[Edge, State] ) -&gt; Dict[str, Result]: for upstream in upstream_states: self.inputs_to_tasks[upstream.key] = task_qualified_name(task=upstream.upstream_task) task_inputs = super().get_task_inputs(state=state, upstream_states=upstream_states) return task_inputs def set_task_to_running(self, state: State, inputs: Dict[str, Result]) -&gt; State: self.task_inputs = inputs state = super().set_task_to_running(state=state, inputs=inputs) self._on_start(inputs=inputs) return state </code></pre>

Brad (bradley.mcelroy@live.com)

2021-09-16 19:07:37

*Thread Reply:* I might move this into a proper PR @Julien Le Dem

Brad (bradley.mcelroy@live.com)

2021-09-16 19:08:12

*Thread Reply:* Successfully got a basic prefect flow working

Brad (bradley.mcelroy@live.com)

2021-09-16 02:11:53

A question about DatasetType - is there a representation for a file-like type? For files stored in S3/FTP/NFS etc (assuming a fully resolvable url)

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-16 09:53:24

*Thread Reply:* I think there was some talk somewhere to actually drop the DatasetType concept; can't find where though.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-16 10:04:09

*Thread Reply:* I've taken a look at your repo. Looks great so far!

One thing I've noticed I don't think you need to use any stuff from Marquez to emit events. It's lineage ingestion API is deprecated - you can just use openlineage-python client. If there's something you think it's missing from it, feel free to write that here or open issue.

Brad (bradley.mcelroy@live.com)

2021-09-16 17:12:31

*Thread Reply:* And would that be replaced by just some Input/Output notion @Maciej Obuchowski?

Brad (bradley.mcelroy@live.com)

2021-09-16 17:13:26

*Thread Reply:* Oh yeah I got a little confused by the single lineage endpoint - but I’ve realised how it all works now. I’m still using the marquez backend to view things but I’ll use the openlineage-client to talk to it

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-16 17:34:46

*Thread Reply:* Yes 🙌

Tomas Satka (satka.tomas@gmail.com)

2021-09-16 06:04:30

When trying to fix failing checks, i see integration-test-integration-airflow to fail ```#!/bin/bash -eo pipefail if [[ GCLOUDSERVICEKEY,GOOGLEPROJECTID == "" ]]; then echo "No required environment variables to check; moving on" else IFS="," read -ra PARAMS <<< "GCLOUDSERVICEKEY,GOOGLEPROJECTID"

for i in "${PARAMS[@]}"; do if [[ -z "${!i}" ]]; then echo "ERROR: Missing environment variable {i}" >&2

  if [[ -n "" ]]; then
    echo "" &gt;&amp;2
  fi

  exit 1
else
  echo "Yes, ${i} is defined!"
fi

done fi

ERROR: Missing environment variable {i}

Exited with code exit status 1 CircleCI received exit code 1``` However i havent touch airflow at all.. can somebody help please?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-16 06:59:34

*Thread Reply:* Hey, Airflow integration tests do not pass env variables to PRs from forks due to security reasons - everyone could create malicious PR and dump secrets

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-16 07:00:29

*Thread Reply:* So, they will fail and there's nothing to do from your side 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-16 07:00:55

*Thread Reply:* We probably should split those into ones that don't touch external systems, and run those for all PRs

Tomas Satka (satka.tomas@gmail.com)

2021-09-16 07:08:03

*Thread Reply:* ah okie. good to know. and in build-integration-spark Could not resolve all artifacts. Is that also known issue? Or something from my side that i could fix?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-16 07:11:12

*Thread Reply:* Looks like gradle server problem? > Could not get resource '<https://plugins.gradle.org/m2/com/diffplug/spotless/spotless-lib/2.13.2/spotless-lib-2.13.2.module>'. > Could not GET '<https://plugins.gradle.org/m2/com/diffplug/spotless/spotless-lib/2.13.2/spotless-lib-2.13.2.module>'. Received status code 500 from server: Internal Server Error

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-16 07:34:44

*Thread Reply:* After retry, there's spotless error:

+········.orElse(Collections.emptyList()).stream()

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-16 07:35:15

*Thread Reply:* I think this is due to mismatch between behavior of spotless in Java 8 and Java 11+ - which you probably used 🙂

Tomas Satka (satka.tomas@gmail.com)

2021-09-16 07:40:01

*Thread Reply:* ah.. i used java11. so shall i rerun something with java8 setup as sdk?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-16 07:44:31

*Thread Reply:* For spotless, you can just fix this one line 🙂 Though I don't guarantee that tests that run later will pass, so you might need Java 8 for later testing

Tomas Satka (satka.tomas@gmail.com)

2021-09-16 08:04:36

*Thread Reply:* yup looks better now

Tomas Satka (satka.tomas@gmail.com)

2021-09-16 08:04:41

*Thread Reply:* thanks

Tomas Satka (satka.tomas@gmail.com)

2021-09-16 14:27:02

*Thread Reply:* will somebody please review my PR? had to already adjust due to updates on same test class 🙂

Brad (bradley.mcelroy@live.com)

2021-09-16 20:36:28

Hey team - I've opened https://github.com/OpenLineage/OpenLineage/pull/293 for a very WIP prefect integration

🙌 Maciej Obuchowski

Brad (bradley.mcelroy@live.com)

2021-09-16 20:37:27

*Thread Reply:* @Thomas Fredriksen would love any feedback

Thomas Fredriksen (thomafred90@gmail.com)

2021-09-17 04:21:13

*Thread Reply:* nicely done! As we discussed in another thread - the way you have implemented lineage using FlowRunner and TaskRunner is likely the best way to do this. Let me know if you need any help, I would love to see this PR get merged!

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-17 07:28:33

*Thread Reply:* Hey @Brad, it looks great!

I've seen you're using task_qualified_name to name datasets and I don't think it's the right way. I'd take a look at naming conventions here: https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md

Getting that right is key to making sure that lineage is properly tracked between systems - for example, if you use Prefect to schedule dbt runs or pyspark jobs, the unified naming makes sure that all those integrations properly refer to the same dataset.

Brad (bradley.mcelroy@live.com)

2021-09-17 08:12:50

*Thread Reply:* Hey @Maciej Obuchowski thanks for the feedback. Yep the naming was a bit of a placeholder. Open to any recommendations.. I think things like dbt or pyspark are straight forward (we could add special handling for tasks like that) but what about regular transformation type tasks that run in a scheduler? Do you have any naming preference? Say I just had some pandas transform task in prefect for example

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-17 08:28:04

*Thread Reply:* First of all, not all tasks are producing and consuming datasets. For example, I wouldn't expect any of the Github tasks to have any datasets.

Second, in Airflow we have a concept of Extractor where you can write specialized code to expose datasets. For example, for BigQuery we extract datasets from query plan. Now, I'm not sure if this concept would translate well to Prefect - but if yes, then we have some helpers inside openlineage common library that could be reused. Also, this way allows to emit additional facets, some of which are really useful - like query statistics for BigQuery, and data quality tests for dbt.

Third, if we're talking about generalized tasks like FunctionTask or ShellTask, then I think the right way is to expose functionality to user to expose lineage themselves. I'm not sure how exactly that would look in Prefect.

Brad (bradley.mcelroy@live.com)

2021-09-19 23:03:14

*Thread Reply:* You've raised some good points @Maciej Obuchowski - I might have been thinking about this integration in slightly the wrong way. I think based on your comments I'll refactor some of the code to hook into the Results object in prefect (The Result object is the way in which data is serialized and persisted).

> Now, I'm not sure if this concept would translate well to Prefect - but if yes, then we have some helpers inside openlineage common library that could be reused This definitely applies to prefect and the similar tasks exist in prefect and we should definitely leverage the common library in this case.

> Third, if we're talking about generalized tasks like FunctionTask or ShellTask, then I think the right way is to expose functionality to user to expose lineage themselves. I'm not sure how exactly that would look in Prefect. Yeah I agree with this. I'd like to make it as easy a possible to opt-in, but I think you're right that there needs to be some hooks for user defined lineage. I'll think about this a little more.

> First of all, not all tasks are producing and consuming datasets. For example, I wouldn't expect any of the Github tasks to have any datasets. My initial thoughts here were that it would still be good to have lineage as these tasks do have side effects, and downstream consumers of the lineage data might want to know about these tasks. However I don't have a good feeling yet how best to do this, so I'm going to park those thoughts for now.

🙌 Maciej Obuchowski

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-20 06:30:51

*Thread Reply:* > Yeah I agree with this. I'd like to make it as easy a possible to opt-in, but I think you're right that there needs to be some hooks for user defined lineage. I'll think about this a little more. First version of an integration doesn't have to be perfect. in particular, not handling this use case would be okay, since it does not lock us into some particular way of doing it later.

> My initial thoughts here were that it would still be good to have lineage as these tasks do have side effects, and downstream consumers of the lineage data might want to know about these tasks. However I don't have a good feeling yet how best to do this, so I'm going to park those thoughts for now. I'd think of two options first, before modeling it as a dataset: Won't existence of a event be enough? After all, we'll still have it despite it not having any input and output datasets. If not, then wouldn't custom run or job facet be a better fit?

Brad (bradley.mcelroy@live.com)

2021-09-23 17:27:49

*Thread Reply:* > Won’t existence of a event be enough? After all, we’ll still have it despite it not having any input and output datasets. Duh, yep you’re right @Maciej Obuchowski, I’m over thinking this. I’m going to clean this up based on your comments

Thomas Fredriksen (thomafred90@gmail.com)

2021-10-06 03:39:28

*Thread Reply:* Hi @Brad. How will this integration work for Prefect flows running in Prefect Cloud or on Prefect Server?

Brad (bradley.mcelroy@live.com)

2021-10-06 03:40:44

*Thread Reply:* Hi @Thomas Fredriksen - it'll relate to the agent actually - you'll need to pass the flow runner class to the agent when running

Thomas Fredriksen (thomafred90@gmail.com)

2021-10-06 03:48:14

*Thread Reply:* nice!

Brad (bradley.mcelroy@live.com)

2021-10-06 03:48:54

*Thread Reply:* Unfortunately I've been a little busy the past week, and I will be for the rest of this week

Brad (bradley.mcelroy@live.com)

2021-10-06 03:49:09

*Thread Reply:* but I do plan to pick this up next week

Brad (bradley.mcelroy@live.com)

2021-10-06 03:49:23

*Thread Reply:* (the additional changes I mention above)

Thomas Fredriksen (thomafred90@gmail.com)

2021-10-06 03:50:08

*Thread Reply:* looking forward to it 🙂 let me know if you need any help!

Brad (bradley.mcelroy@live.com)

2021-10-06 03:50:34

*Thread Reply:* yeah when I get this next lot of stuff in - I'd love for people to test it out

🙌 Thomas Fredriksen, Maciej Obuchowski

Adam Pocock (adam.pocock@oracle.com)

2021-09-20 17:38:51

Is there a preferred academic citation for OpenLineage? I’m writing a paper on the provenance system in our machine learning library, and I’d like to cite OpenLineage as an example of future work on data lineage to integrate with.

🙌 Willy Lulciuc

Julien Le Dem (julien@apache.org)

2021-09-20 19:18:53

*Thread Reply:* I think you can reffer to https://openlineage.io/

openlineage.io

Home

Original URL: https://openlineage.io/

Julien Le Dem (julien@apache.org)

2021-09-20 19:31:30

We’re starting to see the beginning of larger contributions (Spark streaming, prefect, …) and I think we need to define a way to accept those contributions incrementally. If we take the example of Streaming (Spark streaming, Flink or Beam) support (but really this applies in general, sorry to pick on you Tomas, this is great!): The first Spark streaming PR ( https://github.com/OpenLineage/OpenLineage/pull/279 ) lays the ground work for testing spark streaming but there’s more work to have a full feature. I’m in favor of merging Spark streaming support into main once it’s working end to end (possibly with partial input/output coverage). So I see 2 options:

start a branch for spark streaming support. Have PRs like this one go into it until it’s completed (smaller reviews). Then merge the whole thing as a PR in main when it’s finished
Keep working on that PR until it’s fully implemented, but it will get big, and make reviews difficult. I have seen the model 1) work well. It’s easier to do multiple smaller reviews for larger projects.

<https://github.com/OpenLineage/OpenLineage/pull/279|#279 adding initial test for spark kafka integration>

Comments

👍 Ross Turk, Maciej Obuchowski, Faouzi

Yannick Endrion (yannick.endrion@gmail.com)

2021-09-24 05:10:04

Thank you @Ross Turk for this really useful article: https://openlineage.io/blog/dbt-with-marquez/?s=03 Is anyone aware of additional environment being supported by the dbt<->OpenLineage<->Marquez integration ? I think only Snowflake and BigQuery are supported now. I am really interested by SQLServer or even Dremio (which could be great because capable of read from multiples DB).

Thank you

openlineage.io

Using Marquez to Visualize dbt Models

Each time dbt runs it generates a trove of metadata about datasets and the work it performs with them. In this post, I’d like to show you how to harvest this metadata and put it to good use.

Original URL: https://openlineage.io/blog/dbt-with-marquez/?s=03

🎉 Minkyu Park, Ross Turk

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-24 05:15:31

*Thread Reply:* It should be really easy to add additional databases. Basically, we'd need to know how to get namespace for that database: https://github.com/OpenLineage/OpenLineage/blob/main/integration/common/openlineage/common/provider/dbt.py#L467

The first step should be to add SQLServer or Dremio to the dataset naming schema here https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/common/openlineage/common/provider/dbt.py | dbt.py>

<https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md | Naming.md>

Yannick Endrion (yannick.endrion@gmail.com)

2021-10-04 16:22:59

*Thread Reply:* Thank you @Maciej Obuchowski, I tried to give it a try but without success yet. Not sure where I am suppose to add the sqlserver naming schema... If you have any documentation that I could read I would be glad =) Many thanks

Julien Le Dem (julien@apache.org)

2021-10-07 15:13:43

*Thread Reply:* This would be adding a paragraph similar to this one: https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md#snowflake

<https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md | Naming.md>

Naming We define the unique name strategy per resource to ensure it is followed uniformly independently from who is producing metadata and we can connect lineage from various sources. Both Jobs and Datasets are in their own namespaces. Job namespaces are related to their scheduler. The namespace for a dataset is the unique name for its datasource. Datasets The namespace and name of a datasource can be combined to form a URI (scheme:[//authority]path) • Namespace = scheme:[//authority] (the datasource) • Name = path (the datasets) Data warehouses/data bases Datasets are called tables. Tables are organised in databases and schemas. Postgres: Datasource hierarchy: • Host • Port Naming hierarchy: • Database • Schema • Table Identifier: • Namespace: postgres://{host}:{port} of the service instance. • Scheme = postgres • Authority = {host}:{port} • Unique name: {database}.{schema}.{table} • URI = postgres://{host}:{port}/{database}.{schema}.{table} Snowflake See: <a href="https://docs.snowflake.com/en/sql-reference/identifiers.html">Object Identifiers — Snowflake Documentation</a> Datasource hierarchy: • account name Naming hierarchy: • Database: {database name} => unique across the account • Schema: {schema name} => unique within the database • Table: {table name} => unique within the schema Identifier: • Namespace: snowflake://{account name} • Scheme = snowflake • Authority = {account name} • Name: {database}.{schema}.{table} • URI = snowflake://{account name}/{database}.{schema}.{table} BigQuery See: <a href="https://cloud.google.com/resource-manager/docs/creating-managing-projects|Creating and managing projects | Resource Manager Documentation">https://cloud.google.com/resource-manager/docs/creating-managing-projects|Creating and managing projects | Resource Manager Documentation</a> <a href="https://cloud.google.com/bigquery/docs/datasets-intro|Introduction to datasets | BigQuery">https://cloud.google.com/bigquery/docs/datasets-intro|Introduction to datasets | BigQuery</a> <a href="https://cloud.google.com/bigquery/docs/tables-intro|Introduction to tables | BigQuery">https://cloud.google.com/bigquery/docs/tables-intro|Introduction to tables | BigQuery</a> Datasource hierarchy: • bigquery Naming hierarchy: • Project Name: {project name} => is not unique • Project number: {project number} => numeric: is unique across google cloud • Project ID: {project id} => readable: is unique across google cloud • dataset: {dataset name} => is unique within a project • table: {table name} => is unique within a dataset Identifier : • Namespace: bigquery • Scheme = bigquery • Authority = • Unique name: {project id}.{dataset name}.{table name} • URI = bigquery:{project id}.{schema}.{table} Distributed file systems/blob stores GCS Datasource hierarchy: none, naming is global Naming hierarchy: • bucket name => globally unique • Path Identifier : • Namespace: gs://{bucket name} • Scheme = gs • Authority = {bucket name} • Unique name: {path} • URI = gs://{bucket name}{path} S3 Naming hierarchy: • bucket name => globally unique • Path Identifier : • Namespace: s3://{bucket name} • Scheme = s3 • Authority = {bucket name} • Unique name: {path} • URI = s3://{bucket name}{path} HDFS Naming hierarchy: • Namenode: host + port • Path Identifier : • Namespace: hdfs://{namenode host}:{namenode port} • Scheme = hdfs • Authority = {namenode host}:{namenode port} • Unique name: {path} • URI = hdfs://{namenode host}:{namenode port}{path} Jobs Context A <code>Job</code> is a recurring data transformation with Inputs and outputs. Each execution is captured as a <code>Run</code> with corresponding metadata. A <code>Run</code> event identifies the <code>Job</code> it is an instance of by providing the job’s unique identifier. The <code>Job</code> identifier is composed of a <code>Namespace</code> and a <code>Name</code>. The job name is unique within that namespace. The core property we want to identify about a <code>Job</code> is how it changes over time. Different schedules of the same logic applied to different datasets (possibly with different parameters) are different jobs. The notion of a <code>job</code> is tied to a recurring schedule with specific inputs and outputs. It could be an incremental update or a full reprocess or even a streaming job. If the same code artifact (for example a spark jar or a templated SQL query) is used in the context of different schedules with different input or outputs then they are different jobs. We are interested first in how they affect the datasets they produce. Job Namespace and constructing job names Jobs have a <code>name</code> that is unique to them in their <code>namespace</code> by construction. The Namespace is the root of the naming hierarchy. The job name is constructed to identify the job within that namespace. Example: • Airflow: • Namespace: the namespace is assigned to the airflow instance. Ex: airflow-staging, airflow-prod • Job: each task in a DAG is a job. name: {dag name}.{task name} • Spark: • Namespace: the namespace is provided as a configuration parameter as in airflow. If there's a parent job, we use the same namespace, otherwise it is provided by configuration. • Spark app job name: the spark.app.name • Spark action job name: {spark.app.name}.{node.name} Parent job run: a nested hierarchy of Jobs It is often the case that jobs are part of a nested hierarchy. For example an Airflow DAG contains tasks. An instance of the DAG is finished when all of the tasks are finished. Similarly a Spark job can spawn multiple actions each of them running independently. Additionally, a Spark job can be launched by an Airflow task within a DAG. Since what we care about is identifying the job as rooted in a recurring schedule, we want to capture that connection and make sure that we treat the same application logic triggered at different schedules as different jobs. For example: if an Airflow DAG runs individual tasks per partition (for example market segments) using the same underlying job logic, they will be tracked as separate jobs. To capture this, a run event provides <a href="https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json#L282-L331">a <code>ParentRun</code> facet</a>, referring to the parent <code>Job</code> and <code>Run</code>. This allows tracking a recurring job from the root of the schedule it is running for. If there's a parent job, we use the same namespace, otherwise it is provided by configuration. Example: <pre><code>{ "run": { "runId": "run_uuid" }, "job": { "namespace": "job_namespace", "name": "job_name" } } </code></pre>

Julien Le Dem (julien@apache.org)

2021-10-07 15:14:30

*Thread Reply:* Snowflake See: Object Identifiers — Snowflake Documentation Datasource hierarchy: • account name Naming hierarchy: • Database: {database name} => unique across the account • Schema: {schema name} => unique within the database • Table: {table name} => unique within the schema Identifier: • Namespace: snowflake://{account name} ◦ Scheme = snowflake ◦ Authority = {account name} • Name: {database}.{schema}.{table} ◦ URI = snowflake://{account name}/{database}.{schema}.{table}

Marty Pitt (martypitt@vyne.co)

2021-09-24 06:53:05

Hi all. I'm the Founder / CTO of a data discovery & transformation platform that captures very rich lineage information. We're interested in exposing / making our lineage data consumable via open standards, which is what lead me to this project. A couple of questions:

A) Am I right in considering that's the goal of this project? B) Are you also considering provedance as well as lineage? C) What's a good starting point to understand the models we should be exposing our data in, to make it consumable?

Marty Pitt (martypitt@vyne.co)

2021-09-24 07:06:20

*Thread Reply:* For clarity on the provedance vs lineage point (in case I'm using those terms incorrectly...)

Our platform performs automated enrichment and processing of data. In doing so, we often make calls to functions or out to other data services (such as APIs, or SELECTs against databases). We capture the inputs that pass to these, along with the outputs. (And, if the input is derived from other outputs, we capture the full chain, right back to the root).

That's the kinda stuff our customers are really interested in, and we feel like there's value in making is consumable.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-24 08:47:35

*Thread Reply:* Not sure I understand you right, but are you interested in tracking individual API calls, and for example, values of some parameters passed for one call?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-24 08:51:16

*Thread Reply:* I guess that's not in OpenLineage scope, as we're interested more in tracking metadata for whole datasets. But I might be wrong, some other people might chime in.

We could of course model this situation, but that would capture for example schema of those parameters. Not their values.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-24 08:52:16

*Thread Reply:* I think this might be better suited for https://opentelemetry.io/

Marty Pitt (martypitt@vyne.co)

2021-09-24 10:55:54

*Thread Reply:* Kinda, but not really. Telemetery data is metadata about the API calls. We have that, but it's not interesting to our customers. It's the metadata about the data that Vyne provides that we want to expose.

Our customers use Vyne to fetch data from lots of different sources. Eg:

> "Whenever a trade is booked, calculate it's compliance against these regulations, to report to the regulators". or

> "Whenever a customer buys a $thing, capture the transaction data, client data, and account data, and store it in this table." Providing answers to those questions involves fetching and transforming data, before storing it, or outputting it. We capture all that data, on a per-attribute basis, so we can answer the question "how did we get this value?" That's the lineage information we want to publish.

Michael Collado (collado.mike@gmail.com)

2021-09-30 15:10:51

*Thread Reply:* The core OpenLineage model is documented at https://github.com/OpenLineage/OpenLineage/#core-model . The model is really focused on Jobs and Datasets. Jobs have Runs which have start and end times (typically scheduled start/end times as well) and read from and/or write to the target datasets. If your transformation chain fits within that model, then I think you can definitely record and share the lineage information with your customers. The existing implementations are all focused on batch data access, though streaming should be possible to capture as well

Drew Bittenbender (drew@salt.io)

2021-09-29 11:10:29

Hello. I am trying the openlineage-airflow integration with Marquez as the backend and have 3 questions.

Does it only work for PostgresOperators?
Which is the recommended integration: marquez-airflow or openlineage-airflow
How do you enable more detailed logging? I tried OPENLINEAGELOGLEVEL and MARQUEZLOGLEVEL and neither seemed to affect logging. I assume this is logged to the airflow worker

Faouzi (faouzi@dataroots.io)

2021-09-29 13:46:59

*Thread Reply:* Hello @Drew Bittenbender!

For your two first questions:

• Yes right now only the PostgresOperator is integrated. I learnt it the hard way ^_^. Spent hours trying with MySQL. There were attempts to integrate with MySQL actually. If engineers do not integrate it I will allocate myself some time to try to implement other airflow db operators. • Use the openlineage one. It is the recommended approach now.

Drew Bittenbender (drew@salt.io)

2021-09-29 13:49:41

*Thread Reply:* Thank you @Faouzi. Is there any documentation/best practices to write your own extractor, or is it "read the code"? We use the Python, Docker and SSH operators a lot. Maybe those don't fit into the lineage paradigm well, but want to give it a shot

Faouzi (faouzi@dataroots.io)

2021-09-29 13:52:16

*Thread Reply:* To the best of my knowledge there is no documentation to guide through the design of your own extractor. So yes we need to read the code. Here a link where you can see how they did for postgre extractor and others. https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow/openlineage/airflow/extractors

👍 Drew Bittenbender

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-09-30 05:08:53

*Thread Reply:* I think in case of "bring your own code" operators like Python or Docker ones, it might be better to use lineage_run_id macro and use openlineage-python library inside, instead of implementing extractor.

Michael Collado (collado.mike@gmail.com)

2021-09-30 15:14:47

*Thread Reply:* I think @Maciej Obuchowski is right here. The airflow integration will create the parent jobs, but to get the dataset input/output links, it's best to do that directly from the python/docker scripts. If you report the parent run id, Marquez will link the jobs together correctly

Julien Le Dem (julien@apache.org)

2021-10-07 15:09:55

*Thread Reply:* To clarify on what airflow operators are supported out of the box: • postgres • bigquery • snowflake • Great expectations (with extra config) See: https://github.com/OpenLineage/OpenLineage/blob/3a1ccbd854bbf202bbe6437bf81786cb01[…]ntegration/airflow/openlineage/airflow/extractors/extractors.py Mysql is not at the moment. We should track it as an issue

<https://github.com/OpenLineage/OpenLineage/blob/3a1ccbd854bbf202bbe6437bf81786cb01ed4d01/integration/airflow/openlineage/airflow/extractors/extractors.py | extractors.py>

<pre><code>_extractors = [ PostgresExtractor, BigQueryExtractor, GreatExpectationsExtractor, SnowflakeExtractor ] </code></pre>

Yuki Tannai (tannai-yuki@dmm.com)

2021-09-30 09:21:35

Hi there! I’m trying to enhance the lineage functionality of a data infrastructure I’m working on. All of the tools I found only visualize the relationships between tables before and after the transformation, but the DataHub RFC discusses Field Level Lineage, which I thought was close to the functionality I was looking for. Does OpenLineage support the same functionality? https://datahubproject.io/docs/rfc/active/1841-lineage/field_level_lineage/

datahubproject.io

RFC - Field Level Lineage | DataHub

- Start Date: 2020-08-28

Original URL: https://datahubproject.io/docs/rfc/active/1841-lineage/field_level_lineage/

Julien Le Dem (julien@apache.org)

2021-10-07 15:03:40

*Thread Reply:* OpenLineage doesn’t have field level lineage yet. Here is the proposal for adding it: https://github.com/OpenLineage/OpenLineage/issues/148

<https://github.com/OpenLineage/OpenLineage/issues/148|#148 [PROPOSAL] column level lineage facet>

Purpose: For transformations like SQL queries (but also in other cases), we can extract column level lineage. This allows answering questions like ‘which root input columns are used to construct column x?’. We need to be able to capture this information in the OpenLineage model Proposed implementation We propose to introduce a new dataset facet: example: <pre><code>{ "eventType": "COMPLETE", "eventTime": "2020-12-28T20:52:00.001+10:00", "run" : { "runId": "uuid" }, "job": { "namespace": "scheduler", "name": "myjob", "facets": { "sql": { "query": "Insert into outputTable from select ** from inputTable" } } }, "inputs": [ { "namespace": "N1", "name": "inputTable", "facets": { "schema": { "fields": [ {"name": "col_a", "type": "VARCHAR"}, {"name": "col_b", "type": "int"}] } } } ], "outputs": [ { "namespace": "N2", "name": "outputTable", "facets": { "schema": { "fields": [ {"name": "col_a", "type": "VARCHAR"}, {"name": "col_b", "type": "int"}] }, "columnLineage": { "fields": { "col_a": [ { "namespace": "N1", "name": "inputTable", "field": "col_a"} ], "col_b": [ { "namespace": "N1", "name": "inputTable", "field": "col_b" } ] } } } } ] } </code></pre> Schema: <pre><code> "columnLineage": { "type": "object", "properties": { "fields": { "type": "array", "items": { "type": "object", "additionalProperties": { "type": "object", "properties": { "namespace": { "type": "string", "description": "the input dataset namespace" }, "name": { "type": "string", "description": "the input dataset name" }, "field": { "type": "string", "description": "the input field" } }, "required": [ "namespace", "name" ] } } } } </code></pre>

Labels

proposal

Comments

👀 Yuki Tannai, Ricardo Gaspar

Julien Le Dem (julien@apache.org)

2021-10-07 15:04:36

*Thread Reply:* Those two specs look compatible, so Datahub should be able to consume this lineage metadata in the future

👍 Yuki Tannai

павел клопотюк (klopotuk@gmail.com)

2021-10-04 14:27:24

Hello, everyone. I'm trying to work with OL and Airflow 2.1.4 and it doesn't work. I found that OL is supported for Airflow 1.10.12++. Does it support Airflow 2.X.Y?

Ross Turk (ross@datakin.com)

2021-10-04 15:38:47

*Thread Reply:* Hi! Airflow 2.x is currently in development - you can follow along with the progress here: https://github.com/OpenLineage/OpenLineage/issues/205

<https://github.com/OpenLineage/OpenLineage/issues/205|#205 Airflow 2.0 support>

☑︎ Test suite for 2.0 ☑︎ Working LineageBackend PoC that passes existing tests ☑︎ <a href="https://github.com/apache/airflow/issues/17984">Issue in Airflow repo to create mechanism for reporting failures</a> ☑︎ <a href="https://github.com/OpenLineage/OpenLineage/issues/242">Way to add custom Extractors</a> ☑︎ <a href="https://github.com/apache/airflow/pull/18470">Airflow PR</a> ☑︎ Check if it's feasible to have both Airflows supported in one package ☐ Implement support in existing package

Assignees

mobuchowski

Comments

павел клопотюк (klopotuk@gmail.com)

2021-10-05 03:01:54

*Thread Reply:* Thank you for your reply!

Julien Le Dem (julien@apache.org)

2021-10-07 15:02:23

*Thread Reply:* There should be a first version of Airflow 2.X support soon: https://github.com/OpenLineage/OpenLineage/pull/305 We’re labelling it experimental because the config step might change as discussion in the airflow github evolve. It will track succesful jobs in its current state.

<https://github.com/OpenLineage/OpenLineage/pull/305|#305 [INTEGRATION][airflow] integration working on both Airflow 2 and 1.10>

Implement support for both Airflow versions on one python package. Airflow 2's support is done via Airflow's <code>LineageBackend</code> and is currently experimental. It has limitations - can't track job fails, and job starts are only tracked on job end. The end-user has only to specify <code>openlineage.airflow.backend.OpenLineageBackend</code> as their <code>LineageBackend</code> in the Airflow config or via env variable. To make sure this integration works on both versions, some helpers had to be implemented, including ones that correctly import based on version, like <code>safe_import_airflow</code>. Also, unit and integration tests are running on both versions of Airflow now. Closes <a href="https://github.com/OpenLineage/OpenLineage/issues/205">#205</a> <a href="https://github.com/OpenLineage/OpenLineage/issues/240">#240</a> <a href="https://github.com/OpenLineage/OpenLineage/issues/126">#126</a> Signed-off-by: Maciej Obuchowski <a href="mailto:maciej.obuchowski@getindata.com">maciej.obuchowski@getindata.com</a>

Comments

SAM (skhettri@gmail.com)

2021-10-04 23:14:26

Hi All, I’m working on openlineage-dbt integration with Marquez as backend. I want to integrate OL with DBT cloud, would you please help to provide steps that I need to follow?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-05 04:18:42

*Thread Reply:* Take a look at this: https://docs.getdbt.com/docs/dbt-cloud/dbt-cloud-api/metadata/metadata-overview

docs.getdbt.com

Metadata API Overview | dbt Docs

Every time that dbt Cloud runs a dbt project, it generates metadata which pertains to the accuracy, recency, configuration, and structure of the views and tables in the warehouse. dbt Cloud serves a GraphQL API which supports arbitrary queries over this metadata; the endpoint for this API is <https://metadata.cloud.getdbt.com/graphql>. This API is an incredibly rich resource for evaluating data health longitudinally or at a point in time.

Original URL: https://docs.getdbt.com/docs/dbt-cloud/dbt-cloud-api/metadata/metadata-overview

✅ SAM

Julien Le Dem (julien@apache.org)

2021-10-07 14:58:24

*Thread Reply:* @SAM Let us know of your progress.

👍 SAM

ale (alessandro.lollo@gmail.com)

2021-10-05 16:23:41

Hey folks 😊 I’m trying to run dbt-ol with Redshift target, but I get the following error message Traceback (most recent call last): File "/usr/local/bin/dbt-ol", line 61, in <module> main() File "/usr/local/bin/dbt-ol", line 54, in main events = processor.parse().events() File "/usr/local/lib/python3.8/site-packages/openlineage/common/provider/dbt.py", line 97, in parse self.extract_dataset_namespace(profile) File "/usr/local/lib/python3.8/site-packages/openlineage/common/provider/dbt.py", line 368, in extract_dataset_namespace self.dataset_namespace = self.extract_namespace(profile) File "/usr/local/lib/python3.8/site-packages/openlineage/common/provider/dbt.py", line 382, in extract_namespace raise NotImplementedError( NotImplementedError: Only 'snowflake' and 'bigquery' adapters are supported right now. Passed redshift I know that Redshift is not the best cloud DWH we can use… 😅 But, still….do you have any plan to support it? Thanks!

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-05 16:41:30

*Thread Reply:* Hey, can you create ticket in OpenLineage repository? FWIW Redshift is very similar to postgres, so supporting it won't be hard.

ale (alessandro.lollo@gmail.com)

2021-10-05 16:43:39

*Thread Reply:* Hey @Maciej Obuchowski 😊 Yep, will do now! Thanks!

ale (alessandro.lollo@gmail.com)

2021-10-05 16:46:26

*Thread Reply:* Well...will do tomorrow morning 😅

ale (alessandro.lollo@gmail.com)

2021-10-06 03:03:16

*Thread Reply:* Here’s the issue: https://github.com/OpenLineage/OpenLineage/issues/318

<https://github.com/OpenLineage/OpenLineage/issues/318|#318 Add Redshift support to dbt integration>

As of <code>dbt-ol==0.2.2</code>, it is not possible to generate lineage if the target DWH is Redshift. Try running <code>dbt-ol</code> and you get the following error: <pre><code>Traceback (most recent call last): File "/usr/local/bin/dbt-ol", line 61, in &lt;module&gt; main() File "/usr/local/bin/dbt-ol", line 54, in main events = processor.parse().events() File "/usr/local/lib/python3.8/site-packages/openlineage/common/provider/dbt.py", line 97, in parse self.extract_dataset_namespace(profile) File "/usr/local/lib/python3.8/site-packages/openlineage/common/provider/dbt.py", line 368, in extract_dataset_namespace self.dataset_namespace = self.extract_namespace(profile) File "/usr/local/lib/python3.8/site-packages/openlineage/common/provider/dbt.py", line 382, in extract_namespace raise NotImplementedError( NotImplementedError: Only 'snowflake' and 'bigquery' adapters are supported right now. Passed redshift </code></pre> As <a href="https://github.com/mobuchowski">@mobuchowski</a> suggested <a href="https://openlineage.slack.com/archives/C01CK9T7HKR/p1633465421127100">here</a>, I'm opening this issue as it should be not hard to support Redshift since it is very similar to Postgres.

Julien Le Dem (julien@apache.org)

2021-10-07 14:51:08

*Thread Reply:* Thanks a lot. I pulled it in the current project.

👍 ale

ale (alessandro.lollo@gmail.com)

2021-10-08 05:48:28

*Thread Reply:* @Julien Le Dem @Maciej Obuchowski I’m not familiar with dbt-ol codebase, but I’m willing to help on this if you guys can give me a bit of guidance 😅

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-08 05:53:05

*Thread Reply:* @ale can you help us define naming schema for redshift, as we have for other databases? https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md

ale (alessandro.lollo@gmail.com)

2021-10-08 05:53:21

*Thread Reply:* Sure!

ale (alessandro.lollo@gmail.com)

2021-10-08 05:54:21

*Thread Reply:* will work on this today and I’ll try to submit a PR by EOD

ale (alessandro.lollo@gmail.com)

2021-10-08 06:36:12

*Thread Reply:* There you go https://github.com/OpenLineage/OpenLineage/pull/324

<https://github.com/OpenLineage/OpenLineage/pull/324|#324 Add redshift naming spec>

This PR adds Redshift naming specs. Given that Redshift is based on Postgres, the proposed naming specs are almost identical to Postgres naming specs.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-08 06:39:35

*Thread Reply:* Host would be something like examplecluster.<XXXXXXXXXXXX>.<a href="http://us-west-2.redshift.amazonaws.com">us-west-2.redshift.amazonaws.com</a> right?

ale (alessandro.lollo@gmail.com)

2021-10-08 07:13:51

*Thread Reply:* Yep, let me update the PR

ale (alessandro.lollo@gmail.com)

2021-10-08 07:27:42

*Thread Reply:* Done

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-08 07:31:40

*Thread Reply:* 🙌

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-08 07:35:30

*Thread Reply:* If you want to look at dbt integration itself, there are two things:

We need to determine how Redshift adapter reports metrics https://github.com/OpenLineage/OpenLineage/blob/610a687bf69df2b52ec4ac4da80b4a05580e8d32/integration/common/openlineage/common/provider/dbt.py#L412

And how we can create namespace and job name based on the job naming schema that you created: https://github.com/OpenLineage/OpenLineage/blob/610a687bf69df2b52ec4ac4da80b4a05580e8d32/integration/common/openlineage/common/provider/dbt.py#L512

One thing how to get this info is to run the dbt yourself and look at resulting metadata files - in target dir of the dbt directory

<https://github.com/OpenLineage/OpenLineage/blob/610a687bf69df2b52ec4ac4da80b4a05580e8d32/integration/common/openlineage/common/provider/dbt.py | dbt.py>

<pre><code> def extract_namespace(self, profile: Dict) -&gt; str: </code></pre>

ale (alessandro.lollo@gmail.com)

2021-10-08 08:33:31

*Thread Reply:* I figured out how to generate the namespace. But I can’t understand which of the JSON files is inspected for metrics. Is it run_results.json ?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-08 09:48:50

*Thread Reply:* yes, run_results.json - it's different in bigquery and snowflake, so I presume it's different in redshift too

ale (alessandro.lollo@gmail.com)

2021-10-08 11:02:32

*Thread Reply:* Ok thanks!

ale (alessandro.lollo@gmail.com)

2021-10-08 11:11:57

*Thread Reply:* Should be stats:rows:value

ale (alessandro.lollo@gmail.com)

2021-10-08 11:19:59

*Thread Reply:* Regarding namespace: if env_var is used in profiles.yml , how is this handled now?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-08 11:44:50

*Thread Reply:* Well, it isn't. This is relevant only if you passed cluster hostname this way, right?

ale (alessandro.lollo@gmail.com)

2021-10-08 11:53:52

*Thread Reply:* Exactly

ale (alessandro.lollo@gmail.com)

2021-10-11 07:10:38

*Thread Reply:* If you think it make sense, I can submit a PR to handle dbt profile with env_var

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-11 07:18:01

*Thread Reply:* Do you want to run jinja on the dbt profile?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-11 07:20:18

*Thread Reply:* Theoretically, we'd need to run it also on dbt_project.yml , but we only take target path and profile name from it.

ale (alessandro.lollo@gmail.com)

2021-10-11 07:20:32

*Thread Reply:* The env_var syntax in the profile is quite simple, I was thinking of extracting the env var name using re and then retrieving the value from os

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-11 07:23:59

*Thread Reply:* It would work, but we can actually use jinja - if you're using dbt, it's already included. The method is pretty simple: ``` @contextmember @staticmethod def envvar(var: str, default: Optional[str] = None) -> str: """The envvar() function. Return the environment variable named 'var'. If there is no such environment variable set, return the default.

    If the default is None, raise an exception for an undefined variable.
    """
    if var in os.environ:
        return os.environ[var]
    elif default is not None:
        return default
    else:
        msg = f"Env var required but not provided: '{var}'"
        undefined_error(msg)```

ale (alessandro.lollo@gmail.com)

2021-10-11 07:25:07

*Thread Reply:* Oh cool! I will definitely use this one!

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-11 07:25:09

*Thread Reply:* We'd be sure that our implementation matches dbt's one, right? Also, you'd support default method for free

ale (alessandro.lollo@gmail.com)

2021-10-11 07:26:34

*Thread Reply:* So this env_varmethod is defined in dbt and not in OpenLineage codebase, right?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-11 07:27:01

*Thread Reply:* yes

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-11 07:27:14

*Thread Reply:* dbt is on Apache license 🙂

ale (alessandro.lollo@gmail.com)

2021-10-11 07:28:06

*Thread Reply:* Should we import dbt package and use the method or should we just copy/paste the method inside OpenLineage codebase?

ale (alessandro.lollo@gmail.com)

2021-10-11 07:28:28

*Thread Reply:* I’m asking for guidance here 😊

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-11 07:34:44

*Thread Reply:* I think we should just do basic jinja template rendering in our code like in the quick example: https://realpython.com/primer-on-jinja-templating/#quick-examples

just with the env_var method passed to the render method 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-11 07:37:05

*Thread Reply:* basically, here in the code we should read the file, do the jinja render, and load yaml from string instead of straight from file https://github.com/OpenLineage/OpenLineage/blob/610a687bf69df2b52ec4ac4da80b4a05580e8d32/integration/common/openlineage/common/provider/dbt.py#L176

<https://github.com/OpenLineage/OpenLineage/blob/610a687bf69df2b52ec4ac4da80b4a05580e8d32/integration/common/openlineage/common/provider/dbt.py | dbt.py>

ale (alessandro.lollo@gmail.com)

2021-10-11 07:38:53

*Thread Reply:* ok, got it. Will try to implement following your suggestions. Thanks @Maciej Obuchowski 🙌

🙌 Maciej Obuchowski

ale (alessandro.lollo@gmail.com)

2021-10-11 08:36:13

*Thread Reply:* We need to:

load the template profile from the profile.yml
replace any env vars we found For the first step, we can use jinja2.Template However, to replace the env vars we find, we have to actually search for those env vars… 🤔

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-11 08:43:06

*Thread Reply:* The dbt method implements that: ``` @contextmember @staticmethod def envvar(var: str, default: Optional[str] = None) -> str: """The envvar() function. Return the environment variable named 'var'. If there is no such environment variable set, return the default.

    If the default is None, raise an exception for an undefined variable.
    """
    if var in os.environ:
        return os.environ[var]
    elif default is not None:
        return default
    else:
        msg = f"Env var required but not provided: '{var}'"
        undefined_error(msg)```

ale (alessandro.lollo@gmail.com)

2021-10-11 08:45:54

*Thread Reply:* Ok, but I need to pass var to the env_var method. And to pass the var value, I need to look into the loaded Template and search for env var names…

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-11 08:46:54

*Thread Reply:* that's what jinja does - you're passing function to jinja render, and it's calling it itself

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-11 08:47:45

*Thread Reply:* you can try the quick example from here, but just pass the env_var method (slightly adjusted - as a standalone function and without undefined error) and call it inside the template: https://realpython.com/primer-on-jinja-templating/#quick-examples

ale (alessandro.lollo@gmail.com)

2021-10-11 08:51:19

*Thread Reply:* Ok, will try

ale (alessandro.lollo@gmail.com)

2021-10-11 09:37:49

*Thread Reply:* I’m trying to run pip install -e ".[dev]" so that I can test my changes, but I get ERROR: Could not find a version that satisfies the requirement openlineage-integration-common[dbt]==0.2.3 (from openlineage-dbt[dev]) (from versions: 0.0.1rc7, 0.0.1rc8, 0.0.1, 0.1.0rc5, 0.1.0, 0.2.0, 0.2.1, 0.2.2) ERROR: No matching distribution found for openlineage-integration-common[dbt]==0.2.3 I don’t understand what I’m doing wrong…

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-11 09:41:47

*Thread Reply:* can you try installing it manually?

pip install openlineage-integration-common[dbt]==0.2.3

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-11 09:42:13

*Thread Reply:* I mean, it exists in pypi: https://pypi.org/project/openlineage-integration-common/#files

PyPI

openlineage-integration-common

OpenLineage common python library for integrations

Original URL: https://pypi.org/project/openlineage-integration-common/#files

ale (alessandro.lollo@gmail.com)

2021-10-11 09:44:57

*Thread Reply:* Yep, maybe it’s our internal Pypi repo which is not synced. Installing from the public pypi resolved the issue

ale (alessandro.lollo@gmail.com)

2021-10-11 12:04:55

*Thread Reply:* Can;’t seem to make env_var working as the render method of a Template 😅

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-11 12:57:07

*Thread Reply:* try this:

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-11 12:57:09

*Thread Reply:* ```import os from typing import Optional from jinja2 import Template

def envvar(var: str, default: Optional[str] = None) -> str: """The envvar() function. Return the environment variable named 'var'. If there is no such environment variable set, return the default.

If the default is None, raise an exception for an undefined variable.
"""
if var in os.environ:
    return os.environ[var]
elif default is not None:
    return default
else:
    msg = f"Env var required but not provided: '{var}'"
    raise Exception("")

if name == 'main': t = Template("Hello {{ envvar('ENVVAR') }}!") print(t.render(envvar=envvar))```

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-11 12:57:42

*Thread Reply:* works for me: mobuchowski@thinkpad [18:57:14] [~] -> % ENV_VAR=world python jinja_example.py Hello world!

ale (alessandro.lollo@gmail.com)

2021-10-11 16:59:13

*Thread Reply:* Finally 😅 https://github.com/OpenLineage/OpenLineage/pull/328

There are minimal tests for Redshift and env vars. Feedbacks and suggestions are welcome!

<https://github.com/OpenLineage/OpenLineage/pull/328|#328 - Added support for Redshift profile>

• Added support for Redshift profile • Added support for handling env vars in profiles.yml Please note that I had to enclose <code>{{ config(...) }}</code> between <code>raw</code> Jinja tags, otherwise it complains about an unexpected '.' Signed-off-by: Alessandro Lollo <a href="mailto:alessandro.lollo@cloudacademy.com">alessandro.lollo@cloudacademy.com</a>

ale (alessandro.lollo@gmail.com)

2021-10-12 03:10:45

*Thread Reply:* Hi @Maciej Obuchowski 😊 Regarding this comment https://github.com/OpenLineage/OpenLineage/pull/328#discussion_r726586564

How can we distinguish between snowflake, bigquery and redshift in this method?

A simple, but not very clean solution, would be to split this bytes = get_from_multiple_chains( node.catalog_node, [ ['stats', 'num_bytes', 'value'], # bigquery ['stats', 'bytes', 'value'], # snowflake ['stats', 'size', 'value'] # redshift (Note: size = count of 1MB blocks) ] ) into two pieces, one checking for snowflake and bigquery and the other checking for redshift.

A better solution would be to have the profile type inside method node_to_output_dataset , but I’m struggling understanding how to do that

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-12 05:35:00

*Thread Reply:* Well, why not do something like

```bytes = getfrommultiple_chains(... rest of stuff)

if adapter == 'redshift': bytes = 10241024```

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-12 05:36:49

*Thread Reply:* we can store adapter type in the class

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-12 05:38:47

*Thread Reply:* well, I've looked at last commit and that's exactly what you did 👍

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-12 05:40:35

*Thread Reply:* Now, have you tested your branch on real redshift cluster? I don't think we 100% need automated tests for that now, but would be nice to have confirmation that it works.

ale (alessandro.lollo@gmail.com)

2021-10-12 06:35:04

*Thread Reply:* Not yet, but I'll try to do that this afternoon. Need to figure out how to build the lib locally, then I can use it to test with Redshift

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-12 06:40:58

*Thread Reply:* I think pip install -e .[dbt] in common directory should be enough

ale (alessandro.lollo@gmail.com)

2021-10-12 09:29:13

*Thread Reply:* I was able to run my local branch with my Redshift cluster and metadata is pushed to Marquez. However, I’m not sure about the namespace . I also see exceptions in Marquez logs

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-12 09:33:26

*Thread Reply:* namespace: well, if it matches what you put into your profile, there's not much we can do. I don't understand why you connect to redshift via host, maybe this is related to IAM?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-12 09:44:17

*Thread Reply:* I think the marquez error is because we don't send SourceCodeLocationJobFacet

ale (alessandro.lollo@gmail.com)

2021-10-12 09:46:17

*Thread Reply:* Regarding the namespace, I will check it and figure it out 😊 Regarding the error: in the context of this PR, is it something I should worry about or not?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-12 09:54:17

*Thread Reply:* I think not in the context of the PR. It certainly deserves separate issue in Marquez repository.

ale (alessandro.lollo@gmail.com)

2021-10-12 10:24:38

*Thread Reply:* 👍

ale (alessandro.lollo@gmail.com)

2021-10-12 10:24:51

*Thread Reply:* Is there anything else I can do to improve the PR?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-12 10:27:44

*Thread Reply:* did you figure out the namespace stuff? I think it's ready to be merged outside of that

ale (alessandro.lollo@gmail.com)

2021-10-12 10:49:06

*Thread Reply:* Not yet

ale (alessandro.lollo@gmail.com)

2021-10-12 10:58:07

*Thread Reply:* Ok i figured it out. When running dbt locally, we connect to Redshift using an SSH tunnel. dbt runs on Docker, hence it can access the tunnel using host.docker.internal

ale (alessandro.lollo@gmail.com)

2021-10-12 10:58:16

*Thread Reply:* So the namespace is correct

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-12 11:04:12

*Thread Reply:* Makes sense. So, let's merge it, after DCO bot gets up again.

ale (alessandro.lollo@gmail.com)

2021-10-12 11:04:37

*Thread Reply:* 👍

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-13 05:29:48

*Thread Reply:* merged your PR 🙌

ale (alessandro.lollo@gmail.com)

2021-10-13 10:54:09

*Thread Reply:* 🎉

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-13 12:01:20

*Thread Reply:* I think I'm going to change it up a bit. The problem is that we can try to render jinja everywhere, including comments. I tried to make it skip unknown methods and values here, but I think the right solution is to load the yaml, and then try to render jinja for values.

ale (alessandro.lollo@gmail.com)

2021-10-13 14:27:37

*Thread Reply:* Ok sounds good to me!

SAM (skhettri@gmail.com)

2021-10-06 10:50:43

Hey there, I’m not sure why I’m getting below error, after I ran OPENLINEAGE_URL=<http://localhost:5000> dbt-ol run , although running this command dbt debug doesn’t show any error. Pls help.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-06 10:54:32

*Thread Reply:* Does it work with simply dbt run?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-06 10:55:51

*Thread Reply:* also, do you have dbt-snowflake installed?

SAM (skhettri@gmail.com)

2021-10-06 11:00:42

*Thread Reply:* it works with dbt run

👀 Maciej Obuchowski

SAM (skhettri@gmail.com)

2021-10-06 11:01:22

*Thread Reply:* no i haven’t installed dbt-snowflake

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-06 12:04:19

*Thread Reply:* what the dbt says - the snowflake profile with dev target - is that what you ment to run or was it something else?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-06 12:04:46

*Thread Reply:* it feels very weird to me, since the dbt-ol script just runs dbt run underneath

SAM (skhettri@gmail.com)

2021-10-06 12:19:27

*Thread Reply:* this is my profiles.yml file: ```snowflake: target: dev outputs: dev: type: snowflake account: xxxxxxx

  # User/password auth
  user: xxxxxx
  password: xxxxx

  role: poc_db_temp_fullaccess
  database: POC_DB
  warehouse: poc_wh
  schema: temp
  threads: 2
  client_session_keep_alive: False
  query_tag: dbt_ol```

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-06 12:26:39

*Thread Reply:* Yes, it looks that everything is okay on your side...

SAM (skhettri@gmail.com)

2021-10-06 12:28:19

*Thread Reply:* may be I’ll restart my machine and try again

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-06 12:30:25

*Thread Reply:* can you try OPENLINEAGE_URL=<http://localhost:5000> dbt-ol debug

SAM (skhettri@gmail.com)

2021-10-07 05:59:03

*Thread Reply:* Actually i had to use venv that fixed above issue. However, i ran into another problem which is no jobs / datasets found in marquez:

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-07 06:00:28

*Thread Reply:* Good that you fixed that one 🙂 Regarding last one, I've found it independently yesterday and PR fixing it is already waiting for review: https://github.com/OpenLineage/OpenLineage/pull/322

<https://github.com/OpenLineage/OpenLineage/pull/322|#322 dbt: support manifest and run-results v3>

There are no major differences between v2 and v3 metadata, at least those impacting information that we're collecting. This PR makes <code>dbt-ol</code> treat v3 metadata the same way we treated v2. Signed-off-by: Maciej Obuchowski <a href="mailto:maciej.obuchowski@getindata.com">maciej.obuchowski@getindata.com</a>

SAM (skhettri@gmail.com)

2021-10-07 06:00:46

*Thread Reply:* oh, thanks a lot

Julien Le Dem (julien@apache.org)

2021-10-07 14:50:01

*Thread Reply:* There will be a release soon: https://openlineage.slack.com/archives/C01CK9T7HKR/p1633631825147900

} Willy Lulciuc (https://openlineage.slack.com/team/U01DCMDFHBK)

@channel: We’ve recently become aware that our integration with dbt no longer works with the latest dbt manifest version (<code>v3</code>), see <a href="https://openlineage.slack.com/archives/C01CK9T7HKR/p1633531843132400">original</a> discussion. The manifest version change was introduced in dbt <code>0.21</code> , see <a href="https://github.com/dbt-labs/dbt/commit/ea07729bbf0a3481e220615644b7955659001529#diff-33c94e40283afdfeba411a98c0628fc610c287e41c3477d88559f1e0244259c3">diff</a>. That said, we do have a fix: <a href="https://github.com/OpenLineage/OpenLineage/pull/322">PR #322</a> contributed by @Maciej Obuchowski! Here’s our plan to rollout the <code>openlineage-dbt</code> hotfix for those using the latest version of dbt (NOTE: for those using an older dbt version, you will NOT not be affected by this bug): Releasing OpenLineage <code>0.2.3</code> with dbt <code>v3</code> manifest support: <ol><li>Branch off <code>0.2.2</code> tagged commit, and create a <code>openlineage-0.2.x</code> branch</li><li>Cherry pick the commit with the dbt manifest <code>v3</code> fix</li><li>Release <code>0.2.3</code> batch release We will be releasing <code>0.2.3</code> today. Please reach out to us with any questions!</li> </ol>

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1633631825147900

👍 SAM

SAM (skhettri@gmail.com)

2021-10-07 23:23:26

*Thread Reply:* Hi, openlineage-dbt==0.2.3 worked, thanks a lot for the quick fix.

Alex P (alexander.pelivan@scout24.com)

2021-10-07 07:46:16

Hi, I just started playing around with Marquez. When submitting some lineage data, after some experimenting, the visualisation becomes a bit cluttered with all the naive attempts of building a meaningful graph. Can I clear this up somehow? Or is there a tip, how to hide certain information?

Alex P (alexander.pelivan@scout24.com)

2021-10-07 07:46:59

*Thread Reply:*

Alex P (alexander.pelivan@scout24.com)

2021-10-07 09:51:40

*Thread Reply:* So, as a quick fix, shutting down and re-starting the docker container resets everything. ./docker/up.sh

👍 Maciej Obuchowski

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-07 12:28:25

*Thread Reply:* I guess that it's the easiest way now. There should be API for that.

Willy Lulciuc (willy@datakin.com)

2021-10-07 14:09:50

*Thread Reply:* @Alex P Yeah, we're realizing that being able to delete metadata is becoming very important. And, as @Maciej Obuchowski mentioned, dropping your entire database is the only way currently (not ideal!). We do have an issue in the Marquez backlog to expose delete APIs: https://github.com/MarquezProject/marquez/issues/754

<https://github.com/MarquezProject/marquez/issues/754|#754 Delete endpoint for routes>

It would be super convenient to have DELETE calls for endpoints. I.e As a user conducting testing of a data set I wish to DELETE it and then start a new As a user creating a job I linked the wrong datasets, I wish to the DELETE the job Ideally for: • Sources • Datasets • Jobs • Tags

Labels

feature, api

Willy Lulciuc (willy@datakin.com)

2021-10-07 14:10:36

*Thread Reply:* A bit more discussion is needed though. Like what if a dataset is deleted, but you still want to keep track that it existed at some point? (i.e. soft vs hard deletes). But, for the case that you just want to clear metadata because you were testing things out, then yeah, that's more obvious and requires little discussion of the API upfront.

Willy Lulciuc (willy@datakin.com)

2021-10-07 14:12:52

*Thread Reply:* @Alex P I moved the delete APIs to the Marquez 0.20.0 release

Julien Le Dem (julien@apache.org)

2021-10-07 14:39:03

*Thread Reply:* Thanks Willy.

🙌 Willy Lulciuc

Julien Le Dem (julien@apache.org)

2021-10-07 14:48:37

*Thread Reply:* I have also updated a corresponding issue to track this in OpenLineage: https://github.com/OpenLineage/OpenLineage/issues/323

<https://github.com/OpenLineage/OpenLineage/issues/323|#323 [PROPOSAL] Ability to annotate datasets with metadata without a job run>

Purpose: We need a way to make observations about a Dataset current state outside of a run reading or writing the data. For example, this could be: • collecting schema information about a table from a warehouse • informing that a table has been deleted • etc Proposed implementation Adding Dataset event types that are not lineage events. This would be the Dataset entity with its facets: <a href="https://github.com/OpenLineage/OpenLineage/blob/3a1ccbd854bbf202bbe6437bf81786cb01ed4d01/spec/OpenLineage.json#L178-L203">OpenLineage/spec/OpenLineage.json</a> Lines 178 to 203 in </OpenLineage/OpenLineage/commit/3a1ccbd854bbf202bbe6437bf81786cb01ed4d01|3a1ccbd> <pre><code>"Dataset": { "type": "object", "properties": { "namespace": { "description": "The namespace containing that dataset", "type": "string", "example": "my-datasource-namespace" }, "name": { "description": "The unique name for that dataset within that namespace", "type": "string", "example": "instance.schema.table" }, "facets": { "description": "The facets for this dataset", "type": "object", "additionalProperties": { "$ref": "#/$defs/DatasetFacet" } } }, "required": [ "namespace", "name" ] } </code></pre>

Labels

proposal

Julien Le Dem (julien@apache.org)

2021-10-07 13:36:48

The next OpenLineage monthly meeting is on the 13th. https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting please chime in here if you’d like a topic to be added to the agenda

🙌 Willy Lulciuc, Maciej Obuchowski, Peter Hicks

❤️ Willy Lulciuc, Maciej Obuchowski, Peter Hicks

Julien Le Dem (julien@apache.org)

2021-10-13 10:47:49

*Thread Reply:* Reminder that the meeting is today. See you soon

Julien Le Dem (julien@apache.org)

2021-10-13 19:49:21

*Thread Reply:* The recording and notes of the meeting are now available: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting#MonthlyTSCmeeting-Oct13th2021

Willy Lulciuc (willy@datakin.com)

2021-10-07 14:37:05

@channel: We’ve recently become aware that our integration with dbt no longer works with the latest dbt manifest version (v3), see original discussion. The manifest version change was introduced in dbt 0.21 , see diff. That said, we do have a fix: PR #322 contributed by @Maciej Obuchowski! Here’s our plan to rollout the openlineage-dbt hotfix for those using the latest version of dbt (NOTE: for those using an older dbt version, you will NOT not be affected by this bug):

Releasing OpenLineage 0.2.3 with dbt v3 manifest support:

Branch off 0.2.2 tagged commit, and create a openlineage-0.2.x branch
Cherry pick the commit with the dbt manifest v3 fix
Release 0.2.3 batch release We will be releasing 0.2.3 today. Please reach out to us with any questions!

} Samjhana Khettri (https://openlineage.slack.com/team/U02EYPQNU58)

Hey there, I’m not sure why I’m getting below error, after I ran <code>OPENLINEAGE_URL=<http://localhost:5000> dbt-ol run</code> , although running this command <code>dbt debug</code> doesn’t show any error. Pls help.

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1633531843132400

<https://github.com/OpenLineage/OpenLineage/pull/322|#322 dbt: support manifest and run-results v3>

🙌 Mario Measic, Minkyu Park, Peter Hicks

Julien Le Dem (julien@apache.org)

2021-10-07 14:55:35

*Thread Reply:* For people following along, dbt changed the schema of its metadata which broke the openlineage integration. However we were a bit too stringent on validating the schema version (they increment it every time event if it’s backwards compatible, which it is in this case). We will fix that so that future compatible changes don’t prevent the ol integration to work.

Mario Measic (mario.measic.gavran@gmail.com)

2021-10-07 16:44:28

*Thread Reply:* As one of the main integrations, would be good to connect more within the dbt community for the next releases, by testing the release candidates 👍

Thanks for the PR

💯 Willy Lulciuc

Willy Lulciuc (willy@datakin.com)

2021-10-07 16:46:40

*Thread Reply:* Yeah, I totally agree with you. We also should be more proactive and also be more aware in what’s coming in future dbt releases. Sorry if you were effected by this bug :ladybug:

Willy Lulciuc (willy@datakin.com)

2021-10-07 18:12:22

*Thread Reply:* We’ve release OpenLineage 0.2.3 with the hotfix for adding dbt v3 manifest support, see https://github.com/OpenLineage/OpenLineage/releases/tag/0.2.3

You can download and install openlineage-dbt 0.2.3 with the fix using:

$ pip3 install openlineage-dbt==0.2.3

Drew Bittenbender (drew@salt.io)

2021-10-07 19:02:37

Hello. I have a question about dbt-ol. I run dbt in a docker container and alias the dbt command to execute in that docker container. dbt-ol doesn't seem to use that alias. Do you know of a way to force it to use the alias?...or is there an alternative to getting the linage into Marquez?

Julien Le Dem (julien@apache.org)

2021-10-07 21:10:36

*Thread Reply:* @Maciej Obuchowski might know

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-08 04:23:17

*Thread Reply:* @Drew Bittenbender dbt-ol always calls dbt command now, without spawning shell - so it does not have access to bash aliases.

Can you elaborate about your use case? Do you mean that dbt in your path does docker run or something like this? It still might be a problem if we won't have access to artifacts generated by dbt in target directory.

Drew Bittenbender (drew@salt.io)

2021-10-08 10:59:32

*Thread Reply:* I am running on a mac and I have aliased (.zshrc) dbt to execute docker run against the fishtownanalytics docker image rather than installing dbt natively (homebrew, etc). I am doing this so that the dbt configuration is portable and reusable by others.

It seems that by installing openlineage-dbt in a virtual environment, it pulls down it's own version of dbt which it calls inline rather than shelling out and executing the dbt setup resident in the host system. I understand that opening a shell is a security risk so that is understandable.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-08 11:05:00

*Thread Reply:* It does not pull down, it just assumes that it's in the system. It would fail if it isn't.

For now I think you could build your own image based on official one, and install openlineage-dbt inside, something like:

FROM fishtownanalytics/dbt:0.21.0 RUN pip install openlineage-dbt ENTRYPOINT ["dbt-ol"]

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-08 11:05:15

*Thread Reply:* and then pass OPENLINEAGE_URL in env while doing docker run

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-08 11:06:55

*Thread Reply:* Also, to make sure that using shell would help in your case: do you bind mount your dbt directory to home? dbt-ol can't run without access to dbt's target directory, so if it's not visible in host, the only option is to have dbt-ol in container.

SAM (skhettri@gmail.com)

2021-10-08 07:00:43

Hi, I found below issues, not sure what is the root-cause:

Marquez UI does not show any jobs/datasets, but if I search my table name then only it shows in search result section.
After running dbt docs generate there is not schema information available in marquez?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-08 08:16:37

*Thread Reply:* Regarding 2), the data is only visible after next dbt-ol run - dbt docs generate does not emit events itself, but generates data that run take into account.

SAM (skhettri@gmail.com)

2021-10-08 08:24:57

*Thread Reply:* oh got it, since its in default, i need to click on it and choose my dbt profile’s account name. thnx

SAM (skhettri@gmail.com)

2021-10-08 11:25:22

*Thread Reply:* May I know, why these highlighted ones dont have schema? FYI, I used sources in dbt.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-08 11:26:18

*Thread Reply:* Do they have it in dbt docs?

SAM (skhettri@gmail.com)

2021-10-08 11:33:59

*Thread Reply:* I prepared this yaml file, not sure this is what u asked

ale (alessandro.lollo@gmail.com)

2021-10-12 04:14:08

Hey folks 😊 DCO checks on this PR https://github.com/OpenLineage/OpenLineage/pull/328 seem to be stuck. Any suggestions on how to unblock it?

Thanks!

<https://github.com/OpenLineage/OpenLineage/pull/328|#328 - Added support for Redshift profile>

Comments

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-12 07:21:33

*Thread Reply:* I don't think anything is wrong with your branch. It's also not working on my one. Maybe it's globally stuck?

Mark Taylor (marktayl@microsoft.com)

2021-10-12 15:17:02

We are working on the hackathon and have a couple of questions about generating lineage information. @Willy Lulciuc would you have time to help answer a couple of questions?

• Is there a way to generate OpenLineage output that contains a mapping between input and output fields? • In Azure Databricks sources often map to ADB mount points. We are looking for a way to translate this into source metadata in the OL output. Is there some configuration that would make this possible, or any other suggestions?

👋 Willy Lulciuc

Willy Lulciuc (willy@datakin.com)

2021-10-12 15:50:20

*Thread Reply:* > Is there a way to generate OpenLineage output that contains a mapping between input and output fields? OpenLineage defines discrete classes for both OpenLineage.InputDataset and OpenLineage.OutputDataset datasets. But, for clarification, are you asking:

If a job reads / writes to the same dataset, how can OpenLineage track which fields were used in job’s logic as input and which fields were used to write back to the resulting output?
Or, if a job reads / writes from two different dataset, how can OpenLineage track which input fields were used in the job’s logic for the resulting output dataset? (i.e. column-level lineage)

Willy Lulciuc (willy@datakin.com)

2021-10-12 15:56:18

*Thread Reply:* > In Azure Databricks sources often map to ADB mount points. We are looking for a way to translate this into source metadata in the OL output. Is there some configuration that would make this possible, or any other suggestions? I would look into our OutputDatasetVisitors class (as a starting point) that extracts metadata from the spark logical plan to construct a mapping between a logic plan to one or more OpenLineage.Dataset for the spark job. But, I think @Michael Collado will have a more detailed suggestion / approach to what you’re asking

Michael Collado (collado.mike@gmail.com)

2021-10-12 15:59:41

*Thread Reply:* are the sources mounted like local filesystem mounts? are you ending up with datasources that point to the local filesystem rather than some dbfs url? (sorry, I'm not familiar with databricks or azure at this point)

Mark Taylor (marktayl@microsoft.com)

2021-10-12 16:59:38

*Thread Reply:* I think under the covers they are an os level fs mount, but it is using an ADB specific api, dbutils.fs.mount. It is using the ADB filesystem.

docs.microsoft.com

Databricks File System (DBFS) - Azure Databricks

Learn about Databricks File System (DBFS).

Original URL: https://docs.microsoft.com/en-us/azure/databricks/data/databricks-file-system

Michael Collado (collado.mike@gmail.com)

2021-10-12 17:01:23

*Thread Reply:* Do you use the dbfs scheme to access the files from Spark as in the example on that page? df = spark.read.text("dbfs:/mymount/my_file.txt")

Mark Taylor (marktayl@microsoft.com)

2021-10-12 17:04:52

*Thread Reply:* @Willy Lulciuc In our project, @Will Johnson had generated some sample OL output from just reading in and writing out a dataset to blob storage. In the resulting output, I see the columns represented as fields under the schema element with a set represented for output and another for input. I would need the mapping of in and out columns to generate column level lineage so wondering if it is possible to get or am I just missing it somewhere? Thanks for your help!

Willy Lulciuc (willy@datakin.com)

2021-10-12 17:26:35

*Thread Reply:* Ahh, well currently, no, but it has been discussed and on the OpenLineage roadmap. Here’s a proposal opened by @Julien Le Dem, column level lineage facet, that starts the discussion to add the columnLineage face to the datasets model in order to support column-level lineage. Would be great to get your thoughts!

<https://github.com/OpenLineage/OpenLineage/issues/148|#148 [PROPOSAL] column level lineage facet>

Purpose: For transformations like SQL queries (but also in other cases), we can extract column level lineage. This allows answering questions like ‘which root input columns are used to construct column x?’. We need to be able to capture this information in the OpenLineage model Proposed implementation We propose to introduce a new dataset facet: example: <pre><code>{ "eventType": "COMPLETE", "eventTime": "2020-12-28T20:52:00.001+10:00", "run" : { "runId": "uuid" }, "job": { "namespace": "scheduler", "name": "myjob", "facets": { "sql": { "query": "Insert into outputTable from select ** from inputTable" } } }, "inputs": [ { "namespace": "N1", "name": "inputTable", "facets": { "schema": { "fields": [ {"name": "col_a", "type": "VARCHAR"}, {"name": "col_b", "type": "int"}] } } } ], "outputs": [ { "namespace": "N2", "name": "outputTable", "facets": { "schema": { "fields": [ {"name": "col_a", "type": "VARCHAR"}, {"name": "col_b", "type": "int"}] }, "columnLineage": { "fields": { "col_a": [ { "namespace": "N1", "name": "inputTable", "field": "col_a"} ], "col_b": [ { "namespace": "N1", "name": "inputTable", "field": "col_b" } ] } } } } ] } </code></pre> Schema: <pre><code> "columnLineage": { "type": "object", "properties": { "fields": { "type": "array", "items": { "type": "object", "additionalProperties": { "type": "object", "properties": { "namespace": { "type": "string", "description": "the input dataset namespace" }, "name": { "type": "string", "description": "the input dataset name" }, "field": { "type": "string", "description": "the input field" } }, "required": [ "namespace", "name" ] } } } } </code></pre> References: • Datahub spec: <a href="https://datahubproject.io/docs/rfc/active/1841-lineage/field_level_lineage/">https://datahubproject.io/docs/rfc/active/1841-lineage/fieldlevellineage/</a>

Labels

proposal

Comments

Will Johnson (will@willj.co)

2021-10-12 17:41:41

*Thread Reply:* @Michael Collado - Databricks allows you to reference a file called /mnt/someMount/some/file/path The way you have referenced it would let you hit the file with local file system stuff like pandas / local python.

Julien Le Dem (julien@apache.org)

2021-10-12 17:49:37

*Thread Reply:* For column level lineage, you can add your own custom facets: Here’s an example in the Spark integration: (LogicalPlanFacet) https://github.com/OpenLineage/OpenLineage/blob/5f189a94990dad715745506c0282e16fd8[…]openlineage/spark/agent/lifecycle/SparkSQLExecutionContext.java Here is the paragraph about this in the spec: https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md#custom-facet-naming

<https://github.com/OpenLineage/OpenLineage/blob/5f189a94990dad715745506c0282e16fd857b43a/integration/spark/src/main/java/io/openlineage/spark/agent/lifecycle/SparkSQLExecutionContext.java | SparkSQLExecutionContext.java>

<https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md | OpenLineage.md>

Julien Le Dem (julien@apache.org)

2021-10-12 17:51:24

*Thread Reply:* This example adds facets to the run, but you can also add them to the job

Michael Collado (collado.mike@gmail.com)

2021-10-12 17:52:46

*Thread Reply:* unfortunately, there's not yet a way to add your own custom facets to the spark integration- there's some work on extensibility to be done

Michael Collado (collado.mike@gmail.com)

2021-10-12 17:54:07

*Thread Reply:* for the hackathon's sake, you can check out the package and just add in whatever you want

Will Johnson (will@willj.co)

2021-10-12 18:26:44

*Thread Reply:* Thank you guys!!

🙌 Willy Lulciuc

Will Johnson (will@willj.co)

2021-10-12 20:42:20

Question on the Spark Integration and its SPARKCONFURL_KEY configuration variable.

https://github.com/OpenLineage/OpenLineage/blob/8afc4ff88b8dd8090cd9c45061a9f669fe[…]rk/src/main/java/io/openlineage/spark/agent/ArgumentParser.java

It looks like I can pass in any url but I'm not sure if I can pass in query parameters along with that URL. For example, if I had https://localhost/myendpoint?secret_code=123 I THINK that is used for the endpoint and it does not append /lineage to the end of the url. Is that a fair assessment of what happens when the url is provided?

Thank you for any guidance!

<https://github.com/OpenLineage/OpenLineage/blob/8afc4ff88b8dd8090cd9c45061a9f669fea2151e/integration/spark/src/main/java/io/openlineage/spark/agent/ArgumentParser.java | ArgumentParser.java>

<pre><code> public static ArgumentParser parse(String clientUrl) { </code></pre>

Julien Le Dem (julien@apache.org)

2021-10-12 21:46:12

*Thread Reply:* You can also pass the settings independently if you want something more flexible: https://github.com/OpenLineage/OpenLineage/blob/8afc4ff88b8dd8090cd9c45061a9f669fe[…]n/java/io/openlineage/spark/agent/OpenLineageSparkListener.java

<https://github.com/OpenLineage/OpenLineage/blob/8afc4ff88b8dd8090cd9c45061a9f669fea2151e/integration/spark/src/main/java/io/openlineage/spark/agent/OpenLineageSparkListener.java | OpenLineageSparkListener.java>

<pre><code> private ArgumentParser parseConf(SparkConf conf) { </code></pre>

Julien Le Dem (julien@apache.org)

2021-10-12 21:47:36

*Thread Reply:* SparkSession.builder() .config("spark.jars.packages", "io.openlineage:openlineage_spark:0.2.+") .config("spark.extraListeners", "io.openlineage.spark.agent.OpenLineageSparkListener") .config("spark.openlineage.host", "<https://localhost>") .config("spark.openlineage.apiKey", "your api key") .config("spark.openlineage.namespace", "<NAMESPACE_NAME>") // Replace with the name of your Spark cluster. .getOrCreate()

Julien Le Dem (julien@apache.org)

2021-10-12 21:48:57

*Thread Reply:* It is going to add /lineage in the end: https://github.com/OpenLineage/OpenLineage/blob/8afc4ff88b8dd8090cd9c45061a9f669fe[…]rc/main/java/io/openlineage/spark/agent/OpenLineageContext.java

<https://github.com/OpenLineage/OpenLineage/blob/8afc4ff88b8dd8090cd9c45061a9f669fea2151e/integration/spark/src/main/java/io/openlineage/spark/agent/OpenLineageContext.java | OpenLineageContext.java>

<pre><code> new URI(String.format("%s/api/%s/lineage", argument.getHost(), argument.getVersion())); </code></pre>

Julien Le Dem (julien@apache.org)

2021-10-12 21:49:37

*Thread Reply:* the apiKey setting is sent in an “Authorization” header

Julien Le Dem (julien@apache.org)

2021-10-12 21:49:55

*Thread Reply:* “Bearer $KEY”

Julien Le Dem (julien@apache.org)

2021-10-12 21:51:09

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/a6eea7a55fef444b6561005164869a9082[…]n/java/io/openlineage/spark/agent/client/OpenLineageClient.java

<https://github.com/OpenLineage/OpenLineage/blob/a6eea7a55fef444b6561005164869a90827cef41/integration/spark/src/main/java/io/openlineage/spark/agent/client/OpenLineageClient.java | OpenLineageClient.java>

<pre><code> request.addHeader(AUTHORIZATION, "Bearer " + apiKey.get()); </code></pre>

Will Johnson (will@willj.co)

2021-10-12 22:54:22

*Thread Reply:* Thank you @Julien Le Dem it seems in both cases (defining the url endpoint with spark.openlineage.url and with the components: spark.openlineage.host / openlineage.version / openlineage.namespace / etc.) OpenLineage will strip out url parameters and rebuild the url endpoint with /lineage.

I think we might need to add in a url parameter configuration for our hackathon. We're using a bit of serverless code to shuttle open lineage events to a queue so that another job and/or serverless application can read that queue at its leisure.

Using the apiKey that feeds into the Authorization header as a Bearer token is great and would suffice but for our services we use OAuth tokens that would expire after two hours AND most of our customers wouldn't want to generate an access token themselves and feed it to Spark. ☹️

Would you guys entertain a proposal to support a spark.openlineage.urlParams configuration variable that lets you add url parameters to the derived lineage url?

Thank you for the detailed replies and deep links!

Julien Le Dem (julien@apache.org)

2021-10-13 10:46:22

*Thread Reply:* Yes, please open an issue detailing the use case.

Will Johnson (will@willj.co)

2021-10-13 13:02:06

Quick question, is it expected, when using Spark SQL and the Spark Integration for Spark3 that we receive and INPUT but no OUTPUTS when doing a CREATE TABLE ... AS SELECT ... .

I'm reading from a Spark SQL table (underlying CSV) and then writing it to a DELTA lake table.

I get a COMPLETE event type with an INPUT but no OUTPUT and then I get an exception for the AsyncEvent Queue but I'm guessing it's unrelated 😅

21/10/13 15:38:15 INFO OpenLineageContext: Lineage completed successfully: ResponseMessage(responseCode=200, body=null, error=null) {"eventType":"COMPLETE","eventTime":"2021-10-13T15:38:15.878Z","run":{"runId":"2cfe52b3-e08f-4888-8813-ffcdd2b27c89","facets":{"spark_unknown":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.2.3-SNAPSHOT/integration/spark>","_schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>","output":{"description":{"@class":"org.apache.spark.sql.catalyst.plans.logical.Project","traceEnabled":false,"streaming":false,"cacheId":null,"canonicalizedPlan":false},"inputAttributes":[{"name":"id","type":"long","metadata":{}}],"outputAttributes":[{"name":"id","type":"long","metadata":{}},{"name":"action_date","type":"date","metadata":{}}]},"inputs":[{"description":{"@class":"org.apache.spark.sql.catalyst.plans.logical.Range","streaming":false,"traceEnabled":false,"cacheId":null,"canonicalizedPlan":false},"inputAttributes":[],"outputAttributes":[{"name":"id","type":"long","metadata":{}}]}]},"spark.logicalPlan":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.2.3-SNAPSHOT/integration/spark>","_schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>","plan":[{"class":"org.apache.spark.sql.catalyst.plans.logical.Project","num-children":1,"projectList":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num_children":0,"name":"id","dataType":"long","nullable":false,"metadata":{},"exprId":{"product_class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":111,"jvmId":"4bdfd808-97d5-455f-ad6a-a3b29855e85b"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.Alias","num-children":1,"child":0,"name":"action_date","exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":113,"jvmId":"4bdfd808_97d5_455f_ad6a_a3b29855e85b"},"qualifier":[],"explicitMetadata":{},"nonInheritableMetadataKeys":"[__dataset_id, __col_position]"},{"class":"org.apache.spark.sql.catalyst.expressions.CurrentDate","num_children":0,"timeZoneId":"Etc/UTC"}]],"child":0},{"class":"org.apache.spark.sql.catalyst.plans.logical.Range","num-children":0,"start":0,"end":5,"step":1,"numSlices":8,"output":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num_children":0,"name":"id","dataType":"long","nullable":false,"metadata":{},"exprId":{"product_class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":111,"jvmId":"4bdfd808-97d5-455f-ad6a-a3b29855e85b"},"qualifier":[]}]],"isStreaming":false}]}}},"job":{"namespace":"sparknamespace","name":"databricks_shell.project"},"inputs":[],"outputs":[],"producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.2.3-SNAPSHOT/integration/spark>","schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunEvent>"} 21/10/13 15:38:16 INFO FileSizeAutoTuner: File size tuning result: {"tuningType":"autoTuned","tunedConfs":{"spark.databricks.delta.optimize.minFileSize":"268435456","spark.databricks.delta.optimize.maxFileSize":"268435456"}} 21/10/13 15:38:16 INFO FileFormatWriter: Write Job e062f36c-8b9d-4252-8db9-73b58bd67b15 committed. 21/10/13 15:38:16 INFO FileFormatWriter: Finished processing stats for write job e062f36c-8b9d-4252-8db9-73b58bd67b15. 21/10/13 15:38:18 INFO CodeGenerator: Code generated in 253.294028 ms 21/10/13 15:38:18 INFO SparkContext: Starting job: collect at DataSkippingReader.scala:430 21/10/13 15:38:18 INFO DAGScheduler: Job 1 finished: collect at DataSkippingReader.scala:430, took 0.000333 s 21/10/13 15:38:18 ERROR AsyncEventQueue: Listener OpenLineageSparkListener threw an exception java.lang.NullPointerException at io.openlineage.spark.agent.OpenLineageSparkListener.onJobEnd(OpenLineageSparkListener.java:167) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:39) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:119) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:103) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1547) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)

Julien Le Dem (julien@apache.org)

2021-10-13 17:54:22

*Thread Reply:* This is because this specific action is not covered yet. You can see the “spark_unknown” facet is describing things that are not understood yet run": { ... "facets": { "spark_unknown": { ... "output": { "description": { "@class": "org.apache.spark.sql.catalyst.plans.logical.Project", "traceEnabled": false, "streaming": false, "cacheId": null, "canonicalizedPlan": false },

Julien Le Dem (julien@apache.org)

2021-10-13 17:54:43

*Thread Reply:* I think this is part of the Spark 3 gap

Julien Le Dem (julien@apache.org)

2021-10-13 17:55:46

*Thread Reply:* an unknown output will cause missing output lineage

Julien Le Dem (julien@apache.org)

2021-10-13 18:05:57

*Thread Reply:* Output handling is here: https://github.com/OpenLineage/OpenLineage/blob/e0f1852422f325dc019b0eab0e466dc905[…]io/openlineage/spark/agent/lifecycle/OutputDatasetVisitors.java

<https://github.com/OpenLineage/OpenLineage/blob/e0f1852422f325dc019b0eab0e466dc905876a48/integration/spark/src/main/java/io/openlineage/spark/agent/lifecycle/OutputDatasetVisitors.java | OutputDatasetVisitors.java>

<pre><code> public List&lt;PartialFunction&lt;LogicalPlan, List&lt;OpenLineage.OutputDataset&gt;&gt;&gt; get() { </code></pre>

🙌 Will Johnson

Will Johnson (will@willj.co)

2021-10-13 22:49:08

*Thread Reply:* Ah! Thank you so much, Julien! This is very helpful to understand where that is set. This is a big gap that we want to help address after our hackathon. Thank you!

Julien Le Dem (julien@apache.org)

2021-10-13 20:09:17

Following up on the meeting this morning, I have created an issue to formalize a design doc review process: https://github.com/OpenLineage/OpenLineage/issues/336 If that sounds good I’ll create the first doc to describe this as a PR. (how meta!)

<https://github.com/OpenLineage/OpenLineage/issues/336|#336 [PROPOSAL] Add a design doc process for proposals>

Purpose: Add a design doc section for proposals. The process would be the following: • open an issue with the "PROPOSAL" template • if the scope warrants it create a PR to add a proposal doc. • have the PR approved • Once the proposal is accepted you can open another issue for the implementation. Proposed implementation • Proposals live under OpenLineage/OpenLineage/proposals • one folder per design (includes images and the proposal document) • the proposal is keyed by the github issue id: "/proposals/{github issue id}/{short description}.md" • each doc has a header with: • Status: APPROVED | IN PROGRESS | RELEASED • Author: • Created: • issue tracker: #xxx • assignees for reviews will be in the issue itself • community can vote on the issue <pre><code>--- Status: Proposal Author: Created: Issue: --- </code></pre>

Labels

proposal

Julien Le Dem (julien@apache.org)

2021-10-13 20:13:02

*Thread Reply:* the github wiki is backed by a git repo but it does not allow PRs. (people do hacks but I’d rather avoid those)

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-18 10:24:25

We're discussing creating Transport abstraction for OpenLineage clients, that would allow us creating better experience for people that expect to be able to emit their events using something else than http interface. Please tell us what you think of proposed mechanism - encouraging emojis are helpful too 😉 https://github.com/OpenLineage/OpenLineage/pull/344

Julien Le Dem (julien@apache.org)

2021-10-18 20:57:04

OpenLineage release 0.3 is coming. Please chiming if there’s anything blocker that should go in the release: https://github.com/OpenLineage/OpenLineage/projects/4

❤️ Willy Lulciuc

Carlos Quintas (cdquintas@gmail.com)

2021-10-19 06:36:05

👋 Hi everyone!

👋 Ross Turk, Willy Lulciuc, Michael Collado

Carlos Quintas (cdquintas@gmail.com)

2021-10-22 05:38:14

openlineage with DBT and Trino, is there any forecast?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-22 05:44:17

*Thread Reply:* Maybe you want to contribute it? It's not that hard, mostly testing, and figuring out what would be the naming of openlineage namespace for Trino, and how some additional statistics work.

For example, recently we had added support for Redshift by community member @ale

https://github.com/OpenLineage/OpenLineage/pull/328

<https://github.com/OpenLineage/OpenLineage/pull/328|#328 - Added support for Redshift profile>

Comments

Carlos Quintas (cdquintas@gmail.com)

2021-10-22 05:42:52

Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5 Traceback (most recent call last): File "/home/labuser/.local/bin/dbt-ol", line 61, in <module> main() File "/home/labuser/.local/bin/dbt-ol", line 54, in main events = processor.parse().events() File "/home/labuser/.local/lib/python3.8/site-packages/openlineage/common/provider/dbt.py", line 98, in parse self.extractdatasetnamespace(profile) File "/home/labuser/.local/lib/python3.8/site-packages/openlineage/common/provider/dbt.py", line 377, in extractdatasetnamespace self.datasetnamespace = self.extractnamespace(profile) File "/home/labuser/.local/lib/python3.8/site-packages/openlineage/common/provider/dbt.py", line 391, in extract_namespace raise NotImplementedError( NotImplementedError: Only 'snowflake' and 'bigquery' adapters are supported right now. Passed trino

Michael Collado (collado.mike@gmail.com)

2021-10-22 12:41:08

Hey folks, we've released OpenLineage 0.3.1. There are quite a few changes, including doc improvements, Redshift support in dbt, bugfixes, a new server-side client code base, but the real highlights are

Official Spark 3 support- this is still a work in progress (the whole Spark integration is), but the big deal is we've split the source tree to support both Spark 2 and Spark 3 specific plan visitors. This will enable us to work with the Spark 3 API explicitly and to add support for those interfaces and classes that didn't exist in Spark 2. We're also running all integration tests against both Spark 2.4.7 and Spark 3.1.0
Airflow 2 support- also a work in progress, but we have a new LineageBackend implementation that allows us to begin tracking lineage for successful Airflow 2 DAGs. We're working to support failure notifications so we can also trace failed jobs. The LineageBackend can also be enabled in Airflow 1.10.X to improve the reporting of task completion times. Check the READMEs for more details and to get started with the new features. Thanks to @Maciej Obuchowski , @Oleksandr Dvornik, @ale, and @Willy Lulciuc for their contributions. See the full changelog

🎉 Willy Lulciuc, Maciej Obuchowski, Minkyu Park, Ross Turk, Peter Hicks, RamanD, Ry Walker

🙌 Willy Lulciuc, Maciej Obuchowski, Minkyu Park, Will Johnson, Ross Turk, Peter Hicks, Ry Walker

🔥 Ry Walker

David Virgil (david.virgil.naranjo@googlemail.com)

2021-10-28 07:27:12

Hello community. I am starting using marquez. I try to connect dbt with Marquez, but the spark adapter is not yet available.

Are you planning to implement this spark dbt adapter in next openlineage versions?

NotImplementedError: Only 'snowflake', 'bigquery', and 'redshift' adapters are supported right now. Passed spark In my company we are starting to use as well the athena dbt adapter. Are you planning to implement this integration? Thanks a lot community

Julien Le Dem (julien@apache.org)

2021-10-28 12:20:27

*Thread Reply:* That would make sense. I think you are the first person to request this. Is this something you would want to contribute to the project?

David Virgil (david.virgil.naranjo@googlemail.com)

2021-10-28 17:37:53

*Thread Reply:* I would like to Julien, but not sure how can I do it. Could you guide me how can i start? or show me other integration.

Matthew Mullins (mmullins@aginity.com)

2021-10-31 07:57:55

*Thread Reply:* @David Virgil look at the pull request for the addition of Redshift as a starting guide. https://github.com/OpenLineage/OpenLineage/pull/328

<https://github.com/OpenLineage/OpenLineage/pull/328|#328 - Added support for Redshift profile>

Comments

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-01 12:01:41

*Thread Reply:* Thanks @Matthew Mullins I ll try to add dbt spark integration

Mario Measic (mario.measic.gavran@gmail.com)

2021-10-28 09:31:01

Hey folks, quick question, are we able to run dbt-ol without providing OPENLINEAGE_URL? I find it quite limiting that I need to have a service set up in order to emit/generate OL events/messages. Is there a way to just output them to the console?

Mario Measic (mario.measic.gavran@gmail.com)

2021-10-28 10:05:09

*Thread Reply:* OK, was changed here: https://github.com/OpenLineage/OpenLineage/pull/286

Did you think about this?

<https://github.com/OpenLineage/OpenLineage/pull/286|#286 [INTEGRATION][dbt] fail execution if OPENLINEAGE_URL is unset>

Currently, <code>dbt-ol</code> will run even if <code>OPENLINEAGE_URL</code> is unset. This causes nomal dbt execution, after which user sees that events weren't emitted. We should assume that variables/config should be set and valid - and fail the run if they aren't. After all, if someone wouldn't need lineage events, they wouldn't use our wrapper. When <a href="https://github.com/OpenLineage/OpenLineage/issues/287">OpenLineage/OpenLineage#287</a> is done, we should use emitting dbt parent run as way to check if URL is valid and reachable. This will require change with introduction of client <code>Backend</code> or <code>Transport</code> - we should check if any transport is configured correctly, instead of just simple http one. Signed-off-by: Maciej Obuchowski <a href="mailto:maciej.obuchowski@getindata.com">maciej.obuchowski@getindata.com</a>

Comments

Julien Le Dem (julien@apache.org)

2021-10-28 12:19:27

*Thread Reply:* In Marquez there was a mechanism to do that. Something like OPENLINEAGE_BACKEND=HTTP|LOG

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-28 13:56:42

*Thread Reply:* @Mario Measic We're going to add Transport mechanism, that will address use cases like yours. Please comment on this PR what would you expect: https://github.com/OpenLineage/OpenLineage/pull/344

<https://github.com/OpenLineage/OpenLineage/pull/344|#344 [PROPOSAL] Transport proposal document>

This document summarizes existing discussion on <code>Transport</code> abstraction from issue <a href="https://github.com/OpenLineage/OpenLineage/issues/269">OpenLineage/OpenLineage#269</a>, following method proposed here: <a href="https://github.com/OpenLineage/OpenLineage/issues/336">OpenLineage/OpenLineage#336</a> Draft implementation of <code>Transport</code> abstraction for Python is here: <a href="https://github.com/OpenLineage/OpenLineage/pull/329">OpenLineage/OpenLineage#329</a> Signed-off-by: Maciej Obuchowski <a href="mailto:maciej.obuchowski@getindata.com">maciej.obuchowski@getindata.com</a>

Comments

👀 Mario Measic

Mario Measic (mario.measic.gavran@gmail.com)

2021-10-28 15:29:50

*Thread Reply:* Nice, thanks @Julien Le Dem and @Maciej Obuchowski.

Mario Measic (mario.measic.gavran@gmail.com)

2021-10-28 15:46:45

*Thread Reply:* Also, dbt build is not working which is kind of the biggest feature of the version 0.21.0, I will try testing the code with modifications to the https://github.com/OpenLineage/OpenLineage/blob/c3aa70e161244091969951d0da4f37619bcbe36f/integration/dbt/scripts/dbt-ol#L141

I guess there's a reason for it that I didn't see since you support v3 of the manifest.

<https://github.com/OpenLineage/OpenLineage/blob/c3aa70e161244091969951d0da4f37619bcbe36f/integration/dbt/scripts/dbt-ol | dbt-ol>

Mario Measic (mario.measic.gavran@gmail.com)

2021-10-29 03:45:27

*Thread Reply:* Also, is it normal not to see the column descriptions for the model/table even though these are provided in the YAML file, persisted in Redshift and also dbt docs generate has been run before dbt-ol run?

Mario Measic (mario.measic.gavran@gmail.com)

2021-10-29 04:26:22

*Thread Reply:* Tried with dbt versions 0.20.2 and 0.21.0, openlineage-dbt==0.3.1

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-10-29 10:39:10

*Thread Reply:* I'll take a look at that. Supporting descriptions might be simple, but dbt build might be a little larger task.

Julien Le Dem (julien@apache.org)

2021-11-01 19:12:01

*Thread Reply:* I opened a ticket to track this: https://github.com/OpenLineage/OpenLineage/issues/376

<https://github.com/OpenLineage/OpenLineage/issues/376|#376 [INTEGRATION][dbt] support dbt built>

dbt build is a superset of the dbt run and dbt test commands. We should apply the same logic there. <a href="https://docs.getdbt.com/reference/commands/build">https://docs.getdbt.com/reference/commands/build</a>

👀 Mario Measic

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-11-02 05:48:06

*Thread Reply:* The column description issue should be fixed here: https://github.com/OpenLineage/OpenLineage/pull/383

<https://github.com/OpenLineage/OpenLineage/pull/383|#383 dbt: use column description from metadata.json even if we have catalog>

Column descriptions are properly set in <code>metadata.json</code> file. Fetch it from there, even if we have <code>catalog.json</code>. Also, remove empty column descriptions from being checked in most tests. Signed-off-by: Maciej Obuchowski <a href="mailto:maciej.obuchowski@getindata.com">maciej.obuchowski@getindata.com</a> Closes: <a href="https://github.com/OpenLineage/OpenLineage/issues/382">OpenLineage/OpenLineage#382</a> Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☑︎ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☑︎ Your changes are accompanied by tests (if relevant) ☑︎ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☐ You've updated the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md"><code>CHANGELOG.md</code></a> with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant)

Julien Le Dem (julien@apache.org)

2021-10-28 12:27:17

I’m looking for feedback on my proposal to improve the proposal process ! https://github.com/OpenLineage/OpenLineage/issues/336

<https://github.com/OpenLineage/OpenLineage/issues/336|#336 [PROPOSAL] Add a design doc process for proposals>

Assignees

wslulciuc, mobuchowski, mandy-chessell, collado-mike

Labels

proposal

Brad (bradley.mcelroy@live.com)

2021-10-28 18:49:12

Hey guys - just an update on my prefect PR (https://github.com/OpenLineage/OpenLineage/pull/293) - there a little spiel on the ticket but I've closed that PR in favour of opening a new one. Prefect have just release a 2.0a technical preview, which they would like to make stable near the start of next year. I think it makes sense to target this release, and I've had one of the prefect team reach out and is keen to get some sort of lineage implemented in prefect.

👍 Kevin Kho, Maciej Obuchowski, Willy Lulciuc, Michael Collado, Julien Le Dem, Thomas Fredriksen

Brad (bradley.mcelroy@live.com)

2021-10-28 18:51:10

*Thread Reply:* If anyone has any questions or comments - happy to discuss here

Brad (bradley.mcelroy@live.com)

2021-10-28 18:51:15

*Thread Reply:* @davzucky

Willy Lulciuc (willy@datakin.com)

2021-10-28 23:01:29

*Thread Reply:* Thanks for updating the community, Brad!

davzucky (davzucky@hotmail.com)

2021-10-28 23:47:02

*Thread Reply:* Than you Brad. Looking forward to see how to integrated that with v2

Kevin Kho (kdykho@gmail.com)

2021-10-28 18:53:23

Hello, joining here from Prefect. Because of community requests from users like Brad above, we are looking to implement lineage for Prefect this quarter. Good to meet you all!

❤️ Minkyu Park, Faouzi, John Thomas, Maciej Obuchowski, Kevin Mellott, Thomas Fredriksen

👍 Minkyu Park, Faouzi, John Thomas

🙌 Michael Collado, Faouzi, John Thomas

Willy Lulciuc (willy@datakin.com)

2021-10-28 18:54:56

*Thread Reply:* Welcome, @Kevin Kho 👋. Really excited to see this integration kick off! 💯🚀

👍 Kevin Kho, Maciej Obuchowski, Peter Hicks, Faouzi

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-01 12:03:14

Hello,

i am integratng openLineage with Airflow 2.2.0

Do you consider in the future airflow manual inlets and outlets?

Seeing the documentation I can see that is not possible.

OpenLineageBackend does not take into account manually configured inlets and outlets. Thanks

John Thomas (john@datakin.com)

2021-11-01 12:23:11

*Thread Reply:* While it’s not something we’re supporting at the moment, it’s definitely something that we’re considering!

If you can give me a little more detail on what your system infrastructure is like, it’ll help us set priority and design

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-01 13:57:34

*Thread Reply:* So basic architecture of a datalake. We are using airflow to trigger jobs. Every job is a pipeline that runs a spark job (in our case it spin up an EMR). So the idea of lineage would be defining in the dags inlets and outlets based on the airflow lineage:

https://airflow.apache.org/docs/apache-airflow/stable/lineage.html

I think you need to be able to include these inlets and outlets in the picture of openlineage

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-11-01 14:01:24

*Thread Reply:* Why not use spark integration? https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-01 14:05:02

*Thread Reply:* because there are some other jobs that are not spark, some jobs they run in dbt, other jobs they run in redshift @Maciej Obuchowski

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-11-01 14:08:58

*Thread Reply:* So, combo of https://github.com/OpenLineage/OpenLineage/tree/main/integration/dbt and PostgresExtractor from airflow integration should cover Redshift if you're using it from PostgresOperator 🙂

It's definitely interesting use case - you'd be using most of the existing integrations we have.

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-01 15:04:44

*Thread Reply:* @Maciej Obuchowski Do i need to define any extractor in the airflow startup?

Francis McGregor-Macdonald (francis@mc-mac.com)

2021-11-05 23:48:21

*Thread Reply:* I am using Redshift with PostgresOperator and it is returning…

[2021-11-06 03:43:06,541] {{__init__.py:92}} ERROR - Failed to extract metadata 'NoneType' object has no attribute 'host' task_type=PostgresOperator airflow_dag_id=counter task_id=inc airflow_run_id=scheduled__2021-11-06T03:42:00+00:00 Traceback (most recent call last): File "/usr/local/airflow/.local/lib/python3.7/site-packages/openlineage/lineage_backend/__init__.py", line 83, in _extract_metadata task_metadata = self._extract(extractor, task_instance) File "/usr/local/airflow/.local/lib/python3.7/site-packages/openlineage/lineage_backend/__init__.py", line 104, in _extract task_metadata = extractor.extract_on_complete(task_instance) File "/usr/local/airflow/.local/lib/python3.7/site-packages/openlineage/airflow/extractors/base.py", line 61, in extract_on_complete return self.extract() File "/usr/local/airflow/.local/lib/python3.7/site-packages/openlineage/airflow/extractors/postgres_extractor.py", line 65, in extract authority=self._get_authority(), File "/usr/local/airflow/.local/lib/python3.7/site-packages/openlineage/airflow/extractors/postgres_extractor.py", line 120, in _get_authority if self.conn.host and self.conn.port: AttributeError: 'NoneType' object has no attribute 'host'

I can’t see this raised as an issue.

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-01 13:57:54

Hello, I am trying to integrate Airflow with openlineage.

It is not working for me.

What I tried:

Adding openlineage-airflow to requirements.txt
Adding ```- AIRFLOWLINEAGEBACKEND=openlineage.airflow.backend.OpenLineageBackend

OPENLINEAGE_URL="https://marquez-internal-eks.eu-west-1.dev.hbi.systems"
OPENLINEAGENAMESPACE="dnsairflow"ModuleNotFoundError: No module named 'openlineage.airflow.backend'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/airflow/.local/bin/airflow", line 8, in <module> sys.exit(main()) File "/home/airflow/.local/lib/python3.8/site-packages/airflow/main.py", line 40, in main args.func(args) File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/cliparser.py", line 47, in command func = importstring(importpath) File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/moduleloading.py", line 32, in importstring module = importmodule(modulepath) File "/usr/local/lib/python3.8/importlib/init.py", line 127, in importmodule return bootstrap.gcdimport(name[level:], package, level) File "<frozen importlib.bootstrap>", line 1014, in gcdimport File "<frozen importlib.bootstrap>", line 991, in _findandload File "<frozen importlib.bootstrap>", line 975, in findandloadunlocked File "<frozen importlib.bootstrap>", line 671, in _loadunlocked File "<frozen importlib.bootstrapexternal>", line 843, in execmodule File "<frozen importlib.bootstrap>", line 219, in callwithframesremoved File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/dbcommand.py", line 24, in <module> from airflow.utils import cli as cliutils, db File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/db.py", line 26, in <module> from airflow.jobs.basejob import BaseJob # noqa: F401 File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/init.py", line 19, in <module> import airflow.jobs.backfilljob File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/backfilljob.py", line 29, in <module> from airflow import models File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/init.py", line 20, in <module> from airflow.models.baseoperator import BaseOperator, BaseOperatorLink File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/baseoperator.py", line 196, in <module> class BaseOperator(Operator, LoggingMixin, TaskMixin, metaclass=BaseOperatorMeta): File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/baseoperator.py", line 941, in BaseOperator def postexecute(self, context: Any, result: Any = None): File "/home/airflow/.local/lib/python3.8/site-packages/airflow/lineage/init.py", line 103, in applylineage _backend = getbackend() File "/home/airflow/.local/lib/python3.8/site-packages/airflow/lineage/init.py", line 52, in get_backend clazz = conf.getimport("lineage", "backend", fallback=None) File "/home/airflow/.local/lib/python3.8/site-packages/airflow/configuration.py", line 469, in getimport raise AirflowConfigException( airflow.exceptions.AirflowConfigException: The object could not be loaded. Please check "backend" key in "lineage" section. Current value: "openlineage.airflow.backend.OpenLineageBackend".```

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-11-01 14:06:12

*Thread Reply:* 1. Please use openlineage.lineage_backend.OpenLineageBackend as AIRFLOW__LINEAGE__BACKEND

Please tell us where you've seen openlineage.airflow.backend.OpenLineageBackend, so we can fix the documentation 🙂

Julien Le Dem (julien@apache.org)

2021-11-01 19:07:21

*Thread Reply:* https://pypi.org/project/openlineage-airflow/

PyPI

openlineage-airflow

OpenLineage integration with Airflow

Original URL: https://pypi.org/project/openlineage-airflow/

Julien Le Dem (julien@apache.org)

2021-11-01 19:08:03

*Thread Reply:* (I googled it and found that page that seems to have an outdated doc)

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-02 02:38:59

*Thread Reply:* @Maciej Obuchowski @Julien Le Dem that's the page i followed. Please guys revise the documentation, as it is very important

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-11-02 04:34:14

*Thread Reply:* It should just copy actual readme

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/setup.py | setup.py>

<pre><code> long_description=readme, </code></pre>

John Thomas (john@datakin.com)

2021-11-03 16:30:00

*Thread Reply:* PyPi is using the README at the time of the release 0.3.1, rather than the current README, which is 0.4.0. If we send the new release to PyPi it should also update the README

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-01 15:09:54

Related the Airflow integration. Is it required to install openlineage-airflow and setup the environment variables in both scheduler and webserver, or just in the scheduler?

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-01 15:19:18

*Thread Reply:* I set i up in the scheduler and it starts to log data to marquez. But it fails with this error:

Traceback (most recent call last): File "/home/airflow/.local/lib/python3.8/site-packages/openlineage/client/client.py", line 49, in __init__ raise ValueError(f"Need valid url for OpenLineageClient, passed {url}") ValueError: Need valid url for OpenLineageClient, passed "<http://marquez-internal-eks.eu-west-1.dev.hbi.systems>"

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-01 15:19:26

*Thread Reply:* why is it not a valid URL?

John Thomas (john@datakin.com)

2021-11-01 18:39:58

*Thread Reply:* Which version of the OpenLineage client are you using? On first check it should be fine

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-02 05:14:30

*Thread Reply:* @John Thomas I was appending double quotes as part of the url. Forget about this error

John Thomas (john@datakin.com)

2021-11-02 10:35:28

*Thread Reply:* aaaah, gotcha, good catch!

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-02 05:15:52

Hello, I am receiving this error today when I deployed openlineage in development environment (not using docker-compose locally).

I am running with KubernetesExecutor

airflow.exceptions.AirflowConfigException: The object could not be loaded. Please check "backend" key in "lineage" section. Current value: "openlineage.lineage_backend.OpenLineageBackend".

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-11-02 05:18:18

*Thread Reply:* Are you sure that openlineage-airflow is present in the container?

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-02 05:23:09

So in this case in my template I am adding:

```env:
ADDITIONALPYTHONDEPS: "openpyxl==3.0.3 smartopen==2.0.0 apache-airflow-providers-http apache-airflow-providers-cncf-kubernetes apache-airflow-providers-amazon openlineage-airflow" OPENLINEAGEURL: https://marquez-internal-eks.eu-west-1.dev.hbi.systems OPENLINEAGENAMESPACE: dnsairflow AIRFLOWKUBERNETESENVIRONMENTVARIABLESOPENLINEAGEURL: https://marquez-internal-eks.eu-west-1.dev.hbi.systems AIRFLOWKUBERNETESENVIRONMENTVARIABLESOPENLINEAGENAMESPACE: dns_airflow

configmap: mountPath: /var/airflow/config # mount path of the configmap data: airflow.cfg: | [lineage] backend = openlineage.lineage_backend.OpenLineageBackend

pod_template_file.yaml: |

    containers:
      - args: []
        command: []
        env:
          - name: AIRFLOW__KUBERNETES_ENVIRONMENT_VARIABLES__OPENLINEAGE_URL
            value: <https://marquez-internal-eks.eu-west-1.dev.hbi.systems>
          - name: AIRFLOW__KUBERNETES_ENVIRONMENT_VARIABLES__OPENLINEAGE_NAMESPACE
            value: dns_airflow
          - name: AIRFLOW__LINEAGE__BACKEND
            value: openlineage.lineage_backend.OpenLineageBackend```

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-02 05:23:31

I am installing openlineage in the ADDITIONAL_PYTHON_DEPS

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-11-02 05:25:43

*Thread Reply:* Maybe ADDITIONAL_PYTHON_DEPS are dependencies needed by the tasks, and are installed after Airflow tries to initialize LineageBackend?

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-02 06:34:11

*Thread Reply:* I am checking this accessing the Kubernetes pod

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-02 06:34:54

I have a question related airflow and open lineage:

I have a dag that contains 2 tasks:

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-02 06:35:34

I see that every task is displayed as a different job. I was expecting to see one job per dag.

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-02 07:29:43

Is this the expected behaviour??

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-11-02 07:34:47

*Thread Reply:* Yes

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-11-02 07:35:53

*Thread Reply:* Probably what you want is job hierarchy: https://github.com/MarquezProject/marquez/issues/1737

<https://github.com/MarquezProject/marquez/issues/1737|#1737 Implement job hierarchy based on parent job>

jobs should have a name that is relative to the parent job facet when created through OpenLineage events.

Assignees

collado-mike

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-02 07:46:02

*Thread Reply:* I do not see any benefit of just having some airflow task metadata. I do not see relationship between tasks. Every task is a job. When I was thinking about lineage when i started working on my company integration with openlineage i though that openlineage would give me relationship between task or datasets and the only thing i see is some metadata of the history of airflow runs that is already provided by airflow

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-02 07:46:20

*Thread Reply:* i was expecting to see a nice graph. I think it is missing some features

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-02 07:46:25

*Thread Reply:* at this early stage

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-11-02 07:50:10

*Thread Reply:* It probably depends on whether those tasks are covered by the extractors: https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow/openlineage/airflow/extractors

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-02 07:55:50

*Thread Reply:* We are not using any of those operators: bigquery, postsgress or snowflake.

And what is it doing GreatExpectactions extractor?

It would be good if there is one extractor that relies in the inlets and outlets that you can define in any Airflow task, and that that can be the general way to make relationships between datasets

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-02 07:56:30

*Thread Reply:* And that the same dag graph can be seen in marquez, and not one job per task.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-11-02 08:07:06

*Thread Reply:* > It would be good if there is one extractor that relies in the inlets and outlets that you can define in any Airflow task I think this is good idea. Overall, OpenLineage strongly focuses on automatic metadata collection. However, using them would be a nice fallback for not-covered-yet cases.

> And that the same dag graph can be seen in marquez, and not one job per task. This currently depends on dataset hierarchy. If you're not using any of the covered extractors, then Marquez can't build dataset graph like in the demo: https://raw.githubusercontent.com/MarquezProject/marquez/main/web/docs/demo.gif

With the job hierarchy ticket, probably some graph could be generated using just the job data though.

Original URL: https://raw.githubusercontent.com/MarquezProject/marquez/main/web/docs/demo.gif

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-11-02 08:09:55

*Thread Reply:* Created issue for the manual fallback: https://github.com/OpenLineage/OpenLineage/issues/384

<https://github.com/OpenLineage/OpenLineage/issues/384|#384 [INTEGRATION][Airflow] Consider using manual inlets/outlets as a fallback for automatic extractor mechanism>

There are way more operators than we can reasonably cover. Consider adding manual fallback mechanism that would cover cases where we don't have any extractors. Advantage of this approach is that it's generally easier for users to annotate dag than to modify it to use OpenLineage client manually.

Assignees

mobuchowski

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-02 08:28:29

*Thread Reply:* @Maciej Obuchowski how many people are working full time in this library? I really would like to adopt it in my company, as we use airflow and spark, but i see that yet it does not have the features we would like to.

At the moment the same info we have in marquez related the tasks, is available in airflow UI or using airflow API.

The game changer for us would be that it could give us features/metadata that we cannot query directly from airflow. That's why if the airflow inlets/outlets could be used, then it really would make much more sense for us to adopt it.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-11-02 09:33:31

*Thread Reply:* > how many people are working full time in this library? On Airflow integration or on OpenLineage overall? 🙂

> The game changer for us would be that it could give us features/metadata that we cannot query directly from airflow. I think there are three options there:

Contribute relevant extractors for Airflow operators that you use
Use those extractors as custom extractors: https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow#custom-extractors
Create that manual fallback mechanism with Airflow inlets/outlets: https://github.com/OpenLineage/OpenLineage/issues/384

<https://github.com/OpenLineage/OpenLineage/issues/384|#384 [INTEGRATION][Airflow] Consider using manual inlets/outlets as a fallback for automatic extractor mechanism>

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-11-02 09:35:10

*Thread Reply:* But first, before implementing last option, I'd like to get consensus about it - so feel free to comment there about your use case

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-02 09:19:14

@Maciej Obuchowski even i can contribute or help with my ideas (from what i consider that should be lineage from a client side)

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-03 07:58:56

@Maciej Obuchowski I was able to put to work Airflow in Kubernetes pointing to Marquez using the openlineage library. I have a few problems I found that would be good to comment.

I see a warning [2021-11-03 11:47:04,309] {great_expectations_extractor.py:27} WARNING - Did not find great_expectations_provider library or failed to import it I couldnt find any information about GreatExpectationsExtractor. Could you tell me what is this extractor about?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-11-03 08:00:34

*Thread Reply:* It should only affect you if you're using https://greatexpectations.io/

Francis McGregor-Macdonald (francis@mc-mac.com)

2021-11-03 15:57:02

*Thread Reply:* I have a similar message after installing openlineage into Amazon MWAA from the scheduler logs:

WARNING:/usr/local/airflow/.local/lib/python3.7/site-packages/openlineage/airflow/extractors/great_expectations_extractor.py:Did not find great_expectations_provider library or failed to import it

I am not using great expectations in the DAG.

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-03 08:00:52

I see a few priorities for Airflow integration:

Direct relationship 1-1 between Dag && Job. At the moment every task is a different job in marquez. What i consider wrong.
Airflow Inlets/outlets integration with marquez When do you think you guys can have this? If you need any help I can happily contribute, but I would need some help

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-11-03 08:08:21

*Thread Reply:* I don't think 1) is a good idea. You can have multiple tasks in one dag, processing different datasets and producing different datasets. If you want visual linking of jobs that produce disjoint datasets, then I think you want this: https://github.com/MarquezProject/marquez/issues/1737 which wuill affect visual layer.

Regarding 2), I think we need to get along with Airflow maintainers regarding long term mechanism on which OL will work: https://github.com/apache/airflow/issues/17984

I think using inlets/outlets as a fallback mechanism when we're not doing automatic metadata extraction is a good idea, but we don't know if hypothetical future mechanism will have access to these. It's hard to commit to mechanism which might disappear soon.

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-03 08:13:28

Another option is that I build my own extractor, do you have any example of how to create a custom extractor? How I can apply that customExtractor to specific operators? Is there a way to link an extractor with an operator, so at runtime airflow knows which extractor to run?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-11-03 08:19:00

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow#custom-extractors

I think you can base your code on any existing extractor, like PostgresExtractor: https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/openlineage/airflow/extractors/postgres_extractor.py#L53

Custom extractors work just like buildin ones, just that you need to add bit of mapping between operator and extractor, like OPENLINEAGE_EXTRACTOR_PostgresOperator=openlineage.airflow.extractors.postgres_extractor.PostgresExtractor

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/openlineage/airflow/extractors/postgres_extractor.py | postgres_extractor.py>

👍 Francis McGregor-Macdonald

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-03 08:35:59

*Thread Reply:* Thank you very much @Maciej Obuchowski

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-03 08:36:52

Last question of the morning. Running one task that failed i could see that no information appeared in Marquez. Is this something that is expected to happen? I would like to see in Marquez all the history of runs, successful and unsucessful them.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-11-03 08:41:14

*Thread Reply:* It worked like that in Airflow 1.10.

This is an unfortunate limitation of LineageBackend API that we're using for Airflow 2. We're trying to work out solution for this with Airflow maintainers: https://github.com/apache/airflow/issues/17984

<https://github.com/apache/airflow/issues/17984|#17984 Add possibility to LineageBackend to be notified of task instance execution start and failure>

Description Provide way for <code>LineageBackend</code> to be notified of task instance start and failed execution. Proposed implementation: <ol><li>add start and fail methods to LineageBackend</li><li>preparelineage decorator should call LineageBackend's start method the same way that applylineage calls send_lineage.</li><li>call fail method on where onfailurecallback is called.</li> </ol> Use case/motivation I'm working on <a href="https://openlineage.io/">OpenLineage</a> integration for Airflow 2. <a href="https://github.com/OpenLineage/OpenLineage/blob/airflow/2/integration/airflow/openlineage/airflow/backend.py">I've created MVP of this.</a> In contrast to our integration for Airflow 1.10 done via subclassing DAG, <code>LineageBackend</code> provides easy way to integrate OpenLineage just via configuration. However, <code>LineageBackend.send_lineage</code> is called just on successful task execution, while we also want to track newly started and failed tasks. Related issues No response Are you willing to submit a PR? ☑︎ Yes I am willing to submit a PR! Code of Conduct ☑︎ I agree to follow this project's <a href="https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md">Code of Conduct</a>

Labels

kind:feature, area:lineage

Comments

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-04 03:41:38

Hello openlineage community.

Yesterday I tried the integration with spark.

The result was not satisfactory. This is what I did:

Add openlineage-spark dependency
Add these lines: .config("spark.jars.packages", "io.openlineage:openlineage_spark:0.3.1") .config("spark.extraListeners", "io.openlineage.spark.agent.OpenLineageSparkListener") .config("spark.openlineage.url", "<https://marquez-internal-eks.eu-west-1.dev.hbi.systems/api/v1/namespaces/spark_integration/>" This job was doing spark.read from 2 different json location. It is doing spark write to 5 different parquet location in s3. The job finished succesfully and the result in marquez is:

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-04 03:43:40

It created 3 namespaces. One was the one that I point in the spark config property. The other 2 are the bucket that we are writing to () and the bucket where we are reading from ()

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-04 03:44:00

If I enter in the bucket namespaces I see nowthing inside

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-04 03:48:35

I can see if i enter in one of the weird jobs generated this:

Julien Le Dem (julien@apache.org)

2021-11-04 18:47:41

*Thread Reply:* This job with no output is a symptom of the output not being understood. you should be able to see the facets for that job. There will be a spark_unknown facet with more information about the problem. If you put that into an issue with some more details about this job we should be able to help.

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-05 04:36:30

*Thread Reply:* I ll try to put all the info in a ticket, as it is not working as i would expect

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-04 03:52:24

And i am seeing this as well

If I check the logs of marquez-web and marquez I can't see any error there

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-04 03:54:38

When I try to open the job fulfilments.execute_insert_into_hadoop_fs_relation_command I see this window:

David Virgil (david.virgil.naranjo@googlemail.com)

2021-11-04 04:06:29

The page froze and no link from the menu works. Apart from that I see that there are no messages in the logs

Julien Le Dem (julien@apache.org)

2021-11-04 18:49:31

*Thread Reply:* Is there an error in the browser javascript console? (example on chrome: View -> Developer -> Javascript console)

Alessandro Rizzo (l.alessandrorizzo@gmail.com)

2021-11-04 17:22:29

Hi #general, I'm a data engineer for a UK-based insuretech (part of one of the biggest UK retail insurers). We run a series of tech meetups and we'd love to have someone from the OpenLineage project to give us a demo of the tool. Would anyone be interested (DM if so 🙂 ) ?

👍 Ross Turk

Taleb Zeghmi (talebz@zillowgroup.com)

2021-11-04 21:30:24

Hi! Is there an example of tracking lineage when using Pandas to read/write and transform data?

John Thomas (john@datakin.com)

2021-11-04 21:35:16

*Thread Reply:* Hi Taleb - I don’t know of a generalized example of lineage tracking with Pandas, but you should be able to accomplish this by sending the runEvents manually to the OpenLineage API in your code: https://openlineage.io/docs/openapi/

Taleb Zeghmi (talebz@zillowgroup.com)

2021-11-04 21:38:25

*Thread Reply:* Is this a work in progress, that we can investigate? Because I see it in this image https://github.com/OpenLineage/OpenLineage/blob/main/doc/Scope.png

<https://github.com/OpenLineage/OpenLineage/blob/main/doc/Scope.png | Scope.png>

John Thomas (john@datakin.com)

2021-11-04 21:54:51

*Thread Reply:* To my knowledge, while there are a few proposals around adding a wrapper on some Pandas methods to output runEvents, it’s not something that’s had work started on it yet

John Thomas (john@datakin.com)

2021-11-04 21:56:26

*Thread Reply:* I sent some feelers out to get a little more context from folks who are more informed about this than I am, so I’ll get you more info about potential future plans and the considerations around them when I know more

John Thomas (john@datakin.com)

2021-11-04 23:04:47

*Thread Reply:* So, Pandas is tricky because unlike Airflow, DBT, or Spark, Pandas doesn’t own the whole flow, and you might dip in and out of it to use other Python Packages (at least I did when I was doing more Data Science).

We have this issue open in OpenLineage that you should go +1 to help with our planning 🙂

<https://github.com/OpenLineage/OpenLineage/issues/108|#108 [INTEGRATION] Add support for `pandas.read_sql`>

Let's add support for the Panda wrapper method <a href="https://pandas.pydata.org/docs/reference/api/pandas.read_sql.html"><code>read_sql</code></a>: <pre><code>pandas.read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, columns=None, chunksize=None) </code></pre> Since we have access to the <code>sql</code> at time of method execution, we can easily parse the SQL to capture the in / out tables before sending an OpenLineage event. We also have access to <code>columns</code> (if present) on table reads and may also want to support <code>read_sql_table</code> and <code>read_sql_query</code> (for backward compatibility).

Labels

proposal

Taleb Zeghmi (talebz@zillowgroup.com)

2021-11-05 15:08:09

*Thread Reply:* interesting... what if it were instead on all the read_** to_** functions?

Lyndon Armitage (lyndon.armitage@gmail.com)

2021-11-05 12:00:57

Hi! I am working alongside David at integrating OpenLineage into our Data Pipelines. I have a questions around Marquez and OpenLineage's divergent APIs: That is to say, these 2 APIs differ: https://openlineage.io/docs/openapi/ https://marquezproject.github.io/marquez/openapi.html This makes sense since they are at different layers of abstraction, but Marquez requires a few things that are absent from OpenLineage's API, for example the type in a data source, the distinctions between physicalName and sourceName in Datasets. Is that intentional? And can these be set using the OpenLineage API as some additional facets or keys? I noticed that the DatasourceDatasetFacet has a map of additionalProperties .

John Thomas (john@datakin.com)

2021-11-05 12:59:49

*Thread Reply:* The Marquez write APIs are artifacts from before OpenLineage existed, and they’re already slated for deprecation soon.

If you POST an OpenLineage runEvent to the /lineage endpoint in Marquez, it’ll create any missing jobs or datasets that are relevant.

Lyndon Armitage (lyndon.armitage@gmail.com)

2021-11-05 13:06:06

*Thread Reply:* Thanks for the response. That sounds good. Does this include the query interface e.g. http://localhost:5000/api/v1/namespaces/testing_java/datasets/incremental_data as that currently returns the Marquez version of a dataset including default set fields for type and the above mentioned properties.

Michael Collado (collado.mike@gmail.com)

2021-11-05 17:01:55

*Thread Reply:* I believe the intention for type is to support a new facet- TBH, it hasn't been the most pressing concern for most users, as most people are only recording tables, not streams. However, there's been some recent work to support Kafka in Spark- maybe it's time to address that deficiency.

I don't actually know what happened to the datasource type field- maybe @Julien Le Dem can comment on whether that field was dropped intentionally or whether it was an oversight.

Julien Le Dem (julien@apache.org)

2021-11-05 18:18:06

*Thread Reply:* It looks like an oversight, currently Marquez hard codes it to POSGRESQL: https://github.com/MarquezProject/marquez/blob/734bfd691636cb00212d7d22b1a489bd4870fb04/api/src/main/java/marquez/db/OpenLineageDao.java#L438

<https://github.com/MarquezProject/marquez/blob/734bfd691636cb00212d7d22b1a489bd4870fb04/api/src/main/java/marquez/db/OpenLineageDao.java | OpenLineageDao.java>

<pre><code> default String getSourceType(Dataset ds) { </code></pre>

Julien Le Dem (julien@apache.org)

2021-11-05 18:18:25

*Thread Reply:* https://github.com/MarquezProject/marquez/blob/734bfd691636cb00212d7d22b1a489bd4870fb04/api/src/main/java/marquez/db/OpenLineageDao.java#L438-L440

<https://github.com/MarquezProject/marquez/blob/734bfd691636cb00212d7d22b1a489bd4870fb04/api/src/main/java/marquez/db/OpenLineageDao.java | OpenLineageDao.java>

<pre><code> default String getSourceType(Dataset ds) { return SourceType.of("POSTGRESQL").getValue(); } </code></pre>

Julien Le Dem (julien@apache.org)

2021-11-05 18:20:25

*Thread Reply:* The source has a name though: https://github.com/OpenLineage/OpenLineage/blob/8afc4ff88b8dd8090cd9c45061a9f669fea2151e/spec/facets/DatasourceDatasetFacet.json#L12

<https://github.com/OpenLineage/OpenLineage/blob/8afc4ff88b8dd8090cd9c45061a9f669fea2151e/spec/facets/DatasourceDatasetFacet.json | DatasourceDatasetFacet.json>

<pre><code> "type" : "string" </code></pre>

Julien Le Dem (julien@apache.org)

2021-11-05 18:07:16

The next OpenLineage monthly meeting is this coming Wednesday at 9am PT The tentative agenda is: • OL Client use cases for Apache Iceberg [Ryan] • OpenLineage and Azure Purview [Shrikanth] • Proxy Backend and Egeria integration progress update (Issue #152) [Mandy] • OpenLineage last release overview (0.3.1) ◦ Facet versioning ◦ Airflow 2 / Spark 3 support, dbt improvements • OpenLineage 0.4 scope review ◦ Proxy Backend (Issue #152) ◦ Spark, Airflow, dbt improvements (documentation, coverage, ...) ◦ improvements to the OpenLineage model • Open discussion

<https://github.com/OpenLineage/OpenLineage/issues/152|#152 Proposal for a ProxyBackend>

This issue is to track the discussion and development of a proxy backend. The diagram below shows the basic idea. <a href="https://user-images.githubusercontent.com/12354195/128734055-2a08372e-8d1b-426e-9c25-c06aeb551e36.png">image</a>

Assignees

mandy-chessell

Comments

🙌 Maciej Obuchowski, Peter Hicks

Julien Le Dem (julien@apache.org)

2021-11-05 18:07:57

*Thread Reply:* If you want to add something please chime in this thread

Julien Le Dem (julien@apache.org)

2021-11-05 19:27:44

*Thread Reply:* https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

Julien Le Dem (julien@apache.org)

2021-11-09 19:47:26

*Thread Reply:* The monthly meeting is happening tomorrow. The purview team will present at the December meeting instead See full agenda here: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting You are welcome to contribute

Julien Le Dem (julien@apache.org)

2021-11-10 11:10:17

*Thread Reply:* The slides for the meeting later today: https://docs.google.com/presentation/d/1z2NTkkL8hg_2typHRYhcFPyD5az-5-tl/edit#slide=id.ge7d4b64ef4_0_0

Julien Le Dem (julien@apache.org)

2021-11-10 12:02:23

*Thread Reply:* It’s happening now ^

Julien Le Dem (julien@apache.org)

2021-11-16 19:57:23

*Thread Reply:* I have posted the notes and the recording from the last instance of our monthly meeting: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting#MonthlyTSCmeeting-Nov10th2021(9amPT) I have a few TODOs to follow up on tickets

Julien Le Dem (julien@apache.org)

2021-11-05 18:09:10

The next release of OpenLineage is being scoped: https://github.com/OpenLineage/OpenLineage/projects/6 Please chime in if you want to raise the priority of something or are planning to contribute

Anthony Ivanov (anthvt@gmail.com)

2021-11-09 08:18:11

Hi, I have been looking at open lineage for some time. And I really like it. It is very simple specification that covers a lot of use-cases. You can create any provider or consumer in a very simple way. So that’s pretty powerful. I have some questions about things that are not clear to me. I am not sure if this is the best place to ask. Please refer me to other place if this is not appropriate.

Anthony Ivanov (anthvt@gmail.com)

2021-11-09 08:18:58

*Thread Reply:* How do you model continuous process (not batch processes). For example a flume or spark job that does some real time processing on data.

Maybe it’s simply a “Job” But than what is run ?

Anthony Ivanov (anthvt@gmail.com)

2021-11-09 08:19:44

*Thread Reply:* How do you model consumers at the end - they can be reports? Data applications, ML model deployments, APIs, GUI consumed by end users ?

Have you considered having some examples of different use cases like those?

Anthony Ivanov (anthvt@gmail.com)

2021-11-09 08:21:43

*Thread Reply:* By definition, Job is a process definition that consumes and produces datasets. It is many to many relations? I’ve been wondering about that.Shouldn’t be more restrictive? For example important use-case for lineage is troubleshooting or error notifications (e.g mark report or job as temporarily in bad state if upstream data integration is broken). In order to be able to that you need to be able to traverse the graph to find the original error. So having multiple inputs produce single output make sense (e.g insert into output_1 select ** from x,y group by a,b) . But what are the cases where you’d want to see multiple outputs ? You can have single process produce multiple tables (in above example) but they’d alway be separate queries. The actual inputs for each output would be different.

But having multiple outputs create ambiguity as now If x or y is broken but have multiple outputs I do not know which is really impacted?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-11-09 08:34:01

*Thread Reply:* > How do you model continuous process (not batch processes). For example a flume or spark job that does some real time processing on data. > > Maybe it’s simply a “Job” But than what is run ? Every continuous process eventually has end - for example, you can deploy new version of your Flink pipeline. The new version would be the next Run for the same Job.

Moreover, OTHER event type is useful to update metadata like amount of processed records. In this Flink example, it could be emitted per checkpoint.

I think more attention for streaming use cases will be given soon.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-11-09 08:43:09

*Thread Reply:* > How do you model consumers at the end - they can be reports? Data applications, ML model deployments, APIs, GUI consumed by end users ? Our reference implementation is an web application https://marquezproject.github.io/marquez/

We definitely do not exclude any of the things you're talking about - and it would make a lot of sense to talk more about potential usages.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-11-09 08:45:47

*Thread Reply:* > By definition, Job is a process definition that consumes and produces datasets. It is many to many relations? I’ve been wondering about that.Shouldn’t be more restrictive? I think this is too SQL-centric view 🙂

Not everything is a query. For example, those Flink streaming jobs can produce side outputs, or even push data to multiple sinks. We need to model those types of jobs too.

If your application does not do multiple outputs, then I don't see how specification allowing those would impact you.

Anthony Ivanov (anthvt@gmail.com)

2021-11-17 12:11:37

*Thread Reply:* > We definitely do not exclude any of the things you’re talking about - and it would make a lot of sense to talk more about potential usages. Yes I think that would be great if we expand on potential usages. if Open Lineage documentation (perhaps) has all kind of examples for different use-cases or case studies. Financal or healthcase industry case study and how would someone doing integration with OpenLineage. It would be easier to understand the concepts and make sure things are modeled consistently.

Anthony Ivanov (anthvt@gmail.com)

2021-11-17 14:19:19

*Thread Reply:* > I think this is too SQL-centric view 🙂 > > Not everything is a query. For example, those Flink streaming jobs can produce side outputs, or even push data to multiple sinks. We need to model those types of jobs too. Thanks for answering @Maciej Obuchowski

Even in SQL you can have multiple outputs if you look thing at transaction level. I was simply using it as an example.

Maybe it would be clear what I mean in another example . Let’s say we have those phases

Ingest from sources
Process/transform
export to somewhere (image/diagram) https://mermaid.ink/img/eyJjb2RlIjoiXG5ncmFwaCBMUlxuICAgIHN1YmdyYXBoIFNvdXJjZXNcbi[…]yIjpmYWxzZSwiYXV0b1N5bmMiOnRydWUsInVwZGF0ZURpYWdyYW0iOmZhbHNlfQ

Let’s look at those two cases:

Within a single flink job and even task: Inventory & UI are both written to both S3, DB
Within a single flink job and even task: Inventory is written only to S3, UI is written only to DB

In 1. open lineage run event could look like {inputs: [ui, inventory], outputs: [s3, db] }

In 2. user can either do same as 1. (because data changes or copy-paste) which would be an error since both do not go to both Likely accurate one would be {inputs: [ui], outputs: [s3] } {inputs: [ui], outputs: [db] }

If the specification standard required single output then

would be modelled like run event {inputs: [ui, inventory], outputs: [s3] } ; {inputs: [ui, inventory], outputs: [db] } which is still correct if more verbose.
could only be modelled this way: {inputs: [ui], outputs: [s3] }; {inputs: [ui], outputs: [db] }

The more restrictive specification seems to lower the chance for an error doesn’t it?

Also if tools know spec guarantees single output , they’d be able to write tracing capabilities which are more precise because the structure would allow for less ambiguity. Storage backends that implement the spec could be also written in more optimal ways perhaps I have not looked into those accuracy of those hypothesis though.

Those were the thoughts I was thinking when asking about that. I’d be curious if there’s document on the research of pros/cons and alternatives for the design of the current specifications

Original URL: https://mermaid.ink/img/eyJjb2RlIjoiXG5ncmFwaCBMUlxuICAgIHN1YmdyYXBoIFNvdXJjZXNcbiAgICAgICAgSW52ZW50b3J5XG4gICAgICAgIFVJIFxuICAgIGVuZFxuICAgIHN1YmdyYXBoIFByb2Nlc3NcbiAgICAgICAgRmxpbmtKb2IxXG4gICAgICAgIEludmVudG9yeSAtLT4gfCBldmVudHMgIHwgRmxpbmtKb2IxXG4gICAgICAgIFVJIC0tPiB8IGV2ZW50cyB8IEZsaW5rSm9iMVxuICAgIGVuZFxuICAgIHN1YmdyYXBoIERlc3RpbmF0aW9uIFxuICAgICAgICBTM1xuICAgICAgICBEQlxuICAgICAgICBGbGlua0pvYjEgLS0-IFMzXG4gICAgICAgIEZsaW5rSm9iMSAtLT4gREJcbiAgICBlbmRcbiIsIm1lcm1haWQiOnsidGhlbWUiOiJmb3Jlc3QifSwidXBkYXRlRWRpdG9yIjpmYWxzZSwiYXV0b1N5bmMiOnRydWUsInVwZGF0ZURpYWdyYW0iOmZhbHNlfQ

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-11-23 05:38:11

*Thread Reply:* @Anthony Ivanov I see what you're trying to model. I think this could be solved by column level lineage though - when we'll have it. OL consumer could look at particular columns and derive which table contained particular error.

> 2. Within a single flink job and even task: Inventory is written only to S3, UI is written only to DB Does that actually happen? I understand this in case of job, but having single operator write to two different systems seems like bad design. Wouldn't that leave the possibility of breaking exactly-once unless you're going full into two phase commit?

Anthony Ivanov (anthvt@gmail.com)

2021-11-23 17:02:36

*Thread Reply:* > Does that actually happen? I understand this in case of job, but having single operator write to two different systems seems like bad design In a Spark or flink job it is less likely now that you mention it. But in a batch job (airflow python or kubernetes operator for example) users could do anything and then they’d need lineage to figure out what is wrong if even if what they did is suboptimal 🙂

> I see what you’re trying to model. I am not trying to model something specific. I am trying to understand how would openlineage be used in different organisations/companies and use-cases.

> I think this could be solved by column level lineage though There’s something specific planned ? I could not find a ticket in github. I thought you can use Dataset Facets - Schema for example could be subset of columns for a table …

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-11-24 04:55:41

*Thread Reply:* @Anthony Ivanov take a look at this: https://github.com/OpenLineage/OpenLineage/issues/148

Lyndon Armitage (lyndon.armitage@gmail.com)

2021-11-10 13:21:23

How do you deleting jobs/runs from Marquez/OpenLineage?

Willy Lulciuc (willy@datakin.com)

2021-11-10 16:17:10

*Thread Reply:* We’re adding APIs to delete metadata in Marquez 0.20.0. Here’s the related issue, https://github.com/MarquezProject/marquez/issues/1736

<https://github.com/MarquezProject/marquez/issues/1736|#1736 Ability to DELETE jobs and datasets>

Willy Lulciuc (willy@datakin.com)

2021-11-10 16:17:37

*Thread Reply:* Until then, you can connected to the DB directly and drop the rows from both the datasets and jobs tables (I know, not dieal)

Lyndon Armitage (lyndon.armitage@gmail.com)

2021-11-11 05:03:50

*Thread Reply:* Thanks! I assume deleting information will remain a Marquez only feature rather than becoming part of OpenLineage itself?

Willy Lulciuc (willy@datakin.com)

2021-12-10 14:07:57

*Thread Reply:* Yes! Delete operations will be an action supported by consumers of OpenLineage events

Lyndon Armitage (lyndon.armitage@gmail.com)

2021-11-11 05:13:31

Am I understanding namespaces correctly? A job namespace is different to a Dataset namespace. And that job namespaces define a job environment, like Airflow, Spark or some other system that executes jobs. But Dataset namespace define data locations, like an S3 bucket, local file system or schema in a Database?

Lyndon Armitage (lyndon.armitage@gmail.com)

2021-11-11 05:14:39

*Thread Reply:* I've been skimming this page: https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md

<https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md | Naming.md>

Naming We define the unique name strategy per resource to ensure it is followed uniformly independently from who is producing metadata and we can connect lineage from various sources. Both Jobs and Datasets are in their own namespaces. Job namespaces are related to their scheduler. The namespace for a dataset is the unique name for its datasource. Datasets The namespace and name of a datasource can be combined to form a URI (scheme:[//authority]path) • Namespace = scheme:[//authority] (the datasource) • Name = path (the datasets) Data warehouses/data bases Datasets are called tables. Tables are organised in databases and schemas. Postgres: Datasource hierarchy: • Host • Port Naming hierarchy: • Database • Schema • Table Identifier: • Namespace: postgres://{host}:{port} of the service instance. • Scheme = postgres • Authority = {host}:{port} • Unique name: {database}.{schema}.{table} • URI = postgres://{host}:{port}/{database}.{schema}.{table} Redshift: Datasource hierarchy: • Host: examplecluster.<XXXXXXXXXXXX>.<a href="http://us-west-2.redshift.amazonaws.com">us-west-2.redshift.amazonaws.com</a> • Port: 5439 Naming hierarchy: • Database • Schema • Table Identifier: • Namespace: redshift://{host}:{port} of the cluster instance. • Scheme = redshift • Authority = {host}:{port} • Unique name: {database}.{schema}.{table} • URI = redshift://{host}:{port}/{database}.{schema}.{table} Snowflake See: <a href="https://docs.snowflake.com/en/sql-reference/identifiers.html">Object Identifiers — Snowflake Documentation</a> Datasource hierarchy: • account name Naming hierarchy: • Database: {database name} => unique across the account • Schema: {schema name} => unique within the database • Table: {table name} => unique within the schema Identifier: • Namespace: snowflake://{account name} • Scheme = snowflake • Authority = {account name} • Name: {database}.{schema}.{table} • URI = snowflake://{account name}/{database}.{schema}.{table} BigQuery See: <a href="https://cloud.google.com/resource-manager/docs/creating-managing-projects|Creating and managing projects | Resource Manager Documentation">https://cloud.google.com/resource-manager/docs/creating-managing-projects|Creating and managing projects | Resource Manager Documentation</a> <a href="https://cloud.google.com/bigquery/docs/datasets-intro|Introduction to datasets | BigQuery">https://cloud.google.com/bigquery/docs/datasets-intro|Introduction to datasets | BigQuery</a> <a href="https://cloud.google.com/bigquery/docs/tables-intro|Introduction to tables | BigQuery">https://cloud.google.com/bigquery/docs/tables-intro|Introduction to tables | BigQuery</a> Datasource hierarchy: • bigquery Naming hierarchy: • Project Name: {project name} => is not unique • Project number: {project number} => numeric: is unique across google cloud • Project ID: {project id} => readable: is unique across google cloud • dataset: {dataset name} => is unique within a project • table: {table name} => is unique within a dataset Identifier : • Namespace: bigquery • Scheme = bigquery • Authority = • Unique name: {project id}.{dataset name}.{table name} • URI = bigquery:{project id}.{schema}.{table} Distributed file systems/blob stores GCS Datasource hierarchy: none, naming is global Naming hierarchy: • bucket name => globally unique • Path Identifier : • Namespace: gs://{bucket name} • Scheme = gs • Authority = {bucket name} • Unique name: {path} • URI = gs://{bucket name}{path} S3 Naming hierarchy: • bucket name => globally unique • Path Identifier : • Namespace: s3://{bucket name} • Scheme = s3 • Authority = {bucket name} • Unique name: {path} • URI = s3://{bucket name}{path} HDFS Naming hierarchy: • Namenode: host + port • Path Identifier : • Namespace: hdfs://{namenode host}:{namenode port} • Scheme = hdfs • Authority = {namenode host}:{namenode port} • Unique name: {path} • URI = hdfs://{namenode host}:{namenode port}{path} DBFS (Databricks File System) Naming hierarchy: • workspace name: globally unique • Path Identifier : • Namespace: hdfs://{workspace name} • Scheme = hdfs • Authority = workspace name • Unique name: {path} • URI = hdfs://{workspace name}{path} Jobs Context A <code>Job</code> is a recurring data transformation with Inputs and outputs. Each execution is captured as a <code>Run</code> with corresponding metadata. A <code>Run</code> event identifies the <code>Job</code> it is an instance of by providing the job’s unique identifier. The <code>Job</code> identifier is composed of a <code>Namespace</code> and a <code>Name</code>. The job name is unique within that namespace. The core property we want to identify about a <code>Job</code> is how it changes over time. Different schedules of the same logic applied to different datasets (possibly with different parameters) are different jobs. The notion of a <code>job</code> is tied to a recurring schedule with specific inputs and outputs. It could be an incremental update or a full reprocess or even a streaming job. If the same code artifact (for example a spark jar or a templated SQL query) is used in the context of different schedules with different input or outputs then they are different jobs. We are interested first in how they affect the datasets they produce. Job Namespace and constructing job names Jobs have a <code>name</code> that is unique to them in their <code>namespace</code> by construction. The Namespace is the root of the naming hierarchy. The job name is constructed to identify the job within that namespace. Example: • Airflow: • Namespace: the namespace is assigned to the airflow instance. Ex: airflow-staging, airflow-prod • Job: each task in a DAG is a job. name: {dag name}.{task name} • Spark: • Namespace: the namespace is provided as a configuration parameter as in airflow. If there's a parent job, we use the same namespace, otherwise it is provided by configuration. • Spark app job name: the spark.app.name • Spark action job name: {spark.app.name}.{node.name} Parent job run: a nested hierarchy of Jobs It is often the case that jobs are part of a nested hierarchy. For example an Airflow DAG contains tasks. An instance of the DAG is finished when all of the tasks are finished. Similarly a Spark job can spawn multiple actions each of them running independently. Additionally, a Spark job can be launched by an Airflow task within a DAG. Since what we care about is identifying the job as rooted in a recurring schedule, we want to capture that connection and make sure that we treat the same application logic triggered at different schedules as different jobs. For example: if an Airflow DAG runs individual tasks per partition (for example market segments) using the same underlying job logic, they will be tracked as separate jobs. To capture this, a run event provides <a href="https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json#L282-L331">a <code>ParentRun</code> facet</a>, referring to the parent <code>Job</code> and <code>Run</code>. This allows tracking a recurring job from the root of the schedule it is running for. If there's a parent job, we use the same namespace, otherwise it is provided by configuration. Example: <pre><code>{ "run": { "runId": "run_uuid" }, "job": { "namespace": "job_namespace", "name": "job_name" } } </code></pre>

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-11-11 05:46:06

*Thread Reply:* Yes!

Lyndon Armitage (lyndon.armitage@gmail.com)

2021-11-11 06:17:01

*Thread Reply:* Excellent, I think I had mistakenly conflated the two originally. This document makes it a little clearer. As an additional question: When viewing a Dataset in Marquez will it cross the job namespace bounds? As in, will I see jobs from different job namespaces?

Lyndon Armitage (lyndon.armitage@gmail.com)

2021-11-11 09:20:14

*Thread Reply:* In this example I have 1 job namespace and 2 dataset namespaces: sql-runner-dev is the job namespace. I cannot see a graph of my job now. Is this something to do with the namespace names?

Lyndon Armitage (lyndon.armitage@gmail.com)

2021-11-11 09:21:46

*Thread Reply:* The above document seems to have implied a namespace could be like a connection string for a database

Lyndon Armitage (lyndon.armitage@gmail.com)

2021-11-11 09:22:25

*Thread Reply:* Wait, it does work? Marquez was being temperamental

Lyndon Armitage (lyndon.armitage@gmail.com)

2021-11-11 09:24:01

*Thread Reply:* Yes, marquez is unable to fetch lineage for either dataset

Lyndon Armitage (lyndon.armitage@gmail.com)

2021-11-11 09:32:19

*Thread Reply:* Here's what I mean:

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-11-11 09:59:24

*Thread Reply:* I think you might have hit this issue: https://github.com/MarquezProject/marquez/issues/1744

<https://github.com/MarquezProject/marquez/issues/1744|#1744 UI doesn't urlencode dataset names>

The UI doesn't urlencode dataset names, so any dataset that has a <code>/</code> or <code>:</code> in the name causes a 404. This is notable with any S3 or GCS dataset, as their structure is always <pre><code>{"namespace": "<gs://bucket>", "datasetName": "/path/to/data"} </code></pre>

Labels

bug

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-11-11 10:00:29

*Thread Reply:* or, maybe not? It was released already.

Can you create issue on github with those helpful gifs? @Lyndon Armitage

Lyndon Armitage (lyndon.armitage@gmail.com)

2021-11-11 10:58:25

*Thread Reply:* I think you are right Maciej

Lyndon Armitage (lyndon.armitage@gmail.com)

2021-11-11 10:58:52

*Thread Reply:* Was that patched in 0,19.1?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-11-11 11:06:06

*Thread Reply:* As far as I see yes: https://github.com/MarquezProject/marquez/releases/tag/0.19.1

Haven't tested this myself unfortunately.

Lyndon Armitage (lyndon.armitage@gmail.com)

2021-11-11 11:07:07

*Thread Reply:* Perhaps not. It is urlencoding them: <http://localhost:3000/lineage/dataset/jdbc%3Ah2%3Amem%3Asql_tests_like/HBMOFA.ORDDETP> But the error seems to be in marquez getting them.

Lyndon Armitage (lyndon.armitage@gmail.com)

2021-11-11 11:09:23

*Thread Reply:* This is an example Lineage event JSON I am sending.

Example lineage event

👀 Maciej Obuchowski

Lyndon Armitage (lyndon.armitage@gmail.com)

2021-11-11 11:11:29

*Thread Reply:* I did run into another issue with really long names not being supported due to Marquez's DB using a fixed size string for a column, but that is understandable and probably a non-issue (my test code was generating temporary folders with long names).

Lyndon Armitage (lyndon.armitage@gmail.com)

2021-11-11 11:22:00

*Thread Reply:* A 404 is returned for: http://localhost:3000/api/v1/lineage/?nodeId=dataset:jdbc%3Ah2%3Amem%3Asql_tests_like:HBMOFA.ORDDETP

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-11-11 11:36:01

*Thread Reply:* @Lyndon Armitage can you create issue on the Marquez repo? https://github.com/MarquezProject/marquez/issues

Lyndon Armitage (lyndon.armitage@gmail.com)

2021-11-11 11:52:36

*Thread Reply:* https://github.com/MarquezProject/marquez/issues/1761 Is this sufficient?

<https://github.com/MarquezProject/marquez/issues/1761|#1761 404 returned when querying nodes that contain URL Encoded characters>

With an event similar to the attached scheme 3 namespaces are created. 2 of these are Dataset namespaces and 1 is a job namespace. When trying to access one of these datasets a 404 is encountered URL: <a href="http://localhost:3000/api/v1/lineage/?nodeId=dataset:jdbc%3Ah2%3Amem%3Asql_tests_like:HBMOFA.ORDDETP|http://localhost:3000/api/v1/lineage/?nodeId=dataset:jdbc%3Ah2%3Amem%3Asql_tests_like:HBMOFA.ORDDETP">http://localhost:3000/api/v1/lineage/?nodeId=dataset:jdbc%3Ah2%3Amem%3Asqltestslike:HBMOFA.ORDDETP|http://localhost:3000/api/v1/lineage/?nodeId=dataset:jdbc%3Ah2%3Amem%3Asqltestslike:HBMOFA.ORDDETP</a> <pre><code>{ "eventType": "COMPLETE", "eventTime": 1636646662.687894, "run": { "runId": "d3968e5f-84ac-48c1-954c-f999ff27ef3a", "facets": null }, "job": { "namespace": "sql-runner-dev", "name": "ORDDETP - ORDDETP.avro", "facets": { "sourceCodeLocation": null, "sql": { "_producer": "lyndon-thinkpad/127.0.1.1", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/SQLJobFacet.json#/$defs/SQLJobFacet>", "query": "SELECT t.**, CURRENT_DATE AS ingest_date FROM HBMOFA.ORDDETP t WHERE (BAYY &gt;= 93 AND BAMMDD &gt;= 801 AND BAMMDD &lt; 1111) OR (BAYY = 92 AND BAMMDD &gt;= 1301)" }, "documentation": { "_producer": "lyndon-thinkpad/127.0.1.1", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/DocumentationJobFacet.json#/$defs/DocumentationJobFacet>", "description": "SQL Runner Job for /tmp/sql_runner_tests4560779590026189736/config.conf" } } }, "inputs": [ { "namespace": "jdbc:h2:mem:sql_tests_like", "name": "HBMOFA.ORDDETP", "facets": { "documentation": null, "dataSource": { "_producer": "lyndon-thinkpad/127.0.1.1", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet>", "name": "jdbc:h2:mem:sql_tests_like", "uri": "jdbc:h2:mem:sql_tests_like" }, "schema": null }, "inputFacets": null } ], "outputs": [ { "namespace": "<s3://sql-runner>", "name": "2021-11-11/incremental/ORDDETP.avro", "facets": { "documentation": null, "dataSource": { "_producer": "lyndon-thinkpad/127.0.1.1", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet>", "name": "<s3://sql-runner>", "uri": "<s3://sql-runner>" }, "schema": null }, "outputFacets": null } ], "producer": "lyndon-thinkpad/127.0.1.1", "schemaURL": "<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunEvent>" } </code></pre> A partial stack trace from the docker process looks as follows: <pre><code>marquez-api | 172.21.0.4 - - [11/Nov/2021:16:49:56 +0000] "GET /api/v1/namespaces/jdbc%3Ah2%3Amem%3Asql_tests_like/datasets/HBMOFA.ORDDETP/versions?limit=100&amp;offset=0 HTTP/1.1" 200 648 "<http://localhost:3000/lineage/dataset/jdbc%3Ah2%3Amem%3Asql_tests_like/HBMOFA.ORDDETP>" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36" 27 marquez-api | ERROR [2021-11-11 16:49:59,168] io.dropwizard.jersey.errors.IllegalStateExceptionMapper: Error handling a request: 7b955941cb928966 marquez-api | ! java.lang.IllegalStateException: No match available marquez-api | ! at java.base/java.util.regex.Matcher.start(Unknown Source) marquez-api | ! at marquez.service.models.NodeId.parts(NodeId.java:190) marquez-api | ! at marquez.service.models.NodeId.asDatasetId(NodeId.java:214) marquez-api | ! at marquez.service.LineageService.getJobUuid(LineageService.java:191) marquez-api | ! at marquez.service.LineageService.lineage(LineageService.java:40) marquez-api | ! at marquez.api.OpenLineageResource.getLineage(OpenLineageResource.java:96) </code></pre> Note that the original OpenLineage POST request succeeds.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-11-11 11:54:41

*Thread Reply:* Yup, thanks!

Francis McGregor-Macdonald (francis@mc-mac.com)

2021-11-15 13:00:39

I am looking at an AWS Glue Crawler lineage event. The glue crawler creates or updates a table schema, and I have a few questions on aligning to best practice.

Is this a dataset create/update or…
… a job with no dataset inputs and only dataset outputs or
… is the path in S3 the input and the Glue table the output?
Is there an example of the lineage even here I can clone or work from? Thanks.

🚀 Willy Lulciuc

John Thomas (john@datakin.com)

2021-11-15 13:04:19

*Thread Reply:* Hi Francis, for the event is it creating a new table with new data in glue / adding new data to an existing one or is it simply reformatting an existing table or making an empty one?

Francis McGregor-Macdonald (francis@mc-mac.com)

2021-11-15 13:35:00

*Thread Reply:* The table does not exist in the Glue catalog until …

A Glue crawler connects to one or more data stores (in this case S3), determines the data structures, and writes tables into the Data Catalog.

The data/objects are in S3, the Glue catalog is a metadata representation (HIVE) as as table.

John Thomas (john@datakin.com)

2021-11-15 13:41:14

*Thread Reply:* Hmm, interesting, so the lineage of interest here would be of the metadata flow not of the data itself?

In that case I’d say that the glue Crawler is a job that outputs a dataset.

Michael Collado (collado.mike@gmail.com)

2021-11-15 15:03:36

*Thread Reply:* The crawler is a job that discovers a dataset. It doesn't create it. If you're posting lineage yourself, I'd post it as an input event, not an output. The thing that actually wrote the data - generated the records and stored them in S3 - is the thing that would be outputting the dataset

Francis McGregor-Macdonald (francis@mc-mac.com)

2021-11-15 15:23:23

*Thread Reply:* @Michael Collado I agree the crawler discovers the S3 dataset. It also creates an event which creates/updates the HIVE/Glue table.

If the Glue table isn’t a distinct dataset from the S3 data, how does this compare to a view in a database on top of a table. Are they 2 datasets or just one?

Glue can discover data in remote databases too, in those cases does it make sense to have only the source dataset?

Francis McGregor-Macdonald (francis@mc-mac.com)

2021-11-15 15:24:39

*Thread Reply:* @John Thomas yes, its the metadata flow.

Michael Collado (collado.mike@gmail.com)

2021-11-15 15:24:52

*Thread Reply:* that's how the Spark integration currently treats Hive datasets- I'd like to add a facet to attach that indicates that it is being read as a Hive table, and include all the appropriate metadata, but it uses the dataset's location in S3 as the canonical dataset identifier

John Thomas (john@datakin.com)

2021-11-15 15:29:22

*Thread Reply:* @Francis McGregor-Macdonald I think the way to represent this is predicated on what you’re looking to accomplish by sending a runEvent for the Glue crawler. What are your broader objectives in adding this?

Francis McGregor-Macdonald (francis@mc-mac.com)

2021-11-15 15:50:37

*Thread Reply:* I am working through AWS native services seeing how they could, can, or do best integrate with openlineage (I’m an AWS SA). Hence the questions on best practice.

Aligning with the Spark integration sounds like it might make sense then. Is there an example I could build from?

Michael Collado (collado.mike@gmail.com)

2021-11-15 17:56:17

*Thread Reply:* an example of reporting lineage? you can look at the Spark integration here - https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/

John Thomas (john@datakin.com)

2021-11-15 17:59:14

*Thread Reply:* Ahh, in that case I would have to agree with Michael’s approach to things!

✅ Diogo

Francis McGregor-Macdonald (francis@mc-mac.com)

2021-11-19 03:30:03

*Thread Reply:* @Michael Collado I am following the Spark integration you recommended (for a Glue job) and while everything appears to be set up correct, I am getting no lineage appear in marquez (a request.get from the pyspark script can reach the endpoint). Is there a way to enable a debug log so I can look to identify where the issue is? Is there a specific place to look in the regular logs?

Michael Collado (collado.mike@gmail.com)

2021-11-19 13:39:01

*Thread Reply:* listener output should be present in the driver logs. you can turn on debug logging in your log4j config (or whatever logging tool you use) for the package io.openlineage.spark.agent

✅ Francis McGregor-Macdonald

Michael Collado (collado.mike@gmail.com)

2021-11-19 19:44:06

Woo hoo! Initial Spark <-> Kafka support has been merged 🙂 https://github.com/OpenLineage/OpenLineage/pull/387

<https://github.com/OpenLineage/OpenLineage/pull/387|#387 [spark] support read/write to kafka>

Problem On write to kafka in spark exception occurs. Closes: #280 Solution Adds kafka read/write support ☐ Your change modifies the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json">core</a> OpenLineage model ☐ Your change modifies one or more OpenLineage <a href="https://github.com/OpenLineage/OpenLineage/tree/main/spec/facets">facets</a> If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports <code>S3</code> and <code>GCS</code> filesystem operations, tested with AWS EMR). Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☑︎ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☑︎ Your changes are accompanied by tests (if relevant) ☐ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☐ You've updated the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md"><code>CHANGELOG.md</code></a> with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant)

Comments

🎉 Willy Lulciuc, John Thomas, Peter Hicks, Maciej Obuchowski

🙌 Willy Lulciuc, John Thomas, Francis McGregor-Macdonald, Peter Hicks, Maciej Obuchowski

🚀 Willy Lulciuc, John Thomas, Peter Hicks, Francis McGregor-Macdonald, Maciej Obuchowski

Francis McGregor-Macdonald (francis@mc-mac.com)

2021-11-22 13:32:57

I am “successfully” exporting lineage to openlineage from AWS Glue using the listener. Only the source load is showing, not the transforms, or the sink

Francis McGregor-Macdonald (francis@mc-mac.com)

2021-11-22 13:34:15

*Thread Reply:* Output event:

2021-11-22 08:12:15,513 INFO [spark-listener-group-shared] agent.OpenLineageContext (OpenLineageContext.java:emit(50)): Lineage completed successfully: ResponseMessage(responseCode=201, body=, error=null) { “eventType”: “COMPLETE”, “eventTime”: “2021-11-22T08:12:15.478Z”, “run”: { “runId”: “03bfc770-2151-499e-9265-8457a38ceec3”, “facets”: { “sparkversion”: { “producer”: “https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark”, “schemaURL”: “https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet”, “spark-version”: “3.1.1-amzn-0”, “openlineage-spark-version”: “0.3.1” } } }, “job”: { “namespace”: “sparkintegration”, “name”: “nyctaxirawstage.mappartitionsunionmappartitionsnew_hadoop” }, “inputs”: [ { “namespace”: “s3.cdkdl-dev-foundationstoragef3787fa8-raw1d6fb60a-171gwxf2sixt9”, “name”: “” } ], “outputs”: [], “producer”: “https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark”, “schemaURL”: “https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunEvent” }

Francis McGregor-Macdonald (francis@mc-mac.com)

2021-11-22 13:34:59

*Thread Reply:* This sink record is missing details …

2021-11-22 08:12:15,481 INFO [Thread-7] sinks.HadoopDataSink (HadoopDataSink.scala:$anonfun$writeDynamicFrame$1(275)): nameSpace: , table:

Francis McGregor-Macdonald (francis@mc-mac.com)

2021-11-22 13:40:30

*Thread Reply:* I can also see multiple history events (presumably for each transform, each as above) emitted for the same Glue Job, with different RunId, with the same inputs and the same (null) output.

John Thomas (john@datakin.com)

2021-11-22 14:31:06

*Thread Reply:* Are you using the existing spark integration for the spark lineage?

Francis McGregor-Macdonald (francis@mc-mac.com)

2021-11-22 14:46:47

*Thread Reply:* I followed: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark In the Glue context I was not clear on the correct settings for “spark.openlineage.parentJobName” and “spark.openlineage.parentRunId”, I put in static values (which may be incorrect)? I injected these via: "--conf": "spark.openlineage.parentJobName=nyc-taxi-raw-stage",

Francis McGregor-Macdonald (francis@mc-mac.com)

2021-11-22 14:47:54

*Thread Reply:* Happy to share what is working when I am done, I can’t seem to find an AWS Glue specific example to walk me through.

John Thomas (john@datakin.com)

2021-11-22 15:03:31

*Thread Reply:* yeah, We haven’t spent any significant time with AWS Glue, but we just released the Databricks integration, which might help guide the way you’re working a little bit more

Francis McGregor-Macdonald (francis@mc-mac.com)

2021-11-22 15:12:15

*Thread Reply:* from what I can see in the DBX integration (https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/databricks) all of what is being done here I am doing in Glue (upload the jar, embed the settings into the Glue spark job). It is emitting the above for each transform in the Glue job, but does not seem to capture the output …

Francis McGregor-Macdonald (francis@mc-mac.com)

2021-11-22 15:13:54

*Thread Reply:* Is there a standard Spark test script in use with openlineage I could put into Glue to test without using any Glue specific functionality (without for example the GlueContext, or Glue dynamic frames)?

Francis McGregor-Macdonald (francis@mc-mac.com)

2021-11-22 15:25:30

*Thread Reply:* The initialisation does appear to be working if I compare it to the DBX README Mine from AWS Glue… 21/11/22 18:48:48 INFO SparkContext: Registered listener io.openlineage.spark.agent.OpenLineageSparkListener 21/11/22 18:48:49 INFO OpenLineageContext: Init OpenLineageContext: Args: ArgumentParser(host=<http://ec2>-….<a href="http://compute-1.amazonaws.com:5000">compute-1.amazonaws.com:5000</a>, version=v1, namespace=spark_integration, jobName=default, parentRunId=null, apiKey=Optional.empty) URI: <http://ec2>-….<a href="http://compute-1.amazonaws.com:5000/api/v1/lineage">compute-1.amazonaws.com:5000/api/v1/lineage</a> 21/11/22 18:48:49 INFO AsyncEventQueue: Process of event SparkListenerApplicationStart(nyc-taxi-raw-stage,Some(spark-application-1637606927106),1637606926281,spark,None,None,None) by listener OpenLineageSparkListener took 1.092252643s.

John Thomas (john@datakin.com)

2021-11-22 16:12:40

*Thread Reply:* We don’t have a test run, unfortunately, but you could follow this blog post’s processes in each and see what the differences are? https://openlineage.io/blog/openlineage-spark/

openlineage.io

Tracing Data Lineage with OpenLineage and Apache Spark

Spark ushered in a brand new age of data democratization... and left us with a mess of hidden dependencies, stale datasets, and failed jobs.

Original URL: https://openlineage.io/blog/openlineage-spark/

Francis McGregor-Macdonald (francis@mc-mac.com)

2021-11-22 16:43:23

*Thread Reply:* Thanks, I have been looking at that. I will create a Glue job aligned with that. What is the best way to pass feedback? Keep it here?

John Thomas (john@datakin.com)

2021-11-22 16:49:50

*Thread Reply:* yeah, this thread will work great 🙂

Ilya Davidov (idavidov@marpaihealth.com)

2022-07-18 11:37:02

*Thread Reply:* @Francis McGregor-Macdonald are you managed to enable it?

Francis McGregor-Macdonald (francis@mc-mac.com)

2022-07-18 15:14:47

*Thread Reply:* Just DM you the code I used a while back (app.py + CDK code). I haven’t used it in a while, and there is some duplication in it. I had openlineage enabled, but dynamic frames not working yet with lineage. Let me know how you go. I haven’t had the space to look at it in a while, but happy to support if you are looking at it.

Dinakar Sundar (dinakar_sundar@condenast.com)

2021-11-23 08:48:51

how to use the Open lineage with amundsen ?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-11-23 09:01:11

*Thread Reply:* You can use this: https://github.com/amundsen-io/amundsen/pull/1444

<https://github.com/amundsen-io/amundsen/pull/1444|#1444 feat: OpenLineage extractor for databuilder>

Signed-off-by: dechoma <a href="mailto:dominik.choma@gmail.com">dominik.choma@gmail.com</a> Summary of Changes Databulder extractor to extract lineage information from OpenLineage format <a href="https://github.com/OpenLineage/OpenLineage">https://github.com/OpenLineage/OpenLineage</a> Tests Documentation CheckList Make sure you have checked all steps below to ensure a timely review. ☐ PR title addresses the issue accurately and concisely. Example: "Updates the version of Flask to v1.0.2" • In case you are adding a dependency, check if the license complies with the <a href="https://www.apache.org/legal/resolved.html#category-x">ASF 3rd Party License Policy</a>. ☐ PR includes a summary of changes. ☐ PR adds unit tests, updates existing unit tests, OR documents why no test additions or modifications are needed. ☐ In case of new functionality, my PR adds documentation that describes how to use it. • All the public functions and the classes in the PR contain docstrings that explain what it does

Labels

area:databuilder, area:dev-tools, area:docs

Comments

John Thomas (john@datakin.com)

2021-11-23 09:38:44

*Thread Reply:* you can also check out this section from the Amundsen Community Meeting in october: https://www.youtube.com/watch?v=7WgECcmLSRk

YouTube

} Amundsen (https://www.youtube.com/channel/UCgOyzG0sEoolxuC9YXDYPeg)

Pandas profiling with Amundsen & OpenLineage integration with Amundsen - Oct 2021 community meeting

Original URL: https://www.youtube.com/watch?v=7WgECcmLSRk

Dinakar Sundar (dinakar_sundar@condenast.com)

2021-11-23 08:49:16

do we need to use the Marquez ?

Willy Lulciuc (willy@datakin.com)

2021-11-23 12:45:34

*Thread Reply:* No, I believe the databuilder OpenLineage extractor for Amundsen will continue to store lineage metadata in Atlas

Willy Lulciuc (willy@datakin.com)

2021-11-23 12:47:01

*Thread Reply:* We've spoken to the Amundsen team, and though using Marquez to store lineage metadata isn't an option, it's an integration that makes sense but hasn't yet been prioritized

Dinakar Sundar (dinakar_sundar@condenast.com)

2021-11-23 13:51:00

*Thread Reply:* Thanks , Right now amundsen has no support for lineage extraction from spark or airflow , if this case do we need to use marquez for open lineage implementation to capture the lineage from airflow & spark

Willy Lulciuc (willy@datakin.com)

2021-11-23 13:57:13

*Thread Reply:* Maybe, that would mean running the full Amundsen stack as well as the Marquez stack along side each other (not ideal). The OpenLineage integration for Amundsen is very recent, so haven't had a chance to look deeply into the implementation. But, briefly looking over the config for Openlineagetablelineageextractor, you can only send metadata to Atlas

Dinakar Sundar (dinakar_sundar@condenast.com)

2021-11-24 00:36:56

*Thread Reply:* @Willy Lulciuc thats our real concern , running the two stacks will make a mess environment , let me explain our amundsen setup , we are having neo4j as backend , (front end , search service , metadata service,elastic search & neo4j) . our requirement to capture lineage from spark and airflow , imported into amundsen

Vinith Krishnan US (vinithk@nvidia.com)

2022-03-11 22:33:39

*Thread Reply:* We are running into a similar issue. @Dinakar Sundar were you able to get the Amundsen OpenLineage integration to work with a neo4j backend?

bitsofinfo (bitsofinfo.g@gmail.com)

2021-11-24 11:41:31

Hi all - i just watched the presentation on this and Marquez from the Airflow 21 summit. I was pretty impressed with this. My question is what other open source players are in this space or are pretty much people consolidating around this? (which would be great). Was looking at the available datasource extractors for the airflow side and would hope to see more here, looking at the code doesn't seem like too huge of a deal. Is there a roadmap available?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-11-24 11:49:14

*Thread Reply:* You can take a look at https://github.com/OpenLineage/OpenLineage/projects

Martin Fiser (fisa@keboola.com)

2021-11-24 19:24:48

Hi all, I was wondering what is the status of native support of openlineage for DataHub or Amundzen. re https://openlineage.slack.com/archives/C01CK9T7HKR/p1633633476151000?thread_ts=1633008095.115900&cid=C01CK9T7HKR Many thanks!

} Julien Le Dem (https://openlineage.slack.com/team/U01DCLP0GU9)

Those two specs look compatible, so Datahub should be able to consume this lineage metadata in the future

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1633633476151000?thread_ts=1633008095.115900&cid=C01CK9T7HKR

Martin Fiser (fisa@keboola.com)

2021-12-01 16:35:17

*Thread Reply:* Anyone? Thanks!

Dinakar Sundar (dinakar_sundar@condenast.com)

2021-11-25 01:42:26

our amundsen setup , we are having neo4j as backend , (front end , search service , metadata service,elastic search & neo4j) . our requirement to capture lineage from spark and airflow , imported into amundsen ?

Will Johnson (will@willj.co)

2021-11-29 23:30:12

Hello, OpenLineage folks - I'm curious if anyone here has ran into an issue like we're running into as we look to extend OpenLineage's Spark integration into Databricks.

Has anyone ran into an issue where a scala class should exist (based on a decompiled jar, I see that it's a public class) but you keep getting an error like object SqlDWRelation in package sqldw cannot be accessed in package com.databricks.spark.sqldw?

Databricks has a Synapse SQL DW connector: https://docs.databricks.com/data/data-sources/azure/synapse-analytics.html

I want to extract the database URL, table, and schema from the logical plan but

I execute something like the below command that runs a SELECT ** on the given tableName ("borrower" in this case) in the Azure Synapse database.

val df = spark.read.format("com.databricks.spark.sqldw") .option("url", sqlDwUrl) .option("tempDir", tempDir) .option("forwardSparkAzureStorageCredentials", "true") .option("dbTable", tableName) .load() val logicalPlan = df.queryExecution.logical val logicalRelation = logicalPlan.asInstanceOf[LogicalRelation] val sqlBaseRelation = logicalRelation.relation I end up with something like this, all good so far: ```logicalPlan: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Relation[memberId#97,residentialState#98,yearsEmployment#99,homeOwnership#100,annualIncome#101,incomeVerified#102,dtiRatio#103,lengthCreditHistory#104,numTotalCreditLines#105,numOpenCreditLines#106,numOpenCreditLines1Year#107,revolvingBalance#108,revolvingUtilizationRate#109,numDerogatoryRec#110,numDelinquency2Years#111,numChargeoff1year#112,numInquiries6Mon#113] SqlDWRelation("borrower")

logicalRelation: org.apache.spark.sql.execution.datasources.LogicalRelation = Relation[memberId#97,residentialState#98,yearsEmployment#99,homeOwnership#100,annualIncome#101,incomeVerified#102,dtiRatio#103,lengthCreditHistory#104,numTotalCreditLines#105,numOpenCreditLines#106,numOpenCreditLines1Year#107,revolvingBalance#108,revolvingUtilizationRate#109,numDerogatoryRec#110,numDelinquency2Years#111,numChargeoff1year#112,numInquiries6Mon#113] SqlDWRelation("borrower")

sqlBaseRelation: org.apache.spark.sql.sources.BaseRelation = SqlDWRelation("borrower")``Schema, I can easily get withsqlBaseRelation.schema` but I cannot figure out:

How I can get the database name from the logical relation
How I can get the table name from the logical relation ("borrower" is the table name so I can always parse the string if necessary" I know that Databricks has the SqlDWRelation class which I think I need to cast the BaseRelation to BUT it appears to be in a jar / package that is inaccessible during the execution of a notebook. Specifically import com.databricks.spark.sqldw.SqlDWRelation is the relation and it appears to have a few accessors that would help me answer some of these questions: params and JDBCWrapper

Of course this is undocumented on the Databricks side 😰

If I could cast the BaseRelation into this SqlDWRelation, I'd be able to get this info. However, whenever I attempt to use the imported SqlDWRelation, I get an error object SqlDWRelation in package sqldw cannot be accessed in package com.databricks.spark.sqldw I'm hoping someone has run into something similar in the past on the Spark / Databricks / Scala side and might share some advice. Thank you for any guidance!

docs.databricks.com

Azure Synapse Analytics | Databricks on AWS

Learn how to read and write data to Azure Synapse Analytics (formerly SQL Data Warehouse) using Databricks.

Original URL: https://docs.databricks.com/data/data-sources/azure/synapse-analytics.html

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-11-30 07:03:30

*Thread Reply:* Have you tried reflection? https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/reflect/FieldUtils.html#getDeclaredField-jav[…].lang.String-boolean-

Will Johnson (will@willj.co)

2021-11-30 11:21:34

*Thread Reply:* I have not! Will give it a try, Maciej! Thank you for the reply!

🙌 Maciej Obuchowski

Will Johnson (will@willj.co)

2021-11-30 15:20:18

*Thread Reply:* 🙏 @Maciej Obuchowski we're not worthy! That was the magic we needed. Seems like a hack since we're snooping in on private classes but if it works...

Thank you so much for pointing to those utilities!

❤️ Julien Le Dem

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-11-30 15:48:25

*Thread Reply:* Glad I could help!

Francis McGregor-Macdonald (francis@mc-mac.com)

2021-11-30 19:43:03

A colleague pointed me at https://open-metadata.org/, is there anywhere a view or comparison of this and openlineage?

Mario Measic (mario.measic.gavran@gmail.com)

2021-12-01 08:51:28

*Thread Reply:* Different concepts. OL is focused on describing the lineage and metadata of the running jobs. So it keeps track of all the metadata (schema, ...) of inputs and outputs at the time transformation occurs + transformation metadata (code version, cost, etc.)

OM I am not an expert but it's a metadata model with clients and API around it.

RamanD (romantanzar@gmail.com)

2021-12-01 12:33:51

Hey! OpenLineage is a beautiful initiative, to be honest! We also try to accommodate it. One question, maybe it's already described somewhere then many apologies :) if we need to propagate run id from Airflow to a child task (AWS Batch job, for instance) what will be the best way to do it in the current realization (as we get run id only at post execute phase)?.. We use Airflow 2+ integration.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-12-01 12:40:53

*Thread Reply:* Hey. For technical reasons, we can't automatically register macro that does this job, as we could in Airflow 1 integration. You could put it yourself:

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-12-01 12:41:02

*Thread Reply:* ```def lineageparentid(run_id, task): """ Macro function which returns the generated job and run id for a given task. This can be used to forward the ids from a task to a child run so the job hierarchy is preserved. Child run can create ParentRunFacet from those ids. Invoke as a jinja template, e.g.

PythonOperator(
    task_id='render_template',
    python_callable=my_task_function,
    op_args=['{{ lineage_parent_id(run_id, task) }}'], # lineage_run_id macro invoked
    provide_context=False,
    dag=dag
)

:param run_id:
:param task:
:return:
"""
with create_session() as session:
    job_name = openlineage_job_name(task.dag_id, task.task_id)
    ids = JobIdMapping.get(job_name, run_id, session)
    if ids is None:
        return ""
    elif isinstance(ids, list):
        run_id = "" if len(ids) == 0 else ids[0]
    else:
        run_id = str(ids)
    return f"{_DAG_NAMESPACE}/{job_name}/{run_id}"

def openlineagejobname(dagid: str, taskid: str) -> str: return f'{dagid}.{taskid}'```

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-12-01 12:41:13

*Thread Reply:* from here: https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/openlineage/airflow/dag.py#L77

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/openlineage/airflow/dag.py | dag.py>

<pre><code>def lineage_parent_id(run_id, task): </code></pre>

RamanD (romantanzar@gmail.com)

2021-12-01 12:53:27

*Thread Reply:* the quickest response ever! And that works like a charm 🙌

👍 Michael Collado

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-12-01 13:21:16

*Thread Reply:* Glad I could help!

Will Johnson (will@willj.co)

2021-12-01 14:14:23

@Maciej Obuchowski and @Michael Collado given your work on the Spark Integration, what's the right way to explore the Write operations' logical plans? When doing a read, it's easy! In scala df.queryExecution.logical gives you exactly what you need but how do you guys interactively explore what sort of commands are being used during a write? We are exploring some of the DataSourceV2 data sources and are hoping to learn from you guys a bit more, please 😃

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-12-01 14:18:00

*Thread Reply:* For SQL, EXPLAIN EXTENDED and show() in scala-shell is helpful:

spark.sql("EXPLAIN EXTENDED CREATE TABLE tbl USING delta LOCATION '/tmp/delta' AS SELECT ** FROM tmp").show(false) ```|== Parsed Logical Plan == 'CreateTableAsSelectStatement [tbl], delta, /tmp/delta, false, false +- 'Project [**] +- 'UnresolvedRelation [tmp], [], false

== Analyzed Logical Plan ==

CreateTableAsSelect org.apache.spark.sql.delta.catalog.DeltaCatalog@63c5b63a, default.tbl, [provider=delta, location=/tmp/delta], false +- Project [x#12, y#13] +- SubqueryAlias tmp +- LocalRelation [x#12, y#13]

== Optimized Logical Plan == CreateTableAsSelect org.apache.spark.sql.delta.catalog.DeltaCatalog@63c5b63a, default.tbl, [provider=delta, location=/tmp/delta], false +- LocalRelation [x#12, y#13]

== Physical Plan == AtomicCreateTableAsSelect org.apache.spark.sql.delta.catalog.DeltaCatalog@63c5b63a, default.tbl, LocalRelation [x#12, y#13], [provider=delta, location=/tmp/delta, owner=mobuchowski], [], false +- LocalTableScan [x#12, y#13] |```

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-12-01 14:27:25

*Thread Reply:* For dataframe api, I'm usually just either logging plan to console from OpenLineage listener, or looking at sparklogicalPlan or sparkunknown facets send by listener - even when the particular write operation isn't supported by integration, those facets should have some relevant info.

🙌 Will Johnson

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-12-01 14:27:40

*Thread Reply:* For example, for the query I've send at comment above, the spark_logicalPlan facet looks like this:

"spark.logicalPlan": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/0.4.0-SNAPSHOT/integration/spark>", "_schemaURL": "<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>", "plan": [ { "allowExisting": false, "child": [ { "class": "org.apache.spark.sql.catalyst.plans.logical.LocalRelation", "data": null, "isStreaming": false, "num-children": 0, "output": [ [ { "class": "org.apache.spark.sql.catalyst.expressions.AttributeReference", "dataType": "integer", "exprId": { "id": 2, "jvmId": "e03e2860-a24b-41f5-addb-c35226173f7c", "product-class": "org.apache.spark.sql.catalyst.expressions.ExprId" }, "metadata": {}, "name": "x", "nullable": false, "num-children": 0, "qualifier": [] } ], [ { "class": "org.apache.spark.sql.catalyst.expressions.AttributeReference", "dataType": "integer", "exprId": { "id": 3, "jvmId": "e03e2860-a24b-41f5-addb-c35226173f7c", "product-class": "org.apache.spark.sql.catalyst.expressions.ExprId" }, "metadata": {}, "name": "y", "nullable": false, "num-children": 0, "qualifier": [] } ] ] } ], "class": "org.apache.spark.sql.execution.command.CreateViewCommand", "name": { "product-class": "org.apache.spark.sql.catalyst.TableIdentifier", "table": "tmp" }, "num-children": 0, "properties": null, "replace": true, "userSpecifiedColumns": [], "viewType": { "object": "org.apache.spark.sql.catalyst.analysis.LocalTempView$" } } ] },

Will Johnson (will@willj.co)

2021-12-01 14:38:55

*Thread Reply:* Okay! That is very helpful! I wasn't sure if there was a fancier trick but I can definitely do logging 🙂 Our challenge was that our proprietary packages were resulting in Null Pointer Exceptions when it tried to push to OpenLineage 😞

Will Johnson (will@willj.co)

2021-12-01 14:39:02

*Thread Reply:* Thank you as usual!!

Michael Collado (collado.mike@gmail.com)

2021-12-01 14:40:25

*Thread Reply:* You can always add test cases and add breakpoints to debug in your IDE. That doesn't work for the container tests, but it does work for the other ones

Will Johnson (will@willj.co)

2021-12-01 14:47:20

*Thread Reply:* Ah! That's a great point! I definitely would appreciate being able to poke at the objects interactively in a debug mode. Thank you for the guidance as well!

Ricardo Gaspar (ricardogaspar2@gmail.com)

2021-12-03 11:49:10

hi everyone! 👋 Very noob question here: I’ve been wanting to play with Marquez and open lineage for my company’s projects. I use mostly scala & spark, but also Airflow. I’ve been reading and watching talks about OpenLineage and Marquez. So far i didn’t quite discover if Marquez or OpenLineage does field-level lineage (with Spark), like spline tries to.

Any idea?

Other sources about this topic • https://medium.com/cdapio/data-integration-with-field-level-lineage-5d9986524316 • https://medium.com/cdapio/field-level-lineage-part-1-3cc5c9e1d8c6 • https://medium.com/cdapio/designing-field-level-lineage-part-2-b6c7e6af5bf4 • https://www.youtube.com/playlist?list=PL897MHVe_nHeEQC8UnCfXecmZdF0vka_T • https://www.youtube.com/watch?v=gKYGKXIBcZ0 • https://www.youtube.com/watch?v=eBep6rRh7ic

Medium

Data Integration with Field Level Lineage

Arguably, data is the most important asset that an organization owns, and in today’s climate of data security and stewardship it’s more…

Reading time

6 min read

Original URL: https://medium.com/cdapio/data-integration-with-field-level-lineage-5d9986524316

Medium

Field Level Lineage — part 1

How the CDAP team designed a tool to track the journey of data in enterprise production environments.

Reading time

6 min read

Original URL: https://medium.com/cdapio/field-level-lineage-part-1-3cc5c9e1d8c6

Medium

Designing field Level Lineage — part 2

How the CDAP team designed a tool to track the journey of data in enterprise production environments.

Reading time

10 min read

Original URL: https://medium.com/cdapio/designing-field-level-lineage-part-2-b6c7e6af5bf4

YouTube

CDAP How To’s

CDAP How To Videos

Original URL: https://www.youtube.com/playlist?list=PL897MHVe_nHeEQC8UnCfXecmZdF0vka_T

YouTube

} CDAP (https://www.youtube.com/c/CDAPio)

CDAP Community Meetup Maintaining full data lineage and governance across billions of data partition

Original URL: https://www.youtube.com/watch?v=gKYGKXIBcZ0

🙌 Francis McGregor-Macdonald

John Thomas (john@datakin.com)

2021-12-03 11:55:17

*Thread Reply:* Hi Ricardo - OpenLineage doesn’t currently have support for field-level lineage, but it’s definitely something we’ve been looking into. This is a great collection of resources 🙂

We’ve to-date been working on our integrations library, making it as easy to set up as possible.

Ricardo Gaspar (ricardogaspar2@gmail.com)

2021-12-03 12:01:25

*Thread Reply:* Thanks John! I was checking the issues on github and other posts here. Just wanted to clarify that. I’ll keep an eye on it

Julien Le Dem (julien@apache.org)

2021-12-06 20:25:19

The next OpenLineage monthly meeting is this Wednesday at 9am PT. (everybody is welcome to join) The slides are here: https://docs.google.com/presentation/d/1q2Be7WTKlIhjLPgvH-eXAnf5p4w7To9v/edit#slide=id.ge4b57c6942_0_75 tentative agenda: • SPDX headers [Mandy Chessel] • Azure Purview + OpenLineage [Will Johnson, Mark Taylor] • Logging backend (OpenTelemetry, ...) [Julien Le Dem] • Open discussion Please chime in in this thread if you’d want to add something

Julien Le Dem (julien@apache.org)

2021-12-06 20:28:09

*Thread Reply:* The link to join the meeting is on the wiki: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

Julien Le Dem (julien@apache.org)

2021-12-06 20:28:25

*Thread Reply:* Please reach out to me if you’d like to be added to a gcal invite

Dinakar Sundar (dinakar_sundar@condenast.com)

2021-12-06 22:37:29

@John Thomas we in Condenast currently exploring the features of open lineage to integrate to databricks , https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/databricks , spark configuration not working ,

Michael Collado (collado.mike@gmail.com)

2021-12-08 02:03:37

*Thread Reply:* Hi Dinakar. Can you give some specifics regarding what kind of problem you're running into?

Dinakar Sundar (dinakar_sundar@condenast.com)

2021-12-09 10:15:50

*Thread Reply:* Hi @Michael Collado, were able to set the spark configuration for spark extra listener & placed jars as well , wen i ran the sapark job , Lineage is not get tracked into the marquez

Dinakar Sundar (dinakar_sundar@condenast.com)

2021-12-09 10:34:39

*Thread Reply:* {"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark","schemaURL":"https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark/facets/spark/v1/output-statistics-facet.json","rowCount":0,"size":-1,"status":"DEPRECATED"}},"outputFacets":{"outputStatistics":{"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark","schemaURL":"https://openlineage.io/spec/facets/1-0-0/OutputStatisticsOutputDatasetFacet.json#/$defs/OutputStatisticsOutputDatasetFacet","rowCount":0,"size":-1}}}],"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark","schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunEvent"} OpenLineageHttpException(code=0, message=java.lang.IllegalArgumentException: Cannot construct instance of io.openlineage.spark.agent.client.HttpError (although at least one Creator exists): no String-argument constructor/factory method to deserialize from String value ('{"code":404,"message":"HTTP 404 Not Found"}') at [Source: UNKNOWN; line: -1, column: -1], details=java.util.concurrent.CompletionException: java.lang.IllegalArgumentException: Cannot construct instance of io.openlineage.spark.agent.client.HttpError (although at least one Creator exists): no String-argument constructor/factory method to deserialize from String value ('{"code":404,"message":"HTTP 404 Not Found"}') at [Source: UNKNOWN; line: -1, column: -1]) at io.openlineage.spark.agent.OpenLineageContext.emit(OpenLineageContext.java:48) at io.openlineage.spark.agent.lifecycle.SparkSQLExecutionContext.start(SparkSQLExecutionContext.java:122) at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$onJobStart$3(OpenLineageSparkListener.java:159) at java.util.Optional.ifPresent(Optional.java:159) at io.openlineage.spark.agent.OpenLineageSparkListener.onJobStart(OpenLineageSparkListener.java:148) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:119) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:103) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1585) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)

Dinakar Sundar (dinakar_sundar@condenast.com)

2021-12-09 13:29:42

*Thread Reply:* Issue solved , mentioned the version wrongly as 1 instead v1

🙌 Michael Collado

Jitendra Sharma (jitendra_sharma@condenast.com)

2021-12-07 02:07:06

👋 Hi everyone!

👋 Willy Lulciuc, Maciej Obuchowski

kavuri raghavendra (kavuri.raghavendra@gmail.com)

2021-12-08 05:37:44

Hello Everyone.. we are exploring Openlineage for capturing Spark lineage.. but form the GitHub(https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark) ..I see that the output send to API (Marquez).. how can I send it to Kafka topic.. can some body please guide me on this.

Minkyu Park (minkyu@datakin.com)

2021-12-08 12:15:38

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/400/files

there’s ongoing PR for proxy backend, which opens http API and redirects events to Kafka.

John Thomas (john@datakin.com)

2021-12-08 12:17:38

*Thread Reply:* Hi Kavuri, as minkyu said, there's currently work going on to simplify this process.

For now, you'll need to make something to capture the HTTP api events and send them to the Kafka topic. Changing the spark.openlineage.url parameter will send the runEvents wherever you like, but obviously you can't directly produce HTTP events to a topic

kavuri raghavendra (kavuri.raghavendra@gmail.com)

2021-12-08 22:13:09

*Thread Reply:* Many Thanks for the Reply.. As I understand, currently pushing lineage to kafka topic is not yet there. it is under implementation. If you can help me out in understanding in which version it is going to be present, that will help me a lot. Thanks in advance.

Minkyu Park (minkyu@datakin.com)

2021-12-09 12:57:10

*Thread Reply:* Not sure about the release plan, but the http endpoint is just regular RESTful API, and you will be able to write a super simple proxy for your own use case if you want.

🙌 Will Johnson

Will Johnson (will@willj.co)

2021-12-12 00:13:54

Hi, Open Lineage team - For the Spark Integration, I'm looking to extract information from a DataSourceV2 data source.

I'm working on the WRITE side of the data source and right now I'm touching the AppendData logical plan (I can't find the Java Doc): https://github.com/rdblue/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala#L446

I was able to extract out the table name (from the named relation) but I'm struggling getting out the schema next.

I noticed that the AppendData offers inputSet, schema, and outputSet. • inputSet gives me an AttributeSet which does contain the names of my columns (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeSet.scala#L69) • schema returns an empty StructType • outputSet is an empty AttributeSet I thought I read in the Spark Internals book that outputSet would only be populated if there was some sort of change to the DataFrame columns but I cannot find that page and searching for spark outputSet turns up few relevant results.

Has anyone else worked with the AppendData plan and gotten the schema out of it? Am I going down the wrong path with this snippet of code below? Thank you for any guidance!

if (logical instanceof AppendData) { AppendData appendOp = (AppendData) logical; NamedRelation namedRel = appendOp.table(); <a href="http://log.info">log.info</a>(namedRel.name()); // Works great! <a href="http://log.info">log.info</a>(appendOp.inputSet().toString());// This will get you a rough schema StructType schema = appendOp.schema(); // This is an empty StructType <a href="http://log.info">log.info</a>(schema.json()); // Nothing useful here }

<https://github.com/rdblue/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala | basicLogicalOperators.scala>

<https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeSet.scala | AttributeSet.scala>

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-12-12 07:34:13

*Thread Reply:* One thing, you're looking at Ryan's fork of Spark, which is few thousand commits behind head 🙂

This one should be good: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala#L72

About schema: looking at AppendData's query schema should work, if there's no change to columns, because to pass analysis, data being inserted have to match table's schema. I would test that though 🙂

On the other hand, current AppendDataVisitor just looks at AppendData's table and tries to extract dataset from it using list of common output visitors:

https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/src/main/co[…]o/openlineage/spark/agent/lifecycle/plan/AppendDataVisitor.java

In this case, the DataSourceV2RelationVisitor would look at it, provided we're using Spark 3:

https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/src/main/sp[…]ge/spark3/agent/lifecycle/plan/DataSourceV2RelationVisitor.java

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-12-12 07:37:04

*Thread Reply:* In this case, we basically need more info about nature of this DataSourceV2Relation, because this is provider-dependent. We have Iceberg in main branch and Delta here: https://github.com/OpenLineage/OpenLineage/pull/393/files#diff-7b66a9bd5905f4ba42914b73a87d834c1321ebcf75137c1e2a2413c0d85d9db6

Will Johnson (will@willj.co)

2021-12-13 14:54:13

*Thread Reply:* Ah! Maciej! As always, thank you! Looking through the DataSourceV2RelationVisitor you provided, it looks like the connector (Azure Cosmos Db) doesn't provide that Provider property 😞 😞 😞

Is there any other method for determining the type of DataSourceV2Relation?

Will Johnson (will@willj.co)

2021-12-13 14:57:06

*Thread Reply:* And, to make sure I close out on my original question, it was as simple as the code that Maciej was using:

I merely needed to use DataSourceV2Realtion rather than NamedRelation!

DataSourceV2Relation relation = (DataSourceV2Relation)appendOp.table(); <a href="http://log.info">log.info</a>(relation.schema().toString()); <a href="http://log.info">log.info</a>(relation.name());

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-12-15 06:20:31

*Thread Reply:* Are we talking about this connector? https://github.com/Azure/azure-sdk-for-java/blob/934200f63dc5bc7d5502a95f8daeb8142[…]/src/main/scala/com/azure/cosmos/spark/ItemsReadOnlyTable.scala

<https://github.com/Azure/azure-sdk-for-java/blob/934200f63dc5bc7d5502a95f8daeb81426bb600e/sdk/cosmos/azure-cosmos-spark_3-1_2-12/src/main/scala/com/azure/cosmos/spark/ItemsReadOnlyTable.scala | ItemsReadOnlyTable.scala>

<pre><code>private[spark] class ItemsReadOnlyTable(val sparkSession: SparkSession, </code></pre>

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-12-15 06:22:05

*Thread Reply:* I guess you can use object.getClass.getCanonicalName() to find if the passed class matches the one that Cosmos provider uses.

Will Johnson (will@willj.co)

2021-12-15 09:53:24

*Thread Reply:* Yes! That's the one, Maciej! I will give getCanonicalName a try but also make a PR into that repo to get the provider property set up correctly 🙂

Will Johnson (will@willj.co)

2021-12-15 09:53:28

*Thread Reply:* Thank you so much!

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-12-15 10:09:39

*Thread Reply:* Glad to help 😄

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-12-15 10:22:58

*Thread Reply:* @Will Johnson could you tell on which commands from https://github.com/OpenLineage/OpenLineage/issues/368#issue-1038510649 you'll be working?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-12-15 10:24:14

*Thread Reply:* If any, of course 🙂

Will Johnson (will@willj.co)

2021-12-15 10:49:31

*Thread Reply:* From all of our tests on that Cosmos connector, it looks like it strictly uses athe AppendData operation. However @Harish Sune is looking at more of these commands from a Delta data source.

👍 Maciej Obuchowski

Will Johnson (will@willj.co)

2021-12-22 22:43:34

*Thread Reply:* Just to close the loop on this one - I submitted a PR for the work we've been doing. Looking forward to any feedback! https://github.com/OpenLineage/OpenLineage/pull/450

<https://github.com/OpenLineage/OpenLineage/pull/450|#450 [INTEGRATION][SPARK] Support Azure Cosmos DB Spark Connector in Open Lineage>

Problem We would like to extend OpenLineage's Spark integration to work with the <a href="https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/cosmos/azure-cosmos-spark_3-1_2-12">Azure Cosmos DB Spark Connector for Spark3</a>. This would enable people using Spark, OpenLineage, and Cosmos DB to observe the input and output to their spark jobs. Closes: <a href="https://github.com/OpenLineage/OpenLineage/issues/449">#449</a> Solution The solution requires extending the DataSourceV2RelationVisitor to support the Azure Cosmos DB as a provider. • Azure Cosmos DB Spark Connector does not use the provider property on the table but can be determined based on name. • The Cosmos DB connector defines a Cosmos "table" as <code>com.azure.cosmos.spark.items.&lt;serviceName&gt;.&lt;databaseName&gt;.&lt;collectionName&gt;</code> • The Azure Cosmos DB write operations are all <code>AppendData</code>. As such, the <code>Spark3VisitorFactoryImpl.java</code> must include <code>DataSourceV2RelationVisitor</code> as one of its common visitors. • This gets fed into <code>AppendData</code> so the underlying relation can be tested against a <code>DataSourceV2RelationVisitor</code>. • When reading Cosmos DB, Spark uses a <code>DataSourceV2ScanRelation</code>. • This class has a <code>relation</code> property which is actually <code>DataSourceV2Relation</code>. • Since it was that simple, rather than making an entirely new SCAN relation class, I hitched it onto the existing DataSourceV2Relation class. • I followed the same process as Iceberg which does not check if you are using any Iceberg Jars (like the BigQuery integration). To support this addition, I added two tests similar to the Delta and Iceberge DataSourceV2Relation tests. I also modified <code>testIsDefinedAtForNonDefinedProvider</code> so that a mock table name was provided since the implementation now checks if the table name matches the pattern for Cosmos DB. Scope: This solution has been tested locally on a Spark-Shell with the <code>--packages "com.azure.cosmos.spark:azure_cosmos_spark_3_1_2_12:4.5.0"</code> parameter along with in Azure Databricks on a Spark 3.1.2 cluster. Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☑︎ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☑︎ Your changes are accompanied by tests (if relevant) ☑︎ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☐ You've updated the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md"><code>CHANGELOG.md</code></a> with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant)

Comments

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-12-23 05:04:36

*Thread Reply:* Thanks @Will Johnson! I added one question about dataset naming.

Michael Collado (collado.mike@gmail.com)

2021-12-14 19:45:59

Finally got this doc posted - https://github.com/OpenLineage/OpenLineage/pull/437 (see the readable version here ) Looking for feedback, @Willy Lulciuc @Maciej Obuchowski @Will Johnson

<https://github.com/OpenLineage/OpenLineage/pull/437|#437 Add proposal doc for making Spark event handling code more extensible>

<https://github.com/OpenLineage/OpenLineage/blob/90321b8b7e44c070efd1f6ba6b4bcafb3b78f5e9/proposals/168/making_spark_visitors_extensible.md | making_spark_visitors_extensible.md>

Will Johnson (will@willj.co)

2021-12-15 10:54:41

*Thread Reply:* Yes! This is awesome!! How might this work for an existing command like the DataSourceV2Visitor.

Right now, OpenLineage checks based on the provider property if it's an Iceberg or Delta provider.

Ideally, we'd be able to extend the list of providers or have a custom "CosmosDbDataSourceV2Visitor" that knew how to work with a custom DataSourceV2.

Would that cause any conflicts if the base class is already accounted for in OpenLineage?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-12-15 11:13:20

*Thread Reply:* Resolving this would be nice addition to the doc (and, to the implementation) - currently, we're just returning result of first function for which isDefinedAt is satisfied.

This means, that we can depend on the order of the visitors...

Michael Collado (collado.mike@gmail.com)

2021-12-15 13:59:12

*Thread Reply:* great question. For posterity, I'd like to move this to the PR discussion. I'll address the question there.

Michael Collado (collado.mike@gmail.com)

2021-12-14 19:50:57

Oh, and I forgot to post yesterday OpenLineage 0.4.0 was released 🥳

This was a big one. • Split tests for Spark 2 and Spark 3 • Spark output metrics • Databricks support with init scripts • Initial Iceberg support for Spark • Initial Kafka support for Spark • dbt build support • forward compatibility for dbt versions • lots of bug fixes 🙂 Check the full changelog for details

🙌 Maciej Obuchowski, Will Johnson, Peter Hicks, Manuel, Peter Hanssens

Dinakar Sundar (dinakar_sundar@condenast.com)

2021-12-14 21:42:40

Hi @Michael Collado is there any documentation on using great expectations with open lineage

Michael Collado (collado.mike@gmail.com)

2021-12-15 11:50:47

*Thread Reply:* hmm, actually the only documentation we have right now is on the demo.datakin.com site https://demo.datakin.com/onboarding . The great expectations tab should be enough to get you started

demo.datakin.com

datakin

Observability, Lineage and Visualization for Data Pipelines

Original URL: https://demo.datakin.com/onboarding

Michael Collado (collado.mike@gmail.com)

2021-12-15 11:51:04

*Thread Reply:* I'll open a ticket to copy that documentation to the OpenLineage site repo

👍 Madhu Maddikera, Dinakar Sundar

Carlos Meza (omar.m.8x@gmail.com)

2021-12-15 09:52:51

Hello ! I am new on OpenLineage , awesome project !! ; anybody knows about integration with Deequ ? Or a way to capture dataset stats with openlineage ? Thanks ! Appreciate the help !

Michael Collado (collado.mike@gmail.com)

2021-12-15 19:01:50

*Thread Reply:* Hi! We don't have any integration with deequ yet. We have a structure for recording data quality assertions and statistics, though - see https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/DataQualityAssertionsDatasetFacet.json and https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/DataQualityMetricsInputDatasetFacet.json for the specs.

Check the great expectations integration to see how those facets are being used

Bruno González (brugms2@gmail.com)

2022-05-24 06:20:50

*Thread Reply:* This is great. Thanks @Michael Collado!

Anatoliy Zhyzhkevych (Anatoliy.Zhyzhkevych@franklintempleton.com)

2021-12-19 22:40:33

Hi,

I am testing Open Lineage/Marquez 0.4.0 with dbt 1.0.0 using dbt-ol build It seems 12 events were generated but UI shows only history of runs with "Nothing to show here" in detail section about datasets/tests failures in dbt namespace. The warehouse namespace shows lineage but no details about dataset/test failures .

Please advice.

02:57:54 Done. PASS=4 WARN=0 ERROR=3 SKIP=2 TOTAL=9 02:57:54 Error sending message, disabling tracking Emitting OpenLineage events: 100%|██████████████████████████████████████████████████████| 12/12 [00:00<00:00, 12.50it/s]

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-12-20 04:15:51

*Thread Reply:* This is nothing to show here when you click on test node, right? What about run node?

Anatoliy Zhyzhkevych (Anatoliy.Zhyzhkevych@franklintempleton.com)

2021-12-20 12:28:21

*Thread Reply:* There is no details about failure.

```dbt-ol build -t DEV --profile cdp --profiles-dir /c/Work/dbt/cdp100/profiles --project-dir /c/Work/dbt/cdp100 --select +riskrawmastersharedshareclass Running OpenLineage dbt wrapper version 0.4.0 This wrapper will send OpenLineage events at the end of dbt execution. 02:57:21 Running with dbt=1.0.0 02:57:23 [WARNING]: Configuration paths exist in your dbtproject.yml file which do not apply to any resources. There are 1 unused configuration paths:

models.cdp.risk.raw.liquidity.shared

02:57:23 Found 158 models, 181 tests, 0 snapshots, 0 analyses, 574 macros, 0 operations, 2 seed files, 56 sources, 1 exposure, 0 metrics 02:57:23 02:57:35 Concurrency: 10 threads (target='DEV') 02:57:35 02:57:35 1 of 9 START test dbtexpectationssourceexpectcompoundcolumnstobeuniquebsesharedpbshareclassEDMPORTFOLIOIDSHARECLASSCODEanyvalueismissingDELETEDFLAGFalse [RUN] 02:57:37 1 of 9 PASS dbtexpectationssourceexpectcompoundcolumnstobeuniquebsesharedpbshareclassEDMPORTFOLIOIDSHARECLASSCODEanyvalueismissingDELETEDFLAGFalse [PASS in 2.67s] 02:57:37 2 of 9 START view model REPL.SHARECLASSDIM.................................... [RUN] 02:57:39 2 of 9 OK created view model REPL.SHARECLASSDIM............................... [SUCCESS 1 in 2.12s] 02:57:39 3 of 9 START test dbtexpectationsexpectcompoundcolumnstobeuniquerawreplpbsharedshareclassRISKPORTFOLIOIDSHARECLASSCODEanyvalueismissingDELETEDFLAGFalse [RUN] 02:57:43 3 of 9 PASS dbtexpectationsexpectcompoundcolumnstobeuniquerawreplpbsharedshareclassRISKPORTFOLIOIDSHARECLASSCODEanyvalueismissingDELETEDFLAGFalse [PASS in 3.42s] 02:57:43 4 of 9 START view model RAWRISKDEV.STG.SHARECLASSDIM........................ [RUN] 02:57:46 4 of 9 OK created view model RAWRISKDEV.STG.SHARECLASSDIM................... [SUCCESS 1 in 3.44s] 02:57:46 5 of 9 START view model RAWRISKDEV.MASTER.SHARECLASSDIM..................... [RUN] 02:57:46 6 of 9 START test relationshipsriskrawstgsharedshareclassRISKINSTRUMENTIDRISKINSTRUMENTIDrefriskrawstgsharedsecurity_ [RUN] 02:57:46 7 of 9 START test relationshipsriskrawstgsharedshareclassRISKPORTFOLIOIDRISKPORTFOLIOIDrefriskrawstgsharedportfolio_ [RUN] 02:57:51 5 of 9 ERROR creating view model RAWRISKDEV.MASTER.SHARECLASSDIM............ [ERROR in 4.31s] 02:57:51 8 of 9 SKIP test relationshipsriskrawmastersharedshareclassRISKINSTRUMENTIDRISKINSTRUMENTIDrefriskrawmastersharedsecurity_ [SKIP] 02:57:51 9 of 9 SKIP test relationshipsriskrawmastersharedshareclassRISKPORTFOLIOIDRISKPORTFOLIOIDrefriskrawmastersharedportfolio_ [SKIP] 02:57:52 7 of 9 FAIL 7282 relationshipsriskrawstgsharedshareclassRISKPORTFOLIOIDRISKPORTFOLIOIDrefriskrawstgsharedportfolio_ [FAIL 7282 in 5.41s] 02:57:54 6 of 9 FAIL 6520 relationshipsriskrawstgsharedshareclassRISKINSTRUMENTIDRISKINSTRUMENTIDrefriskrawstgsharedsecurity_ [FAIL 6520 in 7.23s] 02:57:54 02:57:54 Finished running 6 tests, 3 view models in 30.71s. 02:57:54 02:57:54 Completed with 3 errors and 0 warnings: 02:57:54 02:57:54 Database Error in model riskrawmastersharedshareclass (models/risk/raw/master/shared/riskrawmastersharedshareclass.sql) 02:57:54 002003 (42S02): SQL compilation error: 02:57:54 Object 'RAWRISKDEV.AUDIT.STGSHARECLASSDIMRELATIONSHIPRISKINSTRUMENTID' does not exist or not authorized. 02:57:54 compiled SQL at target/run/cdp/models/risk/raw/master/shared/riskrawmastersharedshareclass.sql 02:57:54 02:57:54 Failure in test relationshipsriskrawstgsharedshareclassRISKPORTFOLIOIDRISKPORTFOLIOIDrefriskrawstgsharedportfolio (models/risk/raw/stg/shared/riskrawstgsharedschema.yml) 02:57:54 Got 7282 results, configured to fail if != 0 02:57:54 02:57:54 compiled SQL at target/compiled/cdp/models/risk/raw/stg/shared/riskrawstgsharedschema.yml/relationshipsriskrawstgsha19e10fb324f7d0cccf2aab512683f693.sql 02:57:54 02:57:54 Failure in test relationshipsriskrawstgsharedshareclassRISKINSTRUMENTIDRISKINSTRUMENTID_refriskrawstgsharedsecurity_ (models/risk/raw/stg/shared/riskrawstgsharedschema.yml) 02:57:54 Got 6520 results, configured to fail if != 0 02:57:54 02:57:54 compiled SQL at target/compiled/cdp/models/risk/raw/stg/shared/riskrawstgsharedschema.yml/relationshipsriskrawstgsha_e3148a1627817f17f7f5a9eb841ef16f.sql 02:57:54 02:57:54 See test failures:

select ** from RAWRISKDEV.AUDIT.STGSHARECLASSDIMrelationship_RISKINSTRUMENT_ID

02:57:54 02:57:54 Done. PASS=4 WARN=0 ERROR=3 SKIP=2 TOTAL=9 02:57:54 Error sending message, disabling tracking Emitting OpenLineage events: 100%|██████████████████████████████████████████████████████| 12/12 [00:00<00:00, 12.50it/s]Emitted 14 openlineage events (dbt) linux@dblnbk152371:/c/Work/dbt/cdp$```

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-12-20 12:30:20

*Thread Reply:* I'm talking on clicking on non-test node in Marquez UI - the screenshots shared show you clicked on the one ending in test

Anatoliy Zhyzhkevych (Anatoliy.Zhyzhkevych@franklintempleton.com)

2021-12-20 16:46:11

*Thread Reply:* There are two types of failures: tests failed on stage model (relationships) and physical error in master model (no table with such name). The stage test node in Marquez does not show any indication of failures and dataset node indicates failure but without number of failed records or table name for persistent test storage. The failed master model shows in red but no details of failure. Master model tests were skipped because of model failure but UI reports "Complete".

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-12-20 18:11:50

*Thread Reply:* If I understood correctly, for model you would like OpenLineage to capture message error, like this one 22:52:07 Database Error in model customers (models/customers.sql) 22:52:07 Syntax error: Expected "(" or keyword SELECT or keyword WITH but got identifier "PLEASE_REMOVE" at [56:12] 22:52:07 compiled SQL at target/run/jaffle_shop/models/customers.sql And for dbt test failures, to visualize better that error is happening, for example like that:

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-12-20 18:23:12

*Thread Reply:* We actually do the first one for Airflow and Spark, I've missed it for dbt 😞

Created issue to add it to spec in a generic way: https://github.com/OpenLineage/OpenLineage/issues/446

<https://github.com/OpenLineage/OpenLineage/issues/446|#446 [PROPOSAL] Add generic ErrorRunFacet>

On execution failure, multiple systems provide text-based error messages, like dbt: <pre><code>22:52:07 Database Error in model customers (models/customers.sql) 22:52:07 Syntax error: Expected "(" or keyword SELECT or keyword WITH but got identifier "PLEASE_REMOVE" at [56:12] 22:52:07 compiled SQL at target/run/jaffle_shop/models/customers.sql </code></pre> Airflow: <pre><code>airflow.exceptions.AirflowException: Bash command failed. The command returned a non-zero exit code. </code></pre> Spark: <pre><code>org.apache.spark.sql.AnalysisException: Table or view not found: wrong_table_name; line 1 pos 14; </code></pre> We should add simple <code>ErrorRunFacet</code> to capture human-readable messages. Currently, we have some custom facets for those for <code>BigQuery</code> and <code>Spark</code> Proposed schema would look like that: <pre><code>{ "$schema" : "<https://json-schema.org/draft/2020-12/schema>", "$id" : "<https://openlineage.io/spec/facets/1-0-0/ErrorRunFacet.json>", "$defs" : { "ErrorRunFacet" : { "allOf" : [ { "$ref" : "<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>" }, { "type" : "object", "properties" : { "message" : { "description" : "A human-readable string representing error message generated by observed system", "type" : "string", "example" : "org.apache.spark.sql.AnalysisException: Table or view not found: wrong_table_name; line 1 pos 14" }, "stackTrace" : { "description" : "A language-specific stack trace generated by observed system", "type" : "string", "example" : "Exception in thread \"main\" java.lang.RuntimeException: A test exception\nat io.openlineage.SomeClass.method(SomeClass.java:13)\nat io.openlineage.SomeClass.anotherMethod(SomeClass.java:9)" } }, "required" : [ "message" ] } ], "type" : "object" } }, "type" : "object", "properties" : { "error" : { "$ref" : "#/$defs/ErrorRunFacet" } } } </code></pre>

Labels

proposal

Anatoliy Zhyzhkevych (Anatoliy.Zhyzhkevych@franklintempleton.com)

2021-12-20 22:49:54

*Thread Reply:* Sounds great. Failed/Skipped Tests/Models could be color-coded as well. Thanks.

Jorge Reyes (Zenta Group) (jorge.reyes@zentagroup.com)

2021-12-22 12:37:00

hello everyone , i'm learning Openlineage, I am trying to connect with airflow 2, is it possible? or that version is not yet released. this is currently throwing me airflow

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-12-22 12:38:26

*Thread Reply:* Hey. If you're using Airflow 2, you should use LineageBackend method described here: https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow#airflow-21-experimental

🙌 Jorge Reyes (Zenta Group)

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2021-12-22 12:39:06

*Thread Reply:* You don't need to do anything with DAG import then.

Jorge Reyes (Zenta Group) (jorge.reyes@zentagroup.com)

2021-12-22 12:40:30

*Thread Reply:* Thanks!!!!! i'll try

Michael Collado (collado.mike@gmail.com)

2021-12-27 16:49:20

The PR at https://github.com/OpenLineage/OpenLineage/pull/451 should be everything needed to complete the implementation for https://github.com/OpenLineage/OpenLineage/pull/437 . The PR is in draft mode, as I still need ~1 day to update the integration test expectations to match the refactoring (there are some new events, but from my cursory look, the old events still match expected contents). But I think it's in a state that can be reviewed before the tests are updated.

There are two other PRs that this one is based on - broken up for easier reviewing • https://github.com/OpenLineage/OpenLineage/pull/447 • https://github.com/OpenLineage/OpenLineage/pull/448

Michael Collado (collado.mike@gmail.com)

2021-12-27 16:49:56

*Thread Reply:* @Will Johnson @Maciej Obuchowski FYI 👆

🙌 Will Johnson, Maciej Obuchowski

Michael Robinson (michael.robinson@astronomer.io)

2022-01-07 15:25:11

The next OpenLineage Technical Steering Committee meeting is Wednesday, January 12! Meetings are on the second Wednesday of each month from 9:00 to 10:00am PT. Join us on Zoom: https://us02web.zoom.us/j/81831865546?pwd=RTladlNpc0FTTDlFcWRkM2JyazM4Zz09 All are welcome. Agenda: • OpenLineage 0.4 and 0.5 releases • Egeria version 3.4 support for OpenLineage • Airflow TaskListener to simplify OpenLineage integration [Maciej] • Open Discussion Notes: https://tinyurl.com/openlineagetsc

🙌 Maciej Obuchowski, Ross Turk, John Thomas, Minkyu Park, Joshua Wankowski, Dalin Kim

David Virgil (david.virgil.naranjo@googlemail.com)

2022-01-11 12:16:09

Hello community,

We are able to post this datasource in marquez. But then the information about the facet with the datasource is not displayed in the UI.

We want to display the S3 location (URI) where this datasource is pointing to. { id: { namespace: "<s3://hbi-dns-staging>", name: "PCHG" }, type: "DB_TABLE", name: "PCHG", physicalName: "PCHG", createdAt: "2022-01-11T16:15:54.887Z", updatedAt: "2022-01-11T16:56:04.093153Z", namespace: "<s3://hbi-dns-staging>", sourceName: "<s3://hbi-dns-staging>", fields: [], tags: [], lastModifiedAt: null, description: null, currentVersion: "c565864d-1a66-4cff-a5d9-2e43175cbf88", facets: { dataSource: { uri: "<s3://hbi-dns-staging/sql-runner/2022-01-11/PCHG.avro>", name: "<s3://hbi-dns-staging>", _producer: "<a href="http://ip-172-25-23-163.dir.prod.aws.hollandandbarrett.comeu-west-1.com/172.25.23.163">ip-172-25-23-163.dir.prod.aws.hollandandbarrett.comeu-west-1.com/172.25.23.163</a>", _schemaURL: "<https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet>" } } }

David Virgil (david.virgil.naranjo@googlemail.com)

2022-01-11 12:23:41

David Virgil (david.virgil.naranjo@googlemail.com)

2022-01-11 12:24:00

As you see there is no much info in openlineage UI

Michael Robinson (michael.robinson@astronomer.io)

2022-01-11 13:02:16

The OpenLineage TSC meeting is tomorrow! https://openlineage.slack.com/archives/C01CK9T7HKR/p1641587111000700

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)

The next OpenLineage Technical Steering Committee meeting is Wednesday, January 12! Meetings are on the second Wednesday of each month from 9:00 to 10:00am PT. Join us on Zoom: <a href="https://us02web.zoom.us/j/81831865546?pwd=RTladlNpc0FTTDlFcWRkM2JyazM4Zz09">https://us02web.zoom.us/j/81831865546?pwd=RTladlNpc0FTTDlFcWRkM2JyazM4Zz09</a> All are welcome. Agenda: • OpenLineage 0.4 and 0.5 releases • Egeria version 3.4 support for OpenLineage • Airflow TaskListener to simplify OpenLineage integration [Maciej] • Open Discussion Notes: <a href="https://tinyurl.com/openlineagetsc">https://tinyurl.com/openlineagetsc</a>

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1641587111000700

Julien Le Dem (julien@apache.org)

2022-01-12 11:59:44

*Thread Reply:* ^ It’s happening now!

David Virgil (david.virgil.naranjo@googlemail.com)

2022-01-14 06:46:44

any idea guys about the previous question?

Minkyu Park (minkyu@datakin.com)

2022-01-18 14:19:39

*Thread Reply:* Just to be clear, were you able to get a datasource information from API but just now showing up in the UI? Or you weren’t able to get it from API too?

SAM (skhettri@gmail.com)

2022-01-17 03:41:56

Hi everyone !! I am doing POC of OpenLineage with Airflow version 2.1, before that would like to know, if this version is supported by OpenLineage?

Conor Beverland (conorbev@gmail.com)

2022-01-18 11:40:00

*Thread Reply:* It does generally work, but, there's a known limitation in that only successful task runs are reported to the lineage backend. This is planned to be fixed in Airflow 2.3.

✅ SAM

❤️ Julien Le Dem

SAM (skhettri@gmail.com)

2022-01-18 20:35:52

*Thread Reply:* thank you. 🙂

SAM (skhettri@gmail.com)

2022-01-17 06:47:54

Hello there, I’m using docker Airflow version 2.1.0 , below were the steps I performed but I encountered error, pls help:

Inside requirements.txt file i added openlineage-airflow . Then ran pip install -r requirements.txt .
Added environmental variable using this command export AIRFLOW__LINEAGE__BACKEND = openlineage.lineage_backend.OpenLineageBackend
Then configured HTTP Backend environment variables inside “airflow” folder: export OPENLINEAGE_URL=<http://marquez:5000>
Ran Marquez using ./docker/up.sh & open web frontend UI and saw below error msg:

Conor Beverland (conorbev@gmail.com)

2022-01-18 11:30:38

*Thread Reply:* hey, I'm aware of one small bug ( which will be fixed in the upcoming OpenLineage 0.5.0 ) which means you would also have to include google-cloud-bigquery in your requirements.txt. This is the bug: https://github.com/OpenLineage/OpenLineage/issues/438

✅ SAM

Conor Beverland (conorbev@gmail.com)

2022-01-18 11:31:51

*Thread Reply:* The other thing I think you should check is, did you def define the AIRFLOW__LINEAGE__BACKEND variable correctly? What you pasted above looks a little odd with the 2 = signs

Conor Beverland (conorbev@gmail.com)

2022-01-18 11:34:25

*Thread Reply:* I'm looking a task log inside my own Airflow and I see msgs like: INFO - Constructing openlineage client to send events to

Conor Beverland (conorbev@gmail.com)

2022-01-18 11:34:47

*Thread Reply:* ^ i.e. I think checking the task logs you can see if it's at least attempting to send data

Conor Beverland (conorbev@gmail.com)

2022-01-18 11:34:52

*Thread Reply:* hope this helps!

SAM (skhettri@gmail.com)

2022-01-18 20:40:37

*Thread Reply:* Thank you, will try again.

Michael Collado (collado.mike@gmail.com)

2022-01-18 20:10:25

Just published OpenLineage 0.5.0 . Big items here are • dbt-spark support • New proxy message broker for forwarding OpenLineage messages to Kafka • New extensibility API for Spark integration Accompanying tweet thread on the latter two items here: https://twitter.com/PeladoCollado/status/1483607050953232385

🙌 Maciej Obuchowski, Kevin Mellott

Michael Collado (collado.mike@gmail.com)

2022-01-19 12:39:30

*Thread Reply:* BTW, this was actually the 0.5.1 release. Because, pypi... 🤷‍♂️:skintone4:

Mario Measic (mario.measic.gavran@gmail.com)

2022-01-27 06:45:08

*Thread Reply:* nice on the dbt-spark support 👍

Mohamed El IBRAHIMI (mohamedelibrahimi700@gmail.com)

2022-01-19 11:12:14

HELLO everyone . I’ve been reading and watching talks about OpenLineage and Marquez . this solution is exactly what we been looking to lineage our etls . GREAT WORK . our etls based on postgres redshift and airflow. SO

I tried to implement the example respecting all the steps required. everything runs successfully (the two dags on airflow ) on host http://localhost:3000/ but nothing appeared on marquez ui . am i missing something ? .

I’am thinking about create a simple etl pandas to a pandas with some transformation . Like to have a poc to show it to my team . I REALLY NEED SOME HELP

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-01-19 11:13:35

*Thread Reply:* Are you using docker on mac with "Use Docker Compose V2" enabled?

We've just found yesterday that it somehow breaks our example...

✅ Mohamed El IBRAHIMI

Mohamed El IBRAHIMI (mohamedelibrahimi700@gmail.com)

2022-01-19 11:14:51

*Thread Reply:* yes i just installed docker on mac

Mohamed El IBRAHIMI (mohamedelibrahimi700@gmail.com)

2022-01-19 11:15:02

*Thread Reply:* and docker compose version 1.29.2

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-01-19 11:20:24

*Thread Reply:* What you can do is to uncheck this, do docker system prune -a and try again.

Mohamed El IBRAHIMI (mohamedelibrahimi700@gmail.com)

2022-01-19 11:21:56

*Thread Reply:* done but i get this : Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-01-19 11:22:15

*Thread Reply:* Try to restart docker for mac

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-01-19 11:23:00

*Thread Reply:* It needs to show Docker Desktop is running :

Mohamed El IBRAHIMI (mohamedelibrahimi700@gmail.com)

2022-01-19 11:24:01

*Thread Reply:* yeah done . I will try to implement the example again and see thank you very much

Mohamed El IBRAHIMI (mohamedelibrahimi700@gmail.com)

2022-01-19 11:32:55

*Thread Reply:* i dont why i getting this when i $ docker-compose up :

WARNING: The TAG variable is not set. Defaulting to a blank string. WARNING: The APIPORT variable is not set. Defaulting to a blank string. WARNING: The APIADMINPORT variable is not set. Defaulting to a blank string. WARNING: The WEBPORT variable is not set. Defaulting to a blank string. ERROR: The Compose file ‘./../docker-compose.yml’ is invalid because: services.api.ports contains an invalid type, it should be a number, or an object services.api.ports contains an invalid type, it should be a number, or an object services.web.ports contains an invalid type, it should be a number, or an object services.api.ports value [‘:’, ‘:’] has non-unique elements

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-01-19 11:46:12

*Thread Reply:* are you running it exactly like here, with respect to directories, etc?

https://github.com/MarquezProject/marquez/tree/main/examples/airflow

Mohamed El IBRAHIMI (mohamedelibrahimi700@gmail.com)

2022-01-19 11:59:36

*Thread Reply:* yeah yeah my bad . every things work fine know . I see the graph in the ui

Mohamed El IBRAHIMI (mohamedelibrahimi700@gmail.com)

2022-01-19 12:04:01

*Thread Reply:* one more question plz . As i said our etls based on postgres redshift and airflow . any advice you have for us to integrate OL to our pipeline ?

Mohamed El IBRAHIMI (mohamedelibrahimi700@gmail.com)

2022-01-19 11:12:17

thank you very much

Kevin Mellott (kevin.r.mellott@gmail.com)

2022-01-19 17:29:51

I’m upgrading our OL Java client from an older version (0.2.3) and noticed that the ol.newCustomFacetBuilder() method to create custom facets no longer exists. I can see in this code diff that it might be replaced by simply adding to the additional properties of the standard element you are extending.

Can you please let me know if I’m understanding this change correctly? In other words, is the code in the diff functionally equivalent or is there a large change I should be understanding better?

https://github.com/OpenLineage/OpenLineage/compare/0.2.3...0.4.0#diff-f0381d7e68797d9ec60551c96897809072582350e1657d23425747358ec6e471L196

John Thomas (john@datakin.com)

2022-01-19 17:50:39

*Thread Reply:* Hi Kevin - to my understanding that's correct. Do you guys have a custom extractor using this?

Kevin Mellott (kevin.r.mellott@gmail.com)

2022-01-19 20:49:49

*Thread Reply:* Thanks John! We have custom code emitting OL events within our ingestion pipeline and it includes a custom facet. I’ll refactor the code to the new format and should be good to go.

Kevin Mellott (kevin.r.mellott@gmail.com)

2022-01-21 00:34:37

*Thread Reply:* Just to follow up, this code update worked as expected and we are all good on the upgrade.

👍 Minkyu Park, John Thomas, Julien Le Dem

SAM (skhettri@gmail.com)

2022-01-21 02:13:51

I’m not sure what went wrong, with Airflow docker, version 2.1.0 , below were the steps I performed but Marquez UI is showing no jobs, pls help:

requirements.txt i added openlineage-airflow==0.5.1 . Then ran pip install -r requirements.txt .
Added environmental variable inside my airflow docker folder using this command: export AIRFLOW__LINEAGE__BACKEND = openlineage.lineage_backend.OpenLineageBackend
Then configured HTTP Backend environment variables inside same airflow docker folder: export OPENLINEAGE_URL=<http://localhost:5000>
Ran Marquez using ./docker/up.sh which is in another folder, Front end UI is not showing any job, its empty:
Attached in the airflow DAG log.

Ross Turk (ross@datakin.com)

2022-01-25 14:46:58

*Thread Reply:* Hm, that is odd. Usually there are a few lines in the DAG log from the OpenLineage bits. I’d expect to see something about not having an extractor for the operator you are using.

Ross Turk (ross@datakin.com)

2022-01-25 14:47:53

*Thread Reply:* If you open a shell in your Airflow Scheduler container and check for the presence of AIRFLOW__LINEAGE__BACKEND is it properly set? Possible the env isn’t making it all the way there.

Lena Kullab (Lena.Kullab@storable.com)

2022-01-21 13:38:37

Hi All,

I am working on a POC of OpenLineage-Airflow integration and was attempting to get it configured with Amundsen (also working on a POC). Reading through the tutorial here https://openlineage.io/integration/apache-airflow/, under the Prerequisites section it says: To use the OpenLineage Airflow integration, you'll need a running Airflow instance. You'll also need an OpenLineage compatible HTTP backend. The example uses Marquez, but I was trying to figure out how to get it to send metadata to the Amundsen graph db backend. Does the Airflow integration only support configuration with an HTTP compatible backend?

John Thomas (john@datakin.com)

2022-01-21 14:03:29

*Thread Reply:* Hi Lena! That’s correct, Openlineage is designed to send events to an HTTP backend. There’s a ticket on the future section of the roadmap to support pushing to Amundsen, but it’s not yet been worked on (Ref: Roadmap Issue #86)

<https://github.com/OpenLineage/OpenLineage/issues/86|#86 Add backend for sending lineage to Amundsen>

Lena Kullab (Lena.Kullab@storable.com)

2022-01-21 14:08:35

*Thread Reply:* Thank you for the info!

naman shaundik (namanshaundik@gmail.com)

2022-01-30 11:01:42

hi , i am completely new to openlineage and marquez, i have to integrate openlineage to my existing java project but i am completely confused on where to start, i have gone through documentation and all but i am not able to understand how to integrate openlineage using marquez http backend in my existing project. please someone help me. I may sound naive here but i am in dire need of help.

John Thomas (john@datakin.com)

2022-01-30 12:37:39

*Thread Reply:* what do you mean by “Integrate Openlineage”?

Can you give a little more information on what you’re trying to accomplish and what the existing project is?

naman shaundik (namanshaundik@gmail.com)

2022-01-31 03:49:22

*Thread Reply:* I work in a datalake team and we are trying to implement data lineage property in our project using openlineage. our project basically keeps track of datasets coming from different sources(hive, redshift, elasticsearch etc.) and jobs.

John Thomas (john@datakin.com)

2022-01-31 15:01:31

*Thread Reply:* Gotcha!

Broadly speaking, all an integration needs to do is to send runEvents to Marquez.

I'd start by understanding the OpenLineage data model, and then looking at your system to identify when / where runEvents should be sent from, and what information needs to be included.

<https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md | OpenLineage.md>

OpenLineage Spec Specification The specification for OpenLineage is formalized as a JsonSchema <OpenLineage.json|OpenLineage.json>. An OpenAPI spec is also provided for HTTP based implementations: <OpenLineage.yml|OpenLineage.yml> The documentation is published at: <a href="https://openlineage.github.io/">https://openlineage.github.io/</a> It allows extensions to the spec using <code>Custom Facets</code> as described in this document. Core concepts Core Lineage Model <OpenLineageModel.svg|Open Lineage model> • Run Event: an event describing an observed state of a job run. It is required to at least send a START event and a COMPLETE/FAIL/ABORT event. Additional events are optional. • Job: a process definition that consumes and produces datasets (defined as its inputs and outputs). It is <Naming.md#Jobs|identified by a unique name within a namespace> (which is assigned to the scheduler starting the jobs). The Job evolves over time and this change is captured when the job runs. • Dataset: an abstract representation of data. It has a <Naming.md#Datasets|unique name within the datasource namespace> derived from its physical location (for example db.host.database.schema.table). Typically, a Dataset changes when a job writing to it completes. Similarly to the Job and Run distinction, metadata that is more static from run to run is captured in a DatasetFacet (for example, the schema that does not change every run), what changes every Run is captured as an InputFacet or an OutputFacet (for example, what subset of the data set was read or written, like a time partition). • Run: An instance of a running job with a start and completion (or failure) time. A run is identified by a globally unique ID relative to its job definition. A run ID must be an <a href="https://datatracker.ietf.org/doc/html/rfc4122">UUID</a>. • Facet: A piece of metadata attached to one of the entities defined above. example: Here is an example of a simple start run event not adding any facet information: <pre><code>{ "eventType": "START", "eventTime": "2020-12-09T23:37:31.081Z", "run": { "runId": "3b452093-782c-4ef2-9c0c-aafe2aa6f34d", }, "job": { "namespace": "my-scheduler-namespace", "name": "myjob.mytask", }, "inputs": [ { "namespace": "my-datasource-namespace", "name": "instance.schema.table", } ], "outputs": [ { "namespace": "my-datasource-namespace", "name": "instance.schema.output_table", } ], "producer": "<https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client>", "schemaURL": "<https://openlineage.io/spec/1-0-0/OpenLineage.json#/definitions/RunEvent>" } </code></pre> Lifecycle The OpenLineage API defines events to capture the lifecycle of a Run for a given Job. When a job is being run, we capture metadata by sending run events when the state of the job transitions to a different state. We might observe different aspects of the job run at different stages. This means that different metadata might be collected in each event during the lifecycle of a run. All metadata is additive. for example, if more inputs or outputs are detected as the job is running we might send additional events specifically for those datasets without re-emitting previously observed inputs or outputs. Example: • When the run starts, we collect the following Metadata: • Run Id • Job id • eventType: START • event time • source location and version (ex: git sha) • If known: Job inputs and outputs. (input schema, ...) • When the run completes: • Run Id • Job id • eventType: COMPLETE • event time • Output datasets schema (and other metadata). Facets Facets are pieces of metadata that can be attached to the core entities: • Run • Job • Dataset (Inputs or Outputs) A facet is an atomic piece of metadata identified by its name. This means that emitting a new facet with the same name for the same entity replaces the previous facet instance for that entity entirely). It is defined as a JSON object that can be either part of the spec or custom facets defined in a different project. Custom facets must use a distinct prefix named after the project defining them to avoid collision with standard facets defined in the <OpenLineage.json|OpenLineage.json> spec. They have a _schemaURL field pointing to the corresponding version of the facet schema (as a JSONPointer: <a href="https://swagger.io/docs/specification/using-ref/">$ref URL location</a> ). Example: <a href="https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/MyCustomJobFacet">https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/MyCustomJobFacet</a> The versioned URL must be an immutable pointer to the version of the facet schema. For example, it should include a tag of a git sha and not a branch name. This should also be a canonical URL. There should be only one URL used for a given version of a schema. Custom facets can be promoted to the standard by including them in the spec. Custom Facet Naming Naming of custom facets should follow pattern <code>{prefix}{name}{entity}Facet</code> PascalCased. Prefix must be distinct identifier named after the project defining them to avoid colision with standard facets defined in the <OpenLineage.json|OpenLineage.json> spec. Entity is the core entity for which the facet is attached. When attached to core entity, the key should follow pattern <code>{prefix}_{name}</code>, where both prefix and name follow snakeCase pattern. Example of valid name is <code>BigQueryStatisticsJobFacet</code> and it's key <code>bigQuery_statistics</code>. Standard Facets Run Facets • nominalTime: Captures the time this run is scheduled for. This is a typical usage for time based scheduled job. The job has a nominal schedule time that will be different from the actual time it is running at. • parent: Captures the parent job and Run when the run was spawn from a parent run. For example in the case of Airflow, there's a run for the DAG that then spawns runs for individual tasks that would refer to the parent run as the DAG run. Similarly when a SparkOperator starts a Spark job, this creates a separate run that refers to the task run as its parent. Job Facets • sourceCodeLocation: Captures the source code location and version (example: git sha) of the job. • sql: Capture the SQL query if this job is a SQL query. Dataset Facets • schema: Captures the schema of the dataset • dataSource: Captures the Database instance containing this datasets (ex: Database schema. Object store bucket, ...) Input Dataset Facets • dataQualityMetrics: Captures dataset level and column level data quality metrics when scanning a dataset whith a DataQuality library (row count, byte size, null count, distinct count, average, min, max, quantiles). • dataQualityAssertions: Captures the result of running data tests on dataset or its columns. Output Dataset Facets • outputStatistics: Captures the size of the output written to a dataset (row count and byte size).

TJ Tang (tj@tapdata.io)

2022-02-15 15:28:03

*Thread Reply:* I suppose OpenLineage itself only defines the standard/protocol to design your data model. To be able to visualize/trace the lineage, you either have to implement your self with the standard data models or including Marquez in your project. You would need to use HTTP API to send lineage events from your Java project to Marquez in this case.

John Thomas (john@datakin.com)

2022-02-16 11:17:13

*Thread Reply:* Exactly! This project also includes connectors for more common data tools (Airflow, dbt, spark, etc), but at it's core OpenLineage is a standard and protocol

Michael Robinson (michael.robinson@astronomer.io)

2022-02-02 19:55:13

The next OpenLineage Technical Steering Committee meeting is Wednesday, February 9. Meetings are on the second Wednesday of each month from 9:00 to 10:00am PT. Join us on Zoom: https://us02web.zoom.us/j/81831865546?pwd=RTladlNpc0FTTDlFcWRkM2JyazM4Zz09 All are welcome. Agenda items are always welcome, as well. Reply in thread with yours. Current agenda: • OpenLineage 0.5.1 release • Apache Flink effort • Dagster integration • Open Discussion Notes: https://tinyurl.com/openlineagetsc

Jensen Yap (jensen@contxts.io)

2022-02-03 00:33:45

Hi everybody!

👋 Maciej Obuchowski, John Thomas

John Thomas (john@datakin.com)

2022-02-03 12:39:57

*Thread Reply:* Hello!

Albert Bikeev (albert.bikeev@gmail.com)

2022-02-04 09:36:46

Hi everybody! Very cool initiative, thank you! Is there any traction on Apache Atlas integration? Is there some way to help you there?

John Thomas (john@datakin.com)

2022-02-04 15:07:07

*Thread Reply:* Hey Albert! There aren't yet any issues or proposals around Apache Atlas yet, but that's definitely something you can help with!

I'm not super familiar with Atlas, were you thinking in terms of enabling Atlas to receive runEvents from OpenLineage connectors?

Albert Bikeev (albert.bikeev@gmail.com)

2022-02-07 05:49:16

*Thread Reply:* Hi John! Yes, exactly, it’d be nice to see Atlas as a receiver side of the OpenLineage events. Is there some guidelines on how to implement it? I guess we need OpenLineage-compatible server implementation so we could receive events and send them to Atlas, right?

John Thomas (john@datakin.com)

2022-02-07 11:30:14

*Thread Reply:* exactly - This would be a change on the Atlas side. I’d start by opening an issue in the atlas repo about making an API endpoint that can receive OpenLineage events. Marquez is our reference implementation of OpenLineage, so I’d look around in that repo to see how it’s been implemented :)

Albert Bikeev (albert.bikeev@gmail.com)

2022-02-07 11:50:27

*Thread Reply:* Got it, thanks! Did that: https://issues.apache.org/jira/browse/ATLAS-4550 If it’d not get any traction we at New Work might contribute as well

John Thomas (john@datakin.com)

2022-02-07 11:56:09

*Thread Reply:* awesome! if you guys have any questions, reach out and I can get you in touch with some of the engineers on our end

👍 Albert Bikeev

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-02-08 11:20:47

*Thread Reply:* @Albert Bikeev one minor thing that could be helpful: java OpenLineage library contains server model classes: https://github.com/OpenLineage/OpenLineage/pull/300#issuecomment-923489097

<https://github.com/OpenLineage/OpenLineage/pull/300#issuecomment-923489097|Comment on #300 server model>

the new classes look like this: ``` package io.openlineage.server; import com.fasterxml.jackson.annotation.JsonAnyGetter; import com.fasterxml.jackson.annotation.JsonAnySetter; import com.fasterxml.jackson.annotation.JsonCreator; import com.fasterxml.jackson.annotation.JsonProperty; import com.fasterxml.jackson.annotation.JsonPropertyOrder; import java.lang.Object; import java.lang.String; import java.net.URI; import java.time.ZonedDateTime; import java.util.LinkedHashMap; import java.util.List; import java.util.Map; import java.util.UUID; public final class OpenLineage { @JsonPropertyOrder public static final class RunFacets { @JsonAnySetter private final Map<String, RunFacet> additionalProperties; <pre><code>@JsonCreator public RunFacets() { this.additionalProperties = new LinkedHashMap&lt;&gt;(); } /**** ** @return additional properties **/ @JsonAnyGetter public Map&lt;String, RunFacet&gt; getAdditionalProperties() { return additionalProperties; } </code></pre> } @JsonPropertyOrder({ "eventType", "eventTime", "run", "job", "inputs", "outputs", "producer", "schemaURL" }) public static final class RunEvent { private final String eventType; <pre><code>private final ZonedDateTime eventTime; private final Run run; private final Job job; private final List&lt;InputDataset&gt; inputs; private final List&lt;OutputDataset&gt; outputs; private final URI producer; private final URI schemaURL; @JsonAnySetter private final Map&lt;String, Object&gt; additionalProperties; /**** ** @param eventType the current transition of the run state. It is required to issue 1 START event and 1 of [ COMPLETE, ABORT, FAIL ] event per run. Additional events with OTHER eventType can be added to the same run. For example to send additional metadata after the run is complete ** @param eventTime the time the event occurred at ** @param run the run ** @param job the job ** @param inputs The set of ****input**** datasets. ** @param outputs The set of ****output**** datasets. ** @param producer URI identifying the producer of this metadata. For example this could be a git url with a given tag or sha ** @param schemaURL The JSON Pointer (<https://tools.ietf.org/html/rfc6901>) URL to the corresponding version of the schema definition for this RunEvent **/ @JsonCreator public RunEvent(@JsonProperty("eventType") String eventType, @JsonProperty("eventTime") ZonedDateTime eventTime, @JsonProperty("run") Run run, @JsonProperty("job") Job job, @JsonProperty("inputs") List&lt;InputDataset&gt; inputs, @JsonProperty("outputs") List&lt;OutputDataset&gt; outputs, @JsonProperty("producer") URI producer, @JsonProperty("schemaURL") URI schemaURL) { this.eventType = eventType; this.eventTime = eventTime; this.run = run; this.job = job; this.inputs = inputs; this.outputs = outputs; this.producer = producer; this.schemaURL = schemaURL; this.additionalProperties = new LinkedHashMap&lt;&gt;(); } /**** ** @return the current transition of the run state. It is required to issue 1 START event and 1 of [ COMPLETE, ABORT, FAIL ] event per run. Additional events with OTHER eventType can be added to the same run. For example to send additional metadata after the run is complete **/ public String getEventType() { return eventType; } /**** ** @return the time the event occurred at **/ public ZonedDateTime getEventTime() { return eventTime; } public Run getRun() { return run; } public Job getJob() { return job; } /**** ** @return The set of ****input**** datasets. **/ public List&lt;InputDataset&gt; getInputs() { return inputs; } /**** ** @return The set of ****output**** datasets. **/ public List&lt;OutputDataset&gt; getOutputs() { return outputs; } /**** ** @return URI identifying the producer of this metadata. For example this could be a git url with a given tag or sha **/ public URI getProducer() { return producer; } /**** ** @return The JSON Pointer (<https://tools.ietf.org/html/rfc6901>) URL to the corresponding version of the schema definition for this RunEvent **/ public URI getSchemaURL() { return schemaURL; } /**** ** @return additional properties **/ @JsonAnyGetter public Map&lt;String, Object&gt; getAdditionalProperties() { return additionalProperties; } </code></pre> } @JsonPropertyOrder public static final class JobFacets { @JsonAnySetter private final Map<String, JobFacet> additionalProperties; <pre><code>@JsonCreator public JobFacets() { this.additionalProperties = new LinkedHashMap&lt;&gt;(); } /**** ** @return additional properties **/ @JsonAnyGetter public Map&lt;String, JobFacet&gt; getAdditionalProperties() { return additionalProperties; } </code></pre> } @JsonPropertyOrder public static final class InputDatasetInputFacets { @JsonAnySetter private final Map<String, InputDatasetFacet> additionalProperties; <pre><code>@JsonCreator public InputDatasetInputFacets() { this.additionalProperties = new LinkedHashMap&lt;&gt;(); } /**** ** @return additional properties **/ @JsonAnyGetter public Map&lt;String, InputDatasetFacet&gt; getAdditionalProperties() { return additionalProperties; } </code></pre> } @JsonPropertyOrder({ "producer", "schemaURL" }) public static final class DatasetFacet implements BaseFacet { private final URI _producer; <pre><code>private final URI _schemaURL; @JsonAnySetter private final Map&lt;String, Object&gt; additionalProperties; /**** ** @param _producer URI identifying the producer of this metadata. For example this could be a git url with a given tag or sha ** @param _schemaURL The JSON Pointer (<https://tools.ietf.org/html/rfc6901>) URL to the corresponding version of the schema definition for this facet **/ @JsonCreator public DatasetFacet(@JsonProperty("_producer") URI _producer, @JsonProperty("_schemaURL") URI _schemaURL) { this._producer = _producer; this._schemaURL = _schemaURL; this.additionalProperties = new LinkedHashMap&lt;&gt;(); } /**** ** @return URI identifying the producer of this metadata. For example this could be a git url with a given tag or sha **/ public URI get_producer() { return _producer; } /**** ** @return The JSON Pointer (<https://tools.ietf.org/html/rfc6901>) URL to the corresponding version of the schema definition for this facet **/ public URI get_schemaURL() { return _schemaURL; } /**** ** @return additional properties **/ @JsonAnyGetter public Map&lt;String, Object&gt; getAdditionalProperties() { return additionalProperties; } </code></pre> } @JsonPropertyOrder({ "producer", "schemaURL" }) public static final class OutputDatasetFacet implements BaseFacet { private final URI _producer; <pre><code>private final URI _schemaURL; @JsonAnySetter private final Map&lt;String, Object&gt; additionalProperties; /**** ** @param _producer URI identifying the producer of this metadata. For example this could be a git url with a given tag or sha ** @param _schemaURL The JSON Pointer (<https://tools.ietf.org/html/rfc6901>) URL to the corresponding version of the schema definition for this facet **/ @JsonCreator public OutputDatasetFacet(@JsonProperty("_producer") URI _producer, @JsonProperty("_schemaURL") URI _schemaURL) { this._producer = _producer; this._schemaURL = _schemaURL; this.additionalProperties = new LinkedHashMap&lt;&gt;(); } /**** ** @return URI identifying the producer of this … </code></pre>

Albert Bikeev (albert.bikeev@gmail.com)

2022-02-08 11:32:12

*Thread Reply:* Got it, thank you!

Juan Carlos Fernández Rodríguez (jcfernandez@keedio.com)

2022-05-04 11:12:23

*Thread Reply:* This is a quite old discussion, but isn't possible to use openlineage proxy to send json to kafka topic and let Atlas read that json without any modification? It would be needed to create a new model for spark, other than https://github.com/apache/atlas/blob/release-2.1.0-rc3/addons/models/1000-Hadoop/1100-spark_model.json and upload it to atlas (what could be done with a call to the atlas Api) Does it makes sense?

<https://github.com/apache/atlas/blob/release-2.1.0-rc3/addons/models/1000-Hadoop/1100-spark_model.json | 1100-spark_model.json>

``` { "enumDefs": [], "structDefs": [], "classificationDefs": [], "entityDefs": [ { "name": "sparkdb", "superTypes": [ "DataSet" ], "serviceType": "spark", "typeVersion": "1.0", "attributeDefs": [ { "name": "location", "typeName": "string", "isOptional": true, "cardinality": "SINGLE", "isUnique": false, "isIndexable": false, "searchWeight": 5 }, { "name": "clusterName", "typeName": "string", "isOptional": true, "cardinality": "SINGLE", "isUnique": false, "isIndexable": false, "searchWeight": 8 }, { "name": "parameters", "typeName": "map<string,string>", "isOptional": true, "cardinality": "SINGLE", "isUnique": false, "isIndexable": false }, { "name": "ownerType", "typeName": "string", "isOptional": true, "cardinality": "SINGLE", "isUnique": false, "isIndexable": false } ] }, { "name": "sparktable", "superTypes": [ "DataSet" ], "serviceType": "spark", "typeVersion": "1.0", "attributeDefs": [ { "name": "tableType", "typeName": "string", "isOptional": true, "cardinality": "SINGLE", "isUnique": false, "isIndexable": false }, { "name": "provider", "typeName": "string", "isOptional": true, "cardinality": "SINGLE", "isUnique": false, "isIndexable": false, "searchWeight": 5 }, { "name": "partitionColumnNames", "typeName": "array<string>", "isOptional": true, "cardinality": "SINGLE", "isUnique": false, "isIndexable": false }, { "name": "bucketSpec", "typeName": "map<string,string>", "isOptional": true, "cardinality": "SINGLE", "isUnique": false, "isIndexable": false }, { "name": "ownerType", "typeName": "string", "isOptional": true, "cardinality": "SINGLE", "isUnique": false, "isIndexable": false }, { "name": "createTime", "typeName": "date", "isOptional": true, "cardinality": "SINGLE", "isUnique": false, "isIndexable": false }, { "name": "parameters", "typeName": "map<string,string>", "isOptional": true, "cardinality": "SINGLE", "isUnique": false, "isIndexable": false }, { "name": "comment", "typeName": "string", "isOptional": true, "cardinality": "SINGLE", "isUnique": false, "isIndexable": false, "searchWeight": 9 }, { "name": "unsupportedFeatures", "typeName": "array<string>", "isOptional": true, "cardinality": "SINGLE", "isUnique": false, "isIndexable": false }, { "name": "viewOriginalText", "typeName": "string", "isOptional": true, "cardinality": "SINGLE", "isUnique": false, "isIndexable": false, "searchWeight": 9 }, { "name": "schemaDesc", "typeName": "string", "isOptional": true, "cardinality": "SINGLE", "isUnique": false, "isIndexable": false, "searchWeight": 5 }, { "name": "partitionProvider", "typeName": "string", "isOptional": true, "cardinality": "SINGLE", "isUnique": false, "isIndexable": false } ] }, { "name": "sparkcolumn", "superTypes": [ "DataSet" ], "serviceType": "spark", "typeVersion": "1.0", "attributeDefs": [ { "name": "type", "typeName": "string", "isOptional": false, "cardinality": "SINGLE", "isUnique": false, "isIndexable": true }, { "name": "nullable", "typeName": "boolean", "isOptional": true, "cardinality": "SINGLE", "isUnique": false, "isIndexable": false }, { "name": "metadata", "typeName": "string", "isOptional": true, "cardinality": "SINGLE", "isUnique": false, "isIndexable": false }, { "name": "comment", "typeName": "string", "isOptional": true, "cardinality": "SINGLE", "isUnique": false, "isIndexable": false, "searchWeight": 9 } ] }, { "name": "sparkmldirectory", "superTypes": [ "DataSet" ], "serviceType": "spark", "typeVersion": "1.0", "attributeDefs": [ { "name": "uri", "typeName": "string", "isOptional": false, "cardinality": "SINGLE", "isUnique": false, "isIndexable": true, "searchWeight": 10 }, { "name": "directory", "typeName": "string", "isOptional": false, "cardinality": "SINGLE", "isUnique": false, "isIndexable": true, "searchWeight": 10 } ] }, { "name": "sparkstoragedesc", "superTypes": [ "Referenceable" ], "serviceType": "spark", "typeVersion": "1.0", "attributeDefs": [ { "name": "location", "typeName": "string", "isOptional": true, "cardinality": "SINGLE", "isUnique": false, "isIndexable": false, "searchWeight": 10 }, { "name": "inputFormat", "typeName": "string", "isOptional": true, "cardinality": "SINGLE", "isUnique": false, "isIndexable": false }, { "name": "outputFormat", "typeName": "string", "isOptional": true, "cardinality": "SINGLE", "isUnique": false, "isIndexable": false }, { "name": "serde", "typeName": "string", "isOptional": true, "cardinality": "SINGLE", "isUnique": false, "isIndexable": false }, { "name": "compressed", "typeName": "boolean", "isOptional": false, "cardinality": "SINGLE", "isUnique": false, "isIndexable": true }, { "name": "parameters", "typeName": "map<string,string>", "isOptional": true, "cardinality": "SINGLE", "isUnique": false, "isIndexable": false } ] }, { "name": "sparkmlmodel", "superTypes": [ "DataSet" ], "serviceType": "spark", "typeVersion": "1.0", "attributeDefs": [ { "name": "extra", "typeName": "string", "isOptional": true, "cardinality": "SINGLE", "isUnique": false, "isIndexable": false } ] }, { "name": "spark_application", "superTypes": [ "Process" ], "serviceType": "spark", "typeVersion": "1.0", "attributeDefs": [ { "name": "currentUser", "typeName": "string", "isOptional": true, "cardinality": "SINGLE", "isUniq…

👍 Albert Bikeev

Will Johnson (will@willj.co)

2022-05-04 11:24:02

*Thread Reply:* @Juan Carlos Fernández Rodríguez - You still need to build a bridge between the OpenLineage Spec and the Apache Atlas entity JSON. So far, no one has contributed something like that to the open source community... yet!

Juan Carlos Fernández Rodríguez (jcfernandez@keedio.com)

2022-05-04 14:24:28

*Thread Reply:* sorry for the ignorance, But what is the purpose of the bridge?the comunicación with atlas should be done throw kafka, and that messages can be sent by the proxy. What are I missing?

John Thomas (john@datakin.com)

2022-05-04 16:37:33

*Thread Reply:* "bridge" in this case refers to a service of some sort that converts from OpenLineage run event to Atlas entity JSON, since there's currently nothing that will do that

xiang chen (cdmikechen@hotmail.com)

2022-05-19 09:08:23

*Thread Reply:* If OpenLineage send an event to kafka, I think we can use kafka stream or kafka connect to rebuild message to atlas event.

xiang chen (cdmikechen@hotmail.com)

2022-05-19 09:11:37

*Thread Reply:* @John Thomas Our company used to use atlas as a metadata service. I just came into know this project. After I learned how openlineage works, I think I can create an issue to describe my design first.

xiang chen (cdmikechen@hotmail.com)

2022-05-19 09:13:36

*Thread Reply:* @Juan Carlos Fernández Rodríguez If you already have some experience and design, can you directly create an issue so that we can discuss it in more detail ?

Juan Carlos Fernández Rodríguez (jcfernandez@keedio.com)

2022-05-19 12:42:31

*Thread Reply:* Hi @xiang chen we are discussing internally in my company if rewrite to atlas or another alternative. If we do this, we will share and could involve you in some way.

Michael Robinson (michael.robinson@astronomer.io)

2022-02-04 15:02:29

Who here is working with OpenLineage at Dagster or Flink? We would love to hear about your work at the next on February 9 at 9 a.m. PT. Please reply here or message me to coordinate. @Ziyoiddin Yusupov

👍 Ziyoiddin Yusupov

Luca Soato (lucasoato@gmail.com)

2022-02-04 19:18:24

Hi everyone, OpenLineage is wonderful, we really needed something like this! Has anyone else used it with Databricks, Delta tables or Spark? If someone is interested into these technologies we can work together to get a POC and share some thoughts. Thanks and have a nice weekend! :)

Julius Rentergent (julius.rentergent@thetradedesk.com)

2022-02-25 13:06:16

*Thread Reply:* Hi Luca, I agree this looks really promising. I’m working on getting it to run on Databricks, but I’m only just starting out 🙂

Michael Robinson (michael.robinson@astronomer.io)

2022-02-08 12:00:02

Friendly reminder: this month’s OpenLineage TSC meeting is tomorrow! https://openlineage.slack.com/archives/C01CK9T7HKR/p1643849713216459

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)

The next OpenLineage Technical Steering Committee meeting is Wednesday, February 9. Meetings are on the second Wednesday of each month from 9:00 to 10:00am PT. Join us on Zoom: <a href="https://us02web.zoom.us/j/81831865546?pwd=RTladlNpc0FTTDlFcWRkM2JyazM4Zz09">https://us02web.zoom.us/j/81831865546?pwd=RTladlNpc0FTTDlFcWRkM2JyazM4Zz09</a> All are welcome. Agenda items are always welcome, as well. Reply in thread with yours. Current agenda: • OpenLineage 0.5.1 release • Apache Flink effort • Dagster integration • Open Discussion Notes: <a href="https://tinyurl.com/openlineagetsc">https://tinyurl.com/openlineagetsc</a>

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1643849713216459

❤️ Kevin Mellott, John Thomas

Albert Bikeev (albert.bikeev@gmail.com)

2022-02-10 08:22:28

Hi people, One question regarding error reporting - what is the mechanism for that? E.g. if I send duplicated job to Openlineage, is there a way to notify me about that?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-02-10 09:05:39

*Thread Reply:* By duplicated, you mean with the same runId?

Albert Bikeev (albert.bikeev@gmail.com)

2022-02-10 11:40:55

*Thread Reply:* It’s only one example, could be also duplicated job name or anything else. The question is if there is mechanism to report that

Will Johnson (will@willj.co)

2022-02-14 17:21:20

Reducing the Logging of Spark Integration

Hey, OpenLineage community! I'm curious if there are any quick tricks / fixes to reduce the amount of logging happening in the OpenLineage Spark Integration. Each job seems to print out the Logical Plan with INFO level logging. The default behavior of Databricks is to print out INFO level logs and so it gets pretty cluttered and noisy.

I'm hoping there's a feature flag that would help me shut off those kind of logs in OpenLineage's Spark integration 🤞

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-02-15 05:15:12

*Thread Reply:* I think this log should be dropped to debug: https://github.com/OpenLineage/OpenLineage/blob/d66c41872f3cc7f7cd5c99664d401e070e[…]c/main/common/java/io/openlineage/spark/agent/EventEmitter.java

<https://github.com/OpenLineage/OpenLineage/blob/d66c41872f3cc7f7cd5c99664d401e070e60ff48/integration/spark/src/main/common/java/io/openlineage/spark/agent/EventEmitter.java | EventEmitter.java>

<pre><code> <a href="http://log.info">log.info</a>("Lineage completed successfully: {} {}", resp, mapper.writeValueAsString(event)); </code></pre>

Will Johnson (will@willj.co)

2022-02-15 23:27:07

*Thread Reply:* @Maciej Obuchowski that is a good one! It would be nice to still have SOME logging in info to know that the event complete successfully but that response and event is very verbose.

I was also thinking about here: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/src/main/common/java/io/openlineage/spark/agent/lifecycle/OpenLineageRunEventBuilder.java#L337-L340

and here: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/src/main/common/java/io/openlineage/spark/agent/lifecycle/OpenLineageRunEventBuilder.java#L405-L408

These spots are where it's printing out the full logical plan for some reason.

Can I just open up a PR and switch these to log.debug instead?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-02-16 04:59:17

*Thread Reply:* Yes, that would be good solution for now. Later would be nice to have some option to raise the log level - OL logs are absolutely drowning in logs from rest of Spark cluster when set to debug.

Will Johnson (will@willj.co)

2022-02-16 13:35:15

[SPARK][INTEGRATION] Need Brainstorming Ideas - How to Persist / Access Spark Configs in JobEnd

Hey, OL community! I'm working on PR#490 and I finally have all tests passing but now my desired behavior - display environment properties during COMPLETE / JobEnd events - is not happening 😭

The previous approach stored the spark properties in the OpenLineageContext with a properties attribute but that was part of all of the test failures I believe.

What are some other ways to store the jobStart's properties and make them accessible to the corresponding jobEnd? Hopefully it's okay to tag @Maciej Obuchowski, @Michael Collado, and @Paweł Leszczyński who have been extremely helpful in the past and brought great ideas to the table.

<https://github.com/OpenLineage/OpenLineage/pull/490|#490 Added generic facet to collect environmental properties (EnvironmentFacet)>

Problem 👋 Introduce the ability to produce environmental properties using EnvironmentFacet, this facet can be used to work with any environment, based on the conditional check to verify environment EnvironmentFacet can be added. Closes: <a href="https://github.com/OpenLineage/OpenLineage/issues/464">#464</a> Solution Generic implementation of EnvironmentFacet gives the ability to add environmental details. Since EnvironmentFacet produces HashMap it can handle properties dynamically. In the current implementation added ability to collect properties for Databricks environment including mount point details. Below are some examples: • "orgId" • "spark.databricks.clusterUsageTags.clusterOwnerOrgId" • "spark.databricks.notebook.path" • "spark.databricks.job.type" • "spark.databricks.job.id" • "spark.databricks.job.runId" • "user" • "userId" • "spark.databricks.clusterUsageTags.clusterName" • "spark.databricks.clusterUsageTags.azureSubscriptionId" Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☑︎ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☐ Your changes are accompanied by tests (if relevant) ☑︎ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☐ You've updated the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md"><code>CHANGELOG.md</code></a> with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant)

Comments

Michael Collado (collado.mike@gmail.com)

2022-02-16 13:44:30

*Thread Reply:* Hey, I responded on the issue, but just to make it clear for everyone, the OL events for a run are not expected to be an accumulation of all past events. Events should be treated as additive by the backend - each event can post what information it has about the run and the backend is responsible for constructing a holistic picture of the run

<https://github.com/OpenLineage/OpenLineage/pull/490#issuecomment-1042011803|Comment on #490 Added generic facet to collect environmental properties (EnvironmentFacet)>

> how can I persist this information over to the COMPLETE events coming from OpenLineage? The different events are intended to be additive - it is not expected that each event is a cumulative representation of the previous events. If you send the environment facet on the <code>START</code> event, it will still be present for the run even if it's missing from the <code>COMPLETE</code>.

Michael Collado (collado.mike@gmail.com)

2022-02-16 13:47:18

*Thread Reply:* e.g., here is the marquez code that fetches the facets for a run. Note that all of the facets are included from all events with the requested run_uuid. If the env facet is present on any event, it will be returned by the API

<https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/db/RunDao.java | RunDao.java>

<pre><code> + "(\n" + " SELECT le.run_uuid, JSON_AGG(event-&gt;'run'-&gt;'facets') AS facets\n" + " FROM lineage_events le\n" + " GROUP BY le.run_uuid\n" + ") AS f ON r.uuid=f.run_uuid\n" </code></pre>

Will Johnson (will@willj.co)

2022-02-16 13:51:30

*Thread Reply:* Ah! Thanks for that @Michael Collado it's good to understand the OpenLineage perspective.

So, we do need to maintain some state. That makes total sense, Mike.

How does Marquez handle failed jobs currently? Based on this issue (https://github.com/OpenLineage/OpenLineage/issues/436) I think Marquez would show a START but no COMPLETE event, right?

<https://github.com/OpenLineage/OpenLineage/issues/436|#436 [INTEGRATION][Spark] use executionFailure on SparkListenerSQLExecutionEnd to indicate failed execution >

In Spark 3, <code>SparkListenerSQLExecutionEnd</code> has private property that allows us to find out if job failed. We should use it if we're using Spark 3. <a href="https://github.com/OpenLineage/OpenLineage/pull/393#discussion_r768049758">#393 (comment)</a>

Michael Collado (collado.mike@gmail.com)

2022-02-16 14:00:03

*Thread Reply:* If I were building the backend, I would store events, then calculate the end state later, rather than trying to "maintain some state" (maybe we mean the same thing, but using different words here 😀). Re: the failure events, I think job failures will currently result in one FAIL event and one COMPLETE event. The SparkListenerJobEnd event will trigger a FAIL event but the SparkListenerSQLExecutionEnd event will trigger the COMPLETE event.

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/src/main/common/java/io/openlineage/spark/agent/lifecycle/OpenLineageRunEventBuilder.java | OpenLineageRunEventBuilder.java>

Will Johnson (will@willj.co)

2022-02-16 15:16:27

*Thread Reply:* Oooh! I did not know we already could get a FAIL event! That is super helpful to know, Mike! Thank you so much!

Will Johnson (will@willj.co)

2022-02-21 10:04:18

[SPARK] Connecting SparkListenerSQLExecutionStart to the various SparkListenerJobStarts

TL;DR: How can I connect the SparkListenerSQLExecutionStart to the SparkListenerJobStart events coming out of OpenLineage? The events appear to have two separate run ids and no link to indicate that the ExecutionStart event owns the subsequent JobStart events.

More Context:

Recently, I implemented a connector for Azure Synapse (data warehouse on the Microsoft cloud) for the Spark integration and now with https://github.com/OpenLineage/OpenLineage/pull/490, I realize now that the SparkListenerSQLExecutionStart events carries with it the necessary inputs and outputs to tell the "real" lineage. The way the Synapse in Databricks works is:

• SparkListenerSQLExecutionStart fires off an event with the end to end input and output (e.g. S3 as input and SQL table as output) • SparkListenerJobStart events fire off that move content from one S3 location to a "staging" location controlled by Azure Synapse. OpenLineage records this event with INPUT S3 and output is a WASB "tempfolder" (which is a temporary locatio and not really useful for lineage since it will be destroyed at the end of the job) • The final operation actually happens ALL in Synapse and OpenLineage does not fire off an event it seems. The Synapse database has a "COPY" command which moves the data from "tempfolder" in to the database. • Finally a SparkListenerSQLExecutionEnd event happens and the query is complete. Ideally, I could connect the SQLExecutionStart of SQLExecutionEnd with the SparkListenerJobStart so that I can get the JobStart properties. I see that ExecutionStart has an execution id and JobStart should have the same Execution Id BUT I think by the time I reach the ExecutionEND, all the JobStart events would have been removed from the HashMap that contains all of the events in OpenLineage.

Any guidance on how to reach a JobStart properties from an ExecutionStart or ExecutionEnd would be greatly appreciated!

<https://github.com/OpenLineage/OpenLineage/pull/490|#490 Added generic facet to collect environmental properties (EnvironmentFacet)>

Comments

🤔 Maciej Obuchowski

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-02-22 09:02:48

*Thread Reply:* I think this scenario only happens when spark job spawns another "sub-job", right?

I think that maybe you can check sparkContext.getLocalProperty("spark.sql.execution.id")

> I see that ExecutionStart has an execution id and JobStart should have the same Execution Id BUT I think by the time I reach the ExecutionEND, all the JobStart events would have been removed from the HashMap that contains all of the events in OpenLineage. But pairwise, those starts and ends should at least have the same runId as they were created with same OpenLineageContext, right?

Anyway, what @Michael Collado wrote on the issue is true: https://github.com/OpenLineage/OpenLineage/pull/490#issuecomment-1042011803 - you should not assume that we hold all the metadata somewhere in memory during whole execution of the run. The backend should be able to take care of it.

<https://github.com/OpenLineage/OpenLineage/pull/490#issuecomment-1042011803|Comment on #490 Added generic facet to collect environmental properties (EnvironmentFacet)>

Will Johnson (will@willj.co)

2022-02-22 10:53:09

*Thread Reply:* @Maciej Obuchowski - I was hoping they'd have the same run id as well but they do not 😞

But that is the expectation? A SparkSQLExecutionStart and JobStart SHOULD have the same execution ID, right?

I will take a look at sparkContext.getLocalProperty. Thank you so much for the reply Maciej!

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-02-22 10:57:24

*Thread Reply:* SparkSQLExecutionStart and SparkSQLExecutionEnd should have the same runId, as well as JobStart and JobEnd events. Beyond those it can get wild. For example, some jobs don't emit JobStart/JobEnd events. Some jobs, like Delta emit multiple, that aren't easily tied to SQL event.

Will Johnson (will@willj.co)

2022-02-23 03:48:38

*Thread Reply:* Okay, I dug into the Databricks Synapse Connector and it does the following:

SparkSQLExecutionStart with execution id of 8 happens (so gets runid of abc123). It contains the real inputs and outputs that we want.
The Synapse connector starts executing JDBC commands. These commands prepare the synapse database to connect with data that Spark will land in a staging area in the cloud. (I don't know how it' executing arbitrary commands before the official job start begins 😞 )
SparkJobStart beings with execution id of 9 happens (so it gets runid of jkl456). This contains the inputs and an output to a temp folder (NOT the real output we want but a staging location) a. There are four JobIds 0 - 3, all of which point back to execution id 9 with the same physical plan. b. After job1, it runs more JDBC commands. c. I think at Job2, it runs the actual Spark code to query and join my raw input data and land it in a cloud storage account "tempfolder"/ d. After job3, it runs the final JDBC commands to actually move the data from "tempfolder/" to Synapse Db.
Finally, the SparkSQLListenerEnd event occurs. I can see this in the Spark UI as well.

Because the Databricks Synapse connector somehow adds these additional JobStarts WITHOUT referencing the original SparkSQLExeuctionStart execution ID, we have to rely on heuristics to connect the /tempfolder to the real downstream table that was already provided in the ExecutionStart event 😞

I've attached the logs and a screenshot of what I'm seeing the Spark UI. If you had a chance to take a look, it's a bit verbose but I'd appreciate a second pair of eyes on my analysis. Hopefully I got something wrong 😅

log4j-SYNAPSE2-SUBSET.txt

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-02-23 07:19:01

*Thread Reply:* I think we've encountered the same stuff in Delta before 🙂

https://github.com/OpenLineage/OpenLineage/issues/388#issuecomment-964401860

Michael Collado (collado.mike@gmail.com)

2022-02-23 14:13:18

*Thread Reply:* @Will Johnson , am I reading your report correctly that the SparkListenerJobStart event is reported with a spark.sql.execution.id that differs from the execution id of the SparkSQLExecutionStart?

Michael Collado (collado.mike@gmail.com)

2022-02-23 14:18:04

*Thread Reply:* WILLJ: We're deep inside this thing and have an executionid |9| 😂

Will Johnson (will@willj.co)

2022-02-23 21:56:48

*Thread Reply:* Hah @Michael Collado I see you found my method of debugging in Databricks 😅

But you're exactly right, there's a SparkSQLExecutionStart event with execution id 8 and then a set of JobStart events all with execution id 9!

I don't know enough about Spark internals on how you can just run arbitrary Scala code while making it look like a Spark Job but that's what it looks like. As if the SqlDwWriter somehow submits a new job without a ExecutionStart... maybe it's an RDD operation instead? This has given me another idea to add some more log.info statements to my jar 😅😬

Michael Robinson (michael.robinson@astronomer.io)

2022-02-28 14:00:23

One of our own will be talking OpenLineage, Airflow and Spark at the Subsurface Conference this week. Register to attend @Michael Collado’s session on March 3rd at 11:45. You can register and learn more here: https://www.dremio.com/subsurface/live/winter2022/

Dremio

Subsurface Live Winter 2022, The Cloud Data Lake Conference

Subsurface LIVE is tailored for data architects and data engineers to discuss open source projects driving innovation in cloud data lakes

Original URL: https://www.dremio.com/subsurface/live/winter2022/

🎉 Willy Lulciuc, Maciej Obuchowski

🙌 Will Johnson, Ziyoiddin Yusupov, Julien Le Dem

👍 Ziyoiddin Yusupov

Willy Lulciuc (willy@datakin.com)

2022-02-28 14:00:56

*Thread Reply:* You won’t want to miss this talk!

Martin Fiser (fisa@keboola.com)

2022-02-28 15:06:43

I have a question about DataHub integration through OpenLineage standard. Is anyone working on it, or was it rather just an icon used in previous materials? We have build a openlineage API endpoint in our product and we were hoping OL will gain enough traction so it will be a native way to connect to variaty of data discovery/observability tools, such as datahub, amundzen, etc.

Many thanks!

John Thomas (john@datakin.com)

2022-02-28 15:29:58

*Thread Reply:* hi Martin - when you talk about a DataHub integration, did you mean a method to collect information from DataHub? I don't see a current issue open for that, but I recommend you make one and to kick off the discussion around it.

If you mean sending information to DataHub, that should already be possible if users pass a datahub api endpoint to the OPENLINEAGE_ENDPOINT variable

Martin Fiser (fisa@keboola.com)

2022-02-28 16:29:54

*Thread Reply:* Hi, thanks for a reply! I meant to emit Openlineage JSON structure to datahub.

Could you be please more specific, possibly link an article how to find the endpoint on the datahub side? Many thanks!

John Thomas (john@datakin.com)

2022-02-28 17:15:31

*Thread Reply:* ooooh, sorry I misread - I thought you meant that datahub had built an endpoint. Your integration should emit openlineage events to an endpoint, but datahub would have to build that support into their product likely? I'm not sure how to go about it

John Thomas (john@datakin.com)

2022-02-28 17:16:27

*Thread Reply:* I'd reach out to datahub, potentially?

Martin Fiser (fisa@keboola.com)

2022-02-28 17:21:51

*Thread Reply:* i see. ok, will do!

Julien Le Dem (julien@apache.org)

2022-03-02 18:15:21

*Thread Reply:* It has been discussed in the past but I don’t think there is something yet. The Kafka transport PR that is in flight should facilitate this

Martin Fiser (fisa@keboola.com)

2022-03-02 18:33:45

*Thread Reply:* Thanks for the response! though dragging Kafka in just for data delivery bit is too much. I think the clearest way would be to push Datahub to make an API endpoint and parser for OL /lineage data structure.

I see this is more political think that would require join effort of DataHub team and OpenLineage with a common goal.

Michael Robinson (michael.robinson@astronomer.io)

2022-02-28 17:22:47

Is there a topic you think the community should discuss at the next OpenLineage TSC meeting? Reply or DM with your item, and we’ll add it to the agenda. Mark your calendars: the next TSC meeting is Wednesday, March 9 at 9 am PT on zoom.

Michael Robinson (michael.robinson@astronomer.io)

2022-03-02 10:24:58

The next OpenLineage Technical Steering Committee meeting is Wednesday, March 9! Meetings are on the second Wednesday of each month from 9:00 to 10:00am PT. Join us on Zoom: https://us02web.zoom.us/j/81831865546?pwd=RTladlNpc0FTTDlFcWRkM2JyazM4Zz09 All are welcome. Agenda: • New committers • Release overview (0.6.0) • New process for blog posts • Retrospective: Spark integration Notes: https://tinyurl.com/openlineagetsc

Michael Collado (collado.mike@gmail.com)

2022-03-02 14:29:33

FYI, there's a talk on OpenLineage at Subsurface live tomorrow - https://www.dremio.com/subsurface/live/winter2022/session/cross-platform-data-lineage-with-openlineage/

Dremio

Cross-Platform Data Lineage with OpenLineage | Subsurface Live Winter 2022

Data today is distributed and heterogeneous. Data lineage helps by tracing the relationship between datasets and placing them in context.

Est. reading time

1 minute

Original URL: https://www.dremio.com/subsurface/live/winter2022/session/cross-platform-data-lineage-with-openlineage/

🙌 Maciej Obuchowski, John Thomas, Paweł Leszczyński, Francis McGregor-Macdonald

👍 Ziyoiddin Yusupov, Michael Robinson, Jac.

Michael Robinson (michael.robinson@astronomer.io)

2022-03-04 15:25:20

@channel The latest release (0.6.0) of OpenLineage is now available, featuring a new Dagster integration, updates to the Airflow and Java integrations, a generic facet for env properties, bug fixes, and more. For more info, visit https://github.com/OpenLineage/OpenLineage/releases/tag/0.6.0

🙌 Conor Beverland, Dalin Kim, Ziyoiddin Yusupov, Luca Soato

👍 Julien Le Dem

👀 William Angel, Francis McGregor-Macdonald

Marco Diaz (mdiaz@roblox.com)

2022-03-07 14:06:19

Hello Guys,

Where do I find an example of building a custom extractor? We have several custom airflow operators that I need to integrate

John Thomas (john@datakin.com)

2022-03-07 14:56:58

*Thread Reply:* Hi marco - we don't have documentation on that yet, but the Postgres extractor is a pretty good example of how they're implemented.

all the included extractors are here: https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow/openlineage/airflow/extractors

Marco Diaz (mdiaz@roblox.com)

2022-03-07 15:07:41

*Thread Reply:* Thanks. I can follow that to build my own. Also I am installing this environment right now in Airflow 2. It seems I need Marquez and openlinegae-aiflow library. It seems that by this example I can put my extractors in any path as long as it is referenced in the environment variable. Is that correct? OPENLINEAGE_EXTRACTOR_<operator>=full.path.to.ExtractorClass Also do I need anything else other than Marquez and openlineage_airflow?

Ross Turk (ross@datakin.com)

2022-03-07 15:30:45

*Thread Reply:* Yes, as long as the extractors are in the python path.

Ross Turk (ross@datakin.com)

2022-03-07 15:31:59

*Thread Reply:* I built one a little while ago for a custom operator, I'd be happy to share what I did. I put it in the same file as the operator class for convenience.

Marco Diaz (mdiaz@roblox.com)

2022-03-07 15:32:51

*Thread Reply:* That will be great help. Thanks

Ross Turk (ross@datakin.com)

2022-03-08 20:38:27

*Thread Reply:* This is the one I wrote:

http_to_bigquery.py

Ross Turk (ross@datakin.com)

2022-03-08 20:39:30

*Thread Reply:* to make it work, I set this environment variable:

OPENLINEAGE_EXTRACTOR_HttpToBigQueryOperator=http_to_bigquery.HttpToBigQueryExtractor

Ross Turk (ross@datakin.com)

2022-03-08 20:40:57

*Thread Reply:* the extractor starts at line 183, and the really important bits start at line 218

Michael Robinson (michael.robinson@astronomer.io)

2022-03-07 15:16:37

@channel At the next OpenLineage TSC meeting, we’ll be reminiscing about the Spark integration. If you’ve had a hand in OL support for Spark, please join and share! The meeting will start at 9 am PT on Wednesday this week. @Maciej Obuchowski @Oleksandr Dvornik @Willy Lulciuc @Michael Collado https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

👍 Ross Turk, Maciej Obuchowski

Marco Diaz (mdiaz@roblox.com)

2022-03-07 18:44:26

Would Marquez create some lineage for operators that don't have a custom extractor built yet?

✅ Fuming Shih

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-03-08 12:05:25

*Thread Reply:* You would see that job was run - but we couldn't extract dataset lineage from it.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-03-08 12:05:49

*Thread Reply:* The good news is that we're working to solve this problem in general.

Marco Diaz (mdiaz@roblox.com)

2022-03-08 12:15:52

*Thread Reply:* I see, so i definitively will need the custom extractor built. I just need to understand where to set the path to the extractor. I can build one by following the postgress extractor you have built.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-03-08 12:50:00

*Thread Reply:* That depends how you deploy Airflow. Our tests use environment in docker-compose: https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/tests/integration/tests/docker-compose-2.yml#L34

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/tests/integration/tests/docker-compose-2.yml | docker-compose-2.yml>

<pre><code> OPENLINEAGE_EXTRACTOR_CustomOperator: custom_extractor.CustomExtractor </code></pre>

Marco Diaz (mdiaz@roblox.com)

2022-03-08 13:19:37

*Thread Reply:* Thanks for the example. I can show this to my infra support person for his reference.

Michael Robinson (michael.robinson@astronomer.io)

2022-03-08 11:47:11

This month’s OpenLineage TSC community meeting is tomorrow at 9am PT! It’s not too late to add an item to the agenda. Reply here or msg me with yours. https://openlineage.slack.com/archives/C01CK9T7HKR/p1646234698326859

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)

The next OpenLineage Technical Steering Committee meeting is Wednesday, March 9! Meetings are on the second Wednesday of each month from 9:00 to 10:00am PT. Join us on Zoom: <a href="https://us02web.zoom.us/j/81831865546?pwd=RTladlNpc0FTTDlFcWRkM2JyazM4Zz09">https://us02web.zoom.us/j/81831865546?pwd=RTladlNpc0FTTDlFcWRkM2JyazM4Zz09</a> All are welcome. Agenda: • New committers • Release overview (0.6.0) • New process for blog posts • Retrospective: Spark integration Notes: <a href="https://tinyurl.com/openlineagetsc">https://tinyurl.com/openlineagetsc</a>

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1646234698326859

👍 Ross Turk

Marco Diaz (mdiaz@roblox.com)

2022-03-09 19:31:23

I am running the last command to install marquez in AWS helm upgrade --install marquez . --set marquez.db.host <AWS-RDS-HOST> --set marquez.db.user <AWS-RDS-USERNAME> --set marquez.db.password <AWS-RDS-PASSWORD> --namespace marquez --atomic --wait And I am receiving this error Error: query: failed to query with labels: secrets is forbidden: User "xxx@xxx.xx" cannot list resource "secrets" in API group "" in the namespace "default"

Julien Le Dem (julien@apache.org)

2022-03-10 12:46:18

*Thread Reply:* Do you need to specify a namespace that is not « default »?

Marco Diaz (mdiaz@roblox.com)

2022-03-09 19:31:48

Can anyone let me know what is happening? My DI guy said it is a chart issue

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-03-10 07:40:13

*Thread Reply:* @Kevin Mellott aren't you the chart wizard? Maybe you could help 🙂

👀 Kevin Mellott

Marco Diaz (mdiaz@roblox.com)

2022-03-10 14:09:26

*Thread Reply:* Ok so I had to update a chart dependency

Marco Diaz (mdiaz@roblox.com)

2022-03-10 14:10:39

*Thread Reply:* Now I installed the service in amazon using this helm install marquez . --dependency-update --set marquez.db.host=myhost --set marquez.db.user=myuser --set marquez.db.password=mypassword --namespace marquez --atomic --wait

Marco Diaz (mdiaz@roblox.com)

2022-03-10 14:11:31

*Thread Reply:* i can see marquez-web running and marquez as well as the database i set up manually

Marco Diaz (mdiaz@roblox.com)

2022-03-10 14:12:27

*Thread Reply:* however I can not fetch initial data when login into the endpoint

Kevin Mellott (kevin.r.mellott@gmail.com)

2022-03-10 14:52:06

*Thread Reply:* 👋 @Marco Diaz happy to hear that the Helm install is completing without error! To help troubleshoot the error above, can you please let me know if this endpoint is available and working?

http://localhost:5000/api/v1/namespaces

Marco Diaz (mdiaz@roblox.com)

2022-03-10 15:13:16

*Thread Reply:* i got this {"namespaces":[{"name":"default","createdAt":"2022_03_10T18:05:55.780593Z","updatedAt":"2022-03-10T19:03:31.309713Z","ownerName":"anonymous","description":"The default global namespace for dataset, job, and run metadata not belonging to a user-specified namespace."}]}

Marco Diaz (mdiaz@roblox.com)

2022-03-10 15:13:34

*Thread Reply:* i have to use the namespace marquez to redirect there kubectl port-forward svc/marquez 5000:80 -n marquez

Marco Diaz (mdiaz@roblox.com)

2022-03-10 15:13:48

*Thread Reply:* is there something i need to change in a config file?

Marco Diaz (mdiaz@roblox.com)

2022-03-10 15:14:39

*Thread Reply:* also how would i change the "localhost" address to something that is accessible in amazon without the need to redirect?

Marco Diaz (mdiaz@roblox.com)

2022-03-10 15:14:59

*Thread Reply:* Sorry for all the questions. I am not an infra guy and have had to do all this by myself

Kevin Mellott (kevin.r.mellott@gmail.com)

2022-03-10 15:39:23

*Thread Reply:* No problem at all, I think there are a couple of things at play here. With the local setup, it appears that the web is attempting to access the API on the wrong port number (3000 instead of 5000). I’ll create an issue for that one so that we can fix it.

As to the EKS installation (or any non-local install), this is where you would need to use what’s called an ingress controller to expose the services outside of the Kubernetes cluster. There are different flavors of these (NGINX is popular), and I believe that AWS EKS has some built-in capabilities that might help as well.

https://www.eksworkshop.com/beginner/130_exposing-service/ingress/

Amazon EKS Workshop

EKSworkshop.com

Amazon EKS Workshop

Original URL: https://www.eksworkshop.com/beginner/130_exposing-service/ingress/

Marco Diaz (mdiaz@roblox.com)

2022-03-10 15:40:50

*Thread Reply:* So how do i fix this issue?

Kevin Mellott (kevin.r.mellott@gmail.com)

2022-03-10 15:46:56

*Thread Reply:* If your goal is to deploy to AWS, then you would need to get the EKS ingress configured. It’s not a trivial task, but they do have a bit of a walkthrough at https://www.eksworkshop.com/beginner/130_exposing-service/.

However, if you are just seeking to explore Marquez and try things out, then I would highly recommend the “Open in Gitpod” functionality at https://github.com/MarquezProject/marquez#try-it. That will perform a full deployment for you in a temporary environment very quickly.

Amazon EKS Workshop

EKSworkshop.com

Amazon EKS Workshop

Original URL: https://www.eksworkshop.com/beginner/130_exposing-service/

Marco Diaz (mdiaz@roblox.com)

2022-03-10 16:02:05

*Thread Reply:* i need to use it in aws for a POC

Marco Diaz (mdiaz@roblox.com)

2022-03-10 19:15:08

*Thread Reply:* Is there a better guide on how to install and setup Marquez in AWS? This guide is omitting many steps https://marquezproject.github.io/marquez/running-on-aws.html

Marquez

Running Marquez on AWS

Collect, aggregate, and visualize a data ecosystem’s metadata

Original URL: https://marquezproject.github.io/marquez/running-on-aws.html

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-03-10 12:35:37

We're trying to find best way to track upstream releases of projects we have integrations for, to support newer versions faster and with less bugs. If you have any opinions on this topic, please chime in here

https://github.com/OpenLineage/OpenLineage/issues/602

Marco Diaz (mdiaz@roblox.com)

2022-03-11 13:34:30

@Kevin Mellott Hello Kevin I followed the tutorial you sent me and I have exposed my services. However I am still seeing the same errors (this comes from the api/namescape call) {"namespaces":[{"name":"default","createdAt":"2022_03_10T18:05:55.780593Z","updatedAt":"2022-03-10T19:03:31.309713Z","ownerName":"anonymous","description":"The default global namespace for dataset, job, and run metadata not belonging to a user-specified namespace."}]}

Marco Diaz (mdiaz@roblox.com)

2022-03-11 13:35:08

Is there something i need to change in the chart? I do not have access to the default namespace in kubernetes only marquez namescpace

Kevin Mellott (kevin.r.mellott@gmail.com)

2022-03-11 13:56:27

@Marco Diaz that is actually a good response! This is the JSON returned back by the API to show some of the default Marquez data created by the install. Is there another error you are experiencing?

Marco Diaz (mdiaz@roblox.com)

2022-03-11 13:59:28

*Thread Reply:* I still see this https://files.slack.com/files-pri/T01CWUYP5AR-F036JKN77EW/image.png

Marco Diaz (mdiaz@roblox.com)

2022-03-11 14:00:09

*Thread Reply:* I created my own database and changed the values for host, user and password inside the chart.yml

Kevin Mellott (kevin.r.mellott@gmail.com)

2022-03-11 14:00:23

*Thread Reply:* Does it show that within the AWS deployment? It looks to show localhost in your screenshot.

Kevin Mellott (kevin.r.mellott@gmail.com)

2022-03-11 14:00:52

*Thread Reply:* Or are you working through the local deploy right now?

Marco Diaz (mdiaz@roblox.com)

2022-03-11 14:01:57

*Thread Reply:* It shows the same using the exposed service

Marco Diaz (mdiaz@roblox.com)

2022-03-11 14:02:09

*Thread Reply:* i just didnt do another screenshot

Marco Diaz (mdiaz@roblox.com)

2022-03-11 14:02:27

*Thread Reply:* Could it be communication with the DB?

Kevin Mellott (kevin.r.mellott@gmail.com)

2022-03-11 14:04:37

*Thread Reply:* What do you see if you view the network traffic within your web browser (right click -> Inspect -> Network). Specifically, wondering what the response code from the Marquez API URL looks like.

Marco Diaz (mdiaz@roblox.com)

2022-03-11 14:14:48

*Thread Reply:* i see this error Error occured while trying to proxy to: <a href="http://xxxxxxxxxxxxxxxxxxxxxxxxx.us-east-1.elb.amazonaws.com/api/v1/namespaces">xxxxxxxxxxxxxxxxxxxxxxxxx.us-east-1.elb.amazonaws.com/api/v1/namespaces</a>

Marco Diaz (mdiaz@roblox.com)

2022-03-11 14:16:00

*Thread Reply:* it seems to be trying to use the same address to access the api endpoint

Marco Diaz (mdiaz@roblox.com)

2022-03-11 14:16:26

*Thread Reply:* however the api service is in a different endpoint

Marco Diaz (mdiaz@roblox.com)

2022-03-11 14:18:24

*Thread Reply:* The API resides here <a href="http://Xxxxxxxxxxxxxxxxxxxxxx-2064419849.us-east-1.elb.amazonaws.com">Xxxxxxxxxxxxxxxxxxxxxx-2064419849.us-east-1.elb.amazonaws.com</a>

Marco Diaz (mdiaz@roblox.com)

2022-03-11 14:19:13

*Thread Reply:* The web service resides here <a href="http://xxxxxxxxxxxxxxxxxxxxxxxxxxx-335729662.us-east-1.elb.amazonaws.com">xxxxxxxxxxxxxxxxxxxxxxxxxxx-335729662.us-east-1.elb.amazonaws.com</a>

Marco Diaz (mdiaz@roblox.com)

2022-03-11 14:19:25

*Thread Reply:* do they both need to be under the same LB?

Marco Diaz (mdiaz@roblox.com)

2022-03-11 14:19:56

*Thread Reply:* How would i do that is they install as separate services?

Kevin Mellott (kevin.r.mellott@gmail.com)

2022-03-11 14:27:15

*Thread Reply:* You are correct, both the website and API are expecting to be exposed on the same ALB. This will give you a single URL that can reach your Kubernetes cluster, and then the ALB will allow you to configure Ingress rules to route the traffic based on the request.

Here is an example from one of the AWS repos - in the ingress resource you can see the single rule setup to point traffic to a given service.

https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/main/docs/examples/2048/2048_full.yaml

<https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/main/docs/examples/2048/2048_full.yaml | 2048_full.yaml>

<pre><code>--- apiVersion: v1 kind: Namespace metadata: name: game-2048 --- apiVersion: apps/v1 kind: Deployment metadata: namespace: game-2048 name: deployment-2048 spec: selector: matchLabels: <a href="http://app.kubernetes.io/name">app.kubernetes.io/name</a>: app-2048 replicas: 5 template: metadata: labels: <a href="http://app.kubernetes.io/name">app.kubernetes.io/name</a>: app-2048 spec: containers: - image: public.ecr.aws/l6m2t8p7/docker-2048:latest imagePullPolicy: Always name: app-2048 ports: - containerPort: 80 --- apiVersion: v1 kind: Service metadata: namespace: game-2048 name: service-2048 spec: ports: - port: 80 targetPort: 80 protocol: TCP type: NodePort selector: <a href="http://app.kubernetes.io/name">app.kubernetes.io/name</a>: app-2048 --- apiVersion: <a href="http://networking.k8s.io/v1">networking.k8s.io/v1</a> kind: Ingress metadata: namespace: game-2048 name: ingress-2048 annotations: <a href="http://alb.ingress.kubernetes.io/scheme">alb.ingress.kubernetes.io/scheme</a>: internet-facing <a href="http://alb.ingress.kubernetes.io/target-type">alb.ingress.kubernetes.io/target-type</a>: ip spec: ingressClassName: alb rules: - http: paths: - path: / pathType: Prefix backend: service: name: service-2048 port: number: 80 </code></pre>

Marco Diaz (mdiaz@roblox.com)

2022-03-11 14:36:40

*Thread Reply:* Thanks for the help. Now I know what the issue is

Kevin Mellott (kevin.r.mellott@gmail.com)

2022-03-11 14:51:34

*Thread Reply:* Great to hear!!

Sandeep Bhat (bhatsandeep424@gmail.com)

2022-03-16 00:55:36

👋 Hi everyone! Our company is looking to adopt data lineage tool, so i have few queries on open lineage, so 1. Is this completey free.

What are tha database it supports?

Ross Turk (ross@datakin.com)

2022-03-16 10:29:06

*Thread Reply:* Hi! Yes, OpenLineage is free. It is an open source standard for collection, and it provides the agents that integrate with pipeline tools to capture lineage metadata. You also need a metadata server, and there is an open source one called Marquez that you can use.

Ross Turk (ross@datakin.com)

2022-03-16 10:29:15

*Thread Reply:* It supports the databases listed here: https://openlineage.io/integration

openlineage.io

Integrations

Original URL: https://openlineage.io/integration

Sandeep Bhat (bhatsandeep424@gmail.com)

2022-03-16 08:27:20

and when i run the ./docker/up.sh --seed i got the result from java code(sample example) But how to get the same thing in python example?

Ross Turk (ross@datakin.com)

2022-03-16 10:29:53

*Thread Reply:* Not sure I understand - are you looking for example code in Python that shows how to make OpenLineage calls?

Sandeep Bhat (bhatsandeep424@gmail.com)

2022-03-16 12:45:14

*Thread Reply:* yup

Sandeep Bhat (bhatsandeep424@gmail.com)

2022-03-16 13:10:04

*Thread Reply:* how to run

Ross Turk (ross@datakin.com)

2022-03-16 23:08:31

*Thread Reply:* this is a good post for getting started with Marquez: https://openlineage.io/blog/explore-lineage-api/

openlineage.io

Exploring Lineage History via the Marquez API

Taking advantage of recent changes to the Marquez API, this post shows how to diagnose job failures and explore the impact of code changes on downstream dependents.

Original URL: https://openlineage.io/blog/explore-lineage-api/

Ross Turk (ross@datakin.com)

2022-03-16 23:08:51

*Thread Reply:* once you have run ./docker/up.sh, you should be able to run through that and see how the system runs

Ross Turk (ross@datakin.com)

2022-03-16 23:09:45

*Thread Reply:* There is a python client you can find here: https://github.com/OpenLineage/OpenLineage/tree/main/client/python

Sandeep Bhat (bhatsandeep424@gmail.com)

2022-03-17 00:05:58

*Thread Reply:* Thank you

Ross Turk (ross@datakin.com)

2022-03-19 00:00:32

*Thread Reply:* You are welcome 🙂

Mirko Raca (racamirko@gmail.com)

2022-04-19 09:28:50

*Thread Reply:* Hey @Ross Turk, (and potentially @Maciej Obuchowski) - what are the plans for OL Python client? I'd like to use it, but without a pip package it's not really project-friendly.

Is there any work in that direction, is the current client code considered mature and just needs re-packaging, or is it just a thought sketch and some serious work is needed?

I'm trying to avoid re-inventing the wheel, so if there's already something in motion, I'd rather support than start (badly) from scratch?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-04-19 09:32:17

*Thread Reply:* What do you mean without pip-package?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-04-19 09:32:18

*Thread Reply:* https://pypi.org/project/openlineage-python/

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-04-19 09:35:08

*Thread Reply:* It's still developed, for example next release will have pluggable backends - like Kafka https://github.com/OpenLineage/OpenLineage/pull/530

Mirko Raca (racamirko@gmail.com)

2022-04-19 09:40:11

*Thread Reply:* My apologies Maciej! In my defense - looking for "open lineage" on pypi doesn't show this in the first 20 results. Still, should have checked setup.py. My bad, and thank you for the pointer!

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-04-19 10:00:49

*Thread Reply:* We might need to add some keywords to setup.py - right now we have only "openlineage" there 😉

Mirko Raca (racamirko@gmail.com)

2022-04-20 08:12:29

*Thread Reply:* My mistake was that I was expecting a separate repo for the clients. But now I'm playing around with the package and trying to figure out the OL concepts. Thank you for your contribution, it's much nicer to experiment from ipynb than curl 🙂

Michael Robinson (michael.robinson@astronomer.io)

2022-03-16 12:00:01

@Julien Le Dem and @Willy Lulciuc will be at Data Council Austin next week talking OpenLineage and Airflow https://www.datacouncil.ai/talks/data-lineage-with-apache-airflow-using-openlineage?hsLang=en

datacouncil.ai

Data Lineage with Apache Airflow using OpenLineage

Previous Data Council talks from past conferences

Original URL: https://www.datacouncil.ai/talks/data-lineage-with-apache-airflow-using-openlineage?hsLang=en

Sandeep Bhat (bhatsandeep424@gmail.com)

2022-03-16 12:50:20

I couldn't figure out for the sample lineage flow (etldelivery7_days) when we ran the seed command after from which file its fetching data

John Thomas (john@datakin.com)

2022-03-16 14:35:14

*Thread Reply:* the seed data is being inserted by this command here: https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/cli/SeedCommand.java

<https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/cli/SeedCommand.java | SeedCommand.java>

``` /* SPDX-License-Identifier: Apache-2.0 */ package marquez.cli; import static com.google.common.base.Preconditions.checkArgument; import static marquez.client.models.JobType.BATCH; import static marquez.common.base.MorePreconditions.checkNotBlank; import com.google.common.collect.ImmutableList; import com.google.common.collect.ImmutableMap; import com.google.common.collect.ImmutableSet; import com.google.common.collect.Iterables; import com.google.common.collect.Lists; import com.google.common.collect.Maps; import com.google.common.collect.Sets; import io.dropwizard.cli.ConfiguredCommand; import io.dropwizard.setup.Bootstrap; import java.net.URL; import java.time.Duration; import java.time.Instant; import java.time.temporal.ChronoUnit; import java.util.LinkedHashMap; import java.util.List; import java.util.Map; import java.util.Optional; import java.util.Random; import javax.annotation.Nullable; import lombok.EqualsAndHashCode; import lombok.Getter; import lombok.NonNull; import lombok.ToString; import lombok.extern.slf4j.Slf4j; import marquez.MarquezConfig; import marquez.client.MarquezClient; import marquez.client.Utils; import marquez.client.models.Dataset; import marquez.client.models.DatasetId; import marquez.client.models.DatasetMeta; import marquez.client.models.DbTableMeta; import marquez.client.models.Field; import marquez.client.models.Job; import marquez.client.models.JobMeta; import marquez.client.models.Namespace; import marquez.client.models.NamespaceMeta; import marquez.client.models.Run; import marquez.client.models.RunMeta; import marquez.client.models.Source; import marquez.client.models.SourceMeta; /** * A command to seed the HTTP API with source, dataset, and job metadata. You can override the * default {@code host} and {@code port} using the command-line arguments {@code --host} and {@code * --port}. This command is meant to be used to explore the features of Marquez. For example, * lineage graph, dataset schemas, job run history, etc. * * <h3>Usage</h3> * * For example, to override the {@code port}: * * <pre>{@code * java -jar marquez-api.jar seed --port 5001 marquez.yml * }</pre> * * Note that all metadata is defined within this class and requires a running instance of Marquez. **/ @Slf4j public final class SeedCommand extends ConfiguredCommand<MarquezConfig> { static final String DEFAULTMARQUEZHOST = "localhost"; static final int DEFAULTMARQUEZPORT = 8080; public static final String NAMESPACENAME = "fooddelivery"; static final String SOURCENAME = "analyticsdb"; static final int LINEAGEGRAPH24HOURWINDOW = 24; static final int RUNTIMEINSECMIN = 120; static final int RUNTIMEINSECMAX = 240; public SeedCommand() { super("seed", "seeds the HTTP API with metadata"); } @Override public void configure(@NonNull final net.sourceforge.argparse4j.inf.Subparser subparser) { super.configure(subparser); subparser .addArgument("--host") .dest("host") .type(String.class) .required(false) .setDefault(DEFAULTMARQUEZHOST) .help("the HTTP API server host"); subparser .addArgument("--port") .dest("port") .type(Integer.class) .required(false) .setDefault(DEFAULTMARQUEZPORT) .help("the HTTP API server port"); } @Override protected void run( @NonNull final Bootstrap<MarquezConfig> bootstrap, @NonNull final net.sourceforge.argparse4j.inf.Namespace namespace, @NonNull final MarquezConfig config) { final String host = namespace.getString("host"); final int port = namespace.getInt("port"); <pre><code>final URL baseUrl = Utils.toUrl(String.format("<http://%s:%d>", host, port)); final MarquezClient client = MarquezClient.builder().baseUrl(baseUrl).build(); seedApiWithMeta(client, LINEAGE_GRAPH_24_HOUR_WINDOW); </code></pre> } public void seedApiWithMeta(@NonNull MarquezClient client, int additionalIterations) { // (1) Create namespace final NamespaceMeta namespaceMeta = NamespaceMeta.builder() .ownerName("<a href="mailto:owner@food.com">owner@food.com</a>") .description("Food delivery example!") .build(); final Namespace newNamespace = client.createNamespace(NAMESPACE_NAME, namespaceMeta); <a href="http://log.info">log.info</a>("Created namespace: {}", newNamespace); <pre><code>// (2) Create source final SourceMeta sourceMeta = SourceMeta.builder() .type("POSTGRESQL") .connectionUrl("jdbc:<postgres://localhost:3306/deliveries%22|postgres://localhost:3306/deliveries">) .description("Contains all food delivery orders.") .build(); final Source newSource = client.createSource(SOURCE_NAME, sourceMeta); <a href="http://log.info">log.info</a>("Created source: {}", newSource); // (3) Seed dataset meta DATASET_META.forEach( (datasetName, datasetMeta) -&gt; { final Dataset newDataset = client.createDataset(NAMESPACE_NAME, datasetName, datasetMeta); <a href="http://log.info">log.info</a>("Created dataset: {}", newDataset); }); // (4) Seed job meta JOB_META.forEach( (jobName, jobMeta) -&gt; { final Job newJob = client.createJob(NAMESPACE_NAME, jobName, jobMeta); <a href="http://log.info">log.info</a>("Created job: {}", newJob); }); // (5) Define run start times for each graph level final Instant startTimesGraphLevel0 = Instant.now(); final Instant startTimesGraphLevel1 = startTimesGraphLevel0.plusSeconds(RUN_TIME_IN_SEC_MAX); final Instant startTimesGraphLevel2 = startTimesGraphLevel1.plusSeconds(RUN_TIME_IN_SEC_MAX); final Instant startTimesGraphLevel3 = startTimesGraphLevel2.plusSeconds(RUN_TIME_IN_SEC_MAX); final Instant startTimesGraphLevel4 = startTimesGraphLevel3.plusSeconds(RUN_TIME_IN_SEC_MAX); final Instant[] startTimesByGraphLevel = { startTimesGraphLevel0, startTimesGraphLevel1, startTimesGraphLevel2, startTimesGraphLevel3, startTimesGraphLevel4 }; // (6) Seed run meta for jobs using graph level start times for (int hourOfDay = additionalIterations; hourOfDay &gt;= 0; hourOfDay--) { for (final Map.Entry&lt;String, JobMeta&gt; entry : JOB_META.entrySet()) { final String jobName = entry.getKey(); final JobMeta jobMeta = entry.getValue(); if (hourOfDay &gt;= ACTIVE_RUN_META.get(jobName).size()) { continue; } // On code change, create a new job version final ActiveRunMeta activeRunMeta = ACTIVE_RUN_META.get(jobName).get(hourOfDay); activeRunMeta .getCodeChange() .ifPresent( codeChange -&gt; { client.createJob( NAMESPACE_NAME, jobName, JobMeta.builder() .type(jobMeta.getType()) .inputs(jobMeta.getInputs()) .outputs(jobMeta.getOutputs()) .location(codeChange.getToUrl()) .context(jobMeta.getContext()) .description(jobMeta.getDescription().orElse(null)) .build()); }); // Set run start and end times final Instant runStartedAt = startTimesByGraphLevel[activeRunMeta.getLevelInGraph()].minus( Duration.ofHours(hourOfDay)); final Instant runEndedAt = runStartedAt.plusSeconds(secondsToAdd()); // Create run final RunMeta runMeta = RunMeta.builder() .nominalStartTime(runStartedAt.truncatedTo(ChronoUnit.MINUTES)) .nominalEndTime(runEndedAt.truncatedTo(ChronoUnit.MINUTES)) .build(); final Run run = client.createRun(NAMESPACE_NAME, jobName, runMeta); <a href="http://log.info">log.info</a>("Create… </code></pre>

Sandeep Bhat (bhatsandeep424@gmail.com)

2022-03-17 00:06:53

*Thread Reply:* Got it, but if i changed the code in this java file lets say i added another job here satisfying the syntax its not appearing in the lineage flow

Marco Diaz (mdiaz@roblox.com)

2022-03-22 18:18:22

@Kevin Mellott Hello Kevin, sorry to bother you again. I was finally able to configure Marquez in AWS using an ALB. Now I am receiving this error when calling the API

Marco Diaz (mdiaz@roblox.com)

2022-03-22 18:18:32

Is this an issue accessing the database?

Marco Diaz (mdiaz@roblox.com)

2022-03-22 18:19:15

I created the database and host manually and passed the parameters using helm --set

Marco Diaz (mdiaz@roblox.com)

2022-03-22 18:19:33

Do the database services need to be exposed too through the ALB?

Kevin Mellott (kevin.r.mellott@gmail.com)

2022-03-23 10:20:47

*Thread Reply:* I’m not too familiar with the 504 error in ALB, but found a guide with troubleshooting steps. If this is an issue with connectivity to the Postgres database, then you should be able to see errors within the marquez pod in EKS (kubectl logs <marquez pod name>) to confirm.

I know that EKS needs to have connectivity established to the Postgres database, even in the case of RDS, so that could be the culprit.

Marco Diaz (mdiaz@roblox.com)

2022-03-23 16:09:09

*Thread Reply:* @Kevin Mellott This is the error I am seeing in the logs [HPM] Proxy created: /api/v1 -> <http://localhost:5000/> App listening on port 3000! [HPM] Error occurred while trying to proxy request /api/v1/namespaces from <a href="http://marquez-interface-test.di.rbx.com">marquez-interface-test.di.rbx.com</a> to <http://localhost:5000/> (ECONNREFUSED) (<https://nodejs.org/api/errors.html#errors_common_system_errors>)

Kevin Mellott (kevin.r.mellott@gmail.com)

2022-03-23 16:22:13

*Thread Reply:* It looks like the website is attempting to find the API on localhost. I believe this can be resolved by setting the following Helm chart value within your deployment.

marquez.hostname=marquez-interface-test.di.rbx.com

Kevin Mellott (kevin.r.mellott@gmail.com)

2022-03-23 16:22:54

*Thread Reply:* assuming that is the DNS used by the website

Marco Diaz (mdiaz@roblox.com)

2022-03-23 16:48:53

*Thread Reply:* thanks, that did it. I have a question regarding the database

Marco Diaz (mdiaz@roblox.com)

2022-03-23 16:50:01

*Thread Reply:* I made my own database manually. Do the marquez tables should be created automatically when install marquez?

Marco Diaz (mdiaz@roblox.com)

2022-03-23 16:56:10

*Thread Reply:* Also could you put both the API and interface on the same port (3000)

Marco Diaz (mdiaz@roblox.com)

2022-03-23 17:21:58

*Thread Reply:* Seems I am still having the forwarding issue [HPM] Proxy created: /api/v1 -> <http://marquez-interface-test.di.rbx.com:5000/> App listening on port 3000! [HPM] Error occurred while trying to proxy request /api/v1/namespaces from <a href="http://marquez-interface-test.di.rbx.com">marquez-interface-test.di.rbx.com</a> to <http://marquez-interface-test.di.rbx.com:5000/> (ECONNRESET) (<https://nodejs.org/api/errors.html#errors_common_system_errors>)

Will Johnson (will@willj.co)

2022-03-23 09:08:14

Guidance on How / When a Spark SQL Execution event Controls JobStart Events?

@Maciej Obuchowski and @Paweł Leszczyński and @Michael Collado I'd really appreciate your thoughts on how / when JobStart events are triggered for a given execution. I've ran into two situations now where a SQLExecutionStart event fires with execution id X and then JobStart events fire with execution id Y.

• Spark 2 Delta SaveIntoDataSourceCommand on Databricks - I see it has a SparkSQLExecutionStart event but only on Spark 3 does it have JobStart events with the SaveIntoDataSourceCommand and the same execution id. • Databricks Synapse Connector - A SparkSQLExecutionStart event occurs but then the job starts are different execution ids. Is there any guidance / books / videos that dive deeper into how these events are triggered?

We need the JobStart event with the same execution id so that we can get some environment properties stored in the job start event.

Thanks you so much for any guidance!

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-03-23 09:25:18

*Thread Reply:* It's always Delta, isn't it?

When I originally worked on Delta support I tried to find answer on Delta slack and got an answer:

Hi Maciej, the main reason is that Delta will run queries on metadata to figure out what files should be read for a particular version of a Delta table and that's why you might see multiple jobs. In general Delta treats metadata as data and leverages Spark to handle them to make it scalable.

🤣 Will Johnson

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-03-23 09:25:48

*Thread Reply:* I haven't touched how it works in Spark 2 - wanted to make it work with Spark 3's new catalogs, so can't help you there.

Will Johnson (will@willj.co)

2022-03-23 09:46:14

*Thread Reply:* Argh!! It's always Databricks doing something 🙄

Thanks, Maciej!

Will Johnson (will@willj.co)

2022-03-23 09:51:59

*Thread Reply:* One last question for you, @Maciej Obuchowski, any thoughts on how I could identify WHY a particular JobStart event fired? Is it just stepping through every event? Was that your approach to getting Spark3 Delta working? Thank you so much for the insights!

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-03-23 09:58:08

*Thread Reply:* Before that, we were using just JobStart/JobEnd events and I couldn't find events that correspond to logical plan that has anything to do with what job was actually doing. I just found out that SQLExecution events have what I want, so I just started using them and stopped worrying about Projection or Aggregate, or other events that don't really matter here - and that's how filtering idea was born: https://github.com/OpenLineage/OpenLineage/issues/423

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-03-23 09:59:37

*Thread Reply:* Are you trying to get environment info from those events, or do you actually get Job event with proper logical plans like SaveIntoDataSourceCommand?

Might be worth to just post here all the events + logical plans that are generated for particular job, as I've done in that issue

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-03-23 09:59:40

*Thread Reply:* scala> spark.sql("CREATE TABLE tbl USING delta AS SELECT ** FROM tmp") 21/11/09 19:01:46 WARN SparkSQLExecutionContext: SparkListenerSQLExecutionStart - executionId: 3 21/11/09 19:01:46 WARN SparkSQLExecutionContext: org.apache.spark.sql.catalyst.plans.logical.CreateTableAsSelect 21/11/09 19:01:46 WARN SparkSQLExecutionContext: SparkListenerSQLExecutionStart - executionId: 4 21/11/09 19:01:46 WARN SparkSQLExecutionContext: org.apache.spark.sql.catalyst.plans.logical.LocalRelation 21/11/09 19:01:46 WARN SparkSQLExecutionContext: SparkListenerJobStart - executionId: 4 21/11/09 19:01:46 WARN SparkSQLExecutionContext: org.apache.spark.sql.catalyst.plans.logical.LocalRelation 21/11/09 19:01:47 WARN SparkSQLExecutionContext: SparkListenerJobEnd - executionId: 4 21/11/09 19:01:47 WARN SparkSQLExecutionContext: org.apache.spark.sql.catalyst.plans.logical.LocalRelation 21/11/09 19:01:47 WARN SparkSQLExecutionContext: SparkListenerSQLExecutionEnd - executionId: 4 21/11/09 19:01:47 WARN SparkSQLExecutionContext: org.apache.spark.sql.catalyst.plans.logical.LocalRelation 21/11/09 19:01:48 WARN SparkSQLExecutionContext: SparkListenerSQLExecutionStart - executionId: 5 21/11/09 19:01:48 WARN SparkSQLExecutionContext: org.apache.spark.sql.catalyst.plans.logical.Aggregate 21/11/09 19:01:48 WARN SparkSQLExecutionContext: SparkListenerJobStart - executionId: 5 21/11/09 19:01:48 WARN SparkSQLExecutionContext: org.apache.spark.sql.catalyst.plans.logical.Aggregate 21/11/09 19:01:49 WARN SparkSQLExecutionContext: SparkListenerJobEnd - executionId: 5 21/11/09 19:01:49 WARN SparkSQLExecutionContext: org.apache.spark.sql.catalyst.plans.logical.Aggregate 21/11/09 19:01:49 WARN SparkSQLExecutionContext: SparkListenerSQLExecutionEnd - executionId: 5 21/11/09 19:01:49 WARN SparkSQLExecutionContext: org.apache.spark.sql.catalyst.plans.logical.Aggregate 21/11/09 19:01:49 WARN SparkSQLExecutionContext: SparkListenerSQLExecutionEnd - executionId: 3 21/11/09 19:01:49 WARN SparkSQLExecutionContext: org.apache.spark.sql.catalyst.plans.logical.CreateTableAsSelect

Will Johnson (will@willj.co)

2022-03-23 11:41:37

*Thread Reply:* The JobStart event contains a Properties field and that contains a bunch of fields we want to extract to get more precise lineage information within Databricks.

As far as we know, the SQLExecutionStart event does not have any way to get these properties :(

https://github.com/OpenLineage/OpenLineage/blob/21b039b78bdcb5fb2e6c2489c4de840ebb[…]ark/agent/facets/builder/DatabricksEnvironmentFacetBuilder.java

As a result, I do have to care about the subsequent JobStart events coming from a given ExecutionStart 😢

<https://github.com/OpenLineage/OpenLineage/blob/21b039b78bdcb5fb2e6c2489c4de840ebb138759/integration/spark/src/main/common/java/io/openlineage/spark/agent/facets/builder/DatabricksEnvironmentFacetBuilder.java | DatabricksEnvironmentFacetBuilder.java>

<pre><code> private HashMap&lt;String, Object&gt; getDatabricksEnvironmentalAttributes( SparkListenerJobStart jobStart) { dbProperties = new HashMap&lt;&gt;(); // These are useful properties to extract if they are available List&lt;String&gt; dbPropertiesKeys = Arrays.asList( "orgId", "spark.databricks.clusterUsageTags.clusterOwnerOrgId", "spark.databricks.notebook.path", "spark.databricks.job.type", "spark.databricks.job.id", "spark.databricks.job.runId", "user", "userId", "spark.databricks.clusterUsageTags.clusterName", "spark.databricks.clusterUsageTags.azureSubscriptionId"); dbPropertiesKeys.stream() .forEach( (p) -&gt; { dbProperties.put(p, jobStart.properties().getProperty(p)); }); /**** ** Azure Databricks makes available a dbutils mount point to list aliased paths to cloud ** storage. However, that dbutils object is not available inside a spark listener. We must ** access it via reflection. **/ try { dbutilsClass = Class.forName("com.databricks.dbutils_v1.impl.DbfsUtilsImpl"); dbutils = (DbfsUtils) dbutilsClass.getDeclaredConstructor().newInstance(); dbProperties.put("mountPoints", getDatabricksMountpoints(dbutils)); } catch (Exception e) { log.warn("Failed to load dbutils in OpenLineageListener"); dbProperties.put("mountPoints", new ArrayList&lt;DatabricksMountpoint&gt;()); } return dbProperties; } </code></pre>

Will Johnson (will@willj.co)

2022-03-23 11:42:33

*Thread Reply:* I started down this path with the Project statement but I agree with @Michael Collado that a ProjectVisitor isn't a great idea.

https://github.com/OpenLineage/OpenLineage/issues/617

<https://github.com/OpenLineage/OpenLineage/issues/617|#617 [INTEGRATION][SPARK] Support org.apache.spark.sql.catalyst.plans.logical.Project for Delta Tables>

It seems that a <code>Project</code> logical plan is the terminal node for a Delta event. Currently OpenLineage does not support this as a Visitor and as a result, there are no output datasets. I'd like to propose adding a ProjectVisitor that extracts the appropriate output dataset contents. This example demonstrates the issue on Azure Databricks Runtime 6.4 (Spark 2.4.5) <pre><code>val storageServiceName = "inputservice" val storageContainerName = "inputcontainer" val abfssRootPath = "abfss://"+storageContainerName+"@"+storageServiceName+".<a href="http://dfs.core.windows.net">dfs.core.windows.net</a>" val outRootPath = "abfss://"+"outputdata"+"@"+storageServiceName+".<a href="http://dfs.core.windows.net">dfs.core.windows.net</a>" val exampleA = ( spark.read.format("delta") .load(abfssRootPath+"/examples/data/delta/exampleInputA") ) val exampleB = ( spark.read.format("delta") .load(abfssRootPath+"/examples/data/delta/exampleInputB") ) val outputDf = exampleA.join(exampleB, exampleA("id") === exampleB("id"), "inner").drop(exampleB("id")) outputDf.write.format("delta").mode("append").save(outRootPath + "/outputdata/delta/delta-in-delta-out") </code></pre> Here's the truncated open lineage event that is produced: <pre><code>{ "eventType": "START", "eventTime": "2022-03-18T12:58:53.112Z", "run": { "runId": "6068fde5-0e3e-4bd6-8bd9-fc3ebdb283ce", "facets": { "spark_unknown": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/0.6.0-SNAPSHOT/integration/spark>", "_schemaURL": "<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>", "output": { "description": { "@class": "org.apache.spark.sql.catalyst.plans.logical.Project", "id": 1, "projectList": ["some columns"], "origin": 5, "tags": {}, "schema": { "type": "struct", "fields": ["the fields"] } }, "spark_version": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/0.6.0-SNAPSHOT/integration/spark>", "_schemaURL": "<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>", "spark-version": "2.4.5", "openlineage-spark-version": "0.6.0-SNAPSHOT" } } }, "job": { "namespace": "adbpurviewol1", "name": "databricks_shell.project", "facets": {} }, "inputs": [ { "namespace": "<abfss://CONTAINER@SERVICE.dfs.core.windows.net>", "name": "/examples/data/delta/exampleInputA", "facets": { "dataSource": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/0.6.0-SNAPSHOT/integration/spark>", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet>", "name": "<abfss://CONTAINER@SERVICE.dfs.core.windows.net>", "uri": "<abfss://CONTAINER@SERVICE.dfs.core.windows.net>" }, "schema": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/0.6.0-SNAPSHOT/integration/spark>", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet>", "fields": [ { "name": "id", "type": "integer" }, { "name": "postalCode", "type": "string" }, { "name": "streetAddress", "type": "string" } ] } }, "inputFacets": {} }, { "namespace": "<abfss://CONTAINER@SERVICE.dfs.core.windows.net>", "name": "/examples/data/delta/exampleInputB", "facets": { "dataSource": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/0.6.0-SNAPSHOT/integration/spark>", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet>", "name": "<abfss://CONTAINER@SERVICE.dfs.core.windows.net>", "uri": "<abfss://CONTAINER@SERVICE.dfs.core.windows.net>" }, "schema": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/0.6.0-SNAPSHOT/integration/spark>", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet>", "fields": [ { "name": "id", "type": "integer" }, { "name": "city", "type": "string" }, { "name": "stateAbbreviation", "type": "string" } ] } }, "inputFacets": {} } ], "outputs": [], "producer": "<https://github.com/OpenLineage/OpenLineage/tree/0.6.0-SNAPSHOT/integration/spark>", "schemaURL": "<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunEvent>" } </code></pre>

Comments

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-03-24 09:43:38

Hey. I'm working on replacing current SQL parser - on which we rely for Postgres, Snowflake, Great Expectations - and I'd appreciate your opinion.

https://github.com/OpenLineage/OpenLineage/pull/627/files

Marco Diaz (mdiaz@roblox.com)

2022-03-25 19:30:29

Am i supposed to see this when I open marquez fro the first time on an empty database?

John Thomas (john@datakin.com)

2022-03-25 20:33:02

*Thread Reply:* Marquez and OpenLineage are job-focused lineage tools, so once you run a job in an OL-integrated instance of Airflow (or any other supported integration), you should see the jobs and DBs appear in the marquez ui

👍 Marco Diaz

Ross Turk (ross@datakin.com)

2022-03-25 21:44:54

*Thread Reply:* If you want to seed it with some data, just to try it out, you can run docker/up.sh -s and it will run a seeding job as it starts.

Marco Diaz (mdiaz@roblox.com)

2022-03-25 19:31:09

Would datasets be created when I send data from airflow?

Willy Lulciuc (willy@datakin.com)

2022-03-31 18:34:40

*Thread Reply:* Yep! Marquez will register all in/out datasets present in the OL event as well as link them to the run

Willy Lulciuc (willy@datakin.com)

2022-03-31 18:35:47

*Thread Reply:* FYI, @Peter Hicks is working on displaying the dataset version to run relationship in the web UI, see https://github.com/MarquezProject/marquez/pull/1929

<https://github.com/MarquezProject/marquez/pull/1929|#1929 Web: Created by view for dataset versions>

Signed-off-by: Peter Hicks <a href="mailto:peter@datakin.com">peter@datakin.com</a> Problem Some Marquez users were eager to see the run that created a dataset version. Solution We've added the <code>Created by Run</code> section on the dataset page as well as the dataset version list view. Also updated is syntax highlighting support for <code>sql</code> contexts and the empty states for context that is missing for us. <a href="https://user-images.githubusercontent.com/7514204/160910381-644773d7-6ed6-499d-ad96-2ebdde14495b.png">image</a> <a href="https://user-images.githubusercontent.com/7514204/160910466-5c62380f-0ad1-4e14-8292-c5c0b98b5ccb.png">image</a> <a href="https://user-images.githubusercontent.com/7514204/160910773-4fbc6329-5ec3-483e-bf82-9269c0223906.png">image</a> <a href="https://user-images.githubusercontent.com/7514204/160917522-b6a6c52f-e2bb-4590-85c1-6f6ed46b4eb7.png">image</a> > Note: All database schema changes require discussion. Please <a href="https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue">link the issue</a> for context. Checklist ☑︎ You've <a href="https://github.com/MarquezProject/marquez/blob/main/CONTRIBUTING.md#sign-your-work">signed-off</a> your work ☐ Your changes are accompanied by tests (if relevant) ☑︎ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☐ You've updated the <a href="https://github.com/MarquezProject/marquez/blob/main/CHANGELOG.md#unreleased"><code>CHANGELOG.md</code></a> with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary) ☐ You've versioned your <code>.sql</code> database schema migration according to <a href="https://flywaydb.org/documentation/concepts/migrations#naming">Flyway's naming convention</a> (if relevant)

Labels

feature, review, web, javascript

Comments

Marco Diaz (mdiaz@roblox.com)

2022-03-28 14:31:32

How is Datakin used in conjunction with Openlineage and Marquez?

John Thomas (john@datakin.com)

2022-03-28 15:43:46

*Thread Reply:* Hi Marco,

Datakin is a reporting tool built on the Marquez API, and therefore designed to take in Lineage using the OpenLineage specification.

Did you have a more specific question?

Marco Diaz (mdiaz@roblox.com)

2022-03-28 15:47:53

*Thread Reply:* No, that is it. Got it. So, i can install Datakin and still use openlineage and marquez?

John Thomas (john@datakin.com)

2022-03-28 15:55:07

*Thread Reply:* if you set up a datakin account, you'll have to change the environment variables used by your OpenLineage integrations, and the runEvents will be sent to Datakin rather than Marquez. You shouldn't have any loss of functionality, and you also won't have to keep manually hosting Marquez

Marco Diaz (mdiaz@roblox.com)

2022-03-28 16:10:25

*Thread Reply:* Will I still be able to use facets for backfills?

John Thomas (john@datakin.com)

2022-03-28 17:04:03

*Thread Reply:* yeah it works in the same way - Datakin actually submodules the Marquez API

Marco Diaz (mdiaz@roblox.com)

2022-03-28 16:52:41

Another question. I installed the open-lineage library and now I am trying to configure Airflow 2 to use it Do I follow these steps?

Marco Diaz (mdiaz@roblox.com)

2022-03-28 16:53:20

If I have marquez access via alb ingress what would i use the marquezurl variable or openlineageurl?

Marco Diaz (mdiaz@roblox.com)

2022-03-28 16:54:53

So, i don't need to modify my dags in Airflow 2 to use the library? Would this just allow me to start collecting data? openlineage.lineage_backend.OpenLineageBackend

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-03-29 06:24:21

*Thread Reply:* Yes, you don't need to modify dags in Airflow 2.1+

Marco Diaz (mdiaz@roblox.com)

2022-03-29 17:47:39

*Thread Reply:* ok, I added that environment variable. Now my question is how do i configure my other variables. I have marquez running in AWS with an ingress. Do i use OpenLineageURL or Marquez_URL?

Marco Diaz (mdiaz@roblox.com)

2022-03-29 17:48:09

*Thread Reply:* Also would a new namespace be created if i add the variable?

data_fool (data.fool.me@gmail.com)

2022-03-29 02:12:30

Hello! Are there any plans for openlineage to support dbt on trino?

John Thomas (john@datakin.com)

2022-03-30 14:59:13

*Thread Reply:* Hi Datafool - I'm not familiar with how trino works, but the DBT-OL integration works by wrapping the dbt run command with dtb-ol run , and capturing lineage data from the runresult file

These things don't necessarily preclude you from using OpenLineage on trino, so it may work already.

data_fool (data.fool.me@gmail.com)

2022-03-30 18:34:38

*Thread Reply:* hey @John Thomas yep, tried to use dbt-ol run command but it seems trino is not supported, only bigquery, redshift and few others.

John Thomas (john@datakin.com)

2022-03-30 18:36:41

*Thread Reply:* aaah I misunderstood what Trino is - yeah we don't currently support jobs that are running outside of those environments.

We don't currently have plans for this, but a great first step would be opening an issue in the OpenLineage repo.

If you're interested in implementing the support yourself I'm also happy to connect you to people that can help you get started.

data_fool (data.fool.me@gmail.com)

2022-03-30 20:23:46

*Thread Reply:* oh okay, got it, yes I can contribute, I'll see if I can get some time in the next few weeks. Thanks @John Thomas

Francis McGregor-Macdonald (francis@mc-mac.com)

2022-03-30 16:08:39

I can see 2 articles using Spline with BMW and Capital One. Could OpenLineage be doing the same job as Spline here? What would the differences be? Are there any similar references for OpenLineage? I can see Northwestern Mutual but that article does not contain a lot of detail.

SpringerLink

Collecting and visualizing data lineage of Spark jobs

Datenbank-Spektrum - Metadata management constitutes a key prerequisite for enterprises as they engage in data analytics and governance. Today, however, the context of data is often only manually...

Original URL: https://link.springer.com/article/10.1007/s13222-021-00387-7

Capital One

Capturing & Displaying Data Transformations with Spline | Capital One

Capturing data lineage manually--especially as an application’s logic changes over time--can be time consuming and prone to human error. As a data engineer, it’s advantageous to capture data lineage as a normal part of application flow, and to be able to display that data lineage in a format that is easily understood for documentation purposes.

Original URL: https://www.capitalone.com/tech/software-engineering/spline-spark-data-lineage/

openlineage.io

How Northwestern Mutual Simplified Data Observability with OpenLineage & Marquez

Northwestern Mutual is building an Enterprise Data Platform. In this guest blog, learn about the experiences and decisions that led them to embrace the OpenLineage and Marquez communities.

Original URL: https://openlineage.io/blog/openlineage-at-northwestern-mutual/

Marco Diaz (mdiaz@roblox.com)

2022-03-31 12:47:59

Could anyone help me wit this custom extractor. I am not sure what I am doing wrong. I added the variable to airflow2, but I still see this in the logs [2022-03-31, 16:43:39 UTC] {__init__.py:97} WARNING - Unable to find an extractor. task_type=QueryOperator Here is the code

```import logging from typing import Optional, List from openlineage.airflow.extractors.base import BaseExtractor,TaskMetadata from openlineage.client.facet import SqlJobFacet, ExternalQueryRunFacet from openlineage.common.sql import SqlMeta, SqlParser

logger = logging.getLogger(name)

class QueryOperatorExtractor(BaseExtractor):

def __init__(self, operator):
    super().__init__(operator)

@classmethod
def get_operator_classnames(cls) -&gt; List[str]:
    return ['QueryOperator']

def extract(self) -&gt; Optional[TaskMetadata]:
    # (1) Parse sql statement to obtain input / output tables.
    sql_meta: SqlMeta = SqlParser.parse(self.operator.hql)
    inputs = sql_meta.in_tables
    outputs = sql_meta.out_tables
    task_name = f"{self.operator.dag_id}.{self.operator.task_id}"
    run_facets = {}
    job_facets = {
        'hql': SqlJobFacet(self.operator.hql)
    }

    return TaskMetadata(
        name=task_name,
        inputs=[inputs.to_openlineage_dataset()],
        outputs=[outputs.to_openlineage_dataset()],
        run_facets=run_facets,
        job_facets=job_facets
    )```

Orbit

2022-03-31 13:20:55

@Orbit has joined the channel

Orbit

2022-03-31 13:21:23

@Orbit has joined the channel

Marco Diaz (mdiaz@roblox.com)

2022-03-31 14:07:24

@Ross Turk Could you please take a look if you have a minute☝️? I know you have built one extractor before

Ross Turk (ross@datakin.com)

2022-03-31 14:11:35

*Thread Reply:* Hmmmm. Are you running in Docker? Is it possible for you to shell into your scheduler container and make sure the ENV is properly set?

Ross Turk (ross@datakin.com)

2022-03-31 14:11:57

*Thread Reply:* looks to me like the value you posted is correct, and return ['QueryOperator'] seems right to me

Marco Diaz (mdiaz@roblox.com)

2022-03-31 14:33:00

*Thread Reply:* It is in an EKS cluster I checked and the variable is there OPENLINEAGE_EXTRACTOR_QUERYOPERATOR=shared.plugins.ol_custom_extractors.QueryOperatorExtractor

Marco Diaz (mdiaz@roblox.com)

2022-03-31 14:33:56

*Thread Reply:* I am wondering if it is an issue with my extractor code. Something not rendering well

Ross Turk (ross@datakin.com)

2022-03-31 14:40:17

*Thread Reply:* I don’t think it’s even executing your extractor code. The error message traces back to here: https://github.com/OpenLineage/OpenLineage/blob/249868fa9b97d218ee35c4a198bcdf231a9b874b/integration/airflow/openlineage/lineage_backend/__init__.py#L77

<https://github.com/OpenLineage/OpenLineage/blob/249868fa9b97d218ee35c4a198bcdf231a9b874b/integration/airflow/openlineage/lineage_backend/__init__.py | __init__.py>

<pre><code> extractor = self._get_extractor(task) </code></pre>

Ross Turk (ross@datakin.com)

2022-03-31 14:40:45

*Thread Reply:* I am currently digging into _get_extractor to see where it might be missing yours 🤔

Marco Diaz (mdiaz@roblox.com)

2022-03-31 14:46:36

*Thread Reply:* Thanks

Ross Turk (ross@datakin.com)

2022-03-31 14:47:19

*Thread Reply:* silly idea, but you could add a log message to __init__ in your extractor.

Ross Turk (ross@datakin.com)

2022-03-31 14:47:25

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/249868fa9b97d218ee35c4a198bcdf231a[…]ntegration/airflow/openlineage/airflow/extractors/extractors.py

<https://github.com/OpenLineage/OpenLineage/blob/249868fa9b97d218ee35c4a198bcdf231a9b874b/integration/airflow/openlineage/airflow/extractors/extractors.py | extractors.py>

<pre><code> self.extractors[operator] = import_from_string(value) </code></pre>

Ross Turk (ross@datakin.com)

2022-03-31 14:48:20

*Thread Reply:* the openlineage client actually tries to import the value of that env variable from pos 22. if that happens, but for some reason it fails to register the extractor, we can at least know that it’s importing

Ross Turk (ross@datakin.com)

2022-03-31 14:48:54

*Thread Reply:* if you add a log line, you can verify that your PYTHONPATH and env are correct

Marco Diaz (mdiaz@roblox.com)

2022-03-31 14:49:23

*Thread Reply:* will try that

Marco Diaz (mdiaz@roblox.com)

2022-03-31 14:49:29

*Thread Reply:* and let you know

Ross Turk (ross@datakin.com)

2022-03-31 14:49:39

*Thread Reply:* ok!

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-03-31 15:04:05

*Thread Reply:* @Marco Diaz can you try env variable OPENLINEAGE_EXTRACTOR_QueryOperator instead of full caps?

👍 Ross Turk

Marco Diaz (mdiaz@roblox.com)

2022-03-31 15:13:37

*Thread Reply:* Will try that too

Marco Diaz (mdiaz@roblox.com)

2022-03-31 15:13:44

*Thread Reply:* Thanks for helping

Marco Diaz (mdiaz@roblox.com)

2022-03-31 16:58:24

*Thread Reply:* @Maciej Obuchowski My setup does not allow me to submit environment variables with lowercases. Is the name of the variable used to register the extractor?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-03-31 17:15:57

*Thread Reply:* yes, it's case sensitive...

Marco Diaz (mdiaz@roblox.com)

2022-03-31 17:18:42

*Thread Reply:* i see

Marco Diaz (mdiaz@roblox.com)

2022-03-31 17:39:16

*Thread Reply:* So it is definitively the name of the variable. I changed the name of the operator to capitals and now is being registered

Marco Diaz (mdiaz@roblox.com)

2022-03-31 17:39:44

*Thread Reply:* Could there be a way not to make this case sensitive?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-03-31 18:31:27

*Thread Reply:* yes - could you create issue on OpenLineage repository?

Marco Diaz (mdiaz@roblox.com)

2022-04-01 10:46:59

*Thread Reply:* sure

Marco Diaz (mdiaz@roblox.com)

2022-04-01 10:48:28

I have another question. I have this query INSERT OVERWRITE TABLE schema.daily_play_sessions_v2 PARTITION (ds = '2022-03-30') SELECT platform_id, universe_id, pii_userid, NULL as session_id, NULL as session_start_ts, COUNT(1) AS session_cnt, SUM( UNIX_TIMESTAMP(stopped) - UNIX_TIMESTAMP(joined) ) AS time_spent_sec FROM schema.fct_play_sessions_merged WHERE ds = '2022-03-30' AND UNIX_TIMESTAMP(stopped) - UNIX_TIMESTAMP(joined) BETWEEN 0 AND 28800 GROUP BY platform_id, universe_id, pii_userid And I am seeing the following inputs [DbTableName(None,'schema','fct_play_sessions_merged','schema.fct_play_sessions_merged')] But the outputs are empty Shouldn't this be an output table schema.daily_play_sessions_v2

Ross Turk (ross@datakin.com)

2022-04-01 13:25:52

*Thread Reply:* Yes, it should. This line is the likely culprit: https://github.com/OpenLineage/OpenLineage/blob/431251d25f03302991905df2dc24357823d9c9c3/integration/common/openlineage/common/sql/parser.py#L30

<https://github.com/OpenLineage/OpenLineage/blob/431251d25f03302991905df2dc24357823d9c9c3/integration/common/openlineage/common/sql/parser.py | parser.py>

<pre><code> return _match_on(token, ['INTO']) </code></pre>

Ross Turk (ross@datakin.com)

2022-04-01 13:26:25

*Thread Reply:* I bet if that said ['INTO','OVERWRITE'] it would work

Ross Turk (ross@datakin.com)

2022-04-01 13:27:23

*Thread Reply:* @Maciej Obuchowski do you agree? should OVERWRITE be a token we look for? if so, I can submit a short PR.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-04-01 13:30:36

*Thread Reply:* we have a better solution

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-04-01 13:30:37

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/644

<https://github.com/OpenLineage/OpenLineage/pull/644|#644 sql: implementation of parser API in Rust>

Description in proposal doc: <a href="https://github.com/OpenLineage/OpenLineage/pull/627">#627</a> Signed-off-by: Maciej Obuchowski <a href="mailto:obuchowski.maciej@gmail.com">obuchowski.maciej@gmail.com</a> Closes: <a href="https://github.com/OpenLineage/OpenLineage/issues/582">#582</a>

Ross Turk (ross@datakin.com)

2022-04-01 13:31:27

*Thread Reply:* ah! I heard there was a new SQL parser, but did not know it was imminent!

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-04-01 13:31:30

*Thread Reply:* I've added this case as a test and it works: https://github.com/OpenLineage/OpenLineage/blob/764dfdb885112cd0840ebc7384ff958bf20d4a70/integration/sql/tests/tests_insert.rs

<https://github.com/OpenLineage/OpenLineage/blob/764dfdb885112cd0840ebc7384ff958bf20d4a70/integration/sql/tests/tests_insert.rs | tests_insert.rs>

<pre><code>use openlineage_sql::{parse_sql, SqlMeta}; use sqlparser::dialect::PostgreSqlDialect; #[macro_use] mod test_utils; use test_utils::**; #[test] fn insert_values() { assert_eq!( test_sql("INSERT INTO TEST VALUES(1)",), SqlMeta { in_tables: vec![], out_tables: table("TEST") } ); } #[test] fn insert_cols_values() { assert_eq!( test_sql("INSERT INTO tbl(col1, col2) VALUES (1, 2), (2, 3)",), SqlMeta { in_tables: vec![], out_tables: table("tbl") } ); } #[test] fn insert_select_table() { assert_eq!( test_sql("INSERT INTO TEST SELECT ** FROM TEMP",), SqlMeta { in_tables: table("TEMP"), out_tables: table("TEST") } ); } #[test] fn insert_nested_select() { assert_eq!(test_sql(" INSERT INTO popular_orders_day_of_week (order_day_of_week, order_placed_on,orders_placed) SELECT EXTRACT(ISODOW FROM order_placed_on) AS order_day_of_week, order_placed_on, COUNT(**) AS orders_placed FROM top_delivery_times GROUP BY order_placed_on; ", ), SqlMeta { in_tables: table("top_delivery_times"), out_tables: table("popular_orders_day_of_week") }) } #[test] fn insert_cmpl() { assert_eq!( test_sql( "\ INSERT OVERWRITE TABLE schema.daily_play_sessions_v2 PARTITION (ds = '2022-03-30') SELECT platform_id, universe_id, pii_userid, NULL as session_id, NULL as session_start_ts, COUNT(1) AS session_cnt, SUM( UNIX_TIMESTAMP(stopped) - UNIX_TIMESTAMP(joined) ) AS time_spent_sec FROM schema.fct_play_sessions_merged WHERE ds = '2022-03-30' AND UNIX_TIMESTAMP(stopped) - UNIX_TIMESTAMP(joined) BETWEEN 0 AND 28800 GROUP BY platform_id, universe_id, pii_userid " ), SqlMeta { in_tables: table("schema.fct_play_sessions_merged"), out_tables: table("schema.daily_play_sessions_v2") } ) } </code></pre>

👍 Ross Turk, Paweł Leszczyński

Ross Turk (ross@datakin.com)

2022-04-01 13:31:33

*Thread Reply:* let me review this PR

Marco Diaz (mdiaz@roblox.com)

2022-04-01 13:36:32

*Thread Reply:* Do i have to download a new version of the opelineage-airflow python library

Marco Diaz (mdiaz@roblox.com)

2022-04-01 13:36:41

*Thread Reply:* If so which version?

Ross Turk (ross@datakin.com)

2022-04-01 13:37:22

*Thread Reply:* this PR isn’t merged yet 😞 so if you wanted to try this you’d have to build the python client from the sql/rust-parser-impl branch

Marco Diaz (mdiaz@roblox.com)

2022-04-01 13:38:17

*Thread Reply:* ok, np. I am not in a hurry yet. Do you have an ETA for the merge?

Ross Turk (ross@datakin.com)

2022-04-01 13:39:50

*Thread Reply:* Hard to say, it’s currently in-review. Let me pull some strings, see if I can get eyes on it.

Marco Diaz (mdiaz@roblox.com)

2022-04-01 13:40:34

*Thread Reply:* I will check again next week don't worry. I still need to make some things in my extractor work

Ross Turk (ross@datakin.com)

2022-04-01 13:40:36

*Thread Reply:* after it’s merged, we’ll have to do an OpenLineage release as well - perhaps next week?

👍 Michael Robinson

Ross Turk (ross@datakin.com)

2022-04-01 13:40:41

*Thread Reply:* 👍

Tien Nguyen (tiennguyenhotel97@gmail.com)

2022-04-01 12:25:48

Hi everyone, I just started using openlineage to connect with DBT for my company. I work as data engineering. After the connection and run test on dbt-ol run, it gives me this error. I have looked up online to find the answer but couldn't see the answer anywhere. Can somebody please help me with? The error tells me that the correct version is DBT Schemajson version 2 instead of 3. I don't know where to change the schemajson version. Thank you everyone @channel

Ross Turk (ross@datakin.com)

2022-04-01 13:34:10

*Thread Reply:* Hm - what version of dbt are you using?

Ross Turk (ross@datakin.com)

2022-04-01 13:47:50

*Thread Reply:* @Tien Nguyen The dbt schema version changes with different versions of dbt. If you have recently updated, you may have to make some changes: https://docs.getdbt.com/docs/guides/migration-guide/upgrading-to-v1.0

docs.getdbt.com

Upgrading to v1.0 | dbt Docs

Resources

Original URL: https://docs.getdbt.com/docs/guides/migration-guide/upgrading-to-v1.0

Ross Turk (ross@datakin.com)

2022-04-01 13:48:27

*Thread Reply:* also make sure you are on the latest version of openlineage-dbt - I believe we have made it a bit more tolerant of dbt schema changes.

Tien Nguyen (tiennguyenhotel97@gmail.com)

2022-04-01 13:52:46

*Thread Reply:* @Ross Turk Thank you very much for your answer. I will update those and see if I can resolve the issues.

Tien Nguyen (tiennguyenhotel97@gmail.com)

2022-04-01 14:20:00

*Thread Reply:* @Ross Turk Thank you very much for your help. The latest version of dbt couldn't work. But version 0.20.0 works for this problem.

Ross Turk (ross@datakin.com)

2022-04-01 14:22:42

*Thread Reply:* Hmm. Interesting, I remember when dbt 1.0 came out we fixed a very similar issue: https://github.com/OpenLineage/OpenLineage/pull/397

<https://github.com/OpenLineage/OpenLineage/pull/397|#397 [INTEGRATION][dbt] allow to parse unspecified, future version of artifacts>

Currently, if <code>DbtMetadataArtifact</code> encounters artifact with schema version not exactly matching one that is supported by <code>dbt-ol</code>, it throws exception. This PR changes this behavior, when the metadata version is higher than currently supported (and tested with). If we encounter artifact with higher version, we warn the user and continue processing. Also, change raw <code>print()</code> calls to logger calls - this allows more granular control over output, and allows to test it easily. Signed-off-by: Maciej Obuchowski <a href="mailto:maciej.obuchowski@getindata.com">maciej.obuchowski@getindata.com</a> Closes: <a href="https://github.com/OpenLineage/OpenLineage/issues/389">#389</a> Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☑︎ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☑︎ Your changes are accompanied by tests (if relevant) ☑︎ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☐ You've updated the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md"><code>CHANGELOG.md</code></a> with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant)

Ross Turk (ross@datakin.com)

2022-04-01 14:25:17

*Thread Reply:* if you run pip3 list | grep openlineage-dbt, what version does it show?

Ross Turk (ross@datakin.com)

2022-04-01 14:26:26

*Thread Reply:* I wonder if you have somehow ended up with an older version of the integration

Tien Nguyen (tiennguyenhotel97@gmail.com)

2022-04-01 14:33:43

*Thread Reply:* it is 0.1.0

Tien Nguyen (tiennguyenhotel97@gmail.com)

2022-04-01 14:34:23

*Thread Reply:* is it 0.1.0 the older version of openlineage ?

Ross Turk (ross@datakin.com)

2022-04-01 14:43:14

*Thread Reply:* ❯ pip3 list | grep openlineage-dbt openlineage-dbt 0.6.2

Ross Turk (ross@datakin.com)

2022-04-01 14:43:26

*Thread Reply:* the latest is 0.6.2 - that might be your issue

Ross Turk (ross@datakin.com)

2022-04-01 14:43:59

*Thread Reply:* How are you going about installing it?

Tien Nguyen (tiennguyenhotel97@gmail.com)

2022-04-01 18:35:26

*Thread Reply:* @Ross Turk. I follow instruction from open lineage "pip3 install openlineage-dbt"

Ross Turk (ross@datakin.com)

2022-04-01 18:36:00

*Thread Reply:* Hm! Interesting. I did the same thing to get 0.6.2.

Tien Nguyen (tiennguyenhotel97@gmail.com)

2022-04-01 18:51:36

*Thread Reply:* @Ross Turk Yes. I have tried to reinstall and clear cache but it still install 0.1.0

Tien Nguyen (tiennguyenhotel97@gmail.com)

2022-04-01 18:53:07

*Thread Reply:* But thanks for the version. I reinstall 0.6.2 version by specify the version

👍 Ross Turk

Marco Diaz (mdiaz@roblox.com)

2022-04-02 17:37:59

@Ross Turk @Maciej Obuchowski FYI the sql parser also seems not to return any inputs or outpus for queries that have subqueries Example INSERT OVERWRITE TABLE mytable PARTITION (ds = '2022-03-31') SELECT ** FROM (SELECT ** FROM table2) a INSERT OVERWRITE TABLE mytable PARTITION (ds = '2022-03-31') SELECT ** FROM (SELECT ** FROM table2 UNION SELECT ** FROM table3 UNION ALL SELECT ** FROM table4) a

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-04-03 15:07:09

*Thread Reply:* they'll work with new parser - added test for those

<https://github.com/OpenLineage/OpenLineage/blob/2e8e51981483e17534445a751723dda787794739/integration/sql/tests/tests_insert.rs | tests_insert.rs>

<pre><code>fn insert_overwrite_multiple_subqueries() { </code></pre>

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-04-03 15:07:39

*Thread Reply:* btw, thank you very much for notifying us about multiple bugs @Marco Diaz!

Marco Diaz (mdiaz@roblox.com)

2022-04-03 15:20:55

*Thread Reply:* @Maciej Obuchowski thank you for making sure these cases are taken into account. I am getting more familiar with the Open lineage code as i build my extractors. If I see anything else I will let you know. Any ETA on the new parser release date?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-04-03 15:55:28

*Thread Reply:* it should be week-two, unless anything comes up

Marco Diaz (mdiaz@roblox.com)

2022-04-03 17:10:02

*Thread Reply:* I see. Keeping my fingers crossed this is the only thing delaying me right now.

Marco Diaz (mdiaz@roblox.com)

2022-04-02 20:27:37

Also what would happen if someone uses a CTE in the SQL? Is the parser taken those cases in consideration?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-04-03 15:02:13

*Thread Reply:* current one handles cases where you have one CTE (like this test) but not multiple - next one will handle arbitrary number of CTEs (like this test)

Michael Robinson (michael.robinson@astronomer.io)

2022-04-04 10:54:47

Agenda items are requested for the next OpenLineage Technical Steering Committee meeting on Wednesday, April 13. Please reply here or ping me with your items!

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-04-04 11:11:53

*Thread Reply:* I've mentioned it before but I want to talk a bit about new SQL parser

🙌 Will Johnson, Ross Turk

Marco Diaz (mdiaz@roblox.com)

2022-04-04 13:25:17

*Thread Reply:* Will the parser be released after the 13?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-04-08 11:47:05

*Thread Reply:* @Michael Robinson added additional item to Agenda - client transports feature that we'll have in next release

🙌 Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2022-04-08 12:56:44

*Thread Reply:* Thanks, Maciej

Sukanya Patra (Sukanya_Patra@mckinsey.com)

2022-04-05 02:39:59

Hi Everyone,

I have come across OpenLineage at Data Council Austin, 2022 and am curious to try it out. I have reviewed the Getting Started section (https://openlineage.io/getting-started/) of OpenLineage docs but couldn't find clear reference documentation for using the API • Are there any swagger API docs or equivalent dedicated for OpenLineage API? There is some reference docs of Marquez API: https://marquezproject.github.io/marquez/openapi.html#tag/Lineage Secondly are there any means to use Open Lineage independent of Marquez? Any pointers would be appreciated.

Patrick Mol (patrick.mol@prolin.com)

2022-04-05 10:28:08

*Thread Reply:* I had kind of the same question. I found https://marquezproject.github.io/marquez/openapi.html#tag/Lineage With some of the entries marked Deprecated, I am not sure how to proceed.

John Thomas (john@datakin.com)

2022-04-05 11:55:35

*Thread Reply:* Hey folks, are you looking for the OpenAPI specification found here?

John Thomas (john@datakin.com)

2022-04-05 15:33:23

*Thread Reply:* @Patrick Mol, Marquez's deprecated endpoints were the old methods for creating lineage (making jobs, dataset, and runs independently), they were deprecated because we moved over to using the OpenLineage spec for all lineage collection purposes.

The GET methods for jobs/datasets/etc are still functional

Sarat Chandra (saratchandra9494@gmail.com)

2022-04-05 21:10:39

*Thread Reply:* Hey John,

Thanks for sharing the OpenAPI docs. Was wondering if there are any means to setup OpenLineage API that will receive events without a consumer like Marquez or is it essential to always pair with a consumer to receive the events?

John Thomas (john@datakin.com)

2022-04-05 21:47:13

*Thread Reply:* the OpenLineage integrations don’t have any way to recieve events, since they’re designed to send events to other apps - what were you expecting OpenLinege to do?

Marquez is our reference implementation of an OpenLineage consumer, but egeria also has a functional endpoint

Patrick Mol (patrick.mol@prolin.com)

2022-04-06 09:53:31

*Thread Reply:* Hi @John Thomas, Would creation of Sources and Datasets have an equivalent in the OpenLineage specification ? Sofar I only see the Inputs and Outputs in the Run Event spec.

John Thomas (john@datakin.com)

2022-04-06 11:31:10

*Thread Reply:* Inputs and outputs in the OL spec are Datasets in the old MZ spec, so they're equivalent

Marco Diaz (mdiaz@roblox.com)

2022-04-05 14:24:50

Hey Guys,

The BaseExtractor is working fine with operators that are derived from Airflow BaseOperator. However for operators derived from LivyOperator the BaseExtractor does not seem to work. Is there a fix for this? We use livyoperator to run sparkjobs

John Thomas (john@datakin.com)

2022-04-05 15:16:34

*Thread Reply:* Hi Marco - it looks like LivyOperator itself does derive from BaseOperator, have you seen any other errors around this problem?

@Maciej Obuchowski might be more help here

Marco Diaz (mdiaz@roblox.com)

2022-04-05 15:21:03

*Thread Reply:* It is the operators that inherit from LivyOperator. It doesn't find the parameters like sql, connection etc

Marco Diaz (mdiaz@roblox.com)

2022-04-05 15:25:42

*Thread Reply:* My guess is that operators that inherit from other operators (not baseoperator) will have the same problem

John Thomas (john@datakin.com)

2022-04-05 15:32:13

*Thread Reply:* interesting! I'm not sure about that. I can look into it if I have time, but Maciej is definitely the person who would know the most.

Ross Turk (ross@datakin.com)

2022-04-06 15:49:48

*Thread Reply:* @Marco Diaz I wonder - perhaps it would be better to instrument spark with OpenLineage. It doesn’t seem that Airflow will know much about what’s happening underneath here. Have you looked into openlineage-spark?

Marco Diaz (mdiaz@roblox.com)

2022-04-06 15:51:57

*Thread Reply:* I have not tried that library yet. I need to see how it implement because we have several spark custom operators that use livy

Marco Diaz (mdiaz@roblox.com)

2022-04-06 15:52:59

*Thread Reply:* Do you have any examples?

Ross Turk (ross@datakin.com)

2022-04-06 15:54:01

*Thread Reply:* there is a good blog post from @Michael Collado: https://openlineage.io/blog/openlineage-spark/

openlineage.io

Tracing Data Lineage with OpenLineage and Apache Spark

Spark ushered in a brand new age of data democratization... and left us with a mess of hidden dependencies, stale datasets, and failed jobs.

Original URL: https://openlineage.io/blog/openlineage-spark/

Ross Turk (ross@datakin.com)

2022-04-06 15:54:37

*Thread Reply:* and the doc page here has a good overview: https://openlineage.io/integration/apache-spark/

openlineage.io

Apache Spark

OpenLineage can automatically track lineage of jobs and datasets across Spark jobs.

Original URL: https://openlineage.io/integration/apache-spark/

Marco Diaz (mdiaz@roblox.com)

2022-04-06 16:38:15

*Thread Reply:* is this all we need to pass? spark-submit --conf "spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener" \ --packages "io.openlineage:openlineage_spark:0.2.+" \ --conf "spark.openlineage.host=http://<your_ol_endpoint>" \ --conf "spark.openlineage.namespace=my_job_namespace" \ --class com.mycompany.MySparkApp my_application.jar

Marco Diaz (mdiaz@roblox.com)

2022-04-06 16:38:49

*Thread Reply:* If so, yes our operators have a way to pass configurations to spark and we may be able to implement it.

Michael Collado (collado.mike@gmail.com)

2022-04-06 16:41:27

*Thread Reply:* Looks right to me

Marco Diaz (mdiaz@roblox.com)

2022-04-06 16:42:03

*Thread Reply:* Will give it a try

Marco Diaz (mdiaz@roblox.com)

2022-04-06 16:42:50

*Thread Reply:* Do we have to install the library on the spark side or the airflow side?

Marco Diaz (mdiaz@roblox.com)

2022-04-06 16:42:58

*Thread Reply:* I assume is the spark side

Michael Collado (collado.mike@gmail.com)

2022-04-06 16:44:25

*Thread Reply:* The —packages argument tells spark where to get the jar (you'll want to upgrade to 0.6.1)

Marco Diaz (mdiaz@roblox.com)

2022-04-06 16:44:54

*Thread Reply:* sounds good

Varun Singh (varuntestaz@outlook.com)

2022-04-06 00:04:14

Hi, I saw there was some work done for integrating OpenLineage with Azure Purview

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-04-06 04:54:27

*Thread Reply:* @Will Johnson

Will Johnson (will@willj.co)

2022-04-07 12:43:27

*Thread Reply:* Hey @Varun Singh! We are building a github repository that deploys a few resources that will support a limited number of Azure data sources being pushed into Azure Purview. You can expect a public release near the end of the month! Feel free to direct message me if you'd like more details!

Michael Robinson (michael.robinson@astronomer.io)

2022-04-06 15:05:39

The next OpenLineage Technical Steering Committee meeting is Wednesday, April 13! Meetings are on the second Wednesday of each month from 9:00 to 10:00am PT. Join us on Zoom: https://astronomer.zoom.us/j/87156607114?pwd=a3B0K210dnRaQmdkaFdGMytBREZEQT09 All are welcome. Agenda: • OpenLineage 0.6.2 release overview • Airflow integration update • Dagster integration retrospective • Open discussion Notes: https://tinyurl.com/openlineagetsc

slackbot

2022-04-06 21:40:16

This message was deleted.

Marco Diaz (mdiaz@roblox.com)

2022-04-07 01:00:43

*Thread Reply:* Are both airflow2 and Marquez installed locally on your computer?

Jorge Reyes (Zenta Group) (jorge.reyes@zentagroup.com)

2022-04-07 09:04:19

*Thread Reply:* yes Marco

Marco Diaz (mdiaz@roblox.com)

2022-04-07 15:00:18

*Thread Reply:* can you open marquez on <http://localhost:3000>

Marco Diaz (mdiaz@roblox.com)

2022-04-07 15:00:40

*Thread Reply:* and get a response from <http://localhost:5000/api/v1/namespaces>

Jorge Reyes (Zenta Group) (jorge.reyes@zentagroup.com)

2022-04-07 15:26:41

*Thread Reply:* yes , i used this guide https://openlineage.io/getting-started and execute un post to marquez correctly

openlineage.io

Getting Started

Original URL: https://openlineage.io/getting-started

Marco Diaz (mdiaz@roblox.com)

2022-04-07 22:17:34

*Thread Reply:* In theory you should receive events in jobs under airflow namespace

Tien Nguyen (tiennguyenhotel97@gmail.com)

2022-04-07 14:18:05

Hi Everyone, Can someone please help me to debug this error ? Thank you very much all

John Thomas (john@datakin.com)

2022-04-07 14:59:06

*Thread Reply:* It looks like you need to add a payment method to your DBT account

Tyler Farris (tyler@kickstand.work)

2022-04-11 12:46:41

Hello. Does Airflow's TaskFlow API work with OpenLineage?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-04-11 12:50:48

*Thread Reply:* It does, but admittedly not very well. It can't recognize what you're doing inside your tasks. The good news is that we're working on it and long term everything should work well.

👍 Howard Yoo

Tyler Farris (tyler@kickstand.work)

2022-04-11 12:58:28

*Thread Reply:* Thanks for the quick reply Maciej.

sandeep (sandeepgame07@gmail.com)

2022-04-12 09:56:44

Hi all, watched few of your demos with airflow(astronomer) recently, really liked them. Thanks for doing those

Questions:

Are there plans to have a hive listener similar to the open-lineage spark integration ?
If not will the sql parser work with the HiveQL ?
Maybe one for presto too ?
Will the run version and dataset version come out of the box or do we need to define some facets ?
I read the blog on facets, is there a tutorial on how to create a sample facet ? Background: We have hive, spark jobs and big query tasks running from airflow in GCP Dataproc

John Thomas (john@datakin.com)

2022-04-12 13:56:53

*Thread Reply:* Hi Sandeep,

1&3: We don't currently have Hive or Presto on the roadmap! The best way to start the conversation around them would be to create a proposal in the OpenLineage repo, outlining your thoughts on implementation and benefits.

2: I'm not familiar enough with HiveQL, but you can read about the new SQL parser we're implementing here

you can see the Standard Facets here - Dataset Version is included out of the box, but Run Version would have to be defined.
the best place to start looking into making facets is the Spec doc here. We don't have a dedicated tutorial, but if you have more specific questions please feel free to reach out again on slack

👍 sandeep

sandeep (sandeepgame07@gmail.com)

2022-04-12 15:39:23

*Thread Reply:* Thank you John The standard facets links to the github issues currently

John Thomas (john@datakin.com)

2022-04-12 15:40:33

*Thread Reply:* ah here -https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md#standard-facets

sandeep (sandeepgame07@gmail.com)

2022-04-12 15:41:01

*Thread Reply:* Will check it out thank you

Michael Robinson (michael.robinson@astronomer.io)

2022-04-12 10:37:58

Reminder: this month’s OpenLineage TSC meeting is tomorrow, 4/13, at 9 am PT! https://openlineage.slack.com/archives/C01CK9T7HKR/p1649271939878419

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)

The next OpenLineage Technical Steering Committee meeting is Wednesday, April 13! Meetings are on the second Wednesday of each month from 9:00 to 10:00am PT. Join us on Zoom: <a href="https://astronomer.zoom.us/j/87156607114?pwd=a3B0K210dnRaQmdkaFdGMytBREZEQT09">https://astronomer.zoom.us/j/87156607114?pwd=a3B0K210dnRaQmdkaFdGMytBREZEQT09</a> All are welcome. Agenda: • OpenLineage 0.6.2 release overview • Airflow integration update • Dagster integration retrospective • Open discussion Notes: <a href="https://tinyurl.com/openlineagetsc">https://tinyurl.com/openlineagetsc</a>

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1649271939878419

sandeep (sandeepgame07@gmail.com)

2022-04-12 15:43:29

I setup the open-lineage spark integration for spark(dataproc) tasks from airflow. It’s able to post data to the marquez end point and I see the job information in Marquez UI.

I don’t see any dataset information in it, I see just the jobs ? Is there some setup I need to do or something else I need to configure ?

John Thomas (john@datakin.com)

2022-04-12 16:08:30

*Thread Reply:* is there anything in your marquez-api logs that might indicate issues?

What guide did you follow to setup the spark integration?

sandeep (sandeepgame07@gmail.com)

2022-04-12 16:10:07

*Thread Reply:* Followed this guide https://openlineage.io/integration/apache-spark/ and used the spark-defaults.conf approach

sandeep (sandeepgame07@gmail.com)

2022-04-12 16:11:04

*Thread Reply:* The logs from dataproc side show no errors, let me check from the marquez api side To confirm, we should be able to see the datasets from the marquez UI with the spark integration right ?

John Thomas (john@datakin.com)

2022-04-12 16:11:50

*Thread Reply:* I'm not super familiar with the spark integration, since I work more with airflow - I'd start with looking through the readme for the spark integration here

sandeep (sandeepgame07@gmail.com)

2022-04-12 16:14:44

*Thread Reply:* Hmm, the readme says it aims to generate the input and output datasets

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-04-12 16:40:38

*Thread Reply:* Are you looking at the same namespace?

sandeep (sandeepgame07@gmail.com)

2022-04-12 16:40:51

*Thread Reply:* Yes, the same one where I can see the job

sandeep (sandeepgame07@gmail.com)

2022-04-12 16:54:49

*Thread Reply:* Tailing the API logs and rerunning the spark job now to hopefully catch errors if any, will ping back here

sandeep (sandeepgame07@gmail.com)

2022-04-12 17:01:10

*Thread Reply:* Don’t see any failures in the logs, any suggestions on how to debug this ?

John Thomas (john@datakin.com)

2022-04-12 17:08:24

*Thread Reply:* I'd next set up a basic spark notebook and see if you can't get it to send dataset information on something simple in order to check if it's a setup issue or a problem with your spark job specifically

sandeep (sandeepgame07@gmail.com)

2022-04-12 17:14:43

*Thread Reply:* ok, that sounds good, will try that

sandeep (sandeepgame07@gmail.com)

2022-04-12 17:16:06

*Thread Reply:* before that, I see that spark-lineage integration posts lineage to the api https://marquezproject.github.io/marquez/openapi.html#tag/Lineage/paths/~1lineage/post We don’t seem to add a DataSet in this, does marquez internally create this “dataset” based on Output and fields ?

John Thomas (john@datakin.com)

2022-04-12 17:16:34

*Thread Reply:* yeah, you should be seeing "input" and "output" in the runEvents - that's where datasets come from

John Thomas (john@datakin.com)

2022-04-12 17:17:00

*Thread Reply:* I'm not sure if it's a problem with your specific spark job or with the integration itself, however

sandeep (sandeepgame07@gmail.com)

2022-04-12 17:19:16

*Thread Reply:* By runEvents, do you mean a job Object or lineage Object ? The integration seems to be only POSTing lineage objects

John Thomas (john@datakin.com)

2022-04-12 17:20:34

*Thread Reply:* yep, a runEvent is body that gets POSTed to the /lineage endpoint:

https://openlineage.io/docs/openapi/

👍 sandeep

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-04-12 17:41:01

*Thread Reply:* > Yes, the same one where I can see the job I think you should look at other namespace, which name depends on what systems you're actually using

sandeep (sandeepgame07@gmail.com)

2022-04-12 17:48:24

*Thread Reply:* Shouldn’t the dataset would be created in the same namespace we define in the spark properties?

sandeep (sandeepgame07@gmail.com)

2022-04-15 10:19:06

*Thread Reply:* I found few datasets in the table location, I ran it in a similar (hive metastore, gcs, sparksql and scala spark jobs) setup to the one mentioned in this post https://openlineage.slack.com/archives/C01CK9T7HKR/p1649967405659519

} Will Johnson (https://openlineage.slack.com/team/U02H4FF5M36)

Before I create an issue around it, maybe I'm just not seeing it in Databricks. In the Spark Integration, does OpenLineage report Hive Metastore tables or it ONLY reports the file path? For example, if I have a Hive table called default.myTable stored at LOCATION /usr/hive/warehouse/default/mytable. For a query that reads a CSV file and inserts into default.myTable, would I see an output of default.myTable or /usr/hive/warehoues/default/mytable? We want to include a link between the physical path and the hive metastore table but it seems that OpenLineage (at least on Databricks) only reports the physical path with the table name showing up in the catalog but not as a facet.

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1649967405659519

sandeep (sandeepgame07@gmail.com)

2022-04-12 15:49:46

Is this the correct place for this Q or should I reach out to Marquez slack ? I followed this post https://openlineage.io/integration/apache-spark/

openlineage.io

Apache Spark

OpenLineage can automatically track lineage of jobs and datasets across Spark jobs.

Original URL: https://openlineage.io/integration/apache-spark/

Will Johnson (will@willj.co)

2022-04-14 16:16:45

Before I create an issue around it, maybe I'm just not seeing it in Databricks. In the Spark Integration, does OpenLineage report Hive Metastore tables or it ONLY reports the file path?

For example, if I have a Hive table called default.myTable stored at LOCATION /usr/hive/warehouse/default/mytable.

For a query that reads a CSV file and inserts into default.myTable, would I see an output of default.myTable or /usr/hive/warehoues/default/mytable?

We want to include a link between the physical path and the hive metastore table but it seems that OpenLineage (at least on Databricks) only reports the physical path with the table name showing up in the catalog but not as a facet.

sandeep (sandeepgame07@gmail.com)

2022-04-15 10:17:55

*Thread Reply:* This was my experience as well, I was under the impression we would see the table as a dataset. Looking forward to understanding the expected behavior

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-04-15 10:39:34

*Thread Reply:* relevant: https://github.com/OpenLineage/OpenLineage/issues/435

<https://github.com/OpenLineage/OpenLineage/issues/435|#435 [INTEGRATION][Spark] provide physical location of a table facet in case we're using logical table name>

When dealing with a logical Hive table, we've decided to use logical Hive table name as a dataset name. In some other cases, some other job - not necessarily Spark one - might deal with this dataset by directly knowing it's physical location. In this case we'll have the same dataset known by two different names in different namespaces. To make sure those datasets can be "linked" by OpenLineage consumers, when available, we should provide additional facet (<code>PhysicalLocationFacet</code>?) that indicates physical location at a storage system (like <code>hdfs</code>, <code>s3</code> etc) or other underlying storage mechanism.

👍 Howard Yoo

Will Johnson (will@willj.co)

2022-04-15 12:36:08

*Thread Reply:* Ah! Thank you both for confirming this! And it's great to see the proposal, Maciej!

Sharanya Santhanam (santhanamsharanya@gmail.com)

2022-06-10 12:37:41

*Thread Reply:* Is there a timeline around when we can expect this fix ?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-06-10 12:46:47

*Thread Reply:* Not a simple fix, but I guess we'll start working on this relatively soon.

Sharanya Santhanam (santhanamsharanya@gmail.com)

2022-06-10 13:10:31

*Thread Reply:* I see, thanks for the update ! We are very much interested in this feature.

Michael Robinson (michael.robinson@astronomer.io)

2022-04-15 15:42:22

@channel A significant number of us have a conflict with the current TSC meeting day/time, so, unfortunately, we need to reschedule the meeting. When you have a moment, please share your availability here: https://doodle.com/meeting/participate/id/ejRnMlPe. Thanks in advance for your input!

doodle.com

Doodle

Doodle is the simplest way to schedule meetings with clients, colleagues, or friends. Find the best time for one-to-ones and team meetings with our user-friendly calendar tool. Get started today!

Original URL: https://doodle.com/meeting/participate/id/ejRnMlPe

Arturo (ggrmos@gmail.com)

2022-04-19 13:35:23

Hello everyone, I'm learning Openlineage, I finally achieved the connection between Airflow 2+ and Openlineage+Marquez. The issue is that I don't see nothing on Marquez. Do I need to modify current airflow operators?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-04-19 13:40:54

*Thread Reply:* You probably need to change dataset from default

Arturo (ggrmos@gmail.com)

2022-04-19 13:47:04

*Thread Reply:* I click it on everything 😕 I manually (joining to the pod and send curl to the marquez local endpoint) created a namespaces to check if there is a network issue I was ok, I created a namespaces called: data-dev . The airflow is mounted over k8s using helm chart. ``` config: AIRFLOWWEBSERVERBASEURL: "http://airflow.dev.test.io" PYTHONPATH: "/opt/airflow/dags/repo/config" AIRFLOWAPIAUTHBACKEND: "airflow.api.auth.backend.basicauth" AIRFLOWCOREPLUGINSFOLDER: "/opt/airflow/dags/repo/plugins" AIRFLOWLINEAGEBACKEND: "openlineage.lineage_backend.OpenLineageBackend"

. . . .

extraEnv: - name: OPENLINEAGEURL value: http://marquez-dev.data-dev.svc.cluster.local - name: OPENLINEAGENAMESPACE value: data-dev```

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-04-19 15:16:47

*Thread Reply:* I think answer is somewhere in airflow logs 🙂 For some reason, OpenLineage events aren't send to Marquez.

Arturo (ggrmos@gmail.com)

2022-04-20 11:08:09

*Thread Reply:* Thanks, finally was my error .. I created a dummy dag to see if maybe it's an issue over the dag and now I can see something over Marquez

Mirko Raca (racamirko@gmail.com)

2022-04-20 08:15:32

One really novice question - there doesn't seem to be a way of deleting lineage elements (any of them)? While I can imagine that in production system we want to keep history, it's not practical while testing/developing. I'm using throw-away namespaces to step around the issue. Is there a better way, or alternatively - did I miss an API somewhere?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-04-20 08:20:35

*Thread Reply:* That's more of a Marquez question 🙂 We have a long-standing issue to add that API https://github.com/MarquezProject/marquez/issues/1736

Mirko Raca (racamirko@gmail.com)

2022-04-20 09:32:19

*Thread Reply:* I see it already got skipped for 2 releases, and my only conclusion is that people using Marquez don't make mistakes - ergo, API not needed 🙂 Lets see if I can stick around the project long enough to offer a bit of help, now I just need to showcase it and get interest in my org.

Dan Mahoney (dan.mahoney@sphericalanalytics.io)

2022-04-20 10:08:33

Good day all. I’m trying out the openlineage-dagster plugin • I’ve got dagit, dagster-daemon and marquez running locally • The openlineagesensor is recognized in dagit and the daemon. But, when I run a job, I see the following message in the daemon’s shell: Sensor openlineage_sensor skipped: Last cursor: {"last_storage_id": 9, "running_pipelines": {"97e2efdf-9499-4ffd-8528-d7fea5b9362c": {"running_steps": {}, "repository_name": "hello_cereal_repository"}}} I’ve attached my repos.py and serialjob.py. Any thoughts?

serial_job.py

repos.py

David (drobin1437@gmail.com)

2022-04-20 10:40:03

Hi All, I am walking through the curl examples on this page and have a question on the first curl example: https://openlineage.io/getting-started/ The curl command completes, and I can see the input file and job in the namespace, but the lineage graph does not show the input file connected as an input to the job. This only seems to happen after the job is marked complete.

Is there a way to have a running job show connections to its input files in the lineage? Thanks!

openlineage.io

Getting Started

Original URL: https://openlineage.io/getting-started/

raghanag (raghanag@gmail.com)

2022-04-20 18:06:29

Hi Team, we are using spark as a service, and we are planning to integrate open lineage spark listener and looking at the below params that we need to pass, we don't know the name of the spark cluster, is the spark.openlineage.namespace conf param mandatory? spark-submit --conf "spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener" \ --packages "io.openlineage:openlineage_spark:0.2.+" \ --conf "spark.openlineage.host=http://<your_ol_endpoint>" \ --conf "spark.openlineage.namespace=my_job_namespace" \ --class com.mycompany.MySparkApp my_application.jar

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-04-20 18:11:19

*Thread Reply:* Namespace is defined by you, it does not have to be name of the spark cluster.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-04-20 18:11:42

*Thread Reply:* And I definitely recommend to use newer version than 0.2.+ 🙂

raghanag (raghanag@gmail.com)

2022-04-20 18:13:32

*Thread Reply:* oh i see that someone mentioned that it has to be replaced with name of the spark clsuter

raghanag (raghanag@gmail.com)

2022-04-20 18:13:57

*Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1634089656188400?thread_ts=1634085740.187700&cid=C01CK9T7HKR

} Julien Le Dem (https://openlineage.slack.com/team/U01DCLP0GU9)

<code>SparkSession.builder() .config("spark.jars.packages", "io.openlineage:openlineage_spark:0.2.+") .config("spark.extraListeners", "io.openlineage.spark.agent.OpenLineageSparkListener") .config("spark.openlineage.host", "<https://localhost>") .config("spark.openlineage.apiKey", "your api key") .config("spark.openlineage.namespace", "&lt;NAMESPACE_NAME&gt;") // Replace with the name of your Spark cluster. .getOrCreate()</code>

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1634089656188400?thread_ts=1634085740.187700&cid=C01CK9T7HKR

raghanag (raghanag@gmail.com)

2022-04-20 18:19:19

*Thread Reply:* @Maciej Obuchowski may i know if i can add the --packages "io.openlineage:openlineage_spark:0.2.+" as part of the spark jar file, that meant as part of the pom.xml

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-04-21 03:54:25

*Thread Reply:* I think it needs to run on the driver

Mirko Raca (racamirko@gmail.com)

2022-04-21 05:53:34

Hello, when looking through Marquez API it seems that most individual-element creation APIs are marked as deprecated and are going to be removed by 0.25, with a point of switching to open lineage. That makes POST to /api/v1/lineage the only creation point of elements, but OpenLineage API is very limited in attributes that can be passed.

Is that intended to stay that way? One practical question/example: how do we create a job of type STREAMING, when OL API only allows to pass name, namespace and facets. Do we now move all properties into facets?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-04-21 07:16:44

*Thread Reply:* > OpenLineage API is very limited in attributes that can be passed. Can you specify where do you think it's limited? The way to solve that problems would be to evolve OpenLineage.

> One practical question/example: how do we create a job of type STREAMING, So, here I think the question is more how streaming jobs differ from batch jobs. One obvious difference is that output of the job is continuous (in practice, probably "microbatched" or commited on checkpoint). However, deprecated Marquez API didn't give us tools to properly indicate that. On the contrary, OpenLineage with different event types allows us to properly do that. > Do we now move all properties into facets? Basically, yes. Marquez should handle specific facets. For example, https://github.com/MarquezProject/marquez/pull/1847

Mirko Raca (racamirko@gmail.com)

2022-04-21 07:23:11

*Thread Reply:* Hey Maciej

first off - thanks for being active on the channel!

> So, here I think the question is more how streaming jobs differ from batch jobs Not really. I just gave an example of how would you express a specific job type creation which can be done with https://marquezproject.github.io/marquez/openapi.html#tag/Jobs/paths/~1namespaces~1{namespace}~1jobs~1{job}/put|/api/v1/namespaces/.../jobs/... , by passing the type field which is required. In the call to /api/v1/lineage the job field offers just to specify (namespace, name), but no other attributes.

> However, deprecated Marquez API didn't give us tools to properly indicate that. On the contrary, OpenLineage with different event types allows us to properly do that. I have the feeling I'm still missing some key concepts on how OpenLineage is designed. I think I went over the API and documentation, but trying to use just OpenLineage failed to reproduce mildly complex chain-of-job scenarios, and when I took a look how Marquez seed demo is doing it - it was heavily based on deprecated API. So, I'm kinda lost on how to use OpenLineage.

I'm looking forward to some open-public meeting, as I don't think asking these long questions on chat really works. 😞 Any pointers are welcome!

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-04-21 07:53:59

*Thread Reply:* > I just gave an example of how would you express a specific job type creation Yes, but you're trying to achieve something by passing this parameter or creating a job in a certain way. We're trying to cover everything in OpenLineage API. Even if we don't have everything, the spec from the beginning is focused to allow emitting custom data by custom facet mechanism.

> I have the feeling I'm still missing some key concepts on how OpenLineage is designed. This talk by @Julien Le Dem is a great place to start: https://www.youtube.com/watch?v=HEJFCQLwdtk

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-04-21 11:29:20

*Thread Reply:* > Any pointers are welcome! BTW: OpenLineage is an open standard. Everyone is welcome to contribute and discuss. Every feedback ultimately helps us build better systems.

Mirko Raca (racamirko@gmail.com)

2022-04-22 03:32:48

*Thread Reply:* I agree, but for now I'm more likely to be in the I didn't get it category, and not in the brilliant new idea category 🙂

My temporary goal is to go over the documentation and to write the gaps that confused me (and the solutions) and maybe publish that as an article for wider audience. So far I realized that: • I don't get the naming convention - it became clearer that it's important with the Naming examples, but more info is needed • I mis-interpret the namespaces. I was placing datasources and jobs in the same namespace which caused a lot of issues until I started using different ones. Not sure why... So now I'm interpreting namespaces=source as suggested by the naming convention • JSON schema actually clarified things a lot, but that's not the most reader-friendly of resources, so surely there should be a better one • I was questioning whether to move away from Marquez completely and go with DataHub, but for my scenario Marquez (with limitations outstanding) is still most suitable • Marquez for some reason does not tolerate the datetimes if they're missing the 'T' delimiter in the ISO, which caused a lot of trial-and-error because the message is just "JSON parsing failed" • Marquez doesn't give you (at least by default) meaningful OpenLineage parsing errors, so running examples against it is a very slow learning process

Karatuğ Ozan BİRCAN (karatugo@gmail.com)

2022-04-21 10:20:55

Hi everyone,

I'm running the Spark Listener on Databricks. It works fine for the event emit part for a basic Databricks SQL Create Table query. Nevertheless, it throws a NullPointerException exception after sending lineage successfully.

I tried to debug a bit. Looks like it's thrown at the line: QueryExecution queryExecution = SQLExecution.getQueryExecution(executionId); So, does this mean that the listener can't get the query exec from Spark SQL execution?

Please see the logs in the thread. Thanks.

Karatuğ Ozan BİRCAN (karatugo@gmail.com)

2022-04-21 10:21:33

*Thread Reply:* Driver logs from Databricks:

```22/04/21 14:05:07 INFO EventEmitter: Lineage completed successfully: ResponseMessage(responseCode=200, body={}, error=null) {"eventType":"COMPLETE",[...], "schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunEvent"}

22/04/21 14:05:07 ERROR AsyncEventQueue: Listener OpenLineageSparkListener threw an exception java.lang.NullPointerException at io.openlineage.spark.agent.lifecycle.ContextFactory.createSparkSQLExecutionContext(ContextFactory.java:43) at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$getSparkSQLExecutionContext$8(OpenLineageSparkListener.java:221) at java.util.HashMap.computeIfAbsent(HashMap.java:1127) at java.util.Collections$SynchronizedMap.computeIfAbsent(Collections.java:2674) at io.openlineage.spark.agent.OpenLineageSparkListener.getSparkSQLExecutionContext(OpenLineageSparkListener.java:220) at io.openlineage.spark.agent.OpenLineageSparkListener.sparkSQLExecStart(OpenLineageSparkListener.java:143) at io.openlineage.spark.agent.OpenLineageSparkListener.onOtherEvent(OpenLineageSparkListener.java:135) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:102) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:119) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:103) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1588) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)```

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-04-21 11:32:37

*Thread Reply:* @Karatuğ Ozan BİRCAN are you running on Spark 3.2? If yes, then new release should have fixed your problem: https://github.com/OpenLineage/OpenLineage/issues/609

Karatuğ Ozan BİRCAN (karatugo@gmail.com)

2022-04-21 11:33:15

*Thread Reply:* Spark 3.1.2 with Scala 2.12

Karatuğ Ozan BİRCAN (karatugo@gmail.com)

2022-04-21 11:33:50

*Thread Reply:* In fact, I couldn't make it work in Spark 3.2. But I'll test it again. Thanks for the info.

Vinith Krishnan US (vinithk@nvidia.com)

2022-05-20 16:15:47

*Thread Reply:* Has this been resolved? I am facing the same issue with spark 3.2.

Ben (ben@meridian.sh)

2022-04-21 11:51:33

Does anyone have thoughts on the difference between the sourceCode and sql job facets - and whether we’d expect to ever see both on a particular job?

John Thomas (john@datakin.com)

2022-04-21 15:34:24

*Thread Reply:* I don't think that the facets are particularly strongly defined, but I would expect that it could be possible to see both on a pythonOperator that's executing SQL queries, depending on how the extractor was written

Ben (ben@meridian.sh)

2022-04-21 15:34:45

*Thread Reply:* ah sure, that makes sense

Xiaoyong Zhu (xiaoyzhu@outlook.com)

2022-04-21 15:14:03

Just get to know open lineage and it's really a great project! One question for the granularity on Spark + Openlineage - is it possible to track column level lineage (rather than the table lineage that's currently there)? Thanks!

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-04-21 16:17:59

*Thread Reply:* We're actively working on it - expect it in next OpenLineage release. https://github.com/OpenLineage/OpenLineage/pull/645

<https://github.com/OpenLineage/OpenLineage/pull/645|#645 [Spark][Integration] column level lineage>

Labels

enhancement, integration/spark

Milestone

Xiaoyong Zhu (xiaoyzhu@outlook.com)

2022-04-21 16:24:16

*Thread Reply:* nice -thanks!

Xiaoyong Zhu (xiaoyzhu@outlook.com)

2022-04-21 16:25:19

*Thread Reply:* Assuming we don't need to do anything except using the next update? Or do you expect that we need to change quite a lot of configs?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-04-21 17:44:46

*Thread Reply:* No, it should be automatic.

Will Johnson (will@willj.co)

2022-04-24 14:37:33

Hey, Team - We are starting to get requests for other, non Microsoft data sources (e.g. Teradata) for the Spark Integration. We (I) don't have a lot of bandwidth to fill every request but I DO want to help these people new to OpenLineage get started.

Has anyone on the team written up a blog post about extending open lineage or is this an area that we could collaborate on for the OpenLineage blog? Alternatively, is it a bad idea to write this down since the internals have changed a few times over the past six months?

Mirko Raca (racamirko@gmail.com)

2022-04-25 03:52:20

*Thread Reply:* Hey Will,

while I would not consider myself in the team, I'm dabbling in OL, hitting walls and learning as I go. If I don't have enough experience to contribute, I'd be happy to at least proof-read and point out things which are not clear from a novice perspective. Let me know!

👍 Will Johnson

Will Johnson (will@willj.co)

2022-04-25 13:49:48

*Thread Reply:* I'll hold you to that @Mirko Raca 😉

Ross Turk (ross@datakin.com)

2022-04-25 17:18:02

*Thread Reply:* I will support! I’ve done a few recent presentations on the internals of OpenLineage that might also be useful - maybe some diagrams can be reused.

Will Johnson (will@willj.co)

2022-04-25 17:56:44

*Thread Reply:* Any chance you have links to those old presentations? Would be great to build off of an existing one and then update for some of the new naming conventions.

Ross Turk (ross@datakin.com)

2022-04-25 18:00:26

*Thread Reply:* the most recent one was an astronomer webinar

happy to share the slides with you if you want 👍 here’s a PDF:

🙌 Will Johnson

Ross Turk (ross@datakin.com)

2022-04-25 18:00:44

*Thread Reply:* the other ones have not been public, unfortunately 😕

Ross Turk (ross@datakin.com)

2022-04-25 18:02:24

*Thread Reply:* architecture, object model, run lifecycle, naming conventions == the basics IMO

Will Johnson (will@willj.co)

2022-04-26 09:14:42

*Thread Reply:* Thank you so much, Ross! This is a great base to work from.

Michael Robinson (michael.robinson@astronomer.io)

2022-04-26 14:49:04

Your periodical reminder that Github stars are one of those trivial things that make a significant difference for an OS project like ours. Have you starred us yet?

👍 Ross Turk

raghanag (raghanag@gmail.com)

2022-04-26 15:02:10

Hi All, I have a simple spark job from converting csv to parquet and I am using https://openlineage.io/integration/apache-spark/ to generate lineage events and posting to maquez but I see that both events (START & COMPLETE) have the same event except eventType, i thought we should see outputsarray in the complete event right?

openlineage.io

Apache Spark

OpenLineage can automatically track lineage of jobs and datasets across Spark jobs.

Original URL: https://openlineage.io/integration/apache-spark/

Will Johnson (will@willj.co)

2022-04-27 00:36:05

*Thread Reply:* For a spark job like that, you'd have at least four events:

START event - This represents the SparkSQLExecutionStart
START event #2 - This represents a JobStart event
COMPLET event - This represents a JobEnd event
COMPLETE event #2 - This represents a SparkSQLExectionEnd event For CSV to Parquet, you should be seeing inputs and outputs that match across each event. OpenLineage scans the logical plan and reports back the inputs / outputs / metadata across the different facets for each event BECAUSE each event might give you some different information.

For example, the JobStart event might give you access to properties that weren't there before. The JobEnd event might give you information about how many rows were written.

Marquez / OpenLineage expects that you collect all of the resulting events and then aggregate the results.

raghanag (raghanag@gmail.com)

2022-04-27 21:51:07

*Thread Reply:* Hi @Will Johnson good evening. We are seeing an issue while using spark integaration and found that when we provide openlinegae.host property a value like <http://lineage.com/common/marquez> where my marquez api is running I see that the below line is modifying the host to become <http://lineage.com/api/v1/lineage> instead of <http://lineage.com/common/marquez/api/v1/lineage> which is causing the problem https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/src/main/common/java/io/openlineage/spark/agent/EventEmitter.java#L49 I see that it has been added 5 months ago and released it as part of 0.4.0, is there anyway that we can fix the line to be like below this.lineageURI = new URI( hostURI.getScheme(), hostURI.getAuthority(), hostURI.getPath() + uriPath, queryParams, null);

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/src/main/common/java/io/openlineage/spark/agent/EventEmitter.java | EventEmitter.java>

<pre><code> new URI(hostURI.getScheme(), hostURI.getAuthority(), uriPath, queryParams, null); </code></pre>

Will Johnson (will@willj.co)

2022-04-28 14:31:42

*Thread Reply:* Can you open up a Github issue for this? I had this same issue and so our implementation always has to feature the /api/v1/lineage. The host config is literally the host. You're specifying a host and path. I'd be happy to see greater flexibility with the api endpoint but the /v1/ is important to know which version of OpenLineage's specification you're communicating with.

Arturo (ggrmos@gmail.com)

2022-04-27 14:12:38

Hi all, guys ... anyone have an example of a custom extractor with different source-destination, I'm trying to build an extractor from a custom operator like mysql_to_s3

Ross Turk (ross@datakin.com)

2022-04-27 15:10:24

*Thread Reply:* @Michael Collado made one for a recent webinar:

https://gist.github.com/collado-mike/d1854958b7b1672f5a494933f80b8b58

Ross Turk (ross@datakin.com)

2022-04-27 15:11:38

*Thread Reply:* it's not exactly for an operator that has source-destination, but it shows how to format lineage events for a few different kinds of datasets

Arturo (ggrmos@gmail.com)

2022-04-27 15:51:32

*Thread Reply:* Thanks! I'm going to take a look

Michael Robinson (michael.robinson@astronomer.io)

2022-04-27 23:04:18

A release has been requested by @Howard Yoo and @Ross Turk pending the merging of PR 644. Are there any +1s?

👍 Julien Le Dem, Maciej Obuchowski, Ross Turk, Conor Beverland

Michael Robinson (michael.robinson@astronomer.io)

2022-04-28 17:44:00

*Thread Reply:* Thanks for your input. The release is authorized. Look for it tomorrow!

raghanag (raghanag@gmail.com)

2022-04-28 14:29:13

Hi All, We are seeing the below exception when we integrate the openlineage-spark into our spark job, can anyone share pointers Exception uncaught: java.lang.NoSuchMethodError: com.fasterxml.jackson.databind.SerializationConfig.hasExplicitTimeZone()Z at openlineage.jackson.datatype.jsr310.ser.InstantSerializerBase.formatValue(InstantSerializerBase.java:144) at openlineage.jackson.datatype.jsr310.ser.InstantSerializerBase.serialize(InstantSerializerBase.java:103) at openlineage.jackson.datatype.jsr310.ser.ZonedDateTimeSerializer.serialize(ZonedDateTimeSerializer.java:79) at openlineage.jackson.datatype.jsr310.ser.ZonedDateTimeSerializer.serialize(ZonedDateTimeSerializer.java:13) at com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:727) at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:719) at com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:155) at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480) at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:319) at com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:3906) at com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:3220) at io.openlineage.spark.agent.client.OpenLineageClient.executeAsync(OpenLineageClient.java:123) at io.openlineage.spark.agent.client.OpenLineageClient.executeSync(OpenLineageClient.java:85) at <a href="http://io.openlineage.spark.agent.client.OpenLineageClient.post">io.openlineage.spark.agent.client.OpenLineageClient.post</a>(OpenLineageClient.java:80) at <a href="http://io.openlineage.spark.agent.client.OpenLineageClient.post">io.openlineage.spark.agent.client.OpenLineageClient.post</a>(OpenLineageClient.java:75) at <a href="http://io.openlineage.spark.agent.client.OpenLineageClient.post">io.openlineage.spark.agent.client.OpenLineageClient.post</a>(OpenLineageClient.java:70) at io.openlineage.spark.agent.EventEmitter.emit(EventEmitter.java:67) at io.openlineage.spark.agent.lifecycle.SparkSQLExecutionContext.start(SparkSQLExecutionContext.java:69) at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$sparkSQLExecStart$0(OpenLineageSparkListener.java:90) at java.util.Optional.ifPresent(Optional.java:159) at io.openlineage.spark.agent.OpenLineageSparkListener.sparkSQLExecStart(OpenLineageSparkListener.java:90) at io.openlineage.spark.agent.OpenLineageSparkListener.onOtherEvent(OpenLineageSparkListener.java:81) at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:80) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91) at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82)

John Thomas (john@datakin.com)

2022-04-28 14:41:10

*Thread Reply:* What's the spark job that's running - this looks similar to an error that can happen when jobs have a very short lifecycle

raghanag (raghanag@gmail.com)

2022-04-28 14:47:27

*Thread Reply:* nothing in spark job, its just a simple csv to parquet conversion file

John Thomas (john@datakin.com)

2022-04-28 14:48:50

*Thread Reply:* ah yeah that's probably it - when the job is finished before the Openlineage integration can poll it for information this error is thrown. Since the job is very quick it creates a race condition

:gratitude_thank_you: raghanag

raghanag (raghanag@gmail.com)

2022-05-03 17:16:39

*Thread Reply:* @John Thomas may i know how to solve this kind of issue?

John Thomas (john@datakin.com)

2022-05-03 17:20:11

*Thread Reply:* This is probably an issue with the integration - for now you can either open an issue, or see if you're still getting a subset of events and take it as is. I'm not sure what you could do on your end aside from adding a sleep call or similar

raghanag (raghanag@gmail.com)

2022-05-03 17:21:17

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/src/main/common/java/io/openlineage/spark/agent/OpenLineageSparkListener.java#L151 you meant if we add a sleep in this method this will solve this

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/src/main/common/java/io/openlineage/spark/agent/OpenLineageSparkListener.java | OpenLineageSparkListener.java>

<pre><code> public void onJobEnd(SparkListenerJobEnd jobEnd) { </code></pre>

John Thomas (john@datakin.com)

2022-05-03 18:44:43

*Thread Reply:* oh no I meant making sure your jobs don't close too quickly

raghanag (raghanag@gmail.com)

2022-05-06 00:14:15

*Thread Reply:* Hi @John Thomas we figured out the error that it is indeed causing with conflicted versions and with shadowJar and shading, we are not seeing it anymore.

Michael Robinson (michael.robinson@astronomer.io)

2022-04-29 18:40:41

@channel The latest release (0.8.1) of OpenLineage is now available, featuring a new TaskInstance listener API for Airflow 2.3+, an HTTP client in the openlineage-java library for emitting run events, support for HiveTableRelation as an input source in the Spark integration, a new SQL parser used by multiple integrations, and bug fixes. For more info, visit https://github.com/OpenLineage/OpenLineage/releases/tag/0.8.1

🚀 Willy Lulciuc, John Thomas, Minkyu Park, Ross Turk, Marco Diaz, Conor Beverland, Kevin Mellott, Howard Yoo, Peter Hicks, Maciej Obuchowski, Mario Measic

🙌 Francis McGregor-Macdonald, Ross Turk, Marco Diaz, Peter Hicks

Willy Lulciuc (willy@datakin.com)

2022-04-29 18:41:37

*Thread Reply:* Amazing work on the new sql parser @Maciej Obuchowski 💯 :firstplacemedal:

👍 Ross Turk, Howard Yoo, Peter Hicks

🙌 Ross Turk, Howard Yoo, Peter Hicks, Maciej Obuchowski

Michael Robinson (michael.robinson@astronomer.io)

2022-04-30 07:54:48

The May meeting of the TSC will be postponed because most of the TSC will be attending the Astronomer Spring Summit the week of May 9th. Details to follow along with a new meeting day/time for the meeting going forward (thanks to all who responded to the poll!).

Hubert Dulay (hubert.dulay@gmail.com)

2022-05-01 09:25:23

Are there examples of using openlineage with streaming data pipelines? Thanks

Mirko Raca (racamirko@gmail.com)

2022-05-03 04:12:09

*Thread Reply:* Hi @Hubert Dulay,

while I'm not an expert, I can offer the following: • Marquez has had the but what I got here - that API is not encouraged • I personally don't find the run->job metaphor to work nicely with streaming transformation, but I'm using that in my current setup (until someone points me in a better direction 😉 ) • I register each change of the stream processing as a new "run", which ends immediately - so duration information is lost, but current set of parameters is recorded. It's not pretty, I know. Maybe stream processing is a scenario to be re-evaluated in OL meetings, or at least clarified?

Hubert Dulay (hubert.dulay@gmail.com)

2022-05-03 21:19:06

*Thread Reply:* Thanks for the details

Kostikey Mustakas (kostikey.mustakas@gmail.com)

2022-05-02 09:32:23

Hey OL! My company is in the process of migrating off of Palantir and into Databricks/Azure. There are a couple of business units not wanting to budge due to the built-in data lineage and code reference features Palantir has. I am tasked with researching an alternative data lineage solution and I quickly came across OL. I love what I have read and seen demos of so far and want to do a POC for my org of its capabilities. I was able to set up the Marquez server on a VM and get it talking to Databricks. I also have the iniit script installed on the cluster and I can see from the log4j logs it’s communicating fine (I think). However, I am embarrassed to admit I can’t figure out how the instrumentation works for the databricks notebooks. I ran a simple notebook that loads data, runs a simple transform, and saves the output somewhere but I don’t see any entries in my namespace I configured. I am sure I missed something very obvious somewhere, but are there examples of how to get a simple example into Marquez from databricks? Thanks so much for any guidance you can give!

John Thomas (john@datakin.com)

2022-05-02 13:26:52

*Thread Reply:* Hi Kostikey - this blog has an example with Spark and jupyter, which might be a good place to start!

openlineage.io

Tracing Data Lineage with OpenLineage and Apache Spark

Spark ushered in a brand new age of data democratization... and left us with a mess of hidden dependencies, stale datasets, and failed jobs.

Original URL: https://openlineage.io/blog/openlineage-spark/

Kostikey Mustakas (kostikey.mustakas@gmail.com)

2022-05-02 14:58:29

*Thread Reply:* Hi @John Thomas, thanks for the reply. I think I am close but my cluster is unable to talk to the marquez server. After looking at log4j I see the following rows:

22/05/02 18:43:39 INFO SparkContext: Registered listener io.openlineage.spark.agent.OpenLineageSparkListener 22/05/02 18:43:40 INFO EventEmitter: Init OpenLineageContext: Args: ArgumentParser(host=<http://135.170.226.91:8400>, version=v1, namespace=gus-namespace, jobName=default, parentRunId=null, apiKey=Optional.empty, urlParams=Optional[{}]) URI: <http://135.170.226.91:8400/api/v1/lineage>? 22/05/02 18:46:21 ERROR EventEmitter: Could not emit lineage [responseCode=0]: {"eventType":"START","eventTime":"2022-05-02T18:44:08.36Z","run":{"runId":"91fd4e13-52ac-4175-8956-c06d7dee97fc","facets":{"spark_version":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.8.1/integration/spark>","_schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>","spark-version":"3.2.1","openlineage_spark_version":"0.8.1"},"spark.logicalPlan":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.8.1/integration/spark>","_schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>","plan":[{"class":"org.apache.spark.sql.catalyst.plans.logical.ShowNamespaces","num-children":1,"namespace":0,"output":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"databaseName","dataType":"string","nullable":false,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":4,"jvmId":"eaa0543b_5e04_4f5b_844b_0e4598f019a7"},"qualifier":[]}]]},{"class":"org.apache.spark.sql.catalyst.analysis.ResolvedNamespace","num_children":0,"catalog":null,"namespace":[]}]},"spark_unknown":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.8.1/integration/spark>","_schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>","output":{"description":"Unable to serialize logical plan due to: Infinite recursion (StackOverflowError) ... OpenLineageHttpException(code=0, message=java.lang.RuntimeException: java.util.concurrent.ExecutionException: openlineage.hc.client5.http.ConnectTimeoutException: Connect to <http://135.170.226.91:8400> [/135.170.226.91] failed: Connection timed out, details=java.util.concurrent.CompletionException: java.lang.RuntimeException: java.util.concurrent.ExecutionException: openlineage.hc.client5.http.ConnectTimeoutException: Connect to <http://135.170.226.91:8400> [/135.170.226.91] failed: Connection timed out) at io.openlineage.spark.agent.EventEmitter.emit(EventEmitter.java:68) at io.openlineage.spark.agent.lifecycle.SparkSQLExecutionContext.start(SparkSQLExecutionContext.java:69) at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$sparkSQLExecStart$0(OpenLineageSparkListener.java:90) at java.util.Optional.ifPresent(Optional.java:159) at io.openlineage.spark.agent.OpenLineageSparkListener.sparkSQLExecStart(OpenLineageSparkListener.java:90) at io.openlineage.spark.agent.OpenLineageSparkListener.onOtherEvent(OpenLineageSparkListener.java:81) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:102) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:119) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:103) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1612) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) the connection timeout is surprising because I can connect just fine using the example curl code from the same cluster:

%sh curl -X POST <http://135.170.226.91:8400/api/v1/lineage> \ -H 'Content-Type: application/json' \ -d '{ "eventType": "START", "eventTime": "2020-12-28T19:52:00.001+10:00", "run": { "runId": "d46e465b-d358-4d32-83d4-df660ff614dd" }, "job": { "namespace": "gus2~-namespace", "name": "my-job" }, "inputs": [{ "namespace": "gus2-namespace", "name": "gus-input" }], "producer": "<https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client>" }' Spark config: spark.openlineage.host <http://135.170.226.91:8400> spark.openlineage.version v1 spark.openlineage.namespace gus-namespace Not sure what is going on, the EventEmitter init log looks like it's right but clearly something is off. Thanks so much for the help

John Thomas (john@datakin.com)

2022-05-02 15:03:40

*Thread Reply:* hmmm, interesting - if it's easy could you spin both up locally and check that it's just a communication issue? It helps with diagnosis

It might also be a firewall issue, but your cURL should preclude that

Kostikey Mustakas (kostikey.mustakas@gmail.com)

2022-05-02 15:05:38

*Thread Reply:* Since it's Databricks I was having a hard time figuring out how to try locally. Other than just using plain 'ol spark on my laptop and a localhost Marquez...

John Thomas (john@datakin.com)

2022-05-02 15:07:13

*Thread Reply:* hmm, that could be an interesting test to see if it's a databricks issue - the databricks integration is pretty much the same as the spark integration, just with a little bit of a wrapper and the init script

Kostikey Mustakas (kostikey.mustakas@gmail.com)

2022-05-02 15:08:44

*Thread Reply:* yeah, i was going to try that but it just didnt seem like helpful troubleshooting for exactly that reason... but i may just do that anyways just so i can see something working 🙂 (morale booster)

John Thomas (john@datakin.com)

2022-05-02 15:09:22

*Thread Reply:* oh totally! Network issues are a huge pain in the ass, and if you're still seeing issues locally with spark/mz then we'll know a lot more than we do now 🙂

Kostikey Mustakas (kostikey.mustakas@gmail.com)

2022-05-02 15:11:19

*Thread Reply:* sounds good, i will give it a go!

Will Johnson (will@willj.co)

2022-05-02 15:16:16

*Thread Reply:* @Kostikey Mustakas - I think spark.openlineage.version should be equal to 1 not v1.

In addition, is http://135.170.226.91:8400 accessible to Databricks? Could you try doing a %sh command inside of a databricks notebook and see if you can ping that IP address (https://linux.die.net/man/8/ping)?

For your Databricks cluster did you VNET inject it into an existing VNET? If it's in an existing VNET, you should confirm that the VM running marquez can access it. If it's in a non-VNET injected VNET, you probably need to redeploy to a VNET that has that VM or has connectivity to that VM.

linux.die.net

ping(8) - Linux man page

ping uses the ICMP protocol's mandatory ECHO_REQUEST datagram to elicit an ICMP ECHO_RESPONSE from a host or gateway. ECHO_REQUEST datagrams (''pings'') ...

Original URL: https://linux.die.net/man/8/ping

Kostikey Mustakas (kostikey.mustakas@gmail.com)

2022-05-02 15:19:22

*Thread Reply:* Ya, know i meant to ask about that. Docs say 1 like you mention: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/databricks. I second guessed from this thread https://openlineage.slack.com/archives/C01CK9T7HKR/p1638848249159700.

} Dinakar Sundar (https://openlineage.slack.com/team/U02MQ8E22HF)

@John Thomas we in Condenast currently exploring the features of open lineage to integrate to databricks , <a href="https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/databricks">https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/databricks</a> , spark configuration not working ,

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1638848249159700

Kostikey Mustakas (kostikey.mustakas@gmail.com)

2022-05-02 15:23:42

*Thread Reply:* @Will Johnson, ping fails... this is surprising as the curl command mentioned above works fine.

Julius Rentergent (julius.rentergent@thetradedesk.com)

2022-05-02 15:37:00

*Thread Reply:* I’m also trying to set up Databricks according to Running Marquez on AWS. Right now I’m stuck on the database part rather than the Marquez part — I can’t connect my EKS cluster to the RDS database which I described in more detail on the Marquez slack.

@Kostikey Mustakas Sorry for the distraction, but I’m curious how you have set up your networking to make the API requests work with Databricks. Good luck with your issue!

Kostikey Mustakas (kostikey.mustakas@gmail.com)

2022-05-02 15:47:17

*Thread Reply:* @Julius Rentergent We are using Azure and leverage Private Endpoints to connect resources in separate subscriptions. There is a Bastion proxy in place that we can map http traffic through and I have a Load Balancer Inbound NAT rule I setup that maps one our whitelisted port ranges (8400) to 5000.

:gratitude_thank_you: Julius Rentergent

Kostikey Mustakas (kostikey.mustakas@gmail.com)

2022-05-02 20:15:01

*Thread Reply:* @Will Johnson a little progress maybe... I created a private endpoint and updated dns to point to it. Now I get a 404 Not Found error instead of a timeout

🙌 Will Johnson

Kostikey Mustakas (kostikey.mustakas@gmail.com)

2022-05-02 20:16:41

*Thread Reply:* 22/05/03 00:09:24 ERROR EventEmitter: Could not emit lineage [responseCode=404]: {"eventType":"START","eventTime":"2022-05-03T00:09:22.498Z","run":{"runId":"f41575a0-e59d-4cbc-a401-9b52d2b020e0","facets":{"spark_version":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.8.1/integration/spark>","_schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>","spark-version":"3.2.1","openlineage_spark_version":"0.8.1"},"spark.logicalPlan":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.8.1/integration/spark>","_schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>","plan":[{"class":"org.apache.spark.sql.catalyst.plans.logical.ShowNamespaces","num-children":1,"namespace":0,"output":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"databaseName","dataType":"string","nullable":false,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":4,"jvmId":"aad3656d_8903_4db3_84f0_fe6d773d71c3"},"qualifier":[]}]]},{"class":"org.apache.spark.sql.catalyst.analysis.ResolvedNamespace","num_children":0,"catalog":null,"namespace":[]}]},"spark_unknown":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.8.1/integration/spark>","_schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>","output":{"description":"Unable to serialize logical plan due to: Infinite recursion (StackOverflowError) (through reference chain: org.apache.spark.sql.catalyst.expressions.AttributeReference[\"preCanonicalized\"] .... OpenLineageHttpException(code=null, message={"code":404,"message":"HTTP 404 Not Found"}, details=null) at io.openlineage.spark.agent.EventEmitter.emit(EventEmitter.java:68)

Julius Rentergent (julius.rentergent@thetradedesk.com)

2022-05-27 00:03:30

*Thread Reply:* Following up on this as I encounter the same issue with the Openlineage Databricks integration. This issue seems quite malicious as it crashes the Spark Context and requires a restart.

I have marquez running on AWS EKS; I’m using Openlineage 0.8.2 on Databricks 10.4 (Spark 3.2.1) and my Spark config looks like this: spark.openlineage.host <https://internal-xxxxxxxxxxxxxxxxxxxxxxx-xxxxxxxxxxx.us-east-1.elb.amazonaws.com> spark.openlineage.namespace default spark.openlineage.version v1 <- also tried "1" I can run some simple read and write commands and successfully find the log4j events highlighted in the docs: INFO SparkContext; INFO OpenLineageContext; INFO AsyncEventQueue for each time I run the cell After doing this a few times I get The spark context has stopped and the driver is restarting. Your notebook will be automatically reattached. stderr shows a bunch of things. log4j shows the same as for Kostikey: ERROR EventEmitter: [...] Unable to serialize logical plan due to: Infinite recursion (StackOverflowError)

I have one more piece of information which I can’t make much sense of, but hopefully someone else can; if I include the port in the host, I can very reliably crash the Spark Context on the first attempt. So: <https://internal-xxxxxxxxxxxxxxxxxxxxxxx-xxxxxxxxxxx.us-east-1.elb.amazonaws.com> <- crashes after a couple of attempts, sometimes it takes me a while to reproduce it while repeatedly reading/writing the same datasets <https://internal-xxxxxxxxxxxxxxxxxxxxxxx-xxxxxxxxxxx.us-east-1.elb.amazonaws.com:80> <- crashes on first try Any insights would be greatly appreciated! 🙂

Julius Rentergent (julius.rentergent@thetradedesk.com)

2022-05-27 00:22:27

*Thread Reply:* I tried two more things: • curl works, ping fails, just like in the previous report • Databricks allows providing spark configs without quotes, whereas quotes are generally required for Spark. So I added the quotes to the host name, but now I’m getting: ERROR OpenLineageSparkListener: Unable to parse open lineage endpoint. Lineage events will not be collected

Martin Fiser (fisa@keboola.com)

2022-05-27 14:00:38

*Thread Reply:* @Kostikey Mustakas May I ask what is the reason for migration from Palantir? Sorry for this off-topic question!

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-05-30 05:46:27

*Thread Reply:* @Julius Rentergent created issue on project github: https://github.com/OpenLineage/OpenLineage/issues/795

<https://github.com/OpenLineage/OpenLineage/issues/795|#795 spark: port included in host causes crash of listener>

Issue reported on Slack: <a href="https://openlineage.slack.com/archives/C01CK9T7HKR/p1653624210694359?thread_ts=1651498343.959749&cid=C01CK9T7HKR|https://openlineage.slack.com/archives/C01CK9T7HKR/p1653624210694359?thread_ts=1651498343.959749&cid=C01CK9T7HKR">https://openlineage.slack.com/archives/C01CK9T7HKR/p1653624210694359?threadts=1651498343.959749&cid=C01CK9T7HKR|https://openlineage.slack.com/archives/C01CK9T7HKR/p1653624210694359?threadts=1651498343.959749&cid=C01CK9T7HKR</a> Following up on this as I encounter the same issue with the Openlineage Databricks integration. This issue seems quite malicious as it crashes the Spark Context and requires a restart. I have marquez running on AWS EKS; I’m using Openlineage 0.8.2 on Databricks 10.4 (Spark 3.2.1) and my Spark config looks like this: <pre><code>spark.openlineage.host <https://internal-xxxxxxxxxxxxxxxxxxxxxxx-xxxxxxxxxxx.us-east-1.elb.amazonaws.com/> spark.openlineage.namespace default spark.openlineage.version v1 &lt;- also tried "1" </code></pre> I can run some simple read and write commands and successfully find the log4j events <a href="https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/databricks">highlighted in the docs</a>: <pre><code>INFO SparkContext; INFO OpenLineageContext; INFO AsyncEventQueue for each time I run the cell </code></pre> After doing this a few times I get The spark context has stopped and the driver is restarting. Your notebook will be automatically reattached. <a href="https://pastebin.com/9NbEjnGL">stderr shows a bunch of things</a>. log4j shows the same as for Kostikey: ERROR EventEmitter: [...] Unable to serialize logical plan due to: Infinite recursion (StackOverflowError) I have one more piece of information which I can’t make much sense of, but hopefully someone else can; if I include the port in the host, I can very reliably crash the Spark Context on the first attempt. So: <pre><code><https://internal-xxxxxxxxxxxxxxxxxxxxxxx-xxxxxxxxxxx.us-east-1.elb.amazonaws.com/> &lt;- crashes after a couple of attempts, sometimes it takes me a while to reproduce it while repeatedly reading/writing the same datasets <https://internal-xxxxxxxxxxxxxxxxxxxxxxx-xxxxxxxxxxx.us-east-1.elb.amazonaws.com:80/> &lt;- crashes on first try </code></pre> Any insights would be greatly appreciated!

Labels

bug, integration/spark

Julius Rentergent (julius.rentergent@thetradedesk.com)

2022-06-01 11:15:26

*Thread Reply:* Thank you @Maciej Obuchowski. Just to clarify, the Spark Context crashes with and without port; it’s just that adding the port causes it to crash more quickly (on the 1st attempt).

I will run some more experiments when I have time, and add the results to the ticket.

Edit - added to issue:

I ran some more experiments, this time with a fake host and on OpenLineage 0.9.0, and was not able to reproduce the issue with regards to the port; instead, the new experiments show that Spark 3.2 looks to be involved.

On Spark 3.2.1 / Databricks 10.4 LTS: Using (fake) host http://ac7aca38330144df9.amazonaws.com:5000 crashes when the first notebook cell is evaluated with The spark context has stopped and the driver is restarting. The same occurs when the port is removed.

On Spark 3.1.2 / Databricks 9.1 LTS: Using (fake) host http://ac7aca38330144df9.amazonaws.com:5000 does not impede the cluster but, reasonably, produces for each lineage event ERROR EventEmitter: Could not emit lineage w/ exception io.openlineage.client.OpenLineageClientException: java.net.UnknownHostException The same occurs when the port is removed.

Michael Robinson (michael.robinson@astronomer.io)

2022-05-02 14:52:09

@channel The poll results are in, and the new day/time for the monthly TSC meeting is each second Thursday at 10 am PT. The next meeting will take place on Thursday, 5/19, at 10 am PT, due to a conflict with the Astronomer Spring Summit. Future meetings will take place on the second Thursday of each month. Calendar updates will be forthcoming. Thanks!

🙌 Willy Lulciuc, Mynor Choc

Will Johnson (will@willj.co)

2022-05-02 15:09:42

*Thread Reply:* @Michael Robinson - just to be sure, is the 5/19 meeting at 10 AM PT as well?

Michael Robinson (michael.robinson@astronomer.io)

2022-05-02 15:14:11

*Thread Reply:* Yes, and I’ll update the msg for others. Thank you

Will Johnson (will@willj.co)

2022-05-02 15:16:25

*Thread Reply:* Thank you!

Sandeep Bhat (bhatsandeep424@gmail.com)

2022-05-02 21:45:39

Hii Team, as i saw marquez is building lineage by java code, from seed command, what should i do to connect with mysql (our database) with credentials and building a lineage for our data?

Marco Diaz (mdiaz@roblox.com)

2022-05-03 12:40:55

@here How do we clear old jobs, datasets and namespaces from Marquez?

Mirko Raca (racamirko@gmail.com)

2022-05-04 07:04:48

*Thread Reply:* It seems we can't for now. This was the same question I had last week:

https://github.com/MarquezProject/marquez/issues/1736

<https://github.com/MarquezProject/marquez/issues/1736|#1736 Ability to DELETE jobs and datasets>

When jobs or datasets are deleted (or renamed) we need a way to tell Marquez that they don't exist anymore so that they don't clutter the lineage view. It would make sense to have <code>DELETE</code> support on <code>/api/v1/namespaces/{ns}/datasets/{name}</code> and <code>/api/v1/namespaces/{ns}/jobs/{name}</code> resources. This is a soft delete in the internal marquez model. It creates a new <code>deleted</code> version. The resources will not show up in the "current" lineage view in <code>GET /api/v1/lineage</code>

Comments

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-05-04 10:56:35

*Thread Reply:* Seems that it's really popular request 🙂

Tyler Farris (tyler@kickstand.work)

2022-05-03 13:43:56

Hello, I'm sending lineage events to astrocloud.datakin DB with the Marquez API. The event is sent- but the metadata for inputs and outputs isn't coming through. Below is an example of the event I'm sending. Not sure if this is the place for this question. Cross-posting to Marquez Slack. { "eventTime": "2022-05-03T17:20:04.151087+00:00", "run": { "runId": "2dfc6dcd4011d2a1c3dc1e5861127e5b" }, "job": { "namespace": "from-airflow", "name": "Postgres_1_to_Snowflake_2.extract" }, "producer": "<https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client>", "inputs": [ { "name": "Postgres_1_to_Snowflake_2.extract", "namespace": "from-airflow" } ] } Thanks.

Tyler Farris (tyler@kickstand.work)

2022-05-04 11:28:48

*Thread Reply:* @Mirko Raca pointed out that I was missing eventType.

Mirko Raca : "From a quick glance - you're missing "eventType": "START", attribute. It's also worth noting that metadata typically shows up after the second event (type COMPLETE)"

thanks again.

👍 Mirko Raca

Sandeep Bhat (bhatsandeep424@gmail.com)

2022-05-06 05:01:34

Hii Team, could anyone tell me, to view lineage in marquez do we have to write metadata as a code, or does marquez has a feature to scan the sql code and build a lineage automatically?please clarify my doubt regarding this.

Juan Carlos Fernández Rodríguez (jcfernandez@keedio.com)

2022-05-06 05:26:16

*Thread Reply:* As far as I understand, OpenLineage has tools to extract metadata from sources. Depend on your source, you could find an integration, if it doesn't exists you should write your own integration (and collaborate with the project)

Ross Turk (ross@datakin.com)

2022-05-06 12:59:06

*Thread Reply:* @Sandeep Bhat take a look at https://openlineage.io/integration - there is some info there on the different integrations that can be used to automatically pull metadata.

openlineage.io

Integrations

Original URL: https://openlineage.io/integration

Ross Turk (ross@datakin.com)

2022-05-06 13:00:39

*Thread Reply:* The Airflow integration, in particular, uses a SQL parser to determine input/output tables (in cases where the data store can't be queried for that info)

Jorik (jorik@scivis.net)

2022-05-12 05:13:01

Hi all. We are looking at using OpenLineage for capturing some lineage in our custom processing system. I think we got the lineage events understood, but we have often datasets that get appended, or get overwritten by an operation. Is there anything in openlineage that would facilitate making this distinction? (ie. if a set gets overwritten we would be interested in the lineage events from the last overwrite, if it gets appended we would like to have all of these in the display)

Mirko Raca (racamirko@gmail.com)

2022-05-12 05:48:43

*Thread Reply:* To my understanding - datasets model the structure, not the content. So, as long as your table doesn't change number of columns, it's the same thing.

The catch-all would be to create a Dataset facet which would record the distinction between append/overwrite per run. But, while this is supported by the standard, Marquez does not handle custom facets at the moment (I'll happily be corrected).

Jorik (jorik@scivis.net)

2022-05-12 06:05:36

*Thread Reply:* Thanks, that makes sense. We're looking for a way to get the lineage of table contents. We may have to opt for new names on overwrite, or indeed extend a facet to flag these.

Jorik (jorik@scivis.net)

2022-05-12 06:06:44

*Thread Reply:* Use case is compliancy, where we need to show how a certain delivered data product (at a given point in time) was constructed. We have all our transforms/transfers as code, but there are a few parts where datasets get recreated in the process after fixes have been made, and I wouldn't want to bother the auditors with those stray paths

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-05-12 06:12:09

*Thread Reply:* We have LifecycleStateChangeDataset facet that captures this information. It's currently emitted when using Spark integration

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-05-12 06:13:25

*Thread Reply:* > But, while this is supported by the standard, Marquez does not handle custom facets at the moment (I'll happily be corrected). It displays this information when it exists

🙌 Mirko Raca

Jorik (jorik@scivis.net)

2022-05-12 06:13:29

*Thread Reply:* Oh that looks perfect! I completely missed that, thanks!

Marco Diaz (mdiaz@roblox.com)

2022-05-12 15:46:04

Are there any examples on how to use this facet ColumnLineageDatasetFacet.json?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-05-13 05:19:47

*Thread Reply:* Work with Spark is not yet fully merged

raghanag (raghanag@gmail.com)

2022-05-12 17:49:23

Hi All, I am trying to see where we can provide owner details when using openlineage-spark configuration, i see only namespace and other config parameters but not the owner. Can we add owner configuration also as part of openlineage-spark like spark.openlineage.owner? Owner will be used to even filter namespaces when showing the jobs or namespaces in Marquez UI.

Michael Robinson (michael.robinson@astronomer.io)

2022-05-13 19:07:04

@channel The next OpenLineage Technical Steering Committee meeting is next Thursday, 5/19, at 10 am PT! Going forward, meetings will take place on the second Thursday of each month at 10 am PT. Join us on Zoom: https://astronomer.zoom.us/j/87156607114?pwd=a3B0K210dnRaQmdkaFdGMytBREZEQT09 All are welcome! Agenda: • releases 0.7.1 & 0.8.1 • column-level lineage • open lineage For notes and the agenda visit the wiki: https://tinyurl.com/openlineagetsc

🙌 Maciej Obuchowski, Ross Turk

Yannick Libert (yan@ioly.fr)

2022-05-16 11:02:23

Hi all, we are considering using OL to send lineage events from various jobs and places in our company. Since there will be multiple producers, we would like to use Kafka as our main hub for communication. One of our sources will be Airflow (more particularly MWAA, ie airflow in its 2.2.2 version). Is there a way to configure the Airflow lineage backend to send event to kafka instead of Marquez directly? So far, from what I've seen in the docs and in here, the only way would be to create a simple proxy to stream the http events to Kafka. Is it still the case?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-05-16 11:31:17

*Thread Reply:* I think you can either use proxy backend: https://github.com/OpenLineage/OpenLineage/tree/main/proxy

or configure OL client to send data to kafka: https://github.com/OpenLineage/OpenLineage/tree/main/client/python#kafka

👍 Yannick Libert

Yannick Libert (yan@ioly.fr)

2022-05-16 12:15:59

*Thread Reply:* Thank you very much for the useful pointers. The proxy solutions could indeed work in our case but it implies creating another service in front of Kafka, and thus and another layer of complexity to the architecture. If there is another more "native" way of streaming event directly from the Airflow backend that'll be great to know

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-05-16 12:37:10

*Thread Reply:* The second link 😉

Yannick Libert (yan@ioly.fr)

2022-05-17 03:46:03

*Thread Reply:* Sure, we already implemented the python client for jobs outside airflow and it works great 🙂 You are saying that there is a way to use this python client in conjonction with the MWAA lineage backend to relay the job events that come with the airflow integration (without including it in the DAGs)? Our strategy is to use both the airflow backend to collect automatic lineage events without modifying any existing DAGs, and the in-code implementation to allow our data engineers to send their own events if they want to. The second option works perfectly but the first one is where we struggle a bit, especially with MWAA.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-05-17 05:24:30

*Thread Reply:* If you can mount file to MWAA, then yes - it should work with config file option: https://github.com/OpenLineage/OpenLineage/tree/main/client/python#config-file

Yannick Libert (yan@ioly.fr)

2022-05-17 05:40:45

*Thread Reply:* Brilliant! I'm going to test that. Thank you Maciej!

Michael Robinson (michael.robinson@astronomer.io)

2022-05-17 15:20:58

A release has been requested. Are there any +1s? Three from committers will authorize. Thanks.

➕ Maciej Obuchowski, Ross Turk, Willy Lulciuc, Michael Collado

Michael Robinson (michael.robinson@astronomer.io)

2022-05-18 10:33:03

The OpenLineage TSC meeting is tomorrow at 10am PT! https://openlineage.slack.com/archives/C01CK9T7HKR/p1652483224119229

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)

@channel The next OpenLineage Technical Steering Committee meeting is next Thursday, 5/19, at 10 am PT! Going forward, meetings will take place on the second Thursday of each month at 10 am PT. Join us on Zoom: <a href="https://astronomer.zoom.us/j/87156607114?pwd=a3B0K210dnRaQmdkaFdGMytBREZEQT09">https://astronomer.zoom.us/j/87156607114?pwd=a3B0K210dnRaQmdkaFdGMytBREZEQT09</a> All are welcome! Agenda: • releases 0.7.1 & 0.8.1 • column-level lineage • open lineage For notes and the agenda visit the wiki: <a href="https://tinyurl.com/openlineagetsc">https://tinyurl.com/openlineagetsc</a>

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1652483224119229

🙌 Willy Lulciuc

Tyler Farris (tyler@kickstand.work)

2022-05-18 16:23:56

Hey all, Do custom extractors work with the taskflow api?

John Thomas (john@datakin.com)

2022-05-18 16:34:25

*Thread Reply:* Hey Tyler - A custom extractor just needs to be able to assemble the runEvents and send the information out to the lineage backends.

If the things you're sending/receiving with TaskFlow are accessible in terms of metadata in the environment the DAG is running in, then you should be able to make one that would work!

This Webinar goes over creating custom extractors for reference.

Does that answer your question?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-05-18 16:41:16

*Thread Reply:* Taskflow internally is just PythonOperator. If you'd write extractor that assumes something more than just it being PythonOperator then you'd probably make it work 🙂

Tyler Farris (tyler@kickstand.work)

2022-05-18 17:15:52

*Thread Reply:* Thanks @John Thomas @Maciej Obuchowski, Your answers both make sense. I just keep running into this error in my logs: [2022-05-18, 20:52:34 UTC] {__init__.py:97} WARNING - Unable to find an extractor. task_type=_PythonDecoratedOperator airflow_dag_id=Postgres_1_to_Snowflake_1_v3 task_id=Postgres_1 airflow_run_id=scheduled__2022-05-18T20:51:34.334045+00:00 The picture is my custom extractor, it's not doing anything currently as this is just a test.

Tyler Farris (tyler@kickstand.work)

2022-05-18 17:16:05

*Thread Reply:* thanks again for the help yall

John Thomas (john@datakin.com)

2022-05-18 17:16:34

*Thread Reply:* did you set the environment variable with the path to your extractor?

Tyler Farris (tyler@kickstand.work)

2022-05-18 17:16:46

*Thread Reply:*

Tyler Farris (tyler@kickstand.work)

2022-05-18 17:17:13

*Thread Reply:* i believe thats correct @John Thomas

Tyler Farris (tyler@kickstand.work)

2022-05-18 17:18:35

*Thread Reply:* and the versions im using: Astronomer Runtime 5.0.0 based on Airflow 2.3.0+astro.1

John Thomas (john@datakin.com)

2022-05-18 17:25:58

*Thread Reply:* this might not be the problem, but you should have only one of extract and extract_on_complete - which one are you meaning to use?

Tyler Farris (tyler@kickstand.work)

2022-05-18 17:32:26

*Thread Reply:* ahh thanks John, as of right now extract_on_complete.

This is a similar setup as Michael had in the video.

John Thomas (john@datakin.com)

2022-05-18 17:33:31

*Thread Reply:* if it's still not working I'm not really sure at this point - that's about what I had when I spun up my own custom extractor

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-05-18 17:39:44

*Thread Reply:* is there anything in logs regarding extractors?

Tyler Farris (tyler@kickstand.work)

2022-05-18 17:40:36

*Thread Reply:* just this: [2022-05-18, 21:36:59 UTC] {__init__.py:97} WARNING - Unable to find an extractor. task_type=_PythonDecoratedOperator airflow_dag_id=competitive_oss_projects_git_to_snowflake task_id=Transform_git_logs_to_S3 airflow_run_id=scheduled__2022-05-18T21:35:57.694690+00:00

Tyler Farris (tyler@kickstand.work)

2022-05-18 17:41:11

*Thread Reply:* @John Thomas Thanks, I appreciate your help.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-05-19 06:01:52

*Thread Reply:* No Failed to import messages?

Tyler Farris (tyler@kickstand.work)

2022-05-19 11:26:34

*Thread Reply:* @Maciej Obuchowski None that I can see. Here is the full log: ```* Failed to verify remote log exists s3:///dag_id=Postgres_1_to_Snowflake_1_v3/run_id=scheduled2022-05-19T15:23:49.248097+00:00/task_id=Postgres_1/attempt=1.log. Please provide a bucket_name instead of "s3:///dag_id=Postgres_1_to_Snowflake_1_v3/run_id=scheduled2022-05-19T15:23:49.248097+00:00/task_id=Postgres_1/attempt=1.log" Falling back to local log * Reading local file: /usr/local/airflow/logs/dagid=Postgres1toSnowflake1v3/runid=scheduled2022-05-19T15:23:49.248097+00:00/taskid=Postgres1/attempt=1.log [2022-05-19, 15:24:50 UTC] {taskinstance.py:1158} INFO - Dependencies all met for <TaskInstance: Postgres1toSnowflake1v3.Postgres1 scheduled2022-05-19T15:23:49.248097+00:00 [queued]> [2022-05-19, 15:24:50 UTC] {taskinstance.py:1158} INFO - Dependencies all met for <TaskInstance: Postgres1toSnowflake1v3.Postgres1 scheduled_2022-05-19T15:23:49.248097+00:00 [queued]>

[2022-05-19, 15:24:50 UTC] {taskinstance.py:1355} INFO -

[2022-05-19, 15:24:50 UTC] {taskinstance.py:1356} INFO - Starting attempt 1 of 1

[2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

[2022-05-19, 15:24:50 UTC] {taskinstance.py:1376} INFO - Executing <Task(PythonDecoratedOperator): Postgres1> on 2022-05-19 15:23:49.248097+00:00 [2022-05-19, 15:24:50 UTC] {standardtaskrunner.py:52} INFO - Started process 3957 to run task [2022-05-19, 15:24:50 UTC] {standardtaskrunner.py:79} INFO - Running: ['airflow', 'tasks', 'run', 'Postgres1toSnowflake1v3', 'Postgres1', 'scheduled2022-05-19T15:23:49.248097+00:00', '--job-id', '96473', '--raw', '--subdir', 'DAGSFOLDER/pgtosnow.py', '--cfg-path', '/tmp/tmp9n7u3i4t', '--error-file', '/tmp/tmp9a55v9b'] [2022-05-19, 15:24:50 UTC] {standardtaskrunner.py:80} INFO - Job 96473: Subtask Postgres1 [2022-05-19, 15:24:50 UTC] {loggingmixin.py:115} WARNING - /usr/local/lib/python3.9/site-packages/airflow/configuration.py:470 DeprecationWarning: The sqlalchemyconn option in [core] has been moved to the sqlalchemyconn option in [database] - the old setting has been used, but please update your config. [2022-05-19, 15:24:50 UTC] {taskcommand.py:369} INFO - Running <TaskInstance: Postgres1toSnowflake1v3.Postgres1 scheduled2022-05-19T15:23:49.248097+00:00 [running]> on host 056ca0b6c7f5 [2022-05-19, 15:24:50 UTC] {taskinstance.py:1568} INFO - Exporting the following env vars: AIRFLOWCTXDAGOWNER=airflow AIRFLOWCTXDAGID=Postgres1toSnowflake1v3 AIRFLOWCTXTASKID=Postgres1 AIRFLOWCTXEXECUTIONDATE=20220519T15:23:49.248097+00:00 AIRFLOWCTXTRYNUMBER=1 AIRFLOWCTXDAGRUNID=scheduled2022-05-19T15:23:49.248097+00:00 [2022-05-19, 15:24:50 UTC] {loggingmixin.py:115} WARNING - /usr/local/lib/python3.9/site-packages/airflow/utils/context.py:202 AirflowContextDeprecationWarning: Accessing 'executiondate' from the template is deprecated and will be removed in a future version. Please use 'dataintervalstart' or 'logicaldate' instead. [2022-05-19, 15:24:50 UTC] {loggingmixin.py:115} WARNING - /usr/local/lib/python3.9/site-packages/airflow/utils/context.py:202 AirflowContextDeprecationWarning: Accessing 'nextds' from the template is deprecated and will be removed in a future version. Please use '{{ dataintervalend | ds }}' instead. [2022-05-19, 15:24:50 UTC] {loggingmixin.py:115} WARNING - /usr/local/lib/python3.9/site-packages/airflow/utils/context.py:202 AirflowContextDeprecationWarning: Accessing 'nextdsnodash' from the template is deprecated and will be removed in a future version. Please use '{{ dataintervalend | dsnodash }}' instead. [2022-05-19, 15:24:50 UTC] {loggingmixin.py:115} WARNING - /usr/local/lib/python3.9/site-packages/airflow/utils/context.py:202 AirflowContextDeprecationWarning: Accessing 'nextexecutiondate' from the template is deprecated and will be removed in a future version. Please use 'dataintervalend' instead. [2022-05-19, 15:24:50 UTC] {loggingmixin.py:115} WARNING - /usr/local/lib/python3.9/site-packages/airflow/utils/context.py:202 AirflowContextDeprecationWarning: Accessing 'prevds' from the template is deprecated and will be removed in a future version. [2022-05-19, 15:24:50 UTC] {loggingmixin.py:115} WARNING - /usr/local/lib/python3.9/site-packages/airflow/utils/context.py:202 AirflowContextDeprecationWarning: Accessing 'prevdsnodash' from the template is deprecated and will be removed in a future version. [2022-05-19, 15:24:50 UTC] {loggingmixin.py:115} WARNING - /usr/local/lib/python3.9/site-packages/airflow/utils/context.py:202 AirflowContextDeprecationWarning: Accessing 'prevexecutiondate' from the template is deprecated and will be removed in a future version. [2022-05-19, 15:24:50 UTC] {loggingmixin.py:115} WARNING - /usr/local/lib/python3.9/site-packages/airflow/utils/context.py:202 AirflowContextDeprecationWarning: Accessing 'prevexecutiondatesuccess' from the template is deprecated and will be removed in a future version. Please use 'prevdataintervalstartsuccess' instead. [2022-05-19, 15:24:50 UTC] {loggingmixin.py:115} WARNING - /usr/local/lib/python3.9/site-packages/airflow/utils/context.py:202 AirflowContextDeprecationWarning: Accessing 'tomorrowds' from the template is deprecated and will be removed in a future version. [2022-05-19, 15:24:50 UTC] {loggingmixin.py:115} WARNING - /usr/local/lib/python3.9/site-packages/airflow/utils/context.py:202 AirflowContextDeprecationWarning: Accessing 'tomorrowdsnodash' from the template is deprecated and will be removed in a future version. [2022-05-19, 15:24:50 UTC] {loggingmixin.py:115} WARNING - /usr/local/lib/python3.9/site-packages/airflow/utils/context.py:202 AirflowContextDeprecationWarning: Accessing 'yesterdayds' from the template is deprecated and will be removed in a future version. [2022-05-19, 15:24:50 UTC] {loggingmixin.py:115} WARNING - /usr/local/lib/python3.9/site-packages/airflow/utils/context.py:202 AirflowContextDeprecationWarning: Accessing 'yesterdaydsnodash' from the template is deprecated and will be removed in a future version. [2022-05-19, 15:24:50 UTC] {python.py:173} INFO - Done. Returned value was: extract [2022-05-19, 15:24:50 UTC] {loggingmixin.py:115} WARNING - /usr/local/lib/python3.9/site-packages/airflow/models/baseoperator.py:1369 DeprecationWarning: Passing 'executiondate' to 'TaskInstance.xcompush()' is deprecated. [2022-05-19, 15:24:50 UTC] {init.py:97} WARNING - Unable to find an extractor. tasktype=PythonDecoratedOperator airflowdagid=Postgres1toSnowflake1v3 taskid=Postgres1 airflowrunid=scheduled2022-05-19T15:23:49.248097+00:00 [2022-05-19, 15:24:50 UTC] {client.py:74} INFO - Constructing openlineage client to send events to https://api.astro-livemaps.datakin.com/ [2022-05-19, 15:24:50 UTC] {taskinstance.py:1394} INFO - Marking task as SUCCESS. dagid=Postgres1toSnowflake1v3, taskid=Postgres1, executiondate=20220519T152349, startdate=20220519T152450, enddate=20220519T152450 [2022-05-19, 15:24:50 UTC] {localtaskjob.py:156} INFO - Task exited with return code 0 [2022-05-19, 15:24:50 UTC] {localtask_job.py:273} INFO - 1 downstream tasks scheduled from follow-on schedule check```

Josh Owens (Josh@kickstand.work)

2022-05-19 16:57:38

*Thread Reply:* @Maciej Obuchowski is our ENV var wrong maybe? Do we need to mention the file to import somewhere else that we may have missed?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-05-20 10:26:01

*Thread Reply:* @Josh Owens one thing I can think of is that you might have older openlineage integration version, as OPENLINEAGE_EXTRACTORS variable was added very recently: https://github.com/OpenLineage/OpenLineage/pull/694

Tyler Farris (tyler@kickstand.work)

2022-05-20 11:58:28

*Thread Reply:* @Maciej Obuchowski, that was it! For some reason, my requirements.txt wasn't pulling the latest version of openlineage-airflow. Working now with 0.8.2

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-05-20 11:59:01

*Thread Reply:* 🙌

Michael Raymond (michael.raymond@cervest.earth)

2022-05-19 05:32:06

Hi 👋, I'm looking at OpenLineage as a solution for fine-grained data lineage tracking. Could I clarify a couple of points?

Where does one specify the version of an input dataset in the RunEvent? In the Marquez seed data I can see that it's recorded, but I'm not sure where it goes from looking at the OpenLineage schema. Or does it just assume the last version?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-05-19 05:59:59

*Thread Reply:* Currently, it assumes latest version. There's an effort with DatasetVersionDatasetFacet to be able to specify it manually - or extract this information from cases like Iceberg or Delta Lake tables.

Michael Raymond (michael.raymond@cervest.earth)

2022-05-19 06:14:59

*Thread Reply:* Ah ok. Is it Marquez assuming the latest version when it records the OpenLineage event?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-05-19 06:18:20

*Thread Reply:* yes

✅ Michael Raymond

Michael Raymond (michael.raymond@cervest.earth)

2022-05-19 06:54:40

*Thread Reply:* Thanks, that's very helpful 👍

Howard Yoo (howardyoo@gmail.com)

2022-05-19 15:23:33

Hi all, I was testing https://github.com/MarquezProject/marquez/tree/main/examples/airflow#step-21-create-dag-counter, and the following error was observed in my airflow env:

Howard Yoo (howardyoo@gmail.com)

2022-05-19 15:23:52

Anybody know why this is happening? Any comments would be welcomed.

Tyler Farris (tyler@kickstand.work)

2022-05-19 15:27:35

*Thread Reply:* @Howard Yoo What version of airflow?

Howard Yoo (howardyoo@gmail.com)

2022-05-19 15:27:51

*Thread Reply:* it's 2.3

Howard Yoo (howardyoo@gmail.com)

2022-05-19 15:28:42

*Thread Reply:* (sorry, it's 2.4)

Tyler Farris (tyler@kickstand.work)

2022-05-19 15:29:28

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow Id refer to the docs again.

"Airflow 2.3+ Integration automatically registers itself for Airflow 2.3 if it's installed on Airflow worker's python. This means you don't have to do anything besides configuring it, which is described in Configuration section."

Howard Yoo (howardyoo@gmail.com)

2022-05-19 15:29:53

*Thread Reply:* Right, configuring I don't see any issues

Tyler Farris (tyler@kickstand.work)

2022-05-19 15:30:56

*Thread Reply:* so you dont need:

from openlineage.airflow import DAG

in your dag files

Howard Yoo (howardyoo@gmail.com)

2022-05-19 15:31:41

*Thread Reply:* Okay... that makes sense then

Tyler Farris (tyler@kickstand.work)

2022-05-19 15:32:47

*Thread Reply:* so if you need to import DAG it would just be: from airflow import DAG

👍 Howard Yoo

Howard Yoo (howardyoo@gmail.com)

2022-05-19 15:56:19

*Thread Reply:* Thanks!

👍 Tyler Farris

Michael Robinson (michael.robinson@astronomer.io)

2022-05-19 17:13:02

@channel OpenLineage 0.8.2 is now available! The project now supports credentialing from the Airflow Secrets Backend and for the Azure Databricks Credential Passthrough, detection of datasets wrapped by ExternalRDDs, bug fixes, and more. For the details, see: https://github.com/OpenLineage/OpenLineage/releases/tag/0.8.2

🎉 Marco Diaz, Howard Yoo, Willy Lulciuc, Michael Collado, Ross Turk, Francis McGregor-Macdonald, Maciej Obuchowski

xiang chen (cdmikechen@hotmail.com)

2022-05-19 22:18:42

Hi~ everyone Is there possible to let openlineage to support camel pipeline?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-05-20 10:23:55

*Thread Reply:* What changes do you mean by letting openlineage support? Or, do you mean, to write Apache Camel integration?

xiang chen (cdmikechen@hotmail.com)

2022-05-22 19:54:17

*Thread Reply:* @Maciej Obuchowski Yes, let openlineage work as same as airflow

xiang chen (cdmikechen@hotmail.com)

2022-05-22 19:56:47

*Thread Reply:* I think this is a very valuable thing. I wish openlineage can support some commonly used pipeline tools, and try to abstract out some general interfaces so that users can expand by themselves

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-05-23 05:20:30

*Thread Reply:* For Python, we have OL client, common libraries (well, at least beginning of them) and SQL parser

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-05-23 05:20:44

*Thread Reply:* As we support more systems, the general libraries will grow as well.

Conor Beverland (conorbev@gmail.com)

2022-05-20 13:50:53

I see a change in the metadata collected from Airflow jobs which I think was introduced with the combination of Airflow 2.3/OpenLineage 0.8.1. There's an airflow_version facet that contains an operator attribute.

Previously that attribute had values such as: airflow.providers.postgres.operators.postgres.PostgresOperator but I now see that for the very same task the operator is now tracked as: airflow.models.taskinstance.TaskInstance

( fwiw there's also a taskInfo attribute in there containing a json string which itself has a operator that is still set to PostgresOperator )

Is this an already known issue?

👀 Maciej Obuchowski

Julien Le Dem (julien@apache.org)

2022-05-20 20:23:15

*Thread Reply:* This looks like a bug. we are probably not looking at the right instance in the TaskInstanceListener

Conor Beverland (conorbev@gmail.com)

2022-05-21 14:17:19

*Thread Reply:* @Howard Yoo I filed: https://github.com/OpenLineage/OpenLineage/issues/767 for this

<https://github.com/OpenLineage/OpenLineage/issues/767|#767 [BUG][INTEGRATION][Airflow] Incorrect Operator tracked in Airflow 2.3>

I see a change in the metadata collected from Airflow jobs which I think was introduced with the combination of Airflow 2.3/OpenLineage 0.8.1. There's an <code>airflow_version</code> facet that contains an <code>operator</code> attribute that is added by the integration. In Airflow 2.2 for a <code>PostgresOperator</code> that attribute would have had a value: <code>airflow.providers.postgres.operators.postgres.PostgresOperator</code> but now in Airflow 2.3 that operator is tracked as: <code>airflow.models.taskinstance.TaskInstance</code> ( fwiw there's also a <code>taskInfo</code> attribute in there containing a json string which itself has a <code>operator</code> that is still set to <code>PostgresOperator</code> )

Will Johnson (will@willj.co)

2022-05-20 21:42:46

Would anyone happen to have a link to the Technical Steering Committee meeting recordings?

I have quite a few people interested in seeing the overview of column lineage that Pawel provided during the Technical Steering Committee meeting on Thursday May 19th.

The wiki does not include a link to the recordings: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

Are the recordings made public? Thank you for any links and guidance!

Julien Le Dem (julien@apache.org)

2022-05-20 21:55:09

That would be @Michael Robinson Yes the recordings are made public.

Michael Robinson (michael.robinson@astronomer.io)

2022-05-20 22:05:27

@Will Johnson I’ll put this on the https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting|wiki soon, but here is the link to the recording: https://astronomer.zoom.us/rec/share/xUBW-n6G4u1WS89tCSXStx8BMl99rCfCC6jGdXLnkN6gMGn5G-_BC7pxHKKeELhG.0JFl88isqb64xX-3 PW: 1VJ=K5&X

🙌 Will Johnson

Will Johnson (will@willj.co)

2022-05-21 09:42:21

*Thread Reply:* Thank you so much, Michael!

Tyler Farris (tyler@kickstand.work)

2022-05-23 15:00:10

Is there documentation/examples around creating custom facets?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-05-24 06:41:11

*Thread Reply:* In Python or Java?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-05-24 06:44:32

*Thread Reply:* In python just inherit BaseFacet and add _get_schema static method that would point to some place where you have your json schema of a facet. For example our DbtVersionRunFacet

In Java you can take a look at Spark's custom facets.

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/common/openlineage/common/provider/dbt.py | dbt.py>

<https://github.com/OpenLineage/OpenLineage/blob/269107690da704ee19e3b4a0362c3fe8da155eaf/integration/spark/src/main/common/java/io/openlineage/spark/agent/facets/TableProviderFacet.java | TableProviderFacet.java>

Tyler Farris (tyler@kickstand.work)

2022-05-24 16:40:00

*Thread Reply:* Thanks, @Maciej Obuchowski, I was asking in regards to Python, sorry I should have clarified.

I'm not sure what the disconnect is, but the facets aren't showing up in the inputs and outputs. The Lineage event is sent successfully to my astrocloud.

below is the facet and extractor, any help is appreciated. Thanks!

```import logging from openlineage.airflow.extractors.base import BaseExtractor, TaskMetadata from openlineage.client.run import InputDataset, OutputDataset from typing import List, Optional from openlineage.client.facet import BaseFacet import attr

log = logging.getLogger(name)

@attr.s class ManualLineageFacet(BaseFacet): database: Optional[str] = attr.ib(default=None) cluster: Optional[str] = attr.ib(default=None) connectionUrl: Optional[str] = attr.ib(default=None) target: Optional[str] = attr.ib(default=None) source: Optional[str] = attr.ib(default=None) _producer: str = attr.ib(init=False) _schemaURL: str = attr.ib(init=False)

@staticmethod
def _get_schema() -&gt; str:
    return {
        "$schema": "<http://json-schema.org/schema#>",
        "$defs": {
            "ManualLineageFacet": {
                "allOf": [
                    {
                        "type": "object",
                        "properties": {
                            "database": {
                                "type": "string",
                                "example": "Snowflake",
                            },
                            "cluster": {
                                "type": "string",
                                "example": "us-west-2",
                            },
                            "connectionUrl": {
                                "type": "string",
                                "example": "<http://snowflake>",
                            },
                            "target": {
                                "type": "string",
                                "example": "Postgres",
                            },
                            "source": {
                                "type": "string",
                                "example": "Stripe",
                            },
                            "description": {
                                "type": "string",
                                "example": "Description of inlet/outlet",
                            },
                            "_producer": {
                                "type": "string",
                            },
                            "_schemaURL": {
                                "type": "string",
                            },
                        },
                    },
                ],
                "type": "object",
            }
        },
    }

class ManualLineageExtractor(BaseExtractor): @classmethod def getoperatorclassnames(cls) -> List[str]: return ["PythonOperator", "_PythonDecoratedOperator"]

def extract_on_complete(self, task_instance) -&gt; Optional[TaskMetadata]:

    return TaskMetadata(
        f"{task_instance.dag_run.dag_id}.{task_instance.task_id}",
        inputs=[
            InputDataset(
                namespace="default",
                name=self.operator.get_inlet_defs()[0]["name"],
                inputFacets=ManualLineageFacet(
                    database=self.operator.get_inlet_defs()[0]["database"],
                    cluster=self.operator.get_inlet_defs()[0]["cluster"],
                    connectionUrl=self.operator.get_inlet_defs()[0][
                        "connectionUrl"
                    ],
                    target=self.operator.get_inlet_defs()[0]["target"],
                    source=self.operator.get_inlet_defs()[0]["source"],
                ),
            )
            if self.operator.get_inlet_defs()
            else {},
        ],
        outputs=[
            OutputDataset(
                namespace="default",
                name=self.operator.get_outlet_defs()[0]["name"],
                outputFacets=ManualLineageFacet(
                    database=self.operator.get_outlet_defs()[0]["database"],
                    cluster=self.operator.get_outlet_defs()[0]["cluster"],
                    connectionUrl=self.operator.get_outlet_defs()[0][
                        "connectionUrl"
                    ],
                    target=self.operator.get_outlet_defs()[0]["target"],
                    source=self.operator.get_outlet_defs()[0]["source"],
                ),
            )
            if self.operator.get_outlet_defs()
            else {},
        ],
        job_facets={},
        run_facets={},
    )

def extract(self) -&gt; Optional[TaskMetadata]:
    pass```

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-05-25 09:21:02

*Thread Reply:* _get_schema should return address to the schema hosted somewhere else - afaik sending object field where server expects string field might cause some problems

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-05-25 09:21:59

*Thread Reply:* can you register ManualLineageFacet as facets not as inputFacets or outputFacets?

Tyler Farris (tyler@kickstand.work)

2022-05-25 13:15:30

*Thread Reply:* Thanks for the advice @Maciej Obuchowski, I was able to get it working! Also great talk today at the airflow summit.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-05-25 13:25:17

*Thread Reply:* Thanks 🙇

Bruno González (brugms2@gmail.com)

2022-05-24 06:26:25

Hey guys! I'm pretty new with OL but would like to start using it for a combination of data lineage in Airflow + data quality metrics collection. I was wondering if that was possible, but Ross clarified that in the deeper dive webinar from some weeks ago (great one by the way!).

I'm referencing this comment from Julien to see if you have any updates or more examples apart from the one from great expectations. We have some custom operators and would like to push lineage and data quality metrics to Marquez using custom extractors. Any reference will be highly appreciated. Thanks in advance!

} Julien Le Dem (https://openlineage.slack.com/team/U01DCLP0GU9)

I have updated the DataQuality metrics proposal and the corresponding PR: <a href="https://github.com/OpenLineage/OpenLineage/issues/101">https://github.com/OpenLineage/OpenLineage/issues/101</a> <a href="https://github.com/OpenLineage/OpenLineage/pull/115">https://github.com/OpenLineage/OpenLineage/pull/115</a>

YouTube

} Astronomer (https://www.youtube.com/c/Astronomer)

OpenLineage and Airflow: A Deeper Dive

Original URL: https://www.youtube.com/watch?v=LRr-ja8_Wjs

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-05-24 06:35:05

*Thread Reply:* We're also getting data quality from dbt if you're running dbt test or dbt build https://github.com/OpenLineage/OpenLineage/blob/main/integration/common/openlineage/common/provider/dbt.py#L399

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/common/openlineage/common/provider/dbt.py | dbt.py>

<pre><code> def parse_test( </code></pre>

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-05-24 06:37:15

*Thread Reply:* Generally, you'd need to construct DataQualityAssertionsDatasetFacet and/or DataQualityMetricsInputDatasetFacet and attach it to tested dataset

<https://github.com/OpenLineage/OpenLineage/blob/2604990e5387d1b8fb73a61664e35e702f25eed5/client/python/openlineage/client/facet.py | facet.py>

<pre><code>class DataQualityAssertionsDatasetFacet(BaseFacet): </code></pre>

Bruno González (brugms2@gmail.com)

2022-05-24 13:23:34

*Thread Reply:* Thanks @Maciej Obuchowski!!!

Howard Yoo (howardyoo@gmail.com)

2022-05-24 16:55:08

Hi all, https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow#development <-- does this still work? I did follow the instructions, but running pytest failed with error messages like ________________________________________________ ERROR collecting tests/extractors/test_bigquery_extractor.py ________________________________________________ ImportError while importing test module '/Users/howardyoo/git/OpenLineage/integration/airflow/tests/extractors/test_bigquery_extractor.py'. Hint: make sure your test modules/packages have valid Python names. Traceback: openlineage/airflow/utils.py:251: in import_from_string module = importlib.import_module(module_path) /opt/homebrew/Caskroom/miniconda/base/envs/airflow/lib/python3.9/importlib/__init__.py:127: in import_module return _bootstrap._gcd_import(name[level:], package, level) <frozen importlib._bootstrap>:1030: in _gcd_import ??? <frozen importlib._bootstrap>:1007: in _find_and_load ??? <frozen importlib._bootstrap>:986: in _find_and_load_unlocked ??? <frozen importlib._bootstrap>:680: in _load_unlocked ??? <frozen importlib._bootstrap_external>:850: in exec_module ??? <frozen importlib._bootstrap>:228: in _call_with_frames_removed ??? ../../../airflow.master/airflow/providers/google/cloud/operators/bigquery.py:39: in <module> from airflow.providers.google.cloud.hooks.bigquery import BigQueryHook, BigQueryJob ../../../airflow.master/airflow/providers/google/cloud/hooks/bigquery.py:46: in <module> from googleapiclient.discovery import Resource, build E ModuleNotFoundError: No module named 'googleapiclient'

Howard Yoo (howardyoo@gmail.com)

2022-05-24 16:55:09

...

Howard Yoo (howardyoo@gmail.com)

2022-05-24 16:55:54

looks like just running the pytest wouldn't be able to run all the tests - as some of these dag tests seems to be requiring connectivities to google's big query, databases, etc..

Mardaunt (miostat@yandex.ru)

2022-05-25 16:32:08

👋 Hi everyone! I didn't find this in the documentation. Can open lineage show me which source columns the final DataFrame column came from? (Spark)

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-05-25 16:59:47

*Thread Reply:* We're working on this feature - should be in the next release from OpenLineage side

🙌 Mardaunt

Mardaunt (miostat@yandex.ru)

2022-05-25 17:06:12

*Thread Reply:* Thanks! I will keep an eye on updates.

Martin Fiser (fisa@keboola.com)

2022-05-25 21:08:39

Hi all, showcase time:

We have implemented a native OpenLineage endpoint and metadata writer in our Keboola all-in-one data platform. The reason was that for more complex data pipeline scenarios it is beneficial to display the lineage in more detail. Additionally, we hope that OpenLineage as a standard will catch up and open up the ability to push lineage data into other data governance tools than Marquez. The implementation started as an internal POC of tweaking our metadata into OpenLineage /lineage format and resulted into a native API endpoint and later on an app within Keboola platform ecosystem - feeding platform job metadata in a regular cadence. We furthermore use a namespace for each keboola project so users can observe the data through their whole data mesh setup (multi-project architecture). Please reach me out if you have any questions!

🙌 Maciej Obuchowski, Michael Robinson

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-05-26 06:05:33

*Thread Reply:* Looks great! Thanks for sharing!

Gopi Krishnan Rajbahadur (gopikrishnanrajbahadur@gmail.com)

2022-05-26 10:13:26

Hi OpenLineage team,

I am Gopi Krishnan Rajbahadur, one of the core members of OpenDatalogy project (a project that we are currently trying to sandbox as a part of LF-AI). Our OpenDatalogy project focuses on providing a process that allows users of publicly available datasets (e.g., CIFAR-10) to ensure license compliance. In addition, we also aim to provide a public repo that documents the final rights and obligations associated with common publicly available datasets, so that users of these datasets can use them compliantly in their AI models and software.

One of the key aspects of conducting dataset license compliance analysis involves tracking the lineage and provenance of the dataset (as we highlight in this paper here: https://arxiv.org/abs/2111.02374). We think that in this regard, our projects (i.e., OpenLineage and OpenDatalogy) could work together to use the existing OpenLineage standard and also collaborate to adopt/modify/enhance and use OpenLineage to track and document the lineage of a publicly available dataset. On that note, we are also working with the SPDX community to make the lineage and provenance of a dataset be tracked as a part of the SPDX BOM that is in the works for representing AI software (AI SBOM).

We think our projects could mutually benefit from collaborating with each other. Our project's Github could be found here: https://github.com/OpenDataology/OpenDataology. Any feedback that you have about our project would be greatly appreciated. Also, as we are trying to sandbox our project, if you could also show us your support we would greatly appreciate it!

Look forward to hearing back from you

Sincerely, Gopi

arXiv.org

Can I use this publicly available dataset to build commercial AI...

Publicly available datasets are one of the key drivers for commercial AI software. The use of publicly available datasets is governed by dataset licenses. These dataset licenses outline the rights...

Original URL: https://arxiv.org/abs/2111.02374

OpenDataology/OpenDataology

Practice of AI dataset metadata and license compliance

Stars

Last updated

3 days ago

👀 Howard Yoo, Maciej Obuchowski

Ilqar Memmedov (ccmilgar@gmail.com)

2022-05-30 04:25:10

Hi guys, sorry for basics. I did some PoC for OpenLineage usage for gathering metrics on Spark job, especially for table creation, alter and drop I detect that Drop/Alter table statements is not trigger listener to post lineage data, Is it normal behaviour?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-05-30 05:38:41

*Thread Reply:* Might be that case if you're using Spark 3.2

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-05-30 05:38:54

*Thread Reply:* There were some changes to those operators

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-05-30 05:39:09

*Thread Reply:* If you're not using 3.2, please share more details 🙂

Ilqar Memmedov (ccmilgar@gmail.com)

2022-05-30 07:58:58

*Thread Reply:* Yeap, im using spark version 3.2.1

Ilqar Memmedov (ccmilgar@gmail.com)

2022-05-30 07:59:35

*Thread Reply:* is it open issue, or i have some option to force them to be sent?)

Ilqar Memmedov (ccmilgar@gmail.com)

2022-05-30 07:59:58

*Thread Reply:* btw thank you for quick response @Maciej Obuchowski

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-05-30 08:00:34

*Thread Reply:* Yes, we have issue for AlterTable at least

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2022-06-01 02:52:14

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/616 -> that’s the issue for altering tables in Spark 3.2. @Ilqar Memmedov Did you mean drop table or drop columns? I am not aware of any drop table issue.

<https://github.com/OpenLineage/OpenLineage/issues/616|#616 [INTEGRATION][SPARK] Implement dataset builders for alter table commands>

Spark 3.2.1 introduced new alter table commands. In order to collect this information, new dataset builders need to be created. <a href="https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2AlterTableCommands.scala">https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2AlterTableCommands.scala</a>

Assignees

<a href="https://github.com/tnazarew">@tnazarew</a>

Labels

enhancement, integration/spark

Ilqar Memmedov (ccmilgar@gmail.com)

2022-06-01 06:03:38

*Thread Reply:* @Paweł Leszczyński drop table statement.

Ilqar Memmedov (ccmilgar@gmail.com)

2022-06-01 06:05:58

*Thread Reply:* For reproduce it, i just create simple spark job. Create table as select from other, Select data from table, and then drop entire table.

Lineage data was posted only for "Create table as select" part

xiang chen (cdmikechen@hotmail.com)

2022-06-01 05:16:01

Hi~all, I have a question about lineage. I am now running airflow 2.3.1 and have started a latest marquez service by docker-compose. I found that using the example DAG of airflow can only see the job information, but not the lineage of the job. How can I configure it to see the lineage ?

Ross Turk (ross@datakin.com)

2022-06-03 14:20:16

*Thread Reply:* hi xiang 👋 lineage in airflow depends on the operator. some operators have extractors as part of the integration, but when they are missing you only see job information in Marquez.

Ross Turk (ross@datakin.com)

2022-06-03 14:20:51

*Thread Reply:* take a look at https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow#extractors--sending-the-correct-data-from-your-dags for a bit more detail

xiang chen (cdmikechen@hotmail.com)

2022-06-01 05:23:54

Another problem is that if I declare a skip task(e.g. DummyOperator) in the DAG, it will never appear in the job list. I think this is a problem, because even if it can not run, it should be able to see it as a metadata object.

Michael Robinson (michael.robinson@astronomer.io)

2022-06-01 10:19:33

@channel The next OpenLineage Technical Steering Committee meeting is on Thursday, June 9 at 10 am PT. Join us on Zoom: https://us02web.zoom.us/j/81831865546?pwd=RTladlNpc0FTTDlFcWRkM2JyazM4Zz09 All are welcome! Agenda:

a recent blog post about Snowflake
the Great Expectations integration
the dbt integration
Open discussion Notes: https://tinyurl.com/openlineagetsc Is there a topic you think the community should discuss at this or a future meeting? DM me to add items to the agenda.

👀 Howard Yoo, Francis McGregor-Macdonald

Michael Robinson (michael.robinson@astronomer.io)

2022-06-04 09:45:41

@channel OpenLineage 0.9.0 is now available, featuring column-level lineage in the Spark integration, bug fixes and more! For the details, see: https://github.com/OpenLineage/OpenLineage/releases/tag/0.9.0 and https://github.com/OpenLineage/OpenLineage/compare/0.8.2...0.9.0. Thanks to all the contributors who made this release possible, including @Paweł Leszczyński for authoring the column-level lineage PRs and new contributor @JDarDagran!

👍 Howard Yoo, Jarek Potiuk, Maciej Obuchowski, Ross Turk, Minkyu Park, pankaj koti, Jorik, Li Ding, Faouzi, Howard Yoo, Mardaunt

🎉 pankaj koti, Faouzi, Howard Yoo, Sheeri Cabral (Collibra), Mardaunt

❤️ Faouzi, Howard Yoo, Mardaunt

Tyler Farris (tyler@kickstand.work)

2022-06-06 16:14:52

Hey, all. Working on a PR to OpenLineage. I'm curious about file naming conventions for facets. Im noticing that there are two conventions being used:

• In OpenLineage.spec.facets; ex. ExampleFacet.json • In OpenLineage.integration.common.openlineage.common.schema; ex. example-facet.json. Thanks

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-06-08 08:02:58

*Thread Reply:* I think internal naming is more important 🙂

I guess, for now, try to match what the local directory has.

Tyler Farris (tyler@kickstand.work)

2022-06-08 10:59:39

*Thread Reply:* Thanks @Maciej Obuchowski

raghanag (raghanag@gmail.com)

2022-06-07 03:24:03

Hi Team, we are seeing DatasetName as the Custom query when we run a spark job which queries Oracle DB using JDBC with a Custom Query and the custom query is having newline syntax in it which is causing the NodeId ID_PATTERN match to fail. How to give custom dataset name when we use custom queries?

Marquez API regex ref: https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/service/models/NodeId.java#L44 ERROR [2022-06-07 06:11:49,592] io.dropwizard.jersey.errors.LoggingExceptionMapper: Error handling a request: 3648e87216d7815b ! java.lang.IllegalArgumentException: node ID (dataset:oracle:thin:_//<host-name>:1521:( ! SELECT ! RULE.RULE_ID, ! ASSG.ASSIGNED_OBJECT_ID, ASSG.ORG_ID, ASSG.SPLIT_PCT, ! PRTCP.PARTICIPANT_NAME, PRTCP.START_DATE, PRTCP.END_DATE ! FROM RULE RULE, ! ASSG ASSG, ! PRTCP PRTCP ! WHERE ! RULE.RULE_ID = ASSG.RULE_ID(+) ! --AND RULE.RULE_ID = 300100207891651 ! AND PRTCP.PARTICIPANT_ID = ASSG.ASSIGNED_OBJECT_ID ! -- and RULE.created_by = ' 1=1 ' ! and 1=1 ! )) must start with 'dataset', 'job', or 'run'

George Zachariah V (manish.zack@gmail.com)

2022-06-08 07:48:16

Hi Team, We have a spark job xyz that uses OpenLineageListener which posts Lineage events to Marquez server. But we are seeing some unknown jobs in the Marquez UI : • xyz.collect_limit • xyz.execute_insert_into_hadoop_fs_relation_command What jobs are these (collect_limit, execute_insert_into_hadoop_fs_relation_command ) ? How do we get the lineage listener to post only our job (xyz) ?

👍 Pradeep S

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-06-08 11:00:41

*Thread Reply:* Those jobs are actually what Spark does underneath 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-06-08 11:00:57

*Thread Reply:* Are you using Delta Lake btw?

Moiz (moiz.groups@gmail.com)

2022-06-08 12:02:39

*Thread Reply:* No, this is not Delta Lake. It is a normal Spark app .

raghanag (raghanag@gmail.com)

2022-06-08 13:58:05

*Thread Reply:* @Maciej Obuchowski i think David posted about this before. https://openlineage.slack.com/archives/C01CK9T7HKR/p1636011698055200

} David Virgil (https://openlineage.slack.com/team/U02K9U58X7F)

Hello openlineage community. Yesterday I tried the integration with spark. The result was not satisfactory. This is what I did: <ol><li>Add openlineage-spark dependency</li><li>Add these lines: <code>.config("spark.jars.packages", "io.openlineage:openlineage_spark:0.3.1") .config("spark.extraListeners", "io.openlineage.spark.agent.OpenLineageSparkListener") .config("spark.openlineage.url", "<https://marquez-internal-eks.eu-west-1.dev.hbi.systems/api/v1/namespaces/spark_integration/>"</code> This job was doing spark.read from 2 different json location. It is doing spark write to 5 different parquet location in s3. The job finished succesfully and the result in marquez is:</li> </ol>

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1636011698055200

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-06-08 14:27:46

*Thread Reply:* I agree that it looks bad on UI, but I also think integration is going good job here. The eventual "aggregation" should be done by event consumer.

If anything, we should filter some 'useless' nodes like collect_limit since they add nothing.

We have an issue for doing this to specifically delta lake operations, as they are the biggest offenders: https://github.com/OpenLineage/OpenLineage/issues/628

<https://github.com/OpenLineage/OpenLineage/issues/628|#628 [INTEGRATION][SPARK] Filter events for Delta operations>

Delta operations in Spark SQL generate a ton of excess run events. There are two general causes for this. <ol><li>Delta reads and writes Snapshot and delta files during execution - it access these files as DataFrames, which causes any access to be treated as a new SparkSQL execution - see the <a href="https://github.com/delta-io/delta/blob/master/core/src/main/scala/org/apache/spark/sql/delta/Snapshot.scala#L429-L457">source code</a>.</li><li>Delta writes start a new Spark SQLExecution separate from the original plan - this query references only the original query, not the write command. It converts the query to an RDD, then executes a new RDD job that writes to the target path.</li> </ol> Examples of the above two behaviors: The following events are generated by running the following Spark code, reading som JSON from a local file and writing it to a delta table <pre><code> Dataset&lt;Row&gt; df = spark.read().json("file://" + testFile.toAbsolutePath()); df.filter("age &gt; 100").write().format("delta").mode("append").save(deltaDir.toString()); </code></pre> <hr /> Snapshot Operations Abbreviated event for Snapshot query - note the input path includes <code>deltaTable/_delta_log</code>. All delta log file operations seem to generate a query similar to this. ``` { "eventType": "START", "eventTime": "2021-01-01T00:00:00Z", "run": { "runId": "453ddea0-0f36-4eae-b420-d1b3eb3d232f", "facets": { "parent": { "producer": "<a href="https://github.com/OpenLineage/OpenLineage/tree/0.7.0-SNAPSHOT/integration/spark">https://github.com/OpenLineage/OpenLineage/tree/0.7.0-SNAPSHOT/integration/spark</a>", "schemaURL": "<a href="https://openlineage.io/spec/facets/1-0-0/ParentRunFacet.json#/$defs/ParentRunFacet">https://openlineage.io/spec/facets/1-0-0/ParentRunFacet.json#/$defs/ParentRunFacet</a>", "run": { "runId": "e6018ec7-a314-439c-888b-ddd63e927c05" }, "job": { "namespace": "Namespace", "name": "ParentJob" } }, "sparkversion": { "producer": "<a href="https://github.com/OpenLineage/OpenLineage/tree/0.7.0-SNAPSHOT/integration/spark">https://github.com/OpenLineage/OpenLineage/tree/0.7.0-SNAPSHOT/integration/spark</a>", "schemaURL": "<a href="https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet">https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet</a>", "spark-version": "3.1.2" }, "spark.logicalPlan": { "producer": "<a href="https://github.com/OpenLineage/OpenLineage/tree/0.7.0-SNAPSHOT/integration/spark">https://github.com/OpenLineage/OpenLineage/tree/0.7.0-SNAPSHOT/integration/spark</a>", "schemaURL": "<a href="https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet">https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet</a>", "plan": [ { "class": "org.apache.spark.sql.catalyst.plans.logical.Aggregate", "num-children": 1, "groupingExpressions": [], "aggregateExpressions": [...], "child": 0 }, { "class": "org.apache.spark.sql.catalyst.plans.logical.Project", "num-children": 1, "projectList": [...], "child": 0 }, { "class": "org.apache.spark.sql.execution.LogicalRDD", "num-children": 0, "output": [...], "rdd": null, "outputPartitioning": { "product-class": "org.apache.spark.sql.catalyst.plans.physical.UnknownPartitioning", "numPartitions": 0 }, "outputOrdering": [], "isStreaming": false, "session": null } ] }, "sparkunknown": { "producer": "<a href="https://github.com/OpenLineage/OpenLineage/tree/0.7.0-SNAPSHOT/integration/spark">https://github.com/OpenLineage/OpenLineage/tree/0.7.0-SNAPSHOT/integration/spark</a>", "schemaURL": "<a href="https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet">https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet</a>", "output": { "description": { "@class": "org.apache.spark.sql.catalyst.plans.logical.Aggregate", "serializedId": 1, "groupingExpressions": [], "aggregateExpressions": [...], "child": { "@class": "org.apache.spark.sql.catalyst.plans.logical.Project", "serializedId": 173, "projectList": [ 59, 73, 142, 35, 5 ], "child": { "@class": "org.apache.spark.sql.execution.LogicalRDD", "serializedId": 174, "output": [...], "rdd": { "@class": "org.apache.spark.rdd.MapPartitionsRDD", "serializedId": 179, "prev": { "@class": "org.apache.spark.sql.execution.SQLExecutionRDD", "serializedId": 180, "sqlRDD": { "@class": "org.apache.spark.rdd.MapPartitionsRDD", "serializedId": 181, "prev": { "@class": "org.apache.spark.rdd.MapPartitionsRDD", "serializedId": 182, "prev": { "@class": "org.apache.spark.rdd.MapPartitionsRDD", "serializedId": 183, "prev": { "@class": "org.apache.spark.rdd.MapPartitionsRDD", "serializedId": 184, "prev": { "@class": "org.apache.spark.sql.execution.ShuffledRowRDD", "serializedId": 185, "dependency": { ..., "rdd": { "@class": "org.apache.spark.rdd.MapPartitionsRDD", "serializedId": 191, "prev": { "@class": "org.apache.spark.rdd.MapPartitionsRDD", "serializedId": 192, "prev": { "@class": "org.apache.spark.rdd.MapPartitionsRDD", "serializedId": 193, "prev": { "@class": "org.apache.spark.rdd.MapPartitionsRDD", "serializedId": 194, "prev": { "@class": "org.apache.spark.rdd.MapPartitionsRDD", "serializedId": 195, "prev": { "@class": "org.apache.spark.sql.execution.datasources.FileScanRDD", "serializedId": 196, "readFunction": { "serializedId": 197 }, ... }, ... }, ... }, ... }, ... }, ... } }, ... }, ... }, ... }, ... }, ... }, ... }, ... }, ... }, ... }, ... }, "inputAtt…

Milestone

👍 George Zachariah V

raghanag (raghanag@gmail.com)

2022-06-08 14:33:09

*Thread Reply:* @Maciej Obuchowski but we only see these 2 jobs in the namespace, no other jobs were part of the lineage metadata, are we doing something wrong?

raghanag (raghanag@gmail.com)

2022-06-08 16:09:15

*Thread Reply:* @Michael Robinson On this note, may we know how to form a lineage if we have different set of API's before calling the spark job (already integrated with OpenLineageSparkListener), we want to see how the different set of params pass thru these components before landing into the spark job. If we use openlineage client to post the lineage events into the Marquez, do we need to mention the same Run UUID across the lineage events for the run or is there any other way to do this? Can you pls advise?

Ross Turk (ross@datakin.com)

2022-06-08 22:51:38

*Thread Reply:* I think I understand what you are asking -

The runID is used to correlate different state updates (i.e., start, fail, complete, abort) across the lifespan of a run. So if you are trying to add additional metadata to the same job run, you’d use the same runID.

So you’d generate a runID and send a START event, then in the various components you could send OTHER events containing the same runID + params you want to study in facets, then at the end you would send a COMPLETE.

(I think there should be an UPDATE event type in the spec for this sort of thing.)

👍 George Zachariah V, raghanag

raghanag (raghanag@gmail.com)

2022-06-08 22:59:39

*Thread Reply:* thanks @Ross Turk but what i am looking for is lets say for example, if we have 4 components in the system then we want to show the 4 components as job icons in the graph and the datasets between them would show the input/output parameters that these components use. A(job) --> DS1(dataset) --> B(job) --> DS2(dataset) --> C(job) --> DS3(dataset) --> D(job)

Ross Turk (ross@datakin.com)

2022-06-08 23:04:37

*Thread Reply:* then you would need to have separate Jobs for each, with inputs and outputs defined

Ross Turk (ross@datakin.com)

2022-06-08 23:06:03

*Thread Reply:* so there would be a Run of job B that shows DS1 as an input and DS2 as an output

raghanag (raghanag@gmail.com)

2022-06-08 23:06:18

*Thread Reply:* got it

Ross Turk (ross@datakin.com)

2022-06-08 23:06:34

*Thread Reply:* (fyi: I know openlineage but my understanding stops at spark 😄)

👍 raghanag

Sharanya Santhanam (santhanamsharanya@gmail.com)

2022-06-10 12:27:58

*Thread Reply:* > The eventual “aggregation” should be done by event consumer. @Maciej Obuchowski Are there any known client side libraries that support this aggregation already ? In case of spark applications running as part of ETL pipelines, most of the times our end user is interested in seeing only the aggregated view where all jobs spawned as part of a single application are rolled up into 1 job.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-06-10 12:32:14

*Thread Reply:* I believe Microsoft @Will Johnson has something similar to that, but it's probably proprietary.

We'd love to have something like it, but AFAIK it affects only some percentage of Spark jobs and we can only do so much.

With exception of Delta Lake/Databricks, where it affects every job, and we know some nodes that could be safely filtered client side.

Will Johnson (will@willj.co)

2022-06-11 23:38:27

*Thread Reply:* @Maciej Obuchowski Microsoft ❤️ OSS!

Apache Atlas doesn't have the same model as Marquez. It only knows of effectively one entity that represents the complete asset.

@Mark Taylor designed this solution available now on Github to consolidate OpenLineage messages

https://github.com/microsoft/Purview-ADB-Lineage-Solution-Accelerator/blob/d6514f2[…]/Function.Domain/Helpers/OlProcessing/OlMessageConsolodation.cs

In addition, we do some filtering only based on inputs and outputs to limit the messages AFTER it has been emitted.

<https://github.com/microsoft/Purview-ADB-Lineage-Solution-Accelerator/blob/d6514f25c7f96151a197429efb9b7d5d425d0776/function-app/adb-to-purview/src/Function.Domain/Helpers/OlProcessing/OlMessageConsolodation.cs | OlMessageConsolodation.cs>

<pre><code> public class OlMessageConsolodation : IOlMessageConsolodation </code></pre>

🙌 Maciej Obuchowski

Sharanya Santhanam (santhanamsharanya@gmail.com)

2022-06-19 09:37:06

*Thread Reply:* thank you !

Michael Robinson (michael.robinson@astronomer.io)

2022-06-08 10:54:32

@channel The next OpenLineage TSC meeting is tomorrow! https://openlineage.slack.com/archives/C01CK9T7HKR/p1654093173961669

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)

@channel The next OpenLineage Technical Steering Committee meeting is on Thursday, June 9 at 10 am PT. Join us on Zoom: <a href="https://us02web.zoom.us/j/81831865546?pwd=RTladlNpc0FTTDlFcWRkM2JyazM4Zz09">https://us02web.zoom.us/j/81831865546?pwd=RTladlNpc0FTTDlFcWRkM2JyazM4Zz09</a> All are welcome! Agenda: <ol><li>a recent blog post about Snowflake</li><li>the Great Expectations integration</li><li>the dbt integration</li><li>Open discussion Notes: <a href="https://tinyurl.com/openlineagetsc">https://tinyurl.com/openlineagetsc</a> Is there a topic you think the community should discuss at this or a future meeting? DM me to add items to the agenda.</li> </ol>

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1654093173961669

👍 Maciej Obuchowski, Sheeri Cabral (Collibra), Willy Lulciuc, raghanag, Mardaunt

Jakub Moravec (IBM/Manta) (jkb.moravec@gmail.com)

2022-06-09 13:04:00

*Thread Reply:* Hi, is the link correct? The meeting room is empty

Michael Robinson (michael.robinson@astronomer.io)

2022-06-09 16:04:23

*Thread Reply:* sorry about that, thanks for letting us know

Mark Beebe (mark_j_beebe@progressive.com)

2022-06-13 15:13:59

Hello all, after sending dbt openlineage events to Marquez, I am now looking to use the Marquez API to extract the lineage information. I am able to use python requests to call the Marquez API to get other information such as namespaces, datasets, etc., but I am a little bit confused about what I need to enter to get the lineage. I included screenshots for what the API reference shows regarding retrieving the lineage where it shows that a nodeId is required. However, this is where I seem to be having problems. It is not exactly clear where the nodeId needs to be set or what the nodeId needs to include. I would really appreciate any insights. Thank you!

Ross Turk (ross@datakin.com)

2022-06-13 18:49:37

*Thread Reply:* Hey @Mark Beebe!

In this case, nodeId is going to be either a dataset or a job. You need to tell Marquez where to start since there is likely to be more than one graph. So you need to get your hands on an identifier for that starting node.

Ross Turk (ross@datakin.com)

2022-06-13 18:50:07

*Thread Reply:* You can do this in a few ways (that I can think of). First, by looking for a namespace, then querying for the datasets in that namespace:

Ross Turk (ross@datakin.com)

2022-06-13 18:53:43

*Thread Reply:* Or you can search, if you know the name of the dataset:

Ross Turk (ross@datakin.com)

2022-06-13 18:53:54

*Thread Reply:* aaaaannnnd that’s actually all the ways I can think of.

Mark Beebe (mark_j_beebe@progressive.com)

2022-06-14 08:11:30

*Thread Reply:* That worked, thank you so much!

👍 Ross Turk

Varun Singh (varuntestaz@outlook.com)

2022-06-14 05:52:39

Hi all, I need to send the lineage information from spark integration directly to a kafka topic. Java client seems to have a KafkaTransport, is it planned to have this support from inside the spark integration as well?

👀 Francis McGregor-Macdonald

Michael Robinson (michael.robinson@astronomer.io)

2022-06-14 10:35:48

Hi all, I’m working on a blog post about the Spark integration and would like to credit @tnazarew and @Sbargaoui for their contributions. Anyone know these contributors’ names? Are you on here? Thanks for any leads.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-06-14 10:37:01

*Thread Reply:* tnazarew - Tomasz Nazarewicz

Michael Robinson (michael.robinson@astronomer.io)

2022-06-14 10:37:14

*Thread Reply:* 🙌

Ross Turk (ross@datakin.com)

2022-06-15 12:46:45

*Thread Reply:* 👍

Conor Beverland (conorbev@gmail.com)

2022-06-14 13:58:07

Has anyone tried getting the OpenLineage Spark integration working with GCP Dataproc ?

Peter Hanssens (peter@cloudshuttle.com.au)

2022-06-15 15:49:17

Hi Folks, DataEngBytes is a community data engineering conference here in Australia and will be hosted on the 27th and 29th of September. Our CFP is open for just under a month and tickets are on sale now: Call for paper: https://sessionize.com/dataengbytes-2022/ Tickets: https://www.tickettailor.com/events/dataengbytes/713307 Promo video https://youtu.be/1HE_XNLvHss

sessionize.com

DataEngBytes 2022: Call for Speakers/Papers

DataEngBytes! is a community conference centred on real world data engineering problems and solutions.We will have two days of in-person conferences i...

Original URL: https://sessionize.com/dataengbytes-2022/

tickettailor.com

Buy tickets – DataEngBytes 2022, Sydney! – Sydney, Thu 29 Sep 2022 9:00 AM - 5:00 PM

DataEngBytes 2022, Sydney! – Sydney, Thu 29 Sep 2022 - DataEngBytes! is a community conference centred on real world data engineering problems and solutions. We will have two days of in-person conferences in Melbourne (on the 27th) and Sydney (on the 29th) of September with both days broadcast online on our YouTube ...

Original URL: https://www.tickettailor.com/events/dataengbytes/713307

YouTube

} DataEngAU (https://www.youtube.com/c/DataEngAU)

DataEngBytes 2022 Announcement

Original URL: https://youtu.be/1HE_XNLvHss

👀 Ross Turk, Michael Collado

Michael Robinson (michael.robinson@astronomer.io)

2022-06-17 16:23:32

A release of OpenLineage has been requested pending the merging of #856. Three +1s will authorize a release today. @Willy Lulciuc @Michael Collado @Ross Turk @Maciej Obuchowski @Paweł Leszczyński @Mandy Chessell @Daniel Henneberger @Drew Banin @Julien Le Dem @Ryan Blue @Will Johnson @Zhamak Dehghani

<https://github.com/OpenLineage/OpenLineage/pull/856|#856 Fix spark casting error and session catalog support for `iceberg`>

Problem Class cast exception for <code>DataSourceV2ScanRelation</code>: <pre><code>java.lang.ClassCastException: class org.apache.spark.sql.execution.datasources.v2.DataSourceV2ScanRelation cannot be cast to class org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation (org.apache.spark.sql.execution.datasources.v2.DataSourceV2ScanRelation and org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation are in unnamed module of loader 'app') at io.openlineage.spark3.agent.lifecycle.plan.TableContentChangeDatasetBuilder.apply(TableContentChangeDatasetBuilder.java:83) ~[openlineage-spark-0.9.0.jar:0.9.0] at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder.apply(AbstractQueryPlanDatasetBuilder.java:92) ~[openlineage-spark-0.9.0.jar:0.9.0] at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder$1.apply(AbstractQueryPlanDatasetBuilder.java:86) ~[openlineage-spark-0.9.0.jar:0.9.0] at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder$1.apply(AbstractQueryPlanDatasetBuilder.java:77) ~[openlineage-spark-0.9.0.jar:0.9.0] at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder.lambda$apply$0(AbstractQueryPlanDatasetBuilder.java:67) ~[openlineage-spark-0.9.0.jar:0.9.0] at java.util.Optional.map(Optional.java:265) ~[?:?] . . </code></pre> Unsupported spark catalog session: <pre><code>ERROR PlanUtils3: Catalog org.apache.iceberg.spark.SparkSessionCatalog is unsupported io.openlineage.spark3.agent.lifecycle.plan.catalog.UnsupportedCatalogException: org.apache.iceberg.spark.SparkSessionCatalog at io.openlineage.spark3.agent.lifecycle.plan.catalog.CatalogUtils3.lambda$getDatasetIdentifier$2(CatalogUtils3.java:50) ~[openlineage-spark-0.9.0.jar:0.9.0] at java.util.Optional.orElseThrow(Optional.java:408) ~[?:?] at io.openlineage.spark3.agent.lifecycle.plan.catalog.CatalogUtils3.getDatasetIdentifier(CatalogUtils3.java:50) ~[openlineage-spark-0.9.0.jar:0.9.0] at io.openlineage.spark3.agent.lifecycle.plan.catalog.CatalogUtils3.getDatasetIdentifier(CatalogUtils3.java:36) ~[openlineage-spark-0.9.0.jar:0.9.0] at io.openlineage.spark3.agent.utils.PlanUtils3.getDatasetIdentifier(PlanUtils3.java:67) ~[openlineage-spark-0.9.0.jar:0.9.0] at io.openlineage.spark3.agent.utils.PlanUtils3.fromDataSourceV2Relation(PlanUtils3.java:110) ~[openlineage-spark-0.9.0.jar:0.9.0] . . </code></pre> Solution Fix class cast exception for <code>DataSourceV2ScanRelation</code> and add support for <code>org.apache.iceberg.spark.SparkSessionCatalog</code>. Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☑︎ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☐ Your changes are accompanied by tests (if relevant) ☑︎ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☐ You've updated the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md"><code>CHANGELOG.md</code></a> with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant)

➕ Willy Lulciuc, Maciej Obuchowski, Michael Collado

✅ Michael Collado

Chase Christensen (christensenc3526@gmail.com)

2022-06-22 17:09:18

👋 Hi everyone!

👋 Conor Beverland, Ross Turk, Maciej Obuchowski, Michael Robinson, George Zachariah V, Willy Lulciuc, Dinakar Sundar

Lee (chenzuoli709@gmail.com)

2022-06-23 21:54:05

👋 Maciej Obuchowski, Sheeri Cabral (Collibra), Willy Lulciuc, Michael Robinson, Dinakar Sundar

Michael Robinson (michael.robinson@astronomer.io)

2022-06-25 07:34:32

@channel OpenLineage 0.10.0 is now available! We added SnowflakeOperatorAsync extractor support to the Airflow integration, an InMemoryRelationInputDatasetBuilder for InMemory datasets to the Spark integration, a static code analysis tool to run in CircleCI on Python modules, a copyright to all source files, and a debugger called PMD to the build process. Changes we made include skipping FunctionRegistry.class serialization in the Spark integration, installing the new rust-based SQL parser by default in the Airflow integration, improving the integration tests for the Airflow integration, reducing event payload size by excluding local data and including an output node in start events, and splitting the Spark integration into submodules. Thanks to all the contributors who made this release possible! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.10.0 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.9.0...0.10.0 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🙌 Maciej Obuchowski, Filipe Comparini Vieira, Manuel, Dinakar Sundar, Ross Turk, Paweł Leszczyński, Willy Lulciuc, Adisesha Reddy G, Conor Beverland, Francis McGregor-Macdonald, Jam Car

Mike brenes (brenesmi@gmail.com)

2022-06-28 18:29:29

Why has put dataset been deprecated? How do I add an initial data set via api?

Willy Lulciuc (willy@datakin.com)

2022-06-28 18:39:16

*Thread Reply:* I think you’re reference the deprecation of the DatasetAPI in Marquez? A milestone for the Marquez is to only collect metadata via OpenLineage events. This includes metadata for datasets , jobs , and runs . The DatasetAPI won’t be removed until support for collecting dataset metadata via OpenLineage has been added, see https://github.com/OpenLineage/OpenLineage/issues/323

<https://github.com/OpenLineage/OpenLineage/issues/323|#323 [PROPOSAL] Ability to annotate datasets with metadata without a job run>

Assignees

<a href="https://github.com/mobuchowski">@mobuchowski</a>

Labels

proposal

Willy Lulciuc (willy@datakin.com)

2022-06-28 18:40:28

*Thread Reply:* Once the spec supports dataset metadata, we’ll outline steps in the Marquez project to switch to using the new dataset event type

Willy Lulciuc (willy@datakin.com)

2022-06-28 18:43:20

*Thread Reply:* The DatasetAPI was also deprecated to avoid confusion around which API to use

Mike brenes (brenesmi@gmail.com)

2022-06-28 18:41:38

🥺

Mike brenes (brenesmi@gmail.com)

2022-06-28 18:42:21

So how would you propose I create the initial node if I am trying to do a POC?

Willy Lulciuc (willy@datakin.com)

2022-06-28 18:44:49

*Thread Reply:* Do you want to register just datasets? Or are you extracting metadata for a job that would include input / output datasets? (outside of Airflow of course)

Mike brenes (brenesmi@gmail.com)

2022-06-28 18:45:09

*Thread Reply:* Sorry didn't notice you over here ! lol

Mike brenes (brenesmi@gmail.com)

2022-06-28 18:45:53

*Thread Reply:* So ideally I would like to map out our current data flow from on prem to aws

Willy Lulciuc (willy@datakin.com)

2022-06-28 18:47:39

*Thread Reply:* What do you mean by mapping to AWS? Like send OL events to a service on AWS that would process the lineage metadata?

Mike brenes (brenesmi@gmail.com)

2022-06-28 18:48:14

*Thread Reply:* no, just visualize the current migration flow.

Willy Lulciuc (willy@datakin.com)

2022-06-28 18:48:53

*Thread Reply:* Ah I see, youre doing a infra migration from on prem to AWS 👌

Mike brenes (brenesmi@gmail.com)

2022-06-28 18:49:08

*Thread Reply:* really AWS is irrelevant. Source sink -> migration scriipts -> s3 -> additional processing -> final sink

Mike brenes (brenesmi@gmail.com)

2022-06-28 18:49:19

*Thread Reply:* correct

Willy Lulciuc (willy@datakin.com)

2022-06-28 18:49:45

*Thread Reply:* right right. so you want to map out that flow and visualize it in Marquez? (or some other meta service)

Mike brenes (brenesmi@gmail.com)

2022-06-28 18:50:05

*Thread Reply:* yes

Mike brenes (brenesmi@gmail.com)

2022-06-28 18:50:26

*Thread Reply:* which I think I can do once the first nodes exist

Mike brenes (brenesmi@gmail.com)

2022-06-28 18:51:18

*Thread Reply:* But I don't know how to get that initial node. I tried using the input facet at job start , that didn't do it. I also can't get the sql context that is in these examples.

Mike brenes (brenesmi@gmail.com)

2022-06-28 18:51:54

*Thread Reply:* really just want to re-create food_devlivery using my own biz context

Willy Lulciuc (willy@datakin.com)

2022-06-28 18:52:14

*Thread Reply:* Have you looked over our workshops and this example? (assuming you’re using python?)

<https://github.com/OpenLineage/workshops/blob/main/airflow/e2-lineage-api/python/generate-events.py | generate-events.py>

<pre><code>#!/usr/bin/env python3 from openlineage.client.run import RunEvent, RunState, Run, Job, Dataset from openlineage.client.client import OpenLineageClient from datetime import datetime from uuid import uuid4 # Initialize the OpenLineage client client = OpenLineageClient.from_environment() # Specify the producer of this lineage metadata producer = "<https://github.com/OpenLineage/workshops>" # Create some basic Dataset objects for our fictional pipeline monthly_summary = Dataset(namespace="<postgres://workshop-db:None>", name="workshop.public.monthly_summary") commissions = Dataset(namespace="<postgres://workshop-db:None>", name="workshop.public.commissions") taxes = Dataset(namespace="<postgres://workshop-db:None>", name="workshop.public.taxes") # Create a Job object job = Job(namespace="workshop", name="monthly_accounting") # Create a Run object with a unique ID run = Run(str(uuid4())) # Emit a START run event client.emit( RunEvent( RunState.START, datetime.now().isoformat(), run, job, producer ) ) # # This is where our application would do the actual work :) # # Emit a COMPLETE run event client.emit( RunEvent( RunState.COMPLETE, datetime.now().isoformat(), run, job, producer, inputs=[monthly_summary], outputs=[commissions, taxes], ) ) </code></pre>

Willy Lulciuc (willy@datakin.com)

2022-06-28 18:53:49

*Thread Reply:* that goes over the py client with some OL examples, but really calling openlineage.emit(...) method with RunEvents and specifying Marquez as the backend will get you up and running!

Willy Lulciuc (willy@datakin.com)

2022-06-28 18:54:32

*Thread Reply:* Don’t forget to configure the transport for the client

Mike brenes (brenesmi@gmail.com)

2022-06-28 18:54:45

*Thread Reply:* sweet. Thank you! I'll take a look. Also.. Just came across datakin for the first time. very nice 🙂

Willy Lulciuc (willy@datakin.com)

2022-06-28 18:55:25

*Thread Reply:* thanks! …. but we’re now part of astronomer.io 😉

Willy Lulciuc (willy@datakin.com)

2022-06-28 18:55:48

*Thread Reply:* making airflow oh-so-easy-to-use one DAG at a time

Mike brenes (brenesmi@gmail.com)

2022-06-28 18:55:52

*Thread Reply:* saw that too !

Willy Lulciuc (willy@datakin.com)

2022-06-28 18:56:03

*Thread Reply:* you’re on top of it!

Mike brenes (brenesmi@gmail.com)

2022-06-28 18:56:28

*Thread Reply:* ha. Thanks again!

Mike brenes (brenesmi@gmail.com)

2022-06-28 18:42:40

This would be outside of Airflow

Fenil Doshi (fdoshi@salesforce.com)

2022-06-28 18:43:22

Hello, Is OpenLineage planning to add support for inlets and outlets for Airflow integration? I am working on a project that relies on it and was hoping to contribute to this feature if its something that is in the talks. I saw an open issue here

I am willing to work on it. My plan was to just support Files and Tables entities (for inlets and outlets). Pass the inlets and outlets info into extract_metadata function here and then convert Airflow entities into TaskMetaData entities here.

Does this sound reasonable?

<https://github.com/OpenLineage/OpenLineage/issues/384|#384 [INTEGRATION][Airflow] Consider using manual inlets/outlets as a fallback for automatic extractor mechanism>

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/openlineage/lineage_backend/__init__.py | __init__.py>

Willy Lulciuc (willy@datakin.com)

2022-06-28 18:59:38

*Thread Reply:* Honestly, I’ve been a huge fan of using / falling back on inlets and outlets since day 1. AND if you’re willing to contribute this support, you get a +1 from me (I’ll add some minor comments to the issue) /cc @Julien Le Dem

🙌 Fenil Doshi

Willy Lulciuc (willy@datakin.com)

2022-06-28 18:59:59

*Thread Reply:* would be great to get @Maciej Obuchowski thoughts on this as well

👍 Fenil Doshi

Fenil Doshi (fdoshi@salesforce.com)

2022-07-08 12:40:39

*Thread Reply:* I have created a draft PR for this here. Please let me know if the changes make sense.

<https://github.com/OpenLineage/OpenLineage/pull/914|#914 [INTEGRATION][AIRFLOW] Allow lineage metadata to flow through inlets and outlets>

Signed-off-by: Fenil Doshi <a href="mailto:fdoshi@slack-corp.com">fdoshi@slack-corp.com</a> Problem Currently, in Airflow integration, lineage is only captured via a few extractors which greatly limits its use case. Airflow allows users to manually annotate lineage for the tasks using inlets and outlets. These can be a great fallback mechanism when extractors cannot automatically extract the input and output Datasets. The PR closes <a href="https://github.com/OpenLineage/OpenLineage/issues/384">#384</a> Addresses: <a href="https://github.com/OpenLineage/OpenLineage/issues/384">#384</a> Solution Airflow sends inlets and outlets information to <code>LineageBackend</code> class. Inlets and Outlets can be one of <a href="https://github.com/apache/airflow/blob/main/airflow/lineage/entities.py">these</a> entities. The code only currently supports <code>Table</code> entity because OpenLineage does not track column-level lineage yet. • Check the TaskMetadata object that is created after corresponding extractors extract the lineage data and see whether inputs and outputs exist • Check whether inlets and outlets are provided for the task • If TaskMetadata object does not have inputs/outputs, use Airflow's inlets and outlets. • Get all the Table entities and convert them to Dataset entities that are suitable for TaskMetaData object Your change modifies the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json">core</a> OpenLineage model Your change modifies one or more OpenLineage <a href="https://github.com/OpenLineage/OpenLineage/tree/main/spec/facets">facets</a> Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☑︎ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☐ Your changes are accompanied by tests (if relevant) ☑︎ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☐ You've updated the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md"><code>CHANGELOG.md</code></a> with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant)

Comments

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-07-08 12:42:30

*Thread Reply:* I think this effort: https://github.com/OpenLineage/OpenLineage/pull/904 ultimately makes more sense, since it will allow getting lineage on Airflow 2.3+ too

<https://github.com/OpenLineage/OpenLineage/pull/904|#904 [INTEGRATION][AIRFLOW] Manual inlets/outlets issue#384>

Problem Addressing issue <a href="https://github.com/OpenLineage/OpenLineage/issues/384">#384</a>. • We want the ability to set inlets and outlets manually Solution • expand the <code>postgres_extractor</code> to except <code>_PythonDecoratedOperator</code>. • add env check to attach <code>inputs</code> and <code>outputs</code> to <code>TaskMetadata</code> when <code>OPENLINEAGE_COLLECT_MANUALLY</code> is set to <code>true</code>. • added <code>extract_inlets_and_outlets_to_dataset</code> function. Your change modifies the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json">core</a> OpenLineage model Your change modifies one or more OpenLineage <a href="https://github.com/OpenLineage/OpenLineage/tree/main/spec/facets">facets</a> If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports <code>S3</code> and <code>GCS</code> filesystem operations, tested with AWS EMR). Checklist • [ x] You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work • [x ] Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☐ Your changes are accompanied by tests (if relevant) ☐ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☐ You've updated the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md"><code>CHANGELOG.md</code></a> with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant)

Comments

✅ Fenil Doshi

👀 Fenil Doshi

Fenil Doshi (fdoshi@salesforce.com)

2022-07-08 18:12:47

*Thread Reply:* I have made the changes in-line to the mentioned comments here. Does this look good?

<https://github.com/OpenLineage/OpenLineage/pull/914|#914 [INTEGRATION][AIRFLOW] Allow lineage metadata to flow through inlets and outlets>

Comments

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-07-12 09:35:22

*Thread Reply:* I think it looks good! Would be great to have tests for this feature though.

👍 Fenil Doshi, Julien Le Dem

Fenil Doshi (fdoshi@salesforce.com)

2022-07-15 21:56:50

*Thread Reply:* I have added the tests! Would really appreciate it if someone can take a look and let me know if anything else needs to be done. Thank you for the support! 😄

👀 Willy Lulciuc, Maciej Obuchowski

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-07-18 06:48:03

*Thread Reply:* One change and I think it will be good for now.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-07-18 06:48:07

*Thread Reply:* Have you tested it manually?

Fenil Doshi (fdoshi@salesforce.com)

2022-07-20 13:22:04

*Thread Reply:* Thanks a lot for the review! Appreciate it 🙌 Yes, I tested it manually (for Airflow versions 2.1.4 and 2.3.3) and it works 🎉

Conor Beverland (conorbev@gmail.com)

2022-07-20 13:24:55

*Thread Reply:* I think this is such a useful feature to have, thank you! Would you mind adding a little example to the PR of how to use it? Like a little example DAG or something? ( either in a comment or edit the PR description )

👍 Fenil Doshi

Fenil Doshi (fdoshi@salesforce.com)

2022-07-20 15:20:32

*Thread Reply:* Yes, Sure! I will add it in the PR description

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-07-21 05:30:56

*Thread Reply:* I think it would be easy to convert to integration test then if you provided example dag

👍 Fenil Doshi

Conor Beverland (conorbev@gmail.com)

2022-07-27 12:20:43

*Thread Reply:* ping @Fenil Doshi if possible I would really love to see the example DAG on there 🙂 🙏

Fenil Doshi (fdoshi@salesforce.com)

2022-07-27 12:26:22

*Thread Reply:* Yes, I was going to but the PR got merged so did not update the description. Should I just update the description of merged PR? Or should I add it somewhere in the docs?

Conor Beverland (conorbev@gmail.com)

2022-07-27 12:42:29

*Thread Reply:* ^ @Ross Turk is it easy for @Fenil Doshi to contribute doc for manual inlet definition on the new doc site?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-07-27 12:48:32

*Thread Reply:* It is easy 🙂 it's just markdown: https://github.com/openlineage/docs/

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-07-27 12:49:23

*Thread Reply:* @Fenil Doshi feel free to create new page here and don't sweat where to put it, we'll still figuring the structure of it out and will move it then

👍 Ross Turk, Fenil Doshi

Ross Turk (ross@datakin.com)

2022-07-27 13:12:31

*Thread Reply:* exactly, yes - don’t be worried about the doc quality right now, the doc site is still in a pre-release state. so whatever you write will be likely edited or moved before it becomes official 👍

👍 Fenil Doshi

Fenil Doshi (fdoshi@salesforce.com)

2022-07-27 20:37:34

*Thread Reply:* I added documentations here - https://github.com/OpenLineage/docs/pull/16

Also, have added an example for it. 🙂 Let me know if something is unclear and needs to be updated.

<https://github.com/OpenLineage/docs/pull/16|#16 Add documentation for openlineage-airflow integration via inlets-outlets>

Add documentation and example for making changes to the Airflow task via inlets and outlets in order to facilitate openlineage to collect lineage metadata.

✅ Conor Beverland

Conor Beverland (conorbev@gmail.com)

2022-07-28 12:50:54

*Thread Reply:* Thanks! very cool.

Conor Beverland (conorbev@gmail.com)

2022-07-28 12:52:22

*Thread Reply:* Does Airflow check the types of the inlets/outlets btw?

Like I wonder if a user could directly define an OpenLineage DataSet ( which might even have various other facets included on it ) and specify it in the inlets/outlets ?

<https://github.com/OpenLineage/OpenLineage/blob/main/client/python/openlineage/client/run.py | run.py>

<pre><code>class Dataset(RedactMixin): </code></pre>

Ross Turk (ross@datakin.com)

2022-07-28 12:54:56

*Thread Reply:* Yeah, I was also curious about using the models from airflow.lineage.entities as opposed to openlineage.client.run.

Ross Turk (ross@datakin.com)

2022-07-28 12:55:42

*Thread Reply:* I am accustomed to creating OpenLineage entities like this:

taxes = Dataset(namespace="<postgres://foobar>", name="schema.table")

Ross Turk (ross@datakin.com)

2022-07-28 12:56:45

*Thread Reply:* I don’t dislike the airflow.lineage.entities models especially, but if we only support one of them…

Conor Beverland (conorbev@gmail.com)

2022-07-28 12:58:18

*Thread Reply:* yeah, if Airflow allows that class within inlets/outlets it'd be nice to support both imo.

Like we would suggest users to use openlineage.client.run.Dataset but if a user already has DAGs that use Table then they'd still work in a best efforts way.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-07-28 13:03:07

*Thread Reply:* either Airflow depends on OpenLineage or we can probably change those entities as part of AIP-48 overhaul to more openlineage-like ones

Ross Turk (ross@datakin.com)

2022-07-28 17:18:35

*Thread Reply:* hm, not sure I understand the dependency issue. isn’t this extractor living in openlineage-airflow?

Conor Beverland (conorbev@gmail.com)

2022-08-15 09:49:02

*Thread Reply:* I gave manual lineage a try with native OL Datasets specified in the Airflow inlets/outlets and it seems to work! Had to make some small tweaks which I have attempted here: https://github.com/OpenLineage/OpenLineage/pull/1015

( I left the support for converting the Airflow Table to Dataset because I think that's nice to have also )

<https://github.com/OpenLineage/OpenLineage/pull/1015|#1015 [Integration][Airflow] Support OL Datasets in manual lineage inputs/outputs>

Signed-off-by: Conor Beverland <a href="mailto:conorbev@gmail.com">conorbev@gmail.com</a> Problem The current support for manual lineage definition requires a user to create an Airflow <code>airflow.lineage.entities.Table</code> ( which is then converted to an OpenLineage Dataset ). It would be good if users could create OpenLineage Dataset classes directly in their DAGs with no special conversion necessary. Solution This extends the current implementation to simply pass through Datasets which are specified in inlets or outlets without modification. In addition it makes the BashOperatorExtractor behave more similarly to the PythonOperator when source code collection is disabled which allows it to work with the manual lineage collection feature. ☐ Your change modifies the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json">core</a> OpenLineage model ☐ Your change modifies one or more OpenLineage <a href="https://github.com/OpenLineage/OpenLineage/tree/main/spec/facets">facets</a> Checklist • [ x] You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work • [ x] Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> • [x ] Your changes are accompanied by tests (if relevant) • [x ] Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained

Mike brenes (brenesmi@gmail.com)

2022-06-28 18:44:24

food_delivery example example.etl_categories node

Mike brenes (brenesmi@gmail.com)

2022-06-28 18:44:40

how do I recreate that using Openlineage?

Willy Lulciuc (willy@datakin.com)

2022-06-28 18:45:52

*Thread Reply:* Ahh great question! I actually just updated the seeding cmd for Marquez to do just this (but in java of course)

Willy Lulciuc (willy@datakin.com)

2022-06-28 18:46:15

*Thread Reply:* Give me a sec to send you over the diff…

❤️ Mike brenes

Willy Lulciuc (willy@datakin.com)

2022-06-28 18:56:35

*Thread Reply:* … continued here https://openlineage.slack.com/archives/C01CK9T7HKR/p1656456734272809?thread_ts=1656456141.097229&cid=C01CK9T7HKR

} Willy Lulciuc (https://openlineage.slack.com/team/U01DCMDFHBK)

Have you looked over our <a href="https://github.com/OpenLineage/workshops">workshops</a> and <a href="https://github.com/OpenLineage/workshops/blob/main/airflow/e2-lineage-api/python/generate-events.py">this</a> example? (assuming you’re using python?)

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1656456734272809?thread_ts=1656456141.097229&cid=C01CK9T7HKR

Conor Beverland (conorbev@gmail.com)

2022-06-28 20:05:33

I'm very new to DBT but wanted to give it a try with OL. I had a couple of questions when going through the DBT tutorial here: https://docs.getdbt.com/guides/getting-started/learning-more/getting-started-dbt-core

An earlier part of the tutorial has you build a model in a single sql file: https://docs.getdbt.com/guides/getting-started/learning-more/getting-started-dbt-core#build-your-first-model When I did this and ran dbt-ol I got a lineage graph like this:

👀 Maciej Obuchowski

Conor Beverland (conorbev@gmail.com)

2022-06-28 20:05:54

Conor Beverland (conorbev@gmail.com)

2022-06-28 20:07:11

then a later part of the tutorial has you split that same example into multiple models and when I run it again I get the graph like:

Conor Beverland (conorbev@gmail.com)

2022-06-28 20:07:27

Conor Beverland (conorbev@gmail.com)

2022-06-28 20:08:54

^ I'm just kind of curious if it's working as expected? And/or could it be possible to parse the DBT .sql so that the lineage in the first case would still show those staging tables?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-06-29 10:04:14

*Thread Reply:* I think you should declare those as sources? Or do you need something different?

Conor Beverland (conorbev@gmail.com)

2022-06-29 21:15:33

*Thread Reply:* I'll try to experiment with this.

Conor Beverland (conorbev@gmail.com)

2022-06-28 20:09:19

I see that DBT has a concept of adding tests to your models. Could those add data quality facets in OL ?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-06-29 10:02:17

*Thread Reply:* this should already be working if you run dbt-ol test or dbt-ol build

Conor Beverland (conorbev@gmail.com)

2022-06-29 21:15:25

*Thread Reply:* oh, nice!

shweta p (shweta.pbs@gmail.com)

2022-07-04 02:48:35

Hi everyone, i am trying openlineage-dbt. It works perfectly on locally when i try to publish the events to Marquez...but when i run the same commands from mwaa...i dont see those events triggered..i amnt able to view any logs if there is any error. How do i debug the issue

Julien Le Dem (julien@apache.org)

2022-07-06 14:26:59

*Thread Reply:* Maybe @Maciej Obuchowski knows? You need to check, it's using the dbt-ol command and that the configuration is available. (environment variables or conf file)

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-07-06 15:31:20

*Thread Reply:* Maybe some aws networking stuff? I'm not really sure how mwaa works internally (or, at all - never used it)

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-07-06 15:35:06

*Thread Reply:* anyway, any logs/errors should be in the same space where your task logs are

Michael Robinson (michael.robinson@astronomer.io)

2022-07-06 05:32:28

Agenda items are requested for the next OpenLineage Technical Steering Committee meeting on July 14. Reply in thread or ping me with your item(s)!

Will Johnson (will@willj.co)

2022-07-06 10:21:50

*Thread Reply:* What is the status on the Flink / Streaming decisions being made for OpenLineage / Marquez?

A few months ago, Flink was being introduced and it was said that more thought was needed around supporting streaming services in OpenLineage.

It would be very helpful to know where the community stands on how streaming data sources should work in OpenLineage.

👍 Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2022-07-06 11:08:01

*Thread Reply:* @Will Johnson added your item

👍 Will Johnson

Will Johnson (will@willj.co)

2022-07-06 10:19:44

Request for Creating a New OpenLineage Release

Hello #general, as per the Governance guide (https://github.com/OpenLineage/OpenLineage/blob/main/GOVERNANCE.md#openlineage-project-releases), I am asking that we generate a new release based on the latest commit by @Maciej Obuchowski (c92a93cdf3df636a02984188563d019474904b2b) which fixes a critical issue running OpenLineage on Azure Databricks.

Having this release made available to the general public on Maven would allow us to enable the hundred+ users of the solution to run OpenLineage on the latest LTS versions of Databricks. In addition, it would enable the Microsoft team to integrate the amazing column level lineage feature contributed by @Paweł Leszczyński with our solution for Microsoft Purview.

👍 Maciej Obuchowski, Jakub Dardziński, Ross Turk, Willy Lulciuc, Will Johnson, Julien Le Dem

Michael Robinson (michael.robinson@astronomer.io)

2022-07-07 10:33:41

@channel The next OpenLineage Technical Steering Committee meeting is on Thursday, July 14 at 10 am PT. Join us on Zoom: https://bit.ly/OLzoom All are welcome! Agenda:

Announcements/recent talks
Release 0.10.0 overview
Flink integration retrospective
Discuss: streaming services in Flink integration
Open discussion Notes: https://bit.ly/OLwiki Is there a topic you think the community should discuss at this or a future meeting? Reply or DM me to add items to the agenda.

Zoom Video

Join our Cloud HD Video Meeting

Zoom is the leader in modern enterprise video communications, with an easy, reliable cloud platform for video and audio conferencing, chat, and webinars across mobile, desktop, and room systems. Zoom Rooms is the original software-based conference room solution used around the world in board, conference, huddle, and training rooms, as well as executive offices and classrooms. Founded in 2011, Zoom helps businesses and organizations bring their teams together in a frictionless environment to get more done. Zoom is a publicly traded company headquartered in San Jose, CA.

Original URL: https://bit.ly/OLzoom

David Cecchi (david_cecchi@cargill.com)

2022-07-11 10:30:34

*Thread Reply:* would appreciate a TSC discussion on OL philosophy for Streaming in general and where/if it fits in the vision and strategy for OL. fully appreciate current maturity, moreso just validating how OL is being positioned from a vision perspective. as we consider aligning enterprise lineage solution around OL want to make sure we're not making bad assumptions. neat discussion might be "imagine that Confluent decided to make Stream Lineage OL compliant/capable - are we cool with that and what are the implications?".

👍 Michael Robinson

Ross Turk (ross@datakin.com)

2022-07-12 12:36:17

*Thread Reply:* @Michael Robinson could I also have a quick 5m to talk about plans for a documentation site?

👍 Michael Robinson, Sheeri Cabral (Collibra)

Michael Robinson (michael.robinson@astronomer.io)

2022-07-12 12:46:29

*Thread Reply:* @David Cecchi @Ross Turk Added your items to the agenda. Thanks and looking forward to the discussion!

David Cecchi (david_cecchi@cargill.com)

2022-07-12 15:08:48

*Thread Reply:* this is great - will keep an eye out for recording. if it got tabled due to lack of attendance will pick it up next TSC.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2022-07-12 16:12:43

*Thread Reply:* I think OpenLineage should have some representation at https://impactdatasummit.com/2022

I’m happy to help craft the abstract, look over slides, etc. (I could help present, but all I’ve done with OpenLineage is one tutorial, so I’m hardly an expert).

CfP closes 31 Aug so there’s plenty of time, but if you want a 2nd set of eyes on things, we can’t just wait until the last minute to submit 😄

impactdatasummit.com

IMPACT 2022 - The Data Observability Summit

Announcing Monte Carlo’s annual Data Observability event, IMPACT 2022: The Data Observability Summit. This year we will be spotlighting some of the industry’s most prominent voices, as well as the broader community of data leaders paving the way forward for reliable data.

Original URL: https://impactdatasummit.com/2022

Will Johnson (will@willj.co)

2022-07-07 12:04:09

How to create custom facets without recompiling OpenLineage?

I have a customer who is interested in using OpenLineage but wants to extend the facets WITHOUT recompiling OL / maintaining a clone of OL with their changes.

Do we have any examples of how someone might create their own jar but using the OpenLineage CustomFacetBuilder and then have that jar's classes be injected into OpenLineage?

Will Johnson (will@willj.co)

2022-07-07 12:04:55

*Thread Reply:* @Michael Collado would you have any thoughts on how to extend the Facets without having to alter OpenLineage itself?

Michael Collado (collado.mike@gmail.com)

2022-07-07 15:16:45

*Thread Reply:* This is described here. Notably: > Custom implementations are registered by following Java's ServiceLoader conventions. A file called io.openlineage.spark.api.OpenLineageEventHandlerFactory must exist in the application or jar's META-INF/service directory. Each line of that file must be the fully qualified class name of a concrete implementation of OpenLineageEventHandlerFactory. More than one implementation can be present in a single file. This might be useful to separate extensions that are targeted toward different environments - e.g., one factory may contain Azure-specific extensions, while another factory may contain GCP extensions.

Michael Collado (collado.mike@gmail.com)

2022-07-07 15:17:55

*Thread Reply:* This example is present in the test package - https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/tes[…]ervices/io.openlineage.spark.api.OpenLineageEventHandlerFactory

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/test/resources/META-INF/services/io.openlineage.spark.api.OpenLineageEventHandlerFactory | io.openlineage.spark.api.OpenLineageEventHandlerFactory>

<pre><code>io.openlineage.spark.agent.util.TestOpenLineageEventHandlerFactory </code></pre>

Will Johnson (will@willj.co)

2022-07-07 20:19:01

*Thread Reply:* @Michael Collado you are amazing! Thank you so much for pointing me to the docs and example!

Michael Robinson (michael.robinson@astronomer.io)

2022-07-07 19:27:47

@channel @Will Johnson OpenLineage 0.11.0 is now available! We added: • an HTTP option to override timeout and properly close connections in openlineage-java lib, • dynamic mapped tasks support to the Airflow integration, • a SqlExtractor to the Airflow integration, • PMD to Java and Spark builds in CI. We changed: • when testing extractors in the Airflow integration, the extractor list length assertion is now dynamic, • templates are rendered at the start of integration tests for the TaskListener in the Airflow integration. Thanks to all the contributors who made this release possible! For the bug fixes and more details, see: Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.11.0 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.10.0...0.11.0 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

👍 Chandru TMBA, John Thomas, Maciej Obuchowski, Fenil Doshi

👏 John Thomas, Willy Lulciuc, Ricardo Gaspar

🙌 Will Johnson, Maciej Obuchowski, Sergio Sicre

Varun Singh (varuntestaz@outlook.com)

2022-07-11 07:06:36

Hi all, I am using openlineage-spark in my project where I lock the dependency versions in gradle.lockfile. After release 0.10.0, this is not working. Is this a known limitation of switching to splitting the integration into submodules?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-07-14 06:18:29

*Thread Reply:* Can you expand on what's not working exactly?

This is not something we're aware of.

Varun Singh (varuntestaz@outlook.com)

2022-07-19 04:09:39

*Thread Reply:* @Maciej Obuchowski Sure, I have my own library where I am creating a shadowJar. This includes the open lineage library into the new uber jar. This worked fine till 0.9.0 but now building the shadowJar gives this error Could not determine the dependencies of task ':shadowJar'. > Could not resolve all dependencies for configuration ':runtimeClasspath'. > Could not find spark:app:0.10.0. Searched in the following locations: - <https://repo.maven.apache.org/maven2/spark/app/0.10.0/app-0.10.0.pom> If the artifact you are trying to retrieve can be found in the repository but without metadata in 'Maven POM' format, you need to adjust the 'metadataSources { ... }' of the repository declaration. Required by: project : > io.openlineage:openlineage_spark:0.10.0 > Could not find spark:shared:0.10.0. Searched in the following locations: - <https://repo.maven.apache.org/maven2/spark/shared/0.10.0/shared-0.10.0.pom> If the artifact you are trying to retrieve can be found in the repository but without metadata in 'Maven POM' format, you need to adjust the 'metadataSources { ... }' of the repository declaration. Required by: project : > io.openlineage:openlineage_spark:0.10.0 > Could not find spark:spark2:0.10.0. Searched in the following locations: - <https://repo.maven.apache.org/maven2/spark/spark2/0.10.0/spark2-0.10.0.pom> If the artifact you are trying to retrieve can be found in the repository but without metadata in 'Maven POM' format, you need to adjust the 'metadataSources { ... }' of the repository declaration. Required by: project : > io.openlineage:openlineage_spark:0.10.0 > Could not find spark:spark3:0.10.0. Searched in the following locations: - <https://repo.maven.apache.org/maven2/spark/spark3/0.10.0/spark3-0.10.0.pom> If the artifact you are trying to retrieve can be found in the repository but without metadata in 'Maven POM' format, you need to adjust the 'metadataSources { ... }' of the repository declaration. Required by: project : > io.openlineage:openlineage_spark:0.10.0

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-07-19 05:00:02

*Thread Reply:* Can you try 0.11? I think we might already fixed that.

Varun Singh (varuntestaz@outlook.com)

2022-07-19 05:50:03

*Thread Reply:* Tried with that as well. Doesn't work

Varun Singh (varuntestaz@outlook.com)

2022-07-19 05:56:50

*Thread Reply:* Same error with 0.11.0 as well

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-07-19 08:11:13

*Thread Reply:* I think I see - we removed internal dependencies from maven's pom.xml but we also publish gradle metadata: https://repo1.maven.org/maven2/io/openlineage/openlineage-spark/0.11.0/openlineage-spark-0.11.0.module

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-07-19 08:11:34

*Thread Reply:* we should remove the dependencies or disable the gradle metadata altogether, it's not required

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-07-19 08:16:18

*Thread Reply:* @Varun Singh For now I think you can try ignoring gradle metadata: https://docs.gradle.org/current/userguide/declaring_repositories.html#sec:supported_metadata_sources

Hanbing Wang (doris.wang200902@gmail.com)

2022-07-19 14:18:45

*Thread Reply:* @Varun Singh did you find out how to build shadowJar successful with release 0.10.0. I can build shadowJar with 0.9.0, but not higher version. If your problem already resolved, could you share some suggestion. thanks ^^

Varun Singh (varuntestaz@outlook.com)

2022-07-20 03:44:40

*Thread Reply:* @Hanbing Wang I followed @Maciej Obuchowski's instructions (Thank you!) and added this to my build.gradle file: repositories { mavenCentral() { metadataSources { mavenPom() ignoreGradleMetadataRedirection() } } } I am able to build the jar now. I am not proficient in gradle so don't know if this is the right way to do this. Please correct me if I am wrong.

Varun Singh (varuntestaz@outlook.com)

2022-07-20 05:26:04

*Thread Reply:* Also, I am not able to see the 3rd party dependencies in the dependency lock file, but they are present in some folder inside the jar (relocated in subproject's build file). But this is a different problem ig

Hanbing Wang (doris.wang200902@gmail.com)

2022-07-20 18:45:50

*Thread Reply:* Thanks @Varun Singh for the very helpful info. I will also try update build.gradle and rebuild shadowJar again.

Will Johnson (will@willj.co)

2022-07-13 01:10:01

Java Question: Why Can't I Find a Class on the Class Path? / How the heck does the ClassLoader know where to find a class?

Are there any java pros that would be willing to share alternatives to searching if a given class exists or help explain what should change in the Kusto package to make it work for the behaviors as seen in Kafka and SQL DW relation visitors? --- Details --- @Hanna Moazam and I are trying to introduce two new Azure data sources into OpenLineage's Spark integration. The https://github.com/Azure/azure-kusto-spark package is nearly done but we're getting tripped up on some Java concepts. In order to know if we should add the KustoRelationVisitor to the input dataset visitors, we need to see if the Kusto jar is installed on the spark / databricks cluster. In this case, the com.microsoft.kusto.spark.datasource.DefaultSource is a public class but it cannot be found using the KustRelationVisitor.class.getClassLoader().loadClass("class name") methods as seen in:

• https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/[…]nlineage/spark/agent/lifecycle/plan/SqlDWDatabricksVisitor.java • https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/[…]penlineage/spark/agent/lifecycle/plan/KafkaRelationVisitor.java At first I thought it was the Azure packages but then I tried to do the same approach with a simple java library

I instantiate a spark-shell like this spark-shell --master local[4] \ --conf spark.driver.extraClassPath=/mnt/repos/SparkListener-Basic/lib/build/libs/custom-listener.jar \ --conf spark.extraListeners=listener.MyListener --jars /mnt/repos/wjtestlib/lib/build/libs/lib.jar With lib.jar containing a class that looks like this: ```package wjtestlib;

public class WillLibrary { public boolean someLibraryMethod() { return true; } }And the custom listener is very simple.public class MyListener extends org.apache.spark.scheduler.SparkListener {

private static final Logger log = LoggerFactory.getLogger("MyLogger");

public MyListener() { log.info("INITIALIZING"); }

@Override public void onJobStart(SparkListenerJobStart jobStart) { log.info("MYLISTENER: ON JOB START"); try{ log.info("Trying wjtestlib.WillLibrary"); MyListener.class.getClassLoader().loadClass("wjtestlib.WillLibrary"); log.info("Got wjtestlib.WillLibrary"); } catch(ClassNotFoundException e){ log.info("Could not get wjtestlib.WillLibrary"); }

try{
  <a href="http://log.info">log.info</a>("Trying wjtestlib.WillLibrary using Class.forName");
  Class.forName("wjtestlib.WillLibrary", false, this.getClass().getClassLoader());
  <a href="http://log.info">log.info</a>("Got wjtestlib.WillLibrary using Class.forName");
} catch(ClassNotFoundException e){
  <a href="http://log.info">log.info</a>("Could not get wjtestlib.WillLibrary using Class.forName");
}

} }And I still a result indicating it cannot find the class.2022-07-12 23:58:22,048 INFO MyLogger: MYLISTENER: ON JOB START 2022-07-12 23:58:22,048 INFO MyLogger: Trying wjtestlib.WillLibrary 2022-07-12 23:58:22,057 INFO MyLogger: Could not get wjtestlib.WillLibrary 2022-07-12 23:58:22,058 INFO MyLogger: Trying wjtestlib.WillLibrary using Class.forName 2022-07-12 23:58:22,065 INFO MyLogger: Could not get wjtestlib.WillLibrary using Class.forName``` Are there any java pros that would be willing to share alternatives to searching if a given class exists or help explain what should change in the Kusto package to make it work for the behaviors as seen in Kafka and SQL DW relation visitors?

Thank you for any guidance.!

Azure/azure-kusto-spark

Stars

Language

Scala

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/lifecycle/plan/SqlDWDatabricksVisitor.java | SqlDWDatabricksVisitor.java>

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2022-07-13 08:50:15

*Thread Reply:* Could you unzip the created jar and verify that classes you’re trying to use are present? Perhaps there’s some relocate in shadowJar plugin, which renames the classes. Making sure the classes are present in jar good point to start.

Then you can try doing classForName just from the spark-shell without any listeners added. The classes should be available there.

Will Johnson (will@willj.co)

2022-07-13 11:42:25

*Thread Reply:* Thank you for the reply Pawel! Hanna and I just wrapped up some testing.

It looks like Databricks AND open source spark does some magic when you install a library OR use --jars on the spark-shell. In both Databricks and Apache Spark, the thread running the SparkListener cannot see the additional libraries installed unless they're on the original / main class path.

• Confirmed the uploaded jars are NOT shaded / renamed. • The databricks class path ($CLASSPATH) is focused on /databricks/jars • The added libraries are in /local_disk0/tmp and are not found in $CLASSPATH. • The sparklistener only recognizes $CLASSPATH. • Using a classloader with an object like spark does not find our installed class: spark.getClass().getClassLoader().getResource("com/microsoft/kusto/spark/datasource/KustoSourceOptions.class") • When we use a classloader on a class we installed and imported, it DOES find the class. myImportedClass.getClass().getClassLoader().getResource("com/microsoft/kusto/spark/datasource/KustoSourceOptions.class") @Michael Collado and @Maciej Obuchowski have you seen any challenges with using --jars on the spark-shell and detecting if the class is installed?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-07-13 12:02:05

*Thread Reply:* We run tests using --packages for external stuff like Delta - which is the same as --jars , but getting them from maven central, not local disk, and it works, like in KafkaRelationVisitor.

What if you did it like it? By that I mean adding it to your code with compileOnly in gradle or provided in maven, compiling with it, then using static method to check if it loads?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-07-13 12:02:36

*Thread Reply:* > • When we use a classloader on a class we installed and imported, it DOES find the class. myImportedClass.getClass().getClassLoader().getResource("com/microsoft/kusto/spark/datasource/KustoSourceOptions.class") Isn't that this actual scenario?

Will Johnson (will@willj.co)

2022-07-13 12:36:47

*Thread Reply:* Thank you for the reply, Maciej!

I will try the compileOnly route tonight!

Re: myImportedClass.getClass().getClassLoader().getResource("com/microsoft/kusto/spark/datasource/KustoSourceOptions.class")

I failed to mention that this was only achieved in the interactive shell / Databricks notebook. It never worked inside the SparkListener UNLESS we installed the Kusto jar on the databricks class path.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2022-07-14 06:43:47

*Thread Reply:* The difference between --jars and --packages is that for packages all transitive dependencies will be handled. But this does not seem to be the case here.

More doc can be found here: (https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management)

When starting a SparkContext, all the jars available on the classpath should be listed and put into Spark logs. So that’s the place one can check if the jar is loaded or not.

If --conf spark.driver.extraClassPath is working, you can add multiple jar files there (they must be separated by commas).

Other examples of adding multiple jars to spark classpath can be found here -> https://sparkbyexamples.com/spark/add-multiple-jars-to-spark-submit-classpath/

Will Johnson (will@willj.co)

2022-07-14 11:20:02

*Thread Reply:* @Paweł Leszczyński thank you for the reply! Hanna and I experimented with jars vs extraClassPath.

When using jars, the spark listener does NOT find the class using a classloader.

When using extraClassPath, the spark listener DOES find the class using a classloader.

When using --jars, we can see in the spark logs that after spark starts (and after the spark listener is already established?) there are Spark.AddJar commands being executed.

@Maciej Obuchowski we also experimented with doing a compileOnly on OpenLineage's spark listener, it did not change the behavior. OpenLineage still failed to identify that I had the kusto-spark-connector.

I'm going to reach out to Databricks to see if there is any guidance on letting the SparkListener be aware of classes added via their libraries / --jar method on the spark-shell.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-07-14 11:22:01

*Thread Reply:* So, this is only relevant to Databricks now? Because I don't understand what do you do different than us with Kafka/Iceberg/Delta

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-07-14 11:22:48

*Thread Reply:* I'm not the spark/classpath expert though - maybe @Michael Collado have something to add?

Will Johnson (will@willj.co)

2022-07-14 11:24:12

*Thread Reply:* @Maciej Obuchowski that's a super good question on Iceberg. How do you instantiate a spark job with Iceberg installed?

Will Johnson (will@willj.co)

2022-07-14 11:26:04

*Thread Reply:* It is still relevant to apache spark because I can't get OpenLineage to find the installed package UNLESS I use extraClassPath.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-07-14 11:29:13

*Thread Reply:* Basically, by adding --packages org.apache.iceberg:iceberg_spark_runtime_3.1_2.12:0.13.0

https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/tes[…]a/io/openlineage/spark/agent/SparkContainerIntegrationTest.java

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/test/java/io/openlineage/spark/agent/SparkContainerIntegrationTest.java | SparkContainerIntegrationTest.java>

<pre><code> void testWriteIcebergTableVersion() { </code></pre>

Will Johnson (will@willj.co)

2022-07-14 11:29:51

*Thread Reply:* Trying with --pacakges right now.

Will Johnson (will@willj.co)

2022-07-14 11:54:37

*Thread Reply:* Using --packages wouldn't let me find the Spark relation's default source:

Spark Shell command spark-shell --master local[4] \ --conf spark.driver.extraClassPath=/customListener-1.0-SNAPSHOT.jar \ --conf spark.extraListeners=listener.MyListener \ --jars /WillLibrary.jar \ --packages com.microsoft.azure.kusto:kusto_spark_3.0_2.12:3.0.0 Code inside customListener:

try{ <a href="http://log.info">log.info</a>("Trying Kusto DefaultSource"); MyListener.class.getClassLoader().loadClass("com.microsoft.kusto.spark.datasource.DefaultSource"); <a href="http://log.info">log.info</a>("Got Kusto DefaultSource!!!!"); } catch(ClassNotFoundException e){ <a href="http://log.info">log.info</a>("Could not get Kusto DefaultSource"); } Logs indicating it still can't find the class when using --packages. 2022-07-14 10:47:35,997 INFO MyLogger: MYLISTENER: ON JOB START 2022-07-14 10:47:35,997 INFO MyLogger: Trying wjtestlib.WillLibrary 2022-07-14 10:47:36,000 INFO 2022-07-14 10:47:36,052 INFO MyLogger: Trying LogicalRelation 2022-07-14 10:47:36,053 INFO MyLogger: Got logical relation 2022-07-14 10:47:36,053 INFO MyLogger: Trying Kusto DefaultSource 2022-07-14 10:47:36,064 INFO MyLogger: Could not get Kusto DefaultSource 😢

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-07-14 11:59:07

*Thread Reply:* what if you load your listener using also packages?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-07-14 12:00:38

*Thread Reply:* That's how I'm doing it locally using spark.conf: spark.jars.packages com.google.cloud.bigdataoss:gcs_connector:hadoop3-2.2.2,io.delta:delta_core_2.12:1.0.0,org.apache.iceberg:iceberg_spark3_runtime:0.12.1,io.openlineage:openlineage_spark:0.9.0

👀 Will Johnson

Will Johnson (will@willj.co)

2022-07-14 12:20:47

*Thread Reply:* @Maciej Obuchowski - You beautiful bearded man! 🙏 2022-07-14 11:14:21,266 INFO MyLogger: Trying LogicalRelation 2022-07-14 11:14:21,266 INFO MyLogger: Got logical relation 2022-07-14 11:14:21,266 INFO MyLogger: Trying org.apache.iceberg.catalog.Catalog 2022-07-14 11:14:21,295 INFO MyLogger: Got org.apache.iceberg.catalog.Catalog!!!! 2022-07-14 11:14:21,295 INFO MyLogger: Trying Kusto DefaultSource 2022-07-14 11:14:21,361 INFO MyLogger: Got Kusto DefaultSource!!!! I ended up setting my spark-shell like this (and used --jars for my custom spark listener since it's not on Maven).

spark-shell --master local[4] \ --conf spark.extraListeners=listener.MyListener \ --packages org.apache.iceberg:iceberg_spark_runtime_3.1_2.12:0.13.0,com.microsoft.azure.kusto:kusto_spark_3.0_2.12:3.0.0 \ --jars customListener-1.0-SNAPSHOT.jar So, now I just need to figure out how Databricks differs from this approach 😢

😂 Maciej Obuchowski, Jakub Dardziński, Hanna Moazam

Michael Collado (collado.mike@gmail.com)

2022-07-14 12:21:35

*Thread Reply:* This is an annoying detail about Java ClassLoaders and the way Spark loads extra jars/packages

Remember Java's ClassLoaders are hierarchical - there are parent ClassLoaders and child ClassLoaders. Parents can't see their children's classes, but children can see their parent's classes.

When you use --spark.driver.extraClassPath , you're adding a jar to the main application ClassLoader. But when you use --jars or --packages, you're instructing the Spark application itself to load the extra jars into its own ClassLoader - a child of the main application ClassLoader that the Spark code creates and manages separately. Since your listener class is loaded by the main application ClassLoader, it can't see any classes that are loaded by the Spark child ClassLoader. Either both jars need to be on the driver classpath or both jars need to be loaded by the --jar or --packages configuration parameter

🙌 Will Johnson, Paweł Leszczyński

Michael Collado (collado.mike@gmail.com)

2022-07-14 12:26:15

*Thread Reply:* In Databricks, we were not able to simply use the --packages argument to load the listener, which is why we have that init script that copies the jar into the classpath that Databricks uses for application startup (the main ClassLoader). You need to copy your visitor jar into the same location so that both jars are loaded by the same ClassLoader and can see each other

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/databricks/open-lineage-init-script.sh | open-lineage-init-script.sh>

<pre><code>cp -f $STAGE_DIR/openlineage-spark-**.jar /mnt/driver-daemon/jars || { echo "Error copying Spark Listener library file"; exit 1;} </code></pre>

Michael Collado (collado.mike@gmail.com)

2022-07-14 12:29:09

*Thread Reply:* (as an aside, this is one of the major drawbacks of the java agent approach and one reason why all the documentation recommends using the spark.jars.packages configuration parameter for loading the OL library - it guarantees that any DataSource nodes loaded by the Spark ClassLoader can be seen by the OL library and we don't have to use reflection for everything)

Will Johnson (will@willj.co)

2022-07-14 12:30:25

*Thread Reply:* @Michael Collado Thank you so much for the reply. The challenge is that Databricks has their own mechanism for installing libraries / packages.

https://docs.microsoft.com/en-us/azure/databricks/libraries/

These packages are installed on databricks AFTER spark is started and the physical files are located in a folder that is different than the main classpath.

I'm going to reach out to Databricks and see if we can get any guidance on this 😢

docs.microsoft.com

Libraries - Azure Databricks

Learn how to use and manage libraries in Azure Databricks.

Original URL: https://docs.microsoft.com/en-us/azure/databricks/libraries/

Will Johnson (will@willj.co)

2022-07-14 12:31:32

*Thread Reply:* Unfortunately, I can't ask users to install their packages on Databricks in a non-standard way (e.g. via an init script) because no one will follow that recommendation.

Michael Collado (collado.mike@gmail.com)

2022-07-14 12:32:46

*Thread Reply:* yeah, I'd prefer if we didn't need an init script to get OL on Databricks either 🤷‍♂️:skintone4:

🤣 Will Johnson

Will Johnson (will@willj.co)

2022-07-17 01:03:02

*Thread Reply:* Quick update: • Turns out using a class loader from a Scala spark listener does not have this problem. • https://stackoverflow.com/questions/7671888/scala-classloaders-confusion • I'm trying to use URLClassLoader as recommended by a few MSFT folks and point it at the /local_disk0/tmp folder. • https://stackoverflow.com/questions/17724481/set-classloader-different-directory • I'm not having luck so far but hoping I can reason about it tomorrow and Monday. This is blocking us from adding additional data sources that are not pre-installed on databricks 😢

Stack Overflow

scala classloaders confusion

Please consider the following test program (using scala 2.9.0.1) object test { def main(args:Array[String]) = { println(ClassLoader.getSystemClassLoader.getResource("toto")) println(this.

Original URL: https://stackoverflow.com/questions/7671888/scala-classloaders-confusion

Stack Overflow

Set classloader different directory

I've searched some topics, but couldn't find the answer. What i need is just to set up an additional path for default ClassLoader. Right now i have such class: public class Loader extends ClassLo...

Original URL: https://stackoverflow.com/questions/17724481/set-classloader-different-directory

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-07-18 05:45:59

*Thread Reply:* Can't help you now, but I'd love if you dumped the knowledge you've gained through this process into some doc on new OpenLineage doc site 🙏

👍 Hanna Moazam

Hanna Moazam (hannamoazam@microsoft.com)

2022-07-18 05:48:15

*Thread Reply:* We'll definitely put all of it together as a reference for others, and hopefully have a solution by the end of it too

🙌 Maciej Obuchowski

Michael Robinson (michael.robinson@astronomer.io)

2022-07-13 12:06:24

@channel The next OpenLineage TSC meeting is tomorrow at 10 am PT! https://openlineage.slack.com/archives/C01CK9T7HKR/p1657204421157959

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)

@channel The next OpenLineage Technical Steering Committee meeting is on Thursday, July 14 at 10 am PT. Join us on Zoom: <a href="https://bit.ly/OLzoom">https://bit.ly/OLzoom</a> All are welcome! Agenda: <ol><li>Announcements/recent talks</li><li>Release 0.10.0 overview</li><li>Flink integration retrospective</li><li>Discuss: streaming services in Flink integration</li><li>Open discussion Notes: <a href="https://bit.ly/OLwiki">https://bit.ly/OLwiki</a> Is there a topic you think the community should discuss at this or a future meeting? Reply or DM me to add items to the agenda.</li> </ol>

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1657204421157959

🙌 Willy Lulciuc, Maciej Obuchowski

💯 Willy Lulciuc, Maciej Obuchowski

David Cecchi (david_cecchi@cargill.com)

2022-07-13 16:32:12

check this out folks - marklogic datahub flow lineage into OL/marquez with jobs and runs and more. i would guess this is a pretty narrow use case but it went together really smoothly and thought i'd share sometimes it's just cool to see what people are working on

🍺 Willy Lulciuc, Conor Beverland, Maciej Obuchowski, Paweł Leszczyński

❤️ Willy Lulciuc, Conor Beverland, Julien Le Dem, Michael Robinson, Maciej Obuchowski, Minkyu Park

Willy Lulciuc (willy@datakin.com)

2022-07-13 16:40:48

*Thread Reply:* Soo cool, @David Cecchi 💯💯💯. I’m not familiar with marklogic, but pretty awesome ETL platform and the lineage graph looks 👌! Did you have to write any custom integration code? Or where you able to use our off the self integrations to get things working? (Also, thanks for sharing!)

David Cecchi (david_cecchi@cargill.com)

2022-07-13 16:57:29

*Thread Reply:* team had to write some custom stuff but it's all framework so it can be repurposed not rewritten over and over. i would see this as another "Platform" in the context of the integrations semantic OL uses, so no, we didn't start w/ an existing solution. just used internal hooks and then called lineage APIs.

Willy Lulciuc (willy@datakin.com)

2022-07-13 17:02:53

*Thread Reply:* Ah totally make sense. Would you be open to a brief presentation and/or demo in a future OL community meeting? The community is always looking to hear how OL is used in the wild, and this seems aligned with that (assuming you can talk about the implementation at a high-level)

Willy Lulciuc (willy@datakin.com)

2022-07-13 17:05:35

*Thread Reply:* No pressure, of course 😉

David Cecchi (david_cecchi@cargill.com)

2022-07-13 17:08:50

*Thread Reply:* ha not feeling any pressure. familiar with the intentions and dynamic. let's keep that on radar - i don't keep tabs on community meetings but mid/late august would be workable. and to be clear, this is being used in the wild in a sandbox 🙂.

Willy Lulciuc (willy@datakin.com)

2022-07-13 17:12:55

*Thread Reply:* Sounds great, and a reasonable timeline! (cc @Michael Robinson can follow up). Even if it’s in a sandbox, talking about the level of effort helps with improving our APIs or sharing with others how smooth it can be!

👍 David Cecchi

Ross Turk (ross@datakin.com)

2022-07-13 17:18:27

*Thread Reply:* chiming in as well to say this is really cool 👍

Julien Le Dem (julien@apache.org)

2022-07-13 18:26:28

*Thread Reply:* Nice! Would this become a product feature in Marklogic Data Hub?

Mark Chiarelli (mark.chiarelli@marklogic.com)

2022-07-14 11:07:42

*Thread Reply:* MarkLogic is a multi-model database and search engine. This implementation triggers off the MarkLogic Datahub Github batch records created when running the datahub flows. Just a toe in the water so far.

MarkLogic

Location

San Carlos, CA USA

URL

<http://developer.marklogic.com>

Repositories

Willy Lulciuc (willy@datakin.com)

2022-07-14 20:31:18

@Ross Turk, in the OL community meeting today, you presented the new doc site (awesome!) that isn’t up (yet!), but I’ve been talk with @Julien Le Dem about the usage of _producer and would like to add a section on the use / function of _producer in OL events. Feel like the new doc site would be a great place to add this! Let me know when’s a good time to start crowd sourcing content for the site

Ross Turk (ross@datakin.com)

2022-07-14 20:37:25

*Thread Reply:* That sounds like a good idea to me. Be good to have some guidance on that.

The repo is open for business! Feel free to add the page where you think it fits.

OpenLineage/docs

Documentation for OpenLineage

Website

<https://docs.openlineage.io>

Stars

❤️ Willy Lulciuc

Willy Lulciuc (willy@datakin.com)

2022-07-14 20:42:09

*Thread Reply:* OK! Let’s do this!

Willy Lulciuc (willy@datakin.com)

2022-07-14 20:59:36

*Thread Reply:* @Ross Turk, feel free to assign to me https://github.com/OpenLineage/docs/issues/1!

<https://github.com/OpenLineage/docs/issues/1|#1 Doc usage of `_producer` and `_schemaURL`>

Ross Turk (ross@datakin.com)

2022-07-14 20:39:26

Hey everyone! As Willy says, there is a new documentation site for OpenLineage in the works.

It’s not quite ready to be, uh, a proper reference yet. But it’s not too far away. Help us get there by submitting issues, making page stubs, and adding sections via PR.

https://github.com/openlineage/docs/

OpenLineage/docs

Documentation for OpenLineage

Website

<https://docs.openlineage.io>

Stars

🙌 Maciej Obuchowski, Michael Robinson

Willy Lulciuc (willy@datakin.com)

2022-07-14 20:43:09

*Thread Reply:* Thanks, @Ross Turk for finding a home for more technical / how-to docs… long overdue 💯

Ross Turk (ross@datakin.com)

2022-07-14 21:22:09

*Thread Reply:* BTW you can see the current site at http://openlineage.io/docs/ - merges to main will ship a new site.

openlineage.io

About OpenLineage | OpenLineage Docs

This page has not yet been written! You're welcome to contribute using the Edit link at the bottom.

Original URL: http://openlineage.io/docs/

Willy Lulciuc (willy@datakin.com)

2022-07-14 21:23:32

*Thread Reply:* great, was using <a href="http://docs.openlineage.io">docs.openlineage.io</a> … we’ll eventually want the docs to live under the docs subdomain though?

Ross Turk (ross@datakin.com)

2022-07-14 21:25:32

*Thread Reply:* TBH I activated GitHub Pages on the repo expecting it to live at openlineage.github.io/docs, thinking we could look at it there before it's ready to be published and linked in to the website

Ross Turk (ross@datakin.com)

2022-07-14 21:25:39

*Thread Reply:* and it came live at openlineage.io/docs 😄

Willy Lulciuc (willy@datakin.com)

2022-07-14 21:26:06

*Thread Reply:* nice and sounds good 👍

Ross Turk (ross@datakin.com)

2022-07-14 21:26:31

*Thread Reply:* still do not understand why, but I'll take it as a happy accident. we can move to docs.openlineage.io easily - just need to add the A record in the LF infra + the CNAME file in the static dir of this repo

shweta p (shweta.pbs@gmail.com)

2022-07-15 09:10:46

Hi #general, how do i link the tasks of airflow which may not have any input or output datasets as they are running some conditions. the dataset is generated only on the last task

shweta p (shweta.pbs@gmail.com)

2022-07-15 09:11:25

In the lineage, though there is option to link the parent , it doesnt show up the lineage of job -> job

shweta p (shweta.pbs@gmail.com)

2022-07-15 09:11:43

does it need to be job -> dataset -> job only ?

Ross Turk (ross@datakin.com)

2022-07-15 14:41:30

*Thread Reply:* yes - openlineage is job -> dataset -> job. particularly, the model is designed to observe the movement of data

Ross Turk (ross@datakin.com)

2022-07-15 14:43:41

*Thread Reply:* the spec is based around run events, which are observed states of job runs. jobs are observed to see how they affect datasets, and that relationship is what OpenLineage traces

Ilya Davidov (idavidov@marpaihealth.com)

2022-07-18 11:32:06

👋 Hi everyone!

Ilya Davidov (idavidov@marpaihealth.com)

2022-07-18 11:32:51

i am looking for some information regarding openlineage integration with AWS Glue jobs/workflows

Ilya Davidov (idavidov@marpaihealth.com)

2022-07-18 11:33:32

i am wondering if it possible and someone already give a try and maybe documented it?

John Thomas (john.thomas@astronomer.io)

2022-07-18 15:16:54

*Thread Reply:* This thread covers glue in some detail: https://openlineage.slack.com/archives/C01CK9T7HKR/p1637605977118000?threadts=1637605977.118000&cid=C01CK9T7HKR|https://openlineage.slack.com/archives/C01CK9T7HKR/p1637605977118000?threadts=1637605977.118000&cid=C01CK9T7HKR

} Francis McGregor-Macdonald (https://openlineage.slack.com/team/U02K353H2KF)

I am “successfully” exporting lineage to openlineage from AWS Glue using the listener. Only the source load is showing, not the transforms, or the sink

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1637605977118000?thread_ts=1637605977.118000&cid=C01CK9T7HKR

John Thomas (john.thomas@astronomer.io)

2022-07-18 15:17:49

*Thread Reply:* TL;Dr: you can use the spark integration to capture some lineage, but it's not comprehensive

David Cecchi (david_cecchi@cargill.com)

2022-07-18 16:29:02

*Thread Reply:* i suspect there will be opportunities to influence AWS to be a "fast follower" if OL adoption and buy-in starts to feel authentically real in non-aws portions of the stack. i discussed OL casually with AWS analytics leadership (Rahul Pathak) last winter and he seemed curious and open to this type of idea. to be clear, ~95% chance he's forgotten that conversation now but hey it's still something.

👍 Ross Turk

Francis McGregor-Macdonald (francis@mc-mac.com)

2022-07-18 19:34:32

*Thread Reply:* There are a couple of aws people here (including me) following.

👍 David Cecchi, Ross Turk

Mikkel Kringelbach (mikkel@theoremlp.com)

2022-07-19 18:01:46

Hi all, I have been playing around with Marquez for a hackday. I have been able to get some lineage information loaded in (using the local docker version for now). I have been trying set the location (for the link) and description information for a job (the text saying "Nothing to show here") but I haven't been able to figure out how to do this using the /lineage api. Any help would be appreciated.

Ross Turk (ross@datakin.com)

2022-07-19 20:11:38

*Thread Reply:* I believe what you want is the DocumentationJobFacet. It adds a description property to a job.

Ross Turk (ross@datakin.com)

2022-07-19 20:13:03

*Thread Reply:* You can see a Python example here, in the Airflow integration: https://github.com/OpenLineage/OpenLineage/blob/65a5f021a1ba3035d5198e759587737a05b242e1/integration/airflow/openlineage/airflow/adapter.py#L217

:gratitude_thank_you: Mikkel Kringelbach

Ross Turk (ross@datakin.com)

2022-07-19 20:13:18

*Thread Reply:* (looking for a curl example…)

Mikkel Kringelbach (mikkel@theoremlp.com)

2022-07-19 20:25:49

*Thread Reply:* I see, so there are special facet keys which will get translated into something special in the ui, is that correct?

Are these documented anywhere?

Ross Turk (ross@datakin.com)

2022-07-19 20:27:55

*Thread Reply:* Correct - info from the various OpenLineage facets are used in the Marquez UI.

Ross Turk (ross@datakin.com)

2022-07-19 20:28:28

*Thread Reply:* I couldn’t find a curl example with a description field, but I did generate this one with a sql field:

{ "job": { "name": "order_analysis.find_popular_products", "facets": { "sql": { "query": "DROP TABLE IF EXISTS top_products;\n\nCREATE TABLE top_products AS\nSELECT\n product,\n COUNT(order_id) AS num_orders,\n SUM(quantity) AS total_quantity,\n SUM(price ** quantity) AS total_value\nFROM\n orders\nGROUP BY\n product\nORDER BY\n total_value desc,\n num_orders desc;", "_producer": "https: //github.com/OpenLineage/OpenLineage/tree/0.11.0/integration/airflow", "_schemaURL": "<https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/SqlJobFacet>" } }, "namespace": "workshop" }, "run": { "runId": "13460e52-a829-4244-8c45-587192cfa009", "facets": {} }, "inputs": [ ... ], "outputs": [ ... ], "producer": "<https://github.com/OpenLineage/OpenLineage/tree/0.11.0/integration/airflow>", "eventTime": "2022-07-20T00: 23: 06.986998Z", "eventType": "COMPLETE" }

Ross Turk (ross@datakin.com)

2022-07-19 20:28:58

*Thread Reply:* The facets (at least, those in the core spec) are here: https://github.com/OpenLineage/OpenLineage/tree/65a5f021a1ba3035d5198e759587737a05b242e1/spec/facets

Ross Turk (ross@datakin.com)

2022-07-19 20:29:19

*Thread Reply:* it’s designed so that facets can exist outside the core, in other repos, as well

Mikkel Kringelbach (mikkel@theoremlp.com)

2022-07-19 22:25:39

*Thread Reply:* Thank you for sharing these, I was able to get the sql query highlighting to work. But I failed to get the location link or the documentation to work. My facet attempt looked like: { "facets": { "description": "test-description-job", "sql": { "query": "SELECT QUERY", "_schema": "<https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/SqlJobFacet>" }, "documentation": { "documentation": "Test docs?", "_schema": "<https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/DocumentationJobFacet>" }, "link": { "type": "", "url": "<a href="http://www.google.com/test_url">www.google.com/test_url</a>", "_schema": "<https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/SourceCodeLocationJobFacet>" } } }

Mikkel Kringelbach (mikkel@theoremlp.com)

2022-07-19 22:36:55

*Thread Reply:* I got the documentation link to work by renaming the property from documentation -> description . I still haven't been able to get the external link to work

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-07-20 10:33:36

Hey all. I've been doing a cleanup of issues on GitHub. If I've closed your issue that you think is still relevant, please reopen it and let us know.

🙌 Jakub Dardziński, Michael Collado, Will Johnson, Ross Turk

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2022-07-21 16:09:08

Is https://databricks.com/blog/2022/06/08/announcing-the-availability-of-data-lineage-with-unity-catalog.html - are they using OpenLineage? I know there’s been a lot of work to make sure OpenLineage integrates with Databricks, even earlier this year.

Databricks

Automated and Real-time Data Lineage With Unity Catalog

Learn more about data lineage, its organizational importance and how Unity Catalog provides automated and real-time data lineage at a granular level for all workloads (SQL, R, Python, Scala) and across all asset types (notebooks, workflows, dashboards)

Original URL: https://databricks.com/blog/2022/06/08/announcing-the-availability-of-data-lineage-with-unity-catalog.html

Ross Turk (ross@datakin.com)

2022-07-21 16:25:47

*Thread Reply:* There’s a good integration between OL and Databricks for pulling metadata out of running Spark clusters. But there’s not currently a connection between OL and the Unity Catalog.

I think it would be cool to see some discussions start to develop around it 👍

👍 Sheeri Cabral (Collibra), Julius Rentergent

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2022-07-21 16:26:44

*Thread Reply:* Absolutely. I saw some mention of APIs and access, and was wondering if maybe they used OpenLineage as a framework, which would be awesome.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2022-07-21 16:30:55

*Thread Reply:* (and since Azure Databricks uses it - https://openlineage.io/blog/openlineage-microsoft-purview/ I wasn’t sure about Unity Catalog)

openlineage.io

Microsoft Purview Accelerates Lineage Extraction from Azure Databricks

A new collaboration between Microsoft and OpenLineage is making lineage extraction possible for Azure Databricks and Microsoft Purview users.

Original URL: https://openlineage.io/blog/openlineage-microsoft-purview/

👍 Will Johnson

Julien Le Dem (julien@apache.org)

2022-07-21 16:56:24

*Thread Reply:* We're in the early stages of discussion regarding an OpenLineage integration for Unity. You showing interest would help increase the priority of that on the DB side.

👍 Sheeri Cabral (Collibra), Will Johnson, Thijs Koot

Thijs Koot (thijs.koot@gmail.com)

2022-07-27 11:41:48

*Thread Reply:* I'm interested in Databricks enabling an openlineage endpoint, serving as a catalogue. Similar to how they provide hosted MLFlow. I can mention this to our Databricks reps as well

Joao Vicente (joao.diogo.vicente@gmail.com)

2022-07-23 04:09:55

Hi all I am trying to find the state of columnLineage in OL I see a proposal and some examples in https://github.com/OpenLineage/OpenLineage/search?q=columnLineage&type=|https://github.com/OpenLineage/OpenLineage/search?q=columnLineage&type= but I can't find it in the spec. Can anyone shed any light why this would be the case?

Joao Vicente (joao.diogo.vicente@gmail.com)

2022-07-23 04:12:26

*Thread Reply:* Link to spec where I looked https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json

<https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json | OpenLineage.json>

<pre><code>{ "$schema": "<https://json-schema.org/draft/2020-12/schema>", "$id": "<https://openlineage.io/spec/1-0-2/OpenLineage.json>", "$defs": { "RunEvent": { "type": "object", "properties": { "eventType": { "description": "the current transition of the run state. It is required to issue 1 START event and 1 of [ COMPLETE, ABORT, FAIL ] event per run. Additional events with OTHER eventType can be added to the same run. For example to send additional metadata after the run is complete", "type": "string", "enum": [ "START", "COMPLETE", "ABORT", "FAIL", "OTHER" ], "example": "START|COMPLETE|ABORT|FAIL|OTHER" }, "eventTime": { "description": "the time the event occurred at", "type": "string", "format": "date-time" }, "run": { "$ref": "#/$defs/Run" }, "job": { "$ref": "#/$defs/Job" }, "inputs": { "description": "The set of ****input**** datasets.", "type": "array", "items": { "$ref": "#/$defs/InputDataset" } }, "outputs": { "description": "The set of ****output**** datasets.", "type": "array", "items": { "$ref": "#/$defs/OutputDataset" } }, "producer": { "description": "URI identifying the producer of this metadata. For example this could be a git url with a given tag or sha", "type": "string", "format": "uri", "example": "<https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client>" }, "schemaURL": { "description": "The JSON Pointer (<https://tools.ietf.org/html/rfc6901>) URL to the corresponding version of the schema definition for this RunEvent", "type": "string", "format": "uri", "example": "<https://openlineage.io/spec/0-0-1/OpenLineage.json>" } }, "required": [ "run", "job", "eventTime", "producer", "schemaURL" ] }, "Run": { "type": "object", "properties": { "runId": { "description": "The globally unique ID of the run associated with the job.", "type": "string", "format": "uuid" }, "facets": { "description": "The run facets.", "type": "object", "additionalProperties": { "$ref": "#/$defs/RunFacet" } } }, "required": [ "runId" ] }, "RunFacet": { "description": "A Run Facet", "type": "object", "allOf": [ { "$ref": "#/$defs/BaseFacet" } ] }, "Job": { "type": "object", "properties": { "namespace": { "description": "The namespace containing that job", "type": "string", "example": "my-scheduler-namespace" }, "name": { "description": "The unique name for that job within that namespace", "type": "string", "example": "myjob.mytask" }, "facets": { "description": "The job facets.", "type": "object", "additionalProperties": { "$ref": "#/$defs/JobFacet" } } }, "required": [ "namespace", "name" ] }, "JobFacet": { "description": "A Job Facet", "type": "object", "allOf": [ { "$ref": "#/$defs/BaseFacet" } ] }, "InputDataset": { "description": "An input dataset", "type": "object", "allOf": [ { "$ref": "#/$defs/Dataset" }, { "type": "object", "properties": { "inputFacets": { "description": "The input facets for this dataset.", "type": "object", "additionalProperties": { "$ref": "#/$defs/InputDatasetFacet" } } } } ] }, "InputDatasetFacet": { "description": "An Input Dataset Facet", "type": "object", "allOf": [ { "$ref": "#/$defs/BaseFacet" } ] }, "OutputDataset": { "description": "An output dataset", "type": "object", "allOf": [ { "$ref": "#/$defs/Dataset" }, { "type": "object", "properties": { "outputFacets": { "description": "The output facets for this dataset", "type": "object", "additionalProperties": { "$ref": "#/$defs/OutputDatasetFacet" } } } } ] }, "OutputDatasetFacet": { "description": "An Output Dataset Facet", "type": "object", "allOf": [ { "$ref": "#/$defs/BaseFacet" } ] }, "Dataset": { "type": "object", "properties": { "namespace": { "description": "The namespace containing that dataset", "type": "string", "example": "my-datasource-namespace" }, "name": { "description": "The unique name for that dataset within that namespace", "type": "string", "example": "instance.schema.table" }, "facets": { "description": "The facets for this dataset", "type": "object", "additionalProperties": { "$ref": "#/$defs/DatasetFacet" } } }, "required": [ "namespace", "name" ] }, "DatasetFacet": { "description": "A Dataset Facet", "type": "object", "allOf": [ { "$ref": "#/$defs/BaseFacet" } ] }, "BaseFacet": { "description": "all fields of the base facet are prefixed with _ to avoid name conflicts in facets", "type": "object", "properties": { "_producer": { "description": "URI identifying the producer of this metadata. For example this could be a git url with a given tag or sha", "type": "string", "format": "uri", "example": "<https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client>" }, "_schemaURL": { "description": "The JSON Pointer (<https://tools.ietf.org/html/rfc6901>) URL to the corresponding version of the schema definition for this facet", "type": "string", "format": "uri", "example": "<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/BaseFacet>" } }, "additionalProperties": true, "required": [ "_producer", "_schemaURL" ] } }, "$ref": "#/$defs/RunEvent" } </code></pre>

Joao Vicente (joao.diogo.vicente@gmail.com)

2022-07-23 04:37:11

*Thread Reply:* My bad. I realize now that column lineage has been implemented as a facet, hence not visible in the main spec https://github.com/OpenLineage/OpenLineage/search?q=ColumnLineageDatasetFacet&type=|https://github.com/OpenLineage/OpenLineage/search?q=ColumnLineageDatasetFacet&type=

👍 Maciej Obuchowski

Julien Le Dem (julien@apache.org)

2022-07-26 19:37:54

*Thread Reply:* It is supported in the Spark integration

Julien Le Dem (julien@apache.org)

2022-07-26 19:39:13

*Thread Reply:* @Paweł Leszczyński could you add the Column Lineage facet here in the spec? https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md#standard-facets

Will Johnson (will@willj.co)

2022-07-24 16:24:15

SundayFunday

Putting together some internal training for OpenLineage and highlighting some of the areas that have been useful to me on my journey with OpenLineage. Many thanks to @Michael Collado, @Maciej Obuchowski, and @Paweł Leszczyński for the continued technical support and guidance.

❤️ Hanna Moazam, Ross Turk, Minkyu Park, Atif Tahir, Paweł Leszczyński

Will Johnson (will@willj.co)

2022-07-24 16:26:59

*Thread Reply:* @Ross Turk I still want to contribute something like this to the OpenLineage docs / new site but the bar for an internal doc is lower in my mind 😅

Ross Turk (ross@datakin.com)

2022-07-25 11:49:54

*Thread Reply:* 😄

Ross Turk (ross@datakin.com)

2022-07-25 11:50:54

*Thread Reply:* @Will Johnson happy to help you with docs, when the time comes! sketching outline --> editing, whatever you need

Julien Le Dem (julien@apache.org)

2022-07-26 19:39:56

*Thread Reply:* This looks nice by the way.

❤️ Will Johnson

Sylvia Seow (sylviaseow@gmail.com)

2022-07-26 09:06:28

hi all, really appreciate if anyone could help. I have been trying to create a poc project with openlineage with dbt. attached will be the pip list of the openlineage packages that i have. However, when i run "dbt-ol"command, it prompted as öpen as file, instead of running as a command. the regular dbt run can be executed without issue. i would want i had done wrong or if any configuration that i have missed. Thanks a lot

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-07-26 10:39:57

*Thread Reply:* do you have proper execute permissions?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-07-26 10:41:09

*Thread Reply:* not sure how that works on windows, but it just looks like it does not recognize dbt-ol as executable

Sylvia Seow (sylviaseow@gmail.com)

2022-07-26 10:43:00

*Thread Reply:* yes i have admin rights. how to make this as executable?

Sylvia Seow (sylviaseow@gmail.com)

2022-07-26 10:43:25

*Thread Reply:* btw do we have a sample docker image where dbt-ol can run?

Ross Turk (ross@datakin.com)

2022-07-26 17:33:08

*Thread Reply:* I have also never tried on Windows 😕 but you might try python3 dbt-ol run?

Sylvia Seow (sylviaseow@gmail.com)

2022-07-26 21:03:43

*Thread Reply:* will try that

Will Johnson (will@willj.co)

2022-07-26 16:41:04

Running a single unit test on the Spark Integration - How it works with the different modules?

Prior to splitting up the OpenLineage spark integration, I could run a command like the one below to test a single test or even a single test method. Now I get a failure and it's pointing to the app: module. Can anyone share the right syntax for running a unit test with the current package structure? Thank you!!

```wj@DESKTOP-ECF9QME:~/repos/OpenLineageWill/integration/spark$ ./gradlew test --tests io.openlineage.spark.agent.OpenLineageSparkListenerTest

> Task :app:test FAILED

SUCCESS: Executed 0 tests in 872ms

FAILURE: Build failed with an exception.

** What went wrong: Execution failed for task ':app:test'. > No tests found for given includes: io.openlineage.spark.agent.OpenLineageSparkListenerTest

** Try: > Run with --stacktrace option to get the stack trace. > Run with --info or --debug option to get more log output. > Run with --scan to get full insights.

** Get more help at https://help.gradle.org

Deprecated Gradle features were used in this build, making it incompatible with Gradle 8.0.

You can use '--warning-mode all' to show the individual deprecation warnings and determine if they come from your own scripts or plugins.

See https://docs.gradle.org/7.4/userguide/command_line_interface.html#sec:command_line_warnings

BUILD FAILED in 2s 18 actionable tasks: 4 executed, 14 up-to-date```

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2022-07-27 01:54:31

*Thread Reply:* This may be a result of splitting Spark integration into multiple submodules: app, shared, spark2, spark3, spark32, etc. If the test case is from shared submodule (this one looks like that), you could try running: ./gradlew :shared:test --tests io.openlineage.spark.agent.OpenLineageSparkListenerTest

Hanna Moazam (hannamoazam@microsoft.com)

2022-07-27 03:18:42

*Thread Reply:* @Paweł Leszczyński, I tried running that command, and I get the following error:

```> Task :shared:test FAILED

FAILURE: Build failed with an exception.

** What went wrong: Execution failed for task ':shared:test'. > No tests found for given includes: io.openlineage.spark.agent.OpenLineageSparkListenerTest

** Try: > Run with --stacktrace option to get the stack trace. > Run with --info or --debug option to get more log output. > Run with --scan to get full insights.

** Get more help at https://help.gradle.org

Deprecated Gradle features were used in this build, making it incompatible with Gradle 8.0.

You can use '--warning-mode all' to show the individual deprecation warnings and determine if they come from your own scripts or plugins.

See https://docs.gradle.org/7.4/userguide/command_line_interface.html#sec:command_line_warnings

BUILD FAILED in 971ms 6 actionable tasks: 2 executed, 4 up-to-date```

Hanna Moazam (hannamoazam@microsoft.com)

2022-07-27 03:24:41

*Thread Reply:* When running build and test for all the submodules, I can see outputs for tests in different submodules (spark3, spark2 etc), but for some reason, I cannot find any indication that the tests in OpenLineage/integration/spark/app/src/test/java/io/openlineage/spark/agent/lifecycle/plan are being run at all.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2022-07-27 03:42:43

*Thread Reply:* That’s interesting. Let’s ask @Tomasz Nazarewicz about that.

👍 Hanna Moazam

Hanna Moazam (hannamoazam@microsoft.com)

2022-07-27 03:57:08

*Thread Reply:* For reference, I attached the stdout and stderr messages from running the following: ./gradlew :shared:spotlessApply && ./gradlew :app:spotlessApply && ./gradlew clean build test

testOutput.txt

Tomasz Nazarewicz (tomasz.nazarewicz@getindata.com)

2022-07-27 04:27:23

*Thread Reply:* I'll look into it

Tomasz Nazarewicz (tomasz.nazarewicz@getindata.com)

2022-07-28 05:17:36

*Thread Reply:* Update: some test appeared to not be visible after split, that's fixed but now I have to solevr some dependency issues

🙌 Hanna Moazam, Will Johnson

Hanna Moazam (hannamoazam@microsoft.com)

2022-07-28 05:19:16

*Thread Reply:* That's great, thank you!

Hanna Moazam (hannamoazam@microsoft.com)

2022-07-29 06:05:55

*Thread Reply:* Hi Tomasz, thanks so much for looking into this. Is this your PR (https://github.com/OpenLineage/OpenLineage/pull/953) that fixes the whole issue, or is there still some work to do to solve the dependency issues you mentioned?

<https://github.com/OpenLineage/OpenLineage/pull/953|#953 making OpenLineageSparkListenerTest.java visible for tests and fixing them>

… them Signed-off-by: tomasznazarewicz <a href="mailto:t.nazarewicz94@gmail.com">t.nazarewicz94@gmail.com</a> Problem <a href="https://openlineage.slack.com/archives/C01CK9T7HKR/p1658868064009839">https://openlineage.slack.com/archives/C01CK9T7HKR/p1658868064009839</a> io.openlineage.spark.agent.OpenLineageSparkListenerTest not visible for test also one test is failing when its run Closes: Solution build script for <code>app</code> module changed so the tests are visible, failing test fixed

Tomasz Nazarewicz (tomasz.nazarewicz@getindata.com)

2022-07-29 06:07:58

*Thread Reply:* I'm still testing it, should've changed it to draft, sorry

👍 Hanna Moazam, Will Johnson

Hanna Moazam (hannamoazam@microsoft.com)

2022-07-29 06:08:59

*Thread Reply:* No worries! If I can help with testing or anything please let me know!

Tomasz Nazarewicz (tomasz.nazarewicz@getindata.com)

2022-07-29 06:09:29

*Thread Reply:* Will do! Thanks :)

Hanna Moazam (hannamoazam@microsoft.com)

2022-08-02 11:06:31

*Thread Reply:* Hi @Tomasz Nazarewicz, if possible, could you please share an estimated timeline for resolving the issue? We have 3 PRs which we are either waiting to open or to update which are dependent on the tests.

Tomasz Nazarewicz (tomasz.nazarewicz@getindata.com)

2022-08-02 13:45:34

*Thread Reply:* @Hanna Moazam hi, it's quite difficult to do that because the issue is that all the tests are passing when I execute ./gradlew app:test but one is failing with ./gradlew app:build

but if it fixes your problem I can disable this test for now and make a PR without it, then you can maybe unblock your stuff and I will have more time to investigate the issue.

Hanna Moazam (hannamoazam@microsoft.com)

2022-08-02 14:54:45

*Thread Reply:* Oh that's a strange issue. Yes that would be really helpful if you can, because we have some tests we implemented which we need to make sure pass as expected.

Hanna Moazam (hannamoazam@microsoft.com)

2022-08-02 14:54:52

*Thread Reply:* Thank you for your help Tomasz!

Tomasz Nazarewicz (tomasz.nazarewicz@getindata.com)

2022-08-03 06:12:07

*Thread Reply:* @Hanna Moazam https://github.com/OpenLineage/OpenLineage/pull/980 here is the pull request with the changes

<https://github.com/OpenLineage/OpenLineage/pull/980|#980 enable tests for app, fix bugs, disable LibraryTest.testRdd>

Signed-off-by: tomasznazarewicz <a href="mailto:t.nazarewicz94@gmail.com">t.nazarewicz94@gmail.com</a> Problem <code>app</code> module was only running integration tests Original solution assumed that the test will be enabled and all potential bugs fixed. That worked for all tests except <code>LibraryTest.testRdd</code>, which has two problems: <ol><li><code>java.lang.NoSuchMethodError: io.openlineage.client.OpenLineageClientUtils.newObjectMapper()Lcom/fasterxml/jackson/databind/ObjectMapper;</code></li><li>when ObjectMapper was changed there is issue with serialization, if test is run with <code>./gradlew app:test</code> the tests are passing but if it's run with <code>./gradlew app:build</code> the <code>OpenLineage.RunFacets.additionalProperties</code> is serialized as additional field and test is failing</li> </ol> EDIT: It turns out there are also issues with Delta. Closes: potentially -> <a href="https://openlineage.slack.com/archives/C01CK9T7HKR/p1658868064009839">https://openlineage.slack.com/archives/C01CK9T7HKR/p1658868064009839</a> Solution It seems to be a blocking issue so the proposed solution for now is to disable <code>LibraryTest.testRdd</code> and <code>DeltaDataSourceTest.testInsertIntoDeltaSource</code> so everything is passing and in the meantime investigate both issues

🙌 Hanna Moazam

Tomasz Nazarewicz (tomasz.nazarewicz@getindata.com)

2022-08-03 06:12:26

*Thread Reply:* its waiting for review currently

Hanna Moazam (hannamoazam@microsoft.com)

2022-08-03 06:20:41

*Thread Reply:* Thank you!

Conor Beverland (conorbev@gmail.com)

2022-07-26 18:44:47

Is there any doc yet about column level lineage? I see a spec for the facet here: https://github.com/openlineage/openlineage/issues/148

<https://github.com/OpenLineage/OpenLineage/issues/148|#148 [PROPOSAL] column level lineage facet>

Purpose: For transformations like SQL queries (but also in other cases), we can extract column level lineage. This allows answering questions like ‘which root input columns are used to construct column x?’. We need to be able to capture this information in the OpenLineage model Proposed implementation We propose to introduce a new dataset facet: example: <pre><code>{ "eventType": "COMPLETE", "eventTime": "2020-12-28T20:52:00.001+10:00", "run" : { "runId": "uuid" }, "job": { "namespace": "scheduler", "name": "myjob", "facets": { "sql": { "query": "Insert into outputTable from select ** from inputTable" } } }, "inputs": [ { "namespace": "N1", "name": "inputTable", "facets": { "schema": { "fields": [ {"name": "col_a", "type": "VARCHAR"}, {"name": "col_b", "type": "int"}] } } } ], "outputs": [ { "namespace": "N2", "name": "outputTable", "facets": { "schema": { "fields": [ {"name": "col_a", "type": "VARCHAR"}, {"name": "col_b", "type": "int"}] }, "columnLineage": { "fields": { "col_a": [ { "namespace": "N1", "name": "inputTable", "field": "col_a"} ], "col_b": [ { "namespace": "N1", "name": "inputTable", "field": "col_b" } ] } } } } ] } </code></pre> Schema: <pre><code> "columnLineage": { "type": "object", "properties": { "fields": { "type": "array", "items": { "type": "object", "additionalProperties": { "type": "object", "properties": { "namespace": { "type": "string", "description": "the input dataset namespace" }, "name": { "type": "string", "description": "the input dataset name" }, "field": { "type": "string", "description": "the input field" } }, "required": [ "namespace", "name" ] } } } } </code></pre> References: • Datahub spec: <a href="https://datahubproject.io/docs/rfc/active/1841-lineage/field_level_lineage/">https://datahubproject.io/docs/rfc/active/1841-lineage/fieldlevellineage/</a>

Assignees

<a href="https://github.com/pawel-big-lebowski">@pawel-big-lebowski</a>

Labels

proposal

Julien Le Dem (julien@apache.org)

2022-07-26 19:41:13

*Thread Reply:* The doc site would benefit from a page about it. Maybe @Paweł Leszczyński?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2022-07-27 01:59:27

*Thread Reply:* Sure, it’s already on my list, will do

:gratitude_thank_you: Julien Le Dem

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2022-07-29 07:55:40

*Thread Reply:* https://openlineage.io/docs/integrations/spark/spark_column_lineage

openlineage.io

Column Level Lineage | OpenLineage Docs

Column level lineage provides fine grained information on datasets' dependencies. Not only we know the dependency exist, but we are also able to understand which input columns are used to produce output columns. This allows answering questions like Which root input columns are used to construct column x?

Original URL: https://openlineage.io/docs/integrations/spark/spark_column_lineage

✅ Conor Beverland

Conor Beverland (conorbev@gmail.com)

2022-07-26 20:03:55

maybe another question for @Paweł Leszczyński: I was watching the Airflow summit talk that you and @Maciej Obuchowski did ( very nice! ). How is this exposed? I'm wondering if it shows up as an edge on the graph in Marquez? ( I guess it may be tracked as a parent run and if so probably does not show on the graph directly at this time? )

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2022-07-27 04:08:18

*Thread Reply:* To be honest, I have never seen that in action and would love to have that in our documentation.

@Michael Collado or @Maciej Obuchowski: are you able to create some doc? I think one of you was working on that.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-07-27 04:24:19

*Thread Reply:* Yes, parent run

shweta p (shweta.pbs@gmail.com)

2022-07-27 01:29:05

Hi #general, there has been a issue with airflow+dbt+openlineage. This was working fine with openlineage-dbt v0.11.0 but there has been some change to the typeextensions due to which i had to upgrade to latest dbt (from 1.0.0 to 1.1.0) and now the dbt-ol is failing with schema version support (the version generated is v5 vs dbt-ol supports only v4). Has anyone else been able to fix this

👀 Maciej Obuchowski

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-07-27 04:47:18

*Thread Reply:* Will take a look

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-07-27 04:47:40

*Thread Reply:* But generally this support message is just a warning

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-07-27 10:04:20

*Thread Reply:* @shweta p any actual error you've found? I've tested it with dbt-bigquery on 1.1.0 and it works despite warning:

➜ small OPENLINEAGE_URL=<http://localhost:5050> dbt-ol build Running OpenLineage dbt wrapper version 0.11.0 This wrapper will send OpenLineage events at the end of dbt execution. 14:03:16 Running with dbt=1.1.0 14:03:17 Found 2 models, 3 tests, 0 snapshots, 0 analyses, 191 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics 14:03:17 14:03:17 Concurrency: 2 threads (target='dev') 14:03:17 14:03:17 1 of 5 START table model dbt_test1.my_first_dbt_model .......................... [RUN] 14:03:21 1 of 5 OK created table model dbt_test1.my_first_dbt_model ..................... [CREATE TABLE (2.0 rows, 0 processed) in 3.31s] 14:03:21 2 of 5 START test unique_my_first_dbt_model_id ................................. [RUN] 14:03:22 2 of 5 PASS unique_my_first_dbt_model_id ....................................... [PASS in 1.55s] 14:03:22 3 of 5 START view model dbt_test1.my_second_dbt_model .......................... [RUN] 14:03:24 3 of 5 OK created view model dbt_test1.my_second_dbt_model ..................... [OK in 1.38s] 14:03:24 4 of 5 START test not_null_my_second_dbt_model_id .............................. [RUN] 14:03:24 5 of 5 START test unique_my_second_dbt_model_id ................................ [RUN] 14:03:25 5 of 5 PASS unique_my_second_dbt_model_id ...................................... [PASS in 1.38s] 14:03:25 4 of 5 PASS not_null_my_second_dbt_model_id .................................... [PASS in 1.42s] 14:03:25 14:03:25 Finished running 1 table model, 3 tests, 1 view model in 8.44s. 14:03:25 14:03:25 Completed successfully 14:03:25 14:03:25 Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5 Artifact schema version: <https://schemas.getdbt.com/dbt/manifest/v5.json> is above dbt-ol supported version 4. This might cause errors. Emitting OpenLineage events: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 274.42it/s] Emitted 10 openlineage events

Fenil Doshi (fdoshi@salesforce.com)

2022-07-27 20:39:21

When will the next version of OpenLineage be available tentatively?

Michael Robinson (michael.robinson@astronomer.io)

2022-07-27 20:41:44

*Thread Reply:* I think it's safe to say we'll see a release by the end of next week

:gratitude_thank_you: Fenil Doshi

👍 Fenil Doshi

Yehuda Korotkin (yehudak@elementor.com)

2022-07-28 04:02:06

👋 Hi everyone! Yesterday was a great presentation by @Julien Le Dem that talked about OpenLineage and did grate comparison between OL and Open-Telemetry, (i wrote a small summary here: https://bit.ly/3z5caOI )

Julian’s charm sparked inside me curiosity especially regarding OL in streaming. I saw the design/architecture of OL I got some questions/discussions that I would like to understand better.

In the context of streaming jobs reporting “start job” - “end job” might be more relevant in the context of a batch mode. or do you mean reporting start job/end job should be processed each event?

and this will be equivalent to starting job each row in a table via UDF, for example.

Thank you in advance

linkedin.com

Yehuda Korotkin on LinkedIn: #new_tech #Kafka #Kafka

Summey of the meetup: / Kafka alternative / Open-Telemetry / Open-Lineage. was worth going. See details: #new_tech Memphis.dev Idan Asulin Talked about...

Original URL: https://bit.ly/3z5caOI

🙌 Maciej Obuchowski, Michael Robinson, Paweł Leszczyński

Will Johnson (will@willj.co)

2022-07-28 08:50:44

*Thread Reply:* Welcome to the community!

We talked about this exact topic in the most recent community call. https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting#MonthlyTSCmeeting-Nextmeeting:Nov10th2021(9amPT)

Discussion: streaming in Flink integration • Has there been any evolution in the thinking on support for streaming? ◦ Julien: start event, complete event, snapshots in between limited to certain number per time interval ◦ Paweł: we can make the snapshot volume configurable • Does Flink support sending data to multiple tables like Spark? ◦ Yes, multiple outputs supported by OpenLineage model ◦ Marquez, the reference implementation of OL, combines the outputs

🙏 Yehuda Korotkin

❤️ Julien Le Dem

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-07-28 09:56:05

*Thread Reply:* > or do you mean reporting start job/end job should be processed each event? We definitely want to avoid tracking every single event 🙂

One thing worth mentioning is that OpenLineage events are meant to be cumulative - the streaming jobs start, run, and eventually finish or restart. In the meantime, we capture additional events "in the middle" - for example, on Apache Flink checkpoint, or every few minutes - where we can emit additional information connected to the state of the job.

🙏 Yehuda Korotkin

Yehuda Korotkin (yehudak@elementor.com)

2022-07-28 11:11:17

*Thread Reply:* @Will Johnson and @Maciej Obuchowski Thank you for your answer

jobs start, run, and eventually finish or restart

This is the perspective that I have a hard time understanding in the context of streaming.

The classic streaming job should always be on it should not be “finish” event (Except failure). usually, streaming data is “dripping”.

It is possible to understand if the job starts/ends in the resolution of the running application and represents when the application begin and when it failed.

if you do start/stop events from the checkpoints on Flink it might be the wrong representation instead use the concept of event-driven for example reporting state.

What do you think?

Yehuda Korotkin (yehudak@elementor.com)

2022-07-28 11:11:36

*Thread Reply:*

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-07-28 12:00:34

*Thread Reply:* The idea is that jobs usually get upgraded - for example, you change Apache Flink version, increase resources, or change the structure of a job - that's the difference for us. The stop events make sense, because if you for example changed SQL of your Flink SQL job, you probably would want this to be captured - from X to Y job was running with older SQL version well, but after change, the second run started and throughput dropped to 10% of the previous one.

> if you do start/stop events from the checkpoints on Flink it might be the wrong representation instead use the concept of event-driven for example reporting state. But this is an misunderstanding 🙂 The information exposed from a checkpoints are in addition to start and stop events.

We want to get information from running job - I just argue that sometimes end of a streaming job is also relevant.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-07-28 12:01:16

*Thread Reply:* The checkpoint would be captured as a new eventType: RUNNING - do I miss something why you want to add StateFacet?

👍 Yehuda Korotkin

Yehuda Korotkin (yehudak@elementor.com)

2022-07-28 14:24:03

*Thread Reply:* About argue - it’s depends on what the definition of job in streaming mode, i agree that if you already have ‘job’ you want to know about the job more information.

each event that entering the sub process (job) should do REST call “Start job” and “End job” ?

Nope, I just represented two possible ways that i thought, or StateFacet or add new Event type eg. RUNNING 😉

👍 Maciej Obuchowski

Michael Robinson (michael.robinson@astronomer.io)

2022-07-28 09:14:28

Hi everyone, I’d like to request a release to publish the new Flink integration (thanks, @Maciej Obuchowski) and an important fix to the Spark integration (thanks, @Paweł Leszczyński). As per our policy here, 3 +1s from committers will authorize an immediate release. Thanks!

➕ Maciej Obuchowski, Paweł Leszczyński, Willy Lulciuc, Will Johnson, Julien Le Dem

Michael Robinson (michael.robinson@astronomer.io)

2022-07-28 17:30:33

*Thread Reply:* Thanks for the +1s. We will initiate the release by Tuesday.

Barak F (fargoun@gmail.com)

2022-07-28 10:30:15

Static code annotations for OpenLineage: hi everyone, i heard yesterday a great lecture by @Julien Le Dem on OpenLineage, and as i'm very interested in this area, i wanted to raise a question: are there any plans to have OpenLineage-like annotations on actual code (e.g. Spark, AirFlow, arbitrary code) to allow deducing some of the lineage informtion from static code analysis?

The reason i'm asking this is because while OpenLineage does a great job of integrating with multiple platforms (AirFlow, Dbt, Spark), some companies still have a lot of legacy-related data processing stack that will probably not get full OpenLineage (as it's a one-off, and the companies themselves will probably won't implement OpenLineage support for their custom frameworks). Having some standard way to annotate code with information like: "reads from X; writes to Y; Job name regexp: Z", may allow writing a "generic" OpenLineage colelctor that can go over the source code, collect this configuration information and then use it when constructing the lineage graph (even though it won't be as complete and full as the full OpenLineage info).

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-08-03 08:30:15

*Thread Reply:* I think this is an interesting idea, however, just the static analysis does not convey any runtime information.

We're doing something similar within Airflow now, but as a fallback mechanism: https://github.com/OpenLineage/OpenLineage/pull/914

You can manually annotate DAG with information instead of writing extractor for your operator. This still gives you runtime information. Similar features might get added to other integrations, especially with such a vast scope as Airflow has - but I think it's unlikely we'd work on a feature for just statically traversing code without runtime context.

Barak F (fargoun@gmail.com)

2022-08-03 14:25:31

*Thread Reply:* Thanks for the detailed response @Maciej Obuchowski! It seems like this solution is specific only to AirFlow, and i wonder why wouldn't we generalize this outside of just AirFlow? My thinking is that there are other areas where there is vast scope (e.g. arbitrary code that does data manipulations), and without such an option, the only path is to provide full runtime information via building your own extractor, which might be a bit hard/expensive to do. If i understand your response correctly, then you assume that OpenLineage can get wide enough "native" support across the stack without resorting to a fallback like 'static code analysis'. Is that your base assumption?

Petr Hajek (petr.hajek@profinit.eu)

2022-07-29 04:36:03

Hi all, does anybody have an experience extracting Airflow lineage using Marquez as documented here https://www.astronomer.io/guides/airflow-openlineage/#generating-and-viewing-lineage-data ? We tested it on our Airflow instance with Marquez hoping to get the standard .json files describing lineage in accord with open-lineage model as described in https://json-schema.org/draft/2020-12/schema. But there seems to be only one GET method related to lineage export in Marquez API library called "Get a lineage graph". This produces quite different .json structure than what we know from open-lineage. Could anybody help if there is a chance to get open-lineage .json structure from Marquez?

astronomer.io

OpenLineage and Airflow - Airflow Guides

Using OpenLineage and Marquez to get lineage data from your Airflow DAGs.

Original URL: https://www.astronomer.io/guides/airflow-openlineage/#generating-and-viewing-lineage-data

Ross Turk (ross@datakin.com)

2022-07-29 12:58:38

*Thread Reply:* The query API has a different spec than the reporting API, so what you’d get from Marquez would look different from what Marquez receives.

Few ideas:

you could send the lineage to a pipedream endpoint to inspect, if you’re just trying to experiment
you could grab them from the lineage table in Marquez’s postgres

Petr Hajek (petr.hajek@profinit.eu)

2022-07-30 16:29:24

*Thread Reply:* ok, now I understand, thank you

👍 Jan Kopic

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-08-03 08:25:57

*Thread Reply:* FYI we want to have something like that too: https://github.com/MarquezProject/marquez/issues/1927

But if you need just the raw events endpoint, without UI, then Marquez might be overkill for your needs

<https://github.com/MarquezProject/marquez/issues/1927|#1927 Option to dump OpenLineage events that correspond to dataset/namespace from web>

Having this feature would make debugging, and replicating errors much faster.

Comments

Milestone

<a href="https://github.com/MarquezProject/marquez/milestone/4">Roadmap</a>

Dinakar Sundar (dinakar_sundar@condenast.com)

2022-07-30 13:44:13

Hi @everyone , we are trying to extract lineage information and import into amundsen .please point us right direction to move - based on the documentation -> Databricks + marquez + amundsen is this the only way to move on ?

John Thomas (john.thomas@astronomer.io)

2022-07-30 13:49:25

*Thread Reply:* Short of implementing an open lineage endpoint in Amundsen, yes that's the right approach.

The Lineage endpoint in Marquez can output the whole graph centered on a node ID, and you can use the jobs/datasets apis to grab lists of each for reference

Barak F (fargoun@gmail.com)

2022-07-31 00:35:06

*Thread Reply:* Is your lineage information coming via OpenLineage? if so - you can quickly use the Amundsen scripts in order to load data into Amundsen, for example, see this script here: https://github.com/amundsen-io/amundsendatabuilder/blob/master/example/scripts/sample_data_loader.py

Where is your lineage coming from?

<https://github.com/amundsen-io/amundsendatabuilder/blob/master/example/scripts/sample_data_loader.py | sample_data_loader.py>

``` <h1>Copyright Contributors to the Amundsen project.</h1> <h1>SPDX-License-Identifier: Apache-2.0</h1> """ This is a example script demonstrating how to load data into Neo4j and Elasticsearch without using an Airflow DAG. It contains several jobs:<ul><li><code>run_csv_job</code>: runs a job that extracts table data from a CSV, loads (writes) this into a different local directory as a csv, then publishes this data to neo4j.</li><li><code>run_table_column_job</code>: does the same thing as <code>run_csv_job</code>, but with a csv containing column data.</li><li><code>create_last_updated_job</code>: creates a job that gets the current time, dumps it into a predefined model schema, and publishes this to neo4j.</li><li><code>create_es_publisher_sample_job</code>: creates a job that extracts data from neo4j and pubishes it into elasticsearch.</li> </ul> For other available extractors, please take a look at <a href="https://github.com/amundsen-io/amundsendatabuilder#list-of-extractors">https://github.com/amundsen-io/amundsendatabuilder#list-of-extractors</a> """ import logging import os import sys import uuid from elasticsearch import Elasticsearch from pyhocon import ConfigFactory from sqlalchemy.ext.declarative import declarative_base from databuilder.extractor.csvextractor import ( CsvColumnLineageExtractor, CsvExtractor, CsvTableBadgeExtractor, CsvTableColumnExtractor, CsvTableLineageExtractor, ) from databuilder.extractor.eslastupdatedextractor import EsLastUpdatedExtractor from databuilder.extractor.neo4jsearchdataextractor import Neo4jSearchDataExtractor from databuilder.job.job import DefaultJob from databuilder.loader.filesystemelasticsearchjsonloader import FSElasticsearchJSONLoader from databuilder.loader.filesystemneo4jcsvloader import FsNeo4jCSVLoader from databuilder.publisher.elasticsearchconstants import ( DASHBOARDELASTICSEARCHINDEXMAPPING, USERELASTICSEARCHINDEXMAPPING, ) from databuilder.publisher.elasticsearchpublisher import ElasticsearchPublisher from databuilder.publisher.neo4jcsvpublisher import Neo4jCsvPublisher from databuilder.task.task import DefaultTask from databuilder.transformer.basetransformer import ChainedTransformer, NoopTransformer from databuilder.transformer.dicttomodel import MODELCLASS, DictToModel from databuilder.transformer.generictransformer import ( CALLBACKFUNCTION, FIELDNAME, GenericTransformer, ) eshost = os.getenv('CREDENTIALSELASTICSEARCHPROXYHOST', 'localhost') neohost = os.getenv('CREDENTIALSNEO4JPROXYHOST', 'localhost') esport = os.getenv('CREDENTIALSELASTICSEARCHPROXYPORT', 9200) neoport = os.getenv('CREDENTIALSNEO4JPROXYPORT', 7687) if len(sys.argv) > 1: eshost = sys.argv[1] if len(sys.argv) > 2: neohost = sys.argv[2] es = Elasticsearch([ {'host': eshost, 'port': esport}, ]) Base = declarative_base() NEO4JENDPOINT = f'bolt://{neohost}:{neo_port}' neo4jendpoint = NEO4JENDPOINT neo4juser = 'neo4j' neo4jpassword = 'test' LOGGER = logging.getLogger(name) def runcsvjob(fileloc, jobname, model): tmpfolder = f'/var/tmp/amundsen/{jobname}' nodefilesfolder = f'{tmpfolder}/nodes' relationshipfilesfolder = f'{tmpfolder}/relationships' <pre><code>csv_extractor = CsvExtractor() csv_loader = FsNeo4jCSVLoader() task = DefaultTask(extractor=csv_extractor, loader=csv_loader, transformer=NoopTransformer()) job_config = ConfigFactory.from_dict({ 'extractor.csv.file_location': file_loc, 'extractor.csv.model_class': model, 'loader.filesystem_csv_neo4j.node_dir_path': node_files_folder, 'loader.filesystem_csv_neo4j.relationship_dir_path': relationship_files_folder, 'loader.filesystem_csv_neo4j.delete_created_directories': True, 'publisher.neo4j.node_files_directory': node_files_folder, 'publisher.neo4j.relation_files_directory': relationship_files_folder, 'publisher.neo4j.neo4j_endpoint': neo4j_endpoint, 'publisher.neo4j.neo4j_user': neo4j_user, 'publisher.neo4j.neo4j_password': neo4j_password, 'publisher.neo4j.neo4j_encrypted': False, 'publisher.neo4j.job_publish_tag': 'unique_tag', # should use unique tag here like {ds} }) DefaultJob(conf=job_config, task=task, publisher=Neo4jCsvPublisher()).launch() </code></pre> def runtablebadgejob(tablepath, badgepath): tmpfolder = '/var/tmp/amundsen/tablebadge' nodefilesfolder = f'{tmpfolder}/nodes' relationshipfilesfolder = f'{tmpfolder}/relationships' extractor = CsvTableBadgeExtractor() csvloader = FsNeo4jCSVLoader() task = DefaultTask(extractor=extractor, loader=csvloader, transformer=NoopTransformer()) jobconfig = ConfigFactory.fromdict({ 'extractor.csvtablebadge.tablefilelocation': tablepath, 'extractor.csvtablebadge.badgefilelocation': badgepath, 'loader.filesystemcsvneo4j.nodedirpath': nodefilesfolder, 'loader.filesystemcsvneo4j.relationshipdirpath': relationshipfilesfolder, 'loader.filesystemcsvneo4j.deletecreateddirectories': True, 'publisher.neo4j.nodefilesdirectory': nodefilesfolder, 'publisher.neo4j.relationfilesdirectory': relationshipfilesfolder, 'publisher.neo4j.neo4jendpoint': neo4jendpoint, 'publisher.neo4j.neo4juser': neo4juser, 'publisher.neo4j.neo4jpassword': neo4jpassword, 'publisher.neo4j.neo4jencrypted': False, 'publisher.neo4j.jobpublishtag': 'uniquetagb', # should use unique tag here like {ds} }) job = DefaultJob(conf=job_config, task=task, publisher=Neo4jCsvPublisher()) job.launch() def runtablecolumnjob(tablepath, columnpath): tmpfolder = '/var/tmp/amundsen/tablecolumn' nodefilesfolder = f'{tmpfolder}/nodes' relationshipfilesfolder = f'{tmpfolder}/relationships' extractor = CsvTableColumnExtractor() csvloader = FsNeo4jCSVLoader() task = DefaultTask(extractor, loader=csvloader, transformer=NoopTransformer()) jobconfig = ConfigFactory.fromdict({ 'extractor.csvtablecolumn.tablefilelocation': tablepath, 'extractor.csvtablecolumn.columnfilelocation': columnpath, 'loader.filesystemcsvneo4j.nodedirpath': nodefilesfolder, 'loader.filesystemcsvneo4j.relationshipdirpath': relationshipfilesfolder, 'loader.filesystemcsvneo4j.deletecreateddirectories': True, 'publisher.neo4j.nodefilesdirectory': nodefilesfolder, 'publisher.neo4j.relationfilesdirectory': relationshipfilesfolder, 'publisher.neo4j.neo4jendpoint': neo4jendpoint, 'publisher.neo4j.neo4juser': neo4juser, 'publisher.neo4j.neo4jpassword': neo4jpassword, 'publisher.neo4j.neo4jencrypted': False, 'publisher.neo4j.jobpublishtag': 'uniquetag', # should use unique tag here like {ds} }) job = DefaultJob(conf=jobconfig, task=task, publisher=Neo4jCsvPublisher()) job.launch() def runtablelineagejob(tablelineagepath): tmpfolder = '/var/tmp/amundsen/tablecolumn' nodefilesfolder = f'{tmpfolder}/nodes' relationshipfilesfolder = f'{tmpfolder}/relationships' extractor = CsvTableLineageExtractor() csvloader = FsNeo4jCSVLoader() task = DefaultTask(extractor, loader=csvloader, transformer=NoopTransformer()) jobconfig = ConfigFactory.fromdict({ 'extractor.csvtablelineage.tablelineagefilelocation': tablelineagepath, 'loader.filesystemcsvneo4j.nodedirpath': nodefilesfolder, 'loader.filesystemcsvneo4j.relationshipdirpath': relationshipfilesfolder, 'loader.filesystemcsvneo4j.deletecreateddirectories': True, 'publisher.neo4j.nodefilesdirectory': nodefilesfolder, 'publisher.neo4j…

Dinakar Sundar (dinakar_sundar@condenast.com)

2022-08-01 20:17:22

*Thread Reply:* yes @Barak F we are using open lineage

Barak F (fargoun@gmail.com)

2022-08-02 01:26:18

*Thread Reply:* So, have you tried using Amundsen data builder scripts to load the lineage information into Amundsen? (maybe you'll have to "play" with those a bit)

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-08-03 08:24:58

*Thread Reply:* AFAIK there is OpenLineage extractor: https://www.amundsen.io/amundsen/databuilder/#openlineagetablelineageextractor

Not sure it solves your issue though 🙂

Dinakar Sundar (dinakar_sundar@condenast.com)

2022-08-05 04:46:45

*Thread Reply:* thanks

Michael Robinson (michael.robinson@astronomer.io)

2022-08-01 17:08:46

@channel OpenLineage 0.12.0 is now available! We added: • an Apache Flink integration, • support for Spark 3.3.0, • the ability to extend column level lineage mechanism, • an ErrorMessageRunFacet to the OpenLineage spec, • SQLCheckExtractors, a RedshiftSQLExtractor & RedshiftDataExtractor to the Airflow integration, • a dataset builder to the AlterTableCommand class in the Spark integration. We changed: • the filtering of Delta events to reduce noise, • the flow of metadata in the Airflow integration to allow metadata from Airflow through inlets and outlets. Thanks to all the contributors who made this release possible! For the bug fixes and more details, see: Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.12.0 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.11.0...0.12.0 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/ (edited)

❤️ Minkyu Park, Harel Shein, Willy Lulciuc, Peter Hicks, Fenil Doshi, Maciej Obuchowski, Howard Yoo, Paul Wilson Villena, Jarek Potiuk, Dinakar Sundar, Shubham Mehta, Sharanya Santhanam, Sheeri Cabral (Collibra)

🎉 Minkyu Park, Peter Hicks, Fenil Doshi, Howard Yoo, Jarek Potiuk, Paweł Leszczyński, Ryan Peterson

🚀 Minkyu Park, Howard Yoo, Jarek Potiuk

🙌 Minkyu Park, Willy Lulciuc, Maciej Obuchowski, Howard Yoo, Jarek Potiuk

Sharanya Santhanam (santhanamsharanya@gmail.com)

2022-08-02 10:12:01

What is the right way of handling/parsing facets on the server side?

I see the generated server side stubs are generic : https://github.com/OpenLineage/OpenLineage/blob/main/client/java/generator/src/main/java/io/openlineage/client/Generator.java#L131 and dont have any resolved facet information. Marquez seems to have duplicated the OL model with https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/service/models/LineageEvent.java#L71 and converts the incoming OL events to a “LineageEvent” for appropriate handling. Is there a cleaner approach where in the known facets can be generated in io.openlineage.server?

<https://github.com/OpenLineage/OpenLineage/blob/main/client/java/generator/src/main/java/io/openlineage/client/Generator.java | Generator.java>

<https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/service/models/LineageEvent.java | LineageEvent.java>

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-08-02 12:28:11

*Thread Reply:* I think the reason for server model being very generic is because new facets can be added later (also as custom facets) - and generally server wants to accept all valid events and get the facet information that it can actually use, rather than reject event because it has unknown field.

Server model was added here after some discussion in Marquez which is relevant - I think @Michael Collado @Willy Lulciuc can add to that

<https://github.com/OpenLineage/OpenLineage/pull/300|#300 server model>

<https://github.com/MarquezProject/marquez/issues/1650|#1650 Binary incompatibility after deleting LineageEvent classes>

Sharanya Santhanam (santhanamsharanya@gmail.com)

2022-08-02 15:54:24

*Thread Reply:* Thanks for the response. I realize the server stubs were created to support flexibility , but it also makes the parsing logic on server side a bit more complex as we need to maintain code on the server side to look for specific facets & their properties from maps or like maquez duplicate the OL model on our end with the facets we care about. Wanted to know whats the guidance around managing this server side. @Willy Lulciuc @Michael Collado Any suggestions ?

Michael Robinson (michael.robinson@astronomer.io)

2022-08-02 18:27:27

Agenda items are requested for the next OpenLineage Technical Steering Committee meeting on August 11 at 10am PT. Reply in thread or ping me with your item(s)!

Varun Singh (varuntestaz@outlook.com)

2022-08-03 04:16:22

Hi all, I am trying out the openlineage spark integration and can't find any column lineage information included with the events. I tried it out with an input dataset where I renamed one of the columns but the columnLineage facet was not present. Can anyone suggest some other examples where it might show up?

Thanks!

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-08-03 04:45:36

*Thread Reply:* @Paweł Leszczyński do we collect column level lineage on renames?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2022-08-03 04:56:23

*Thread Reply:* @Maciej Obuchowski no, we don’t. @Varun Singh create table as select may suit you well. Other examples are within tests like: • https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/spark3/src/[…]lifecycle/plan/column/ColumnLevelLineageUtilsV2CatalogTest.java • https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/spark3/src/[…]ecycle/plan/column/ColumnLevelLineageUtilsNonV2CatalogTest.java

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/spark3/src/test/java/io/openlineage/spark3/agent/lifecycle/plan/column/ColumnLevelLineageUtilsV2CatalogTest.java | ColumnLevelLineageUtilsV2CatalogTest.java>

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/spark3/src/test/java/io/openlineage/spark3/agent/lifecycle/plan/column/ColumnLevelLineageUtilsNonV2CatalogTest.java | ColumnLevelLineageUtilsNonV2CatalogTest.java>

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2022-08-05 05:55:12

*Thread Reply:* I’ve created an issue for column lineage in case of renaming: https://github.com/OpenLineage/OpenLineage/issues/993

<https://github.com/OpenLineage/OpenLineage/issues/993|#993 collect column level lineage for renames>

In case of <code>alter table rename column</code> or <code>alter table rename</code> we do not collect column level lineage at the moment, which may be useful in some scenarios.

Labels

integration/spark

Varun Singh (varuntestaz@outlook.com)

2022-08-08 09:37:43

*Thread Reply:* Thanks @Paweł Leszczyński!

Ross Turk (ross@datakin.com)

2022-08-03 12:58:44

Hey everyone! I am looking into Fivetran a bit, and it occurs to me that the NAMING.md document does not have an opinion about how to deal with entire systems as datasets. More in 🧵.

Ross Turk (ross@datakin.com)

2022-08-03 13:00:22

*Thread Reply:* Fivetran is a tool that copies data from source systems to target databases. One of these source systems might be SalesForce, for example.

This copying results in thousands of SQL queries run against the target database for each sync. I don’t think each of these queries should map to an OpenLineage job, I think the entire synchronization should. Maybe I’m wrong here.

Ross Turk (ross@datakin.com)

2022-08-03 13:01:00

*Thread Reply:* But if I’m right, that means that there needs to be a way to specify “SalesForce Account #45123452233” as a dataset.

Ross Turk (ross@datakin.com)

2022-08-03 13:01:44

*Thread Reply:* or it ends up just being a job with outputs and no inputs…but that’s not very illuminating

Ross Turk (ross@datakin.com)

2022-08-03 13:02:27

*Thread Reply:* or is that good enough?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-08-04 10:31:11

*Thread Reply:* You are looking at a pretty big topic here 🙂

Basically you're asking what is a job in OpenLineage - and it's not fully answered yet.

I think the discussion is kinda relevant to this proposed facet and I kinda replied there: https://github.com/OpenLineage/OpenLineage/issues/812#issuecomment-1205337556

Harel Shein (harel.shein@gmail.com)

2022-08-04 15:50:22

*Thread Reply:* my 2 cents on this is that in the Salesforce example, the system is to complex to capture as a single dataset. and so maybe different objects within a salesforce account (org/account/opportunity/etc…) could be treated as individual datasets. But as @Maciej Obuchowski pointed out, this is quite a large topic 🙂

Ross Turk (ross@datakin.com)

2022-08-08 13:46:31

*Thread Reply:* I guess it depends on whether you actually care about the table/column level lineage for an operation like “copy salesforce to snowflake”.

I can see it being a nuisance having all of that on a lineage graph. OTOH, I can see it being useful to know that a datum can be traced back to a specific endpoint at SFDC.

Ross Turk (ross@datakin.com)

2022-08-08 13:46:55

*Thread Reply:* this is a design decision, IMO.

Michael Robinson (michael.robinson@astronomer.io)

2022-08-04 11:30:00

@channel The next OpenLineage Technical Steering Committee meeting is on Thursday, August 11 at 10 am PT. Join us on Zoom: https://bit.ly/OLzoom All are welcome! Agenda:

Announcements
Docs site update
Release 0.11.0 and 0.12.0 overview
Extractors: examples and how to write them
Open discussion Notes: https://bit.ly/OLwiki Is there a topic you think the community should discuss at this or a future meeting? Reply or DM me to add items to the agenda. (edited)

Zoom Video

Join our Cloud HD Video Meeting

Original URL: https://bit.ly/OLzoom

🙌 Maciej Obuchowski, Harel Shein, Paul Wilson Villena

👀 Francis McGregor-Macdonald

Chris Coulthrust (coulthrust@gmail.com)

2022-08-06 12:06:47

👋 Hi everyone!

👋 Jakub Dardziński, Michael Robinson, Ross Turk, Harel Shein, Willy Lulciuc, Howard Yoo

Michael Robinson (michael.robinson@astronomer.io)

2022-08-10 11:00:01

@channel The next OpenLineage TSC meeting is tomorrow! https://openlineage.slack.com/archives/C01CK9T7HKR/p1659627000308969

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)

@channel The next OpenLineage Technical Steering Committee meeting is on Thursday, August 11 at 10 am PT. Join us on Zoom: <a href="https://bit.ly/OLzoom">https://bit.ly/OLzoom</a> All are welcome! Agenda: <ol><li>Announcements</li><li>Docs site update</li><li>Release 0.11.0 and 0.12.0 overview</li><li>Extractors: examples and how to write them</li><li>Open discussion Notes: <a href="https://bit.ly/OLwiki">https://bit.ly/OLwiki</a> Is there a topic you think the community should discuss at this or a future meeting? Reply or DM me to add items to the agenda. (edited)</li> </ol>

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1659627000308969

👀 Howard Yoo

❤️ Minkyu Park

Will Johnson (will@willj.co)

2022-08-10 22:34:29

*Thread Reply:* I am so sad I'm going to miss this month's meeting 😰 Looking forward to the recording!

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-08-12 06:19:58

*Thread Reply:* We missed you too @Will Johnson 😉

Raj Mishra (hax0755@gmail.com)

2022-08-11 18:50:18

Hi everyone! I have a REST endpoint that I use for other pipelines that can POST their RunEvent and I forward that to marquez. I'm expecting a JSON which has the RunEvent details, which also has the input or output dataset depending upon the EventType. I can see the Run details always shows up on the marquez UI, but the dataset has issues. I can see the dataset listed but when I can click on it, just shows "something went wrong." I don't see any details of that dataset. { "eventType": "START", "eventTime": "2022-08-09T19:49:24.201361Z", "run": { "runId": "d46e465b-d358-4d32-83d4-df660ff614dd" }, "job": { "namespace": "TEST-NAMESPACE", "name": "test-job" }, "inputs": [ { "namespace": "TEST-NAMESPACE", "name": "my-test-input", "facets": { "schema": { "_producer": "<https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client>", "_schemaURL": "<https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/spec/OpenLineage.json#/definitions/SchemaDatasetFacet>", "fields": [ { "name": "a", "type": "INTEGER" }, { "name": "b", "type": "TIMESTAMP" }, { "name": "c", "type": "INTEGER" }, { "name": "d", "type": "INTEGER" } ] } } } ], "producer": "<https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client>" } In above payload, the input data set is never created on marquez. I can only see the Run details, but input data set is just empty. Does the input data set needs to created first and then only the RunEvent can be created?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-08-12 06:09:57

*Thread Reply:* From the first look, you're missing outputsfield in your event - this might break something

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-08-12 06:10:20

*Thread Reply:* If not, then Marquez logs might help to see something

Raj Mishra (hax0755@gmail.com)

2022-08-12 13:12:56

*Thread Reply:* Does the START event needs to have an output?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-08-12 13:19:24

*Thread Reply:* It can have empty output 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-08-12 13:32:43

*Thread Reply:* well, in your case you need to send COMPLETE event

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-08-12 13:33:44

*Thread Reply:* Internally, Marquez does not create dataset version until you complete event. It makes sense when your semantics are transactional - you can still read from previous dataset version until it's finished writing.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-08-12 13:34:06

*Thread Reply:* After I send COMPLETE event with the same information I can see the dataset.

Raj Mishra (hax0755@gmail.com)

2022-08-12 13:56:37

*Thread Reply:* Thanks for the explanation @Maciej Obuchowski So, if I understand this correct. I won't see the my-test-input dataset till I have the COMPLETE event with input and output?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-08-12 14:34:51

*Thread Reply:* @Raj Mishra Yes and no 🙂

Basically your COMPLETE event does not need to contain any input and output datasets at all - OpenLineage model is cumulative, so it's enough to have datasets on either start or complete. That also means you can add different datasets in different moment of a run lifecycle - for example, you know inputs, but not outputs, so you emit inputs on START , but not COMPLETE.

Or, the job is modifying the same dataset it reads from (which happens surprisingly often), Then, you want to collect various input metadata from the dataset before modifying it - most likely you won't have them on COMPLETE 🙂

In this example I've added my-test-input on START and my-test-input2 on COMPLETE :

Raj Mishra (hax0755@gmail.com)

2022-08-12 14:47:56

*Thread Reply:* @Maciej Obuchowski Thank you so much! This is great explanation.

Sharanya Santhanam (santhanamsharanya@gmail.com)

2022-08-11 20:28:40

Effectively handling file datasets on server side. We have a common usecase where dataset of type is produced/consumed per day. On the Lineage UI/server side it would be ideal to treat all files of this pattern as 1 dataset Vs 1 dataset per daily file. Any suggestions ?

Sharanya Santhanam (santhanamsharanya@gmail.com)

2022-08-11 20:35:33

*Thread Reply:* Would adding support for alias/grouping as a config on OL client side be valuable to other users ? i.e OL client could pass down an Alias/grouping facet Or should this be treated purely a server side feature

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-08-12 06:11:21

*Thread Reply:* Agreed 🙂

How do you produce this dataset? Spark integration? Are you using any system like Apache Iceberg/Delta Lake or just writing raw files?

Sharanya Santhanam (santhanamsharanya@gmail.com)

2022-08-12 12:59:48

*Thread Reply:* these are raw files written from Spark or map reduce jobs. And downstream Spark jobs read these raw files to produce tables

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-08-12 13:27:34

*Thread Reply:* written using Spark dataframe API, like df.write.format("parquet").save("/tmp/spark_output/parquet") or RDD?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-08-12 13:27:59

*Thread Reply:* the actual API used matters, because we're handling different cases separately

Sharanya Santhanam (santhanamsharanya@gmail.com)

2022-08-12 13:29:48

*Thread Reply:* I see. Let me look that up to be absolutely sure

Sharanya Santhanam (santhanamsharanya@gmail.com)

2022-08-12 19:21:41

*Thread Reply:* It is like. this : df.write.format("parquet").save("/tmp/spark_output/parquet")

Sharanya Santhanam (santhanamsharanya@gmail.com)

2022-08-15 12:43:45

*Thread Reply:* @Maciej Obuchowski curious what you had in mind with respect to RDDs & Dataframes. Also what if we cannot integrate OL with the frameworks that produce this dataset , but only those that consume from the already produced datasets. Is there a way we could still capture the dataset appropriately ?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-08-16 05:30:57

*Thread Reply:* @Sharanya Santhanam the naming should be consistent between reading and writing, so it wouldn't change much of you can't integrate OL into writers. For the rest, can you create an issue on OL GitHub so someone can pick it up? I'm at vacation now.

Sharanya Santhanam (santhanamsharanya@gmail.com)

2022-08-16 15:08:41

*Thread Reply:* Sounds good , Ty !

Varun Singh (varuntestaz@outlook.com)

2022-08-12 06:02:00

Hi, Minor Suggestion: This line https://github.com/OpenLineage/OpenLineage/blob/46efab1e7c2a0aa5ebe8d11185fe8d5225[…]/app/src/main/java/io/openlineage/spark/agent/EventEmitter.java is printing variables like api key and other parameters in the logs. Wouldn't it be more appropriate to use log.debug instead? I'll create an issue if others agree

<https://github.com/OpenLineage/OpenLineage/blob/46efab1e7c2a0aa5ebe8d11185fe8d5225c0f78b/integration/spark/app/src/main/java/io/openlineage/spark/agent/EventEmitter.java | EventEmitter.java>

<pre><code> "Init OpenLineageContext: Args: %s URI: %s", argument, lineageURI.toString())); </code></pre>

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-08-12 06:09:11

*Thread Reply:* yes

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-08-12 06:09:32

*Thread Reply:* please do create 🙂

✅ Varun Singh

Conor Beverland (conorbev@gmail.com)

2022-08-15 09:01:47

dumb question but, is it easy to run all the OpenLineage tests locally? ( and if so how? 🙂 )

Julien Le Dem (julien@apache.org)

2022-08-17 13:54:19

*Thread Reply:* it's per project. java based: ./gradlew test python based: https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow#development

Will Johnson (will@willj.co)

2022-08-18 23:45:30

Spark Integration: The Order of Processing Events in the Async Event Queue

Hey, OpenLineage team, I'm working on a PR (https://github.com/OpenLineage/OpenLineage/pull/849/) that is going to store information given in different spark events (e.g. SparkListenerSQLExecutionStart, SparkListenerJobStart).

However, I want to avoid holding all this data once the execution of the job is complete. As a result, I want to remove the data once I receive a SparkListenerSQLExecutionEnd.

However, can I be guaranteed that the ExecutionEnd event will be processed AFTER the JobStart event? Is it possible that I can take too long to process the the JobStart event that the ExecutionEnd executes prior to the JobStart finishing?

I know we do something similar to this with sparkSqlExecutionRegistry (https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/mai[…]n/java/io/openlineage/spark/agent/OpenLineageSparkListener.java) but do we have any docs to help explain how the AsyncEventQueue orders and consumes events for a listener?

Thank you so much for any insights

<https://github.com/OpenLineage/OpenLineage/pull/849|#849 Feature/spark parent run facet execution>

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/main/java/io/openlineage/spark/agent/OpenLineageSparkListener.java | OpenLineageSparkListener.java>

Julien Le Dem (julien@apache.org)

2022-08-19 18:38:10

*Thread Reply:* Hey Will! A bunch of folks are on vacation or out this week. Sorry for the delay, I am personally not sure but if it's not too urgent you can have an answer when knowledgable folks are back.

Will Johnson (will@willj.co)

2022-08-19 20:21:18

*Thread Reply:* Hah! No worries, @Julien Le Dem! I can definitely wait for the lucky people who are enjoying the last few weeks of summer unlike the rest of us 😋

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-08-29 05:31:32

*Thread Reply:* @Paweł Leszczyński might want to look at that

Hanbing Wang (doris.wang200902@gmail.com)

2022-08-19 01:53:56

Hi, I try to find out if openLineage spark support pyspark (Non-sql) use cases? Is there any doc I could get more details about non-sql openLineage support? Thanks a lot

Julien Le Dem (julien@apache.org)

2022-08-19 12:30:08

*Thread Reply:* Hello Hanbing, the spark integration works for PySpark since pyspark is wrapped into regular spark operators.

Hanbing Wang (doris.wang200902@gmail.com)

2022-08-19 13:49:35

*Thread Reply:* @Julien Le Dem Thanks a lot for your help. I searched around, but I couldn't find any doc introduce how pyspark supported in openLineage. My company want to integrate with openLineage-spark, I am working on figure out what info does OpenLineage make available for non-sql and does it at least have support for logging the logical plan?

Julien Le Dem (julien@apache.org)

2022-08-19 18:26:48

*Thread Reply:* Yes, it does send the logical plan as part of the event

Julien Le Dem (julien@apache.org)

2022-08-19 18:27:32

*Thread Reply:* This configuration here should work as well for pyspark https://openlineage.io/docs/integrations/spark/

openlineage.io

Apache Spark | OpenLineage Docs

Spark jobs typically run on clusters of machines. A single machine hosts the "driver" application,

Original URL: https://openlineage.io/docs/integrations/spark/

Julien Le Dem (julien@apache.org)

2022-08-19 18:28:11

*Thread Reply:* --conf "spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener"

Julien Le Dem (julien@apache.org)

2022-08-19 18:28:26

*Thread Reply:* you need to add the jar, set the listener and pass your OL config

Julien Le Dem (julien@apache.org)

2022-08-19 18:31:11

*Thread Reply:* Actually I'm demoing this at 27:10 right here 🙂 https://pretalx.com/bbuzz22/talk/FHEHAL/

pretalx

Cross-Platform Data Lineage with OpenLineage Berlin Buzzwords 2022

There are more data tools available than ever before, and it's easier to build a pipeline than it's ever been. This has resulted in an explosion of innovation, but it also means that data within today's organizations has become increasingly distributed. It can't be contained within a single brain, a single team, or a single platform. Data lineage can help by tracing the relationships between datasets and providing a map of your entire data universe. OpenLineage provides a standard for lineage collection that spans multiple platforms, including Apache Airflow, Apache Spark, Flink, and dbt. This empowers teams to diagnose and address widespread data quality and efficiency issues in real time. In this session, Julien Le Dem from Datakin will show how to trace data lineage across Apache Spark and Apache Airflow. He will walk through the OpenLineage architecture and provide a live demo of a running pipeline with real-time data lineage.

Original URL: https://pretalx.com/bbuzz22/talk/FHEHAL/

Julien Le Dem (julien@apache.org)

2022-08-19 18:32:11

*Thread Reply:* you can see the parameters I'm passing to the pyspark command line in the video

Hanbing Wang (doris.wang200902@gmail.com)

2022-08-19 18:35:50

*Thread Reply:* @Julien Le Dem Thanks for the info, Let me take a look at the video now.

Julien Le Dem (julien@apache.org)

2022-08-19 18:40:10

*Thread Reply:* The full demo starts at 24:40. It shows lineage connected together in Marquez coming from 3 different sources: Airflow, Spark and a custom integration

Michael Robinson (michael.robinson@astronomer.io)

2022-08-22 14:32:53

Hi everyone, a release has been requested by @Harel Shein. As per our policy here, 3 +1s from committers will authorize an immediate release. Thanks! Unreleased commits: https://github.com/OpenLineage/OpenLineage/compare/0.12.0...HEAD

<https://github.com/OpenLineage/OpenLineage/blob/main/GOVERNANCE.md | GOVERNANCE.md>

➕ Willy Lulciuc, Michael Robinson, Minkyu Park, Jakub Dardziński, Julien Le Dem

Willy Lulciuc (willy@datakin.com)

2022-08-22 14:38:58

*Thread Reply:* @Michael Robinson can we start posting the “Unreleased” section in the changelog along with the release request? That way, we / the community will know what will be in the upcoming release

👍 Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2022-08-22 15:00:37

*Thread Reply:* The release is approved. Thanks @Willy Lulciuc, @Minkyu Park, @Harel Shein

🙌 Willy Lulciuc, Harel Shein

Michael Robinson (michael.robinson@astronomer.io)

2022-08-22 16:18:30

@channel OpenLineage 0.13.0 is now available! We added: • BigQuery check support • RUNNING EventType in the spec and Python client • databases and schemas to SQL extractors • an event forwarding feature via HTTP • Azure Cosmos Handler to the Spark integration • support for OL datasets in manual lineage inputs/outputs • ownership facets. We changed: • use RUNNING EventType in Flink integration for currently running jobs • convert task object into JSON encodable when creating Airflow version facet. Thanks to all the contributors who made this release possible! For the bug fixes and more details, see: Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.13.0 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.12.0...0.13.0 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/ (edited)

🎉 Harel Shein, Ross Turk, Jarek Potiuk, Sheeri Cabral (Collibra), Willy Lulciuc, Howard Yoo, Howard Yoo, Ernie Ostic, Francis McGregor-Macdonald

✅ Sheeri Cabral (Collibra), Howard Yoo

Conor Beverland (conorbev@gmail.com)

2022-08-23 03:55:24

*Thread Reply:* Cool! Are the new ownership facets populated by the Airflow integration ?

AMRIT SARKAR (sarkaramrit2@gmail.com)

2022-08-24 08:23:35

Hi everyone, excited to work with OpenLineage. I am new to both OpenLineage and Data Lineage in general. Are there working examples/blog posts around actually integrating OpenLineage with existing graph DBs like Neo4J, Neptune etc? (I understand the service layer in between) I understand we have Amundsen with sample open lineage sample data - databuilder/example/sample_data/openlineage/sample_openlineage_events.ndjson. Thanks in advance.

Julien Le Dem (julien@apache.org)

2022-08-25 18:15:59

*Thread Reply:* There is not that I know of besides the Amundsen integration example you pointed at. A basic idea to do such a thing would be to implement an OpenLineage endpoint (receive the lineage events through http posts) and convert them to a format the graph db understand. If others in the community have ideas, please chime in

AMRIT SARKAR (sarkaramrit2@gmail.com)

2022-09-01 13:48:09

*Thread Reply:* Understood, thanks a lot Julien. Make sense.

Harel Shein (harel.shein@gmail.com)

2022-08-25 17:30:46

Hey all, can I ask for a release for OpenLineage?

👍 Harel Shein, Minkyu Park, Michael Robinson, Michael Collado, Ross Turk, Julien Le Dem, Willy Lulciuc, Maciej Obuchowski

Willy Lulciuc (willy@datakin.com)

2022-08-25 17:32:44

*Thread Reply:* @Michael Robinson ^

Michael Robinson (michael.robinson@astronomer.io)

2022-08-25 17:34:04

*Thread Reply:* Thanks, Harel. 3 +1s from committers is all we need to make this happen today.

Minkyu Park (minkyu@datakin.com)

2022-08-25 17:52:40

*Thread Reply:* 🙏

Michael Robinson (michael.robinson@astronomer.io)

2022-08-25 18:09:51

*Thread Reply:* Thanks, all. The release is authorized

🎉 Willy Lulciuc

Julien Le Dem (julien@apache.org)

2022-08-25 18:16:44

*Thread Reply:* can you also state the main purpose for this release?

Michael Robinson (michael.robinson@astronomer.io)

2022-08-25 18:25:49

*Thread Reply:* I believe (correct me if wrong, @Harel Shein) that this is to make available a fix of a bug in the compare functionality

Minkyu Park (minkyu@datakin.com)

2022-08-25 18:27:53

*Thread Reply:* ParentRunFacet from the airflow integration is not compliant to OpenLineage spec and this release includes the fix of that so that the marquez can handle parent run/job information.

Michael Robinson (michael.robinson@astronomer.io)

2022-08-25 18:49:30

@channel OpenLineage 0.13.1 is now available! We fixed: • Rename all parentRun occurrences to parent from Airflow integration #1037 @fm100 • Do not change task instance during on_running event #1028 @JDarDagran Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.13.1 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.13.0...0.13.1 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🎉 Harel Shein, Minkyu Park, Ross Turk, Michael Collado, Howard Yoo

❤️ Minkyu Park, Ross Turk, Howard Yoo

🥳 Minkyu Park, Ross Turk, Howard Yoo

Jason (shzhan@coupang.com)

2022-08-26 18:58:17

Hi, I am new to openlineage. Any one know how to enable spark column level lineage? I saw the code comment, it said default is disabled, thanks

Harel Shein (harel.shein@gmail.com)

2022-08-26 19:26:22

*Thread Reply:* What version of Spark are you using? it should be enabled by default for Spark 3 https://openlineage.io/docs/integrations/spark/spark_column_lineage

openlineage.io

Column Level Lineage | OpenLineage Docs

Column level lineage for Spark is turned on by default and requires no additional work to be done. The following documentation describes its internals.

Original URL: https://openlineage.io/docs/integrations/spark/spark_column_lineage

Jason (shzhan@coupang.com)

2022-08-26 20:21:12

*Thread Reply:* Thanks. Good to here that. I am use 0.9.+ . I will try again

Jason (shzhan@coupang.com)

2022-08-29 13:14:01

*Thread Reply:* I tested 0.9.+ 0.12.+ with spark 3.0 and 3.2 version. There still do not have dataset facet columnlineage. This is strange. I saw the column lineage design proposals 148. It should support from 0.9.+ Do I miss something?

Jason (shzhan@coupang.com)

2022-08-29 13:14:41

*Thread Reply:* @Harel Shein

Will Johnson (will@willj.co)

2022-08-30 00:56:18

*Thread Reply:* @Jason it depends on the data source. What sort of data are you trying to read? Is it in a hive metastore? Is it on an S3 bucket? Is it a delta file format?

Jason (shzhan@coupang.com)

2022-08-30 13:51:03

*Thread Reply:* I tried read hive megastore on s3 and cave file on local. All are miss the columnlineage

Will Johnson (will@willj.co)

2022-08-31 00:33:17

*Thread Reply:* @Jason - Sorry, you'll have to translate a bit for me. Can you share a snippet of code you're using to do the read and write? Is it a special package you need to install or is it just using the hadoop standard for S3? https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html

Jason (shzhan@coupang.com)

2022-08-31 20:00:47

*Thread Reply:* spark.read \ .option("header", "true") \ .option("inferschema", "true") \ .csv("data/input/batch/wikidata.csv") \ .write \ .mode('overwrite') \ .csv("data/output/batch/python-sample.csv")

Jason (shzhan@coupang.com)

2022-08-31 20:01:21

*Thread Reply:* This is simple code run on my local for testing

Will Johnson (will@willj.co)

2022-08-31 21:41:31

*Thread Reply:* Which version of OpenLineage are you running? You might look at the code on the main branch. This looks like a HadoopFSRelation which I implemented for column lineage but the latest release (0.13.1) does not include it yet.

Will Johnson (will@willj.co)

2022-08-31 21:42:05

*Thread Reply:* Specifically this commit is what implemented it. https://github.com/OpenLineage/OpenLineage/commit/ce30178cc81b63b9930be11ac7500ed34808edd3

Jason (shzhan@coupang.com)

2022-08-31 22:02:16

*Thread Reply:* I see. I use 0.13.0

Harel Shein (harel.shein@gmail.com)

2022-09-01 12:04:41

*Thread Reply:* @Jason we have our monthly release coming up now, so it should be included in 0.14.0 when released today/tomorrow

Jason (shzhan@coupang.com)

2022-09-01 12:52:52

*Thread Reply:* Great. Thanks Harel.

Raj Mishra (hax0755@gmail.com)

2022-08-28 17:46:38

Hi! I have ran into some issues and wanted to clarify my doubts. • Why are input schema changes(column delete, new columns) doesn't show up on the UI. I have changed the input schema for the same job, but I'm not seeing getting updated on the UI. • Why is there only ever 1 input schema version. Every change I make in input schema, I only see output schema has multiple versions but only 1 version for input schema. • Is there a reason why can't we see the input schema till the COMPLETE event is posted? I have used the examples from here. https://openlineage.io/getting-started/ curl -X POST <http://localhost:5000/api/v1/lineage> \ -H 'Content-Type: application/json' \ -d '{ "eventType": "START", "eventTime": "2020-12-28T19:52:00.001+10:00", "run": { "runId": "d46e465b-d358-4d32-83d4-df660ff614dd" }, "job": { "namespace": "my-namespace", "name": "my-job" }, "inputs": [{ "namespace": "my-namespace", "name": "my-input" }], "producer": "<https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client>" }' curl -X POST <http://localhost:5000/api/v1/lineage> \ -H 'Content-Type: application/json' \ -d '{ "eventType": "COMPLETE", "eventTime": "2020-12-28T20:52:00.001+10:00", "run": { "runId": "d46e465b-d358-4d32-83d4-df660ff614dd" }, "job": { "namespace": "my-namespace", "name": "my-job" }, "outputs": [{ "namespace": "my-namespace", "name": "my-output", "facets": { "schema": { "_producer": "<https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client>", "_schemaURL": "<https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/spec/OpenLineage.json#/definitions/SchemaDatasetFacet>", "fields": [ { "name": "a", "type": "VARCHAR"}, { "name": "b", "type": "VARCHAR"} ] } } }], "producer": "<https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client>" }' Changing the inputs schema for START doesn't change the schema input version and doesn't update the UI. Thanks!

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-08-29 05:29:52

*Thread Reply:* Reading dataset - which input dataset implies - does not mutate the dataset 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-08-29 05:30:14

*Thread Reply:* If you change the dataset, it would be represented as some other job with this datasets in the outputs list

Raj Mishra (hax0755@gmail.com)

2022-08-29 12:42:55

*Thread Reply:* So, changing the input dataset will always create new output data versions? Sorry I have trouble understanding this, but if the input is changing, shouldn't the input data set will have different versions?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-09-01 08:35:42

*Thread Reply:* @Raj Mishra if input is changing, there should be something else in your data infrastructure that changes this dataset - and it should emit this dataset as output

Iftach Schonbaum (iftach.schonbaum@hunters.ai)

2022-08-29 12:21:52

Hi Everyone, new here. i went thourhg the docs and examples. cant seem to understand how can i model views on top of base tables if not from a data processing job but rather via modeling something static that is coming from some software internals. i.e. i want to issue the lineage my self rather it will learn it dynamically from some Airflow DAG or spark DAG

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-08-29 12:35:32

*Thread Reply:* I think you want to emit raw events using python or java client: https://openlineage.io/docs/client/python

openlineage.io

Python | OpenLineage Docs

Overview

Original URL: https://openlineage.io/docs/client/python

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-08-29 12:35:46

*Thread Reply:* (docs in progress 😉)

Iftach Schonbaum (iftach.schonbaum@hunters.ai)

2022-08-30 02:07:02

*Thread Reply:* can you give a hind what should i look for for modeling a dataset on top of other dataset? potentially also map columns?

Iftach Schonbaum (iftach.schonbaum@hunters.ai)

2022-08-30 02:12:50

*Thread Reply:* i can only see that i can have a dataset as input to a job run and not for another dataset

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-09-01 08:34:35

*Thread Reply:* Not sure I understand - jobs process input datasets into output datasets. There is always something that can be modeled into a job that consumes input and produces output.

Iftach Schonbaum (iftach.schonbaum@hunters.ai)

2022-09-01 10:30:51

*Thread Reply:* so openlineage force me to put a job between datasets? does not fit our use case

Iftach Schonbaum (iftach.schonbaum@hunters.ai)

2022-09-01 10:31:09

*Thread Reply:* unless we can some how easily hide the process that does that on the graph.

Jason (shzhan@coupang.com)

2022-08-29 20:41:19

QQ, I saw that spark Column level lineage start with open lineage 0.9.+ version with spark 3.+, Does it mean it needs to run lower than open lineage 0.9 if our spark is 2.3 or 2.4?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-08-30 04:44:06

*Thread Reply:* I don't think it will work for Spark 2.X.

Jason (shzhan@coupang.com)

2022-08-30 13:42:20

*Thread Reply:* Is there have plan to support spark 2.x?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-08-30 14:00:38

*Thread Reply:* Nope - on the other hand we plan to drop any support for it, as it's unmaintained for quite a bit and vendors are dropping support for it too - afaik Databricks in April 2023.

Jason (shzhan@coupang.com)

2022-08-30 17:19:43

*Thread Reply:* I see. Thanks. Amazon Emr still support spark 2.x

Will Johnson (will@willj.co)

2022-08-30 01:15:10

Spark Integration: Handling Data Source V2 API datasets

Is it expected that a DataSourceV2 relation has a start event with inputs and outputs but a complete event with only outputs? Based on @Michael Collado’s previous comments, I think it's fair to say YES this is expected and we just need to handle it. https://openlineage.slack.com/archives/C01CK9T7HKR/p1645037070719159?thread_ts=1645036515.163189&cid=C01CK9T7HKR

@Hanna Moazam and I noticed this behavior when we looked at the Cosmos Db visitor and then reproduced it for the Iceberg visitor. We traced it down to the fact that the AbstractQueryPlanInputDatasetBuilder (which is the parent of DataSourceV2RelationInputDatasetBuilder) has an isDefinedAt that only includes SparkListenerJobStart and SparkListenerSQLExecutionStart

This means an Iceberg COMPLETE event will NEVER contain inputs because the isDefinedAt will always be false (since COMPLETE only fires for JobEnd and ExecutionEnd events). Does that sound correct (@Paweł Leszczyński)?

It seems that Delta tables (or at least Delta on Databricks) does not follow this same code path and as a result our complete events includes outputs AND inputs.

} Michael Collado (https://openlineage.slack.com/team/U01NNCBCP6K)

Hey, I <a href="https://github.com/OpenLineage/OpenLineage/pull/490#issuecomment-1042011803">responded on the issue</a>, but just to make it clear for everyone, the OL events for a run are not expected to be an accumulation of all past events. Events should be treated as additive by the backend - each event can post what information it has about the run and the backend is responsible for constructing a holistic picture of the run

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1645037070719159?thread_ts=1645036515.163189&cid=C01CK9T7HKR

👀 Maciej Obuchowski

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-09-01 05:56:13

*Thread Reply:* At least for Iceberg I've done it, since I want to emit DatasetVersionDatasetFacet for input dataset only at START - and after I finish writing the dataset might have different version than before writing.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-09-01 05:58:59

*Thread Reply:* Same should be for output AFAIK - output version should be emitted only on COMPLETE, since the version changes after I finish writing.

Will Johnson (will@willj.co)

2022-09-01 09:52:30

*Thread Reply:* Ah! Okay, so this still requires us to truly combine START and COMPLETE to get a TOTAL picture of the entire run. Is that fair?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-09-01 10:30:41

*Thread Reply:* Yes

👍 Will Johnson

Will Johnson (will@willj.co)

2022-09-01 10:31:21

*Thread Reply:* As usual, thank you Maciej for the responses and insights!

🙌 Maciej Obuchowski

Jason (shzhan@coupang.com)

2022-08-31 22:19:44

QQ team, I use spark sql with openlineage namespace weblog: spark.sql(“select ** from weblog where dt=‘1’”).write.orc(“…”) there have two issues 1, there have no upstream dataset weblog on Marquez UI. 2, there have new namespace s3-cdp-prod-hive created. It should the bucket of s3. Am I missing something? Thanks

Jason (shzhan@coupang.com)

2022-09-07 14:13:34

*Thread Reply:* Anyone can help for it? Does I miss something

Jason (shzhan@coupang.com)

2022-08-31 22:21:57

Here is the Marquez UI

Michael Robinson (michael.robinson@astronomer.io)

2022-09-01 07:34:24

Hi everyone, I’m opening up a vote on this month’s OpenLineage release. 3 +1s from committers will authorize. Additions include support for KustoRelationHandler in Kusto (Azure Data Explorer) and for ABFSS and Hadoop Logical Relation, both in the Spark integration. All commits can be found here: https://github.com/OpenLineage/OpenLineage/compare/0.13.1...HEAD. Thanks in advance!

➕ Maciej Obuchowski, Ross Turk, Paweł Leszczyński, Will Johnson, Hanna Moazam

Michael Robinson (michael.robinson@astronomer.io)

2022-09-01 13:18:59

*Thread Reply:* Thanks. The release is authorized. It will be initiated within 2 business days.

🙌 Will Johnson, Maciej Obuchowski

srutikanta hota (srutikanta.hota@gmail.com)

2022-09-05 07:57:02

Is there a reference on how to deploy openlineage on a Non AWS infrastructure ?

Will Johnson (will@willj.co)

2022-09-08 10:31:44

*Thread Reply:* Which integration are you looking to implement?

And what environment are you looking to deploy it on? The Cloud? On-Prem?

srutikanta hota (srutikanta.hota@gmail.com)

2022-09-08 10:40:11

*Thread Reply:* We are planning to deploy on premise with Kerberos as authentication for postgres

Will Johnson (will@willj.co)

2022-09-08 11:27:06

*Thread Reply:* Ah! Are you planning on running Marquez as well and that is your main concern or are you planning on building your own store of OpenLineage Events and using the SQL integration to generate those events?

https://github.com/OpenLineage/OpenLineage/tree/main/integration

srutikanta hota (srutikanta.hota@gmail.com)

2022-09-08 11:33:44

*Thread Reply:* I am looking to deploy Marquez on-prem with onprem postgres as back-end with Kerberos authentication.

srutikanta hota (srutikanta.hota@gmail.com)

2022-09-08 11:34:32

*Thread Reply:* Is the the right forum for Marquez as well or there is different slack channel for Marquez available

Will Johnson (will@willj.co)

2022-09-08 11:46:35

*Thread Reply:* https://bit.ly/MarquezSlack

Will Johnson (will@willj.co)

2022-09-08 11:47:14

*Thread Reply:* There is another slack channel just for Marquez! That might be a better spot with more dedicated Marquez developers.

Michael Robinson (michael.robinson@astronomer.io)

2022-09-06 15:52:32

@channel OpenLineage 0.14.0 is now available! We added: • Support ABFSS and Hadoop Logical Relation in Column-level lineage #1008 @wjohnson • Add Kusto relation visitor #939 @hmoazam • Add ColumnLevelLineage facet doc #1020 @julienledem • Include symlinks dataset facet #935 @pawel-big-lebowski • Add support for dbt 1.3 beta’s metadata changes #1051 @mobuchowski • Support Flink 1.15 #1009 @mzareba382 • Add Redshift dialect to the SQL integration #1066 @mobuchowski We changed: • Make the timeout configurable in the Spark integration #1050 @tnazarew We fixed: • Add a dialect parameter to Great Expectations SQL parser calls #1049 @collado-mike • Fix Delta 2.1.0 with Spark 3.3.0 #1065 @pawel-big-lebowski Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.14.0 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.13.1...0.14.0 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

❤️ Willy Lulciuc, Howard Yoo, Alexander Wagner, Hanna Moazam, Minkyu Park, Grayson Stream, Paweł Leszczyński, Maciej Obuchowski, Conor Beverland, Jason

Willy Lulciuc (willy@datakin.com)

2022-09-06 15:54:30

*Thread Reply:* Thanks for breaking up the changes in the release! Love the new format 💯

🙌 Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2022-09-07 09:05:35

Hello all, I’m requesting a patch release to fix a bug in the Spark integration. Currently, OpenlineageSparkListener fails when no openlineage.timeout is provided. PR #1069 by @Paweł Leszczyński, merged today, will fix it. As per our policy here, 3 +1s from committers will authorize an immediate release.

➕ Paweł Leszczyński, Maciej Obuchowski, Howard Yoo, Willy Lulciuc, Ross Turk, Julien Le Dem

Willy Lulciuc (willy@datakin.com)

2022-09-07 10:00:11

*Thread Reply:* Is PR #1069 all that’s going in 0.14.1 ?

Michael Robinson (michael.robinson@astronomer.io)

2022-09-07 10:27:39

*Thread Reply:* There’s also 1058. 1069 is urgently needed. We can technically wait…

🙌 Willy Lulciuc

Michael Robinson (michael.robinson@astronomer.io)

2022-09-07 10:30:31

*Thread Reply:* (edited prior message because I’m not sure how accurately I was describing the issue)

Willy Lulciuc (willy@datakin.com)

2022-09-07 10:39:32

*Thread Reply:* Thanks for clarifying!

Michael Robinson (michael.robinson@astronomer.io)

2022-09-07 10:50:29

*Thread Reply:* Thanks, all. The release is authorized.

❤️ Willy Lulciuc

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-09-07 11:04:39

*Thread Reply:* 1058 also fixes some bugs

Iftach Schonbaum (iftach.schonbaum@hunters.ai)

2022-09-08 01:55:41

Hello all, question: Views on top of base table is also a use case for lineage and there is no job in between. i dont seem to find a way to have a dataset on top of others to represent a view on top of tables. is there a way to do that without a job in between?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-09-08 04:41:07

*Thread Reply:* Usually there is something creating the view, for example dbt materialization: https://docs.getdbt.com/docs/building-a-dbt-project/building-models/materializations

Besides that, there is this proposal that did not get enough love yet https://github.com/OpenLineage/OpenLineage/issues/323

Iftach Schonbaum (iftach.schonbaum@hunters.ai)

2022-09-08 04:53:23

*Thread Reply:* but we are not working iwth dbt. we try to model lineage of our internal view/tables hirarchy which is related to a propriety application of ours. so we like OpenLineage that lets me explicily model stuff and not only via scanning some DW. but in that case we dont want a job in between.

Iftach Schonbaum (iftach.schonbaum@hunters.ai)

2022-09-08 04:58:47

*Thread Reply:* this PR does not seem to support lineage between datasets

Ross Turk (ross@datakin.com)

2022-09-08 12:49:48

*Thread Reply:* This is something core to the OpenLineage design - the lineage relationships are defined as dataset-job-dataset, not dataset-dataset.

In OpenLineage, something observes the lineage relationship being created.

Ross Turk (ross@datakin.com)

2022-09-08 12:50:13

*Thread Reply:*

🙌 Will Johnson, Maciej Obuchowski

Ross Turk (ross@datakin.com)

2022-09-08 12:51:15

*Thread Reply:* It’s a bit different from some other lineage approaches, but OL is intended to be a push model. A job is observed as it runs, metadata is pushed to the backend.

Ross Turk (ross@datakin.com)

2022-09-08 12:54:27

*Thread Reply:* so in this case, according to openlineage 🙂, the job would be whatever runs within the pipeline that creates the view. very operational point of view.

Iftach Schonbaum (iftach.schonbaum@hunters.ai)

2022-09-11 12:27:42

*Thread Reply:* but what about the view definition use case? u have lineage of columns in view/base table relation ships

Iftach Schonbaum (iftach.schonbaum@hunters.ai)

2022-09-11 12:28:05

*Thread Reply:* how would you model that in OpenLineage? would you create a dummy job ?

Iftach Schonbaum (iftach.schonbaum@hunters.ai)

2022-09-11 12:31:57

*Thread Reply:* would you say that because this is my use case i might better choose some other lineage tool?

Iftach Schonbaum (iftach.schonbaum@hunters.ai)

2022-09-11 12:33:04

*Thread Reply:* for the context: i am not talking about some view and table definitions in some warehouse e.g. SF but its internal data processing mechanism with propriety view/tables definition (in Flink SQL) and we want to push this metadata for visibility

Ross Turk (ross@datakin.com)

2022-09-12 17:20:13

*Thread Reply:* Ah, gotcha. Yeah, I would say it’s probably best to create a job in this case. You can send the view definition using a sourcecodefacet, so it will be collected as well. You’d want to send START and STOP events for it.

Ross Turk (ross@datakin.com)

2022-09-12 17:22:03

*Thread Reply:* regarding the PR linked before, you are right - I wonder if someday the spec should have a way to express “the system was made aware that these datasets are related, but did not observe the relationship being created so it can’t tell you i.e. how long it took or whether it changed over time”

Michael Robinson (michael.robinson@astronomer.io)

2022-09-09 10:25:21

@channel OpenLineage 0.14.1 is now available! We fixed: • Fix Spark integration issues including error when no openlineage.timeout #1069 @pawel-big-lebowski Bug fixes were also included in this release. Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.14.1 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.14.0...0.14.1 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🙌 Maciej Obuchowski, Willy Lulciuc, Howard Yoo, Francis McGregor-Macdonald, AMRIT SARKAR

data_fool (data.fool.me@gmail.com)

2022-09-09 13:52:39

Hello, any future plans for integrating Airbyte with openlineage?

👋 Willy Lulciuc, Maciej Obuchowski

Willy Lulciuc (willy@datakin.com)

2022-09-09 14:01:13

*Thread Reply:* Hey, @data_fool! Not in the near term. but of course we’d love to see this happen. We’re open to having an Airbyte integration driven by the community. Want to open an issue to start the discussion?

data_fool (data.fool.me@gmail.com)

2022-09-09 15:36:20

*Thread Reply:* hey @Willy Lulciuc, Yep, will open an issue. Thanks!

🙌 Willy Lulciuc

Hubert Dulay (hubert.dulay@gmail.com)

2022-09-10 22:00:10

Hi can you create lineage across namespaces? Thanks

Julien Le Dem (julien@apache.org)

2022-09-12 19:26:25

*Thread Reply:* yes!

srutikanta hota (srutikanta.hota@gmail.com)

2022-09-26 10:31:56

*Thread Reply:* Any example or ticket on how to lineage across namespace

Iftach Schonbaum (iftach.schonbaum@hunters.ai)

2022-09-12 02:27:49

Hello, Does OpenLineage support column level lineage?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-09-12 04:56:13

*Thread Reply:* Yes https://openlineage.io/blog/column-lineage/

openlineage.io

The Current State of Column-level Lineage

Column-level lineage helps organizations navigate a complex regulatory landscape.

Original URL: https://openlineage.io/blog/column-lineage/

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2022-09-22 02:18:45

*Thread Reply:* • More details on Spark & Column level lineage integration: https://openlineage.io/docs/integrations/spark/spark_column_lineage • Proposal on how to implement column level lineage in Marquez (implementation is currently work in progress): https://github.com/MarquezProject/marquez/blob/main/proposals/2045-column-lineage-endpoint.md @Iftach Schonbaum let us know if you find the information useful.

openlineage.io

Column Level Lineage | OpenLineage Docs

Column level lineage for Spark is turned on by default and requires no additional work to be done. The following documentation describes its internals.

Original URL: https://openlineage.io/docs/integrations/spark/spark_column_lineage

<https://github.com/MarquezProject/marquez/blob/main/proposals/2045-column-lineage-endpoint.md | 2045-column-lineage-endpoint.md>

Proposal: Column lineage endpoint proposal Author(s): <a href="https://github.com/julienledem">@julienledem</a>, <a href="https://github.com/pawel-big-lebowski">@pawel-big-lebowski</a> Created: 20022-08-18 Dicussion: <a href="https://github.com/MarquezProject/marquez/issues/2045">column lineage endpoint issue #2045</a> Overview Use cases • Find the current upstream dependencies of a column. A column in a dataset is derived from columns in upstream datasets. • See column-level lineage in the dataset level lineage when available. • Retrieve point-in-time upstream lineage for a dataset or a column. What did the lineage look like yesterday compared to today? Existing elements • OpenLineage defines a [column-level lineage facet]- (<a href="https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/ColumnLineageDatasetFacet.json">https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/ColumnLineageDatasetFacet.json</a>). • Marquez has a lineage endpoint <code>GET /api/v1/lineage</code> that returns the current lineage graph connected to a job or a dataset Column lineage characteristics and general assumptions Column level lineage is a different lineage graph due to a different node granularity - kind of zoomed-in view of existing lineage. Instead of datasets and jobs being lineage graph nodes, each dataset field becomes a node. Additionally, there are edges between dataset fields, instead of datasets itself. Thus, enriching existing lineage with column lineage information would not be sufficient. That’s why we propose another API endpoint with column lineage graph. Upstream and downstream edges do have different characteristics. An output dataset is always produced by a single version of input dataset (one upstream), while a single input datset version can have multiple output dataset versions. Lineage graph can be then easily flooded by downstream subgraph which blurs the overall view. That's why we consider an upstream column lineage as a default one. Downstream lineage will be returned only when requested explicitly. New Elements We propose the following changes: • Add column lineage to the dataset resource endpoint. Column lineage will NOT be added to existing <code>/lineage</code> endpoint as it may be a heavy database operation run on each lineage graph's node which we want to avoid. Based on that, column level lineage get be requested per dataset in separate requests when required. • A new column-lineage endpoint leveraging the column lineage facet to retrieve lineage for a given column. • Point-in-time upstream (dataset or column level) lineage given a version of a dataset. Proposal Add column lineage to existing datasets endpoint In the <code>GET /api/v1/namespaces/{namespace}/datasets</code> api, add column lineage facet to returned dataset resource. Add a column-level-lineage endpoint: New endpoints to retrieve a column lineage of a single field or a whole dataset will be added: <pre><code>GET /column-lineage?nodeId=dataset:{namespace}:{dataset} GET /column-lineage?nodeId=datasetField:{namespace}:{dataset}:{field} </code></pre> For example: <pre><code>GET /column-lineage?nodeId=dataset:food_delivery:public.delivery_7_days GET /column-lineage?nodeId=datasetField:food_delivery:public.delivery_7_days:a </code></pre> Although creating a new endpoint, we would like to reuse existing data structures with a new <code>NodeType.FIELD</code> introduced. The logic returns dataset field node: <pre><code>GET /column-lineage?nodeId=datasetField:db1:table1:a ... { graph: [ { "id": "datasetField:db1:table1:a", "type": "DATASET_FIELD", "data": { "namespace": "DB1", "name": "table2", "field": "a", "type": "integer", "transformationDescription": "identical", "transformationType": "IDENTITY", "inputFields": [ { "namespace": "DBA", "name": "tableA", "field": "columnA"}, { "namespace": "DBB", "name": "tableB", "field": "columnB"}, { "namespace": "DBC", "name": "tableC", "field": "columnC"} ] "inEdges": [ { "origin": "datasetField:db1:table1:a", "destination": "datasetField:DBA:tableA:columnA" }, { "origin": "datasetField:db1:table1:a", "destination": "datasetField:DBB:tableB:columnB" }, { "origin": "datasetField:db1:table1:a", "destination": "datasetField:DBB:tableB:columnC" } ], }, ... # Input fields, present within "inEdges", can be also returned within a graph due to a `depth` parameter greate than 0. } ] } </code></pre> The <code>depth</code> parameter controls how many edges, from a given dataset field, shall be returned. The default is set to <code>0</code>. In case of default equal <code>1</code>, each <code>inputField</code> will be returned as a separate node within a response graph with <code>inputFields</code> used to produce it. Please note that extending depth may increase the graph size and affect request performance. The endpoints above fetches upstream column-lineage for given dataset field or all fields within a dataset. Downstream column lineage is turned off by default. However, this can be turned on with an extra <code>withDownstream</code> parameter like: <pre><code>GET /column-lineage?nodeId=datasetField:food_delivery:public.delivery_7_days:a&amp;withDownstream=true </code></pre> This will include <code>outEdges</code> within the returned node of the graph. Point in time upstream lineage Point in time lineage for newly proposed <code>/api/v1/column-lineage</code> endpoint: <pre><code>GET /column-lineage?nodeId=dataset_field:food_delivery:public.delivery_7_days:a&amp;datasetVersion=123e4567-e89b-12d3-a456-426614174000 GET /column-lineage?nodeId=dataset_field:food_delivery:public.delivery_7_days:a&amp;lineageAt=1661846242 </code></pre> Point in time can be controlled by: • datasetVersion - uuid of a specific dataset version, • lineageAt - which contains a unix timestamp. When lineageAt specified, the latest dataset version before timestamp will be found. Regardles datasetVersion or lineageAt parameters applied, responses will be the same as below: <pre><code>{ graph: [ { &lt; "id": "datasetField:db1:table1:a", &gt; "id": "datasetField:db1:table1:a#{VERSION UUID}", "type": "DATASET_FIELD", "data": { .... } </code></pre> Implementation columne lineage facet in dataset resource endpoint Adding the columnLineage facet requires a formatting of existing facet data (work in progress). column lineage endpoint The <code>/column-lineage</code> endpoint leverages the <code>/lineage</code> endpoint and then filters down the payload to return the expected result. point-in-time upstream lineage The point-in-time upstream lineage leverages the run to dataset version relation to track back the lineage of a given dataset of job version. Dataset version -> run that produced it -> consumed Dataset Versions. Next Steps Review of this proposal and production of detailed design for the implementation.:

Paul Lee (paullee@lyft.com)

2022-09-12 15:29:12

where can i find docs on just simply using extractors? without marquez. for example, a basic BashOperator on Airflow 1.10.15

Paul Lee (paullee@lyft.com)

2022-09-12 15:30:08

*Thread Reply:* or is it automatic for anything that exists in extractors/?

Howard Yoo (howard.yoo@astronomer.io)

2022-09-12 15:30:16

*Thread Reply:* Yes

👍 Paul Lee

:gratitude_thank_you: Paul Lee

Paul Lee (paullee@lyft.com)

2022-09-12 15:31:12

*Thread Reply:* so anything i add to extractors directory with the same name as the operator will automatically extract the metadata from the operator is that correct?

Howard Yoo (howard.yoo@astronomer.io)

2022-09-12 15:31:31

*Thread Reply:* Well, not entirely

Howard Yoo (howard.yoo@astronomer.io)

2022-09-12 15:31:47

*Thread Reply:* please take a look at the source code of one of the extractors

👍 Paul Lee

Howard Yoo (howard.yoo@astronomer.io)

2022-09-12 15:32:13

*Thread Reply:* also, there are docs available at openlineage.io/docs

🙏 Paul Lee

Paul Lee (paullee@lyft.com)

2022-09-12 15:33:45

*Thread Reply:* ok, i'll take a look. i think one thing that would be helpful is having a custom setup without marquez. a lot of the docs or videos i found were integrated with marquez

Howard Yoo (howard.yoo@astronomer.io)

2022-09-12 15:34:29

*Thread Reply:* I see. Marquez is a openlineage backend that stores the lineage data, so many examples do need them.

Howard Yoo (howard.yoo@astronomer.io)

2022-09-12 15:34:47

*Thread Reply:* If you do not want to run marquez but just test out the openlineage, you can also take a look at OpenLineage Proxy.

👍 Paul Lee

Paul Lee (paullee@lyft.com)

2022-09-12 15:35:14

*Thread Reply:* awesome thanks Howard! i'll take a look at these resources and come back around if i need to

👍 Howard Yoo

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-09-12 16:01:45

*Thread Reply:* http://openlineage.io/docs/integrations/airflow/extractor - this is the doc you might want to read

openlineage.io

Custom extractors | OpenLineage Docs

OpenLineage integration works by detecting which Airflow operators your dag is using, and extracting lineage

Original URL: http://openlineage.io/docs/integrations/airflow/extractor

🎉 Paul Lee

Paul Lee (paullee@lyft.com)

2022-09-12 17:08:49

*Thread Reply:* yeah, saw that doc earlier. thanks @Maciej Obuchowski appreciate it 🙏

Jay (sanjay.sudhakaran@trovemoney.co.nz)

2022-09-21 20:55:24

Hey team! I’m pretty new to the field in general

In the real world, I would be running pyspark scripts on AWS EMR. Could you explain to me how the metadata is sent to Marquez from my pyspark script, and where it’s persisted?

Would I need to set up an S3 bucket to store the lineage data?

I’m also unsure about how I would run the Marquez UI on AWS - Would I need to have an EC2 instance running permanently in order to access that UI?

Jay (sanjay.sudhakaran@trovemoney.co.nz)

2022-09-21 20:57:39

*Thread Reply:* In my head, I have:

Pyspark script -> Store metadata in S3 -> Marquez UI gets data from S3 and displays it

I suspect this is incorrect?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2022-09-22 02:14:50

*Thread Reply:* It’s more like: you add openlineage jar to Spark job, configure it what to do with the events. Popular options are: * sent to rest endpoint (like Marquez), * send as an event onto Kafka, ** print it onto console There is no S3 in between Spark & Marquez by default. Marquez serves both as an API where events are sent and UI to investigate them.

Jay (sanjay.sudhakaran@trovemoney.co.nz)

2022-09-22 17:36:10

*Thread Reply:* Yeah S3 was just an example for a storage option.

I actually found the answer I was looking for, turns out I had to look at Marquez documentation: https://marquezproject.ai/resources/deployment/

The answer is that Marquez uses a postgres instance to persist the metadata it is given. Thanks for your time though! I appreciate the effort 🙂

👍 Kevin Adams

Hanbing Wang (doris.wang200902@gmail.com)

2022-09-25 17:06:41

Hello team, For the OpenLineage Spark, even when I processed one Spark sql query (CTAS Create Table As Select), I will received multiple events back (2+ Start events, 2 Complete events). I try to understand why OpenLineage need to send back that much events, and what is the primary difference between Start VS Start events, Start VS Complete events? Do we have any doc can help me understand more on it? Thanks

Will Johnson (will@willj.co)

2022-09-26 00:27:05

*Thread Reply:* The Spark execution model follows:

Spark SQL Execution Start event
Spark Job Start event
Spark Job End event
Spark SQL Execution End event As a result, OpenLineage tracks all of those execution and jobs. There is a proposed plan to distinguish between those events (e.g. you wouldn't get two starts but one Start and one Job Start or something like that).

You should collect all of these events in order to be sure you are receiving all the data since each event may contain a subset of the complete facets that represent what occurred in the job.

Hanbing Wang (doris.wang200902@gmail.com)

2022-09-26 15:16:26

*Thread Reply:* Thanks @Will Johnson Can I get an example of how the proposed plan can be used to distinguish between start and job start events? Because I compare the 2 starts events I got, only the event_time is different, all other information are the same.

Hanbing Wang (doris.wang200902@gmail.com)

2022-09-26 15:30:34

*Thread Reply:* One followup question, if I process multiple queries in one command, for example (Drop + Create Table + Insert Overwrite), should I expected for (1). 1 Spark SQL execution start event (2). 3 Spark job start event (Each query has a job start event ) (3). 3 Spark job end event (Each query has a job end event ) (4). 1 Spark SQL execution end event

Will Johnson (will@willj.co)

2022-09-27 10:25:47

*Thread Reply:* Re: Distinguish between start and job start events. There was a proposal to differentiate the two (https://github.com/OpenLineage/OpenLineage/issues/636) but the current discussion is here: https://github.com/OpenLineage/OpenLineage/issues/599 As it currently stands, there is not a way to tell which one is which (I believe). The design of OpenLineage is such that you should consume ALL events under the same run id and job name / namespace.

Re: Multiple Queries in One Command: This is where Spark's execution model comes into play. I believe each one of those commands are executed sequentially and as a result, you'd actually get three execution start and three execution end. If you chose DROP + Create Table As Select, that would be only two commands and thus only two execution start events.

<https://github.com/OpenLineage/OpenLineage/issues/636|#636 [INTEGRATION][SPARK] Change event types for JobStart and JobEnd events>

<https://github.com/OpenLineage/OpenLineage/issues/599|#599 Spec: clarify event order lifecycle>

Hanbing Wang (doris.wang200902@gmail.com)

2022-09-27 16:49:37

*Thread Reply:* Thanks a lot for your help 🙏 @Will Johnson, For multiple queries in one command, I still have a confused place why Drop + CreateTable and Drop + CreateTableAsSelect act different.

When I test Drop + Create Table Query: DROP TABLE IF EXISTS shadow_test.test_sparklineage_4; CREATE TABLE IF NOT EXISTS shadow_test.test_sparklineage_4 (val INT, region STRING) PARTITIONED BY ( ds STRING ) STORED AS PARQUET; I only received 1 start + 1 complete event And the events only contains DropTableCommandVisitor/DropTableCommand. I expected we should also received start and complete events for CreateTable query with CreateTableCommanVisitor/CreateTableComman .

But when I test Drop + Create Table As Select Query: DROP TABLE IF EXISTS shadow_test.test_sparklineage_5; CREATE TABLE IF NOT EXISTS shadow_test.test_sparklineage_5 AS SELECT ** from shadow_test.test_sparklineage where ds > '2022-08-24'" I received 1 start + 1 complete event with DropTableCommandVisitor/DropTableCommand And 2 start + 2 complete events with CreateHiveTableAsSelectCommandVisitor/CreateHiveTableAsSelectCommand

Will Johnson (will@willj.co)

2022-09-27 22:03:38

*Thread Reply:* @Hanbing Wang are you running this on Databricks with a hive metastore that is defaulting to Delta by any chance?

I THINK there are some gaps in OpenLineage because of the way Databricks Delta handles things and now there is Unity catalog that is causing some hiccups as well.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-09-28 09:18:48

*Thread Reply:* > For multiple queries in one command, I still have a confused place why Drop + CreateTable and Drop + CreateTableAsSelect act different. @Hanbing Wang That's basically why we capture all the events (SQL Execution, Job) instead of one of them. We're just inconsistently notified of them by Spark.

Some computations emit SQL Execution events, some emit Job events, I think majority emits both. This also differs by spark version.

The solution OpenLineage assumes is having cumulative model of job execution, where your backend deals with possible duplication of information.

> I THINK there are some gaps in OpenLineage because of the way Databricks Delta handles things and now there is Unity catalog that is causing some hiccups as well. @Will Johnson would be great if you created issue with some complete examples

Hanbing Wang (doris.wang200902@gmail.com)

2022-09-28 15:44:45

*Thread Reply:* @Will Johnson and @Maciej Obuchowski Thanks a lot for your help We are not running on Databricks. We implemented the OpenLineage Spark listener, and custom the Event Transport which emitting the events to our own events pipeline with a hive metastore. We are using Spark version 3.2.1 OpenLineage version 0.14.1

Will Johnson (will@willj.co)

2022-09-29 15:16:28

*Thread Reply:* Ooof! @Hanbing Wang then I'm not certain why you're not receiving the extra event 😞 You may need to run your spark cluster in debug mode to step through the Spark Listener.

Will Johnson (will@willj.co)

2022-09-29 15:17:08

*Thread Reply:* @Maciej Obuchowski - I'll add it to my list!

Hanbing Wang (doris.wang200902@gmail.com)

2022-09-30 15:34:01

*Thread Reply:* @Will Johnson Thanks a lot for your help. Let us debug and continue investigating on this issue.

Yujia Yang (yujia@tubi.tv)

2022-09-26 03:46:19

Hi team, I find Openlineage posts a lot for run events to the backend.

eg. I submit jar to Spark cluster with computations like

count from table1. --> this will have more than one run events inputs:[table1], outputs:[]
count from table2 --> this will have more than one run events inputs:[table2], outputs:[]
write Seq[(t1, count1), (t2, count2)) to table3. --> this may give inputs:[] outputs [table3] can I just get one post with a summary telling me, inputs:[table1, table2], outputs:[table3] alongside with a merged columnareLineage?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2022-09-28 08:34:20

*Thread Reply:* One of assumptions was to create a stateless integration model where multiple events can be sent for a single job run. This has several advantages like sending events for jobs which suddenly fail, sending events immediately, etc.

The events can be merged then at the backend side. The behavior, you describe, can be then achieved by using backends like Marquez and Marquez API to obtain combined data.

Currently, we’re developing column-lineage dedicated endpoint in Marquez according to the proposal: https://github.com/MarquezProject/marquez/blob/main/proposals/2045-column-lineage-endpoint.md This will allow you to request whole column lineage graph based on multiple jobs.

<https://github.com/MarquezProject/marquez/blob/main/proposals/2045-column-lineage-endpoint.md | 2045-column-lineage-endpoint.md>

:gratitude_thank_you: Yujia Yang

👀 Yujia Yang

srutikanta hota (srutikanta.hota@gmail.com)

2022-09-28 09:47:55

Is there a provision to include additional MDC properties as part of openlineage ? Or something like sparkSession.sparkContext().setLocalProperties("key","value")

Julien Le Dem (julien@apache.org)

2022-09-29 14:30:37

*Thread Reply:* Hello @srutikanta hota, could you elaborate a bit on your use case? I'm not sure what you are trying to achieve. Possibly @Paweł Leszczyński will know.

Will Johnson (will@willj.co)

2022-09-29 15:24:26

*Thread Reply:* @srutikanta hota - Not sure what MDC properties stands for but you might take inspiration from the DatabricksEnvironmentHandler Facet Builder: https://github.com/OpenLineage/OpenLineage/blob/65a5f021a1ba3035d5198e759587737a05[…]ark/agent/facets/builder/DatabricksEnvironmentFacetBuilder.java

You can create a facet that could extract out the properties that you might set from within the spark session.

I don't think OpenLineage / a Spark Listener can affect the SparkSession itself so you wouldn't be able to SET the properties in the listener.

<https://github.com/OpenLineage/OpenLineage/blob/65a5f021a1ba3035d5198e759587737a05b242e1/integration/spark/shared/src/main/java/io/openlineage/spark/agent/facets/builder/DatabricksEnvironmentFacetBuilder.java | DatabricksEnvironmentFacetBuilder.java>

<pre><code> private Map&lt;String, Object&gt; getDatabricksEnvironmentalAttributes(SparkListenerJobStart jobStart) { dbProperties = new HashMap&lt;&gt;(); // These are useful properties to extract if they are available List&lt;String&gt; dbPropertiesKeys = Arrays.asList( "orgId", "spark.databricks.clusterUsageTags.clusterOwnerOrgId", "spark.databricks.notebook.path", "spark.databricks.job.type", "spark.databricks.job.id", "spark.databricks.job.runId", "user", "userId", "spark.databricks.clusterUsageTags.clusterName", "spark.databricks.clusterUsageTags.azureSubscriptionId"); dbPropertiesKeys.stream() .forEach( (p) -&gt; { dbProperties.put(p, jobStart.properties().getProperty(p)); }); /**** ** Azure Databricks makes available a dbutils mount point to list aliased paths to cloud ** storage. However, that dbutils object is not available inside a spark listener. We must ** access it via reflection. **/ try { dbutilsClass = Class.forName("com.databricks.dbutils_v1.impl.DbfsUtilsImpl"); dbutils = (DbfsUtils) dbutilsClass.getDeclaredConstructor().newInstance(); dbProperties.put("mountPoints", getDatabricksMountpoints(dbutils)); } catch (Exception e) { log.warn("Failed to load dbutils in OpenLineageListener"); dbProperties.put("mountPoints", new ArrayList&lt;DatabricksMountpoint&gt;()); } return dbProperties; } </code></pre>

srutikanta hota (srutikanta.hota@gmail.com)

2022-09-30 04:56:25

*Thread Reply:* Many thanks for the details. My usecase is simple, I like to default the sparkgroupjob Id as openlineage parent runid if there is no parent run Id set. sc.setJobGroup("myjobgroupid", "job description goes here") This set the value in spark as setLocalProperty(SparkContext.SPARKJOBGROUPID, group_id)

I like to use myjobgroup_id as openlineage parent run id

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2022-09-30 05:01:08

*Thread Reply:* MDC is an ability to add extra key -> value pairs to a log entry, while not doing this within message body. So the question here is (I believe): how to add custom entries / custom facets to OpenLineage events?

@srutikanta hota What information would you like to include? There is great chance we already have some fields for that. If not it’s still worth putting in in write place like: is this info job specific, run specific or relates to some of input / output datasets?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-09-30 05:04:34

*Thread Reply:* @srutikanta hota sounds like you want to set up spark.openlineage.parentJobName spark.openlineage.parentRunId https://openlineage.io/docs/integrations/spark/

srutikanta hota (srutikanta.hota@gmail.com)

2022-09-30 05:15:18

*Thread Reply:* @… we are having a long-running spark context(the context may run for a week) where we submit jobs. Settings the parentrunid at beginning won't help. We are submitting the job with sparkgroupid. I like to use the group Id as parentRunId

https://spark.apache.org/docs/1.6.1/api/R/setJobGroup.html

🤔 Maciej Obuchowski

Trevor Swan (trevor.swan@matillion.com)

2022-09-29 13:59:20

Hi team - I am from Matillion and we would like to build support for openlineage. Who would be best placed to move the conversation with my product team?

🙌 Will Johnson, Maciej Obuchowski, Francis McGregor-Macdonald

🎉 Michael Robinson

👍 Ernie Ostic

Julien Le Dem (julien@apache.org)

2022-09-29 14:22:06

*Thread Reply:* Hi Trevor, thank you for reaching out. I’d be happy to discuss with you how we can help you support OpenLineage. Let me send you an email.

Jarek Potiuk (jarek@potiuk.com)

2022-09-29 15:58:35

cccccbctlvggfhvrcdlbbvtgeuredtbdjrdfttbnldcb

🐈 Julien Le Dem, Jakub Dardziński, Maciej Obuchowski, Paweł Leszczyński

🐈‍⬛ Julien Le Dem, Maciej Obuchowski, Paweł Leszczyński

Petr Hajek (petr.hajek@profinit.eu)

2022-09-30 02:52:51

Hi Everyone! Would anybody be interested in participation in MANTA Open Lineage connector testing? We are specially looking for an environment with rich Airflow implementation but we will be happy to test on any other OL Producer technology. Send me a direct message for more information. Thanks, Petr

🙌 Michael Robinson, Ross Turk

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2022-09-30 14:34:45

Question about Apache Airflow that I think folks here would know, because doing a web search has failed me:

Is there a way to interact with Apache Airflow to retrieve the contents of the files in the sql directory, but NOT to run them?

(the APIs all seem to run sql, and when I search I just get “how to use the airflow API to run queries”)

Ross Turk (ross@datakin.com)

2022-09-30 14:38:34

*Thread Reply:* Is this in the context of an OpenLineage extractor?

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2022-09-30 14:40:47

*Thread Reply:* Yes! I was specifically looking at the PostgresOperator

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2022-09-30 14:41:54

*Thread Reply:* (as Snowflake lineage can be retrieved from their internal ACCESS_HISTORY tables, we wouldn’t need to use Airflow’s SnowflakeOperator to get lineage, we’d use the method on the openlineage blog)

Ross Turk (ross@datakin.com)

2022-09-30 14:43:08

*Thread Reply:* The extractor for the SQL operators gets the query like this: https://github.com/OpenLineage/OpenLineage/blob/45fda47d8ef29dd6d25103bb491fb8c443[…]gration/airflow/openlineage/airflow/extractors/sql_extractor.py

<https://github.com/OpenLineage/OpenLineage/blob/45fda47d8ef29dd6d25103bb491fb8c44355eb3c/integration/airflow/openlineage/airflow/extractors/sql_extractor.py | sql_extractor.py>

<pre><code> job_facets = {"sql": SqlJobFacet(query=self.operator.sql)} </code></pre>

👍 Sheeri Cabral (Collibra)

Ross Turk (ross@datakin.com)

2022-09-30 14:43:48

*Thread Reply:* let me see if I can find the corresponding part of the Airflow API docs...

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2022-09-30 14:45:00

*Thread Reply:* aha! I’m not so far behind the times, it was only put in during July https://github.com/OpenLineage/OpenLineage/pull/907

<https://github.com/OpenLineage/OpenLineage/pull/907|#907 [INTEGRATION][Airflow][SQL] Add SqlExtractor class.>

Signed-off-by: Jakub Dardzinski <a href="mailto:kuba0221@gmail.com">kuba0221@gmail.com</a> Problem Current SQL based extractors have code duplicated. Solution Added common SqlExtractor class. Updated unit tests to match changes. Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☑︎ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☑︎ Your changes are accompanied by tests (if relevant) ☑︎ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained

Ross Turk (ross@datakin.com)

2022-09-30 14:47:28

*Thread Reply:* Hm. The PostgresOperator seems to extend BaseOperator directly: https://github.com/apache/airflow/blob/029ebacd9cbbb5e307a03530bdaf111c2c3d4f51/airflow/providers/postgres/operators/postgres.py#L58

<https://github.com/apache/airflow/blob/029ebacd9cbbb5e307a03530bdaf111c2c3d4f51/airflow/providers/postgres/operators/postgres.py | postgres.py>

<pre><code> sql: str | Iterable[str], </code></pre>

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2022-09-30 14:48:01

*Thread Reply:* yeah 😞 I couldn’t find a way to make that work as an end-user.

Ross Turk (ross@datakin.com)

2022-09-30 14:48:08

*Thread Reply:* perhaps that can't be assumed for all operators that deal with SQL. I know that @Maciej Obuchowski has spent a lot of time on this.

Ross Turk (ross@datakin.com)

2022-09-30 14:49:14

*Thread Reply:* I don't know enough about the airflow internals 😞

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2022-09-30 14:50:00

*Thread Reply:* No worries. In case it saves you work, I also had a look at https://github.com/apache/airflow/blob/029ebacd9cbbb5e307a03530bdaf111c2c3d4f51/airflow/providers/common/sql/operators/sql.py - which also extends BaseOperator but not with a way to just get the SQL.

<https://github.com/apache/airflow/blob/029ebacd9cbbb5e307a03530bdaf111c2c3d4f51/airflow/providers/common/sql/operators/sql.py | sql.py>

``` # <h1>Licensed to the Apache Software Foundation (ASF) under one</h1> <h1>or more contributor license agreements. See the NOTICE file</h1> <h1>distributed with this work for additional information</h1> <h1>regarding copyright ownership. The ASF licenses this file</h1> <h1>to you under the Apache License, Version 2.0 (the</h1> <h1>"License"); you may not use this file except in compliance</h1> <h1>with the License. You may obtain a copy of the License at</h1> # <h1><a href="http://www.apache.org/licenses/LICENSE-2.0">http://www.apache.org/licenses/LICENSE-2.0</a></h1> # <h1>Unless required by applicable law or agreed to in writing,</h1> <h1>software distributed under the License is distributed on an</h1> <h1>"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY</h1> <h1>KIND, either express or implied. See the License for the</h1> <h1>specific language governing permissions and limitations</h1> <h1>under the License.</h1> from future import annotations import re from typing import TYPE_CHECKING, Any, Iterable, Mapping, Sequence, SupportsAbs from packaging.version import Version from airflow.compat.functools import cachedproperty from airflow.exceptions import AirflowException from airflow.hooks.base import BaseHook from airflow.models import BaseOperator, SkipMixin from airflow.providers.common.sql.hooks.sql import DbApiHook, _backportedget_hook from airflow.version import version if TYPE_CHECKING: from airflow.utils.context import Context def parse_boolean(val: str) -> str | bool: """Try to parse a string into boolean. <pre><code>Raises ValueError if the input is not a valid true- or false-like string value. """ val = val.lower() if val in ('y', 'yes', 't', 'true', 'on', '1'): return True if val in ('n', 'no', 'f', 'false', 'off', '0'): return False raise ValueError(f"{val!r} is not a boolean-like string value") </code></pre> def getfailedchecks(checks, col=None): if col: return [ f"Column: {col}\nCheck: {check},\nCheck Values: {checkvalues}\n" for check, checkvalues in checks.items() if not checkvalues["success"] ] return [ f"\tCheck: {check},\n\tCheck Values: {checkvalues}\n" for check, checkvalues in checks.items() if not check_values["success"] ] PROVIDERSMATCHER = re.compile(r'airflow.providers.(.).hooks.') MINSUPPORTEDPROVIDERSVERSION = { "amazon": "4.1.0", "apache.drill": "2.1.0", "apache.druid": "3.1.0", "apache.hive": "3.1.0", "apache.pinot": "3.1.0", "databricks": "3.1.0", "elasticsearch": "4.1.0", "exasol": "3.1.0", "google": "8.2.0", "jdbc": "3.1.0", "mssql": "3.1.0", "mysql": "3.1.0", "odbc": "3.1.0", "oracle": "3.1.0", "postgres": "5.1.0", "presto": "3.1.0", "qubole": "3.1.0", "slack": "5.1.0", "snowflake": "3.1.0", "sqlite": "3.1.0", "trino": "3.1.0", "vertica": "3.1.0", } class BaseSQLOperator(BaseOperator): """ This is a base class for generic SQL Operator to get a DB Hook <pre><code>The provided method is .get_db_hook(). The default behavior will try to retrieve the DB hook based on connection type. You can custom the behavior by overriding the .get_db_hook() method. """ def __init__( self, **, conn_id: str | None = None, database: str | None = None, hook_params: dict | None = None, ****kwargs, ): super().__init__(****kwargs) self.conn_id = conn_id self.database = database self.hook_params = {} if hook_params is None else hook_params @cached_property def _hook(self): """Get DB Hook based on connection type""" self.log.debug("Get connection for %s", self.conn_id) conn = BaseHook.get_connection(self.conn_id) if Version(version) &gt;= Version('2.3'): # "hook_params" were introduced to into "get_hook()" only in Airflow 2.3. hook = conn.get_hook(hook_params=self.hook_params) # ignore airflow compat check else: # For supporting Airflow versions &lt; 2.3, we backport "get_hook()" method. This should be removed # when "apache-airflow-providers-common-sql" will depend on Airflow &gt;= 2.3. hook = _backported_get_hook(conn, hook_params=self.hook_params) if not isinstance(hook, DbApiHook): from airflow.hooks.dbapi_hook import DbApiHook as _DbApiHook if isinstance(hook, _DbApiHook): # This case might happen if user installed common.sql provider but did not upgrade the # Other provider's versions to a version that supports common.sql provider class_module = hook.__class__.__module__ match = _PROVIDERS_MATCHER.match(class_module) if match: provider = match.group(1) min_version = _MIN_SUPPORTED_PROVIDERS_VERSION.get(provider) if min_version: raise AirflowException( f'You are trying to use common-sql with {hook.__class__.__name__},' f' but the Hook class comes from provider {provider} that does not support it.' f' Please upgrade provider {provider} to at least {min_version}.' ) raise AirflowException( f'You are trying to use `common-sql` with {hook.__class__.__name__},' ' but its provider does not support it. Please upgrade the provider to a version that' ' supports `common-sql`. The hook class should be a subclass of' ' `airflow.providers.common.sql.hooks.sql.DbApiHook`.' f' Got {hook.__class__.__name__} Hook with class hierarchy: {hook.__class__.mro()}' ) if self.database: hook.schema = self.database return hook def get_db_hook(self) -&gt; DbApiHook: """ Get the database hook for the connection. :return: the database hook object. :rtype: DbApiHook """ return self._hook </code></pre> class SQLColumnCheckOperator(BaseSQLOperator): """ Performs one or more of the templated checks in the columnchecks dictionary. Checks are performed on a per-column basis specified by the columnmapping. Each check can take one or more of the following options: - equalto: an exact value to equal, cannot be used with other comparison options - greaterthan: value that result should be strictly greater than - lessthan: value that results should be strictly less than - geqto: value that results should be greater than or equal to - leq_to: value that results should be less than or equal to - tolerance: the percentage that the result may be off from the expected value <pre><code>:param table: the table to run checks on :param column_mapping: the dictionary of columns and their associated checks, e.g. .. code-block:: python { "col_name": { "null_check": { "equal_to": 0, }, "min": { "greater_than": 5, "leq_to": 10, "tolerance": 0.2, }, "max": {"less_than": 1000, "geq_to": 10, "tolerance": 0.01}, } } :param partition_clause: a partial SQL statement that is added to a WHERE clause in the query built by the operator that creates partition_clauses for the checks to run on, e.g. .. code-block:: python "date = '1970-01-01'" :param conn_id: the connection ID used to connect to the database :param database: name of database which overwrite the defined one in connection .. seealso:: For more information on how to use this operator, take a look at the guide: :ref:`howto/operator:SQLColumnCheckOperator` """ template_fields = ("partition_clause",) column_checks = { "null_check": "SUM(CASE WH… </code></pre>

Jakub Dardziński (jakub.dardzinski@getindata.com)

2022-09-30 15:22:24

*Thread Reply:* that's more of an Airflow question indeed. As far as I understand you need to read file with SQL statement within Airflow Operator and do something but run the query (like pass as an XCom)? SQLExtractors we have get same SQL that operators render and uses it to extract additional information like table schema straight from database

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2022-09-30 14:36:18

(I’m also ok with a way to get the SQL that has been run - but from Airflow, not the data source - I’m looking for a db-neutral way to do this, otherwise I can just parse query logs on any specific db system)

Paul Lee (paullee@lyft.com)

2022-09-30 18:45:09

👋 are there any docs on how the listener hooks in and gets run with openlineage-airflow? trying to write some unit tests but no docs seem to exist on the flow.

Julien Le Dem (julien@apache.org)

2022-09-30 19:06:47

*Thread Reply:* There's a design doc linked from the PR: https://github.com/apache/airflow/pull/20443 https://docs.google.com/document/d/1L3xfdlWVUrdnFXng1Di4nMQYQtzMfhvvWDR9K4wXnDU/edit

<https://github.com/apache/airflow/pull/20443|#20443 Add Listener Plugin API that tracks TaskInstance state changes>

This PR adds new Plugin API - listeners. It enables plugin authors to write <a href="https://pluggy.readthedocs.io/en/stable/"><code>pluggy</code> hook implementation</a> that will be called on certain formalized extension points. To differentiate between current Airflow extension points, like plugins, and current Airflow hooks, implementations of those hooks are called <code>listeners</code>. The API is ment to be called across all dags, and all operators - in contrast to current <code>on_success_callback</code>, <code>pre_execute</code> and related family which are meant to provide callbacks for particular dag authors, or operator creators. <code>pluggy</code> mechanism enables us to execute multiple, or none, listeners that implement particular extension point, so that users can use multiple listeners seamlessly. In this PR, three such extension points are added. When TaskInstance's state is changed to <code>RUNNING</code>, <code>on_task_instance_running</code> hook is called. On change to<code>SUCCESS</code> <code>on_task_instance_success</code> is called, similarly on <code>FAILED</code> <code>on_task_instance_failed</code> is called. Actual notification mechanism is be implemented using <a href="https://docs.sqlalchemy.org/en/13/orm/session_events.html#after-flush">SQLAlchemy’s events mechanism</a>. This ensures that plugins will get every change of state, regardless of where in the codebase it happened, and not require manual annotation of TI state changes across the codebase. To make sure that this change is not affecting performance, running this mechanism on scheduler is disabled by default. The SQLAlchemy event mechanism is also not affected by default - the event listener is only added if we have any plugin which actually provides any listener. Design doc: <a href="https://docs.google.com/document/d/1L3xfdlWVUrdnFXng1Di4nMQYQtzMfhvvWDR9K4wXnDU/edit?usp=sharing">https://docs.google.com/document/d/1L3xfdlWVUrdnFXng1Di4nMQYQtzMfhvvWDR9K4wXnDU/edit?usp=sharing</a> Background to this change: When discussing changing LineageBackend api for OpenLineage plugin, we came to consensus that the changes proposed there would be better suited to the new API <a href="https://github.com/apache/airflow/issues/17984">#17984</a> related: <a href="https://github.com/apache/airflow/issues/17984">#17984</a>

Labels

area:scheduler/executor, area:dev-tools, area:plugins, type:new-feature, full tests needed

Comments

👀 Paul Lee

Paul Lee (paullee@lyft.com)

2022-09-30 19:18:47

*Thread Reply:* amazing thank you I will take a look

Michael Robinson (michael.robinson@astronomer.io)

2022-10-03 11:32:52

@channel Hello everyone, I’m opening up a vote on releasing OpenLineage 0.15.0, including • an improved development experience in the Airflow integration • updated proposal and integration templates • a change to the BigQuery client in the Airflow integration • plus bug fixes across the project. 3 +1s from committers will authorize an immediate release. For all the commits, see: https://github.com/OpenLineage/OpenLineage/compare/0.14.0...HEAD. Note: this will be the last release to support Airflow 1.x! Thanks!

🎉 Paul Lee, Howard Yoo, Minkyu Park, Michael Collado, Paweł Leszczyński, Maciej Obuchowski, Harel Shein

👍 Michael Collado, Julien Le Dem, Maciej Obuchowski

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-10-03 11:33:30

*Thread Reply:* Hey @Michael Robinson. Removal of Airflow 1.x support is planned for next release after 0.15.0

👍 Jakub Dardziński, Paul Lee

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-10-03 11:37:03

*Thread Reply:* 0.15.0 would be the last release supporting Airflow 1.x

Michael Robinson (michael.robinson@astronomer.io)

2022-10-03 11:37:07

*Thread Reply:* just caught this myself. I’ll make the change

Paul Lee (paullee@lyft.com)

2022-10-03 11:40:33

*Thread Reply:* we’re still on 1.10.15 at the moment so i guess our team would have to rely on <=0.15.0?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-10-03 11:49:47

*Thread Reply:* Is this something you want to continue doing or do you want to migrate relatively soon?

We want to remove 1.10 integration because for multiple PRs, maintaining compatibility with it takes a lot of time; the code is littered with checks like this. if parse_version(AIRFLOW_VERSION) >= parse_version("2.0.0"):

👍 Paul Lee

Paul Lee (paullee@lyft.com)

2022-10-03 12:03:40

*Thread Reply:* hey Maciej, we do have plans to migrate in the coming months but for right now we need to stay on 1.10.15.

Michael Robinson (michael.robinson@astronomer.io)

2022-10-04 09:39:11

*Thread Reply:* Thanks, all. The release is authorized, and you can expect it by Thursday.

Paul Lee (paullee@lyft.com)

2022-10-03 17:56:08

👋 what would be a possible reason for the built in airflow backend being utilized instead of a custom wrapper over airflow.lineage.Backend ? double checked the [lineage] key in our airflow.cfg

there doesn't seem to be any errors being thrown and the object loads 🤔

Paul Lee (paullee@lyft.com)

2022-10-03 17:56:36

*Thread Reply:* running airflow 2.3.4 with openlineage-airflow 0.14.1

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-10-03 18:03:03

*Thread Reply:* if you're talking about LineageBackend, it is used in Airflow 2.1-2.2. It did not have functionality where you can be notified on task start or failure, so we wanted to expand the functionality: https://github.com/apache/airflow/issues/17984

Consensus of Airflow maintainers wasn't positive about changing this interface, so we went with another direction: https://github.com/apache/airflow/pull/20443

<https://github.com/apache/airflow/issues/17984|#17984 Add possibility to LineageBackend to be notified of task instance execution start and failure>

<https://github.com/apache/airflow/pull/20443|#20443 Add Listener Plugin API that tracks TaskInstance state changes>

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-10-03 18:06:58

*Thread Reply:* Why nothing happens? https://github.com/OpenLineage/OpenLineage/blob/895160423643398348154a87e0682c3ab5c8704b/integration/airflow/openlineage/lineage_backend/__init__.py#L91

<https://github.com/OpenLineage/OpenLineage/blob/895160423643398348154a87e0682c3ab5c8704b/integration/airflow/openlineage/lineage_backend/__init__.py | __init__.py>

<pre><code> if parse_version(AIRFLOW_VERSION) &gt;= parse_version("2.3.0.dev0"): </code></pre>

Paul Lee (paullee@lyft.com)

2022-10-03 18:30:32

*Thread Reply:* ah hmm ok, i will double check. i commented that part out so technically it should run but maybe i missed something

Paul Lee (paullee@lyft.com)

2022-10-03 18:30:42

*Thread Reply:* thank you for your fast response @Maciej Obuchowski ! i appreciate it

Paul Lee (paullee@lyft.com)

2022-10-03 18:31:13

*Thread Reply:* it seems like it doesn't use my custom wrapper but instead uses the openlineage implementation.

Paul Lee (paullee@lyft.com)

2022-10-03 20:11:15

*Thread Reply:* @Maciej Obuchowski ok, after checking we are emitting events with our custom backend but an odd thing is an attempt is always made with the openlineage backend. is there something obvious i am perhaps missing 🤔

ends up with requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url immediately after task start. but by the end on task success/failure it emits the event with our custom backend both RunState.COMPLETE and RunState.START into our own pipeline.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-10-04 06:19:06

*Thread Reply:* If you're on 2.3 and trying to use some wrapped LineageBackend, what I think is happening is OpenLineagePlugin that automatically registers via setup.py entrypoint https://github.com/OpenLineage/OpenLineage/blob/65a5f021a1ba3035d5198e759587737a05b242e1/integration/airflow/openlineage/airflow/plugin.py#L30

<https://github.com/OpenLineage/OpenLineage/blob/65a5f021a1ba3035d5198e759587737a05b242e1/integration/airflow/openlineage/airflow/plugin.py | plugin.py>

<pre><code> listeners = [listener] </code></pre>

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-10-04 06:23:48

*Thread Reply:* I think if you want to extend it with proprietary code there are two good options.

First, if your code only needs to touch HTTP client side - which I guess is the case due to 401 error - then you can create custom Transport.

Second, is that you fork OL code and create your own package, without entrypoint script or with adding your own if you decide to extend OpenLineagePlugin instead of LineageBackend

👍 Paul Lee

Paul Lee (paullee@lyft.com)

2022-10-04 14:23:33

*Thread Reply:* amazing thank you for your help. i will take a look

Paul Lee (paullee@lyft.com)

2022-10-04 14:49:47

*Thread Reply:* @Maciej Obuchowski is there a way to extend the plugin like how we can wrap the custom backend with 2.2? or would it be necessary to fork it.

we're trying to not fork and instead opt with extending.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-10-05 04:55:05

*Thread Reply:* I think it's best to fork, since it's getting loaded by Airflow as an entrypoint: https://github.com/OpenLineage/OpenLineage/blob/133110300e8ea4e42e3640608cfed459683d5a8d/integration/airflow/setup.py#L70

<https://github.com/OpenLineage/OpenLineage/blob/133110300e8ea4e42e3640608cfed459683d5a8d/integration/airflow/setup.py | setup.py>

<pre><code> "airflow.plugins": ["OpenLineagePlugin = openlineage.airflow.plugin:OpenLineagePlugin"] </code></pre>

🙏 Paul Lee

:gratitude_thank_you: Paul Lee

Paul Lee (paullee@lyft.com)

2022-10-05 13:29:24

*Thread Reply:* got it. and in terms of the openlineage.yml and defining a custom transport is there a way i can define where openlineage-python should look for the custom transport? e.g. different path

Paul Lee (paullee@lyft.com)

2022-10-05 13:30:04

*Thread Reply:* because from the docs i. can't tell except for the file i'm supposed to copy and implement.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-10-05 14:18:19

*Thread Reply:* @Paul Lee you should derive from Transport base class and register type as full python import path to your custom transport, for example https://github.com/OpenLineage/OpenLineage/blob/f8533266491acea2159f602f782a99a4f8a82cca/client/python/tests/openlineage.yml#L2

<https://github.com/OpenLineage/OpenLineage/blob/f8533266491acea2159f602f782a99a4f8a82cca/client/python/tests/openlineage.yml | openlineage.yml>

<pre><code> type: "tests.transport.AccumulatingTransport" </code></pre>

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-10-05 14:20:48

*Thread Reply:* your custom transport should have also define custom class Config , and this class should implement from_dict method

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-10-05 14:20:56

*Thread Reply:* the whole process is here: https://github.com/OpenLineage/OpenLineage/blob/a62484ec14359a985d283c639ac7e8b9cfc54c2e/client/python/openlineage/client/transport/factory.py#L47

<https://github.com/OpenLineage/OpenLineage/blob/a62484ec14359a985d283c639ac7e8b9cfc54c2e/client/python/openlineage/client/transport/factory.py | factory.py>

<pre><code> def _create_transport(self, config: dict): </code></pre>

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-10-05 14:21:09

*Thread Reply:* and I know we need to document this better 🙂

🙏 Paul Lee

Paul Lee (paullee@lyft.com)

2022-10-05 15:35:31

*Thread Reply:* amazing, thanks for all your help 🙂 +1 to the docs, if i have some time when done i will push up some docs to document what i've done

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-10-05 15:50:29

*Thread Reply:* https://github.com/openlineage/docs/ - let me know and I'll review 🙂

OpenLineage/docs

Documentation for OpenLineage

Website

<https://openlineage.io/docs>

Stars

🎉 Paul Lee

Michael Robinson (michael.robinson@astronomer.io)

2022-10-04 12:39:59

@channel Hi everyone, opening a vote on a release (0.15.1) to add #1131 to fix the release process on CI. 3 +1s from committers will authorize an immediate release. Thanks. More details are here: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md

<https://github.com/OpenLineage/OpenLineage/pull/1131|#1131 CI: build macos release package on medium resouce class>

<https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md | CHANGELOG.md>

✅ Michael Collado, Maciej Obuchowski, Julien Le Dem

Michael Robinson (michael.robinson@astronomer.io)

2022-10-04 14:25:49

*Thread Reply:* Thanks, all. The release is authorized.

Michael Robinson (michael.robinson@astronomer.io)

2022-10-05 10:46:46

@channel OpenLineage 0.15.1 is now available! We added: • Airflow: improve development experience #1101 @JDarDagran • Documentation: update issue templates for proposal & add new integration template #1116 @rossturk • Spark: add description for URL parameters in readme, change overwriteName to appName #1130 @tnazarew We changed: • Airflow: lazy load BigQuery client #1119 @mobuchowski Many bug fixes were also included in this release. Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.15.1 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.14.1...0.15.1 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🙌 Maciej Obuchowski, Jakub Dardziński, Howard Yoo, Harel Shein, Paul Lee, Paweł Leszczyński

🎉 Howard Yoo, Harel Shein, Paul Lee

Michael Robinson (michael.robinson@astronomer.io)

2022-10-06 07:35:00

Is there a topic you think the community should discuss at the next OpenLineage TSC meeting? Reply or DM with your item, and we’ll add it to the agenda.

🌟 Paul Lee

Paul Lee (paullee@lyft.com)

2022-10-06 13:29:30

*Thread Reply:* would love to add improvement in docs :) for newcomers

👏 Jakub Dardziński

Paul Lee (paullee@lyft.com)

2022-10-06 13:31:07

*Thread Reply:* also, what’s TSC?

Michael Robinson (michael.robinson@astronomer.io)

2022-10-06 15:20:23

*Thread Reply:* Technical Steering Committee, but it’s open to everyone

👍 Paul Lee

Michael Robinson (michael.robinson@astronomer.io)

2022-10-06 15:20:45

*Thread Reply:* and we encourage newcomers to attend

Paul Lee (paullee@lyft.com)

2022-10-06 13:49:00

has anyone seen their COMPLETE/FAILED listeners not firing on Airflow 2.3.4 but START events do emit? using openlineage-airflow 0.14.1

Jakub Dardziński (jakub.dardzinski@getindata.com)

2022-10-06 14:39:27

*Thread Reply:* is there any error/warn message logged maybe?

Paul Lee (paullee@lyft.com)

2022-10-06 14:40:53

*Thread Reply:* none that i'm seeing on our workers. i do see that our custom http transport is being utilized on START.

but on SUCCESS nothing fires.

Paul Lee (paullee@lyft.com)

2022-10-06 14:41:21

*Thread Reply:* which makes me believe the listeners themselves aren't being utilized? 🤔

Jakub Dardziński (jakub.dardzinski@getindata.com)

2022-10-06 16:37:54

*Thread Reply:* uhm, any chance you're experiencing this with custom extractors?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2022-10-06 16:38:13

*Thread Reply:* I'd be happy to jump on a quick call if you wish

Jakub Dardziński (jakub.dardzinski@getindata.com)

2022-10-06 16:38:40

*Thread Reply:* but in more EU friendly hours 🙂

Paul Lee (paullee@lyft.com)

2022-10-07 16:19:47

*Thread Reply:* no custom extractors, its usingt he base extractor. a call would be 👍. let me look at my calendar and EU hours.

Michael Robinson (michael.robinson@astronomer.io)

2022-10-06 15:23:27

@channel The next OpenLineage Technical Steering Committee meeting is on Thursday, October 13 at 10 am PT. Join us on Zoom: https://bit.ly/OLzoom All are welcome! Agenda:

Announcements
Recent Release 0.15.1
Project roadmap review
Open discussion Notes: https://bit.ly/OLwiki Is there a topic you think the community should discuss at this or a future meeting? Reply or DM me to add items to the agenda.

🙌 Paul Lee, Harel Shein

Srini Raghavan (gsrinir@gmail.com)

2022-10-07 06:52:42

hello all. I am trying to run the airflow example from here I changed the Marquez web port from 5000 to 15000 but when I start the docker images, it seems to always default to port 5000 and therefore when I go to localhost:3000, the jobs don't load up as they are not able to connect to the marquez app running in 15000. I've overriden the values in docker-compose.yml and in openLineage.env but it seems to be picking up the 5000 value from some other location. This is what I see in the logs. Any pointers on this or please redirect me to the appropriate channel. Thanks! INFO [2022-10-07 10:48:58,022] org.eclipse.jetty.server.AbstractConnector: Started application@782fd504{HTTP/1.1, (http/1.1)}{0.0.0.0:5000} INFO [2022-10-07 10:48:58,034] org.eclipse.jetty.server.AbstractConnector: Started admin@1537c744{HTTP/1.1, (http/1.1)}{0.0.0.0:5001}

👀 Maciej Obuchowski

Srini Raghavan (gsrinir@gmail.com)

2022-10-20 05:11:09

*Thread Reply:* Apparently the value is hard coded in the code somewhere that I couldn't figure out but at-least learnt that in my Mac where this port 5000 is being held up can be freed by following the below simple step.

Hanna Moazam (hannamoazam@microsoft.com)

2022-10-10 18:00:17

Hi #general - @Will Johnson and I are working on adding support for Snowflake to OL, and as we were going to specify the package under the compileOnly dependencies in gradle, we had some doubts looking at the existing dependencies. Taking bigQuery as an example - we see it's included as a dependency in both the shared build.gradle file, and in the app build.gradle file. We're a bit confused about the following:

Why do we need to have the bigQuery package in shared's dependencies? App of course contains the bigQueryNodeVisitor but we couldn't spot where it's being used within shared.
For all the dependencies in the shared gradle file, the versions for Scala and Spark are fixed (Scala 2.11, Spark 2.4.8), whereas for app, the versionsMap allows for different combinations of spark and scala versions. Why is this so?
How do the dependencies between app and shared interact? Does one or the other take precedence for which version of the bigQuery connector is compiled? We'd appreciate any guidance!

Thank you in advance!

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2022-10-11 03:47:31

*Thread Reply:* Hi @Hanna Moazam,

Within recent PR https://github.com/OpenLineage/OpenLineage/pull/1111, I removed BigQuery dependencies from spark2, spark32 and spark3 subprojects. It has to stay in sharedbecause of BigQueryNodeVisitor. The usage of BigQueryNodeVisitor is tricky as we never know if bigquery classes are available on runtime or not. The check is done in io.openlineage.spark.agent.lifecycle.BaseVisitorFactory if (BigQueryNodeVisitor.hasBigQueryClasses()) { list.add(new BigQueryNodeVisitor(context, factory)); } Regarding point 2, there were some Spark versions which allowed two Scala versions (2.11 and 2.12). Then it makes sense to make it configurable. On the other hand, for Spark 3.2 we only support 2.12 which is hardcoded in build.gradle.

The idea of app project is let's create a separate project to aggregate all the dependecies and run integration tests on it . Subprojects spark2, spark3, etc. do depend on shared . Putting integration tests in shared would create additional opposite-way dependency, which we wanted to avoid.

<https://github.com/OpenLineage/OpenLineage/pull/1111|#1111 fix bigquery connector>

Signed-off-by: Pawel Leszczynski <a href="mailto:leszczynski.pawel@gmail.com">leszczynski.pawel@gmail.com</a> Problem Spark integration fails with spark-bigquery-connector >=0.25.0 Closes: <a href="https://github.com/OpenLineage/OpenLineage/issues/1105">#1105</a> Solution • make Spark integration work with the latest connector • provide a change in a way that the old version would work either > Note: All schema changes require discussion. Please <a href="https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue">link the issue</a> for context. ☐ Your change modifies the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json">core</a> OpenLineage model ☐ Your change modifies one or more OpenLineage <a href="https://github.com/OpenLineage/OpenLineage/tree/main/spec/facets">facets</a> If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports <code>S3</code> and <code>GCS</code> filesystem operations, tested with AWS EMR). Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☑︎ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☐ Your changes are accompanied by tests (if relevant) ☑︎ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☐ You've updated the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md"><code>CHANGELOG.md</code></a> with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant) ☐ You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant)

Labels

bug, documentation, integration/spark, integration/bigquery

Will Johnson (will@willj.co)

2022-10-11 09:20:44

*Thread Reply:* So, if we wanted to add Snowflake, we would need to:

Pick a version of snowflake's spark library
Pick a version of scala that we target (i.e. we are only going to support Snowflake in Spark 3.2 so scala 2.12 will be hard coded)
Add the visitor code to Shared
Add the dependencies to app (ONLY if there is an integration test in app?? This is the confusing part still)

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2022-10-12 03:51:54

*Thread Reply:* Yes. Please note that snowflake library will not be included in target OpenLineage jar. So you may test it manually against multiple Snowflake library versions or even adjust code in case of minor differences.

👍 Hanna Moazam, Will Johnson

Hanna Moazam (hannamoazam@microsoft.com)

2022-10-12 05:20:17

*Thread Reply:* Thank you Pawel!

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-10-12 12:18:16

*Thread Reply:* Basically the same pattern you've already done with Kusto 😉 https://github.com/OpenLineage/OpenLineage/blob/a96ecdabe66567151e7739e25cd9dd03d6[…]va/io/openlineage/spark/agent/lifecycle/BaseVisitorFactory.java

<https://github.com/OpenLineage/OpenLineage/blob/a96ecdabe66567151e7739e25cd9dd03d67e03e8/integration/spark/shared/src/main/java/io/openlineage/spark/agent/lifecycle/BaseVisitorFactory.java | BaseVisitorFactory.java>

<pre><code> if (KustoRelationVisitor.hasKustoClasses()) { </code></pre>

Hanna Moazam (hannamoazam@microsoft.com)

2022-10-12 12:26:35

*Thread Reply:* We actually used only reflection for Kusto and were hoping to do it the 'better' way with the package itself for snowflake - if it's possible :)

Akash r (akashrn25@gmail.com)

2022-10-11 02:04:28

Hi Community,

I was going through the code of dbt integration with Open lineage, Once the events has been emitted from client code , I wanted to check the server code where the events are read and the lineage is formed. Where can I find that code ?

Thanks

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-10-11 05:03:26

*Thread Reply:* Reference implementation of OpenLineage consumer is Marquez: https://github.com/MarquezProject/marquez

MarquezProject/marquez

Collect, aggregate, and visualize a data ecosystem's metadata

Website

<https://marquezproject.ai>

Stars

1187

Michael Robinson (michael.robinson@astronomer.io)

2022-10-12 11:59:55

This month’s OpenLineage TSC meeting is tomorrow at 10 am PT! https://openlineage.slack.com/archives/C01CK9T7HKR/p1665084207602369

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)

@channel The next OpenLineage Technical Steering Committee meeting is on Thursday, October 13 at 10 am PT. Join us on Zoom: <a href="https://bit.ly/OLzoom">https://bit.ly/OLzoom</a> All are welcome! Agenda: <ol><li>Announcements</li><li>Recent Release 0.15.1</li><li>Project roadmap review</li><li>Open discussion Notes: <a href="https://bit.ly/OLwiki">https://bit.ly/OLwiki</a> Is there a topic you think the community should discuss at this or a future meeting? Reply or DM me to add items to the agenda.</li> </ol>

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1665084207602369

🙌 Maciej Obuchowski

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2022-10-13 12:05:17

Is there anyone in the Open Lineage community in San Diego? I’ll be there Nov 1-3 and would love to meet some of y’all in person

Paul Lee (paullee@lyft.com)

2022-10-20 13:49:39

👋 is there a way to define a base extractor to be defaulted to? for example, i'd like to have all our operators (50+) default to my custom base extractor instead of having a list of 50+ operators in get_operator_classnames

Howard Yoo (howard.yoo@astronomer.io)

2022-10-20 13:53:55

I don't think that's possible yet, as the extractor checks are based on the class name... and it wouldn't check which parent operator has it inherited from.

Paul Lee (paullee@lyft.com)

2022-10-20 14:05:38

😢 ok, i would contribute upstream but unfortunately we're still on 1.10.15. looking like we might have to hardcode for a bit.

Paul Lee (paullee@lyft.com)

2022-10-20 14:06:01

is this the correct assumption? we're still on 0.14.1 ^

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-10-20 14:33:49

If you'll move to 2.x series and OpenLineage 0.16, you could use this feature: https://github.com/OpenLineage/OpenLineage/pull/1162

<https://github.com/OpenLineage/OpenLineage/pull/1162|#1162 airflow: add default extractor>

As a way of extending OpenLineage airflow integration, we are providing a way to add custom extractors. This is great for people who want to add integration with external operators - for example, regular Airflow user who wants to support <code>GCSToS3Operator</code>. However, it's too cumbersome for people who own operators, and want to add default implementation of OpenLineage for their operators for external users. This PR adds support for <code>DefaultExtractor</code> which looks up particular method - <code>get_openlineage_facets</code> on Operator, and when it's present, uses it to get lineage data. Signed-off-by: Maciej Obuchowski <a href="mailto:obuchowski.maciej@gmail.com">obuchowski.maciej@gmail.com</a>

Labels

integration/airflow, extractor

👍 Paul Lee

Paul Lee (paullee@lyft.com)

2022-10-20 14:46:36

thanks @Maciej Obuchowski we're working on it. hoping we'll land on 2.3.4 in the coming month.

🔥 Maciej Obuchowski

Austin Poulton (austin.poulton@equalexperts.com)

2022-10-26 05:31:07

👋 Hi everyone!

👋 Jakub Dardziński, Maciej Obuchowski, Michael Robinson, Ross Turk, Willy Lulciuc, Paweł Leszczyński, Harel Shein

Harel Shein (harel.shein@gmail.com)

2022-10-26 15:22:22

*Thread Reply:* Hey @Austin Poulton, welcome! 👋

Austin Poulton (austin.poulton@equalexperts.com)

2022-10-31 06:09:41

*Thread Reply:* thanks Harel 🙂

Michael Robinson (michael.robinson@astronomer.io)

2022-11-01 09:44:18

@channel Hi everyone, I’m opening a vote to release OpenLineage 0.16.0, featuring: • support for boolean arguments in the DefaultExtractor • a more efficient get_connection_uri method in the Airflow integration • a reorganized, Rust-based SQL integration (easing the addition of language interfaces in the future) • bug fixes and more. 3 +1s from committers will authorize an immediate release. Thanks. More details are here: https://github.com/OpenLineage/OpenLineage/compare/0.15.1...HEAD

🙌 Howard Yoo, Paweł Leszczyński, Maciej Obuchowski

👍 Ross Turk, Paweł Leszczyński, Maciej Obuchowski

➕ Willy Lulciuc, Mandy Chessell, Maciej Obuchowski

Michael Robinson (michael.robinson@astronomer.io)

2022-11-01 13:37:54

*Thread Reply:* Thanks, all! The release is authorized. We will initiate it within 48 hours.

Iftach Schonbaum (iftach.schonbaum@hunters.ai)

2022-11-02 08:45:20

Anybody with a success use-case of ingesting column-level lineage into amundsen?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-11-02 09:19:43

*Thread Reply:* I think amundsen-openlineage dataloader precedes column-level lineage in OL by a bit, so I doubt this works

Harel Shein (harel.shein@gmail.com)

2022-11-02 15:54:31

*Thread Reply:* do you want to open up an issue for it @Iftach Schonbaum?

Michael Robinson (michael.robinson@astronomer.io)

2022-11-02 12:36:22

Hi everyone, you might notice Dependabot opening PRs to update dependencies now that it’s been configured and turned on (https://github.com/OpenLineage/OpenLineage/pull/1182). There will probably be a large number of PRs to start with, but this shouldn’t always be the case and we can change the tool’s behavior, as well. (Some background: this will help us earn the OSSF Silver badge for the project, which will help us advance in the LFAI.)

👍 Maciej Obuchowski

Michael Robinson (michael.robinson@astronomer.io)

2022-11-03 07:53:31

@channel I’m opening a vote to release OpenLineage 0.16.1 to fix an issue in the SQL integration. This release will also include all the commits announced for 0.16.0. 3 +1s from committers will authorize an immediate release. Thanks.

<https://github.com/OpenLineage/OpenLineage/pull/1228|#1228 fix: sql macos build directory>

Since package renaming, mac os CI release build was run in the wrong directory. Signed-off-by: Maciej Obuchowski <a href="mailto:obuchowski.maciej@gmail.com">obuchowski.maciej@gmail.com</a>

Labels

integration/sql

➕ Maciej Obuchowski, Hanna Moazam, Jakub Dardziński, Ross Turk, Paweł Leszczyński, Jarek Potiuk, Willy Lulciuc

Michael Robinson (michael.robinson@astronomer.io)

2022-11-03 12:25:29

*Thread Reply:* Thanks, all. The release is authorized and will be initiated shortly.

Michael Robinson (michael.robinson@astronomer.io)

2022-11-03 13:46:58

@channel OpenLineage 0.16.1 is now available, featuring: Additions: • Airflow: add dag_run information to Airflow version run facet #1133 @fm100 • Airflow: add LoggingMixin to extractors #1149 @JDarDagran • Airflow: add default extractor #1162 @mobuchowski • Airflow: add on_complete argument in DefaultExtractor #1188 @JDarDagran • SQL: reorganize the library into multiple packages #1167 @StarostaGit @mobuchowski Changes: • Airflow: move get_connection_uri as extractor’s classmethod #1169 @JDarDagran • Airflow: change get_openlineage_facets_on_start/complete behavior #1201 @JDarDagran Bug fixes and more! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.16.1 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.15.1...0.16.1 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🙌 Maciej Obuchowski, Francis McGregor-Macdonald, Eric Veleker

Phil Chen (phil@gpr.com)

2022-11-03 13:59:29

Are there any tutorial and documentation how to create an Openlinage connector. For example, what if we Argo workflow instead of Apache airflow for orchestrating ETL jobs? How are we going to create Openlinage Argo workflow connector? How much efforts, roughly? And can people contribute such connectors to the community if they create one?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-11-04 06:34:27

*Thread Reply:* > Are there any tutorial and documentation how to create an Openlinage connector. We have somewhat of a start of a doc: https://openlineage.io/docs/development/developing/

Here we have an example of using Python OL client to emit OL events: https://openlineage.io/docs/client/python#start-docker-and-marquez

> How much efforts, roughly? I'm not familiar with Argo workflows, but usually the effort needed depends on extensibility of the underlying system. From the first look, Argo looks like it has sufficient mechanisms for that: https://argoproj.github.io/argo-workflows/executor_plugins/#examples-and-community-contributed-plugins

Then, it depends if you can get the information that you need in that plugin. Basic need is to have information from which datasets the workflow/job is reading and to which datasets it's writing.

> And can people contribute such connectors to the community if they create one? Definitely! And if you need help with anything OpenLineage feel free to write here on Slack

openlineage.io

Python | OpenLineage Docs

Overview

Original URL: https://openlineage.io/docs/client/python#start-docker-and-marquez

Michael Robinson (michael.robinson@astronomer.io)

2022-11-03 17:57:37

Is there a topic you think the community should discuss at the next OpenLineage TSC meeting? Reply or DM with your item, and we’ll add it to the agenda.

Michael Robinson (michael.robinson@astronomer.io)

2022-11-03 18:03:18

@channel This month’s OpenLineage TSC meeting is next Thursday, November 10th, at 10 am PT. Join us on Zoom: https://bit.ly/OLzoom. All are welcome! On the tentative agenda:

Recent release overview [Michael R.]
Update on LFAI & Data Foundation progress [Michael R.]
Proposal: Defining “implementing OpenLineage” [Julien]
Update from MANTA on their OpenLineage integration [Eric and/or Petr from MANTA]
Linking CMF (a common ML metadata framework) and OpenLineage [Suparna and AnnMary from HP Enterprise]
Open discussion

👍 Luca Soato, Maciej Obuchowski, Paul Lee, Willy Lulciuc

Kenton (swiple.io) (kknoxparton@gmail.com)

2022-11-08 04:47:41

Hi all 👋 I’m Kenton — a Software Engineer and founder of Swiple. I’m looking forward to working with OpenLineage and its community to integrate data lineage and data observability. https://swiple.io

swiple.io

Build trust in every dataset | Build trust in every dataset

Swiple is an automated data monitoring platform that helps analytics and data engineering teams seamlessly monitor the quality of their data. With automated data analysis and profiling, scheduling and alerting, teams can resolve data quality issues before they impact mission critical resources.

Original URL: https://swiple.io

🙌 Maciej Obuchowski, Jakub Dardziński, Michael Robinson, Ross Turk, John Thomas, Julien Le Dem, Willy Lulciuc, Varun Singh

Ross Turk (ross@datakin.com)

2022-11-08 10:22:15

*Thread Reply:* Welcome Kenton! Happy to help 👍

👍 Kenton (swiple.io)

Deepika Prabha (deepikaprabha@gmail.com)

2022-11-08 05:35:03

Hi everyone, We wanted to pass some dynamic metadata from spark job that we can catch up in OpenLineage event and use it for processing. Presently I have seen that we have few conf parameters like openlineage params that we can send only with Spark conf. Is there any other option we have where we can send some information dynamically from the spark jobs.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-11-08 10:06:10

*Thread Reply:* What kind of data? My first feeling is that you need to extend the Spark integration

Deepika Prabha (deepikaprabha@gmail.com)

2022-11-09 00:35:29

*Thread Reply:* Yes, we wanted to add information like user/job description that we can use later with rest of openlineage event fields in our system

Deepika Prabha (deepikaprabha@gmail.com)

2022-11-09 00:41:35

*Thread Reply:* I can see in this PR https://github.com/OpenLineage/OpenLineage/pull/490 that env values can be captured which we can use to add some custom metadata but it seems it is specific to Databricks only.

<https://github.com/OpenLineage/OpenLineage/pull/490|#490 Added generic facet to collect environmental properties (EnvironmentFacet)>

Comments

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-11-09 05:14:50

*Thread Reply:* I think it makes sense to have something like that, but generic, if you want to contribute it

👍 Will Johnson, Deepika Prabha

Varun Singh (varuntestaz@outlook.com)

2022-11-14 03:28:35

*Thread Reply:* @Maciej Obuchowski Do you mean adding something like spark.openlineage.jobFacet.FacetName.Key=Value to the spark conf should add a new job facet like "FacetName": { "Key": "Value" }

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-11-14 05:56:02

*Thread Reply:* We can argue about name of that key, but yes, something like that. Just notice that while it's possible to attach something to run and job facets directly, it would be much harder to do this with datasets

slackbot

2022-11-09 11:15:49

This message was deleted.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2022-11-10 02:22:18

*Thread Reply:* Hi @Varun Singh, what version of openlineage-spark where you using? Are you able to copy lineage event here?

Michael Robinson (michael.robinson@astronomer.io)

2022-11-09 12:31:10

@channel This month’s TSC meeting is tomorrow at 10 am PT! https://openlineage.slack.com/archives/C01CK9T7HKR/p1667512998061829

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)

@channel This month’s OpenLineage TSC meeting is next Thursday, November 10th, at 10 am PT. Join us on Zoom: <a href="https://bit.ly/OLzoom">https://bit.ly/OLzoom</a>. All are welcome! On the tentative agenda: <ol><li>Recent release overview [Michael R.]</li><li>Update on LFAI & Data Foundation progress [Michael R.]</li><li>Proposal: Defining “implementing OpenLineage” [Julien]</li><li>Update from MANTA on their OpenLineage integration [Eric and/or Petr from MANTA]</li><li>Linking CMF (a common ML metadata framework) and OpenLineage [Suparna and AnnMary from HP Enterprise]</li><li>Open discussion</li> </ol>

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1667512998061829

💥 Willy Lulciuc, Maciej Obuchowski

Hanna Moazam (hannamoazam@microsoft.com)

2022-11-11 11:32:54

Hi #general, quick question: do we plan to disable spark 2 support in the near future?

Longer question: I've recently made a PR (https://github.com/OpenLineage/OpenLineage/pull/1231) to support capturing lineage from Snowflake, but it fails at a specific integration test due to what we think is a dependency mismatch for guava. I've tried to exclude any transient dependencies which may cause the problem but no luck with that so far.

Just wondering if:

It makes sense to spend more time trying to ensure that test passes? Especially if we plan to remove spark 2 support soon.
Assuming we do want to make sure to pass the test, does anyone have any other ideas for where to look/modify to prevent the error? Here's the test failure message: ```io.openlineage.spark.agent.lifecycle.LibraryTest testRdd(SparkSession) FAILED (16s)

java.lang.IllegalAccessError: tried to access method com.google.common.base.Stopwatch.<init>()V from class org.apache.hadoop.mapred.FileInputFormat at io.openlineage.spark.agent.lifecycle.LibraryTest.testRdd(LibraryTest.java:113) ``` Thanks in advance!

<https://github.com/OpenLineage/OpenLineage/pull/1231|#1231 spark: add Snowflake Relation Visitor>

Problem OpenLineage's Spark integration does not currently support Snowflake. Solution Implemented the SnowflakeRelationVisitor to support lineage for the <a href="https://github.com/snowflakedb/spark-snowflake">Snowflake Spark connector</a>. There are two paths in OpenLineage to the SnowflakeRelationVisitor: • Write events reach the SaveIntoDataSourceCommandVisitor, which routes the command to the <code>createSnowflakeDatasets</code> method. • Read events trigger SnowflakeRelationVisitor's apply method. The implementation has been tested locally and on Azure Databricks. ☐ Your change modifies the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json">core</a> OpenLineage model ☐ Your change modifies one or more OpenLineage <a href="https://github.com/OpenLineage/OpenLineage/tree/main/spec/facets">facets</a> Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☑︎ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☑︎ Your changes are accompanied by tests (if relevant) ☑︎ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☐ You've updated the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md"><code>CHANGELOG.md</code></a> with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant) ☑︎ You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant)

Labels

documentation, integration/spark, spec

Comments

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-11-11 16:28:07

*Thread Reply:* What if we just not include it in the BaseVisitorFactory but only in the Spark3 visitor factories?

Paul Lee (paullee@lyft.com)

2022-11-11 14:52:19

quick question: how do i get the <<non-serializable Time...to show in the extraction? or really any object that gets passed in.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-11-11 16:24:30

*Thread Reply:* You might look here: https://github.com/OpenLineage/OpenLineage/blob/f7049c599a0b1416408860427f0759624326677d/client/python/openlineage/client/serde.py#L51

<https://github.com/OpenLineage/OpenLineage/blob/f7049c599a0b1416408860427f0759624326677d/client/python/openlineage/client/serde.py | serde.py>

<pre><code> default=lambda o: f"<<non-serializable: {type(o).__qualname__}>>" </code></pre>

:gratitude_thank_you: Paul Lee

srutikanta hota (srutikanta.hota@gmail.com)

2022-11-14 01:12:45

Is there a way I can update the detaset description and the column description. While generating the open lineage spark events and columns

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2022-11-15 02:09:25

*Thread Reply:* I don’t think this is possible at the moment.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2022-11-15 15:47:49

Hey all, I'd like to ask for a release for OpenLineage. #1256 fixes bug in DefaultExtractor. This blocks people from migrating code from custom extractors to get_openlineage_facets methods.

➕ Michael Robinson, Howard Yoo, Maciej Obuchowski, Willy Lulciuc

Michael Robinson (michael.robinson@astronomer.io)

2022-11-16 09:13:17

*Thread Reply:* Thanks, all. The release is authorized.

Michael Robinson (michael.robinson@astronomer.io)

2022-11-16 10:41:07

*Thread Reply:* The PR for the changelog updates: https://github.com/OpenLineage/OpenLineage/pull/1306

Varun Singh (varuntestaz@outlook.com)

2022-11-16 03:34:01

Hi, small question: Is it possible to disable the /api/{version}/lineage suffix that gets added to every url automatically? Thanks!

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-11-16 12:27:12

*Thread Reply:* I think we had similar request before, but nothing was implemented.

👍 Varun Singh

Michael Robinson (michael.robinson@astronomer.io)

2022-11-16 12:23:54

@channel OpenLineage 0.17.0 is now available, featuring: Additions: • Spark: support latest Spark 3.3.1 #1183 @pawel-big-lebowski • Spark: add Kinesis Transport and support config Kinesis in Spark integration #1200 @yogyang • Spark: disable specified facets #1271 @pawel-big-lebowski • Python: add facets implementation to Python client #1233 @pawel-big-lebowski • SQL: add Rust parser interface #1172 @StarostaGit @mobuchowski • Proxy: add helm chart for the proxy backend #1068 @wslulciuc • Spec: include possible facets usage in spec #1249 @pawel-big-lebowski • Website: publish YML version of spec to website #1300 @rossturk • Docs: update language on nominating new committers #1270 @rossturk Changes: • Website: publish spec into new website repo location #1295 @rossturk • Airflow: change how pip installs packages in tox environments #1302 @JDarDagran Removals: • Deprecate HttpTransport.Builder in favor of HttpConfig #1287 @collado-mike Bug fixes and more! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.17.0 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.16.1...0.17.0 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🙌 Howard Yoo, Maciej Obuchowski, Ross Turk, Aphra Bloomfield, Harel Shein, Kengo Seki, Paweł Leszczyński, pankaj koti, Varun Singh

Diego Cesar (dcesar@krakenrobotik.de)

2022-11-18 05:40:53

Hi everyone,

I'm trying to get the lineage of a dataset per version. I initially had something like

Dataset A -> Dataset B -> DataSet C (version 1)

then:

Dataset D -> Dataset E -> DataSet C (version 2)

I can get the graph for version 2 without problems, but I'm wondering if there's any way to retrieve the entire graph for DataSet C version 1.

Thanks

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-11-22 13:40:44

*Thread Reply:* It's kind of a hard problem UI side. Backend can express that relationship

Diego Cesar (dcesar@krakenrobotik.de)

2022-11-22 13:48:58

*Thread Reply:* Thanks for replying. Could you please point me to the API that allows me to do that? I've been calling GET /lineage with dataset in the node ID, e g., nodeId=dataset:my_dataset . Where could I specify the version of my dataset?

Paul Lee (paullee@lyft.com)

2022-11-18 17:55:24

👋 how do we get the actual values from macros? e.g. a schema name is passed in with {{params.table_name}} and thats what shows in lineage instead of the actual table name

Jakub Dardziński (jakub.dardzinski@getindata.com)

2022-11-19 04:54:13

*Thread Reply:* Templated fields are rendered before generating lineage data. Do you have some sample code or logs preferrably?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-11-22 13:40:11

*Thread Reply:* If you're on 1.10 then I think it won't work

Paul Lee (paullee@lyft.com)

2022-11-28 12:50:39

*Thread Reply:* @Maciej Obuchowski we are still on airflow 1.10.15 unfortunately.

cc. @Eli Schachar @Allison Suarez

Paul Lee (paullee@lyft.com)

2022-11-28 12:50:49

*Thread Reply:* is there no workaround we can make work?

Paul Lee (paullee@lyft.com)

2022-11-28 12:51:01

*Thread Reply:* @Jakub Dardziński is this for airflow versions 2.0+?

Varun Singh (varuntestaz@outlook.com)

2022-11-21 07:07:10

Hey, quick question: I see there is Kafka transport in the java client, but it's not supported in the spark integration, right?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-11-21 07:28:04

*Thread Reply:* Yeah. However, to add it, just tiny bit of code would be required.

Either in the URL version https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/main/java/io/openlineage/spark/agent/EventEmitter.java#L48

Or as separate Spark config entry: https://github.com/OpenLineage/OpenLineage/blob/182d2e5a907e6602f7fe132e07ea569c7e[…]n/java/io/openlineage/spark/agent/OpenLineageSparkListener.java

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/main/java/io/openlineage/spark/agent/EventEmitter.java | EventEmitter.java>

<https://github.com/OpenLineage/OpenLineage/blob/182d2e5a907e6602f7fe132e07ea569c7e923dff/integration/spark/app/src/main/java/io/openlineage/spark/agent/OpenLineageSparkListener.java | OpenLineageSparkListener.java>

srutikanta hota (srutikanta.hota@gmail.com)

2022-11-22 13:03:41

How can we auto instrument a dataset owner at Java agent level? Is there any spark property available?

srutikanta hota (srutikanta.hota@gmail.com)

2022-11-22 16:47:37

Is there a way if we are running a job with business day as yesterday to capture the information. Just think if I am running yesterday missing job today. Or Friday's file on Monday as we received file late from vendor etc..

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-11-22 18:45:48

*Thread Reply:* I think that's what NominalTimeFacet covers

openlineage.io

Nominal Time Facet | OpenLineage Docs

The facet to describe the nominal start and end time of the run. The nominal usually means the time the job run was expected to run (like a scheduled time), and the actual time can be different.

Original URL: https://openlineage.io/docs/spec/facets/run-facets/nominal_time

Rahul Sharma (panditrahul151197@gmail.com)

2022-11-24 09:15:45

hello Team, i wanna to use data lineage using airflow but not getting understand from docs please let me know if someone have pretty docs

Harel Shein (harel.shein@gmail.com)

2022-11-28 10:29:58

*Thread Reply:* Hey @Rahul Sharma, what version of Airflow are you running?

Rahul Sharma (panditrahul151197@gmail.com)

2022-11-28 10:30:14

*Thread Reply:* i am using airflow 2.x

Rahul Sharma (panditrahul151197@gmail.com)

2022-11-28 10:30:27

*Thread Reply:* can we connect if you have time ?

Harel Shein (harel.shein@gmail.com)

2022-11-28 11:11:58

*Thread Reply:* did you see these docs before? https://openlineage.io/integration/apache-airflow/#airflow-20

openlineage.io

Apache Airflow

Enabling OpenLineage in Apache Airflow automatically tracks metadata about jobs and datasets as DAGs execute.

Original URL: https://openlineage.io/integration/apache-airflow/#airflow-20

Rahul Sharma (panditrahul151197@gmail.com)

2022-11-28 11:12:22

*Thread Reply:* yes

Rahul Sharma (panditrahul151197@gmail.com)

2022-11-28 11:12:36

*Thread Reply:* i already set configuration in airflow.cfg file

Harel Shein (harel.shein@gmail.com)

2022-11-28 11:12:57

*Thread Reply:* where are you sending the events to?

Rahul Sharma (panditrahul151197@gmail.com)

2022-11-28 11:13:24

*Thread Reply:* i have a docker machine on which marquez is working

Harel Shein (harel.shein@gmail.com)

2022-11-28 11:13:47

*Thread Reply:* so, what is the issue you are seeing?

Rahul Sharma (panditrahul151197@gmail.com)

2022-11-28 11:15:37

*Thread Reply:* there is no error

Rahul Sharma (panditrahul151197@gmail.com)

2022-11-28 11:16:01

*Thread Reply:* ```[lineage]

what lineage backend to use

backend =openlineage.lineage_backend.OpenLineageBackend

MARQUEZ_URL=http://10.36.37.178:3000

MARQUEZ_NAMESPACE=airflow

MARQUEZBACKEND=HTTP MARQUEZURL=http://10.36.37.178:5000

MARQUEZAPIKEY=[YOURAPIKEY]

MARQUEZ_NAMESPACE=airflow```

Rahul Sharma (panditrahul151197@gmail.com)

2022-11-28 11:16:09

*Thread Reply:* above config i have set

Rahul Sharma (panditrahul151197@gmail.com)

2022-11-28 11:16:22

*Thread Reply:* please let me know any other thing need to do

Mohamed Nabil H (m.nabil.hafez@gmail.com)

2022-11-24 14:02:27

hey i wonder if somebody can link me to the lineage ( table lineage ) event schema ?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2022-11-25 02:20:40

*Thread Reply:* please have a look at openapi definition of the event: https://openlineage.io/apidocs/openapi/

Murali Krishna (vmurali.krishnaraju@genpact.com)

2022-11-30 02:34:51

Hello Team, I am from Genpact Data Analytics team, we are looking for demo of your product

Conor Beverland (conorbev@gmail.com)

2022-11-30 14:10:10

*Thread Reply:* hey, I'll DM you.

Michael Robinson (michael.robinson@astronomer.io)

2022-12-01 15:00:28

Hello all, I’m calling for a vote on releasing OpenLineage 0.18.0, including: • improvements to the Spark integration, • extractors for Sagemaker operators and SFTPOperator in the Airflow integration, • a change to the Databricks integration to support Databricks Runtime 11.3, • new governance docs, • bug fixes, • and more. Three +1s from committers will authorize an immediate release.

➕ Maciej Obuchowski, Will Johnson, Bramha Aelem

Michael Robinson (michael.robinson@astronomer.io)

2022-12-06 13:56:17

*Thread Reply:* Thanks, all. The release is authorized will be initiated within two business days.

Michael Robinson (michael.robinson@astronomer.io)

2022-12-01 15:11:10

@channel This month’s OpenLineage TSC meeting is next Thursday, December 8th, at 10 am PT. Join us on Zoom: https://bit.ly/OLzoom. All are welcome! On the tentative agenda:

an overview of the new Rust implementation of the SQL integration
a pesentation/discussion of what it actually means to “implement” OpenLineage
open discussion.

Scott Anderson (scott.anderson@alteryx.com)

2022-12-02 13:57:07

Hello everyone! General question here, aside from ‘consumer’ orgs/integrations (dbt/dagster/manta), is anyone aware of any enterprise organizations that are leveraging OpenLineage today? Example lighthouse brands?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-12-02 15:21:20

*Thread Reply:* Microsoft https://openlineage.io/blog/openlineage-microsoft-purview/

openlineage.io

Microsoft Purview Accelerates Lineage Extraction from Azure Databricks

A new collaboration between Microsoft and OpenLineage is making lineage extraction possible for Azure Databricks and Microsoft Purview users.

Original URL: https://openlineage.io/blog/openlineage-microsoft-purview/

🙌 Will Johnson

Will Johnson (will@willj.co)

2022-12-05 13:54:06

*Thread Reply:* I think we can share that we have over 2,000 installs of that Microsoft solution accelerator using OpenLineage.

That means we have thousands of companies having experimented with OpenLineage and Microsoft Purview.

We can't name any customers at this point unfortunately.

🎉 Conor Beverland, Kengo Seki

👍 Scott Anderson

Michael Robinson (michael.robinson@astronomer.io)

2022-12-07 12:03:06

@channel This month’s TSC meeting is tomorrow at 10 am PT. All are welcome! https://openlineage.slack.com/archives/C01CK9T7HKR/p1669925470878699

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)

@channel This month’s OpenLineage TSC meeting is next Thursday, December 8th, at 10 am PT. Join us on Zoom: <a href="https://bit.ly/OLzoom">https://bit.ly/OLzoom</a>. All are welcome! On the tentative agenda: <ol><li>an overview of the new Rust implementation of the SQL integration</li><li>a pesentation/discussion of what it actually means to “implement” OpenLineage</li><li>open discussion.</li> </ol>

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1669925470878699

Will Johnson (will@willj.co)

2022-12-07 14:22:58

*Thread Reply:* For open discussion, I'd like to ask the team for an overview of how the different gradle files are working together for the Spark implementation. I'm terribly confused on where dependencies need to be added (whether it's in shared, app, or a spark version specific folder). Maybe @Maciej Obuchowski...?

👍 Michael Robinson

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2022-12-07 14:25:12

*Thread Reply:* Unfortunately I'll be unable to attend the meeting @Will Johnson 😞

😭 Will Johnson

Julien Le Dem (julien@apache.org)

2022-12-08 13:03:08

*Thread Reply:* This is starting now. CC @Will Johnson

Julien Le Dem (julien@apache.org)

2022-12-09 19:24:15

*Thread Reply:* @Will Johnson Check the notes and the recording. @Michael Collado did a pass at explaining the relationship between shared, app and the versions

Julien Le Dem (julien@apache.org)

2022-12-09 19:24:30

*Thread Reply:* feel free to follow up here as well

Michael Collado (collado.mike@gmail.com)

2022-12-09 19:39:37

*Thread Reply:* ascii art to the rescue! (top “depends on” bottom)

              /   \
             / / \ \
            / /   \ \
           / /     \ \
          / /       \ \
         / |         | \
        /  |         |  \
       /   |         |   \
      /    |         |    \
     /     |         |     \
    /      |         |      \
   /       |         |       \
spark2   spark3   spark32   spark33
   \        |        |       /
    \       |        |      /
     \      |        |     /
      \     |        |    /
       \    |        |   /
        \   |        |  /
         \  |        | /
          \ |       / /
           \ \     / /
            \ \   / /
             \ \ / /
              \   /
               \ /
             share

Julien Le Dem (julien@apache.org)

2022-12-09 19:40:05

*Thread Reply:* 😍

Michael Collado (collado.mike@gmail.com)

2022-12-09 19:41:13

*Thread Reply:* (btw, we should have written datakin to output ascii art; it’s obviously the superior way to generate graphs 😜)

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2022-12-14 05:18:53

*Thread Reply:* Hi, is there a recording for this meeting?

Christian Lundgren (christian@lunit.io)

2022-12-07 20:33:19

Hi! I have a basic question about the naming conventions for blob storage. The spec is not totally clear to me. Is the convention to use (1) namespace=bucket name=bucket+path or (2) namespace=bucket name=path?

Julien Le Dem (julien@apache.org)

2022-12-07 22:05:25

*Thread Reply:* The namespace is the bucket and the dataset name is the path. Is there a blob storage provider in particular you are thinking of?

Christian Lundgren (christian@lunit.io)

2022-12-07 23:13:41

*Thread Reply:* Thanks, that makes sense. We use GCS, so it is already covered by the naming conventions documented. I was just not sure if I was understanding the document correctly or not.

Julien Le Dem (julien@apache.org)

2022-12-07 23:34:33

*Thread Reply:* No problem. Let us know if you have suggestions on the wording to make the doc clearer

Michael Robinson (michael.robinson@astronomer.io)

2022-12-08 11:44:49

@channel OpenLineage 0.18.0 is available now, featuring: • Airflow: support SQLExecuteQueryOperator #1379 @JDarDagran • Airflow: introduce a new extractor for SFTPOperator #1263 @sekikn • Airflow: add Sagemaker extractors #1136 @fhoda • Airflow: add S3 extractor for Airflow operators #1166 @fhoda • Spec: add spec file for ExternalQueryRunFacet #1262 @howardyoo • Docs: add a TSC doc #1303 @merobi-hub • Plus bug fixes. Thanks to all our contributors, including new contributor @Faisal Hoda! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.18.0 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.17.0...0.18.0 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🚀 Willy Lulciuc, Minkyu Park, Kengo Seki, Enrico Rotundo, Faisal Hoda

🙌 Howard Yoo, Minkyu Park, Kengo Seki, Enrico Rotundo, Faisal Hoda

srutikanta hota (srutikanta.hota@gmail.com)

2022-12-09 01:42:59

1) Is there a specifications to capture dataset dependency. ds1 is dependent on ds2

Ross Turk (ross@datakin.com)

2022-12-09 11:51:16

*Thread Reply:* Dataset dependencies are represented through common relationship with a Job - e.g., the task that performed the transformation.

srutikanta hota (srutikanta.hota@gmail.com)

2022-12-11 09:01:19

*Thread Reply:* Is it possible to populate table level dependency without any transformation using open lineage specifications? Like to define dataset 1 is dependent of table 1 and table 2 which can be represented as separate datasets

Ross Turk (ross@datakin.com)

2022-12-13 15:24:20

*Thread Reply:* Not explicitly, in today's spec. The guiding principle is that something created that dependency, and the dependency changes over time in a way that is important to study.

Ross Turk (ross@datakin.com)

2022-12-13 15:25:12

*Thread Reply:* I say this to explain why it is the way it is - but the spec can change over time to serve new uses cases, certainly!

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2022-12-14 05:18:10

Hi everyone, I'd like to use openlineage to capture column level lineage for spark. I would also like to capture a few custom environment variables along with the column lineage. May I know how this can be done? Thanks!

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2022-12-14 09:56:22

*Thread Reply:* Hi @Anirudh Shrinivason, you could start with column-lineage & spark workshop available here -> https://github.com/OpenLineage/workshops/tree/main/spark

❤️ Ricardo Gaspar

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2022-12-14 10:05:54

*Thread Reply:* Hi @Paweł Leszczyński Thanks for the link! But this does not really answer the concern.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2022-12-14 10:06:08

*Thread Reply:* I am already able to capture column lineage

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2022-12-14 10:06:33

*Thread Reply:* What I would like is to capture some extra environment variables, and send it to the server along with the lineage

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2022-12-14 11:22:59

*Thread Reply:* i remember we already have a facet for that: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/facets/EnvironmentFacet.java

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/facets/EnvironmentFacet.java | EnvironmentFacet.java>

<pre><code>/** /** Copyright 2018-2022 contributors to the OpenLineage project /** SPDX-License-Identifier: Apache-2.0 **/ package io.openlineage.spark.agent.facets; import com.fasterxml.jackson.annotation.JsonProperty; import io.openlineage.client.OpenLineage; import io.openlineage.spark.agent.Versions; import java.util.Map; /**** ** Facet used to report environment specific properties. For example, reporting the name of the ** cluster used, reporting certain environment variables, or resolving mount points. **/ public class EnvironmentFacet extends OpenLineage.DefaultRunFacet { @JsonProperty("environment-properties") @SuppressWarnings("PMD") private Map<String, Object> properties; public EnvironmentFacet(Map<String, Object> environmentDetails) { super(Versions.OPEN_LINEAGE_PRODUCER_URI); this.properties = environmentDetails; } } </code></pre>

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2022-12-14 11:24:07

*Thread Reply:* but it is only used at the moment to capture some databricks environment attributes

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2022-12-14 11:28:29

*Thread Reply:* so you can contribute to project and add a feature which adds specified/al environment variables to lineage event.

you can also have a look at extending section of spark integration docs (https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark#extending) and create a class thats add run facet builder according to your needs.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2022-12-14 11:29:28

*Thread Reply:* the third way is to create an issue related to this bcz being able to send selected/all environment variables in OL event seems to be really cool feature.

👍 Anirudh Shrinivason

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2022-12-14 21:49:19

*Thread Reply:* That is great! Thank you so much! This really helps!

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2022-12-15 01:44:42

*Thread Reply:* List<String> dbPropertiesKeys = Arrays.asList( "orgId", "spark.databricks.clusterUsageTags.clusterOwnerOrgId", "spark.databricks.notebook.path", "spark.databricks.job.type", "spark.databricks.job.id", "spark.databricks.job.runId", "user", "userId", "spark.databricks.clusterUsageTags.clusterName", "spark.databricks.clusterUsageTags.azureSubscriptionId"); dbPropertiesKeys.stream() .forEach( (p) -> { dbProperties.put(p, jobStart.properties().getProperty(p)); }); It seems like it is obtaining these env variable information from the jobStart obj, but not capturing from the env directly?

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2022-12-15 01:57:05

*Thread Reply:* I have opened an issue in the community here: https://github.com/OpenLineage/OpenLineage/issues/1419

#1419 [PROPOSAL] Add custom environment variables to the RunEvent

Hi, I would like to capture some custom environment variables and send it along with the column lineage to the server. This feature will be of great use as there is a potential use case to capture and persist some environment-specific metadata from spark clusters. Wondering if this feature can be supported, or is on the roadmap potentially? Thanks a lot! cc: <a href="https://github.com/pawel-big-lebowski">@pawel-big-lebowski</a>

Labels

proposal

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-02-01 02:24:39

*Thread Reply:* Hi @Paweł Leszczyński I have opened a PR for helping to add this use case. Please do help to see if we can merge it in. Thanks! https://github.com/OpenLineage/OpenLineage/pull/1545

#1545 Add CustomEnvironmentFacetBuilder class

Problem The main idea behind this pull request is to enable a way to capture custom environment variables from a spark integration. The flow defined here, as applicable to our use case is to have a CUSTOM_VARS="VAR1,VAR2,VAR3..." environment variable that indicates which environment variables should be captured. Then, we would want to capture those variables, ie VAR1, VAR2, VAR3, ... from the spark clusters. Closes: <a href="https://github.com/OpenLineage/OpenLineage/issues/1419">#1419</a> Solution The solution proposed extends the EnvironmentFacets class, to build a CustomEnvironmentFacetBuilder class. This class then is used to get the values of the environment variables and pass on to the startEvent json. ☐ Your change modifies the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json">core</a> OpenLineage model ☑︎ Your change modifies one or more OpenLineage <a href="https://github.com/OpenLineage/OpenLineage/tree/main/spec/facets">facets</a> If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports <code>S3</code> and <code>GCS</code> filesystem operations, tested with AWS EMR). Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☐ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☐ Your changes are accompanied by tests (if relevant) ☐ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☐ You've updated the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md"><code>CHANGELOG.md</code></a> with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant) ☐ You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant) <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2023 contributors to the OpenLineage project

Labels

integration/spark

Comments

👀 Maciej Obuchowski, Ross Turk

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-02-02 11:45:52

*Thread Reply:* Hey @Anirudh Shrinivason, sorry for late reply, but I reviewed the PR.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-02-06 03:06:42

*Thread Reply:* Hey thanks a lot! I have made the requested changes! Thanks!

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-02-06 03:06:49

*Thread Reply:* @Maciej Obuchowski ^ 🙂

👀 Maciej Obuchowski

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-02-06 09:09:34

*Thread Reply:* Hey @Anirudh Shrinivason, took a look at it but it unfortunately fails integration tests (throws NPE), can you take a look again?

23/02/06 12:18:39 ERROR AsyncEventQueue: Listener OpenLineageSparkListener threw an exception java.lang.NullPointerException at io.openlineage.spark.agent.EventEmitter.<init>(EventEmitter.java:39) at io.openlineage.spark.agent.OpenLineageSparkListener.initializeContextFactoryIfNotInitialized(OpenLineageSparkListener.java:276) at io.openlineage.spark.agent.OpenLineageSparkListener.onOtherEvent(OpenLineageSparkListener.java:80) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1433) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-02-07 04:17:02

*Thread Reply:* Hi yeah my bad. It should be fixed in the latest push. But I think the tests are not running in the CI because of some GCP environment issue? I am not really sure how to fix it...

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-02-07 04:18:46

*Thread Reply:* I can make them run, it's just that running them on forks is disabled. We need to make it more clear I suppose

👍 Anirudh Shrinivason

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-02-07 04:24:38

*Thread Reply:* Ahh I see thanks! Also, some of the tests are failing on my local, such as https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/test/java/io/openlineage/spark/agent/lifecycle/DeltaDataSourceTest.java. Is this expected behaviour?

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/test/java/io/openlineage/spark/agent/lifecycle/DeltaDataSourceTest.java | DeltaDataSourceTest.java>

<pre><code>/** /** Copyright 2018-2023 contributors to the OpenLineage project /** SPDX-License-Identifier: Apache-2.0 **/ package io.openlineage.spark.agent.lifecycle; import static org.junit.jupiter.api.Assertions.assertEquals; import static org.junit.jupiter.api.Assertions.assertTrue; import io.openlineage.client.OpenLineage; import io.openlineage.client.OpenLineage.RunEvent; import io.openlineage.client.OpenLineage.RunEvent.EventType; import io.openlineage.spark.agent.SparkAgentTestExtension; import java.io.IOException; import java.nio.file.Path; import java.util.Arrays; import java.util.List; import java.util.Optional; import java.util.concurrent.TimeoutException; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SaveMode; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema; import org.apache.spark.sql.types.LongType$; import org.apache.spark.sql.types.Metadata; import org.apache.spark.sql.types.StringType$; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType; import org.junit.jupiter.api.Tag; import org.junit.jupiter.api.Test; import org.junit.jupiter.api.extension.ExtendWith; import org.junit.jupiter.api.io.TempDir; import org.mockito.ArgumentCaptor; import org.mockito.Mockito; @ExtendWith(SparkAgentTestExtension.class) @Tag("delta") class DeltaDataSourceTest { @Test void testInsertIntoDeltaSource(@TempDir Path tempDir, SparkSession spark) throws IOException, InterruptedException, TimeoutException { StructType tableSchema = new StructType( new StructField[] { new StructField("name", StringType$.MODULE$, false, Metadata.empty()), new StructField("age", LongType$.MODULE$, false, Metadata.empty()) }); Dataset<Row> df = spark.createDataFrame( Arrays.asList( new GenericRowWithSchema(new Object[] {"john", 25L}, tableSchema), new GenericRowWithSchema(new Object[] {"sam", 22L}, tableSchema), new GenericRowWithSchema(new Object[] {"alicia", 35L}, tableSchema), new GenericRowWithSchema(new Object[] {"bob", 47L}, tableSchema), new GenericRowWithSchema(new Object[] {"jordan", 52L}, tableSchema), new GenericRowWithSchema(new Object[] {"liz", 19L}, tableSchema), new GenericRowWithSchema(new Object[] {"marcia", 83L}, tableSchema), new GenericRowWithSchema(new Object[] {"maria", 40L}, tableSchema), new GenericRowWithSchema(new Object[] {"luis", 8L}, tableSchema), new GenericRowWithSchema(new Object[] {"gabriel", 30L}, tableSchema)), tableSchema); String deltaDir = tempDir.resolve("deltaData").toAbsolutePath().toString(); df.write().format("delta").option("path", deltaDir).mode(SaveMode.Overwrite).save(); // wait for event processing to complete StaticExecutionContextFactory.waitForExecutionEnd(); ArgumentCaptor<RunEvent> lineageEvent = ArgumentCaptor.forClass(RunEvent.class); Mockito.verify(SparkAgentTestExtension.OPEN_LINEAGE_SPARK_CONTEXT, Mockito.atLeast(2)) .emit(lineageEvent.capture()); List<RunEvent> events = lineageEvent.getAllValues(); Optional<RunEvent> completionEvent = events.stream() .filter(e -> e.getEventType().equals(EventType.COMPLETE) && !e.getOutputs().isEmpty()) .findFirst(); assertTrue(completionEvent.isPresent()); RunEvent event = completionEvent.get(); List<OpenLineage.OutputDataset> outputs = event.getOutputs(); assertEquals(1, outputs.size()); assertEquals("file", outputs.get(0).getNamespace()); assertEquals(deltaDir, outputs.get(0).getName()); } } </code></pre>

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-02-07 07:20:11

*Thread Reply:* tests failing isn't expected behaviour 🙂

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-02-08 03:37:23

*Thread Reply:* Ahh yeap it was a local ide issue on my side. I added some tests to verify the presence of env variables too.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-02-08 03:47:22

*Thread Reply:* @Anirudh Shrinivason let me know then when you'll push fixed version, I can run full tests then

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-02-08 03:49:35

*Thread Reply:* I have pushed just now

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-02-08 03:49:39

*Thread Reply:* You can run the tests

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-02-08 04:13:07

*Thread Reply:* @Maciej Obuchowski mb I pushed again rn. Missed out a closing bracket.

👍 Maciej Obuchowski

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-02-10 00:47:04

*Thread Reply:* @Maciej Obuchowski Hi, could we merge this PR in? I'd like to see if we can have these changes in the new release...

Bramha Aelem (bramhaaelem@gmail.com)

2022-12-15 17:14:02

Hi All- I am sending lineage from ADF for each activity which i am performing. But the individual activities are representing correctly. How can I represent task1 as a parent to task2. can someone please share the sample json request for it.

Ross Turk (ross@datakin.com)

2022-12-16 13:29:44

*Thread Reply:* Hi 👋 this would require a series of JSON calls:

start the first task
end the first task, specify output dataset
start the second task, specify input dataset
end the second task

Ross Turk (ross@datakin.com)

2022-12-16 13:32:08

*Thread Reply:* in OpenLineage relationships are typically Job -> Dataset -> Job, so • you create a relationship between datasets by referring to them in the same job - i.e., this task ran that read from these datasets and wrote to those datasets • you create a relationship between tasks by referring to the same datasets across both of them - i.e., this task wrote that dataset and this other task read from it

Ross Turk (ross@datakin.com)

2022-12-16 13:35:06

*Thread Reply:* @Bramha Aelem if you look in this directory, you can find example start/complete JSON calls that show how to specify input/output datasets.

(it’s an airflow workshop, but those examples are for a part of the workshop that doesn’t involve airflow)

Ross Turk (ross@datakin.com)

2022-12-16 13:35:46

*Thread Reply:* (these can also be found in the docs)

openlineage.io

Example Lineage Events | OpenLineage Docs

Simple Examples

Original URL: https://openlineage.io/docs/development/examples

👍 Ross Turk

Bramha Aelem (bramhaaelem@gmail.com)

2022-12-16 14:49:30

*Thread Reply:* @Ross Turk - Thanks for the details. will try and get back to you on it

Bramha Aelem (bramhaaelem@gmail.com)

2022-12-17 19:53:21

*Thread Reply:* @Ross Turk - Good Evening, It worked as expected. I am able to replicate the scenarios which I am looking for.

👍 Ross Turk

Bramha Aelem (bramhaaelem@gmail.com)

2022-12-17 19:53:48

*Thread Reply:* @Ross Turk - Thanks for your response.

Bramha Aelem (bramhaaelem@gmail.com)

2023-01-12 13:23:56

*Thread Reply:* @Ross Turk - First activity : I am making HTTP Call to pull the lookup data and store it in ADLS. Second Activity : After the completion of first activity I am making Azure databricks call to use the lookup file and generate the output tables. How I can refer the databricks generated tables facets as an input to the subsequent activities in the pipeline. When I refer it's as an input the spark tables metadata is not showing up. How can this be achievable. After the execution of each activity in ADF Pipeline I am sending start and complete/fail event lineage to Marquez.

Can someone please guide me on this.

Bramha Aelem (bramhaaelem@gmail.com)

2022-12-15 17:19:34

I am not using airflow in my Process. pls suggest

Bramha Aelem (bramhaaelem@gmail.com)

2022-12-19 12:40:26

Hi All - Good Morning, how the column lineage of data source when it ran by different teams and jobs in openlineage.

Al (Koii) (al@koii.network)

2022-12-20 14:26:57

Hey folks! I'm al from Koii.network, very happy to have heard about this project :)

👋 Willy Lulciuc, Maciej Obuchowski, Julien Le Dem

Willy Lulciuc (willy@datakin.com)

2022-12-20 14:27:59

*Thread Reply:* welcome! let’s us know if you have any questions

Matt Menzenski (matt@payitgov.com)

2022-12-29 08:22:26

Hello! I found the OpenLineage project today after searching for “OpenTelemetry” in the dbt Slack.

Harel Shein (harel.shein@gmail.com)

2022-12-29 10:47:00

*Thread Reply:* Hey Matt! Happy to have you here! Feel free to reach out if you have any questions

:gratitude_thank_you: Matt Menzenski

Max (maxime.broussard@gmail.com)

2022-12-30 05:33:40

Hi guys - I am really excited to test open lineage. I had a quick question, sorry if this is not the right place for it. We are testing dbt-ol with airflow and I was hoping this would by default push the number of rows updated/created in that dbt transformation to marquez. It runs fine on airflow, but when I check in marquez there doesn't seem to be a 'dataset' created, only 'jobs' with job level metadata. When i check here I see that the dataset facets should have it though https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md Does anyone know if creating a dataset & sending row counts to OL is out of the box on dbt-ol or if I need to build another script to get that number from my snowflake instance and push it to OL as another step in my process? Thanks a lot!

<https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md | OpenLineage.md>

OpenLineage Spec Specification The specification for OpenLineage is formalized as a JsonSchema <OpenLineage.json|OpenLineage.json>. An OpenAPI spec is also provided for HTTP based implementations: <OpenLineage.yml|OpenLineage.yml> The documentation is published at: <a href="https://openlineage.github.io/">https://openlineage.github.io/</a> It allows extensions to the spec using <code>Custom Facets</code> as described in this document. Core concepts Core Lineage Model <OpenLineageModel.svg|Open Lineage model> • Run Event: an event describing an observed state of a job run. It is required to at least send a START event and a COMPLETE/FAIL/ABORT event. Additional events are optional. • Job: a process definition that consumes and produces datasets (defined as its inputs and outputs). It is <Naming.md#Jobs|identified by a unique name within a namespace> (which is assigned to the scheduler starting the jobs). The Job evolves over time and this change is captured when the job runs. • Dataset: an abstract representation of data. It has a <Naming.md#Datasets|unique name within the datasource namespace> derived from its physical location (for example db.host.database.schema.table). Typically, a Dataset changes when a job writing to it completes. Similarly to the Job and Run distinction, metadata that is more static from run to run is captured in a DatasetFacet (for example, the schema that does not change every run), what changes every Run is captured as an InputFacet or an OutputFacet (for example, what subset of the data set was read or written, like a time partition). • Run: An instance of a running job with a start and completion (or failure) time. A run is identified by a globally unique ID relative to its job definition. A run ID must be an <a href="https://datatracker.ietf.org/doc/html/rfc4122">UUID</a>. • Facet: A piece of metadata attached to one of the entities defined above. example: Here is an example of a simple start run event not adding any facet information: <pre><code>{ "eventType": "START", "eventTime": "2020-12-09T23:37:31.081Z", "run": { "runId": "3b452093-782c-4ef2-9c0c-aafe2aa6f34d", }, "job": { "namespace": "my-scheduler-namespace", "name": "myjob.mytask", }, "inputs": [ { "namespace": "my-datasource-namespace", "name": "instance.schema.table", } ], "outputs": [ { "namespace": "my-datasource-namespace", "name": "instance.schema.output_table", } ], "producer": "<https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client>", "schemaURL": "<https://openlineage.io/spec/1-0-0/OpenLineage.json#/definitions/RunEvent>" } </code></pre> Lifecycle The OpenLineage API defines events to capture the lifecycle of a Run for a given Job. When a job is being run, we capture metadata by sending run events when the state of the job transitions to a different state. We might observe different aspects of the job run at different stages. This means that different metadata might be collected in each event during the lifecycle of a run. All metadata is additive. for example, if more inputs or outputs are detected as the job is running we might send additional events specifically for those datasets without re-emitting previously observed inputs or outputs. Example: • When the run starts, we collect the following Metadata: • Run Id • Job id • eventType: START • event time • source location and version (ex: git sha) • If known: Job inputs and outputs. (input schema, ...) • When the run completes: • Run Id • Job id • eventType: COMPLETE • event time • Output datasets schema (and other metadata). Facets Facets are pieces of metadata that can be attached to the core entities: • Run • Job • Dataset (Inputs or Outputs) A facet is an atomic piece of metadata identified by its name. This means that emitting a new facet with the same name for the same entity replaces the previous facet instance for that entity entirely). It is defined as a JSON object that can be either part of the spec or custom facets defined in a different project. Custom facets must use a distinct prefix named after the project defining them to avoid collision with standard facets defined in the <OpenLineage.json|OpenLineage.json> spec. They have a _schemaURL field pointing to the corresponding version of the facet schema (as a JSONPointer: <a href="https://swagger.io/docs/specification/using-ref/">$ref URL location</a> ). Example: <a href="https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/MyCustomJobFacet">https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/MyCustomJobFacet</a> The versioned URL must be an immutable pointer to the version of the facet schema. For example, it should include a tag of a git sha and not a branch name. This should also be a canonical URL. There should be only one URL used for a given version of a schema. Custom facets can be promoted to the standard by including them in the spec. Custom Facet Naming Naming of custom facets should follow pattern <code>{prefix}{name}{entity}Facet</code> PascalCased. Prefix must be distinct identifier named after the project defining them to avoid colision with standard facets defined in the <OpenLineage.json|OpenLineage.json> spec. Entity is the core entity for which the facet is attached. When attached to core entity, the key should follow pattern <code>{prefix}_{name}</code>, where both prefix and name follow snakeCase pattern. Example of valid name is <code>BigQueryStatisticsJobFacet</code> and it's key <code>bigQuery_statistics</code>. Standard Facets Run Facets • nominalTime: Captures the time this run is scheduled for. This is a typical usage for time based scheduled job. The job has a nominal schedule time that will be different from the actual time it is running at. • parent: Captures the parent job and Run when the run was spawn from a parent run. For example in the case of Airflow, there's a run for the DAG that then spawns runs for individual tasks that would refer to the parent run as the DAG run. Similarly when a SparkOperator starts a Spark job, this creates a separate run that refers to the task run as its parent. • errorMessage: Captures potential error message, programming language - and optionally stack trace - with which the run failed. Job Facets • sourceCodeLocation: Captures the source code location and version (example: git sha) of the job. • sourceCode: Captures language (ex. python) and actual source code of the job. • sql: Capture the SQL query if this job is a SQL query. • ownership: Captures the owners of the job Dataset Facets • schema: Captures the schema of the dataset • dataSource: Captures the Database instance containing this datasets (ex: Database schema. Object store bucket, ...) • lifecycleStateChange: Captures the lifecycle states of the dataset like: alter, create, drop, overwrite, rename, truncate. • version: Captures the dataset version when versioning is defined by database (ex. Iceberg snapshot ID) • <facets/ColumnLineageDatasetFacet.md|columnLineage>: Captures the column-level lineage • ownership: Captures the owners of the dataset Input Dataset Facets • dataQualityMetrics: Captures dataset level and column level data quality metrics when scanning a dataset whith a DataQuality library (row count, byte size, null count, distinct count, average, min, max, quantiles). • dataQualityAssertions: Captures the result of running data tests on dataset or its columns. Output Dataset Facets • outputStatistics: Captures the size of the output written to a dataset (row count and byte size).

Viraj Parekh (vmpvmp94@gmail.com)

2023-01-03 13:20:14

*Thread Reply:* @Ross Turk maybe you can help with this?

Ross Turk (ross@datakin.com)

2023-01-03 13:34:23

*Thread Reply:* hmm, I believe the dbt-ol integration does capture bytes/rows, but only for some data sources: https://github.com/OpenLineage/OpenLineage/blob/6ae1fd5665d5fd539b05d044f9b6fb831ce9d475/integration/common/openlineage/common/provider/dbt.py#L567

<https://github.com/OpenLineage/OpenLineage/blob/6ae1fd5665d5fd539b05d044f9b6fb831ce9d475/integration/common/openlineage/common/provider/dbt.py | dbt.py>

<pre><code> def node_to_output_dataset( </code></pre>

Ross Turk (ross@datakin.com)

2023-01-03 13:34:58

*Thread Reply:* I haven't personally tried it with Snowflake in a few versions, but the code suggests that it's one of them.

Ross Turk (ross@datakin.com)

2023-01-03 13:35:42

*Thread Reply:* @Max you say your dbt-ol run is resulting in only jobs and no datasets emitted, is that correct?

Ross Turk (ross@datakin.com)

2023-01-03 13:38:06

*Thread Reply:* if so, I'd say something rather strange is going on because in my experience each model should result in a Job and a Dataset.

Kuldeep (kuldeep.marathe@affirm.com)

2023-01-03 00:41:09

Hi All, Curious to see if there is an openlineage integration with luigi or any open source projects working on it.

Kuldeep (kuldeep.marathe@affirm.com)

2023-01-03 01:53:10

*Thread Reply:* I was looking for something similar to the airflow integration

Viraj Parekh (vmpvmp94@gmail.com)

2023-01-03 13:21:18

*Thread Reply:* hey @Kuldeep - i don't think there's something for Luigi right now - is that something you'd potentially be interested in?

Kuldeep (kuldeep.marathe@affirm.com)

2023-01-03 13:23:53

*Thread Reply:* @Viraj Parekh Yes this is something we are interested in! There are a lot of projects out there that use luigi

Michael Robinson (michael.robinson@astronomer.io)

2023-01-03 11:05:48

Hello all, I’m opening a vote to release OpenLineage 0.19.0, including: • new extractors for Trino and S3FileTransformOperator in the Airflow integration • a new, standardized run facet in the Airflow integration • a new NominalTimeRunFacet and OwnershipJobFacet in the Airflow integration • Postgres support in the dbt integration • a new client-side proxy (skeletal version) • a new, improved mechanism for passing conf parameters to the OpenLineage client in the Spark integration • a new ExtractionErrorRunFacet to reflect internal processing errors for the SQL parser • testing improvements, bug fixes and more. As always, three +1s from committers will authorize an immediate release. Thanks in advance!

➕ Willy Lulciuc, Maciej Obuchowski, Paweł Leszczyński, Jakub Dardziński, Julien Le Dem

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-01-03 23:07:59

*Thread Reply:* Hi @Michael Robinson a new, improved mechanism for passing conf parameters to the OpenLineage client in the Spark integration Would it be possible to have more details on what this entails please? Thanks!

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-01-04 09:21:46

*Thread Reply:* @Tomasz Nazarewicz might explain this better

Tomasz Nazarewicz (tomasz.nazarewicz@getindata.com)

2023-01-04 10:04:22

*Thread Reply:* @Anirudh Shrinivason until now If you wanted to add new property to OL client, you had to also implement it in the integration because it had to parse all properties, create appropriate objects etc. New implementation makes client properties transparent to integration, they are only passed through and parsing happens inside the client.

Michael Robinson (michael.robinson@astronomer.io)

2023-01-04 13:02:39

*Thread Reply:* Thanks, all. The release is authorized and will commence shortly 🙂

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-01-04 22:00:55

*Thread Reply:* @Tomasz Nazarewicz Ahh I see. Okay thanks!

Michael Robinson (michael.robinson@astronomer.io)

2023-01-05 10:37:09

@channel This month’s OpenLineage TSC meeting is next Thursday, January 12th, at 10 am PT. Join us on Zoom: https://bit.ly/OLzoom. All are welcome! On the tentative agenda:

Recent release overview @Michael Robinson
Column lineage update @Maciej Obuchowski
Airflow integration improvements @Jakub Dardziński
Discussions: • Real-world implementation of OpenLineage (What does it really mean?) @Sheeri Cabral (Collibra) • Using namespaces @Michael Robinson
Open discussion Notes: https://bit.ly/OLwiki Is there a topic you think the community should discuss at this or a future meeting? Reply or DM me to add items to the agenda.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-01-05 23:45:38

*Thread Reply:* @Michael Robinson Will there be a recording?

Michael Robinson (michael.robinson@astronomer.io)

2023-01-06 09:10:50

*Thread Reply:* @Anirudh Shrinivason Yes, and the recording will be here: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

:gratitude_thank_you: Anirudh Shrinivason

Michael Robinson (michael.robinson@astronomer.io)

2023-01-05 13:00:01

OpenLineage 0.19.2 is available now, including: • Airflow: add Trino extractor #1288 @sekikn • Airflow: add S3FileTransformOperator extractor #1450 @sekikn • Airflow: add standardized run facet #1413 @JDarDagran • Airflow: add NominalTimeRunFacet and OwnershipJobFacet #1410 @JDarDagran • dbt: add support for postgres datasources #1417 @julienledem • Proxy: add client-side proxy (skeletal version) #1439 #1420 @fm100 • Proxy: add CI job to publish Docker image #1086 @wslulciuc • SQL: add ExtractionErrorRunFacet #1442 @mobuchowski • SQL: add column-level lineage to SQL parser #1432 #1461 @mobuchowski @StarostaGit • Spark: pass config parameters to the OL client #1383 @tnazarew • Plus bug fixes and testing and CI improvements. Thanks to all the contributors, including new contributor Saurabh (@versaurabh) Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.19.2 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.18.0...0.19.2 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

❤️ Julien Le Dem, Howard Yoo, Willy Lulciuc, Maciej Obuchowski, Kengo Seki, Harel Shein, Jarek Potiuk, Varun Singh

Will Johnson (will@willj.co)

2023-01-06 01:07:18

Question on Spark Integration and External Hive Metastores

@Hanna Moazam and I are working with a team using OpenLineage and wants to extract out the server name of the hive metastore they're using when writing to a Hive table through Spark.

For example, the hive metastore is an Azure SQL database and the table name is sales.transactions.

OpenLineage will give something like /usr/hive/warehouse/sales.db/transactions for the name.

However, this is not a complete picture since sales.db/transactions is defined like this for a given hive metastore. In Hive, you'd define the fully qualified name as sales.transactions@sqlservername.database.windows.net .

Has anyone else come across this before? If not, we plan on raising an issue and suggesting we extract out the spark.hadoop.javax.jdo.option.ConnectionURL in the DatabricksEnvironmentFacetBuilder but ideally there would be a better way of extracting this.

https://learn.microsoft.com/en-us/azure/databricks/data/metastores/external-hive-metastore#set-up-an-external-metastore-using-the-ui

There was an issue by @Maciej Obuchowski or @Paweł Leszczyński that talked about providing a facet of the alias of a path but I can't find it at this point :(

learn.microsoft.com

External Apache Hive metastore - Azure Databricks

Learn how to connect to external Apache Hive metastores in Azure Databricks.

Original URL: https://learn.microsoft.com/en-us/azure/databricks/data/metastores/external-hive-metastore#set-up-an-external-metastore-using-the-ui

👀 Maciej Obuchowski

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-01-09 02:28:43

*Thread Reply:* Hi @Hanna Moazam, we've written Jupyter notebook to demo dataset symlinks feature: https://github.com/OpenLineage/workshops/blob/main/spark/dataset_symlinks.ipynb

For scenario you describe, there should be symlink facet sent similar to: { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/0.15.1/integration/spark>", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/SymlinksDatasetFacet.json#/$defs/SymlinksDatasetFacet>", "identifiers": [ { "namespace": "<hive://metastore>", "name": "default.some_table", "type": "TABLE" } ] } Within Openlineage Spark integration code, symlinks are included here: https://github.com/OpenLineage/OpenLineage/blob/0.19.2/integration/spark/shared/src/main/java/io/openlineage/spark/agent/util/PathUtils.java#L75

and they are added only when spark catalog is hive and metastore URI in spark conf is present.

<https://github.com/OpenLineage/workshops/blob/main/spark/dataset_symlinks.ipynb | dataset_symlinks.ipynb>

<https://github.com/OpenLineage/OpenLineage/blob/0.19.2/integration/spark/shared/src/main/java/io/openlineage/spark/agent/util/PathUtils.java | PathUtils.java>

➕ Maciej Obuchowski

🤯 Will Johnson

Will Johnson (will@willj.co)

2023-01-09 14:21:10

*Thread Reply:* This is so awesome, @Paweł Leszczyński - Thank you so much for sharing this! I'm wondering if we could extend this to capture the hive JDBC Connection URL. I will explore this and put in an issue and PR to try and extend it. Thank you for the insights!

Michael Robinson (michael.robinson@astronomer.io)

2023-01-11 12:00:02

@channel Friendly reminder: this month’s OpenLineage TSC meeting is tomorrow at 10am, and all are welcome. https://openlineage.slack.com/archives/C01CK9T7HKR/p1672933029317449

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)

@channel This month’s OpenLineage TSC meeting is next Thursday, January 12th, at 10 am PT. Join us on Zoom: <a href="https://bit.ly/OLzoom">https://bit.ly/OLzoom</a>. All are welcome! On the tentative agenda: <ol><li>Recent release overview @Michael Robinson </li><li>Column lineage update @Maciej Obuchowski </li><li>Airflow integration improvements @Jakub Dardziński </li><li>Discussions: • Real-world implementation of OpenLineage (What does it really mean?) @Sheeri Cabral (Collibra) • Using namespaces @Michael Robinson </li><li>Open discussion Notes: <a href="https://bit.ly/OLwiki">https://bit.ly/OLwiki</a> Is there a topic you think the community should discuss at this or a future meeting? Reply or DM me to add items to the agenda.</li> </ol>

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1672933029317449

🙌 Maciej Obuchowski, Will Johnson, John Bagnall, AnnMary Justine, Willy Lulciuc, Minkyu Park, Paweł Leszczyński, Varun Singh

Varun Singh (varuntestaz@outlook.com)

2023-01-12 06:37:56

Hi, are there any plans to add an Azure EventHub transport similar to the Kinesis one?

Will Johnson (will@willj.co)

2023-01-12 17:31:12

*Thread Reply:* @Varun Singh why not just use the KafkaTransport and the Event Hub's Kafka endpoint?

https://github.com/yogyang/OpenLineage/blob/2b7fa8bbd19a2207d54756e79aea7a542bf7bb[…]/main/java/io/openlineage/client/transports/KafkaTransport.java

https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-kafka-stream-analytics

learn.microsoft.com

Azure Event Hubs - Process Apache Kafka events - Azure Event Hubs

Tutorial: This article shows how to process Kafka events that are ingested through event hubs by using Azure Stream Analytics

Original URL: https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-kafka-stream-analytics

<https://github.com/yogyang/OpenLineage/blob/2b7fa8bbd19a2207d54756e79aea7a542bf7bb4c/client/java/src/main/java/io/openlineage/client/transports/KafkaTransport.java | KafkaTransport.java>

<pre><code>/** /** Copyright 2018-2022 contributors to the OpenLineage project /** SPDX-License-Identifier: Apache-2.0 **/ package io.openlineage.client.transports; import io.openlineage.client.OpenLineage; import io.openlineage.client.OpenLineageClientUtils; import lombok.NonNull; import lombok.extern.slf4j.Slf4j; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.ProducerRecord; @Slf4j public final class KafkaTransport extends Transport { private final String topicName; private final String localServerId; private final KafkaProducer<String, String> producer; public KafkaTransport(@NonNull final KafkaConfig kafkaConfig) { this(new KafkaProducer<>(kafkaConfig.getProperties()), kafkaConfig); } public KafkaTransport( @NonNull final KafkaProducer<String, String> kafkaProducer, @NonNull final KafkaConfig kafkaConfig) { super(Type.KAFKA); this.topicName = kafkaConfig.getTopicName(); this.localServerId = kafkaConfig.getLocalServerId(); this.producer = kafkaProducer; } @Override public void emit(@NonNull OpenLineage.RunEvent runEvent) { final String eventAsJson = OpenLineageClientUtils.toJson(runEvent); log.debug("Received lineage event: {}", eventAsJson); final ProducerRecord<String, String> record = new ProducerRecord<>(topicName, localServerId, eventAsJson); try { producer.send(record); } catch (Exception e) { log.error("Failed to collect lineage event: {}", eventAsJson, e); } } } </code></pre>

👍 Varun Singh

Julien Le Dem (julien@apache.org)

2023-01-12 09:01:24

Following up on last month’s discussion (), I created the <#C04JPTTC876|spec-compliance> channel for further discussion

Will Johnson (will@willj.co)

2023-01-12 17:43:55

*Thread Reply:* @Julien Le Dem is there a channel to discuss the community call / ask follow-up questions on the communiyt call topics? For example, I wanted to ask more about the AirflowFacet and if we expected to introduce more tool specific facets into the spec. Where's the right place to ask that question? On the PR?

Julien Le Dem (julien@apache.org)

2023-01-17 15:11:05

*Thread Reply:* I think asking in #general is the right place. If there’s a specific github issue/PR, his is a good place as well. You can tag the relevant folks as well to get their attention

Allison Suarez (asuarezmiranda@lyft.com)

2023-01-12 18:37:24

@here I am using the Spark listener and whenever a query like INSERT OVERWRITE TABLE gets executed it looks like I can see some outputs, but there are no symlinks for the output table. The operation type being executed is InsertIntoHadoopFsRelationCommand . I am not sure why I cna see symlinks for all the input tables but not the output tables. Anyone know the reason behind this?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-01-13 02:30:37

*Thread Reply:* Hello @Allison Suarez, in case of InsertIntoHadoopFsRelationCommand, Spark Openlineage implementation uses method: DatasetIdentifier di = PathUtils.fromURI(command.outputPath().toUri(), "file"); (https://github.com/OpenLineage/OpenLineage/blob/0.19.2/integration/spark/shared/sr[…]ark/agent/lifecycle/plan/InsertIntoHadoopFsRelationVisitor.java)

If the dataset identifier is constructed from a path, then no symlinks are added. That's the current behaviour.

Calling io.openlineage.spark.agent.util.DatasetIdentifier#withSymlink(io.openlineage.spark.agent.util.DatasetIdentifier.Symlink) on DatasretIdentifier in InsertIntoHadoopFsRelationVisitor could be a remedy to that.

Do you have some Spark code snippet to reproduce this issue?

<https://github.com/OpenLineage/OpenLineage/blob/0.19.2/integration/spark/shared/src/main/java/io/openlineage/spark/agent/lifecycle/plan/InsertIntoHadoopFsRelationVisitor.java | InsertIntoHadoopFsRelationVisitor.java>

<pre><code> DatasetIdentifier di = PathUtils.fromURI(command.outputPath().toUri(), "file"); </code></pre>

Will Johnson (will@willj.co)

2023-01-22 10:04:56

*Thread Reply:* @Allison Suarez it would also be good to know what compute engine you're using to run your code on? On-Prem Apache Spark? Azure/AWS/GCP Databricks?

Allison Suarez (asuarezmiranda@lyft.com)

2023-02-13 18:18:52

*Thread Reply:* I created a custom visitor and fixed the issue that way, thank you!

🙌 Will Johnson

Varun Singh (varuntestaz@outlook.com)

2023-01-13 11:44:19

Hi, I am trying to use kafka transport in spark for sending events to an EventHub but it requires me to set a property sasl.jaas.config which needs to have semicolons (;) in its value. But this gives an error about being unable to convert Array to a String. I think this is due to this line which splits property values into an array if they have a semicolon https://github.com/OpenLineage/OpenLineage/blob/92adbc877f0f4008928a420a1b8a93f394[…]pp/src/main/java/io/openlineage/spark/agent/ArgumentParser.java Does this seem like a bug or is it intentional?

<https://github.com/OpenLineage/OpenLineage/blob/92adbc877f0f4008928a420a1b8a93f394530d04/integration/spark/app/src/main/java/io/openlineage/spark/agent/ArgumentParser.java | ArgumentParser.java>

<pre><code> Arrays.stream(value.split(DISABLED_FACETS_SEPARATOR)) </code></pre>

Harel Shein (harel.shein@gmail.com)

2023-01-13 14:39:51

*Thread Reply:* seems like a bug to me, but tagging @Tomasz Nazarewicz / @Paweł Leszczyński

Tomasz Nazarewicz (tomasz.nazarewicz@getindata.com)

2023-01-13 15:22:19

*Thread Reply:* So we needed a generic way of passing parameters to client and made an assumption that every field with ; will be treated as an array

Varun Singh (varuntestaz@outlook.com)

2023-01-14 02:00:04

*Thread Reply:* Thanks for the confirmation, should I add a condition to split only if it's a key that can have array values? We can have a list of such keys like facets.disabled

Tomasz Nazarewicz (tomasz.nazarewicz@getindata.com)

2023-01-14 02:28:41

*Thread Reply:* We thought about this solution but it forces us to know the structure of each config and we wanted to avoid that as much as possible

Tomasz Nazarewicz (tomasz.nazarewicz@getindata.com)

2023-01-14 02:34:06

*Thread Reply:* Maybe the condition could be having ; and [] in the value

👍 Varun Singh

Varun Singh (varuntestaz@outlook.com)

2023-01-15 08:14:14

*Thread Reply:* Makes sense, I can add this check. Thanks @Tomasz Nazarewicz!

Varun Singh (varuntestaz@outlook.com)

2023-01-16 01:15:19

*Thread Reply:* Created issue https://github.com/OpenLineage/OpenLineage/issues/1506 for this

#1506 Spark configs with semicolons (;) in their value get split into an array

Semicolon (;) is being used as the delimiter for array configs like <code>disabled.facets</code> but this also converts string configs which require a semicolon into arrays. For example, EventHub authentication requires a SAS key which has a semicolon in it but it doesn't need to be converted into an array. This conversion is being done in the following lines <a href="https://github.com/OpenLineage/OpenLineage/blob/92adbc877f0f4008928a420a1b8a93f394530d04/integration/spark/app/src/main/java/io/openlineage/spark/agent/ArgumentParser.java#L150-L154">OpenLineage/integration/spark/app/src/main/java/io/openlineage/spark/agent/ArgumentParser.java</a> Lines 150 to 154 in </OpenLineage/OpenLineage/commit/92adbc877f0f4008928a420a1b8a93f394530d04|92adbc8>

Comments

Michael Robinson (michael.robinson@astronomer.io)

2023-01-17 12:00:02

Hi everyone, I’m excited to share some good news about our progress in the LFAI & Data Foundation: we’ve achieved Incubation status! This required us to earn a Silver Badge from the OpenSSF, get 300+ stars on GitHub (which was NBD as we have over 1100 already), and win the approval of the LFAI & Data’s TAC. Now that we’ve cleared this hurdle, we have access to additional services from the foundation, including assistance with creative work, marketing and communication support, and event-planning assistance. Graduation from the program, which will earn us a voting seat on the TAC, is on the horizon. Stay tuned for updates on our progress with the foundation.

LF AI & Data is an umbrella foundation of the Linux Foundation that supports open source innovation in artificial intelligence (AI) and data. LF AI & Data was created to support open source AI and data, and to create a sustainable open source AI and data ecosystem that makes it easy to create AI and data products and services using open source technologies. They foster collaboration under a neutral environment with an open governance in support of the harmonization and acceleration of open source technical projects.

For more info about the foundation and other LFAI & Data projects, visit their website.

❤️ Julien Le Dem, Paweł Leszczyński, Maciej Obuchowski, Ross Turk, Jakub Dardziński, Minkyu Park, Howard Yoo, Jarek Potiuk, Danilo Mota, Willy Lulciuc, Kengo Seki, Harel Shein

Ross Turk (ross@datakin.com)

2023-01-17 15:53:12

if you want to share this news (and I hope you do!) there is a blog post here: https://openlineage.io/blog/incubation-stage-lfai/

openlineage.io

OpenLineage Advances to Incubation Stage with the LFAI & Data

OpenLineage has achieved Incubation status with the LFAI & Data.

Original URL: https://openlineage.io/blog/incubation-stage-lfai/

Ross Turk (ross@datakin.com)

2023-01-17 15:54:07

and I'll add a quick shoutout of @Michael Robinson, who has done a whole lot of work to make this happen 🎉 thanks, man, you're awesome!

🙌 Howard Yoo, Maciej Obuchowski, Jarek Potiuk, Minkyu Park, Willy Lulciuc, Kengo Seki, Paweł Leszczyński, Varun Singh

Michael Robinson (michael.robinson@astronomer.io)

2023-01-17 15:56:38

*Thread Reply:* Thank you, Ross!! I appreciate it. I might have coordinated it, but it’s been a team effort. Lots of folks shared knowledge and time to help us check all the boxes, literally and figuratively (lots of boxes). ;)

☑️ Willy Lulciuc, Paweł Leszczyński, Viraj Parekh

Jarek Potiuk (jarek@potiuk.com)

2023-01-17 16:03:36

Congrats @Michael Robinson and @Ross Turk - > major step for Open Lineage!

🙌 Michael Robinson, Maciej Obuchowski, Jakub Dardziński, Julien Le Dem, Ross Turk, Willy Lulciuc, Kengo Seki, Viraj Parekh, Paweł Leszczyński, Anirudh Shrinivason, Robert

Sudhir Nune (sudhir.nune@kraftheinz.com)

2023-01-18 11:15:02

Hi all, I am new to the https://openlineage.io/integration/dbt/, I followed the steps on Windows Laptop. But the dbt-ol does not get executed.

'dbt-ol' is not recognized as an internal or external command, operable program or batch file.

I see the following Packages installed too openlineage-dbt==0.19.2 openlineage-integration-common==0.19.2 openlineage-python==0.19.2

openlineage.io

dbt

Enabling OpenLineage in dbt can capture lineage metadata for transformations running within your data warehouse.

Original URL: https://openlineage.io/integration/dbt/

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-01-18 11:17:14

*Thread Reply:* What are the errors?

Sudhir Nune (sudhir.nune@kraftheinz.com)

2023-01-18 11:18:09

*Thread Reply:* 'dbt-ol' is not recognized as an internal or external command, operable program or batch file.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-01-19 11:11:09

*Thread Reply:* Hm, I think this is due to different windows conventions around scripts.

Ross Turk (ross@datakin.com)

2023-01-19 14:26:35

*Thread Reply:* I have not tried it on Windows before myself, but on mac/linux if you make a Python virtual environment in venv/ and run pip install openlineage-dbt, the script winds up in ./venv/bin/dbt-ol.

Ross Turk (ross@datakin.com)

2023-01-19 14:27:04

*Thread Reply:* (maybe that helps!)

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-01-19 14:38:23

*Thread Reply:* This might not work, but I think I have an idea that would allow it to run as python -m dbt-ol run ...

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-01-19 14:38:27

*Thread Reply:* That needs one fix though

Sudhir Nune (sudhir.nune@kraftheinz.com)

2023-01-19 14:40:52

*Thread Reply:* Hi @Maciej Obuchowski, thanks for the input, when I try to use python -m dbt-ol run, I see the below error :( \python.exe: No module named dbt-ol

Michael Robinson (michael.robinson@astronomer.io)

2023-01-24 13:23:56

*Thread Reply:* We’re seeing a similar issue with the Great Expectations integration at the moment. This is purely a guess, but what happens when you try with openlineage-dbt 0.18.0?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-01-24 13:24:36

*Thread Reply:* @Michael Robinson GE issue is on Windows?

Michael Robinson (michael.robinson@astronomer.io)

2023-01-24 13:24:49

*Thread Reply:* No, not Windows

Michael Robinson (michael.robinson@astronomer.io)

2023-01-24 13:24:55

*Thread Reply:* (that I know of)

Sudhir Nune (sudhir.nune@kraftheinz.com)

2023-01-24 13:46:39

*Thread Reply:* @Michael Robinson - I see the same error. I used 2 Combinations

Python 3.8.10 with openlineage-dbt 0.18.0 & Latest
Python 3.9.7 with openlineage-dbt 0.18.0 & Latest

Ross Turk (ross@datakin.com)

2023-01-24 13:49:19

*Thread Reply:* Hm. You should be able to find the dbt-ol command wherever pip is installing the packages. In my case, that's usually in a virtual environment.

But if I am not in a virtual environment, it installs the packages in my PYTHONPATH. You might try this to see if the dbt-ol script can be found in one of the directories in sys.path.

Ross Turk (ross@datakin.com)

2023-01-24 13:58:38

*Thread Reply:* this can help you verify that your PYTHONPATH and PATH are correct - installing an unrelated python command-line tool and seeing if you can execute it:

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-01-24 13:59:42

*Thread Reply:* Again, I think this is windows issue

Ross Turk (ross@datakin.com)

2023-01-24 14:00:54

*Thread Reply:* @Maciej Obuchowski you think even if dbt-ol could be found in the path, that might not be the issue?

Sudhir Nune (sudhir.nune@kraftheinz.com)

2023-01-24 14:15:13

*Thread Reply:* Hi @Ross Turk - I could not find the dbt-ol in the site-packages.

Ross Turk (ross@datakin.com)

2023-01-24 14:16:48

*Thread Reply:* Hm 😕 then perhaps @Maciej Obuchowski is right and there is a bigger issue here

Sudhir Nune (sudhir.nune@kraftheinz.com)

2023-01-24 14:31:15

*Thread Reply:* @Ross Turk & @Maciej Obuchowski I see the issue event when I do the install using the https://pypi.org/project/openlineage-dbt/#files - openlineage-dbt-0.19.2.tar.gz.

For some reason, I see only the following folder created

openlineage
openlineage_dbt-0.19.2.dist-info
openlineageintegrationcommon-0.19.2.dist-info
openlineage_python-0.19.2.dist-info and not brining in the openlineage-dbt-0.19.2, which has the scripts/dbt-ol

If it helps I am using pip 21.2.4

Francis McGregor-Macdonald (francis@mc-mac.com)

2023-01-18 18:40:32

@Paul Villena @Stephen Said and Vishwanatha Nayak published an AWS blog Automate data lineage on Amazon MWAA with OpenLineage

Amazon Web Services

Automate data lineage on Amazon MWAA with OpenLineage | Amazon Web Services

In modern data architectures, datasets are combined across an organization using a variety of purpose-built services to unlock insights. As a result, data governance becomes a key component for data consumers and producers to know that their data-driven decisions are based on trusted and accurate datasets. One aspect of data governance is data lineage, which […]

Original URL: https://aws.amazon.com/blogs/big-data/automate-data-lineage-on-amazon-mwaa-with-openlineage/

👀 Ross Turk, Peter Hicks, Willy Lulciuc

🔥 Ross Turk, Willy Lulciuc, Michael Collado, Peter Hicks, Minkyu Park, Julien Le Dem, Kengo Seki, Anirudh Shrinivason, Paweł Leszczyński, Maciej Obuchowski, Harel Shein, Paul Wilson Villena

❤️ Willy Lulciuc, Minkyu Park, Julien Le Dem, Kengo Seki, Paweł Leszczyński, Viraj Parekh

Ross Turk (ross@datakin.com)

2023-01-18 18:54:57

*Thread Reply:* This is excellent! May we promote it on openlineage and marquez social channels?

Willy Lulciuc (willy@datakin.com)

2023-01-18 18:55:30

*Thread Reply:* This is an amazing write up! 🔥 💯 🚀

Francis McGregor-Macdonald (francis@mc-mac.com)

2023-01-18 19:49:46

*Thread Reply:* Happy to have it promoted. 😄 Vish posted on LinkedIn: https://www.linkedin.com/posts/vishwanatha-nayak-b8462054automate-data-lineage-on-amazon-mwaa-with-activity-7021589819763945473-yMHF?utmsource=share&utmmedium=memberios|https://www.linkedin.com/posts/vishwanatha-nayak-b8462054automate-data-lineage-on-amazon-mwaa-with-activity-7021589819763945473-yMHF?utmsource=share&utmmedium=memberios if you want something to repost there.

linkedin.com

Vishwanatha Nayak on LinkedIn: Automate data lineage on Amazon MWAA with OpenLineage | Amazon Web Services

We (Stephen Said, Paul Villena and I) are happy to start the year with a new blog on automating data lineage using Amazon Managed Workflows for Apache Airflow…

Original URL: https://www.linkedin.com/posts/vishwanatha-nayak-b8462054_automate-data-lineage-on-amazon-mwaa-with-activity-7021589819763945473-yMHF?utm_source=share&utm_medium=member_ios

❤️ Willy Lulciuc, Ross Turk

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-01-19 00:13:26

Hi guys, I am trying to build the openlineage jar locally for spark. I ran ./gradlew shadowJar in the /integration/spark directory. However, I am getting this issue: ** What went wrong: A problem occurred evaluating project ':app'. > Could not resolve all files for configuration ':app:spark33'. > Could not resolve io.openlineage:openlineage_java:0.20.0-SNAPSHOT. Required by: project :app > project :shared > Could not resolve io.openlineage:openlineage_java:0.20.0-SNAPSHOT. > Unable to load Maven meta-data from <https://astronomer.jfrog.io/artifactory/maven-public-libs-snapshot/io/openlineage/openlineage-java/0.20.0-SNAPSHOT/maven-metadata.xml>. > Could not GET '<https://astronomer.jfrog.io/artifactory/maven-public-libs-snapshot/io/openlineage/openlineage-java/0.20.0-SNAPSHOT/maven-metadata.xml>'. Received status code 401 from server: Unauthorized It used to work a few weeks ago...May I ask if anyone would know what the reason might be? Thanks! 🙂

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-01-19 03:58:42

*Thread Reply:* Hello @Anirudh Shrinivason, you need to build your openlineage-java package first. Possibly you built in some time ao in different version

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-01-19 03:59:28

*Thread Reply:* ./gradlew clean build publishToMavenLocal in /client/java should help.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-01-19 04:34:33

*Thread Reply:* Ahh yeap this works thanks! 🙂

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2023-01-19 09:17:01

Are there any resources to explain the differences between lineage with Apache Atlas vs. lineage using OpenLineage? we have discussions with customers and partners, and some of them are looking into which is more “ready for industry”.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-01-19 11:03:39

*Thread Reply:* It's been a while since I looked at Atlas, but does it even now supports something else than very Java Apache-adjacent projects like Hive and HBase?

Ross Turk (ross@datakin.com)

2023-01-19 13:10:11

*Thread Reply:* To directly answer your question @Sheeri Cabral (Collibra): I am not aware of any resources currently that explain this 😞 but I would welcome the creation of one & pitch in where possible!

✅ Sheeri Cabral (Collibra)

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2023-01-20 17:00:25

*Thread Reply:* I don’t know enough about Atlas to make that doc.

Justine Boulant (justine.boulant@seenovate.com)

2023-01-19 10:43:18

Hi everyone, I am currently working on a project and we have some questions to use OpenLineage with Apache Airflow : • How does it work : ux vs code/script? How can we implement it? a schema of its architecture for example • What are the visual outputs available? • Is the lineage done from A to Z? if there are multiple intermediary transformations for example? • Is the lineage done horizontally across the organization or vertically on different system levels? or both? • Can we upgrade it to industry-level? • Does it work with Python and/or R? • Does it read metadata or scripts? Thanks a lot if you can help 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-01-19 11:00:54

*Thread Reply:* I think most of your questions will be answered by this video: https://www.youtube.com/watch?v=LRr-ja8_Wjs

YouTube

} Astronomer (https://www.youtube.com/@Astronomer)

OpenLineage and Airflow: A Deeper Dive

Original URL: https://www.youtube.com/watch?v=LRr-ja8_Wjs

Ross Turk (ross@datakin.com)

2023-01-19 13:10:58

*Thread Reply:* I agree - a lot of the answers are in that overview video. You might also take a look at the docs, they do a pretty good job of explaining how it works.

Ross Turk (ross@datakin.com)

2023-01-19 13:19:34

*Thread Reply:* More explicitly: • Airflow is an interesting platform to observe because it runs a large variety of workloads and lineage can only be automatically extracted for some of them • In general, OpenLineage is essentially a standard and data model for lineage. There are integrations for various systems, including Airflow, that cause them to emit lineage events to an OpenLineage compatible backend. It's a push model. • Marquez is one such backend, and the one I recommend for testing & development • There are a few approaches for lineage in Airflow: ◦ Extractors, which pair with Operators to extract and emit lineage ◦ Manual inlets/outlets on a task, defined by a developer - useful for PythonOperator and other cases where an extractor can't do it auto ◦ Orchestration of an underlying OpenLineage integration, like openlineage-dbt • IDK about "A to Z", that depends on your environment. The goal is to capture every transformation. Depending on your pipeline, there may be a set of integrations that give you the coverage you need. We often find that there are gaps. • It works with Python. You can use the openlineage-python client to emit lineage events to a backend. This is useful if there isn't an integration for something your pipeline does. • It describes the pipeline by observing running jobs and the way they affect datasets, not the organization. I don't know what you mean by "industry-level". • I am not aware of an integration that parses source code to determine lineage at this time. • The openlineage-dbt integration consumes the various metadata that dbt leaves behind to construct lineage. Dunno if that's what you mean by "read metadata".

Ross Turk (ross@datakin.com)

2023-01-19 13:23:33

*Thread Reply:* FWIW I did a workshop on openlineage and airflow a while back, and it's all in this repo. You can find slides + a quick Python example + a simple Airflow example in there.

Justine Boulant (justine.boulant@seenovate.com)

2023-01-20 03:44:22

*Thread Reply:* Thanks a lot!! Very helpful!

Ross Turk (ross@datakin.com)

2023-01-20 11:42:43

*Thread Reply:* 👍

Brad Paskewitz (bradford.paskewitz@fivetran.com)

2023-01-20 15:28:06

Hey folks, my team is working on a solution that would support the OL standard with column level lineage. I'm working through the architecture now and I'm wondering if everyone uses the standard rest api backed by a db or if other teams found success using other technologies such as webhooks, streams, etc in order to capture and process lineage events. I'd be very curious to connect on the topic

Julien Le Dem (julien@apache.org)

2023-01-20 19:45:55

*Thread Reply:* Hello Brad, on top of my head:

Julien Le Dem (julien@apache.org)

2023-01-20 19:47:15

*Thread Reply:* • Marquez uses the API HTTP Post. so does Astro • Egeria and Purview prefer consuming through a Kafka topic. There is a ProxyBackend that takes HTTP Posts and writes to Kafka. The client can also be configured to write to Kafka

👍 Jakub Dardziński

Julien Le Dem (julien@apache.org)

2023-01-20 19:48:09

*Thread Reply:* @Will Johnson @Mandy Chessell might have opinions

Julien Le Dem (julien@apache.org)

2023-01-20 19:49:10

*Thread Reply:* The Microsoft Purview approach is documented here: https://learn.microsoft.com/en-us/samples/microsoft/purview-adb-lineage-solution-accelerator/azure-databricks-to-purview-lineage-connector/

learn.microsoft.com

Azure Databricks to Purview Lineage Connector - Code Samples

This solution accelerator, together with the [OpenLineage](<http://openlineage.io>) project, provides a connector that will transfer lineage metadata fr...

Original URL: https://learn.microsoft.com/en-us/samples/microsoft/purview-adb-lineage-solution-accelerator/azure-databricks-to-purview-lineage-connector/

Julien Le Dem (julien@apache.org)

2023-01-20 19:49:47

*Thread Reply:* There’s a blog post about Egeria here: https://openlineage.io/blog/openlineage-egeria/

openlineage.io

OpenLineage Support in Egeria

The Egeria project uses OpenLineage to enhance its production of holistic metadata about an organization's operations.

Original URL: https://openlineage.io/blog/openlineage-egeria/

Will Johnson (will@willj.co)

2023-01-22 10:00:56

*Thread Reply:* @Brad Paskewitz at Microsoft, the solution that Julien linked above, we are using the HTTP Transport (REST API) as we are consuming the OpenLineage Events and transforming them to Apache Atlas / Microsoft Purview.

However, there is a good deal of interest in using the kafka transport instead and that's our future roadmap.

👍 Ross Turk, Brad Paskewitz

Quentin Nambot (qnambot@gmail.com)

2023-01-25 09:59:13

❓ Hi everyone, I am trying to use openlineage with Databricks (using 11.3 LTS runtime, and openlineage 0.19.2) Using this documentation I managed to install openlineage and send events to marquez However marquez did not received all COMPLETE events, it seems like databricks cluster is shutdown immediatly at the end of the job. It is not the first time that i see this with databricks, last year I tried to use spline and we noticed that Databricks seems to not wait that spark session is nicely closed before shutting down instances (see this issue) My question is: has anyone faced the same issue? Does somebody know a workaround? 🙏

spline

0.7

Data Lineage Tracking And Visualization Solution

Original URL: https://absaoss.github.io/spline/

Michael Collado (collado.mike@gmail.com)

2023-01-25 12:04:48

*Thread Reply:* Hmm, if Databricks is shutting the process down without waiting for the ListenerBus to clear, I don’t know that there’s a lot we can do. The best thing is to somehow delay the main application thread from exiting. One thing you could try is to subclass the OpenLineageSparkListener and generate a lock for each SparkListenerSQLExecutionStart and release it when the accompanying SparkListenerSQLExecutionEnd event is processed. Then, in the main application, block until all such locks are released. If you try it and it works, let us know!

Quentin Nambot (qnambot@gmail.com)

2023-01-26 05:46:35

*Thread Reply:* Ok thanks for the idea! I'll tell you if I try this and if it works 🤞

Petr Hajek (petr.hajek@profinit.eu)

2023-01-25 10:12:42

Hi, would anybody be able and willing to help us configure S3 and Snowflake extractors within Airflow integration for one of our clients? Our trouble is that Airflow integration returns valid OpenLineage .json files but it lacks any information about input and output DataSets. Thanks in advance 🙂

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-01-25 10:38:03

*Thread Reply:* Hey Petr. Please DM me or describe the issue here 🙂

Susmitha Anandarao (susmitha.anandarao@gmail.com)

2023-01-27 15:24:47

Hello.. I am trying to play with openlineage spark integration with Kafka and currently trying to just use the config as part of the spark submit command but I run into errors. Details in the 🧵

Susmitha Anandarao (susmitha.anandarao@gmail.com)

2023-01-27 15:25:04

*Thread Reply:* Command spark-submit --packages "io.openlineage:openlineage_spark:0.19.+" \ --conf "spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener" \ --conf "spark.openlineage.transport.type=kafka" \ --conf "spark.openlineage.transport.topicName=topicname" \ --conf "spark.openlineage.transport.localServerId=Kafka_server" \ file.py

Susmitha Anandarao (susmitha.anandarao@gmail.com)

2023-01-27 15:25:14

*Thread Reply:* 23/01/27 17:29:06 ERROR AsyncEventQueue: Listener OpenLineageSparkListener threw an exception java.lang.NullPointerException at io.openlineage.client.transports.TransportFactory.build(TransportFactory.java:44) at io.openlineage.spark.agent.EventEmitter.<init>(EventEmitter.java:40) at io.openlineage.spark.agent.OpenLineageSparkListener.initializeContextFactoryIfNotInitialized(OpenLineageSparkListener.java:278) at io.openlineage.spark.agent.OpenLineageSparkListener.onApplicationStart(OpenLineageSparkListener.java:267) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:55) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1446) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)

Susmitha Anandarao (susmitha.anandarao@gmail.com)

2023-01-27 15:25:31

*Thread Reply:* I would appreciate any pointers on getting started with using openlineage-spark with Kafka.

Susmitha Anandarao (susmitha.anandarao@gmail.com)

2023-01-27 16:15:00

*Thread Reply:* Also this might seem a little elementary but the kafka topic itself, should it be hosted on the spark cluster or could it be any kafka topic?

Susmitha Anandarao (susmitha.anandarao@gmail.com)

2023-01-30 08:37:07

*Thread Reply:* 👀 Could I get some help on this, please?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-01-30 09:07:08

*Thread Reply:* I think any NullPointerException is clearly our bug, can you open issue on OL GitHub?

👍 Susmitha Anandarao

Susmitha Anandarao (susmitha.anandarao@gmail.com)

2023-01-30 09:30:51

*Thread Reply:* @Maciej Obuchowski Another interesting thing is if I use 0.19.2 version specifically, I get 23/01/30 14:28:33 INFO RddExecutionContext: RDDs are empty: skipping sending OpenLineage event

I am trying to print to console at the moment. I haven't been able to get Kafka transport type working though.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-01-30 09:41:12

*Thread Reply:* Are you getting events printed on the console though? This log should not affect you if you're running, for example Spark SQL jobs

Susmitha Anandarao (susmitha.anandarao@gmail.com)

2023-01-30 09:42:28

*Thread Reply:* I am trying to run a python file using pyspark. 23/01/30 14:40:49 INFO RddExecutionContext: RDDs are empty: skipping sending OpenLineage event I see this and don't see any events on the console.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-01-30 09:55:41

*Thread Reply:* Any logs filling pattern log.warn("Unable to access job conf from RDD", nfe); or <a href="http://log.info">log.info</a>("Found job conf from RDD {}", jc); before?

Susmitha Anandarao (susmitha.anandarao@gmail.com)

2023-01-30 09:57:20

*Thread Reply:* ```23/01/30 14:40:48 INFO DAGScheduler: Submitting ShuffleMapStage 0 (PairwiseRDD[2] at reduceByKey at /tmp/spark-20487725-f49b-4587-986d-e63a61890673/statusapidemo.py:47), which has no missing parents 23/01/30 14:40:49 WARN RddExecutionContext: Unable to access job conf from RDD java.lang.NoSuchFieldException: Field is not instance of HadoopMapRedWriteConfigUtil at io.openlineage.spark.agent.lifecycle.RddExecutionContext.lambda$setActiveJob$0(RddExecutionContext.java:117) at java.util.Optional.orElseThrow(Optional.java:290) at io.openlineage.spark.agent.lifecycle.RddExecutionContext.setActiveJob(RddExecutionContext.java:115) at java.util.Optional.ifPresent(Optional.java:159) at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$onJobStart$9(OpenLineageSparkListener.java:148) at java.util.Optional.ifPresent(Optional.java:159) at io.openlineage.spark.agent.OpenLineageSparkListener.onJobStart(OpenLineageSparkListener.java:145) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1446) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)

23/01/30 14:40:49 INFO RddExecutionContext: Found job conf from RDD Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-rbf-default.xml, hdfs-site.xml, hdfs-rbf-site.xml, resource-types.xml

23/01/30 14:40:49 INFO RddExecutionContext: Found output path null from RDD PythonRDD[5] at collect at /tmp/spark-20487725-f49b-4587-986d-e63a61890673/statusapidemo.py:48 23/01/30 14:40:49 INFO RddExecutionContext: RDDs are empty: skipping sending OpenLineage event``` I see both actually.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-01-30 10:03:35

*Thread Reply:* I think this is same problem as this: https://github.com/OpenLineage/OpenLineage/issues/1521

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-01-30 10:04:14

*Thread Reply:* and I think I might have solution on a branch for it, just need to polish it up to release

Susmitha Anandarao (susmitha.anandarao@gmail.com)

2023-01-30 10:13:37

*Thread Reply:* Aah got it. I will give it a try with SQL and a jar.

Do you have a ETA on when the python issue would be fixed?

Susmitha Anandarao (susmitha.anandarao@gmail.com)

2023-01-30 10:37:51

*Thread Reply:* @Maciej Obuchowski Well I run into the same errors if I run spark-submit on a jar.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-01-30 10:38:44

*Thread Reply:* I think that has nothing to do with python

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-01-30 10:39:16

*Thread Reply:* BTW, which Spark version are you using?

Susmitha Anandarao (susmitha.anandarao@gmail.com)

2023-01-30 10:41:22

*Thread Reply:* We are on 3.3.1

👍 Maciej Obuchowski

Susmitha Anandarao (susmitha.anandarao@gmail.com)

2023-01-30 11:38:24

*Thread Reply:* @Maciej Obuchowski Do you have a estimated release date for the fix. Our team is specifically interested in using the Emitter to write out to Kafka.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-01-30 11:46:30

*Thread Reply:* I think we plan to release somewhere in the next week

:gratitude_thank_you: Susmitha Anandarao

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-02-06 09:21:25

*Thread Reply:* @Susmitha Anandarao PR fixing this has been merged, release should be today

:gratitude_thank_you: Susmitha Anandarao

Paul Lee (paullee@lyft.com)

2023-01-27 16:31:45

👋 what would be the reason conn_id on something like SQLCheckOperator ends up being None when OpenLineage attempts to extract metadata but is fine on task execution?

i'm using OpenLineage for Airflow 0.14.1 on 2.3.4 and i'm getting an error about connid not being found. it's a SQLCheckOperator where the check runs fine but the task fails because when OpenLineage goes to extract task metadata it attempts to grab the connid but at that moment it finds it to be None.

Ross Turk (ross@datakin.com)

2023-01-27 18:38:40

*Thread Reply:* hmmm, I am not sure. perhaps @Benji Lampel can help, he’s very familiar with those operators.

👍 Paul Lee

Paul Lee (paullee@lyft.com)

2023-01-27 18:46:15

*Thread Reply:* @Benji Lampel any help would be appreciated!

Benji Lampel (benjamin@astronomer.io)

2023-01-30 09:01:34

*Thread Reply:* Hey Paul, the SQLCheckExtractors were written with the intent that they would be used by a provider that inherits for them - they are all treated as a sort of base class. What is the exact error message you're getting? And what is the operator code? Could you try this with a PostgresCheckOperator ? (Also, only the SqlColumnCheckOperator and SqlTableCheckOperator will provide data quality facets in their output, those functions are not implementable in the other operators at this time)

👀 Paul Lee

Paul Lee (paullee@lyft.com)

2023-01-31 14:36:07

*Thread Reply:* @Benji Lampel here is the error message. i am not sure what the operator code is.

3-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - Traceback (most recent call last): [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - self.run() [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - File "/usr/lib/python3.8/threading.py", line 870, in run [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - self._target(**self._args, ****self._kwargs) [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - File "/code/venvs/venv/lib/python3.8/site-packages/openlineage/airflow/listener.py", line 99, in on_running [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - task_metadata = extractor_manager.extract_metadata(dagrun, task) [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - File "/code/venvs/venv/lib/python3.8/site-packages/openlineage/airflow/extractors/manager.py", line 28, in extract_metadata [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - extractor = self._get_extractor(task) [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - File "/code/venvs/venv/lib/python3.8/site-packages/openlineage/airflow/extractors/manager.py", line 96, in _get_extractor [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - self.task_to_extractor.instantiate_abstract_extractors(task) [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - File "/code/venvs/venv/lib/python3.8/site-packages/openlineage/airflow/extractors/extractors.py", line 118, in instantiate_abstract_extractors [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - task_conn_type = BaseHook.get_connection(task.conn_id).conn_type [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - File "/code/venvs/venv/lib/python3.8/site-packages/airflow/hooks/base.py", line 67, in get_connection [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - conn = Connection.get_connection_from_secrets(conn_id) [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - File "/code/venvs/venv/lib/python3.8/site-packages/airflow/models/connection.py", line 430, in get_connection_from_secrets [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - raise AirflowNotFoundException(f"The conn_id `{conn_id}` isn't defined") [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - airflow.exceptions.AirflowNotFoundException: The conn_id `None` isn't defined

Paul Lee (paullee@lyft.com)

2023-01-31 14:37:06

*Thread Reply:* and above that

[2023-01-31, 00:32:38 UTC] {connection.py:424} ERROR - Unable to retrieve connection from secrets backend (EnvironmentVariablesBackend). Checking subsequent secrets backend. Traceback (most recent call last): File "/code/venvs/venv/lib/python3.8/site-packages/airflow/models/connection.py", line 420, in get_connection_from_secrets conn = secrets_backend.get_connection(conn_id=conn_id) File "/code/venvs/venv/lib/python3.8/site-packages/airflow/secrets/base_secrets.py", line 91, in get_connection value = self.get_conn_value(conn_id=conn_id) File "/code/venvs/venv/lib/python3.8/site-packages/airflow/secrets/environment_variables.py", line 48, in get_conn_value return os.environ.get(CONN_ENV_PREFIX + conn_id.upper())

Paul Lee (paullee@lyft.com)

2023-01-31 14:39:31

*Thread Reply:* sorry, i should mention we're wrapping over the CheckOperator as we're still migrating from 1.10.15 @Benji Lampel

Benji Lampel (benjamin@astronomer.io)

2023-01-31 15:09:51

*Thread Reply:* What do you mean by wrapping the CheckOperator? Like how so, exactly? Can you show me the operator code you're using in the DAG?

Paul Lee (paullee@lyft.com)

2023-01-31 17:38:45

*Thread Reply:* like so

class CustomSQLCheckOperator(CheckOperator): ....

Paul Lee (paullee@lyft.com)

2023-01-31 17:39:30

*Thread Reply:* i think i found the issue though, we have our own get_hook function and so we don't follow the traditional Airflow way of setting CONN_ID which is why CONN_ID is always None and that path only gets called through OpenLineage which doesn't ever get called with our custom wrapper

✅ Benji Lampel

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-01-30 03:50:39

Hi everyone, I am using openlineage to capture column level lineage from spark databricks. I noticed that the environment variables captured are only present in the start event, but are not present in the complete event. Is there a reason why it is implemented like this? It seems more intuitive that whatever variables are present in the start event should also be present in the complete event...

Susmitha Anandarao (susmitha.anandarao@gmail.com)

2023-01-31 08:30:37

Hi everyone.. Does the DBT integration provide an option to emit events to a Kafka topic similar to the Spark integration? I could not find anything regarding this in the documentation and I wanted to make sure if only http transport type is supported. Thank you!

Julien Le Dem (julien@apache.org)

2023-01-31 12:57:47

*Thread Reply:* The dbt integration uses the python client, you should be able to do something similar than with the java client. See here: https://github.com/OpenLineage/OpenLineage/tree/main/client/python#kafka

Susmitha Anandarao (susmitha.anandarao@gmail.com)

2023-01-31 13:26:33

*Thread Reply:* Thank you for this!

I created a openlineage.yml file with the following data to test out the integration locally. transport: type: "kafka" config: { 'bootstrap.servers': 'localhost:9092', } topic: "ol_dbt_events" However, I run into a no module named 'confluent_kafka' error from this code. Running OpenLineage dbt wrapper version 0.19.2 This wrapper will send OpenLineage events at the end of dbt execution. Traceback (most recent call last): File "/Users/susmithaanandarao/.pyenv/virtualenvs/dbt-examples-domain-repo/3.9.8/bin/dbt-ol", line 168, in <module> main() File "/Users/susmithaanandarao/.pyenv/virtualenvs/dbt-examples-domain-repo/3.9.8/bin/dbt-ol", line 94, in main client = OpenLineageClient.from_environment() File "/Users/susmithaanandarao/.pyenv/virtualenvs/dbt-examples-domain-repo/3.9.8/lib/python3.9/site-packages/openlineage/client/client.py", line 73, in from_environment return cls(transport=get_default_factory().create()) File "/Users/susmithaanandarao/.pyenv/virtualenvs/dbt-examples-domain-repo/3.9.8/lib/python3.9/site-packages/openlineage/client/transport/factory.py", line 37, in create return self._create_transport(yml_config) File "/Users/susmithaanandarao/.pyenv/virtualenvs/dbt-examples-domain-repo/3.9.8/lib/python3.9/site-packages/openlineage/client/transport/factory.py", line 69, in _create_transport return transport_class(config_class.from_dict(config)) File "/Users/susmithaanandarao/.pyenv/virtualenvs/dbt-examples-domain-repo/3.9.8/lib/python3.9/site-packages/openlineage/client/transport/kafka.py", line 43, in __init__ import confluent_kafka as kafka ModuleNotFoundError: No module named 'confluent_kafka' Manually installing confluent-kafka worked. But I am curious why it was not automatically installed and if I am missing any config.

<https://github.com/OpenLineage/OpenLineage/blob/3d0fee9002a8d7a85f47c8a6e57446174f433e34/client/python/setup.py | setup.py>

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-02-02 14:39:29

*Thread Reply:* @Susmitha Anandarao It's not installed because it's large binary package. We don't want to install for every user something giant majority won't use, and it's 100x bigger than rest of the client.

We need to indicate this way better, and do not throw this error directly at user thought, both in docs and code.

Benji Lampel (benjamin@astronomer.io)

2023-01-31 11:28:53

~Hey, would love to see a release of OpenLineage~

➕ Michael Robinson, Jakub Dardziński, Ross Turk, Maciej Obuchowski

Julien Le Dem (julien@apache.org)

2023-01-31 12:51:44

Hello, I have been working on a proposal to bring an OpenLineage provider to Airflow. I am currently looking for feedback on a draft AIP. See the thread here: https://lists.apache.org/thread/2brvl4ynkxcff86zlokkb47wb5gx8hw7

🔥 Maciej Obuchowski, Viraj Parekh, Jakub Dardziński, Enrico Rotundo, Harel Shein, Paweł Leszczyński

👀 Enrico Rotundo

🙌 Will Johnson

Bramha Aelem (bramhaaelem@gmail.com)

2023-01-31 14:02:21

@Willy Lulciuc, - Any updates on - https://github.com/OpenLineage/OpenLineage/discussions/1494

#1494 Usage Of databricks generated tables as an in/out in the Datapipeline Process

Hi, First activity : Making HTTP Call to pull the lookup data and store it in datalake. Second Activity : After the completion of first activity, invoking Azure databricks to use the lookup file and generate the output tables. What are all the steps should be follow to refer databricks generated tables facets as an output in the current activity & input to the subsequent activities in the pipeline. When I configure spark generated tables as output to the current activity the existing spark metadata is not showing up. How can this be achievable. Can someone please guide me on this. Note: After the execution of each activity in ADF Pipeline I am sending start and complete/fail event lineage to Marquez. Thanks, Brahma

Marquez as an OpenLineage Client

api: image: marquezproject/marquez containername: marquez-api ports: - "5000:5000" - "5001:5001" volumes: - ./docker/wait-for-it.sh:/usr/src/app/wait-for-it.sh links: - "db:postgres" dependson: - db entrypoint: [ "./wait-for-it.sh", "db:5432", "--", "./entrypoint.sh" ]

db: image: postgres:12.1 containername: marquez-db ports: - "5432:5432" environment: - POSTGRESUSER=postgres - POSTGRESPASSWORD=password - MARQUEZDB=marquez - MARQUEZUSER=marquez - MARQUEZPASSWORD=marquez volumes: - ./docker/init-db.sh:/docker-entrypoint-initdb.d/init-db.sh # Enables SQL statement logging (see: https://www.postgresql.org/docs/12/runtime-config-logging.html#GUC-LOG-STATEMENT) # command: ["postgres", "-c", "log_statement=all"]

PostgreSQL Documentation

19.8. Error Reporting and Logging

19.8.&nbsp;Error Reporting and Logging 19.8.1. Where to Log 19.8.2. When to Log 19.8.3. What to Log 19.8.4. Using CSV-Format Log …

Original URL: https://www.postgresql.org/docs/12/runtime-config-logging.html#GUC-LOG-STATEMENT

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)

2023-09-05 07:38:10

*Thread Reply:* this is hopw mine looks

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)

2023-09-05 07:38:20

*Thread Reply:* it is all tested and letest version

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)

2023-09-05 07:38:31

*Thread Reply:* postgres does not work beyond 12

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)

2023-09-05 07:38:56

*Thread Reply:* if you run this docker-compose up

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)

2023-09-05 07:38:58

*Thread Reply:* the notebooks

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)

2023-09-05 07:39:02

*Thread Reply:* are 10 faster

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)

2023-09-05 07:39:06

*Thread Reply:* and give no errors

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)

2023-09-05 07:39:14

*Thread Reply:* also you need to update other stuff

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)

2023-09-05 07:39:18

*Thread Reply:* such as

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)

2023-09-05 07:39:26

*Thread Reply:* dont run what is in the docs

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)

2023-09-05 07:39:34

*Thread Reply:* but run what is in github

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)

2023-09-05 07:40:13

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)

2023-09-05 07:40:22

*Thread Reply:* run in your notebooks what is in here

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)

2023-09-05 07:40:32

*Thread Reply:* ```from pyspark.sql import SparkSession

spark = (SparkSession.builder.master('local') .appName('samplespark') .config('spark.jars.packages', 'io.openlineage:openlineagespark:1.1.0') .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener') .config('spark.openlineage.transport.url', 'http://{openlineage.client.host}/api/v1/namespaces/spark_integration/') .getOrCreate())```

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)

2023-09-05 07:40:38

*Thread Reply:* the dont update documentation

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)

2023-09-05 07:40:44

*Thread Reply:* it took me 4 weeks to get here

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)

2023-07-23 08:39:13

is this a known error ? does anyone know how to debug this ?

Steven (xli@zjuici.com)

2023-07-23 23:57:43

Hi, Using Marquez. I tried to get the dataset version through two apis. First: http://host/api/v1/namespaces/{namespace}/datasets/{dataset} It will include a currentVersion in the response. Then: http://host/api/v1/namespaces/{namespace}/datasets/{dataset}/versions/{currentVersion} But the version used here refers to the "version" column in table dataset_versions. Not the primary key "uuid". Which leads to 404 not found. I checked other apis but seemed that there are no other way to get the version through "currentVersion". Any help?

👀 Maciej Obuchowski, Willy Lulciuc

Steven (xli@zjuici.com)

2023-07-24 00:14:43

*Thread Reply:* Like I want to change the facets of a specific dataset.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-07-24 16:45:18

*Thread Reply:* @Willy Lulciuc do you have any idea? 🙂

Steven (xli@zjuici.com)

2023-07-25 05:02:47

*Thread Reply:* I solved this by adding a new job which outputs to the same dataset. This ended up in a newer dataset version.

Willy Lulciuc (willy@datakin.com)

2023-07-25 06:20:58

*Thread Reply:* @Steven great to hear that you solved the issue! but there are some minor logical inconsistencies that we’d like to address with versioning (for both datasets and jobs) in Marquez. The tl;dr is the version column wasn’t meant to be used externally, but internally within Marquez. The issue is “minor” as it’s more of a pointer thing. We’ll be addressing soon. For some background, you can look at: • https://github.com/MarquezProject/marquez/issues/2071 • https://github.com/MarquezProject/marquez/pull/2153

#2071 Proposal: Update DatasetVersion versioning

#2153 Proposal/2071 update version versioning

Steven (xli@zjuici.com)

2023-07-25 05:06:48

Hi, Are there any keys to set in marquez.yaml to skip db initialization and use existing db? I am deploying the marquez client on k8s client, which uses a cloud postgres. Every time I restart the marquez deployment I have to drop all those tables otherwise it will raise table already exists ERROR

Willy Lulciuc (willy@datakin.com)

2023-07-25 06:43:32

*Thread Reply:* @Steven ahh very good point, it’s technically not “error” in the true sense, but annoying nonetheless. I think you’re referencing the init container in the Marquez helm chart? https://github.com/MarquezProject/marquez/blob/main/chart/templates/marquez/deployment.yaml#L37

<https://github.com/MarquezProject/marquez/blob/main/chart/templates/marquez/deployment.yaml | deployment.yaml>

<pre><code> - until pg_isready -h ${POSTGRES_HOST} -p ${POSTGRES_PORT} -U ${POSTGRES_USER}; do echo waiting for database; sleep 2; done; </code></pre>

Willy Lulciuc (willy@datakin.com)

2023-07-25 06:45:24

*Thread Reply:* hmm, actually what raises the error you’re referencing? the Maruez http server?

Willy Lulciuc (willy@datakin.com)

2023-07-25 06:49:08

*Thread Reply:* > Every time I restart the marquez deployment I have to drop all those tables otherwise it will raise table already exists ERROR This shouldn’t be an error. I’m trying to understand the scenario in which this error is thrown (any info is helpful). We use flyway to manage our db schema, but you may have gotten in an odd state somehow

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2023-07-25 12:52:51

For Databricks notebooks, does the Spark listener work without any notebook changes? (I see that Azure Databricks -> purview needs no changes, but I’m not sure if that applies to anywhere….e.g. if I have an existing databricks notebook, and I add a spark listener, can I get column-level lineage? or do I need to change my notebook to use openlineage libraries, like I do with an arbitrary Python script?)

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-07-31 03:35:58

*Thread Reply:* Nope, one should modify the cluster as per doc <https://openlineage.io/docs/integrations/spark/quickstart_databricks> but no changes in notebook are required.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2023-08-02 10:59:00

*Thread Reply:* Right, great, that’s exactly what I was hoping 😄

Michael Robinson (michael.robinson@astronomer.io)

2023-07-25 15:24:17

@channel We released OpenLineage 0.30.1, including: Added • Flink: support Iceberg sinks #1960 @pawel-big-lebowski • Spark: column-level lineage for merge into on delta tables #1958 @pawel-big-lebowski • Spark: column-level lineage for merge into on Iceberg tables #1971 @pawel-big-lebowski • Spark: add supprt for Iceberg REST catalog #1963 @juancappi • Airflow: add possibility to force direct-execution based on environment variable #1934 @mobuchowski • SQL: add support for Apple Silicon to openlineage-sql-java #1981 @davidjgoss • Spec: add facet deletion #1975 @julienledem • Client: add a file transport #1891 @alexandre bergere Changed • Airflow: do not run plugin if OpenLineage provider is installed #1999 @JDarDagran • Python: rename config to config_class #1998 @mobuchowski Plus test improvements, docs changes, bug fixes and more. Thanks to all the contributors, including new contributors @davidjgoss, @alexandre bergere and @Juan Manuel Cappi! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.30.1 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.29.2...0.30.1 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

👏 Julian Rossi, Bernat Gabor, Anirudh Shrinivason, Maciej Obuchowski, Jens Pfau, Sheeri Cabral (Collibra)

👍 Athitya Kumar, Sheeri Cabral (Collibra)

Codrut Stoicescu (codrut.stoicescu@gmail.com)

2023-07-27 11:53:09

Hello everyone! I’m part of a team trying to integrate OpenLineage and Marquez with multiple tools in our ecosystem. Integration with Spark and Iceberg was fairly easy with the listener you guys developed. We are now trying to integrate with Ray and we are having some trouble there. I was wondering if anybody has tried any work in that direction, so we can chat and exchange ideas. Thank you!

Michael Robinson (michael.robinson@astronomer.io)

2023-07-27 14:47:18

*Thread Reply:* This is the first I’ve heard of someone trying to do this, but others have tried getting lineage from pandas. There isn’t support for this currently, but this thread contains a link to an issue that might be helpful: https://openlineage.slack.com/archives/C01CK9T7HKR/p1689850134978429?thread_ts=1689688067.729469&cid=C01CK9T7HKR.

} Paweł Leszczyński (https://openlineage.slack.com/team/U02MK6YNAQ5)

I think we don't have pandas support so far. So, if one uses pandas to read local files on disk, then perhaps Openlineage (OL) has little sense to do. There is an old pandas issues in our backlog (over 2 years old) -> <a href="https://github.com/OpenLineage/OpenLineage/issues/108">https://github.com/OpenLineage/OpenLineage/issues/108</a> Surely one can use use python OL client to create manully events and send them to MQZ, which may be less convenient (<a href="https://github.com/OpenLineage/OpenLineage/tree/main/client/python">https://github.com/OpenLineage/OpenLineage/tree/main/client/python</a>) Anyway, we would like to know what's you usecase? this would be super helpful in understanding why OL & pandas integration may be useful.

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1689850134978429?thread_ts=1689688067.729469&cid=C01CK9T7HKR

Codrut Stoicescu (codrut.stoicescu@gmail.com)

2023-07-28 02:10:14

*Thread Reply:* Thank you for your response. We have implemented the “manual way” of emitting events with python OL client. We are now looking for a more automated way, so that updates to the scripts that run in Ray are minimal to none

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-07-28 13:03:43

*Thread Reply:* If you're actively using Ray, then you know way more about it than me, or probably any other OL contributor 🙂 I don't know how it works or is deployed, but I would recommend checking if there's robust way of being notified in the runtime about processing occuring there.

Codrut Stoicescu (codrut.stoicescu@gmail.com)

2023-07-31 12:17:07

*Thread Reply:* Thank you for the tip. That’s the kind of details I’m looking for, but couldn’t find yet

Tereza Trojanová (tereza.trojanova@revolt.bi)

2023-07-28 09:20:34

Hi, does anyone have experience integrating OpenLineage and Marquez with Keboola? I am new to OpenLineage and struggling with the KBC component configuration.

Michael Robinson (michael.robinson@astronomer.io)

2023-07-28 10:53:35

*Thread Reply:* @Martin Fiser can you share any resources or pointers that might be helpful?

Martin Fiser (fisa@keboola.com)

2023-08-21 19:17:17

*Thread Reply:* Hi, apologies - vacation period has hit m. However here are the resources:

API endpoint: https://app.swaggerhub.com/apis-docs/keboola/job-queue-api/1.3.4#/Jobs/getJobOpenApiLineage|job-queue-api | 1.3.4 | keboola | SwaggerHub Dedicated component to push data into openlineage(Marquez instance): https://components.keboola.com/components/keboola.wr-openlineage|OpenLineage data destination | Keboola Developer Portal

🙌 Michael Robinson

Damien Hawes (damien.hawes@booking.com)

2023-07-31 12:32:22

Hi folks. I'm looking to find the complete spec in openapi format. For example, if I want to find the complete spec of 1.0.5 , where would I find that? I've looked here: https://openlineage.io/apidocs/openapi/ however when I download the spec, things are missing, specifically the facets. This makes it difficult to generate clients / backend interfaces from the (limited) openapi spec.

Silvia Pina (silviampina@gmail.com)

2023-08-01 05:14:58

*Thread Reply:* +1, I could also really use this!

Silvia Pina (silviampina@gmail.com)

2023-08-01 05:27:34

*Thread Reply:* Found a way: you download it as json in the above link (“Download OpenAPI specification”), then if you copy paste it to editor.swagger.io it asks f you want to convert to yaml :)

Damien Hawes (damien.hawes@booking.com)

2023-08-01 10:25:49

*Thread Reply:* Whilst that works, it isn't complete. The issue is that the "facets" are not resolved. Exploring the website repository (https://github.com/OpenLineage/website/tree/main/static/spec) shows that facets aren't published alongside the spec, beyond 1.0.1 - which means its hard to know which revisions of the facets belong to which version of the spec.

Silvia Pina (silviampina@gmail.com)

2023-08-01 10:26:54

*Thread Reply:* Good point! Would be good if we could clarify how to get the full spec, in that case

Damien Hawes (damien.hawes@booking.com)

2023-08-01 10:30:57

*Thread Reply:* Granted. If the spec follows backwards compatible evolution rules, then this shouldn't be a problem, i.e., new fields must be optional, you can not remove existing fields, you can not modify existing fields, etc.

🙌 Silvia Pina

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-01 12:15:22

*Thread Reply:* We don't have facets with newer version than 1.1.0

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-01 12:15:56

*Thread Reply:* @Damien Hawes we've moved to merge docs and website repos here: https://github.com/OpenLineage/docs

OpenLineage/docs

Documentation for OpenLineage

Website

<https://openlineage.io/docs>

Stars

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-01 12:18:23

*Thread Reply:* > Would be good if we could clarify how to get the full spec, in that case Is using https://github.com/OpenLineage/OpenLineage/tree/main/spec not enough? We have separate files with facets definition to be able to evolve them separetely from main spec

Damien Hawes (damien.hawes@booking.com)

2023-08-02 04:53:03

*Thread Reply:* @Maciej Obuchowski - thanks for your input. I understand the desire to want to evolve the facets independently from the main spec, yet I keep running into a mental wall.

If I say, 'My application is compatible with OpenLineage 1.0.5' - what does that mean exactly? Does it mean that I am at least compatible with the base definition of RunEvent and its nested components, but not facets?

That's what I'm finding difficult to wrap my head around. Right now, I can not define (for my own sake and the sake of my org) what 'OpenLineage 1.0.5' means.

When I read the Marquez source code, I see that they state they implement 1.0.5, but again, it isn't clear what that completely entails.

I hope I am making sense.

👍 Silvia Pina

Damien Hawes (damien.hawes@booking.com)

2023-08-02 04:56:36

*Thread Reply:* If I approach this from a conventional software engineering standpoint, where I provide a library to my consumers. The library has a version associated with it, and that version encompasses all the objects located within that particular library. If I release a new version of my library, it implies that some form of evolution has happened. Whether it is a bug fix, a documentation change, or evolving the API of my objects it means something has changed and the new version is there to indicate that.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-02 04:56:53

*Thread Reply:* Yes - it means you can read and understand base spec. Facets are completely optional - reading them might provide you additional information, but you as a event consumer need to define what you do with them. Basically, the needs can be very different between consumers, spec should not define behavior of a consumer.

🙌 Silvia Pina

Damien Hawes (damien.hawes@booking.com)

2023-08-02 05:01:26

*Thread Reply:* OK. Thanks for the clarification. That clears things up for me.

👍 Maciej Obuchowski

Michael Robinson (michael.robinson@astronomer.io)

2023-07-31 16:42:48

This month’s issue of OpenLineage News was just sent out. Please to get it directly in your inbox each month!

apache.us14.list-manage.com

OpenLineage Project

OpenLineage Project Email Forms

Original URL: https://openlineage.us14.list-manage.com/track/click?u=fe7ef7a8dbb32933f30a10466&id=785a308136&e=ce16eef4ef

👍 Ross Turk, Maciej Obuchowski, Shirley Lu

🎉 Harel Shein

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-01 12:35:22

Hello, I request OpenLineage release, especially for two things: • Snowflake/HTTP/Airflow bugfix: https://github.com/OpenLineage/OpenLineage/pull/2025 • Spec: removing refs from core: https://github.com/OpenLineage/OpenLineage/pull/1997 Three approvals from committers will authorize release. @Michael Robinson

➕ Jakub Dardziński, Harel Shein, Michael Robinson, George Polychronopoulos, Willy Lulciuc, Shirley Lu

Michael Robinson (michael.robinson@astronomer.io)

2023-08-01 13:26:30

*Thread Reply:* Thanks, @Maciej Obuchowski

Michael Robinson (michael.robinson@astronomer.io)

2023-08-01 15:43:00

*Thread Reply:* Thanks, all. The release is authorized and will be initiated within two business days.

Michael Robinson (michael.robinson@astronomer.io)

2023-08-01 16:42:32

@channel We released OpenLineage 1.0.0, featuring static lineage capability! Added: • Airflow: convert lineage from legacy File definition #2006 @Maciej Obuchowski Removed: • Spec: remove facet ref from core #1997 @JDarDagran Changed • Airflow: change log level to DEBUG when extractor isn’t found #2012 @kaxil • Airflow: make sure we cannot fail in thread despite direct execution #2010 @Maciej Obuchowski Plus test improvements, docs changes, bug fixes and more. *See prior releases for additional changes related to static lineage. Thanks to all the contributors, including new contributors @kaxil and @Mars Lan! *Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.0.0 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.30.1...1.0.0 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🙌 Julian LaNeve, Bernat Gabor, Maciej Obuchowski, Peter Hicks, Ross Turk, Harel Shein, Willy Lulciuc, Paweł Leszczyński, Peter Hicks

🥳 Julian LaNeve, alexandre bergere, Maciej Obuchowski, Peter Hicks, Juan Manuel Cappi, Ross Turk, Harel Shein, Paweł Leszczyński, Peter Hicks

🚀 alexandre bergere, Peter Hicks, Ross Turk, Harel Shein, Paweł Leszczyński, Peter Hicks

Juan Luis Cano Rodríguez (juan_luis_cano@mckinsey.com)

2023-08-02 08:51:57

hi folks! so happy to see that static lineage is making its way through OL. one question: is the OpenAPI spec up to date? https://openlineage.io/apidocs/openapi/ IIUC, proposal 1837 says that JobEvent and DatasetEvent can be emitted independently from RunEvents now, but it's not clear how this affected the spec.

I see the Python client https://pypi.org/project/openlineage-python/1.0.0/ includes these changes already, so I assume I can go ahead and use it already? (I'm also keeping tabs on https://github.com/MarquezProject/marquez/issues/2544)

#2544 Proposal for static lineage support

As of PR <a href="https://github.com/MarquezProject/marquez/pull/2495">#2495</a>, static lineage events are collected, but not stored. Let's write a proposal outlining on how static lineage events will: • be validated and stored in the <code>lineage_events</code> table • change the core model logic; more specifically, the relationship between jobs, job versions, and runs • how static lineage will be represented and queried via REST API

Assignees

<a href="https://github.com/wslulciuc">@wslulciuc</a>

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-02 10:09:33

*Thread Reply:* I think the apidocs are not up to date 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-02 10:09:43

*Thread Reply:* https://openlineage.io/spec/2-0-2/OpenLineage.json has the newest spec

Juan Luis Cano Rodríguez (juan_luis_cano@mckinsey.com)

2023-08-02 10:44:23

*Thread Reply:* thanks for the pointer @Maciej Obuchowski

Michael Robinson (michael.robinson@astronomer.io)

2023-08-02 10:49:17

*Thread Reply:* Also working on updating the apidocs

Michael Robinson (michael.robinson@astronomer.io)

2023-08-02 11:21:14

*Thread Reply:* The API docs are now up to date @Juan Luis Cano Rodríguez! Thank you for raising this issue.

🙌:skin_tone_3: Juan Luis Cano Rodríguez

Michael Robinson (michael.robinson@astronomer.io)

2023-08-02 12:58:15

@channel If you can, please join us in San Francisco for a meetup at Astronomer on August 30th at 5:30 PM PT. On the agenda: a presentation by special guest @John Lukenoff plus updates on the Airflow Provider, static lineage, and more. Food will be provided, and all are welcome. Please https://www.meetup.com/meetup-group-bnfqymxe/events/295195280/?utmmedium=referral&utmcampaign=share-btnsavedeventssharemodal&utmsource=link|RSVP to let us know you’re coming.

Meetup

OpenLineage Meetup @ Astronomer, Wed, Aug 30, 2023, 5:30 PM | Meetup

Data engineers and pipeline managers know that producing data lineage – end-to-end pipeline metadata instrumented at runtime or parsed at design time – is a heavy lift with

Original URL: https://www.meetup.com/meetup-group-bnfqymxe/events/295195280/?utm_medium=referral&utm_campaign=share-btn_savedevents_share_modal&utm_source=link

Zahi Fail (zahi.fail@gmail.com)

2023-08-03 03:18:08

Hey, I hope this is the right channel for this kind of question - I’m running a tests to integrate Airflow (2.4.3) with Marquez (Openlineage 0.30.1). Currently, I’m testing the postgres operator and for some reason queries like “Copy” and “Unload” are being sent as events, but doesn’t appear in the graph. Any idea how to solve it?

You can see attached

The graph of an airflow DAG with all the tasks beside the copy and unload.
The graph with the unload task that isn’t connected to the other flow.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-03 05:36:04

*Thread Reply:* I think our underlying SQL parser does not hancle the Postgres versions of those queries

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-03 05:36:14

*Thread Reply:* Can you post the (anonymized?) queries?

👍 Maciej Obuchowski

Zahi Fail (zahi.fail@gmail.com)

2023-08-03 07:03:09

*Thread Reply:* for example

copy bi.marquez_test_2 from '******' iam_role '**********' delimiter as '^' gzi

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-07 13:35:30

*Thread Reply:* @Zahi Fail iam_role suggests you want redshift version of this supported, not Postgres one right?

Zahi Fail (zahi.fail@gmail.com)

2023-08-08 04:04:35

*Thread Reply:* @Maciej Obuchowski hey, actually I tried both Postgres and Redshift to S3 operators. Both of them sent a new event through OL to Marquez, and still wasn’t part of the entire flow.

Athitya Kumar (athityakumar@gmail.com)

2023-08-04 01:40:15

Hey team! 👋

We were exploring open-lineage and had a couple of questions:

Does open-lineage support presto-sql?
Do we have any docs/benchmarks on query coverage (inner joins, subqueries, etc) & source/sink coverage (spark.read from JDBC, Files etc) for spark-sql?
Can someone point to the code where we currently parse the input/output facets from the spark integration (like sql queries / transformations) and if it's extendable?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-04 02:17:19

*Thread Reply:* Hey @Athitya Kumar,

For parsing SQL queries, we're using sqlparser-rs (https://github.com/sqlparser-rs/sqlparser-rs) which already has great coverage of sql syntax and supports different dialects. it's open source project and we already did contribute to it for snowflake dialect.
We don't have such a benchmark, but if you like, you could contribute and help us providing such. We do support joins, subqueries, iceberg and delta tables, jdbc for Spark and much more. Everything we do support, is covered in our tests.
Not sure if got it properly. Marquez is our reference backend implementation which parses all the facets and stores them in relational db in a relational manner (facets, jobs, datasets and runs in separate tables).

sqlparser-rs/sqlparser-rs

Extensible SQL Lexer and Parser for Rust

Stars

1956

Language

Rust

Athitya Kumar (athityakumar@gmail.com)

2023-08-04 02:29:53

*Thread Reply:* For (3), I was referring to where we call the sqlparser-rs in our spark-openlineage event listener / integration; and how customising/improving them would look like

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-04 02:37:20

*Thread Reply:* sqlparser-rs is a rust libary and we bundle it within iface-java (https://github.com/OpenLineage/OpenLineage/blob/main/integration/sql/iface-java/src/main/java/io/openlineage/sql/SqlMeta.java). It's capable of extracting input/output datasets, column lineage information from SQL

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/sql/iface-java/src/main/java/io/openlineage/sql/SqlMeta.java | SqlMeta.java>

<pre><code>/** /** Copyright 2018-2023 contributors to the OpenLineage project /** SPDX-License-Identifier: Apache-2.0 **/ package io.openlineage.sql; import java.util.Arrays; import java.util.List; import org.apache.commons.lang3.builder.HashCodeBuilder; public class SqlMeta { private final List<DbTableMeta> inTables; private final List<DbTableMeta> outTables; private final List<ColumnLineage> columnLineage; private final List<ExtractionError> errors; public SqlMeta( List<DbTableMeta> in, List<DbTableMeta> out, List<ColumnLineage> columnLineage, List<ExtractionError> errors) { this.inTables = in; this.outTables = out; this.columnLineage = columnLineage; this.errors = errors; } public List<DbTableMeta> inTables() { return inTables; } public List<DbTableMeta> outTables() { return outTables; } public List<ColumnLineage> columnLineage() { return columnLineage; } public List<ExtractionError> errors() { return errors; } @Override public String toString() { return String.format( "{\"inTables\": %s, \"outTables\": %s, \"columnLineage\": %s, \"errors\": %s}", Arrays.toString(inTables.toArray()), Arrays.toString(outTables.toArray()), Arrays.toString(columnLineage.toArray()), Arrays.toString(errors.toArray())); } @Override public boolean equals(Object o) { if (o == this) { return true; } if (!(o instanceof SqlMeta)) { return false; } SqlMeta other = (SqlMeta) o; return other.inTables.equals(inTables) && other.outTables.equals(outTables) && other.columnLineage.equals(columnLineage) && other.errors.equals(errors); } @Override public int hashCode() { return new HashCodeBuilder() .append(inTables) .append(outTables) .append(columnLineage) .append(errors) .toHashCode(); } } </code></pre>

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-04 02:40:02

*Thread Reply:* and this is Spark code that extracts it from JdbcRelation -> https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/[…]ge/spark/agent/lifecycle/plan/handlers/JdbcRelationHandler.java

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/lifecycle/plan/handlers/JdbcRelationHandler.java | JdbcRelationHandler.java>

<pre><code>/** /** Copyright 2018-2023 contributors to the OpenLineage project /** SPDX-License-Identifier: Apache-2.0 **/ package io.openlineage.spark.agent.lifecycle.plan.handlers; import io.openlineage.client.OpenLineage; import io.openlineage.spark.agent.util.JdbcUtils; import io.openlineage.spark.api.DatasetFactory; import io.openlineage.sql.ColumnMeta; import io.openlineage.sql.DbTableMeta; import io.openlineage.sql.SqlMeta; import java.util.Collections; import java.util.List; import java.util.Optional; import java.util.stream.Collectors; import org.apache.spark.sql.execution.datasources.LogicalRelation; import org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType; public class JdbcRelationHandler<D extends OpenLineage.Dataset> { private final DatasetFactory<D> datasetFactory; public JdbcRelationHandler(DatasetFactory<D> datasetFactory) { this.datasetFactory = datasetFactory; } public List<D> handleRelation(LogicalRelation x) { // strip the jdbc: prefix from the url. this leaves us with a url like // postgresql://<hostname>:<port>/<database_name>?params // we don't parse the URI here because different drivers use different // connection // formats that aren't always amenable to how Java parses URIs. E.g., the oracle // driver format looks like oracle:<drivertype>:<user>/<password>@<database> // whereas postgres, mysql, and sqlserver use the <scheme://hostname:port/db> // format. JDBCRelation relation = (JDBCRelation) x.relation(); String url = JdbcUtils.sanitizeJdbcUrl(relation.jdbcOptions().url()); return getDatasets(relation, url); } public List<D> getDatasets(JDBCRelation relation, String url) { Optional<SqlMeta> sqlMeta = JdbcUtils.extractQueryFromSpark(relation); if (!sqlMeta.isPresent()) { return Collections.emptyList(); } if (sqlMeta.get().columnLineage().isEmpty()) { return Collections.singletonList( datasetFactory.getDataset( sqlMeta.get().inTables().get(0).qualifiedName(), url, relation.schema())); } return sqlMeta.get().inTables().stream() .map( dbtm -> datasetFactory.getDataset( dbtm.qualifiedName(), url, generateJDBCSchema(dbtm, relation.schema(), sqlMeta.get()))) .collect(Collectors.toList()); } private static StructType generateJDBCSchema( DbTableMeta origin, StructType schema, SqlMeta sqlMeta) { StructType originSchema = new StructType(); for (StructField f : schema.fields()) { List<ColumnMeta> fields = sqlMeta.columnLineage().stream() .filter(cl -> cl.descendant().name().equals(f.name())) .flatMap( cl -> cl.lineage().stream() .filter( cm -> cm.origin().isPresent() && cm.origin().get().equals(origin))) .collect(Collectors.toList()); for (ColumnMeta cm : fields) { originSchema = originSchema.add(cm.name(), f.dataType()); } } return originSchema; } } </code></pre>

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-04 04:08:53

*Thread Reply:* I think 3 question relates generally to Spark SQL handling, rather than handling JDBC connections inside Spark, right?

Athitya Kumar (athityakumar@gmail.com)

2023-08-04 04:24:57

*Thread Reply:* Yup, both actually. Related to getting the JDBC connection info in the input/output facet, as well as spark-sql queries we do on that JDBC connection

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-04 06:00:17

*Thread Reply:* For Spark SQL - it's translated to Spark's internal query LogicalPlan. We take that plan, and process it's nodes. From root node we can take output dataset, from leaf nodes we can take input datasets, and inside internal nodes we track columns to extract column-level lineage. We express those (table-level) operations by implementing classes like QueryPlanVisitor

You can extend that, for example for additional types of nodes that we don't support by implementing your own QueryPlanVisitor, and then implementing OpenLineageEventHandlerFactory and packaging this into a .jar deployed alongside OpenLineage jar - this would be loaded by us using Java's ServiceLoader .

<https://github.com/OpenLineage/OpenLineage/blob/a9809c3e3ccb959c93af8edd6e3358514a8edd2e/integration/spark/shared/src/main/java/io/openlineage/spark/api/QueryPlanVisitor.java | QueryPlanVisitor.java>

<https://github.com/OpenLineage/OpenLineage/blob/3e5ffee320fc52a8b287d99c678a89335c9c468c/integration/spark/shared/src/main/java/io/openlineage/spark/api/OpenLineageEventHandlerFactory.java | OpenLineageEventHandlerFactory.java>

👍 Kiran Hiremath

Athitya Kumar (athityakumar@gmail.com)

2023-08-08 05:06:07

*Thread Reply:* @Maciej Obuchowski @Paweł Leszczyński - Thanks for your responses! I had a follow-up query regarding the sqlparser-rs that's used internally by open-lineage: we see that these are the SQL dialects supported by sqlparser-rs here doesn't include spark-sql / presto-sql dialects which means they'd fallback to generic dialect:

"--ansi" => Box::new(AnsiDialect {}), "--bigquery" => Box::new(BigQueryDialect {}), "--postgres" => Box::new(PostgreSqlDialect {}), "--ms" => Box::new(MsSqlDialect {}), "--mysql" => Box::new(MySqlDialect {}), "--snowflake" => Box::new(SnowflakeDialect {}), "--hive" => Box::new(HiveDialect {}), "--redshift" => Box::new(RedshiftSqlDialect {}), "--clickhouse" => Box::new(ClickHouseDialect {}), "--duckdb" => Box::new(DuckDbDialect {}), "--generic" | "" => Box::new(GenericDialect {}), Any idea on how much coverage generic dialect provides for spark-sql / how different they are etc?

<https://github.com/sqlparser-rs/sqlparser-rs/blob/173a6db8189a0f1bb9896117c86374c4e691f0f3/examples/cli.rs | cli.rs>

<pre><code> let dialect: Box<dyn Dialect> = match std::env::args().nth(2).unwrap_or_default().as_ref() { </code></pre>

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-08 05:21:32

*Thread Reply:* spark-sql integration is based on spark LogicalPlan's tree. Extracting input/output datasets from tree nodes which is more detailed than sql parsing

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-08 07:04:52

*Thread Reply:* I think presto/trino dialect is very standard - there shouldn't be any problems with regular queries

Athitya Kumar (athityakumar@gmail.com)

2023-08-08 11:19:53

*Thread Reply:* @Paweł Leszczyński - Got it, and would you be able to point me to where within the openlineage-spark integration do we:

provide the Spark Logical Plan / query to sqlparser-rs
get the output of sqlparser-rs (parsed query AST) & stitch back the inputs/outputs in the open-lineage events?

Athitya Kumar (athityakumar@gmail.com)

2023-08-08 12:09:06

*Thread Reply:* For example, we'd like to understand which dialectname of sqlparser-rs would be used in which scenario by open-lineage and what're the interactions b/w open-lineage & sqlparser-rs

Athitya Kumar (athityakumar@gmail.com)

2023-08-09 12:18:47

*Thread Reply:* @Paweł Leszczyński - Incase you missed the above messages ^

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-10 03:31:32

*Thread Reply:* Sqlparser-rs is used within Spark integration only for spark jdbc queries (queries to external databases). That's the only scenario. For spark.sql(...) , instead of SQL parsing, we rely on a logical plan of a job and extract information from it. For jdbc queries, that user sqlparser-rs, dialect is extracted from url: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/util/JdbcUtils.java#L69

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/util/JdbcUtils.java | JdbcUtils.java>

<pre><code> private static String extractDialectFromJdbcUrl(String jdbcUrl) { </code></pre>

👍 Athitya Kumar

nivethika R (nivethikar8@gmail.com)

2023-08-06 07:16:53

Hi.. Is column lineage available for spark version 2.4.0?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-06 17:25:31

*Thread Reply:* No, it's not.

nivethika R (nivethikar8@gmail.com)

2023-08-06 23:53:17

*Thread Reply:* Is it only available for spark version 3+?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-07 04:53:41

*Thread Reply:* Yes

GitHubOpenLineageIssues (githubopenlineageissues@gmail.com)

2023-08-07 11:18:25

Hi, Will really appreciate if I can learn how community have been able to harness spark integration. In our testing where a spark application writes to S3 multiple times (different location), OL generates the same job name for all writes (namepsacename.executeinsertintohadoopfsrelation_command ) rendering the OL graph final output less helpful. Say for example if I have series of transformation/writes 5 times , in Lineage graph we are just seeing last 1. There is an open bug and hopefully will be resolved soon.

Curious how much is adoption of OL spark integration in presence of that bug, as generating same name for a job makes it less usable for anything other than trivial one output application.

Example from 2 write application EXPECTED : first produce weather dataset and the subsequent produce weather40. (generated/mocked using 2 spark app). (1st image) ACTUAL OL : weather40. see only last one. (2nd image)

Will really appreciate community guidance as in how successful they have been in utilizing spark integration (vanilla not Databricks) . Thank you

Expected. vs Actual.

GitHubOpenLineageIssues (githubopenlineageissues@gmail.com)

2023-08-07 11:21:04

Michael Robinson (michael.robinson@astronomer.io)

2023-08-07 11:30:00

@channel This month’s TSC meeting is this Thursday, August 10th at 10:00 a.m. PT. On the tentative agenda: • announcements • recent releases • Airflow provider progress update • OpenLineage 1.0 overview • open discussion • more (TBA) More info and the meeting link can be found on the website. All are welcome! Also, feel free to reply or DM me with discussion topics, agenda items, etc.

openlineage.io

TSC Meetings | OpenLineage

The OpenLineage Technical Steering Committee meets monthly, and is open to all.

Original URL: https://openlineage.io/meetings/

👍 Maciej Obuchowski, Athitya Kumar, Anirudh Shrinivason, Paweł Leszczyński

추호관 (hogan.chu@toss.im)

2023-08-08 04:39:45

I can’t see output when saveAsTable 100+ columns in spark. Any help or ideas for issue? Really thanks.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-08 04:59:23

*Thread Reply:* Does this work with similar jobs, but with small amount of columns?

추호관 (hogan.chu@toss.im)

2023-08-08 05:12:52

*Thread Reply:* thanks for reply @Maciej Obuchowski yes it works for small amount of columns but not work in big amount of columns

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-08 05:14:04

*Thread Reply:* one more question: how much data the jobs approximately process and how long does the execution take?

추호관 (hogan.chu@toss.im)

2023-08-08 05:14:54

*Thread Reply:* ah… it’s like 20 min ~ 30 min various data size is like 2000,0000 rows with columns 100 ~ 1000

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-08 05:15:17

*Thread Reply:* that's interesting. we could prepare integration test for that. 100 cols shouldn't make a difference

추호관 (hogan.chu@toss.im)

2023-08-08 05:15:37

*Thread Reply:* honestly sorry for typo its 1000 columns

추호관 (hogan.chu@toss.im)

2023-08-08 05:15:44

*Thread Reply:* pivoting features

추호관 (hogan.chu@toss.im)

2023-08-08 05:16:09

*Thread Reply:* i check it works good for small numbers of columns

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-08 05:16:39

*Thread Reply:* if it's 1000, then maybe we're over event size - event is too large and backend can't accept that

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-08 05:17:06

*Thread Reply:* maybe debug logs could tell us something

추호관 (hogan.chu@toss.im)

2023-08-08 05:19:27

*Thread Reply:* i’ll do spark.sparkContext.setLogLevel("DEBUG") ing

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-08 05:19:30

*Thread Reply:* are there any errors in the logs? perhaps pivoting uses contains nodes in SparkPlan that we don't support yet

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-08 05:19:52

*Thread Reply:* did you check pivoting that results in less columns?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-08 05:20:33

*Thread Reply:* @추호관 would also be good to disable logicalPlan facet: spark.openlineage.facets.disabled: [spark_unknown;spark.logicalPlan] in spark conf

추호관 (hogan.chu@toss.im)

2023-08-08 05:23:40

*Thread Reply:* got it can’t we do in python config .config("spark.dynamicAllocation.enabled", "true") \ .config("spark.dynamicAllocation.initialExecutors", "5") \ .config("spark.openlineage.facets.disabled", [spark_unknown;spark.logicalPlan]

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-08 05:24:31

*Thread Reply:* .config("spark.dynamicAllocation.enabled", "true") \ .config("spark.dynamicAllocation.initialExecutors", "5") \ .config("spark.openlineage.facets.disabled", "[spark_unknown;spark.logicalPlan]"

추호관 (hogan.chu@toss.im)

2023-08-08 05:24:42

*Thread Reply:* ah.. string got it

추호관 (hogan.chu@toss.im)

2023-08-08 05:36:03

*Thread Reply:* ah… there are no errors nor debug level issue successfully Registered listener ìo.openlineage.spark.agent.OpenLineageSparkListener

추호관 (hogan.chu@toss.im)

2023-08-08 05:39:40

*Thread Reply:* maybe df.groupBy(some column).pivot(some_column).agg(**agg_cols) is not supported

추호관 (hogan.chu@toss.im)

2023-08-08 05:43:44

*Thread Reply:* oh.. interesting spark.openlineage.facets.disabled this option gives me output when eventType is START “eventType”: “START” “outputs”: [ … columns …. ]

추호관 (hogan.chu@toss.im)

2023-08-08 05:54:13

*Thread Reply:* Yes "spark.openlineage.facets.disabled", "[spark_unknown;spark.logicalPlan]" <- this option gives output when eventType is START but not give output bunches of columns when that config is not set

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-08 05:55:18

*Thread Reply:* this option prevents logicalPlan being serialized and sent as a part of Openlineage event which included in one of the facets

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-08 05:56:12

*Thread Reply:* possibly, serializing logicalPlans, in case of pivots, leads to size of the events that are not acceptable

추호관 (hogan.chu@toss.im)

2023-08-08 05:57:56

*Thread Reply:* Ah… so you mean pivot makes serializing logical plan not availble for generating event because of size. and disable logical plan with not serializing make availabe to generate event cuz not serialiize logical plan made by pivot

Can we overcome this

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-08 05:58:48

*Thread Reply:* we've seen such issues for some plans some time ago

🙌 추호관

추호관 (hogan.chu@toss.im)

2023-08-08 05:59:29

*Thread Reply:* oh…. how did you solve it?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-08 05:59:32

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/c3a5211f919c01870a7f79f48588177a9b[…]io/openlineage/spark/agent/lifecycle/LogicalPlanSerializer.java

<https://github.com/OpenLineage/OpenLineage/blob/c3a5211f919c01870a7f79f48588177a9b687fdf/integration/spark/shared/src/main/java/io/openlineage/spark/agent/lifecycle/LogicalPlanSerializer.java | LogicalPlanSerializer.java>

<pre><code> @JsonIgnoreProperties({ </code></pre>

🙌 추호관

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-08 05:59:51

*Thread Reply:* by excluding some properties from plan to be serialized

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-08 06:01:14

*Thread Reply:* here https://github.com/OpenLineage/OpenLineage/blob/c3a5211f919c01870a7f79f48588177a9b[…]io/openlineage/spark/agent/lifecycle/LogicalPlanSerializer.java we exclude certain classes

<pre><code> .put(PythonRDD.class, PythonRDDMixin.class) </code></pre>

🙌 추호관

추호관 (hogan.chu@toss.im)

2023-08-08 06:02:00

*Thread Reply:* AH…. excluded properties cause ignore logical plan’s of pivointing

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-08 06:08:25

*Thread Reply:* you can start with writing a failing test here -> https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/tes[…]/openlineage/spark/agent/lifecycle/SparkReadWriteIntegTest.java

then you can try to debug logical plan trying to find out what should be excluded from it when it's being serialized. Even, if you find this difficult, a failing integration test is super helpful to let others help you in that.

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/test/java/io/openlineage/spark/agent/lifecycle/SparkReadWriteIntegTest.java | SparkReadWriteIntegTest.java>

``` /* /* Copyright 2018-2023 contributors to the OpenLineage project /* SPDX-License-Identifier: Apache-2.0 */ package io.openlineage.spark.agent.lifecycle; import static io.openlineage.client.OpenLineage.RunEvent; import static org.assertj.core.api.Assertions.assertThat; import static org.junit.jupiter.api.Assertions.assertEquals; import static org.junit.jupiter.api.Assertions.assertNotNull; import static org.junit.jupiter.api.Assertions.assertTrue; import static org.mockito.ArgumentMatchers.eq; import static org.mockito.Mockito.atLeast; import static org.mockito.Mockito.reset; import static org.mockito.Mockito.when; import static org.mockito.internal.verification.VerificationModeFactory.times; import com.fasterxml.jackson.core.JsonGenerator; import com.fasterxml.jackson.databind.ObjectMapper; import com.google.cloud.bigquery.MockBigQueryRelationProvider; import com.google.cloud.bigquery.connector.common.BigQueryUtil; import com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQuery; import com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.Field; import com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.Schema; import com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.StandardSQLTypeName; import com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.StandardTableDefinition; import com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.TableId; import com.google.cloud.spark.bigquery.repackaged.com.google.inject.Binder; import com.google.cloud.spark.bigquery.repackaged.com.google.inject.Module; import com.google.cloud.spark.bigquery.repackaged.com.google.inject.Provides; import com.google.common.collect.ImmutableMap; import io.openlineage.client.OpenLineage; import io.openlineage.client.OpenLineage.DatasetFacets; import io.openlineage.client.OpenLineage.InputDataset; import io.openlineage.client.OpenLineage.OutputDataset; import io.openlineage.client.OpenLineage.SchemaDatasetFacet; import io.openlineage.client.OpenLineage.SchemaDatasetFacetFields; import io.openlineage.spark.agent.SparkAgentTestExtension; import io.openlineage.spark.agent.Versions; import io.openlineage.spark.agent.util.PlanUtils; import io.openlineage.spark.agent.util.SparkVersionUtils; import io.openlineage.spark.agent.util.TestOpenLineageEventHandlerFactory; import java.io.FileOutputStream; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.nio.file.Files; import java.nio.file.Path; import java.nio.file.Paths; import java.util.Arrays; import java.util.List; import java.util.Optional; import java.util.Properties; import java.util.Random; import java.util.UUID; import java.util.concurrent.CompletableFuture; import java.util.concurrent.ExecutionException; import java.util.concurrent.TimeUnit; import java.util.concurrent.TimeoutException; import lombok.extern.slf4j.Slf4j; import org.apache.commons.lang3.tuple.Pair; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.TextInputFormat; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.ProducerRecord; import org.apache.kafka.common.serialization.StringSerializer; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.sql.AnalysisException; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SaveMode; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.catalyst.expressions.GenericRow; import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema; import org.apache.spark.sql.types.BinaryType$; import org.apache.spark.sql.types.IntegerType$; import org.apache.spark.sql.types.LongType$; import org.apache.spark.sql.types.Metadata; import org.apache.spark.sql.types.StringType$; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType; import org.assertj.core.api.InstanceOfAssertFactories; import org.assertj.core.api.ObjectAssert; import org.junit.jupiter.api.AfterEach; import org.junit.jupiter.api.BeforeEach; import org.junit.jupiter.api.Tag; import org.junit.jupiter.api.Test; import org.junit.jupiter.api.condition.EnabledIfSystemProperty; import org.junit.jupiter.api.extension.ExtendWith; import org.junit.jupiter.api.io.TempDir; import org.mockito.ArgumentCaptor; import org.mockito.Mockito; import org.testcontainers.containers.KafkaContainer; import org.testcontainers.utility.DockerImageName; import scala.Tuple2; import scala.collection.immutable.HashMap; @ExtendWith(SparkAgentTestExtension.class) @Tag("integration-test") @Slf4j class SparkReadWriteIntegTest { private static final String EVENTTYPE = "eventType"; private static final String NAMESPACE = "namespace"; private static final String FILE = "file"; private static final String NAME = "name"; private static final String AGE = "age"; private static final String FILEURI_PREFIX = "file://"; private final KafkaContainer kafkaContainer = new KafkaContainer(DockerImageName.parse("confluentinc/cp-kafka:7.0.0")); @BeforeEach public void setUp() { reset(MockBigQueryRelationProvider.BIGQUERY); when(SparkAgentTestExtension.OPENLINEAGESPARKCONTEXT.getParentRunId()) .thenReturn(Optional.of(UUID.randomUUID())); when(SparkAgentTestExtension.OPENLINEAGESPARKCONTEXT.getParentJobName()) .thenReturn("ParentJob"); when(SparkAgentTestExtension.OPENLINEAGESPARKCONTEXT.getJobNamespace()) .thenReturn("Namespace"); } @AfterEach public void tearDown() { if (kafkaContainer.isCreated()) { kafkaContainer.stop(); } } @Test void testBigQueryReadWriteToFile(@TempDir Path writeDir, SparkSession spark) throws InterruptedException, TimeoutException { TableId tableId = TableId.of("testproject", "dataset", "MyTable"); BigQuery bq = MockBigQueryRelationProvider.BIG_QUERY; StructType tableSchema = new StructType( new StructField[] { new StructField(NAME, StringType$.MODULE$, false, Metadata.empty()), new StructField(AGE, LongType$.MODULE$, false, Metadata.empty()) }); <pre><code>MockBigQueryRelationProvider.INJECTOR.setTestModule( new Module() { @Override public void configure(Binder binder) {} @Provides public Dataset<Row> testData() { return spark.createDataFrame( Arrays.asList( new GenericRowWithSchema(new Object[] {"john", 25L}, tableSchema), new GenericRowWithSchema(new Object[] {"sam", 22L}, tableSchema), new GenericRowWithSchema(new Object[] {"alicia", 35L}, tableSchema), new GenericRowWithSchema(new Object[] {"bob", 47L}, tableSchema), new GenericRowWithSchema(new Object[] {"jordan", 52L}, tableSchema), new GenericRowWithSchema(new Object[] {"liz", 19L}, tableSchema), new GenericRowWithSchema(new Object[] {"marcia", 83L}, tableSchema), new GenericRowWithSchema(new Object[] {"maria", 40L}, tableSchema), new GenericRowWithSchema(new Object[] {"luis", 8L}, tableSchema), new GenericRowWithSchema(new Object[] {"gabriel", 30L}, tableSchema)), tableSchema); } }); when(bq.getTable(eq(tableId))) .thenAnswer( invocation -> MockBigQueryRelationProvider.makeTable( tableId, StandardTableDefinition.newBuilder() .setSchema( Schema.of( Field.of(NAME, StandardSQLTypeName.STRING), … </code></pre>

추호관 (hogan.chu@toss.im)

2023-08-08 06:24:54

*Thread Reply:* okay i would look into and maybe pr thanks

추호관 (hogan.chu@toss.im)

2023-08-08 06:38:45

*Thread Reply:* Can I ask if there are any suspicious properties?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-08 06:39:25

*Thread Reply:* sure

👍 추호관

🙂 추호관

추호관 (hogan.chu@toss.im)

2023-08-08 07:10:40

*Thread Reply:* Thanks I would also try to find the property too

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-08-08 05:34:46

Hi guys, I've a generic sql-parsing doubt... what would be the recommended way (if any) to check for sql similarity? I understand that most sql parsers parse the query into an AST, but are there any well known ways to measure semantic similarities between 2 or more ASTs? Just curious lol... Any ideas appreciated! Thanks!

Guy Biecher (guy.biecher21@gmail.com)

2023-08-08 07:49:55

*Thread Reply:* Hi @Anirudh Shrinivason, I think I would take a look on this https://sqlglot.com/sqlglot/diff.html

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-08-09 23:12:37

*Thread Reply:* Hey @Guy Biecher Yeah I was looking at this... but it seems to calculate similarity from a more textual context, as opposed to a more semantic one... eg: SELECT ** FROM TABLE_1 and SELECT col1,col2,col3 FROM TABLE_1 could be the same semantic query, but sqlglot would give diffs in the ast because its textual...

Guy Biecher (guy.biecher21@gmail.com)

2023-08-10 02:26:51

*Thread Reply:* I totally get you. In such cases without the metadata of the TABLE_1, it's impossible what I would do I would replace all ** before you use the diff function.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-08-10 07:04:37

*Thread Reply:* Yeah I was thinking about the same... But the more nested and complex your queries get, the harder it'll become to accurately pre-process before running the ast diff too... But yeah that's probably the approach I'd be taking haha... Happy to discuss and learn if there are better ways to doing this

Luigi Scorzato (luigi.scorzato@gmail.com)

2023-08-08 08:36:46

dear all, I have some novice questions. I put them in separate messages for clarity. 1st Question: I understand from the examples in the documentation that the main lineage events are RunEvent's, which can contain link to Run ID, Job ID, Dataset ID (I see they are RunEvent because they have EventType, correct?). However, the main openlineage json object contains also JobEvent and DatasetEvent. When are JobEvent and DatasetEvent supposed to be used in the workflow? Do you have relevant examples? thanks!

openlineage.io

Example Lineage Events | OpenLineage

Simple Examples

Original URL: https://openlineage.io/docs/development/examples

<https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json | OpenLineage.json>

``` { "$schema": "<a href="https://json-schema.org/draft/2020-12/schema">https://json-schema.org/draft/2020-12/schema</a>", "$id": "<a href="https://openlineage.io/spec/2-0-2/OpenLineage.json">https://openlineage.io/spec/2-0-2/OpenLineage.json</a>", "$defs": { "BaseEvent": { "type": "object", "properties": { "eventTime": { "description": "the time the event occurred at", "type": "string", "format": "date-time" }, "producer": { "description": "URI identifying the producer of this metadata. For example this could be a git url with a given tag or sha", "type": "string", "format": "uri", "example": "<a href="https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client">https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client</a>" }, "schemaURL": { "description": "The JSON Pointer (<a href="https://tools.ietf.org/html/rfc6901">https://tools.ietf.org/html/rfc6901</a>) URL to the corresponding version of the schema definition for this RunEvent", "type": "string", "format": "uri", "example": "<a href="https://openlineage.io/spec/0-0-1/OpenLineage.json">https://openlineage.io/spec/0-0-1/OpenLineage.json</a>" } }, "required": ["eventTime", "producer", "schemaURL"] }, "RunEvent": { "allOf": [ { "$ref": "#/$defs/BaseEvent" }, { "type": "object", "properties": { "eventType": { "description": "the current transition of the run state. It is required to issue 1 START event and 1 of [ COMPLETE, ABORT, FAIL ] event per run. Additional events with OTHER eventType can be added to the same run. For example to send additional metadata after the run is complete", "type": "string", "enum": ["START", "RUNNING", "COMPLETE", "ABORT", "FAIL", "OTHER"], "example": "START|RUNNING|COMPLETE|ABORT|FAIL|OTHER" }, "run": { "$ref": "#/$defs/Run" }, "job": { "$ref": "#/$defs/Job" }, "inputs": { "description": "The set of *input* datasets.", "type": "array", "items": { "$ref": "#/$defs/InputDataset" } }, "outputs": { "description": "The set of *output* datasets.", "type": "array", "items": { "$ref": "#/$defs/OutputDataset" } } }, "required": ["run", "job"] } ] }, "DatasetEvent": { "allOf": [ { "$ref": "#/$defs/BaseEvent" }, { "type": "object", "properties": { "dataset": { "$ref": "#/$defs/StaticDataset" } }, "required": ["dataset"], "not": { "required": ["job", "run"] } } ] }, "JobEvent": { "allOf": [ { "$ref": "#/$defs/BaseEvent" }, { "type": "object", "properties": { "job": { "$ref": "#/$defs/Job" }, "inputs": { "description": "The set of *input* datasets.", "type": "array", "items": { "$ref": "#/$defs/InputDataset" } }, "outputs": { "description": "The set of *output* datasets.", "type": "array", "items": { "$ref": "#/$defs/OutputDataset" } } }, "required": ["job"], "not": { "required": ["run"] } } ] }, "Run": { "type": "object", "properties": { "runId": { "description": "The globally unique ID of the run associated with the job.", "type": "string", "format": "uuid" }, "facets": { "description": "The run facets.", "type": "object", "anyOf": [ { "type": "object", "additionalProperties": { "$ref": "#/$defs/RunFacet" } } ] } }, "required": ["runId"] }, "RunFacet": { "description": "A Run Facet", "type": "object", "allOf": [{ "$ref": "#/$defs/BaseFacet" }] }, "Job": { "type": "object", "properties": { "namespace": { "description": "The namespace containing that job", "type": "string", "example": "my-scheduler-namespace" }, "name": { "description": "The unique name for that job within that namespace", "type": "string", "example": "myjob.mytask" }, "facets": { "description": "The job facets.", "type": "object", "anyOf": [ { "type": "object", "additionalProperties": { "$ref": "#/$defs/JobFacet" } } ] } }, "required": ["namespace", "name"] }, "JobFacet": { "description": "A Job Facet", "type": "object", "allOf": [ { "$ref": "#/$defs/BaseFacet" }, { "type": "object", "properties": { "deleted": { "description": "set to true to delete a facet", "type": "boolean" } } } ] }, "InputDataset": { "description": "An input dataset", "type": "object", "allOf": [ { "$ref": "#/$defs/Dataset" }, { "type": "object", "properties": { "inputFacets": { "description": "The input facets for this dataset.", "type": "object", "anyOf": [ { "type": "object", "additionalProperties": { "$ref": "#/$defs/InputDatasetFacet" } } ] } } } ] }, "InputDatasetFacet": { "description": "An Input Dataset Facet", "type": "object", "allOf": [{ "$ref": "#/$defs/BaseFacet" }] }, "OutputDataset": { "description": "An output dataset", "type": "object", "allOf": [ { "$ref": "#/$defs/Dataset" }, { "type": "object", "properties": { "outputFacets": { "description": "The output facets for this dataset", "type": "object", "anyOf": [ { "type": "object", "additionalProperties": { "$ref": "#/$defs/OutputDatasetFacet" } } ] } } } ] }, "OutputDatasetFacet": { "description": "An Output Dataset Facet", "type": "object", "allOf": [{ "$ref": "#/$defs/BaseFacet" }] }, "Dataset": { "type": "object", "properties": { "namespace": { "description": "The namespace containing that dataset", "type": "string", "example": "my-datasource-namespace" }, "name": { "description": "The unique name for that dataset within that namespace", "type": "string", "example": "instance.schema.table" }, "facets": { "description": "The facets for this dataset", "type": "object", "anyOf": [ { "type": "object", "additionalProperties": { "$ref": "#/$defs/DatasetFacet" } } ] } }, "required": ["namespace", "name"] }, "StaticDataset": { "description": "A Dataset sent within static metadata events", "type": "object", "allOf": [{ "$ref": "#/$defs/Dataset" }] }, "DatasetFacet": { "description": "A Dataset Facet", "type": "object", "allOf": [ { "$ref": "#/$defs/BaseFacet" }, { "type": "object", "properties": { "deleted": { …

Harel Shein (harel.shein@gmail.com)

2023-08-08 09:53:05

*Thread Reply:* Hey @Luigi Scorzato! You can read about these 2 event types in this blog post: https://openlineage.io/blog/static-lineage

openlineage.io

OpenLineage 1.0, Featuring Static Lineage, is Coming Soon | OpenLineage

The release of OpenLineage 1.0 will add static lineage support.

Original URL: https://openlineage.io/blog/static-lineage

👍 Luigi Scorzato

Harel Shein (harel.shein@gmail.com)

2023-08-08 09:53:38

*Thread Reply:* we’ll work on getting the documentation improved to clarify the expected use cases for each event type. this is a relatively new addition to the spec.

👍 Luigi Scorzato

Luigi Scorzato (luigi.scorzato@gmail.com)

2023-08-08 10:08:28

*Thread Reply:* this sounds relevant for my 3rd question, doesn't it? But I do not see scheduling information among the use cases, am I wrong?

Harel Shein (harel.shein@gmail.com)

2023-08-08 11:16:39

*Thread Reply:* you’re not wrong, these 2 events were not designed for runtime lineage, but rather “static” lineage that gets emitted after the fact

Luigi Scorzato (luigi.scorzato@gmail.com)

2023-08-08 08:46:39

2nd Question. I see that the input dataset appears in the RunEvent with EventType=START, the output dataset appears in the RunEvent with EventType=COMPLETE only, the RunEvent with EventType=RUNNING has no dataset attached. This makes sense for ETL jobs, but for streaming (e.g. Flink), the run could run very long and never terminate with a COMPLETE. On the other hand, emitting all the info about the output dataset in every RUNNING event would be far too verbose. What is the recommended set up in this case? TLDR: what is the recommended configuration of the frequency and data model of the lineage events for streaming systems like Flink?

Harel Shein (harel.shein@gmail.com)

2023-08-08 09:54:40

*Thread Reply:* great question! did you get a chance to look at the current Flink integration?

Luigi Scorzato (luigi.scorzato@gmail.com)

2023-08-08 10:07:06

*Thread Reply:* to be honest, I only quickly went through this and I did not identfy what I needed. Can you please point me to the relevant section?

openlineage.io

Apache Flink | OpenLineage

This integration is considered experimental: only specific workflows and use cases are supported.

Original URL: https://openlineage.io/docs/integrations/flink

Harel Shein (harel.shein@gmail.com)

2023-08-08 11:13:17

*Thread Reply:* here’s an example START event for Flink: https://github.com/OpenLineage/OpenLineage/blob/main/integration/flink/src/test/resources/events/expected_kafka.json

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/flink/src/test/resources/events/expected_kafka.json | expected_kafka.json>

<pre><code>{ "eventType" : "START", "eventTime": "${json-unit.any-string}", "job" : { "namespace" : "flink_job_namespace", "name": "flink_examples_stateful" }, "run": { "runId": "${json-unit.any-string}" }, "inputs": [ { "namespace": "kafka:9092", "name": "io.openlineage.flink.kafka.input1", "facets": { "schema": { "fields": [ { "name": "id", "type": "string" }, { "name": "version", "type": "long" } ] } } }, { "namespace": "kafka:9092", "name": "io.openlineage.flink.kafka.input2", "facets": { "schema": { "fields": [ { "name": "id", "type": "string" }, { "name": "version", "type": "long" } ] } } } ], "outputs": [ { "namespace": "kafka:9092", "name": "io.openlineage.flink.kafka.output", "facets": { "schema": { "fields": [ { "name": "id", "type": "string" }, { "name": "version", "type": "long" }, { "name": "counter", "type": "long" } ] } } } ] }, </code></pre>

Harel Shein (harel.shein@gmail.com)

2023-08-08 11:13:26

*Thread Reply:* or a checkpoint (RUNNING) event: https://github.com/OpenLineage/OpenLineage/blob/main/integration/flink/src/test/resources/events/expected_kafka_checkpoints.json

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/flink/src/test/resources/events/expected_kafka_checkpoints.json | expected_kafka_checkpoints.json>

<pre><code>{ "eventType" : "RUNNING", "eventTime": "${json-unit.any-string}", "job" : { "namespace" : "flink_job_namespace", "name": "flink_examples_stateful" }, "run": { "runId": "${json-unit.any-string}", "facets" : { "checkpoints": { "completed": "${json-unit.any-number}", "failed": "${json-unit.any-number}", "in-progress": "${json-unit.any-number}", "restored": "${json-unit.any-number}", "total": "${json-unit.any-number}" } } }, "inputs": [ { "namespace": "kafka:9092", "name": "io.openlineage.flink.kafka.input1", "facets": { "schema": { "fields": [ { "name": "id", "type": "string" }, { "name": "version", "type": "long" } ] } } }, { "namespace": "kafka:9092", "name": "io.openlineage.flink.kafka.input2", "facets": { "schema": { "fields": [ { "name": "id", "type": "string" }, { "name": "version", "type": "long" } ] } } } ], "outputs": [ { "namespace": "kafka:9092", "name": "io.openlineage.flink.kafka.output", "facets": { "schema": { "fields": [ { "name": "id", "type": "string" }, { "name": "version", "type": "long" }, { "name": "counter", "type": "long" } ] } } } ] } </code></pre>

Harel Shein (harel.shein@gmail.com)

2023-08-08 11:15:55

*Thread Reply:* generally speaking, you can see the execution contexts that invoke generation of OL events here: https://github.com/OpenLineage/OpenLineage/blob/main/integration/flink/src/main/ja[…]/openlineage/flink/visitor/lifecycle/FlinkExecutionContext.java

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/flink/src/main/java/io/openlineage/flink/visitor/lifecycle/FlinkExecutionContext.java | FlinkExecutionContext.java>

<pre><code>/** /** Copyright 2018-2023 contributors to the OpenLineage project /** SPDX-License-Identifier: Apache-2.0 **/ package io.openlineage.flink.visitor.lifecycle; import io.openlineage.client.OpenLineage; import io.openlineage.client.OpenLineage.RunEvent; import io.openlineage.client.OpenLineage.RunEvent.EventType; import io.openlineage.client.OpenLineage.RunEventBuilder; import io.openlineage.flink.SinkLineage; import io.openlineage.flink.TransformationUtils; import io.openlineage.flink.api.OpenLineageContext; import io.openlineage.flink.client.CheckpointFacet; import io.openlineage.flink.client.EventEmitter; import io.openlineage.flink.visitor.Visitor; import io.openlineage.flink.visitor.VisitorFactory; import io.openlineage.flink.visitor.VisitorFactoryImpl; import java.time.ZonedDateTime; import java.util.ArrayList; import java.util.Arrays; import java.util.HashSet; import java.util.List; import java.util.Set; import java.util.UUID; import java.util.stream.Collectors; import lombok.Builder; import lombok.Getter; import lombok.extern.slf4j.Slf4j; import org.apache.commons.lang3.exception.ExceptionUtils; import org.apache.flink.api.common.JobExecutionResult; import org.apache.flink.api.common.JobID; import org.apache.flink.api.dag.Transformation; @Slf4j @Builder public class FlinkExecutionContext implements ExecutionContext { @Getter private final JobID jobId; protected final UUID runId; protected final EventEmitter eventEmitter; protected final OpenLineageContext openLineageContext; private final String jobName; private final String jobNamespace; @Getter private final List<Transformation<?>> transformations; @Override public void onJobSubmitted() { log.debug("JobClient - jobId: {}", jobId); RunEvent runEvent = buildEventForEventType(EventType.START) .run(new OpenLineage.RunBuilder().runId(runId).build()) .build(); log.debug("Posting event for onJobSubmitted {}: {}", jobId, runEvent); eventEmitter.emit(runEvent); } @Override public void onJobCheckpoint(CheckpointFacet facet) { log.debug("JobClient - jobId: {}", jobId); RunEvent runEvent = buildEventForEventType(EventType.RUNNING) .run( new OpenLineage.RunBuilder() .runId(runId) .facets(new OpenLineage.RunFacetsBuilder().put("checkpoints", facet).build()) .build()) .build(); log.debug("Posting event for onJobCheckpoint {}: {}", jobId, runEvent); eventEmitter.emit(runEvent); } public OpenLineage.RunEventBuilder buildEventForEventType(EventType eventType) { TransformationUtils converter = new TransformationUtils(); List<SinkLineage> sinkLineages = converter.convertToVisitable(transformations); VisitorFactory visitorFactory = new VisitorFactoryImpl(); List<OpenLineage.InputDataset> inputDatasets = new ArrayList<>(); List<OpenLineage.OutputDataset> outputDatasets = new ArrayList<>(); Set<Object> sources = new HashSet<>(); for (var lineage : sinkLineages) { sources.addAll(lineage.getSources()); outputDatasets.addAll(getOutputDatasets(visitorFactory, lineage.getSink())); } inputDatasets.addAll(getInputDatasets(visitorFactory, Arrays.asList(sources.toArray()))); return commonEventBuilder().inputs(inputDatasets).outputs(outputDatasets).eventType(eventType); } @Override public void onJobCompleted(JobExecutionResult jobExecutionResult) { OpenLineage openLineage = openLineageContext.getOpenLineage(); eventEmitter.emit( commonEventBuilder() .run(openLineage.newRun(runId, null)) .eventType(EventType.COMPLETE) .build()); } @Override public void onJobFailed(Throwable failed) { OpenLineage openLineage = openLineageContext.getOpenLineage(); eventEmitter.emit( commonEventBuilder() .run( openLineage.newRun( runId, openLineage .newRunFacetsBuilder() .errorMessage( openLineage.newErrorMessageRunFacet( failed.getMessage(), "JAVA", ExceptionUtils.getStackTrace(failed))) .build())) .eventType(EventType.FAIL) .eventTime(ZonedDateTime.now()) .build()); } /**** ** Builds common elements of Openlineage events constructed. ** ** @return **/ private RunEventBuilder commonEventBuilder() { OpenLineage openLineage = openLineageContext.getOpenLineage(); return openLineage .newRunEventBuilder() .job(openLineage.newJob(jobNamespace, jobName, null)) .eventTime(ZonedDateTime.now()); } private List<OpenLineage.InputDataset> getInputDatasets( VisitorFactory visitorFactory, List<Object> sources) { List<OpenLineage.InputDataset> inputDatasets = new ArrayList<>(); List<Visitor<OpenLineage.InputDataset>> inputVisitors = visitorFactory.getInputVisitors(openLineageContext); for (var transformation : sources) { inputDatasets.addAll( inputVisitors.stream() .filter(inputVisitor -> inputVisitor.isDefinedAt(transformation)) .map(inputVisitor -> inputVisitor.apply(transformation)) .flatMap(List::stream) .collect(Collectors.toList())); } return inputDatasets; } private List<OpenLineage.OutputDataset> getOutputDatasets( VisitorFactory visitorFactory, Object sink) { List<Visitor<OpenLineage.OutputDataset>> outputVisitors = visitorFactory.getOutputVisitors(openLineageContext); return outputVisitors.stream() .filter(inputVisitor -> inputVisitor.isDefinedAt(sink)) .map(inputVisitor -> inputVisitor.apply(sink)) .flatMap(List::stream) .collect(Collectors.toList()); } } </code></pre>

👍 Luigi Scorzato

Luigi Scorzato (luigi.scorzato@gmail.com)

2023-08-08 17:46:17

*Thread Reply:* thank you! So, if I understand correctly, the key is that even eventType=START, admits an output datasets. Correct? What determines how often are the eventType=RUNNING emitted?

👍 Harel Shein

Luigi Scorzato (luigi.scorzato@gmail.com)

2023-08-09 03:25:16

*Thread Reply:* now I see, RUNNING events are emitted on onJobCheckpoint

Luigi Scorzato (luigi.scorzato@gmail.com)

2023-08-08 08:59:40

3rd Question: I am looking for information about the time when the next run should start, in case of scheduled jobs. I see that the Run Facet has a Nominal Time Facet, but -- if I understand correctly -- it refers to the current run, so it is always emitted after the fact. Is the Nominal Start Time of the next run available somewhere? If not, where do you recommend to add it as a custom field? In principle, it belongs to the Job object, but would that maybe cause an undesirable fast change in the Job object?

Harel Shein (harel.shein@gmail.com)

2023-08-08 11:10:47

*Thread Reply:* For Airflow, this is part of the AirflowRunFacet, here: https://github.com/OpenLineage/OpenLineage/blob/81372ca2bc2afecab369eab4a54cc6380dda49d0/integration/airflow/facets/AirflowRunFacet.json#L100

For other orchestrators / schedulers, that would depend..

👍 Luigi Scorzato

Kiran Hiremath (kiran_hiremath@intuit.com)

2023-08-08 10:30:56

Hi Team, Question regarding Databricks OpenLineage init script, is the path /mnt/driver-daemon/jars common to all the clusters? or its unique to each cluster? https://github.com/OpenLineage/OpenLineage/blob/81372ca2bc2afecab369eab4a54cc6380d[…]da49d0/integration/spark/databricks/open-lineage-init-script.sh

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-08 12:15:40

*Thread Reply:* I might be wrong, but I believe it's unique for each cluster - the common part is dbfs\.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-09 02:38:54

*Thread Reply:* dbfs is mounted to a databricks workspace which can run multiple clusters. so i think, it's common.

Worth mentioning: init-scripts located in dbfs are becoming deprecated next month and we plan moving them into workspaces.

👍 Kiran Hiremath

Kiran Hiremath (kiran_hiremath@intuit.com)

2023-08-11 01:33:24

*Thread Reply:* yes, the init scripts are moved at workspace level.

GitHubOpenLineageIssues (githubopenlineageissues@gmail.com)

2023-08-08 14:19:40

Hi @Paweł Leszczyński Will really aprecaite if you please let me know once this PR is good to go. Will love to test in our environment : https://github.com/OpenLineage/OpenLineage/pull/2036. Thank you for all your help.

#2036 [SPARK] append output dataset name to job name

Problem 👋 Thanks for opening a pull request! Please include a brief summary of the problem your change is trying to solve, or bug fix. If your change fixes a bug or you'd like to provide context on why you're making the change, please <a href="https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue">link the issue</a> as follows: Closes: #ISSUE-NUMBER Solution Please describe your change as it relates to the problem, or bug fix, as well as any dependencies. If your change requires a schema change, please describe the schema modifcation(s) and whether it's a backwards-incompatible or backwards-compatible change, then select one of the following: <blockquote> Note: All schema changes require discussion. Please <a href="https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue">link the issue</a> for context. </blockquote> ☐ Your change modifies the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json">core</a> OpenLineage model ☐ Your change modifies one or more OpenLineage <a href="https://github.com/OpenLineage/OpenLineage/tree/main/spec/facets">facets</a> If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports <code>S3</code> and <code>GCS</code> filesystem operations, tested with AWS EMR). One-line summary: Checklist ☐ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☐ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☐ Your changes are accompanied by tests (if relevant) ☐ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☐ Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant) ☐ You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant) <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2023 contributors to the OpenLineage project

Labels

integration/spark

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-09 02:35:28

*Thread Reply:* great to hear. I still need some time as there are few corner cases. For example: what should be the behaviour when alter table rename is called 😉 But sure, you can test it if you like. ci is failing on integration tests but ./gradlew clean build with unit tests are fine.

:gratitude_thank_you: GitHubOpenLineageIssues

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-10 03:33:50

*Thread Reply:* @GitHubOpenLineageIssues Feel invited to join todays community and advocate for the importance of this issue. Such discussions are extremely helpful in prioritising backlog the right way.

Gaurav Singh (gaurav.singh@razorpay.com)

2023-08-09 07:54:33

Hi Team, I'm doing a POC with open lineage to extract column lineage of Spark. I'm using it on databricks notebook. I'm facing a issue where I,m trying to get the column lineage in a join involving external tables on s3. The lineage that is being extracted is returning on base path of the table ie on the s3 file path and not on the corresponding tables. Is there a way to extract/map columns of output to the columns of base tables instead of storage location.

Gaurav Singh (gaurav.singh@razorpay.com)

2023-08-09 07:55:28

*Thread Reply:* Query: INSERT INTO test.merchant_md (Select m.`id`, m.name, m.activated, m.parent_id, md.contact_name, md.contact_email FROM test.merchants_0 m LEFT JOIN merchant_details md ON m.id = md.merchant_id WHERE m.created_date > '2023-08-01')

Gaurav Singh (gaurav.singh@razorpay.com)

2023-08-09 08:01:56

*Thread Reply:* "columnLineage":{ "_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.30.1/integration/spark>", "_schemaURL":"<https://openlineage.io/spec/facets/1-0-1/ColumnLineageDatasetFacet.json#/$defs/ColumnLineageDatasetFacet>", "fields":{ "merchant_id":{ "inputFields":[ { "namespace":"<s3a://datalake>", "name":"/test/merchants", "field":"id" } ] }, "merchant_name":{ "inputFields":[ { "namespace":"<s3a://datalake>", "name":"/test/merchants", "field":"name" } ] }, "activated":{ "inputFields":[ { "namespace":"<s3a://datalake>", "name":"/test/merchants", "field":"activated" } ] }, "parent_id":{ "inputFields":[ { "namespace":"<s3a://datalake>", "name":"/test/merchants", "field":"parent_id" } ] }, "contact_name":{ "inputFields":[ { "namespace":"<s3a://datalake>", "name":"/test/merchant_details", "field":"contact_name" } ] }, "contact_email":{ "inputFields":[ { "namespace":"<s3a://datalake>", "name":"/test/merchant_details", "field":"contact_email" } ] } } }, "symlinks":{ "_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.30.1/integration/spark>", "_schemaURL":"<https://openlineage.io/spec/facets/1-0-0/SymlinksDatasetFacet.json#/$defs/SymlinksDatasetFacet>", "identifiers":[ { "namespace":"/warehouse/test.db", "name":"test.merchant_md", "type":"TABLE" }

Gaurav Singh (gaurav.singh@razorpay.com)

2023-08-09 08:23:57

*Thread Reply:* "contact_name":{ "inputFields":[ { "namespace":"<s3a://datalake>", "name":"/test/merchant_details", "field":"contact_name" } ] } This is returning mapping from the s3 location on which the table is created.

Zahi Fail (zahi.fail@gmail.com)

2023-08-09 10:56:27

Hey, I’m running Spark application (spark version 3.4) with OL integration. I changed spark to use “debug” level, and I see the OL events with the below message: “Emitting lineage completed successfully:”

With all the above, I can’t see the event in Marquez.

Attaching the OL configurations. When changing the OL-spark version to 0.6.+, I do see event created in Marquez with only “Start” status (attached below).

The OL-spark version is matching the Spark version? Is there a known issues with the Spark / OL versions ?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-09 11:23:42

*Thread Reply:* > OL-spark version to 0.6.+ This OL version is ancient. You can try with 1.0.0

I think you're hitting this issue which duplicates jobs: https://github.com/OpenLineage/OpenLineage/issues/1943

#1943 spark: two jobs operating on different datasets in same spark file can be named the same

Currently, a Spark job name is created from action name which is misleading as different actions run with the same name can be writing to different output datasets. So, although the jobs are different, they end up within same graph nodes blurring the lineage. The solution for this would be to: append dataset name to the end of the job, the same way in which it is actually done for databricks platform.

Assignees

<a href="https://github.com/pawel-big-lebowski">@pawel-big-lebowski</a>

Labels

integration/spark

Zahi Fail (zahi.fail@gmail.com)

2023-08-10 01:46:08

*Thread Reply:* I haven’t mentioned that I tried multiple OL versions - 1.0.0 / 0.30.1 / 0.6.+ … None of them worked for me. @Maciej Obuchowski

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-10 05:25:49

*Thread Reply:* @Zahi Fail understood. Can you provide sample job that reproduces this behavior, and possibly some logs?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-10 05:26:11

*Thread Reply:* If you can, it might be better to create issue at github and communicate there.

Zahi Fail (zahi.fail@gmail.com)

2023-08-10 08:34:01

*Thread Reply:* Before creating an issue in GIT, I wanted to check if my issue only related to versions compatibility..

This is the sample of my test: ```from pyspark.sql import SparkSession from pyspark.sql.functions import col

spark = SparkSession.builder\ .config('spark.jars.packages', 'io.openlineage:openlineage_spark:1.0.0') \ .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener') \ .config('spark.openlineage.host', 'http://localhost:9000') \ .config('spark.openlineage.namespace', 'default') \ .getOrCreate()

spark.sparkContext.setLogLevel("DEBUG")

csv_file = location.csv

df = spark.read.format("csv").option("header","true").option("sep","^").load(csv_file)

df = df.select("campaignid","revenue").groupby("campaignid").sum("revenue").show()``` Part of the logs with the OL configurations and the processed event

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-10 08:40:13

*Thread Reply:* try spark.openlineage.transport.url instead of spark.openlineage.host

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-10 08:40:27

*Thread Reply:* and possibly link the doc where you've seen spark.openlineage.host 🙂

Zahi Fail (zahi.fail@gmail.com)

2023-08-10 08:59:27

*Thread Reply:* https://openlineage.io/blog/openlineage-spark/

openlineage.io

Tracing Data Lineage with OpenLineage and Apache Spark | OpenLineage

Spark ushered in a brand new age of data democratization... and left us with a mess of hidden dependencies, stale datasets, and failed jobs.

Original URL: https://openlineage.io/blog/openlineage-spark/

👍 Maciej Obuchowski

Zahi Fail (zahi.fail@gmail.com)

2023-08-10 09:04:56

*Thread Reply:* changing to “spark.openlineage.transport.url” didn’t make any change

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-10 09:09:42

*Thread Reply:* do you see the ConsoleTransport log? it suggests Spark integration did not register that you want to send events to Marquez

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-10 09:10:09

*Thread Reply:* let's try adding spark.openlineage.transport.type to http

Zahi Fail (zahi.fail@gmail.com)

2023-08-10 09:14:50

*Thread Reply:* Now it works !

Zahi Fail (zahi.fail@gmail.com)

2023-08-10 09:14:58

*Thread Reply:* thanks @Maciej Obuchowski

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-10 09:23:04

*Thread Reply:* Cool 🙂 however it should not require it if you provide spark.openlineage.transport.url - I'll create issue for debugging that.

Michael Robinson (michael.robinson@astronomer.io)

2023-08-09 14:37:24

@channel This month’s TSC meeting is tomorrow! All are welcome. https://openlineage.slack.com/archives/C01CK9T7HKR/p1691422200847979

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)

@channel This month’s TSC meeting is this Thursday, August 10th at 10:00 a.m. PT. On the tentative agenda: • announcements • recent releases • Airflow provider progress update • OpenLineage 1.0 overview • open discussion • more (TBA) More info and the meeting link can be found on the <a href="https://openlineage.io/meetings/">website</a>. All are welcome! Also, feel free to reply or DM me with discussion topics, agenda items, etc.

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1691422200847979

Athitya Kumar (athityakumar@gmail.com)

2023-08-10 02:11:07

While using the spark integration, we're unable to see the query in the job facet for any spark-submit - is this a known issue/limitation, and can someone point to the code where this is currently extracted / can be enhanced?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-10 02:55:46

*Thread Reply:* Let me first rephrase my understanding of the question assume a user runs spark.sql('INSERT INTO ...'). Are we able to include sql queryINSERT INTO ...within SQL facet?

We once had a look at it and found it difficult. Given an SQL, spark immediately translates it to a logical plan (which our integration is based on) and we didn't find any place where we could inject our code and get access to sql being run.

Athitya Kumar (athityakumar@gmail.com)

2023-08-10 04:27:51

*Thread Reply:* Got it. So for spark.sql() - there's no interaction with sqlparser-rs and we directly try stitching the input/output & column lineage from the spark logical plan. Would something like this fall under the spark.jdbc() route or the spark.sql() route (say, if the df is collected / written somewhere)?

val df = spark.read.format("jdbc") .option("url", url) .option("user", user) .option("password", password) .option("fetchsize", fetchsize) .option("driver", driver)

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-10 05:15:17

*Thread Reply:* @Athitya Kumar I understand your issue. From my side, there's one problem with this - potentially there can be multiple queries for one spark job. You can imagine something like joining results of two queries - possible to separate systems - and then one SqlJobFacet would be misleading. This needs more thorough spec discussion

Luigi Scorzato (luigi.scorzato@gmail.com)

2023-08-10 05:33:47

Hi Team, has anyone experience with integrating OpenLineage with the SAP ecosystem? And with Salesforce/MuleSoft?

Steven (xli@zjuici.com)

2023-08-10 05:40:47

Hi, Are there any ways to save list of string directly in the dataset facets? Such as the myfacets field in this dict "facets": { "metadata_facet": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/0.29.2/client/python>", "_schemaURL": "<https://sth/schemas/facets.json#/definitions/SomeFacet>", "myfacets": ["a", "b", "c"] } }

Steven (xli@zjuici.com)

2023-08-10 05:42:20

*Thread Reply:* I'm using python OpenLineage package and extend the BaseFacet class

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-10 05:53:57

*Thread Reply:* for custom facets, as long as it's valid json - go for it

Steven (xli@zjuici.com)

2023-08-10 05:55:03

*Thread Reply:* However I tried to insert a list of string. And I tried to get the dataset, the returned valued of that list field is empty.

Steven (xli@zjuici.com)

2023-08-10 05:55:57

*Thread Reply:* @attr.s class MyFacet(BaseFacet): columns: list[str] = attr.ib() Here's my python code.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-10 05:59:02

*Thread Reply:* How did you emit, serialized the event, and where did you look when you said you tried to get the dataset?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-10 06:00:27

*Thread Reply:* I assume the problem is somewhere there, not on the level of facet definition, since SchemaDatasetFacet looks pretty much the same and it works

Steven (xli@zjuici.com)

2023-08-10 06:00:54

*Thread Reply:* I use the python openlineage client to emit the RunEvent. openlineage_client.emit( RunEvent( eventType=RunState.COMPLETE, eventTime=datetime.now().isoformat(), run=run, job=job, producer=PRODUCER, outputs=outputs, ) ) And use marquez to visualize the get data result

Steven (xli@zjuici.com)

2023-08-10 06:02:12

*Thread Reply:* Yah, list of objects is working, but list of string is not.😩

Steven (xli@zjuici.com)

2023-08-10 06:03:23

*Thread Reply:* I think the problem is related to the openlineage package openlineage.client.serde.py. The function Serde.to_json()

Steven (xli@zjuici.com)

2023-08-10 06:05:56

*Thread Reply:*

Steven (xli@zjuici.com)

2023-08-10 06:19:34

*Thread Reply:* I think the code here filters out those string values in the list

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-10 06:21:39

*Thread Reply:* 👀

Steven (xli@zjuici.com)

2023-08-10 06:24:48

*Thread Reply:* Yah, the value in list will end up False in this code and be filtered out isinstance(_x_, dict)

😳

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-10 06:26:33

*Thread Reply:* wow, that's right 😬

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-10 06:26:47

*Thread Reply:* want to create PR fixing that?

Steven (xli@zjuici.com)

2023-08-10 06:27:20

*Thread Reply:* Sure! May do this later tomorrow.

👍 Maciej Obuchowski, Paweł Leszczyński

Steven (xli@zjuici.com)

2023-08-10 23:59:28

*Thread Reply:* I created the pr at https://github.com/OpenLineage/OpenLineage/pull/2044 But the ci on integration-test-integration-spark FAILED

#2044 fix: serde filtering

Problem Values in list objects will be accidently filtered out so that they will not be included in the output. Solution Remove unnecessary instance type check (which filters out single value by mistake) in Serde. <blockquote> Note: All schema changes require discussion. Please <a href="https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue">link the issue</a> for context. </blockquote> ☐ Your change modifies the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json">core</a> OpenLineage model ☐ Your change modifies one or more OpenLineage <a href="https://github.com/OpenLineage/OpenLineage/tree/main/spec/facets">facets</a> One-line summary: Fix the bug that causes values in list objects being accidently filtered. Checklist ☐ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☐ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2023 contributors to the OpenLineage project

Labels

client/python

Comments

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-11 04:17:01

*Thread Reply:* @Steven sorry for that - some tests require credentials that are not present on the forked versions of CI. It will work once I push it to origin. Anyway Spark tests failing aren't blocker for this Python PR

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-11 04:17:45

*Thread Reply:* I would only ask to add some tests for that case with facets containing list of string

Steven (xli@zjuici.com)

2023-08-11 04:18:21

*Thread Reply:* Yeah sure, I will add them now

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-11 04:25:19

*Thread Reply:* ah we had other CI problem, go version was too old in one of the jobs - neverthless I won't judge your PR on stuff failing outside your PR anyway 🙂

Steven (xli@zjuici.com)

2023-08-11 04:36:57

*Thread Reply:* LOL🤣 I've added some tests and made a force push

savan (SavanSharan_Navalgi@intuit.com)

2023-10-20 08:31:45

*Thread Reply:* @GitHubOpenLineageIssues I am trying to contribute to Integration tests which is listed here as good first issue the CONTRIBUTING.md mentions that i can trigger CI for integration tests from forked branch. using this tool. but i am unable to do so, is there a way to trigger CI from forked brach or do i have to get permission from someone to run the CI?

i am getting this error when i run this command sudo git-push-fork-to-upstream-branch upstream savannavalgi:hacktober > Username for '<https://github.com>': savannavalgi > Password for '<https://savannavalgi@github.com>': > remote: Permission to OpenLineage/OpenLineage.git denied to savannavalgi. > fatal: unable to access '<https://github.com/OpenLineage/OpenLineage.git/>': The requested URL returned error: 403 i have tried to configure ssh key also tried to trigger CI from another brach, and tried all of this after fetching the latest upstream

cc: @Athitya Kumar @Maciej Obuchowski @Steven

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-10-23 04:57:44

*Thread Reply:* what PR is the probem related to? I can run git-push-fork-to-upstream-branch for you

savan (SavanSharan_Navalgi@intuit.com)

2023-10-25 01:08:41

*Thread Reply:* @Paweł Leszczyński thanks for approving my PR - ( link )

I will make the changes needed for the new integration test case for drop table (good first issue) , in another PR, I would need your help to run the integration tests again, thank you

#2143 Integration test for drop table does not verify anything

savan (SavanSharan_Navalgi@intuit.com)

2023-10-26 07:48:52

*Thread Reply:* @Paweł Leszczyński opened a PR ( link ) for integration test for drop table can you please help run the integration test

#2214 spark: integration test case for drop table

Problem drop table integration test did not verify the drop table events generated by openlineage. Closes: <a href="https://github.com/OpenLineage/OpenLineage/issues/2143">#2143</a> Solution for drop table the COMPLETE event generated has empty inputs and outputs facet. but the START event has the table to be dropped/deleted in the outputs facet and a empty inputs facet. One-line summary: added START event for the drop table integration test Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☑︎ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☑︎ Your changes are accompanied by tests (if relevant) ☑︎ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☑︎ You've updated any relevant documentation (if relevant) ☑︎ Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant) ☐ You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant) <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2023 contributors to the OpenLineage project

Labels

documentation, integration/spark

Comments

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-10-26 07:50:29

*Thread Reply:* sure, some of our tests require access to S3/BigQuery secret keys, so will not work automatically from the fork, and require action on our side. working on that

savan (SavanSharan_Navalgi@intuit.com)

2023-10-29 09:31:22

*Thread Reply:* thanks @Paweł Leszczyński let me know if i can help in any way

savan (SavanSharan_Navalgi@intuit.com)

2023-11-15 02:31:50

*Thread Reply:* @Paweł Leszczyński any action item on my side?

savan (SavanSharan_Navalgi@intuit.com)

2023-12-01 02:57:20

*Thread Reply:* @Paweł Leszczyński can you please take a look at this ? 🙂

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-12-01 03:05:05

*Thread Reply:* Hi @savan, were you able to run integration tests locally on your side? It seems the generated OL event is missing schema facet "outputs" : [ { "namespace" : "file", "name" : "/tmp/drop_test/drop_table_test", "facets" : { "dataSource" : { "_producer" : "<https://github.com/OpenLineage/OpenLineage/tree/1.5.0-SNAPSHOT/integration/spark>", "_schemaURL" : "<https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet>", "name" : "file", "uri" : "file" }, "symlinks" : { "_producer" : "<https://github.com/OpenLineage/OpenLineage/tree/1.5.0-SNAPSHOT/integration/spark>", "_schemaURL" : "<https://openlineage.io/spec/facets/1-0-0/SymlinksDatasetFacet.json#/$defs/SymlinksDatasetFacet>", "identifiers" : [ { "namespace" : "/tmp/drop_test", "name" : "default.drop_table_test", "type" : "TABLE" } ] }, "lifecycleStateChange" : { "_producer" : "<https://github.com/OpenLineage/OpenLineage/tree/1.5.0-SNAPSHOT/integration/spark>", "_schemaURL" : "<https://openlineage.io/spec/facets/1-0-0/LifecycleStateChangeDatasetFacet.json#/$defs/LifecycleStateChangeDatasetFacet>", "lifecycleStateChange" : "DROP" } }, "outputFacets" : { } } ] which shouldn't be such a big problem I believe. This event intends to notify table is dropped which is still ok I believe without schema.

savan (SavanSharan_Navalgi@intuit.com)

2023-12-01 03:06:40

*Thread Reply:* @Paweł Leszczyński i am unable to run integration tests locally, as you mentioned it requires S3/BigQuery secret keys and wont work from a forked branch

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-12-01 03:07:06

*Thread Reply:* you can run this particular test you modify, don't need to run all of them

savan (SavanSharan_Navalgi@intuit.com)

2023-12-01 03:07:55

*Thread Reply:* can you please share any doc which will help me do that. i did go through the readme doc, i was stuck at > you dont have permission to perform this action

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-12-01 03:08:18

*Thread Reply:* ./gradlew :app:integrationTest --tests io.openlineage.spark.agent.SparkIcebergIntegrationTest.testDropTable

savan (SavanSharan_Navalgi@intuit.com)

2023-12-01 03:08:33

*Thread Reply:* let me try thanks!

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-12-01 03:08:35

*Thread Reply:* this should run the thing you modify

savan (SavanSharan_Navalgi@intuit.com)

2024-02-13 05:27:36

*Thread Reply:* i am getting this error while building the project. tried a lot of things, any pointers or leads will be helpful? i am using apple m1 max chip computer. thanks > ------ Running smoke test ------ > Exception in thread “main” java.lang.UnsatisfiedLinkError: /private/var/folders/zz/zyxvpxvq6csfxvnn0000000000000/T/native-lib4719585175993676348/libopenlineagesqljava.dylib: dlopen(/private/var/folders/zz/zyxvpxvq6csfxvnn0000000000000/T/native-lib4719585175993676348/libopenlineagesqljava.dylib, 0x0001): tried: ‘/private/var/folders/zz/zyxvpxvq6csfxvnn0000000000000/T/native-lib4719585175993676348/libopenlineagesqljava.dylib’ (mach-o file, but is an incompatible architecture (have ‘arm64’, need ‘x8664’)), ‘/System/Volumes/Preboot/Cryptexes/OS/private/var/folders/zz/zyxvpxvq6csfxvnn0000000000000/T/native-lib4719585175993676348/libopenlineagesqljava.dylib’ (no such file), ‘/private/var/folders/zz/zyxvpxvq6csfxvnn0000000000000/T/native-lib4719585175993676348/libopenlineagesqljava.dylib’ (mach-o file, but is an incompatible architecture (have ‘arm64’, need ‘x86_64’)) > at java.base/java.lang.ClassLoader$NativeLibrary.load0(Native Method)

savan (SavanSharan_Navalgi@intuit.com)

2024-02-13 05:51:45

*Thread Reply:* the build passes with out the smoke tests. but the command you gave is throwing below error

(base) snavalgi@macos-PD7LVVY6MQ spark % ./gradlew -q :app:integrationTest --tests io.openlineage.spark.agent.SparkIcebergIntegrationTest.testDropTable

FAILURE: Build failed with an exception.

** Where: Build file ‘/Users/snavalgi/Documents/GitHub/OpenLineage/integration/spark/app/build.gradle’ line: 256

** What went wrong: A problem occurred evaluating project ‘:app’.

Could not resolve all files for configuration ‘:app:spark2’. Could not resolve io.openlineage:openlineagejava:1.9.0-SNAPSHOT. Required by: project :app > project :shared Could not resolve io.openlineage:openlineagejava:1.9.0-SNAPSHOT. Unable to load Maven meta-data from https://astronomer.jfrog.io/artifactory/maven-public-libs-snapshot/io/openlineage/openlineage-java/1.9.0-SNAPSHOT/maven-metadata.xml. org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 10; DOCTYPE is disallowed when the feature “http://apache.org/xml/features/disallow-doctype-decl” set to true. Could not resolve io.openlineage:openlineagesqljava:1.9.0-SNAPSHOT. Required by: project :app > project :shared Could not resolve io.openlineage:openlineagesqljava:1.9.0-SNAPSHOT. Unable to load Maven meta-data from https://astronomer.jfrog.io/artifactory/maven-public-libs-snapshot/io/openlineage/openlineage-sql-java/1.9.0-SNAPSHOT/maven-metadata.xml. org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 10; DOCTYPE is disallowed when the feature “http://apache.org/xml/features/disallow-doctype-decl” set to true.

** Try:

Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights. Get more help at https://help.gradle.org.

BUILD FAILED in 10s

savan (SavanSharan_Navalgi@intuit.com)

2024-02-14 00:07:56

*Thread Reply:* updated the correct error message

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-14 05:30:22

*Thread Reply:* @savan you need to build openlineage-java and openlineage-sql-java libraries as described here: https://github.com/OpenLineage/OpenLineage/blob/73b4a3bcd84239e7baedd22b5294624623d6f3ad/integration/spark/README.md#preparation

savan (SavanSharan_Navalgi@intuit.com)

2024-02-14 06:52:31

*Thread Reply:* @Maciej Obuchowski thanks for the response. the issue was with java-8 architecture i had installed.

i am able to compile, build and run the integration test now , with java 11 ( of appropriate arch)

savan (SavanSharan_Navalgi@intuit.com)

2024-02-14 08:12:02

*Thread Reply:* was able running some(createtable) integration tests successfully. but now the marquez-api container is repeated crashing. any pointers?

marquez-api | [Too many errors, abort] marquez-api | qemu: uncaught target signal 6 (Aborted) - core dumped marquez-api | /usr/src/app/entrypoint.sh: line 19: 44 Aborted java ${JAVAOPTS} -jar marquez-**.jar server ${MARQUEZCONFIG} marquez-api exited with code 134

savan (SavanSharan_Navalgi@intuit.com)

2024-02-14 08:15:37

*Thread Reply:* the marquez-api docker image has this warning

AMD64, image may have poor performance or fail, if run via emulation

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-14 08:32:54

*Thread Reply:* @Willy Lulciuc I think publishing arm64 image of Marquez would be a good idea

👍 Willy Lulciuc

Willy Lulciuc (willy@datakin.com)

2024-02-14 13:36:46

*Thread Reply:* Yeah, supporting multi-architectural docker builds makes sense. Here’s an article outlining an approach https://www.padok.fr/en/blog/multi-architectures-docker-iot#architectures. @Maciej Obuchowski what’s is what you’re suggesting here?

savan (SavanSharan_Navalgi@intuit.com)

2024-02-25 02:53:54

*Thread Reply:* @Maciej Obuchowski @Paweł Leszczyński i have verified the integration test for dropTestTable on my local. it is working fine. ✅✅✅✅✅ can you please trigger the CI for this PR? and expedite the review and merge process https://github.com/OpenLineage/OpenLineage/pull/2214

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-02-26 03:33:48

*Thread Reply:* the test is still failin in CI -> https://app.circleci.com/pipelines/github/OpenLineage/OpenLineage/9488/workflows/f669d751-aa18-4735-a51f-7d647415fee8/jobs/181187

io.openlineage.spark.agent.SparkContainerIntegrationTest testDropTable() FAILED (31.2s)

savan (SavanSharan_Navalgi@intuit.com)

2024-02-26 04:00:06

*Thread Reply:* on my local i see that the test is passing. let me update the branch and test again

savan (SavanSharan_Navalgi@intuit.com)

2024-02-26 05:18:06

*Thread Reply:* i have made a minor change . can you please trigger the CI again @Paweł Leszczyński

👀 Paweł Leszczyński

savan (SavanSharan_Navalgi@intuit.com)

2024-02-26 05:27:14

*Thread Reply:* the test is again passing on my local with latest code. but i notice the below error in the previous CI failure.

the previous CI build was failing because the actual START event for droptable in the CI had empty input and output. > "eventType" : "START", > "inputs" : [ ], > "outputs" : [ ] but on my local , the START event for droptable has output populated as below. > { > "eventType": "START", > "job": { > "namespace": "testDropTable" > }, > "inputs": [], > "outputs": [ > { > "namespace": "file", > "name": "/tmp/drop_test/drop_table_test", > "facets": { > "dataSource": { > "name": "file", > "uri": "file" > }, > "symlinks": { > "identifiers": [ > { > "namespace": "file:/tmp/drop_test", > "name": "default.drop_table_test", > "type": "TABLE" > } > ] > }, > "lifecycleStateChange": { > "lifecycleStateChange": "DROP" > } > } > } > ] > } >

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-02-26 05:29:47

*Thread Reply:* Please note that CI runs tests against several Spark versions. This can be configured with -Pspark.version=3.4.2 It's possible that your test passing for some versions while still failing for other ones.

savan (SavanSharan_Navalgi@intuit.com)

2024-02-26 05:33:05

*Thread Reply:* if CI is verifying against many spark version, does that mean, some spark version have empty output:[] and some have populated output:[] for the same START event of a droptable ?

if so then how do we specify different START events respectively for those versions of spark? is that possible?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-02-26 05:42:24

*Thread Reply:* For complete event the assertion with empty inputs and outputs verifies only if an complete event was emitted. It would make sense for start to verify if this contains information about the deleted dataset. If it is missing for a single spark version, we should first try to understand why is this happening and if there is any workaround for this.

savan (SavanSharan_Navalgi@intuit.com)

2024-02-26 05:43:43

*Thread Reply:* yes makes sense. can you please approve to run CI for integration test again?

I really wanted to check if this build passes.

savan (SavanSharan_Navalgi@intuit.com)

2024-02-26 05:57:51

*Thread Reply:* and for the spark version for which we are getting empty output[] in START event for droptable should i open a new ticket on openlineage and report the issue?

savan (SavanSharan_Navalgi@intuit.com)

2024-02-28 06:02:00

*Thread Reply:* @Paweł Leszczyński @Maciej Obuchowski can you please approve this CI to run integration tests? https://app.circleci.com/pipelines/github/OpenLineage/OpenLineage/9497/workflows/4a20dc95-d5d1-4ad7-967c-edb6e2538820

👍 Paweł Leszczyński

savan (SavanSharan_Navalgi@intuit.com)

2024-02-29 01:13:11

*Thread Reply:* @Paweł Leszczyński only 2 spark version are sending empty input and output for both START and COMPLETE event

• 3.4.2 • 3.5.0 i can look into the above , if you guide me a bit on how to ? should i open a new ticket for it? please suggest how to proceed?

savan (SavanSharan_Navalgi@intuit.com)

2024-03-01 04:01:45

*Thread Reply:* this integration test case lead to finding of the above bug for spark 3.4.2 and 3.5.0 will that be a blocker to merge this test case? @Paweł Leszczyński @Maciej Obuchowski

savan (SavanSharan_Navalgi@intuit.com)

2024-03-06 09:01:44

*Thread Reply:* @Paweł Leszczyński @Maciej Obuchowski any direction on the above blocker will be helpful.

savan (SavanSharan_Navalgi@intuit.com)

2024-05-22 09:14:31

*Thread Reply:* @Paweł Leszczyński @Maciej Obuchowski we were able to debug the issue and found issues in logical plan received from sparkCore and have open a issue on spark jira for tracking it https://issues.apache.org/jira/browse/SPARK-48390

have opened a issue on openlineage github as well https://github.com/OpenLineage/OpenLineage/issues/2716

cc: @Mayur Madnani

#2716 [BUG]- drop event is not received from spark 3.4.2 and 3.5.0

OpenLineage integration/client Spark integration OpenLineage version latest Technology and package versions No response Environment configuration No response Deployment details No response Problem details tested in spark integration version 3.4.2 and 3.5.0 the logical plan received from spark does not have drop table information ( like name of the table) logical plan received in spark version 3.3 has information about the drop table. <pre><code>[ { "class": "org.apache.spark.sql.execution.command.DropTableCommand", "num-children": 0, "tableName": { "product-class": "org.apache.spark.sql.catalyst.TableIdentifier", "table": "drop_table_test", "database": "default" }, "ifExists": false, "isView": false, "purge": false } ] </code></pre> logical plan received in spark version 3.4.2 does not have needed information(tableName property) <pre><code>[ { "class": "org.apache.spark.sql.catalyst.plans.logical.DropTable", "num-children": 1, "child": 0, "ifExists": false, "purge": false }, { "class": "org.apache.spark.sql.catalyst.analysis.ResolvedIdentifier", "num-children": 0, "catalog": null, "identifier": null } ] </code></pre> stack trace on the error received while integration testing with spark version 3.4.2 with command <code>./gradlew :app:integrationTest --tests io.openlineage.spark.agent.SparkContainerIntegrationTest.testDropTable -Pspark.version=3.4.2</code> ``` 2024-05-22 08:08:43 ERROR io.openlineage.spark.agent.util.PlanUtils - Apply failed: java.lang.ClassCastException: class org.apache.spark.sql.catalyst.analysis.ResolvedIdentifier cannot be cast to class org.apache.spark.sql.catalyst.analysis.ResolvedTable (org.apache.spark.sql.catalyst.analysis.ResolvedIdentifier and org.apache.spark.sql.catalyst.analysis.ResolvedTable are in unnamed module of loader 'app') at io.openlineage.spark3.agent.lifecycle.plan.DropTableVisitor.apply(DropTableVisitor.java:35) ~[openlineage-spark-agent2.12-1.10.0-SNAPSHOT-shadow.jar:1.10.0-SNAPSHOT] at io.openlineage.spark3.agent.lifecycle.plan.DropTableVisitor.apply(DropTableVisitor.java:25) ~[openlineage-spark-agent2.12-1.10.0-SNAPSHOT-shadow.jar:1.10.0-SNAPSHOT] at io.openlineage.spark.agent.util.PlanUtils$1.lambda$apply$2(PlanUtils.java:99) ~[openlineage-spark-agent2.12-1.10.0-SNAPSHOT-shadow.jar:1.10.0-SNAPSHOT] at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197) ~[?:?] at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:179) ~[?:?] at java.util.AbstractList$RandomAccessSpliterator.forEachRemaining(AbstractList.java:720) ~[?:?] at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) ~[?:?] at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) ~[?:?] at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921) ~[?:?] at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?] at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682) ~[?:?] at io.openlineage.spark.agent.util.PlanUtils$1.apply(PlanUtils.java:120) ~[openlineage-spark-agent2.12-1.10.0-SNAPSHOT-shadow.jar:1.10.0-SNAPSHOT] at io.openlineage.spark.agent.util.PlanUtils$1.apply(PlanUtils.java:77) ~[openlineage-spark-agent2.12-1.10.0-SNAPSHOT-shadow.jar:1.10.0-SNAPSHOT] at scala.PartialFunction.applyOrElse(PartialFunction.scala:127) ~[scala-library-2.12.17.jar:?] at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126) ~[scala-library-2.12.17.jar:?] at scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:30) ~[scala-library-2.12.17.jar:?] at scala.PartialFunction$AndThen.applyOrElse(PartialFunction.scala:194) ~[scala-library-2.12.17.jar:?] at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$visitLogicalPlan$14(OpenLineageRunEventBuilder.java:417) ~[openlineage-spark-agent2.12-1.10.0-SNAPSHOT-shadow.jar:1.10.0-SNAPSHOT] at io.openlineage.spark.agent.util.ScalaConversionUtils$3.apply(ScalaConversionUtils.java:176) ~[openlineage-spark-agent2.12-1.10.0-SNAPSHOT-shadow.jar:1.10.0-SNAPSHOT] at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$buildOutputDatasets$15(OpenLineageRunEventBuilder.java:432) ~[openlineage-spark-agent2.12-1.10.0-SNAPSHOT-shadow.jar:1.10.0-SNAPSHOT] at java.util.Optional.map(Optional.java:260) ~[?:?] at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildOutputDatasets(OpenLineageRunEventBuilder.java:432) ~[openlineage-spark-agent2.12-1.10.0-SNAPSHOT-shadow.jar:1.10.0-SNAPSHOT] at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.populateRun(OpenLineageRunEventBuilder.java:321) ~[openlineage-spark-agent2.12-1.10.0-SNAPSHOT-shadow.jar:1.10.0-SNAPSHOT] at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:304) ~[openlineage-spark-agent2.12-1.10.0-SNAPSHOT-shadow.jar:1.10.0-SNAPSHOT] at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:250) ~[openlineage-spark-agent2.12-1.10.0-SNAPSHOT-shadow.jar:1.10.0-SNAPSHOT] at io.openlineage.spark.agent.lifecycle.SparkSQLExecutionContext.end(SparkSQLExecutionContext.java:130) ~[openlineage-spark-agent2.12-1.10.0-SNAPSHOT-shadow.jar:1.10.0-SNAPSHOT] at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$sparkSQLExecEnd$2(OpenLineageSparkListener.java:129) ~[openlineage-spark-agent2.12-1.10.0-SNAPSHOT-shadow.jar:1.10.0-SNAPSHOT] at io.openlineage.client.circuitBreaker.NoOpCircuitBreaker.run(NoOpCircuitBreaker.java:27) ~[openlineage-spark-agent2.12-1.10.0-SNAPSHOT-shadow.jar:1.10.0-SNAPSHOT] at io.openlineage.spark.agent.OpenLineageSparkListener.sparkSQLExecEnd(OpenLineageSparkListener.java:127) ~[openlineage-spark-agent2.12-1.10.0-SNAPSHOT-shadow.jar:1.10.0-SNAPSHOT] at io.openlineage.spark.agent.OpenLineageSparkListener.onOtherEvent(OpenLineageSparkListener.java:107) ~[openlineage-spark-agent2.12-1.10.0-SNAPSHOT-shadow.jar:1.10.0-SNAPSHOT] at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100) ~[spark-core2.12-3.4.2.jar:3.4.2] at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) ~[spark-core2.12-3.4.2.jar:3.4.2] at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) ~[spark-core2.12-3.4.2.jar:3.4.2] at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) ~[spark-core2.12-3.4.2.jar:3.4.2] at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) ~[spark-core2.12-3.4.2.jar:3.4.2] at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) ~[spark-core2.12-3.4.2.jar:3.4.2] at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) ~[spark-core2.12-3.4.2.jar:3.4.2] at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) ~[spark-core2.12-3.4.2.jar:3.4.2] at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) ~[scala-library-2.12.17.jar:?] at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) ~[scala-library-2.12.17.jar:?] at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) ~[spark-core2.12-3.4.2.jar:3.4.2] at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) ~[spark-core2.12-3.4.2.jar:3.4.2] at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1471) ~[spark-core2.12-3.4.2.jar:3.4.2] at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.s…

Labels

kind:bug, area:integration/spark

Comments

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-22 09:18:41

*Thread Reply:* Yeah - looks like they moved to using different LogicalPlan - DropTable instead of DropTableCommand - but the identifier field should not be empty

👍 savan

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-22 09:19:40

*Thread Reply:* the code that handles DropTable does not look buggy https://github.com/OpenLineage/OpenLineage/blob/a391c53e3374479ed5bf2c3e3ad519b53f[…]o/openlineage/spark3/agent/lifecycle/plan/DropTableVisitor.java

<https://github.com/OpenLineage/OpenLineage/blob/a391c53e3374479ed5bf2c3e3ad519b53f664dc6/integration/spark/spark3/src/main/java/io/openlineage/spark3/agent/lifecycle/plan/DropTableVisitor.java | DropTableVisitor.java>

<pre><code>public class DropTableVisitor extends QueryPlanVisitor<DropTable, OpenLineage.OutputDataset> { </code></pre>

👍 savan

savan (SavanSharan_Navalgi@intuit.com)

2024-06-04 06:24:26

*Thread Reply:* Hi @Maciej Obuchowski@Paweł Leszczyński,

I hope this message finds you well. I recently noticed that my contributions to PR [#2745] were not attributed to me. Here is the PR i had open for the integration test cases after a lot of work. - PR [#2214] and as result of the over integration tests i wrote, i was able to figure the exact issue that was present - issue Over the past six months, I have invested significant time and effort into this work, and I believe it would be fair to recognize my contributions.

Would it be possible to amend the commit to include me as a co-author? Here’s the line that can be added to the commit message:

Coauthoredby: savan navalgi <savan.navalgi@gmail.com>

Thank you for your assistance.

Best regards, savan navalgi

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-06-04 09:03:51

*Thread Reply:* Hi @savan, your investigation on determining affected spark versions and providing clear logs to nail the problem, was really helpful. I am not sure if amending to commit on main branch can be applied. What if I created a separate PR with a changelog entry mentioning the fix applied and you as co-author? What this work for you?

savan (SavanSharan_Navalgi@intuit.com)

2024-06-04 12:27:21

*Thread Reply:* @Paweł Leszczyński yes that will also work. thank you very much.

savan (SavanSharan_Navalgi@intuit.com)

2024-06-05 04:26:46

*Thread Reply:* @Paweł Leszczyński

I have an internal demo tomorrow where I plan to present my open source contributions. Would it be possible to create the separate PR with the changelog entry by then? This would greatly help me in showcasing my work.

Thank you very much for your assistance.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-06-05 07:40:50

*Thread Reply:* sure, https://github.com/OpenLineage/OpenLineage/pull/2759

#2759 fix changelog

fix changelog in order to add co-author to a recently developed fix.

Labels

area:documentation

savan (SavanSharan_Navalgi@intuit.com)

2024-06-05 13:29:18

*Thread Reply:* thank you :gratitudethankyou:

Athitya Kumar (athityakumar@gmail.com)

2023-08-11 07:36:57

Hey folks! 👋

Had a query/observation regarding columnLineage inferred in spark integration - opened this issue for the same. Basically, when we do something like this in our spark-sql: SELECT t1.c1, t1.c2, t1.c3, t2.c4 FROM t1 LEFT JOIN t2 ON t1.c1 = t2.c1 AND t1.c2 = t2.c2 The expected column lineage for output table t3 is: t3.c1 -> Comes from both t1.c1 & t2.c1 (SELECT + JOIN clause) t3.c2 -> Comes from both t1.c2 & t2.c2 (SELECT + JOIN clause) t3.c3 -> Comes from t1.c3 t3.c4 -> Comes from t2.c4 However, actual column lineage for output table t3 is: t3.c1 -> Comes from t1.c1 (Only based on SELECT clause) t3.c2 -> Comes from t1.c1 (Only based on SELECT clause) t3.c3 -> Comes from t1.c3 t3.c4 -> Comes from t2.c4 Is this a known issue/behaviour?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-11 09:18:44

*Thread Reply:* Hmm... this is kinda "logical" difference - is column level lineage taken from actual "physical" operations - like in this case, we always take from t1 - or from "logical" where t2 is used only for predicate, yet we still want to indicate it as a source?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-11 09:18:58

*Thread Reply:* I think your interpretation is more useful

🙏 Athitya Kumar

Athitya Kumar (athityakumar@gmail.com)

2023-08-11 09:25:03

*Thread Reply:* @Maciej Obuchowski - Yup, especially for use-cases where we wanna depend on column lineage for impact analysis, I think we should be considering even predicates. For example, if t2.c1 / t2.c2 gets corrupted or dropped, the query would be impacted - which means that we should be including even predicates (t2.c1 / t2.c2) in the column lineage imo

But is there any technical limitation if we wanna implement this / make an OSS contribution for this (like logical predicate columns not being part of the spark logical plan object that we get in the PlanVisitor or something like that)?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-11 11:14:58

*Thread Reply:* It's probably a bit of work, but can't think it's impossible on parser side - @Paweł Leszczyński will know better about spark collection

Ernie Ostic (ernie.ostic@getmanta.com)

2023-08-11 12:45:34

*Thread Reply:* This is a case where it would be nice to have an alternate indication (perhaps in the Column lineage facet?) for this type of "suggested" lineage. As noted, this is especially important for impact analysis purposes. We (and I believe others do the same or similar) call that "indirect" lineage at Manta.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-11 12:49:10

*Thread Reply:* Something like additional flag in inputFields, right?

👍 Athitya Kumar, Ernie Ostic, Paweł Leszczyński

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-14 02:36:34

*Thread Reply:* Yes, this would require some extension to the spec. What do you mean spark-sql : spark.sql() with some spark query or SQL in spark JDBC?

Athitya Kumar (athityakumar@gmail.com)

2023-08-15 15:16:49

*Thread Reply:* Sorry, missed your question @Paweł Leszczyński. By spark-sql, I'm referring to the former: spark.sql() with some spark query

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-16 03:10:57

*Thread Reply:* cc @Jens Pfau - you may be also interested in extending column level lineage facet.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-08-22 02:23:08

*Thread Reply:* Hi, is there a github issue for this feature? Seems like a really cool and exciting functionality to have!

Athitya Kumar (athityakumar@gmail.com)

2023-08-22 08:03:49

*Thread Reply:* @Anirudh Shrinivason - Are you referring to this issue: https://github.com/OpenLineage/OpenLineage/issues/2048?

#2048 Spark: Column lineage is not taking join fields into consideration for `inputFields` of an output field

Hey folks! 👋 Had a query/observation regarding columnLineage inferred in spark integration. When we do something like this in our spark-sql: <pre><code>SELECT t1.c1, t1.c2, t1.c3, t2.c4 FROM t1 LEFT JOIN t2 ON t1.c1 = t2.c1 AND t1.c2 = t2.c2 </code></pre> Expected column lineage for output table t3 is: <pre><code>t3.c1 -> Comes from both t1.c1 & t2.c1 (SELECT + JOIN clause) t3.c2 -> Comes from both t1.c2 & t2.c2 (SELECT + JOIN clause) t3.c3 -> Comes from t1.c3 t3.c4 -> Comes from t2.c4 </code></pre> However, the actual column lineage for output table t3 is: <pre><code>t3.c1 -> Comes from t1.c1 (Only based on SELECT clause) t3.c2 -> Comes from t1.c1 (Only based on SELECT clause) t3.c3 -> Comes from t1.c3 t3.c4 -> Comes from t2.c4 </code></pre> Is this a known issue/behaviour?

Comments

:gratitude_thank_you: Anirudh Shrinivason

✅ Anirudh Shrinivason

Athitya Kumar (athityakumar@gmail.com)

2023-08-14 05:13:48

Hey team 👋

Is there a way we can feed the logical plan directly to check the open-lineage events being built, without actually running a spark-job with open-lineage configs? Basically interested to see if we can mock a dry-run of a spark job w/ open-lineage by mimicking the logical plan 😄

cc @Shubh

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-14 06:00:21

*Thread Reply:* Not really I think - the integration does not rely purely on the logical plan

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-14 06:00:44

*Thread Reply:* At least, not in all cases. For some maybe

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-14 07:34:39

*Thread Reply:* We're using pretty similar approach in our column level lineage tests where we run some spark commands, register custom listener https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/tes[…]eage/spark/agent/util/LastQueryExecutionSparkEventListener.java which catches the logical plan. Further we run our tests on the captured logical plan.

The difference here, between what you're asking about, is that we still have an access to the same spark session.

In many cases, our integration uses active Spark session to fetch some dataset details. This happens pretty often (like fetch dataset location) and cannot be taken just from a Logical Plan.

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/test/java/io/openlineage/spark/agent/util/LastQueryExecutionSparkEventListener.java | LastQueryExecutionSparkEventListener.java>

<pre><code>/** /** Copyright 2018-2022 contributors to the OpenLineage project /** SPDX-License-Identifier: Apache-2.0 **/ package io.openlineage.spark.agent.util; import java.util.ArrayList; import java.util.List; import java.util.Optional; import org.apache.spark.scheduler.SparkListener; import org.apache.spark.scheduler.SparkListenerEvent; import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan; import org.apache.spark.sql.execution.QueryExecution; import org.apache.spark.sql.execution.SQLExecution; import org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart; public class LastQueryExecutionSparkEventListener extends SparkListener { private static List<LogicalPlan> queryExecutions = new ArrayList<>(); @Override public void onOtherEvent(SparkListenerEvent event) { if (event instanceof SparkListenerSQLExecutionStart) { QueryExecution queryExecution = SQLExecution.getQueryExecution(((SparkListenerSQLExecutionStart) event).executionId()); if (queryExecution != null) { queryExecutions.add(queryExecution.optimizedPlan()); } } } public static Optional<LogicalPlan> getLastExecutedLogicalPlan() { if (queryExecutions.isEmpty()) { return Optional.empty(); } else { return Optional.of(queryExecutions.get(queryExecutions.size() - 1)); } } public static List<LogicalPlan> getExecutedLogicalPlans() { return queryExecutions; } } </code></pre>

Athitya Kumar (athityakumar@gmail.com)

2023-08-14 11:03:28

*Thread Reply:* @Paweł Leszczyński - We're mainly interested to see the inputs/outputs (mainly column schema and column lineage) for different logical plans. Is that something that could be done in a static manner without running spark jobs in your opinion?

For example, I know that we can statically create logical plans

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-16 03:05:44

*Thread Reply:* The more we talk the more I am wondering what is the purpose of doing so? Do you want to test openlineage coverage or is there any production scenario where you would like to apply this?

Athitya Kumar (athityakumar@gmail.com)

2023-08-16 04:01:39

*Thread Reply:* @Paweł Leszczyński - This is for testing openlineage coverage so that we can be more confident on what're the happy path scenarios and what're the scenarios where it may not work / work partially etc

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-16 04:22:01

*Thread Reply:* If this is for testing, then you're also capable of mocking some SparkSession/catalog methods when Openlineage integration tries to access them. If you want to reuse LogicalPlans from your prod environment, you will encounter logicalplan serialization issues. On the other hand, if you generate logical plans from some example Spark jobs, then the same can be easier achieved in a way the integration tests are run with mockserver.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-14 09:45:31

Hi Team,

Spark & Databricks related question: Starting 1st September Databricks is going to block running init_scripts located in dbfs and this is the way our integration works (https://www.databricks.com/blog/securing-databricks-cluster-init-scripts).

We have two ways of mitigating this in our docs and quickstart: (1) move initscripts to workspace (2) move initscripts to S3

None of them is perfect. (1) requires creating init_script file manually through databricks UI and copy/paste its content. I couldn't find the way to load it programatically. (2) requires quickstart user to have s3 bucket access.

Would love to hear your opinion on this. Perhaps there's some better way to do that. Thanks. `

Databricks

Securing Databricks cluster init scripts

Protecting the Databricks platform and continuously raising the bar with security improvements is the mission of our Security team and the main reason why we invest in our bug bounty program.

Original URL: https://www.databricks.com/blog/securing-databricks-cluster-init-scripts

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-08-15 01:13:49

*Thread Reply:* We're uploading the init scripts to s3 via tf. But yeah ig there are some access permissions that the user needs to have

:gratitude_thank_you: Paweł Leszczyński

Abdallah (abdallah@terrab.me)

2023-08-16 07:32:00

*Thread Reply:* Hello I am new here and I am asking why do you need an init script ? If it's a spark integration we can just specify --package=io.openlineage...

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-16 07:41:25

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/databricks/open-lineage-init-script.sh -> I think the issue was in having openlineage-jar installed immediately on the classpath bcz it's required when OpenLineageSparkListener is instantiated. It didn't work without it.

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/databricks/open-lineage-init-script.sh | open-lineage-init-script.sh>

<pre><code>#!/bin/bash # # Copyright 2018-2023 contributors to the OpenLineage project # SPDX-License-Identifier: Apache-2.0 STAGE_DIR="/dbfs/databricks/openlineage" echo "BEGIN: Upload Spark Listener JARs" cp -f $STAGE_DIR/openlineage-spark-**.jar /mnt/driver-daemon/jars || { echo "Error copying Spark Listener library file"; exit 1;} echo "END: Upload Spark Listener JARs" echo "BEGIN: Modify Spark config settings" cat << 'EOF' > /databricks/driver/conf/openlineage-spark-driver-defaults.conf [driver] { "spark.extraListeners" = "io.openlineage.spark.agent.OpenLineageSparkListener" } EOF echo "END: Modify Spark config settings" </code></pre>

Abdallah (abdallah@terrab.me)

2023-08-16 07:43:55

*Thread Reply:* Yes it happens if you use --jars s3://.../...openlineage-spark-VERSION.jar parameter. (I made a ticket for this issue in Databricks support) But if you use --package io.openlineage... (the package will be downloaded from maven) it works fine.

👀 Paweł Leszczyński

Abdallah (abdallah@terrab.me)

2023-08-16 07:47:50

*Thread Reply:* I think they don't use the right class loader.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-16 08:36:14

*Thread Reply:* To make sure: are you able to run Openlineage & Spark on Databricks Runtime without init_scripts?

I was doing this a second ago and this ended up with Caused by: java.lang.ClassNotFoundException: io.openlineage.spark.agent.OpenLineageSparkListener not found in com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader@1609ed55

Alexandre Campelo (aleqi200@gmail.com)

2023-08-14 19:49:00

Hello, I just downloaded Marquez and I'm trying to send a sample request but I'm getting a 403 (forbidden). Any idea how to find the authentication details?

Alexandre Campelo (aleqi200@gmail.com)

2023-08-15 12:19:34

*Thread Reply:* Ok, nevermind. I figured it out. The port 5000 is reserved in MACOS so I had to start on port 9000 instead.

👍 Maciej Obuchowski

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-08-15 01:25:48

Hi, I noticed that while capturing lineage for merge into commands, some of the tables/columns are unaccounted for the lineage. Example: ```fdummyfunnelstg = spark.sql("""WITH dummyfunnel AS ( SELECT ** FROM fdummyfunnelone WHERE dateid BETWEEN {startdateid} AND {enddateid}

        UNION ALL

        SELECT **
        FROM f_dummy_funnel_two
        WHERE date_id BETWEEN {start_date_id} AND {end_date_id}

        UNION ALL

        SELECT **
        FROM f_dummy_funnel_three
        WHERE date_id BETWEEN {start_date_id} AND {end_date_id}

        UNION ALL

        SELECT **
        FROM f_dummy_funnel_four
        WHERE date_id BETWEEN {start_date_id} AND {end_date_id}

        UNION ALL

        SELECT **
        FROM f_dummy_funnel_five
        WHERE date_id BETWEEN {start_date_id} AND {end_date_id}

    )
    SELECT DISTINCT
        dummy_funnel.customer_id,
        dummy_funnel.product,
        dummy_funnel.date_id,
        dummy_funnel.country_id,
        dummy_funnel.city_id,
        dummy_funnel.dummy_type_id,
        dummy_funnel.num_attempts,
        dummy_funnel.num_transactions,
        dummy_funnel.gross_merchandise_value,
        dummy_funnel.sub_category_id,
        dummy_funnel.is_dummy_flag
    FROM dummy_funnel
    INNER JOIN d_dummy_identity as dummy_identity
        ON dummy_identity.id = dummy_funnel.customer_id
    WHERE
        date_id BETWEEN {start_date_id} AND {end_date_id}""")

spark.sql(f""" MERGE INTO {tablename} USING fdummyfunnelstg ON fdummyfunnelstg.customerid = {tablename}.customerid AND fdummyfunnelstg.product = {tablename}.product AND fdummyfunnelstg.dateid = {tablename}.dateid AND fdummyfunnelstg.countryid = {tablename}.countryid AND fdummyfunnelstg.cityid = {tablename}.cityid AND fdummyfunnelstg.dummytypeid = {tablename}.dummytypeid AND fdummyfunnelstg.subcategoryid = {tablename}.subcategoryid AND fdummyfunnelstg.isdummyflag = {tablename}.isdummyflag WHEN MATCHED THEN UPDATE SET {tablename}.numattempts = fdummyfunnelstg.numattempts , {tablename}.numtransactions = fdummyfunnelstg.numtransactions , {tablename}.grossmerchandisevalue = fdummyfunnelstg.grossmerchandisevalue WHEN NOT MATCHED THEN INSERT ( customerid, product, dateid, countryid, cityid, dummytypeid, numattempts, numtransactions, grossmerchandisevalue, subcategoryid, isdummyflag ) VALUES ( fdummyfunnelstg.customerid, fdummyfunnelstg.product, fdummyfunnelstg.dateid, fdummyfunnelstg.countryid, fdummyfunnelstg.cityid, fdummyfunnelstg.dummytypeid, fdummyfunnelstg.numattempts, fdummyfunnelstg.numtransactions, fdummyfunnelstg.grossmerchandisevalue, fdummyfunnelstg.subcategoryid, fdummyfunnelstg.isdummyflag ) """)``In cases like this, I notice that the full lineage is not actually captured... I'd expect to see this having 5 upstreams:dummyfunnelone, dummyfunneltwo, dummyfunnelthree, dummyfunnelfour, dummyfunnel_five` , but I notice only 1-2 upstreams for this case... Would like to learn more about why this might happen, and whether this is expected behaviour or not. Thanks!

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-15 06:48:43

*Thread Reply:* Would be useful to see generated event or any logs

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-16 03:09:05

*Thread Reply:* @Anirudh Shrinivason what if there is just one union instead of four? What if there are just two columns selected instead of 10? What if inner join is skipped? Does merge into matter?

The smaller SQL to reproduce the problem, the easier it is to find the root cause. Most of the issues are reproducible with just few lines of code.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-08-16 03:34:30

*Thread Reply:* Yup let me try to identify the cause from my end. Give me some time haha. I'll reach out again once there is more clarity on the occurence

Abdallah (abdallah@terrab.me)

2023-08-16 07:33:21

Hello,

The OpenLineage Databricks integration is not working properly in our side which due to filtering adaptive_spark_plan

Please find the issue link.

https://github.com/OpenLineage/OpenLineage/issues/2058

#2058 Issue with Event Filtering in Databricks Integration for `adaptive_spark_plan`

Hello OpenLineage team, While implementing the current OpenLineage Spark integration in Databricks, we've made an interesting observation regarding event filtering and its behavior with narrow and wide transformations: <ol><li>For a narrow transformation (without shuffle), the event is <code>execute_insert_into_hadoop_fs_relation_command</code> and it doesn't include <code>adaptive_spark_plan</code>.</li><li>For a wide transformation (with shuffle), the event is <code>adaptive_spark_plan</code> and doesn't have <code>execute_insert_into_hadoop_fs_relation_command</code>.</li> </ol> We've identified that the <code>adaptive_spark_plan</code> is actively filtered out in the <code>DatabricksEventFilter</code> class. This results in the event being completely ignored due to this line: <a href="https://github.com/OpenLineage/OpenLineage/blob/41dbbf1799595bd9cd1567df0a7027de08619741/integration/spark/shared/src/main/java/io/openlineage/spark/agent/filters/DatabricksEventFilter.java#L30">OpenLineage/integration/spark/shared/src/main/java/io/openlineage/spark/agent/filters/DatabricksEventFilter.java</a> Line 30 in </OpenLineage/OpenLineage/commit/41dbbf1799595bd9cd1567df0a7027de08619741|41dbbf1> To validate, we compiled a JAR without this filtering line, and the integration worked perfectly from Databricks. Example Spark Jobs: <ol><li>Narrow Transformation <pre><code>val data = Seq((1,"a"),(2,"b"),(3,"c")) val rdd = spark.sparkContext.parallelize(data) val df = rdd.toDF("id", "value") df.write.parquet("/path/to/output/narrow_transformation") </code></pre></li><li>Wide Transformation <pre><code>val data = Seq((1,"a"),(2,"b"),(3,"c"),(1,"d")) val rdd = spark.sparkContext.parallelize(data) val df = rdd.toDF("id", "value") df.groupBy("id").count().write.parquet("/path/to/output/wide_transformation") </code></pre></li> </ol> Proposed Solutions: • Short-term fix: Remove adaptivesparkplan from the excludedNodes in DatabricksEventFilter. • Long-term resolution: Consider integrating the proposal from <a href="https://github.com/OpenLineage/OpenLineage/issues/2056">#2056</a> for a more universal filtering mechanism. Environment: Databricks version: 12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12) I hope this provides a clear picture of the issue at hand. We're eager to collaborate on resolving this. Thank you! Best regards,

⬆️ Mouad MOUSSABBIH, Abdallah

Harel Shein (harel.shein@gmail.com)

2023-08-16 09:24:09

*Thread Reply:* thanks @Abdallah for the thoughtful issue that you submitted! was wondering if you’d consider opening up a PR? would love to help you as a contributor is that’s something you are interested in.

Abdallah (abdallah@terrab.me)

2023-08-17 11:59:51

*Thread Reply:* Hello

Abdallah (abdallah@terrab.me)

2023-08-17 11:59:58

*Thread Reply:* Yes I am working on it

Abdallah (abdallah@terrab.me)

2023-08-17 12:00:14

*Thread Reply:* I deleted the line that has that filter.

Abdallah (abdallah@terrab.me)

2023-08-17 12:00:24

*Thread Reply:* I am adding some tests now

Abdallah (abdallah@terrab.me)

2023-08-17 12:00:45

*Thread Reply:* But running ./gradlew --no-daemon databricksIntegrationTest -x test -Pspark.version=3.4.0 -PdatabricksHost=$DATABRICKS_HOST -PdatabricksToken=$DATABRICKS_TOKEN

Abdallah (abdallah@terrab.me)

2023-08-17 12:01:11

*Thread Reply:* gives me A problem occurred evaluating project ':app'. > Could not resolve all files for configuration ':app:spark33'. > Could not resolve io.openlineage:openlineage_java:1.1.0-SNAPSHOT. Required by: project :app > project :shared > Could not resolve io.openlineage:openlineage_java:1.1.0-SNAPSHOT. > Unable to load Maven meta-data from <https://astronomer.jfrog.io/artifactory/maven-public-libs-snapshot/io/openlineage/openlineage-java/1.1.0-SNAPSHOT/maven-metadata.xml>. > org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 326; The reference to entity "display" must end with the ';' delimiter. > Could not resolve io.openlineage:openlineage_sql_java:1.1.0-SNAPSHOT. Required by: project :app > project :shared > Could not resolve io.openlineage:openlineage_sql_java:1.1.0-SNAPSHOT. > Unable to load Maven meta-data from <https://astronomer.jfrog.io/artifactory/maven-public-libs-snapshot/io/openlineage/openlineage-sql-java/1.1.0-SNAPSHOT/maven-metadata.xml>. > org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 326; The reference to entity "display" must end with the ';' delimiter.

Abdallah (abdallah@terrab.me)

2023-08-17 12:01:25

*Thread Reply:* And I am trying to understand what should I do.

Abdallah (abdallah@terrab.me)

2023-08-17 12:13:37

*Thread Reply:* I am compiling sql integration

Abdallah (abdallah@terrab.me)

2023-08-17 13:04:15

*Thread Reply:* I built the java client

Abdallah (abdallah@terrab.me)

2023-08-17 13:04:29

*Thread Reply:* but having A problem occurred evaluating project ':app'. > Could not resolve all files for configuration ':app:spark33'. > Could not resolve io.openlineage:openlineage_java:1.1.0-SNAPSHOT. Required by: project :app > project :shared > Could not resolve io.openlineage:openlineage_java:1.1.0-SNAPSHOT. > Unable to load Maven meta-data from <https://astronomer.jfrog.io/artifactory/maven-public-libs-snapshot/io/openlineage/openlineage-java/1.1.0-SNAPSHOT/maven-metadata.xml>. > org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 326; The reference to entity "display" must end with the ';' delimiter. > Could not resolve io.openlineage:openlineage_sql_java:1.1.0-SNAPSHOT. Required by: project :app > project :shared > Could not resolve io.openlineage:openlineage_sql_java:1.1.0-SNAPSHOT. > Unable to load Maven meta-data from <https://astronomer.jfrog.io/artifactory/maven-public-libs-snapshot/io/openlineage/openlineage-sql-java/1.1.0-SNAPSHOT/maven-metadata.xml>. > org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 326; The reference to entity "display" must end with the ';' delimiter.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-17 14:47:41

*Thread Reply:* Please do ./gradlew publishToMavenLocal in client/java directory

Abdallah (abdallah@terrab.me)

2023-08-17 14:47:59

*Thread Reply:* Okay thanks

Abdallah (abdallah@terrab.me)

2023-08-17 14:48:01

*Thread Reply:* will do

Abdallah (abdallah@terrab.me)

2023-08-22 10:33:02

*Thread Reply:* Hello back

Abdallah (abdallah@terrab.me)

2023-08-22 10:33:12

*Thread Reply:* I created a databricks cluster.

Abdallah (abdallah@terrab.me)

2023-08-22 10:35:00

*Thread Reply:* And I had somme issues that -PdatabricksHost doesn't work with System.getProperty("databricksHost") So I changed to -DdatabricksHost with System.getenv("databricksHost")

Abdallah (abdallah@terrab.me)

2023-08-22 10:36:19

*Thread Reply:* Then I had some issue that the path dbfs:/databricks/openlineage/ doesn't exist, I, then, created the folder /dbfs/databricks/openlineage/

Abdallah (abdallah@terrab.me)

2023-08-22 10:38:03

*Thread Reply:* And now I am investigating this issue : java.lang.NullPointerException at io.openlineage.spark.agent.DatabricksUtils.uploadOpenlineageJar(DatabricksUtils.java:226) at io.openlineage.spark.agent.DatabricksUtils.init(DatabricksUtils.java:66) at io.openlineage.spark.agent.DatabricksIntegrationTest.setup(DatabricksIntegrationTest.java:54) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at ... worker.org.gradle.process.internal.worker.GradleWorkerMain.main(GradleWorkerMain.java:74) Suppressed: com.databricks.sdk.core.DatabricksError: Missing required field: cluster_id at app//com.databricks.sdk.core.error.ApiErrors.readErrorFromResponse(ApiErrors.java:48) at app//com.databricks.sdk.core.error.ApiErrors.checkForRetry(ApiErrors.java:22) at app//com.databricks.sdk.core.ApiClient.executeInner(ApiClient.java:236) at app//com.databricks.sdk.core.ApiClient.getResponse(ApiClient.java:197) at app//com.databricks.sdk.core.ApiClient.execute(ApiClient.java:187) at app//com.databricks.sdk.core.ApiClient.POST(ApiClient.java:149) at app//com.databricks.sdk.service.compute.ClustersImpl.delete(ClustersImpl.java:31) at app//com.databricks.sdk.service.compute.ClustersAPI.delete(ClustersAPI.java:191) at app//com.databricks.sdk.service.compute.ClustersAPI.delete(ClustersAPI.java:180) at app//io.openlineage.spark.agent.DatabricksUtils.shutdown(DatabricksUtils.java:96) at app//io.openlineage.spark.agent.DatabricksIntegrationTest.shutdown(DatabricksIntegrationTest.java:65) at ...

Abdallah (abdallah@terrab.me)

2023-08-22 10:39:22

*Thread Reply:* Suppressed: com.databricks.sdk.core.DatabricksError: Missing required field: cluster_id

Abdallah (abdallah@terrab.me)

2023-08-22 10:40:18

*Thread Reply:* at io.openlineage.spark.agent.DatabricksUtils.uploadOpenlineageJar(DatabricksUtils.java:226)

Abdallah (abdallah@terrab.me)

2023-08-22 10:54:51

*Thread Reply:* I did this !echo "xxx" > /dbfs/databricks/openlineage/openlineage-spark-V.jar

Abdallah (abdallah@terrab.me)

2023-08-22 10:55:29

*Thread Reply:* To create some fake file that can be deleted in uploadOpenlineageJar function.

Abdallah (abdallah@terrab.me)

2023-08-22 10:56:09

*Thread Reply:* Because if there is no file, this part fails StreamSupport.stream( workspace.dbfs().list("dbfs:/databricks/openlineage/").spliterator(), false) .filter(f -> f.getPath().contains("openlineage-spark")) .filter(f -> f.getPath().endsWith(".jar")) .forEach(f -> workspace.dbfs().delete(f.getPath()));

😬 Maciej Obuchowski

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-22 11:47:17

*Thread Reply:* does this work after !echo "xxx" > /dbfs/databricks/openlineage/openlineage-spark-V.jar ?

Abdallah (abdallah@terrab.me)

2023-08-22 11:47:36

*Thread Reply:* Yes

Abdallah (abdallah@terrab.me)

2023-08-22 19:02:05

*Thread Reply:* I am now having another error in the driver

23/08/22 22:56:26 ERROR SparkContext: Error initializing SparkContext. org.apache.spark.SparkException: Exception when registering SparkListener at org.apache.spark.SparkContext.setupAndStartListenerBus(SparkContext.scala:3121) at org.apache.spark.SparkContext.<init>(SparkContext.scala:835) at com.databricks.backend.daemon.driver.DatabricksILoop$.$anonfun$initializeSharedDriverContext$1(DatabricksILoop.scala:362) ... at com.databricks.DatabricksMain.main(DatabricksMain.scala:146) at com.databricks.backend.daemon.driver.DriverDaemon.main(DriverDaemon.scala) Caused by: java.lang.ClassNotFoundException: io.openlineage.spark.agent.OpenLineageSparkListener not found in com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader@298cfe89 at com.databricks.backend.daemon.driver.ClassLoaders$MultiReplClassLoader.loadClass(ClassLoaders.scala:115) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.spark.util.Utils$.classForName(Utils.scala:263)

Abdallah (abdallah@terrab.me)

2023-08-22 19:19:29

*Thread Reply:* Can you please share with me your json conf for the cluster ?

Abdallah (abdallah@terrab.me)

2023-08-22 19:55:57

*Thread Reply:* It's because in mu build file I have

Abdallah (abdallah@terrab.me)

2023-08-22 19:56:27

*Thread Reply:* and the one that was copied is

Abdallah (abdallah@terrab.me)

2023-08-22 20:01:12

*Thread Reply:* due to the findAny 😕 private static void uploadOpenlineageJar(WorkspaceClient workspace) { Path jarFile = Files.list(Paths.get("../build/libs/")) .filter(p -> p.getFileName().toString().startsWith("openlineage-spark-")) .filter(p -> p.getFileName().toString().endsWith("jar")) .findAny() .orElseThrow(() -> new RuntimeException("openlineage-spark jar not found"));

Abdallah (abdallah@terrab.me)

2023-08-22 20:35:10

*Thread Reply:* It works finally 😄

Abdallah (abdallah@terrab.me)

2023-08-23 05:16:19

*Thread Reply:* The PR 😄 https://github.com/OpenLineage/OpenLineage/pull/2061

#2061 Bug/fix ignored event adaptive spark plan databricks

Problem While implementing the current OpenLineage Spark integration in Databricks, we've made an interesting observation regarding event filtering and its behavior with narrow and wide transformations: For a narrow transformation (without shuffle), the event is executeinsertintohadoopfsrelationcommand and it doesn't include adaptivesparkplan. For a wide transformation (with shuffle), the event is adaptivesparkplan and doesn't have executeinsertintohadoopfsrelationcommand. We've identified that the adaptivesparkplan is actively filtered out in the DatabricksEventFilter class. This results in the event being completely ignored due to this line: Closes: <a href="https://github.com/OpenLineage/OpenLineage/issues/2058">#2058</a> Solution Short-term fix: Remove adaptivesparkplan from the excludedNodes in DatabricksEventFilter. One-line summary: Remove adaptivesparkplan from the excludedNodes in DatabricksEventFilter. Checklist • [x ] You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☑︎ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> • [x ] Your changes are accompanied by tests (if relevant) • [ x] Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) • [x ] Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant) • [ x] You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant) <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2023 contributors to the OpenLineage project

Labels

integration/spark

Comments

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-23 08:23:49

*Thread Reply:* thanks for the pr 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-23 08:24:02

*Thread Reply:* code formatting checks complain now

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-23 08:25:09

*Thread Reply:* for the JAR issues, do you also want to create PR as you've fixed the issue on your end?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-23 09:06:26

*Thread Reply:* @Abdallah you're using newer version of Java than 8, right?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-23 09:07:07

*Thread Reply:* AFAIK googleJavaFormat behaves differently between Java versions

Abdallah (abdallah@terrab.me)

2023-08-23 09:15:41

*Thread Reply:* Okay I will switch back to another java version

Abdallah (abdallah@terrab.me)

2023-08-23 09:25:06

*Thread Reply:* terra@MacBook-Pro-M3 spark % java -version java version "1.8.0_381" Java(TM) SE Runtime Environment (build 1.8.0_381-b09) Java HotSpot(TM) 64-Bit Server VM (build 25.381-b09, mixed mode)

Abdallah (abdallah@terrab.me)

2023-08-23 09:28:28

*Thread Reply:* Can you tell me which java version should I use ?

Abdallah (abdallah@terrab.me)

2023-08-23 09:49:42

*Thread Reply:* Hello, I have @mobuchowski ERROR: Missing environment variable {i} Can you please check what does it come from ?

Maciej Obuchowski

OpenLineage committer. Software Engineer @getindata

Company

@getindata

Location

Warsaw

Repositories

Followers

Abdallah (abdallah@terrab.me)

2023-08-23 09:50:18

*Thread Reply:* https://app.circleci.com/pipelines/github/algorithmy1/OpenLineage/6/workflows/18ad96bd-3b5b-4572-a6c0-f5821f1174a9/jobs/106

Abdallah (abdallah@terrab.me)

2023-08-23 09:50:24

*Thread Reply:* Can you help please ?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-23 10:08:43

*Thread Reply:* Java 8

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-23 10:10:14

*Thread Reply:* ```Hello, I have

@mobuchowski ERROR: Missing environment variable {i} Can you please check what does it come from ? (edited) ``` Yup, for now I have to manually make our CI account pick your changes up if you make PR from fork. Just did that

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-23 10:11:10

*Thread Reply:* running here now: https://app.circleci.com/pipelines/github/OpenLineage/OpenLineage/7881/workflows/90793f46-796a-4f59-9de3-5d58cbcbf162

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-23 10:53:34

*Thread Reply:* @Abdallah merged 🙂

Abdallah (abdallah@terrab.me)

2023-08-23 10:59:22

*Thread Reply:* Thank you !

Michael Robinson (michael.robinson@astronomer.io)

2023-08-16 14:21:26

@channel Meetup notice: on Monday, 9/18, at 5:00 pm ET OpenLineage will be gathering in Toronto at Airflow Summit. Coming to the summit? Based in or near Toronto? Please join us to discuss topics such as: • recent developments in the project including the addition of static lineage support and the OpenLineage Airflow Provider, • the project’s history and architecture, • opportunities to contribute, • resources for getting started, • + more. Please visit medium=referral&utmcampaign=share-btnsavedeventssharemodal&utmsource=link|the meetup page> for the specific location (which is not the conference hotel) and to sign up. Hope to see some of you there! (Please note that the start time is 5:00 pm ET.)

Meetup

Toronto OpenLineage Meetup at Airflow Summit, Mon, Sep 18, 2023, 2:00 PM | Meetup

Data engineers and pipeline managers know that producing data lineage – end-to-end pipeline metadata instrumented at runtime or parsed at design time – is a heavy lift with

Original URL: https://www.meetup.com/openlineage/events/295488014/?utm_medium=referral&utm_campaign=share-btn_savedevents_share_modal&utm_source=link

❤️ Julien Le Dem, Maciej Obuchowski, Harel Shein, Paweł Leszczyński, Athitya Kumar, tati

ldacey (lance.dacey2@sutherlandglobal.com)

2023-08-20 17:45:41

i saw OpenLineage was built into Airflow recently as a provider but the documentation seems really light (https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/guides/user.html), is the documentation from openlineage the correct way I should proceed?

https://openlineage.io/docs/integrations/airflow/usage

openlineage.io

Using the Airflow integration | OpenLineage

PREREQUISITES

Original URL: https://openlineage.io/docs/integrations/airflow/usage

👍 Sheeri Cabral (Collibra)

Julien Le Dem (julien@apache.org)

2023-08-21 20:26:56

*Thread Reply:* openlineage-airflow is the package maintained in the OpenLineage project and to be used for versions of Airflow before 2.7. You could use it with 2.7 as well but you’d be staying on the “old” integration. apache-airflow-providers-openlineage is the new package, maintained in the Airflow project that can be used starting Airflow 2.7 and is the recommended package moving forward. It is compatible with the configuration of the old package described in that usage page. CC: @Maciej Obuchowski @Jakub Dardziński It looks like this page needs improvement.

PyPI

openlineage-airflow

OpenLineage integration with Airflow

Original URL: https://pypi.org/project/openlineage-airflow/

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-22 05:03:28

*Thread Reply:* Yeah, I'll fix that

:gratitude_thank_you: Julien Le Dem

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-08-22 17:55:08

*Thread Reply:* https://github.com/apache/airflow/pull/33610

fyi

#33610 openlineage: finish user guide

Explains what is required to get OpenLineage provider running and explains the configuration options.

Labels

area:providers, kind:documentation, provider:openlineage

Comments

🙌 ldacey, Julien Le Dem

ldacey (lance.dacey2@sutherlandglobal.com)

2023-08-22 17:54:20

do I label certain raw data sources as a dataset, for example SFTP/FTP sites, 0365 emails, etc? I extract that data into a bucket for the client in a "folder" called "raw" which I know will be an OL Dataset. Would this GCS folder (after extracting the data with Airflow) be the first Dataset OL is aware of?

<gcs://client-bucket/source-system-lob/raw>

I then process that data into partitioned parquet datasets which would also be OL Datasets: <gcs://client-bucket/source-system-lob/staging> <gcs://client-bucket/source-system-lob/analytics>

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-08-22 18:02:46

*Thread Reply:* that really depends on the use case IMHO if you consider a whole directory/folder as a dataset (meaning that each file inside folds into a larger whole) you should label dataset as directory

you might as well have directory with each file being something different - in this case it would be best to set each file separately as dataset

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-08-22 18:04:32

*Thread Reply:* there was also SymlinksDatasetFacet introduced to store alternative dataset names, might be useful: https://github.com/OpenLineage/OpenLineage/pull/936

ldacey (lance.dacey2@sutherlandglobal.com)

2023-08-22 18:07:26

*Thread Reply:* cool, yeah in general each file is just a snapshot of data from a client (for example, daily dump). the parquet datasets are normally partitioned and might have small fragments and I definitely picture it as more of a table than individual files

👍 Maciej Obuchowski

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-23 08:22:09

*Thread Reply:* Agree with Jakub here - with object storage, people use different patterns, but usually some directory layer vs file is the valid abstraction level, especially if your pattern is adding files with new data inside

👍 Jakub Dardziński

ldacey (lance.dacey2@sutherlandglobal.com)

2023-08-25 10:26:52

*Thread Reply:* I tested a dataset for each raw file versus the folder and the folder looks much cleaner (not sure if I can collapse individual datasets/files into a group?)

from 2022, this particular source had 6 raw schema changes (client controlled, no warning). what should I do to make that as obvious as possible if I track the dataset at a folder level?

ldacey (lance.dacey2@sutherlandglobal.com)

2023-08-25 10:32:19

*Thread Reply:* I was thinking that I could name the dataset based on the schema_version (identified by the raw column names), so in this example I would have 6 OL datasets feeding into one "staging" dataset

ldacey (lance.dacey2@sutherlandglobal.com)

2023-08-25 10:32:57

*Thread Reply:* not sure what the best practice would be in this scenario though

ldacey (lance.dacey2@sutherlandglobal.com)

2023-08-22 17:55:38

• also saw the docs reference URI = gs://{bucket name}{path} and I wondered if the path would include the filename, or if it was just the base path like I showed above

Mars Lan (Metaphor) (mars@metaphor.io)

2023-08-22 18:35:45

Has anyone managed to get the OL Airflow integration to work on AWS MWAA? We've tried pretty much every trick but still ended up with the following error: Broken plugin: [openlineage.airflow.plugin] No module named 'openlineage.airflow'; 'openlineage' is not a package

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-23 05:22:18

*Thread Reply:* Which version are you trying to use?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-23 05:22:45

*Thread Reply:* Both OL and MWAA/Airflow 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-23 05:23:52

*Thread Reply:* 'openlineage' is not a package suggests that something went wrong with import process, for example cycle in import path

Mars Lan (Metaphor) (mars@metaphor.io)

2023-08-23 16:50:34

*Thread Reply:* MWAA: 2.6.3 OL: 1.0.0

I can see from the log that OL has been successfully installed to the webserver: Successfully installed openlineage-airflow-1.0.0 openlineage-integration-common-1.0.0 openlineage-python-1.0.0 openlineage-sql-1.0.0 This is the full stacktrace: ```Traceback (most recent call last):

File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/pluginsmanager.py", line 229, in loadentrypointplugins pluginclass = entrypoint.load() File "/usr/local/airflow/.local/lib/python3.10/site-packages/importlibmetadata/init.py", line 209, in load module = importmodule(match.group('module')) File "/usr/lib/python3.10/importlib/init.py", line 126, in importmodule return bootstrap.gcdimport(name[level:], package, level) File "<frozen importlib.bootstrap>", line 1050, in gcdimport File "<frozen importlib.bootstrap>", line 1027, in _findandload File "<frozen importlib.bootstrap>", line 992, in findandloadunlocked File "<frozen importlib.bootstrap>", line 241, in _callwithframesremoved File "<frozen importlib.bootstrap>", line 1050, in _gcdimport File "<frozen importlib.bootstrap>", line 1027, in _findandload File "<frozen importlib.bootstrap>", line 1001, in findandloadunlocked ModuleNotFoundError: No module named 'openlineage.airflow'; 'openlineage' is not a package```

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-08-24 08:18:36

*Thread Reply:* It’s taking long to update MWAA environment but I tested 2.6.3 version with the followingrequirements.txt: openlineage-airflow and openlineage-airflow==1.0.0 is there any step that might lead to some unexpected results?

Mars Lan (Metaphor) (mars@metaphor.io)

2023-08-24 08:29:30

*Thread Reply:* Yeah, it takes forever to update MWAA even for a simple change. If you open either the webserver log (in CloudWatch) or the AirFlow UI, you should see the above error message.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-08-24 08:33:53

*Thread Reply:* The thing is that I don’t see any error messages. I wrote simple DAG to test too: ```from future import annotations

from datetime import datetime

from airflow.models import DAG

try: from airflow.operators.empty import EmptyOperator except ModuleNotFoundError: from airflow.operators.dummy import DummyOperator as EmptyOperator # type: ignore

from openlineage.airflow.adapter import OpenLineageAdapter from openlineage.client.client import OpenLineageClient

from airflow.operators.python import PythonOperator

DAGID = "exampleol"

def callable(): client = OpenLineageClient() adapter = OpenLineageAdapter() print(client, adapter)

with DAG( dagid=DAGID, startdate=datetime(2021, 1, 1), schedule="@once", catchup=False, ) as dag: begin = EmptyOperator(taskid="begin")

test = PythonOperator(task_id='print_client', python_callable=callable)```

and it gives expected results as well

Mars Lan (Metaphor) (mars@metaphor.io)

2023-08-24 08:48:11

*Thread Reply:* Oh how interesting. I did have a plugin that sets the endpoint & key via env var. Let me try to disable that to see if it fixes the issue. Will report back after 30 mins, or however long it takes to update MWAA 😉

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-08-24 08:50:05

*Thread Reply:* ohh, I see you probably followed this guide: https://aws.amazon.com/blogs/big-data/automate-data-lineage-on-amazon-mwaa-with-openlineage/?

Amazon Web Services

Automate data lineage on Amazon MWAA with OpenLineage | Amazon Web Services

Original URL: https://aws.amazon.com/blogs/big-data/automate-data-lineage-on-amazon-mwaa-with-openlineage/

Mars Lan (Metaphor) (mars@metaphor.io)

2023-08-24 09:04:27

*Thread Reply:* Actually no. I'm not aware of this guide. I assume it's outdated already?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-08-24 09:04:54

*Thread Reply:* tbh I don’t know

Mars Lan (Metaphor) (mars@metaphor.io)

2023-08-24 09:04:55

*Thread Reply:* Actually while we're on that topic, what's the recommended way to pass the URL & API Key in MWAA?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-08-24 09:28:00

*Thread Reply:* I think it's still a plugin that sets env vars

Mars Lan (Metaphor) (mars@metaphor.io)

2023-08-24 09:32:18

*Thread Reply:* Yeah based on the page you shared, secret manager + plugin seems like the way to go.

Mars Lan (Metaphor) (mars@metaphor.io)

2023-08-24 10:31:50

*Thread Reply:* Alas after disabling the plugin and restarting the cluster, I'm still getting the same error. Do you mind to share a screenshot of your cluster's settings so I can compare?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-24 11:57:04

*Thread Reply:* Are you maybe importing some top level OpenLineage code anywhere? This error is most likely circular import

Mars Lan (Metaphor) (mars@metaphor.io)

2023-08-24 12:01:12

*Thread Reply:* Let me try removing all the dags to see if it helps.

Mars Lan (Metaphor) (mars@metaphor.io)

2023-08-24 18:42:49

*Thread Reply:* @Maciej Obuchowski you were correct! It was indeed the DAGs. The errors are gone after removing all the dags. Now just need to figure what caused the circular import since I didn't import OL directly in DAG.

Mars Lan (Metaphor) (mars@metaphor.io)

2023-08-24 18:44:33

*Thread Reply:* Could this be the issue? from airflow.lineage.entities import File, Table How could I declare lineage manually if I can't import these classes?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-25 06:52:47

*Thread Reply:* @Mars Lan (Metaphor) I'll look in more details next week, as I'm in transit now

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-25 06:53:18

*Thread Reply:* but if you could narrow down a problem to single dag that I or @Jakub Dardziński could reproduce, ideally locally, it would help a lot

Mars Lan (Metaphor) (mars@metaphor.io)

2023-08-25 07:07:11

*Thread Reply:* Thanks. I think I understand how this works much better now. Found a few useful BQ example dags. Will give them a try and report back.

🔥 Jakub Dardziński, Maciej Obuchowski

Nitin (nitinkhannain@yahoo.com)

2023-08-23 07:14:44

Hi All, I want to capture, source and target table details as lineage information with openlineage for Amazon Redshift. Please let me know, if anyone has done it

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-08-23 07:32:19

*Thread Reply:* are you using Airflow to connect to Redshift?

Nitin (nitinkhannain@yahoo.com)

2023-08-24 06:50:05

*Thread Reply:* Hi @Jakub Dardziński, Thank you for your reply. No, we are not using Airflow. We are using load/Unload cmd with Pyspark and also Pandas with JDBC connection

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-08-25 13:28:37

*Thread Reply:* @Paweł Leszczyński might know answer if Spark<->OL integration works with Redshift. Eventually JDBC is supported with sqlparser

for Pandas I think there wasn’t too much work done

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-28 02:18:49

*Thread Reply:* @Nitin If you're using jdbc within Spark, the lineage should be obtained via sqlparser-rs library https://github.com/sqlparser-rs/sqlparser-rs. In case it's not, please try to provide some minimal SQL code (or pyspark) which leads to uncaught lineage.

sqlparser-rs/sqlparser-rs

Extensible SQL Lexer and Parser for Rust

Stars

1980

Language

Rust

Nitin (nitinkhannain@yahoo.com)

2023-08-28 04:53:03

*Thread Reply:* Hi @Jakub Dardziński / @Paweł Leszczyński, thank you for taking out time to reply on my query. We need to capture only load and unload query lineage which we are running using Spark.

If you have any sample implementation for reference, it will be indeed helpful

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-28 06:12:46

*Thread Reply:* I think we don't support load yet on our side: https://github.com/OpenLineage/OpenLineage/blob/main/integration/sql/impl/src/visitor.rs#L8

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/sql/impl/src/visitor.rs | visitor.rs>

<pre><code>use sqlparser::ast::{ </code></pre>

Nitin (nitinkhannain@yahoo.com)

2023-08-28 08:18:14

*Thread Reply:* Yeah! any way you can think of, we can accommodate it specially load and unload statement. Also, we would like to capture, lineage information where our endpoints are Sagemaker and Redis

Nitin (nitinkhannain@yahoo.com)

2023-08-28 13:20:37

*Thread Reply:* @Paweł Leszczyński can we use this code base integration/common/openlineage/common/provider/redshift_data.py for redshift lineage capture

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-08-28 14:26:40

*Thread Reply:* it still expects input and output tables that are usually retrieved from sqlparser

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-08-28 14:31:00

*Thread Reply:* for Sagemaker there is an Airflow integration written, might be an example possibly https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/openlineage/airflow/extractors/sagemaker_extractors.py

Abdallah (abdallah@terrab.me)

2023-08-23 10:55:10

Approve a new release please 🙂 • Fix spark integration filtering Databricks events.

➕ Abdallah, Tristan GUEZENNEC -CROIX-, Mouad MOUSSABBIH, Ayoub Oudmane, Asmae Tounsi, Jakub Dardziński, Michael Robinson, Harel Shein, Willy Lulciuc, Maciej Obuchowski, Julien Le Dem

Michael Robinson (michael.robinson@astronomer.io)

2023-08-23 12:27:15

*Thread Reply:* Thank you for requesting a release @Abdallah. Three +1s from committers will authorize.

🙌 Abdallah

Michael Robinson (michael.robinson@astronomer.io)

2023-08-23 13:13:18

*Thread Reply:* Thanks, all. The release is authorized and will be initiated within 2 business days.

Athitya Kumar (athityakumar@gmail.com)

2023-08-23 13:08:48

Hey folks! Do we have clear step-by-step documentation on how we can leverage the ServiceLoader based approach for injecting specific OpenLineage customisations for tweaking the transport type with defaults / tweaking column level lineage etc?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-23 13:24:32

*Thread Reply:* This proposal - https://github.com/OpenLineage/OpenLineage/blob/41dbbf1799595bd9cd1567df0a7027de08619741/proposals/168/making_spark_visitors_extensible.md

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-23 13:29:05

*Thread Reply:* For custom transport, you have to provide implementation of interface https://github.com/OpenLineage/OpenLineage/blob/4a1a5c3bf9767467b71ca0e1b6d820ba9e[…]ain/java/io/openlineage/client/transports/TransportBuilder.java and point to it in META_INF file

<https://github.com/OpenLineage/OpenLineage/blob/4a1a5c3bf9767467b71ca0e1b6d820ba9e0a1a1d/client/java/src/main/java/io/openlineage/client/transports/TransportBuilder.java | TransportBuilder.java>

<pre><code>public interface TransportBuilder { </code></pre>

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-23 13:29:52

*Thread Reply:* But if I understand correctly, if you want to change behavior rather than extend, the correct way may be to either contribute it to repo - if that behavior is useful to anyone, or fork the repo

Athitya Kumar (athityakumar@gmail.com)

2023-08-23 15:14:43

*Thread Reply:* @Maciej Obuchowski - Can you elaborate more on the "point to it in META_INF file"? Let's say we have the custom transport type built in a standalone jar by extending transport builder - what're the exact next steps to use this custom transport in the standalone jar when doing spark-submit?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-23 15:23:13

*Thread Reply:* @Athitya Kumar your jar needs to have META-INF/services/io.openlineage.client.transports.TransportBuilder with fully qualified class names of your custom TransportBuilders there - like openlineage-spark has io.openlineage.client.transports.HttpTransportBuilder io.openlineage.client.transports.KafkaTransportBuilder io.openlineage.client.transports.ConsoleTransportBuilder io.openlineage.client.transports.FileTransportBuilder io.openlineage.client.transports.KinesisTransportBuilder

Athitya Kumar (athityakumar@gmail.com)

2023-08-25 01:49:29

*Thread Reply:* @Maciej Obuchowski - I think this change may be required for consumers to leverage custom transports, can you check & verify this GH comment? https://github.com/OpenLineage/OpenLineage/issues/2007#issuecomment-1690350630

Comment on #2007 [PROPOSAL] Ability to support custom injectable dynamic header generation class/logic for HTTP transports

<a href="https://github.com/mobuchowski">@mobuchowski</a> <a href="https://github.com/pawel-big-lebowski">@pawel-big-lebowski</a> - I think the <code>Type</code> enum in the Transport.java class has default visibility, which makes it invisible for OpenLineage consumers who are trying to define a custom transport logic <a href="https://github.com/OpenLineage/OpenLineage/blob/main/client/java/src/main/java/io/openlineage/client/transports/Transport.java#L14">https://github.com/OpenLineage/OpenLineage/blob/main/client/java/src/main/java/io/openlineage/client/transports/Transport.java#L14</a> I think it'll be great if we can make this <code>Type</code> enum have public visibility. Also, can we add a <code>CUSTOM</code> Type here for folks who are trying to define a custom transport of their own? Please lmk If the changes proposed seem to be fine, I can draft a PR for the same

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-25 06:52:30

*Thread Reply:* Probably, I will look at more details next week @Athitya Kumar as I'm in transit

👍 Athitya Kumar

Michael Robinson (michael.robinson@astronomer.io)

2023-08-23 15:04:10

@channel We released OpenLineage 1.1.0, including: Additions: • Flink: create Openlineage configuration based on Flink configuration #2033 @pawel-big-lebowski • Java: add Javadocs to the Java client #2004 @julienledem • Spark: append output dataset name to a job name #2036 @pawel-big-lebowski • Spark: support Spark 3.4.1 #2057 @pawel-big-lebowski Fixes: • Flink: fix a bug when getting schema for KafkaSink #2042 @pentium3 • Spark: fix ignored event adaptive_spark_plan in Databricks #2061 @algorithmy1 Plus additional bug fixes, doc changes and more. Thanks to all the contributors, especially new contributors @pentium3 and @Abdallah! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.1.0 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/1.0.0...1.1.0 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

👏 Ayoub Oudmane, Abdallah, Yuanli Wang, Athitya Kumar, Mars Lan (Metaphor), Maciej Obuchowski, Harel Shein, Kiran Hiremath, Thomas Abraham

:gratitude_thank_you: GitHubOpenLineageIssues

Michael Robinson (michael.robinson@astronomer.io)

2023-08-25 10:29:23

@channel Friendly reminder: our next in-person meetup is next Wednesday, August 30th in San Francisco at Astronomer’s offices in the Financial District. You can sign up and find the details on the medium=referral&utmcampaign=share-btnsavedeventssharemodal&utmsource=link|meetup event page>.

Meetup

OpenLineage Meetup @ Astronomer, Wed, Aug 30, 2023, 5:30 PM | Meetup

Data engineers and pipeline managers know that producing data lineage – end-to-end pipeline metadata instrumented at runtime or parsed at design time – is a heavy lift with

Original URL: https://www.meetup.com/meetup-group-bnfqymxe/events/295195280/?utm_medium=referral&utm_campaign=share-btn_savedevents_share_modal&utm_source=link

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)

2023-08-25 10:57:30

hi Openlineage team , we would like to join one of your meetups(me and @Madhav Kakumani nad @Phil Rolph and we're wondering if you are hosting any meetups after the 18/9 ? We are trying to join this but air - tickets are quite expensive

Harel Shein (harel.shein@gmail.com)

2023-08-25 11:32:12

*Thread Reply:* there will certainly be more meetups, don’t worry about that!

Harel Shein (harel.shein@gmail.com)

2023-08-25 11:32:30

*Thread Reply:* where are you located? perhaps we can try to organize a meetup closer to where you are.

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)

2023-08-25 11:49:37

*Thread Reply:* Thanks a lot for the response, we are in London. We'd be glad to help you organise a meetup and also meet in person!

Michael Robinson (michael.robinson@astronomer.io)

2023-08-25 11:51:39

*Thread Reply:* This is awesome, thanks @George Polychronopoulos. I’ll start a channel and invite you

Juan Luis Cano Rodríguez (juan_luis_cano@mckinsey.com)

2023-08-28 04:47:53

hi folks, I'm looking into exporting static metadata, and found that DatasetEvent requires a eventTime, which in my mind doesn't make sense for static events. I'm setting it to None and the Python client seems to work, but wanted to ask if I'm missing something.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-28 05:59:10

*Thread Reply:* Although you emit DatasetEvent, you still emit an event and eventTime is a valid marker.

Juan Luis Cano Rodríguez (juan_luis_cano@mckinsey.com)

2023-08-28 06:01:40

*Thread Reply:* so, should I use the current time at the moment of emitting it and that's it?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-28 06:01:53

*Thread Reply:* yes, that should be it

:gratitude_thank_you: Juan Luis Cano Rodríguez

Juan Luis Cano Rodríguez (juan_luis_cano@mckinsey.com)

2023-08-28 04:49:21

and something else: I understand that Marquez does not yet support the 2.0 spec, hence it's incompatible with static metadata right? I tried to emit a list of DatasetEvent s and got HTTPError: 422 Client Error: Unprocessable Entity for url: <http://localhost:3000/api/v1/lineage> (I'm using a FileTransport for now)

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-28 06:02:49

*Thread Reply:* marquez is not capable of reflecting DatasetEvents in DB but it should respond with Unsupported event type

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-28 06:03:15

*Thread Reply:* and return 200 instead of 201 created

Juan Luis Cano Rodríguez (juan_luis_cano@mckinsey.com)

2023-08-28 06:05:41

*Thread Reply:* I'll have a deeper look then, probably I'm doing something wrong. thanks @Paweł Leszczyński

Joshua Dotson (josdotso@cisco.com)

2023-08-28 13:25:58

Hi folks. I have some pure golang jobs from which I need to emit OL events to Marquez. Is the right way to go about this to generate a Golang client from the Marquez OpenAPI spec and use that client from my go jobs?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-08-28 14:23:24

*Thread Reply:* I'd rather generate them from OL spec (compliant with JSON Schema)

Joshua Dotson (josdotso@cisco.com)

2023-08-28 15:12:21

*Thread Reply:* I'll look into this. I take you to mean that I would use the OL spec which is available as a set of JSON schemas to create the data object and then HTTP POST it using vanilla Golang. Is that correct? Thank you for your help!

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-08-28 15:30:05

*Thread Reply:* Correct! You’re also very welcome to contribute Golang client (currently we have Python & Java clients) if you manage to send events using golang 🙂

👏 Joshua Dotson

Michael Robinson (michael.robinson@astronomer.io)

2023-08-28 17:28:31

@channel The agenda for the medium=referral&utmcampaign=share-btnsavedeventssharemodal&utmsource=link|Toronto Meetup at Airflow Summit> on 9/18 has been updated. This promises to be an exciting, richly productive discussion. Don’t miss it if you’ll be in the area!

Intros
Evolution of spec presentation/discussion (project background/history)
State of the community
Spark/Column lineage update
Airflow Provider update
Roadmap Discussion
Action items review/next steps

Meetup

Toronto OpenLineage Meetup at Airflow Summit, Mon, Sep 18, 2023, 2:00 PM | Meetup

Data engineers and pipeline managers know that producing data lineage – end-to-end pipeline metadata instrumented at runtime or parsed at design time – is a heavy lift with

Original URL: https://www.meetup.com/openlineage/events/295488014/?utm_medium=referral&utm_campaign=share-btn_savedevents_share_modal&utm_source=link

❤️ Jarek Potiuk, Paweł Leszczyński, tati

Michael Robinson (michael.robinson@astronomer.io)

2023-08-28 20:05:37

New on the OpenLineage blog: a close look at the new OpenLineage Airflow Provider, including: • the critical improvements it brings to the integration • the high-level design • implementation details • an example operator • planned enhancements • a list of supported operators • more. The post, by @Maciej Obuchowski, @Julien Le Dem and myself is live now on the OpenLineage blog.

openlineage.io

The OpenLineage Airflow Provider is Here | OpenLineage

Built-in OpenLineage support in Airflow means big improvements in reliability, lineage output, and custom operator implementation.

Original URL: https://openlineage.io/blog/airflow-provider

🎉 Drew Meyers, Harel Shein, Maciej Obuchowski, Julian LaNeve, Mars Lan (Metaphor)

Sarwat Fatima (sarwatfatimam@gmail.com)

2023-08-29 03:18:04

Hello, I'm currently in the process of following the instructions outlined in the provided getting started guide at https://openlineage.io/getting-started/. However, I've encountered a problem while attempting to complete *Step 1* of the guide. Unfortunately, I'm encountering an internal server error at this stage. I did manage to successfully run Marquez, but it appears that there might be an issue that needs to be addressed. I have attached screen shots.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-08-29 03:20:18

*Thread Reply:* is 5000 port taken by any other application? or ./docker/up.sh has some errors in logs?

Sarwat Fatima (sarwatfatimam@gmail.com)

2023-08-29 05:23:01

*Thread Reply:* @Jakub Dardziński 5000 port is not taken by any other application. The logs show some errors but I am not sure what is the issue here.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-29 10:02:38

*Thread Reply:* I think Marquez is running on WSL while you're trying to connect from host computer?

Juan Luis Cano Rodríguez (juan_luis_cano@mckinsey.com)

2023-08-29 05:20:39

hi folks, for now I'm producing .jsonl (or .ndjson ) files with one event per line, do you know if there's any way to validate those? would standard JSON Schema tools work?

Juan Luis Cano Rodríguez (juan_luis_cano@mckinsey.com)

2023-08-29 10:58:29

*Thread Reply:* reply by @Julian LaNeve: yes 🙂💯

👍 Maciej Obuchowski

ldacey (lance.dacey2@sutherlandglobal.com)

2023-08-29 13:12:32

for namespaces, if my data is moving between sources (SFTP -> GCS -> Azure Blob (synapse connects to parquet datasets) then should my namespace be based on the client I am working with? my current namespace has been to refer to the bucket, but that falls apart when considering the data sources and some destinations. perhaps I should just add a field for client-name instead to have a consolidated view?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-30 10:53:08

*Thread Reply:* > then should my namespace be based on the client I am working with? I think each of those sources should be a different namespace?

ldacey (lance.dacey2@sutherlandglobal.com)

2023-08-30 12:59:53

*Thread Reply:* got it, yeah I was kind of picturing as one namespace for the client (we handle many clients but they are completely distinct entities). I was able to get it to work with multiple namespaces like you suggested and Marquez was able to plot everything correctly in the visualization

ldacey (lance.dacey2@sutherlandglobal.com)

2023-08-30 13:01:18

*Thread Reply:* I noticed some of my Dataset facets make more sense as Run facets, for example, the name of the specific file I processed and how many rows of data / size of the data for that schedule. that won't impact the Run facets Airflow provides right? I can still have the schedule information + my custom run facets?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-08-30 13:06:38

*Thread Reply:* Yes, unless you name it the same as one of the Airflow facets 🙂

GitHubOpenLineageIssues (githubopenlineageissues@gmail.com)

2023-08-30 08:15:29

Hi, Will really appreciate if someone can guide me or provide me any pointer - if they have been able to implement authentication/authorization for access to Marquez. Have not seen much info around it. Any pointers greatly appreciated. Thanks in advance.

Julien Le Dem (julien@apache.org)

2023-08-30 12:23:18

*Thread Reply:* I’ve seen people do this through the ingress controller in Kubernetes. Unfortunately I don’t have documentation besides k8s specific ones you would find for the ingress controller you’re using. You’d redirect any unauthenticated request to your identity provider

:gratitude_thank_you: GitHubOpenLineageIssues

Michael Robinson (michael.robinson@astronomer.io)

2023-08-30 11:50:05

@channel Friendly reminder: there’s a meetup tonight at Astronomer’s offices in SF!

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)

@channel Friendly reminder: our next in-person meetup is next Wednesday, August 30th in San Francisco at Astronomer’s offices in the Financial District. You can sign up and find the details on the <https://www.meetup.com/meetup-group-bnfqymxe/events/295195280/?utmmedium=referral&utmcampaign=share-btnsavedeventssharemodal&utmsource=link|meetup event page>.

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1692973763570629

✅ Sheeri Cabral (Collibra)

Julien Le Dem (julien@apache.org)

2023-08-30 12:15:31

*Thread Reply:* I’ll be there and looking forward to see @John Lukenoff ‘s presentation

Michael Barrientos (mbarrien@gmail.com)

2023-08-30 21:38:31

Can anyone let 3 people stuck downstairs into the 7th floor?

👍 Willy Lulciuc

Willy Lulciuc (willy@datakin.com)

2023-08-30 23:25:21

*Thread Reply:* Sorry about that!

Yunhe (yunhe52203334@outlook.com)

2023-08-31 02:31:48

hello,everyone,i can run openLineage spark code in my notebook with python,but when use my idea to execute scala code like this: import org.apache.spark.internal.Logging import org.apache.spark.sql.SparkSession import io.openlineage.client.OpenLineageClientUtils.loadOpenLineageYaml import org.apache.spark.scheduler.{SparkListener, SparkListenerApplicationEnd, SparkListenerApplicationStart} import sun.java2d.marlin.MarlinUtils.logInfo object Test { def main(args: Array[String]): Unit = {

val spark = SparkSession
  .builder()
  .master("local")
  .appName("test")
  .config("spark.jars.packages","io.openlineage:openlineage_spark:0.12.0")
  .config("spark.extraListeners","io.openlineage.spark.agent.OpenLineageSparkListener")
  .config("spark.openlineage.transport.type","console")
  .getOrCreate()

spark.sparkContext.setLogLevel("INFO")

//spark.sparkContext.addSparkListener(new MySparkAppListener)
import spark.implicits._
val input = Seq((1, "zs", 2020), (2, "ls", 2023)).toDF("id", "name", "year")

input.select("id", "name").orderBy("id").show()

}

there is something wrong: Exception in thread "spark-listener-group-shared" java.lang.NoSuchMethodError: io.openlineage.client.OpenLineageClientUtils.loadOpenLineageYaml(Ljava/io/InputStream;)Lio/openlineage/client/OpenLineageYaml; at io.openlineage.spark.agent.ArgumentParser.extractOpenlineageConfFromSparkConf(ArgumentParser.java:114) at io.openlineage.spark.agent.ArgumentParser.parse(ArgumentParser.java:78) at io.openlineage.spark.agent.OpenLineageSparkListener.initializeContextFactoryIfNotInitialized(OpenLineageSparkListener.java:277) at io.openlineage.spark.agent.OpenLineageSparkListener.onApplicationStart(OpenLineageSparkListener.java:267) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:55) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1446) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)

i want to know how can i set idea scala environment correctly

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-08-31 02:58:41

*Thread Reply:* io.openlineage:openlineage_spark:0.12.0 -> could you repeat the steps with newer version?

Yunhe (yunhe52203334@outlook.com)

2023-08-31 03:51:52

ok,it`s my first use thie lineage tool. first,I added dependences in my pom.xml like this: <dependency> <groupId>io.openlineage</groupId> <artifactId>openlineage-java</artifactId> <version>0.12.0</version> </dependency> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-api</artifactId> <version>2.7</version> </dependency> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-core</artifactId> <version>2.7</version> </dependency> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-slf4j-impl</artifactId> <version>2.7</version> </dependency> <dependency> <groupId>io.openlineage</groupId> <artifactId>openlineage-spark</artifactId> <version>0.30.1</version> </dependency>

my spark version is 3.3.1 and the version can not change

second, in file Openlineage/intergration/spark I enter command : docker-compose up and follow the steps in this doc: https://openlineage.io/docs/integrations/spark/quickstart_local there is no erro when i use notebook to execute pyspark for openlineage and I could get json message. but after I enter "docker-compose up" ,I want to use my Idea tool to execute scala code like above,the erro happend like above. It seems that I does not configure the environment correctly. so how can i fix the problem .

openlineage.io

Quickstart with Jupyter | OpenLineage

Trying out the Spark integration is super easy if you already have Docker Desktop and git installed.

Original URL: https://openlineage.io/docs/integrations/spark/quickstart_local

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-09-01 05:15:28

*Thread Reply:* please use latest io.openlineage:openlineage_spark:1.1.0 instead. openlineage-java is already contained in the jar, no need to add it on your own.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2023-08-31 15:33:19

Will the August meeting be put up at https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting soon? (usually it’s up in a few days 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-09-01 06:00:53

*Thread Reply:* @Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2023-09-01 17:13:32

*Thread Reply:* The recording is on the youtube channel here. I’ll update the wiki ASAP

YouTube

} OpenLineage Project (https://www.youtube.com/@openlineageproject6897)

OpenLineage Community Meeting | August 10, 2023

Original URL: https://youtu.be/0Q5dWHvIDLo

✅ Sheeri Cabral (Collibra)

Julien Le Dem (julien@apache.org)

2023-08-31 18:10:20

It sounds like there have been a few announcements at Google Next: https://cloud.google.com/data-catalog/docs/how-to/open-lineage https://cloud.google.com/dataproc/docs/guides/lineage

Google Cloud

Integrate with OpenLineage | Data Catalog Documentation | Google Cloud

Original URL: https://cloud.google.com/data-catalog/docs/how-to/open-lineage

Google Cloud

Use data lineage in Dataproc | Dataproc Documentation | Google Cloud

Original URL: https://cloud.google.com/dataproc/docs/guides/lineage

🎉 Harel Shein, Willy Lulciuc, Kevin Languasco, Peter Hicks, Maciej Obuchowski, Paweł Leszczyński, Sheeri Cabral (Collibra), Ross Turk, Michael Robinson, Jakub Dardziński, Kiran Hiremath, Laurent Paris, Anastasia Khomyakova

🙌 Harel Shein, Willy Lulciuc, Mars Lan (Metaphor), Peter Hicks, Maciej Obuchowski, Paweł Leszczyński, Eric Veleker, Sheeri Cabral (Collibra), Ross Turk, Michael Robinson

❤️ Willy Lulciuc, Maciej Obuchowski, ldacey, Ross Turk, Michael Robinson

Julien Le Dem (julien@apache.org)

2023-09-01 23:09:55

*Thread Reply:* https://www.youtube.com/watch?v=zvCdrNJsxBo&t=2260s

YouTube

} Google Cloud (https://www.youtube.com/@googlecloud)

What’s new in data governance

Original URL: https://www.youtube.com/watch?v=zvCdrNJsxBo&t=2260s

Michael Robinson (michael.robinson@astronomer.io)

2023-09-01 17:16:21

@channel The latest issue of OpenLineage News is out now! Please subscribe to get it directly in your inbox each month.

apache.us14.list-manage.com

OpenLineage Project

OpenLineage Project Email Forms

Original URL: http://bit.ly/OL_news

🙌 Jakub Dardziński, Maciej Obuchowski

🙌:skin_tone_3: Juan Luis Cano Rodríguez

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-09-04 03:38:28

Hi guys, I'd like to capture the spark.databricks.clusterUsageTags.clusterAllTags property from databricks. However, the value of this is a list of keys, and therefore cannot be supported by custom environment facet builder. I was thinking that capturing this property might be useful for most databricks workloads, and whether it might make sense to auto-capture it along with other databricks variables, similar to how we capture mount points for the databricks jobs. Does this sound okay? If so, then I can help to contribute this functionality

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-09-04 06:43:47

*Thread Reply:* Sounds good to me

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-09-11 05:15:03

*Thread Reply:* Added this here: https://github.com/OpenLineage/OpenLineage/pull/2099

#2099 Capture clusterAllTags variable from databricks

Problem Auto-collect spark.databricks.clusterUsageTags.clusterAllTags environment variable from databricks Closes: <a href="https://github.com/OpenLineage/OpenLineage/issues/2098">#2098</a> Solution Please describe your change as it relates to the problem, or bug fix, as well as any dependencies. If your change requires a schema change, please describe the schema modification(s) and whether it's a backwards-incompatible or backwards-compatible change, then select one of the following: <blockquote> Note: All schema changes require discussion. Please <a href="https://github.com/OpenLineage/OpenLineage/issues/2098">link the issue</a> for context. </blockquote> ☐ Your change modifies the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json">core</a> OpenLineage model ☐ Your change modifies one or more OpenLineage <a href="https://github.com/OpenLineage/OpenLineage/tree/main/spec/facets">facets</a> If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports <code>S3</code> and <code>GCS</code> filesystem operations, tested with AWS EMR). One-line summary: Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☐ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☐ Your changes are accompanied by tests (if relevant) ☐ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☐ Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant) ☐ You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant) <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2023 contributors to the OpenLineage project

Labels

integration/spark

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-09-04 06:39:05

Also, another small clarification is that when using MergeIntoCommand, I'm receiving the lineage events on the backend, but I cannot seem to find any logging of the payload when I enable debug mode in openlineage. I remember there was a similar issue reported by another user in the past. May I check if it might be possible to help with this? It's making debugging quite hard for these cases. Thanks!

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-09-04 06:54:12

*Thread Reply:* I think it only depends on log4j configuration

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-09-04 06:57:15

*Thread Reply:* ```# Set everything to be logged to the console log4j.rootCategory=INFO, console log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err log4j.appender.console.layout=org.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

set the log level for the openlineage spark library

log4j.logger.io.openlineage.spark=DEBUG``this is what we have inlog4j.properties` in test environment and it works

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-09-04 11:28:11

*Thread Reply:* Hmm... I can see the logs for the other commands, like createViewCommand etc. I just cannot see it for any of the delta runs

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-09-05 03:33:03

*Thread Reply:* that's interesting. So, logging is done here: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/main/java/io/openlineage/spark/agent/EventEmitter.java#L63 and this code is unaware of delta.

The possible problem could be filtering delta events (which we do bcz of delta being noisy)

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/main/java/io/openlineage/spark/agent/EventEmitter.java | EventEmitter.java>

<pre><code> log.debug( </code></pre>

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-09-05 03:33:36

*Thread Reply:* Recently, we've closed that https://github.com/OpenLineage/OpenLineage/issues/1982 which prevents generating events for ` createOrReplaceTempView

#1982 [Spark] CreateViewCommand

<code>CreateViewCommand</code> does not contain is a spark action triggered with: <pre><code>words.createOrReplaceTempView('words') </code></pre> It should be filtered in order not to generate Openlineag events, as they do have no inputs or outputs.

Assignees

<a href="https://github.com/pawel-big-lebowski">@pawel-big-lebowski</a>

Labels

integration/spark

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-09-05 03:35:12

*Thread Reply:* and this is the code change: https://github.com/OpenLineage/OpenLineage/pull/1987/files

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-09-05 05:19:22

*Thread Reply:* Hmm I'm a little confused here. I thought we are only filtering out events for certain specific commands, like show table etc. because its noisy right? Some important commands like MergeInto or SaveIntoDataSource used to be logged before, but I notice now that its not being logged anymore... I'm using 0.23.0 openlineage version.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-09-05 05:47:51

*Thread Reply:* yes, we do. it's just sometimes when doing a filter, we can remove too much. but SaveIntoDataSource and MergeInto should be fine, as we do check them within the tests

ldacey (lance.dacey2@sutherlandglobal.com)

2023-09-04 21:35:05

it looks like my dynamic task mapping in Airflow has the same run ID in marquez, so even if I am processing 100 files, there is only one version of the data. is there a way to have a separate version of each dynamic task so I can track the filename etc?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-05 08:54:57

*Thread Reply:* map_index should be indeed included when calculating run ID (it’s deterministic in Airflow integration) what version of Airflow are you using btw?

ldacey (lance.dacey2@sutherlandglobal.com)

2023-09-05 09:04:14

*Thread Reply:* 2.7.0

I do see this error log in all of my dynamic tasks which might explain it:

[2023-09-05, 00:31:57 UTC] {manager.py:200} ERROR - Extractor returns non-valid metadata: None [2023-09-05, 00:31:57 UTC] {utils.py:401} ERROR - cannot import name 'get_operator_class' from 'airflow.providers.openlineage.utils' (/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/openlineage/utils/__init__.py) Traceback (most recent call last): File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/openlineage/utils/utils.py", line 399, in wrapper return f(**args, ****kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/openlineage/plugins/listener.py", line 93, in on_running ****get_custom_facets(task_instance), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/openlineage/utils/utils.py", line 148, in get_custom_facets custom_facets["airflow_mappedTask"] = AirflowMappedTaskRunFacet.from_task_instance(task_instance) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/openlineage/plugins/facets.py", line 36, in from_task_instance from airflow.providers.openlineage.utils import get_operator_class ImportError: cannot import name 'get_operator_class' from 'airflow.providers.openlineage.utils' (/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/openlineage/utils/__init__.py)

ldacey (lance.dacey2@sutherlandglobal.com)

2023-09-05 09:05:34

*Thread Reply:* I only have a few custom operators with the on_complete facet so I think this is a built in one - it runs before my task custom logs for example

ldacey (lance.dacey2@sutherlandglobal.com)

2023-09-05 09:06:05

*Thread Reply:* and any time I messed up my custom facet, the error would be at the bottom of the logs. this is on top, probably an on_start facet?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-05 09:16:32

*Thread Reply:* seems like some circular import

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-05 09:19:47

*Thread Reply:* I just tested it manually, it’s a bug in OL provider. let me fix that

ldacey (lance.dacey2@sutherlandglobal.com)

2023-09-05 10:53:28

*Thread Reply:* cool, thanks. I am glad it is just a bug, I was afraid dynamic tasks were not supported for a minute there

ldacey (lance.dacey2@sutherlandglobal.com)

2023-09-07 11:46:20

*Thread Reply:* how do the provider updates work? they can be released in between Airflow releases and issues for them are raised on the main Airflow repo?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-07 11:50:07

*Thread Reply:* generally speaking anything related to OL-Airflow should be placed to Airflow repo, important changes/bug fixes would be implemented in OL repo as well

ldacey (lance.dacey2@sutherlandglobal.com)

2023-09-07 15:40:31

*Thread Reply:* got it, thanks

ldacey (lance.dacey2@sutherlandglobal.com)

2023-09-07 19:43:46

*Thread Reply:* is there a way for me to install the openlineage provider based on the commit you made to fix the circular imports?

i was going to try to install from Airflow main branch but didnt want to mess anything up

ldacey (lance.dacey2@sutherlandglobal.com)

2023-09-07 19:44:39

*Thread Reply:* I saw it was merged to airflow main but it is not in 2.7.1 and there is no 1.0.3 provider version yet, so I wondered if I could manually install it for the time being

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-08 05:45:48

*Thread Reply:* https://github.com/apache/airflow/blob/main/BREEZE.rst#preparing-provider-packages building the provider package on your own could be best idea probably? that depends on how you manage your Airflow instance

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-08 12:01:53

*Thread Reply:* there's 1.1.0rc1 btw

ldacey (lance.dacey2@sutherlandglobal.com)

2023-09-08 13:44:44

*Thread Reply:* perfect, thanks. I got started with breeze but then stopped haha

👍 Jakub Dardziński

ldacey (lance.dacey2@sutherlandglobal.com)

2023-09-10 20:29:00

*Thread Reply:* The dynamic task mapping error is gone, I did run into this:

File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/openlineage/extractors/base.py", line 70, in disabledoperators operator.strip() for operator in conf.get("openlineage", "disabledfor_operators").split(";") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/airflow/.local/lib/python3.11/site-packages/airflow/configuration.py", line 1065, in get raise AirflowConfigException(f"section/key [{section}/{key}] not found in config")

I am redeploying now with that option added to my config. I guess it did not use the default which should be ""

ldacey (lance.dacey2@sutherlandglobal.com)

2023-09-10 20:49:17

*Thread Reply:* added "disabledforoperators" to my openlineage config and it worked (using Airflow helm chart - not sure if that means there is an error because the value I provided should just be the default value, not sure why I needed to explicitly specify it)

openlineage: disabledforoperators: "" ...

this is so much better and makes a lot more sense. most of my tasks are dynamic so I was missing a lot of metadata before the fix, thanks!

Abdallah (abdallah@terrab.me)

2023-09-06 16:43:07

Hello Everyone,

I've been diving into the Marquez codebase and found a performance bottleneck in JobDao.java for the query related to namespaceName=MyNameSpace limit=10 and 12s with limit=25. I managed to optimize it using CTEs, and the execution times dropped dramatically to 300ms (for limit=100) and under 100ms (for limit=25 ) on the same cluster. Issue link : https://github.com/MarquezProject/marquez/issues/2608

I believe there's even more room for optimization, especially if we adjust the job_facets_view to include the namespace_name column.

Would the team be open to a PR where I share the optimized query and discuss potential further refinements? I believe these changes could significantly enhance the Marquez web UI experience.

PR link : https://github.com/MarquezProject/marquez/pull/2609

Looking forward to your feedback.

#2608 Performance Issue with Query on Large Data Sets in JobDao.java

#2609 Perf/improve jobdao query

🔥 Jakub Dardziński, Harel Shein, Paweł Leszczyński, Maciej Obuchowski

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-06 18:03:01

*Thread Reply:* @Willy Lulciuc wdyt?

Bernat Gabor (gaborjbernat@gmail.com)

2023-09-06 17:44:12

Has there been any conversation on the extensibility of facets/concepts? E.g.: • how does one extends the list of run states https://openlineage.io/docs/spec/run-cycle to add a paused/resumed state? • how does one extend https://openlineage.io/docs/spec/facets/run-facets/nominal_time to add a created at field?

openlineage.io

The Run Cycle | OpenLineage

The OpenLineage object model is event-based and updates provide an OpenLineage backend with details about the activities of a Job.

Original URL: https://openlineage.io/docs/spec/run-cycle

openlineage.io

Nominal Time Facet | OpenLineage

The facet to describe the nominal start and end time of the run. The nominal usually means the time the job run was expected to run (like a scheduled time), and the actual time can be different.

Original URL: https://openlineage.io/docs/spec/facets/run-facets/nominal_time

Julien Le Dem (julien@apache.org)

2023-09-06 18:28:17

*Thread Reply:* Hello Bernat,

The primary mechanism to extend the model is through facets. You can either: • create new standard facets in the spec: https://github.com/OpenLineage/OpenLineage/tree/main/spec/facets • create custom facets defined somewhere else with a prefix in their name: https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md#custom-facet-naming • Update existing facets with a backward compatible change (example: adding an optional field). The core spec can also be modified. Here is an example of adding a state That being said I think more granular states like pause/resume are probably better suited in a run facet. There was an issue opened for that particular one a while ago: https://github.com/OpenLineage/OpenLineage/issues/9 maybe that particular discussion can continue there.

For the nominal time facet, You could open an issue describing the use case and on community agreement follow up with a PR on the facet itself: https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/NominalTimeRunFacet.json (adding an optional field is backwards compatible)

👀 Juan Luis Cano Rodríguez

Bernat Gabor (gaborjbernat@gmail.com)

2023-09-06 18:31:12

*Thread Reply:* I see, so in general one is best copying a standard facet and maintain it under a different name. That way can be made mandatory 🙂 and one does not need to be blocked for a long time until there's a community agreement 🤔

Julien Le Dem (julien@apache.org)

2023-09-06 18:35:43

*Thread Reply:* Yes, The goal of custom facets is to allow you to experiment and extend the spec however you want without having to wait for approval. If the custom facet is very specific to a third party project/product then it makes sense for it to stay a custom facet. If it is more generic then it makes sense to add it to the core facets as part of the spec. Hopefully community agreement can be achieved relatively quickly. Unless someone is strongly against something, it can be added without too much red tape. Typically with support in at least one of the integrations to validate the model.

Michael Robinson (michael.robinson@astronomer.io)

2023-09-07 15:12:20

@channel This month’s TSC meeting is next Thursday the 14th at 10am PT. On the tentative agenda: • announcements • recent releases • demo: Spark integration tests in Databricks runtime • open discussion • more (TBA) More info and the meeting link can be found on the website. All are welcome! Also, feel free to reply or DM me with discussion topics, agenda items, etc.

openlineage.io

TSC Meetings | OpenLineage

The OpenLineage Technical Steering Committee meets monthly, and is open to all.

Original URL: https://openlineage.io/meetings/

👍 Maciej Obuchowski

Michael Robinson (michael.robinson@astronomer.io)

2023-09-11 10:07:41

@channel The first Toronto OpenLineage Meetup, featuring a presentation by recent adopter Metaphor, is just one week away. On the agenda:

Evolution of spec presentation/discussion (project background/history)
State of the community
Integrating OpenLineage with Metaphor (by special guests Ye & Ivan)
Spark/Column lineage update
Airflow Provider update
Roadmap Discussion Find more details and RSVP https://www.meetup.com/openlineage/events/295488014/?utmmedium=referral&utmcampaign=share-btnsavedeventssharemodal&utmsource=link|here.

metaphor.io

Metaphor - The Social Platform for Data

Making Data Actionable, At Scale - Designed for data teams building cloud-native, self-service data platforms for their business users. Explore our Data Governance, Data Lineage, Data Discovery, and Data Trust capabilities today.

Original URL: https://metaphor.io/

Meetup

Toronto OpenLineage Meetup at Airflow Summit, Mon, Sep 18, 2023, 5:00 PM | Meetup

Data engineers and pipeline managers know that producing data lineage – end-to-end pipeline metadata instrumented at runtime or parsed at design time – is a heavy lift with

Original URL: https://www.meetup.com/openlineage/events/295488014/?utm_medium=referral&utm_campaign=share-btn_savedevents_share_modal&utm_source=link

🙌 Mars Lan (Metaphor), Jarek Potiuk, Harel Shein, Maciej Obuchowski, Peter Hicks, Paweł Leszczyński, Dongjin Seo

John Lukenoff (john@jlukenoff.com)

2023-09-11 17:07:26

I’m seeing some odd behavior with my http transport when upgrading airflow/openlineage-airflow from 2.3.2 -> 2.6.3 and 0.24.0 -> 0.28.0. Previously I had a config like this that let me provide my own auth tokens. However, after upgrading I’m getting a 401 from the endpoint and further debugging seems to reveal that we’re not using the token provided in my TokenProvider. Does anyone know if something changed between these versions that could be causing this? (more details in 🧵 ) transport: type: http url: <https://my.fake-marquez-endpoint.com> auth: type: some.fully.qualified.classpath

John Lukenoff (john@jlukenoff.com)

2023-09-11 17:09:40

*Thread Reply:* If I log this line I can tell the TokenProvider is the class instance I would expect: https://github.com/OpenLineage/OpenLineage/blob/45d94fb73b5488d34b8ca544b58317382ceb3980/client/python/openlineage/client/transport/http.py#L55

<https://github.com/OpenLineage/OpenLineage/blob/45d94fb73b5488d34b8ca544b58317382ceb3980/client/python/openlineage/client/transport/http.py | http.py>

<pre><code> subclass = try_import_from_string(of_type) </code></pre>

John Lukenoff (john@jlukenoff.com)

2023-09-11 17:11:14

*Thread Reply:* However, if I log the token_provider here I get the origin TokenProvider: https://github.com/OpenLineage/OpenLineage/blob/45d94fb73b5488d34b8ca544b58317382ceb3980/client/python/openlineage/client/transport/http.py#L154

<https://github.com/OpenLineage/OpenLineage/blob/45d94fb73b5488d34b8ca544b58317382ceb3980/client/python/openlineage/client/transport/http.py | http.py>

<pre><code> bearer = token_provider.get_bearer() </code></pre>

John Lukenoff (john@jlukenoff.com)

2023-09-11 17:18:56

*Thread Reply:* Ah I think I see the issue. Looks like this was introduced here, we are instantiating with the base token provider here when we should be using the subclass: https://github.com/OpenLineage/OpenLineage/pull/1869/files#diff-2f8ea6f9a22b5567de8ab56c6a63da8e7adf40cb436ee5e7e6b16e70a82afe05R57

John Lukenoff (john@jlukenoff.com)

2023-09-11 17:37:42

*Thread Reply:* Opened a PR for this here: https://github.com/OpenLineage/OpenLineage/pull/2100

#2100 python: fix custom http transport TokenProvider

Problem Looks like in <a href="https://github.com/OpenLineage/OpenLineage/pull/1869">#1869</a> we introduced a bug where we would always use the base <code>TokenProvider</code> class instead of a given custom token provider, even if the validation conditions passed. Solution Revert this to instantiate the subclass ☐ Your change modifies the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json">core</a> OpenLineage model ☐ Your change modifies one or more OpenLineage <a href="https://github.com/OpenLineage/OpenLineage/tree/main/spec/facets">facets</a> If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports <code>S3</code> and <code>GCS</code> filesystem operations, tested with AWS EMR). One-line summary: Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☑︎ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☐ Your changes are accompanied by tests (if relevant) ☑︎ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☐ Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant) ☐ You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant) <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2023 contributors to the OpenLineage project

Labels

client/python

Comments

❤️ Julien Le Dem

Sarwat Fatima (sarwatfatimam@gmail.com)

2023-09-12 08:14:06

This particular code in docker-compose exits with code 1 because it is unable to find wait-for-it.sh, file in the container. I have checked the mounting path from the local machine, It is correct and the path on the container for Marquez is also correct i.e. /usr/src/app but it is unable to mount the wait-for-it.sh. Does anyone know why is this? This code exists in the open lineage repository as well https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/docker-compose.yml # Marquez as an OpenLineage Client api: image: marquezproject/marquez container_name: marquez-api ports: - "5000:5000" - "5001:5001" volumes: - ./docker/wait-for-it.sh:/usr/src/app/wait-for-it.sh links: - "db:postgres" depends_on: - db entrypoint: [ "./wait-for-it.sh", "db:5432", "--", "./entrypoint.sh" ]

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/docker-compose.yml | docker-compose.yml>

<pre><code>version: "3.7" services: notebook: image: jupyter/pyspark-notebook:spark-3.1.2 ports: - "8888:8888" volumes: - ./docker/notebooks:/home/jovyan/notebooks - ./build:/home/jovyan/openlineage links: - "api:marquez" depends_on: - api # Marquez as an OpenLineage Client api: image: marquezproject/marquez container_name: marquez-api ports: - "5000:5000" - "5001:5001" volumes: - ./docker/wait-for-it.sh:/usr/src/app/wait-for-it.sh links: - "db:postgres" depends_on: - db entrypoint: [ "./wait-for-it.sh", "db:5432", "--", "./entrypoint.sh" ] db: image: postgres:12.1 container_name: marquez-db ports: - "5432:5432" environment: - POSTGRES_USER=postgres - POSTGRES_PASSWORD=password - MARQUEZ_DB=marquez - MARQUEZ_USER=marquez - MARQUEZ_PASSWORD=marquez volumes: - ./docker/init-db.sh:/docker-entrypoint-initdb.d/init-db.sh # Enables SQL statement logging (see: <https://www.postgresql.org/docs/12/runtime-config-logging.html#GUC-LOG-STATEMENT>) # command: ["postgres", "-c", "log_statement=all"] </code></pre>

Sarwat Fatima (sarwatfatimam@gmail.com)

2023-09-12 08:15:19

*Thread Reply:* This is the error message:

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-09-12 10:38:41

*Thread Reply:* no permissions?

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-12 15:11:45

I am trying to run Google Cloud Composer where i have added the openlineage-airflow pypi packagae as a dependency and have added the env OPENLINEAGEEXTRACTORS to point to my custom extractor. I have added a folder by name dependencies and inside that i have placed my extractor file, and the path given to OPENLINEAGEEXTRACTORS is dependencies.<filename>.<extractorclass_name>…still it fails with the exception saying No module named ‘dependencies’. Can anyone kindly help me out on correcting my mistake

Harel Shein (harel.shein@gmail.com)

2023-09-12 17:15:36

*Thread Reply:* Hey @Guntaka Jeevan Paul, can you share some details on which versions of airflow and openlineage you’re using?

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-12 17:16:26

*Thread Reply:* airflow ---> 2.5.3, openlinegae-airflow ---> 1.1.0

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-12 17:45:08

*Thread Reply:* ```import traceback import uuid from typing import List, Optional

from openlineage.airflow.extractors.base import BaseExtractor, TaskMetadata from openlineage.airflow.utils import getjobname

class BigQueryInsertJobExtractor(BaseExtractor): def init(self, operator): super().init(operator)

@classmethod
def get_operator_classnames(cls) -&gt; List[str]:
    return ['BigQueryInsertJobOperator']

def extract(self) -&gt; Optional[TaskMetadata]:
    return None

def extract_on_complete(self, task_instance) -&gt; Optional[TaskMetadata]:
    self.log.debug(f"JEEVAN ---&gt; extract_on_complete({task_instance})")
    random_uuid = str(uuid.uuid4())
    self.log.debug(f"JEEVAN ---&gt; Randomly Generated UUID --&gt; {random_uuid}")

    self.operator.job_id = random_uuid

    return TaskMetadata(
        name=get_job_name(task=self.operator)
    )```

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-12 17:45:24

*Thread Reply:* this is the custom extractor code that im trying with

Harel Shein (harel.shein@gmail.com)

2023-09-12 21:10:02

*Thread Reply:* thanks @Guntaka Jeevan Paul, will try to take a deeper look tomorrow

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-09-13 07:54:26

*Thread Reply:* No module named 'dependencies'. This sounds like general Python problem

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-09-13 07:55:12

*Thread Reply:* https://stackoverflow.com/questions/69991553/how-to-import-custom-modules-in-cloud-composer

Stack Overflow

how to import custom modules in Cloud Composer

I created a local project with apache Airflow and i want to run it in cloud composer. My project contains custom modules and a main file that calls them. Example : from src.kuzzle import KuzzleQuery

Original URL: https://stackoverflow.com/questions/69991553/how-to-import-custom-modules-in-cloud-composer

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-09-13 07:56:28

*Thread Reply:* basically, if you're able to import the file from your dag code, OL should be able too

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-13 08:01:12

*Thread Reply:* The Problem is in the GCS Composer there is a component called Triggerer, which they say is used for deferrable operators…i have logged into that pod and i could see that the GCS Bucket is not mounted on this, but i am unable to understand why is the initialisation happening inside the triggerer pod

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-13 08:01:32

*Thread Reply:*

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-09-13 08:01:47

*Thread Reply:* > The Problem is in the GCS Composer there is a component called Triggerer, which they say is used for deferrable operators…i have logged into that pod and i could see that the GCS Bucket is not mounted on this, but i am unable to understand why is the initialisation happening inside the triggerer pod OL integration is not running on triggerer, only on worker and scheduler pods

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-13 08:01:53

*Thread Reply:*

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-13 08:03:26

*Thread Reply:* As you can see in this screenshot i am seeing the logs of the triggerer and it says clearly unable to import plugin openlineage

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-13 08:03:29

*Thread Reply:* https://openlineage.slack.com/files/U05QL7LN2GH/F05SUDUQEDN/screenshot_2023-09-13_at_5.31.22_pm.png

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-09-13 08:10:32

*Thread Reply:* I see. There are few possible things to do here - composer could mount the user files, Airflow could not start plugins on triggerer, or we could detect we're on triggerer and not import anything there. However, does it impact OL or Airflow operation in other way than this log?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-09-13 08:12:06

*Thread Reply:* Probably we'd have to do something if that really bothers you as there won't be further changes to Airflow 2.5

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-13 08:18:14

*Thread Reply:* The Problem is it is actually not registering this custom extractor written by me, henceforth i am just receiving the DefaultExtractor things and my piece of extractor code is not even getting triggered

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-13 08:22:49

*Thread Reply:* any suggestions to try @Maciej Obuchowski

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-09-13 08:27:48

*Thread Reply:* Could you share worker logs?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-09-13 08:27:56

*Thread Reply:* and check if module is importable from your dag code?

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-13 08:31:25

*Thread Reply:* these are the worker pod logs…where there is no log of openlineageplugin

downloaded-logs-20230913-180017.json

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-13 08:31:52

*Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1694608076879469?thread_ts=1694545905.974339&cid=C01CK9T7HKR --> sure will check now on this one

} Maciej Obuchowski (https://openlineage.slack.com/team/U01RA9B5GG2)

and check if module is importable from your dag code?

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1694608076879469?thread_ts=1694545905.974339&cid=C01CK9T7HKR

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-09-13 08:38:32

*Thread Reply:* { "textPayload": "Traceback (most recent call last): File \"/opt/python3.8/lib/python3.8/site-packages/openlineage/airflow/utils.py\", line 427, in import_from_string module = importlib.import_module(module_path) File \"/opt/python3.8/lib/python3.8/importlib/__init__.py\", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File \"<frozen importlib._bootstrap>\", line 1014, in _gcd_import File \"<frozen importlib._bootstrap>\", line 991, in _find_and_load File \"<frozen importlib._bootstrap>\", line 961, in _find_and_load_unlocked File \"<frozen importlib._bootstrap>\", line 219, in _call_with_frames_removed File \"<frozen importlib._bootstrap>\", line 1014, in _gcd_import File \"<frozen importlib._bootstrap>\", line 991, in _find_and_load File \"<frozen importlib._bootstrap>\", line 961, in _find_and_load_unlocked File \"<frozen importlib._bootstrap>\", line 219, in _call_with_frames_removed File \"<frozen importlib._bootstrap>\", line 1014, in _gcd_import File \"<frozen importlib._bootstrap>\", line 991, in _find_and_load File \"<frozen importlib._bootstrap>\", line 973, in _find_and_load_unlockedModuleNotFoundError: No module named 'airflow.gcs'", "insertId": "pt2eu6fl9z5vw", "resource": { "type": "cloud_composer_environment", "labels": { "environment_name": "openlineage", "location": "us-west1", "project_id": "acceldata-acm" } }, "timestamp": "2023-09-13T06:20:44.131577764Z", "severity": "ERROR", "labels": { "worker_id": "airflow-worker-xttt8" }, "logName": "projects/acceldata-acm/logs/airflow-worker", "receiveTimestamp": "2023-09-13T06:20:48.847319607Z" }, it doesn't see No module named 'airflow.gcs' that is part of your extractor path airflow.gcs.dags.big_query_insert_job_extractor.BigQueryInsertJobExtractor however, is it necessary? I generally see people using imports directly from dags folder

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-13 08:44:11

*Thread Reply:* this is one of the experimentation that i have did, but then i reverted it back to keeping it to dependencies.bigqueryinsertjobextractor.BigQueryInsertJobExtractor…where dependencies is a module i have created inside my dags folder

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-13 08:44:33

*Thread Reply:* https://openlineage.slack.com/files/U05QL7LN2GH/F05RM6EV6DV/screenshot_2023-09-13_at_12.38.55_am.png

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-13 08:45:46

*Thread Reply:* these are the logs of the triggerer pod specifically

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-09-13 08:46:31

*Thread Reply:* yeah it would be expected to have this in triggerer where it's not mounted, but will it behave the same for worker where it's mounted?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-09-13 08:47:09

*Thread Reply:* maybe ___init___.py is missing for top-level dag path?

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-13 08:49:01

*Thread Reply:* these are the logs of the worker pod at startup, where it does not complain of the plugin like in triggerer, but when tasks are run on this worker…somehow it is not picking up the extractor for the operator that i have written it for

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-13 08:49:54

*Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1694609229577469?thread_ts=1694545905.974339&cid=C01CK9T7HKR --> you mean to make the dags folder as well like a module by adding the init.py?

} Maciej Obuchowski (https://openlineage.slack.com/team/U01RA9B5GG2)

maybe <code>___init___.py</code> is missing for top-level dag path?

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1694609229577469?thread_ts=1694545905.974339&cid=C01CK9T7HKR

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-09-13 08:55:24

*Thread Reply:* yes, I would put whole custom code directly in dags folder, to make sure import paths are the problem

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-09-13 08:55:48

*Thread Reply:* and would be nice if you could set AIRFLOW__LOGGING__LOGGING_LEVEL="DEBUG"

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-13 09:14:58

*Thread Reply:* ```Starting the process, got command: triggerer Initializing airflow.cfg. airflow.cfg initialization is done. [2023-09-13T13:11:46.620+0000] {settings.py:267} DEBUG - Setting up DB connection pool (PID 8) [2023-09-13T13:11:46.622+0000] {settings.py:372} DEBUG - settings.prepareengineargs(): Using pool settings. poolsize=5, maxoverflow=10, poolrecycle=570, pid=8 [2023-09-13T13:11:46.742+0000] {cliactionloggers.py:39} DEBUG - Adding <function defaultactionlog at 0x7ff39ca1d3a0> to pre execution callback [2023-09-13T13:11:47.638+0000] {cliactionloggers.py:65} DEBUG - Calling callbacks: [<function defaultactionlog at 0x7ff39ca1d3a0>] _ |( ) _/ /_ _ _ /| |_ / / /_ _ / _ _ | /| / / _ | / _ / _ _/ _ / / // /_ |/ |/ / // |// // // // _/_/|/ [2023-09-13T13:11:50.527+0000] {pluginsmanager.py:300} DEBUG - Loading plugins [2023-09-13T13:11:50.580+0000] {pluginsmanager.py:244} DEBUG - Loading plugins from directory: /home/airflow/gcs/plugins [2023-09-13T13:11:50.581+0000] {pluginsmanager.py:224} DEBUG - Loading plugins from entrypoints [2023-09-13T13:11:50.587+0000] {pluginsmanager.py:227} DEBUG - Importing entrypoint plugin OpenLineagePlugin [2023-09-13T13:11:50.740+0000] {utils.py:430} WARNING - No module named 'boto3' [2023-09-13T13:11:50.743+0000] {utils.py:430} WARNING - No module named 'botocore' [2023-09-13T13:11:50.833+0000] {utils.py:430} WARNING - No module named 'airflow.providers.sftp' [2023-09-13T13:11:51.144+0000] {utils.py:430} WARNING - No module named 'bigqueryinsertjobextractor' [2023-09-13T13:11:51.145+0000] {pluginsmanager.py:237} ERROR - Failed to import plugin OpenLineagePlugin Traceback (most recent call last): File "/opt/python3.8/lib/python3.8/site-packages/openlineage/airflow/utils.py", line 427, in importfromstring module = importlib.importmodule(modulepath) File "/opt/python3.8/lib/python3.8/importlib/init.py", line 127, in importmodule return bootstrap.gcdimport(name[level:], package, level) File "<frozen importlib.bootstrap>", line 1014, in gcdimport File "<frozen importlib.bootstrap>", line 991, in _findandload File "<frozen importlib.bootstrap>", line 973, in findandloadunlocked ModuleNotFoundError: No module named 'bigqueryinsertjobextractor'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/opt/python3.8/lib/python3.8/site-packages/airflow/pluginsmanager.py", line 229, in loadentrypointplugins pluginclass = entrypoint.load() File "/opt/python3.8/lib/python3.8/site-packages/setuptools/vendor/importlibmetadata/init.py", line 194, in load module = importmodule(match.group('module')) File "/opt/python3.8/lib/python3.8/importlib/init.py", line 127, in importmodule return _bootstrap.gcdimport(name[level:], package, level) File "<frozen importlib.bootstrap>", line 1014, in gcdimport File "<frozen importlib.bootstrap>", line 991, in _findandload File "<frozen importlib.bootstrap>", line 975, in findandloadunlocked File "<frozen importlib.bootstrap>", line 671, in _loadunlocked File "<frozen importlib.bootstrapexternal>", line 843, in execmodule File "<frozen importlib.bootstrap>", line 219, in callwithframesremoved File "/opt/python3.8/lib/python3.8/site-packages/openlineage/airflow/plugin.py", line 32, in <module> from openlineage.airflow import listener File "/opt/python3.8/lib/python3.8/site-packages/openlineage/airflow/listener.py", line 75, in <module> extractormanager = ExtractorManager() File "/opt/python3.8/lib/python3.8/site-packages/openlineage/airflow/extractors/manager.py", line 16, in init self.tasktoextractor = Extractors() File "/opt/python3.8/lib/python3.8/site-packages/openlineage/airflow/extractors/extractors.py", line 122, in init extractor = importfromstring(extractor.strip()) File "/opt/python3.8/lib/python3.8/site-packages/openlineage/airflow/utils.py", line 431, in importfromstring raise ImportError(f"Failed to import {path}") from e ImportError: Failed to import bigqueryinsertjobextractor.BigQueryInsertJobExtractor [2023-09-13T13:11:51.235+0000] {pluginsmanager.py:227} DEBUG - Importing entrypoint plugin composermenuplugin [2023-09-13T13:11:51.719+0000] {pluginsmanager.py:316} DEBUG - Loading 1 plugin(s) took 1.14 seconds [2023-09-13T13:11:51.733+0000] {triggererjob.py:101} INFO - Starting the triggerer [2023-09-13T13:11:51.734+0000] {selectorevents.py:59} DEBUG - Using selector: EpollSelector [2023-09-13T13:11:56.118+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:12:01.359+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:12:06.665+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:12:11.880+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:12:17.098+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:12:22.323+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:12:27.597+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:12:32.826+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:12:38.049+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:12:43.275+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:12:48.509+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:12:53.867+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:12:59.087+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:13:04.300+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:13:09.539+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:13:14.785+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:13:20.007+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:13:25.274+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:13:30.510+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:13:35.729+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:13:40.960+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:13:46.444+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:13:51.751+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:13:57.084+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:14:02.310+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:14:07.535+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:14:12.754+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:14:17.967+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:14:23.185+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:14:28.406+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:14:33.661+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:14:38.883+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:14:44.247+0000] {base_job.py:240} DEBUG - [heartbeat]```

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-13 09:15:10

*Thread Reply:* still the same error in the triggerer pod

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-13 09:16:23

*Thread Reply:* have changed the dags folder where i have added the init file as you suggested and then have updated the OPENLINEAGEEXTRACTORS to bigqueryinsertjob_extractor.BigQueryInsertJobExtractor…still the same thing

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-09-13 09:36:27

*Thread Reply:* > still the same error in the triggerer pod it won't change, we're not trying to fix the triggerer import but worker, and should look only at worker pod at this point

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-13 09:43:34

*Thread Reply:* ```extractor for <class 'airflow.providers.google.cloud.operators.bigquery.BigQueryInsertJobOperator'> is <class 'bigqueryinsertjobextractor.BigQueryInsertJobExtractor'

Using extractor BigQueryInsertJobExtractor tasktype=BigQueryInsertJobOperator airflowdagid=dataanalyticsdag taskid=joinbqdatasets.bqjoinholidaysweatherdata2021 airflowrunid=manual_2023-09-13T13:24:08.946947+00:00

fatal: not a git repository (or any parent up to mount point /home/airflow) Stopping at filesystem boundary (GITDISCOVERYACROSSFILESYSTEM not set). fatal: not a git repository (or any parent up to mount point /home/airflow) Stopping at filesystem boundary (GITDISCOVERYACROSSFILESYSTEM not set).```

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-13 09:44:44

*Thread Reply:* able to see these logs in the worker pod…so what you said is right that it is able to get the extractor but i get the below error immediately where it says not a git repository

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-13 09:45:24

*Thread Reply:* seems like we are almost there nearby…am i missing something obvious

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-09-13 10:06:35

*Thread Reply:* > fatal: not a git repository (or any parent up to mount point /home/airflow) > Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set). > fatal: not a git repository (or any parent up to mount point /home/airflow) > Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set). hm, this could be the actual bug?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-13 10:06:51

*Thread Reply:* that’s casual log in composer

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-13 10:12:16

*Thread Reply:* extractor for <class 'airflow.providers.google.cloud.operators.bigquery.BigQueryInsertJobOperator'> is <class 'big_query_insert_job_extractor.BigQueryInsertJobExtractor' that’s actually class from your custom module, right?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-13 10:14:03

*Thread Reply:* I’ve done experiment, that’s how gcs looks like

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-13 10:14:09

*Thread Reply:* and env vars

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-13 10:14:19

*Thread Reply:* I have this extractor detected as expected

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-13 10:15:06

*Thread Reply:* seens as <class 'dependencies.bq.BigQueryInsertJobExtractor'>

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-13 10:16:02

*Thread Reply:* no __init__.py in base dags folder

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-13 10:17:02

*Thread Reply:* I also checked that triggerer pod indeed has no gcsfuse set up, tbh no idea why, maybe some kind of optimization the only effect is that when loading plugins in triggerer it throws some errors in logs, we don’t do anything at the moment there

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-13 10:19:26

*Thread Reply:* okk…got it @Jakub Dardziński…so the init at the top level of dags is as well not reqd, got it. Just one more doubt, there is a requirement where i want to change the operators property in the extractor inside the extract function, will that be taken into account and the operator’s execute be called with the property that i have populated in my extractor?

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-13 10:21:28

*Thread Reply:* for example i want to add a custom jobid to the BigQueryInsertJobOperator, so wheneerv someone uses the BigQueryInsertJobOperator operator i want to intercept that and add this jobid property to the operator…will that work?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-13 10:24:46

*Thread Reply:* I’m not sure if using OL for such thing is best choice. Wouldn’t it be better to subclass the operator?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-13 10:25:37

*Thread Reply:* but the answer is: it dependes on the airflow version, in 2.3+ I’m pretty sure the changed property stays in execute method

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-13 10:27:49

*Thread Reply:* yeah ideally that is how we should have done this but the problem is our client is having around 1000+ Dag’s in different google cloud projects, which are owned by multiple teams…so they are not willing to change anything in their dag. Thankfully they are using airflow 2.4.3

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-09-13 10:31:15

*Thread Reply:* task_policy might be better tool for that: https://airflow.apache.org/docs/apache-airflow/2.6.0/administration-and-deployment/cluster-policies.html

➕ Jakub Dardziński

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-13 10:35:30

*Thread Reply:* btw I double-checked - execute method is in different process so this would not change task’s attribute there

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-16 03:32:49

*Thread Reply:* @Jakub Dardziński any idea how can we achieve this one. ---> https://openlineage.slack.com/archives/C01CK9T7HKR/p1694849427228709

} Guntaka Jeevan Paul (https://openlineage.slack.com/team/U05QL7LN2GH)

@here we have dataproc operator getting called from a dag which submits a spark job, we wanted to maintain that continuity of parent job in the spark job so according to the documentation we can acheive that by using a macro called lineagerunid that requires task and taskinstance as the parameters. The problem we are facing is that our client’s have 1000's of dags, so asking them to change this everywhere it is used is not feasible, so we thought of using the taskpolicy feature in the airflow…but the problem is that taskpolicy gives you access to only the task/operator, but we don’t have the access to the task instance..that is required as a parameter to the lineagerun_id function. Can anyone kindly help us on how should we go about this one <code>t1 = DataProcPySparkOperator( task_id=job_name, #required pyspark configuration, job_name=job_name, dataproc_pyspark_properties={ 'spark.driver.extraJavaOptions': f"-javaagent:{jar}={os.environ.get('OPENLINEAGE_URL')}/api/v1/namespaces/{os.getenv('OPENLINEAGE_NAMESPACE', 'default')}/jobs/{job_name}/runs/{{{{macros.OpenLineagePlugin.lineage_run_id(task, task_instance)}}}}?api_key={os.environ.get('OPENLINEAGE_API_KEY')}" dag=dag)</code>

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1694849427228709

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-12 17:26:01

@here has anyone succeded in getting a custom extractor to work in GCP Cloud Composer or AWS MWAA, seems like there is no way

Mars Lan (Metaphor) (mars@metaphor.io)

2023-09-12 17:34:29

*Thread Reply:* I'm getting quite close with MWAA. See https://openlineage.slack.com/archives/C01CK9T7HKR/p1692743745585879.

} Mars Lan (https://openlineage.slack.com/team/U01HVNU6A4C)

Has anyone managed to get the OL Airflow integration to work on AWS MWAA? We've tried pretty much every trick but still ended up with the following error: <code>Broken plugin: [openlineage.airflow.plugin] No module named 'openlineage.airflow'; 'openlineage' is not a package</code>

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1692743745585879

Suraj Gupta (suraj.gupta@atlan.com)

2023-09-13 01:44:27

I am exploring Spark - OpenLineage integration (using the latest PySpark and OL versions). I tested a simple pipeline which: • Reads JSON data into PySpark DataFrame • Apply data transformations • Write transformed data to MySQL database Observed that we receive 4 events (2 START and 2 COMPLETE) for the same job name. The events are almost identical with a small diff in the facets. All the events share the same runId, and we don't get any parentRunId. Team, can you please confirm if this behaviour is expected? Seems to be different from the Airflow integration where we relate jobs to Parent Jobs.

Damien Hawes (damien.hawes@booking.com)

2023-09-13 02:54:37

*Thread Reply:* The Spark integration requires that two parameters are passed to it, namely:

spark.openlineage.parentJobName spark.openlineage.parentRunId You can find the list of parameters here:

https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/README.md

Suraj Gupta (suraj.gupta@atlan.com)

2023-09-13 02:55:51

*Thread Reply:* Thanks, will check this out

Damien Hawes (damien.hawes@booking.com)

2023-09-13 02:57:43

*Thread Reply:* As for double accounting of events - that's a bit harder to diagnose.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-09-13 04:33:03

*Thread Reply:* Can you share the the job and events? Also @Paweł Leszczyński

Suraj Gupta (suraj.gupta@atlan.com)

2023-09-13 06:03:49

*Thread Reply:* Sure, sharing Job and events.

spark_events.txt

Suraj Gupta (suraj.gupta@atlan.com)

2023-09-13 06:06:21

*Thread Reply:*

etl-mysql.py

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-09-13 06:39:02

*Thread Reply:* Hi @Suraj Gupta,

Thanks for providing such a detailed description of the problem.

It is not expected behaviour, it's an issue. The events correspond to the same logical plan which for some reason lead to sending two OL events. Is it reproducible aka. does it occur each time? If yes, we please feel free to raise an issue for that.

We have added in recent months several tests to verify amount of OL events being generated but we haven't tested it that way with JDBC. BTW. will the same happen if you write your data df_transformed to a file (like parquet file) ?

:gratitude_thank_you: Suraj Gupta

Suraj Gupta (suraj.gupta@atlan.com)

2023-09-13 07:28:03

*Thread Reply:* Thanks @Paweł Leszczyński, will confirm about writing to file and get back.

Suraj Gupta (suraj.gupta@atlan.com)

2023-09-13 07:33:35

*Thread Reply:* And yes, the issue is reproducible. Will raise an issue for this.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-09-13 07:33:54

*Thread Reply:* even if you write onto a file?

Suraj Gupta (suraj.gupta@atlan.com)

2023-09-13 07:37:21

*Thread Reply:* Yes, even when I write to a parquet file.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-09-13 07:49:28

*Thread Reply:* ok. i think i was able to reproduce it locally with https://github.com/OpenLineage/OpenLineage/pull/2103/files

Suraj Gupta (suraj.gupta@atlan.com)

2023-09-13 07:56:11

*Thread Reply:* Opened an issue: https://github.com/OpenLineage/OpenLineage/issues/2104

Suraj Gupta (suraj.gupta@atlan.com)

2023-09-25 16:32:09

*Thread Reply:* @Paweł Leszczyński I see that the PR is work in progress. Any rough estimate on when we can expect this fix to be released?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-09-26 03:32:03

*Thread Reply:* @Suraj Gupta put a comment within your issue. it's a bug we need to solve but I cannot bring any estimates today.

Suraj Gupta (suraj.gupta@atlan.com)

2023-09-26 04:33:03

*Thread Reply:* Thanks for update @Paweł Leszczyński, also please look into this comment. It might related and I'm not sure if expected behaviour.

Comment on #2104 [Spark Integration] Receiving duplicate OpenLineage events for a Spark Job

Hi <a href="https://github.com/pawel-big-lebowski">@pawel-big-lebowski</a> I understand your point. I've also come across cases where the duplicate events have the exact same payload with no difference in facets. I'm assuming there should be some diff in the payload for the backend to merge them. I've also seen a single Job emit 7 events and there is no fixed pattern for this. I also agree that there should be only a single START and a single COMPLETE event for any Job since this what the OL spec says. This Job emits 7 events (5 Start and 2 Complete): <pre><code>from pyspark.sql import SparkSession from pyspark.sql.functions import col from pyspark.sql.functions import lit import random spark = (SparkSession.builder.master('local') .appName('spark-pipeline-v1') .config('spark.jars.packages', "io.openlineage:openlineage_spark:1.1.0," "mysql:mysql_connector_java:8.0.33," "net.snowflake:snowflake_jdbc:3.13.14," "net.snowflake:spark_snowflake_2.12:2.10.0-spark_3.2") .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener') .config('spark.openlineage.transport.type', 'http') .config('spark.openlineage.transport.url', '<http://host.docker.internal:5009>') .config('spark.openlineage.transport.endpoint', '/events/openlineage/spark/api/v1/lineage') .config('spark.openlineage.namespace', 'staging') .config('spark.openlineage.transport.auth.type', 'api_key') .config('spark.openlineage.transport.auth.apiKey', 'test-key') .config('spark.openlineage.parentJobName', 'suraj-test-job') .config('spark.openlineage.parentRunId', 'acd-eheh-ththth-wnjwnj') .getOrCreate()) mysql_connection_properties = { "user": "<user>", "password": "<pwd>", "driver": "com.mysql.cj.jdbc.Driver", } snowflake_options = { "sfURL": "<account>.<a href="http://snowflakecomputing.com">snowflakecomputing.com</a>", "sfUser": "<usr>", "sfPassword": "<pwd>", "sfDatabase": "ANALYTICS", "sfWarehouse": "COMPUTE_WH", "sfSchema": "PUBLIC", "sfRole": "ACCOUNTADMIN", } mysql_url = "<mysql-jdbc-url>" cats_df = spark.read.jdbc(url=mysql_url, table="cats", properties=mysql_connection_properties) dogs_df = spark.read.jdbc(url=mysql_url, table="dogs", properties=mysql_connection_properties) cats_df = cats_df.withColumnRenamed("Country", "cat_name") joined_df = cats_df.join(dogs_df, on="owner", how="inner") filtered_df = joined_df.filter(col("cat_name") == 'Cookie') agg_df = filtered_df.groupBy("cat_name").count() agg_df = agg_df.withColumn("id", lit(random.randint(0, 1000))) agg_df.write \ .format("snowflake") \ .options(****snowflake_options) \ .option("dbtable", "COUNT_TABLE") \ .mode("overwrite") \ .save() spark.stop() </code></pre>

Michael Robinson (michael.robinson@astronomer.io)

2023-09-13 14:20:32

@channel This month’s TSC meeting, open to all, is tomorrow: https://openlineage.slack.com/archives/C01CK9T7HKR/p1694113940400549

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)

@channel This month’s TSC meeting is next Thursday the 14th at 10am PT. On the tentative agenda: • announcements • recent releases • demo: Spark integration tests in Databricks runtime • open discussion • more (TBA) More info and the meeting link can be found on the <a href="https://openlineage.io/meetings/">website</a>. All are welcome! Also, feel free to reply or DM me with discussion topics, agenda items, etc.

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1694113940400549

✅ Sheeri Cabral (Collibra)

Damien Hawes (damien.hawes@booking.com)

2023-09-14 06:20:15

Context:

We use Spark with YARN, running on Hadoop 2.x (I can't remember the exact minor version) with Hive support.

Problem:

I'm noticed that CreateDataSourceAsSelectCommand objects are always transformed to an OutputDataset with a namespace value set to file - which is curious, because the inputs always have a (correct) namespace of hdfs://<name-node> - is this a known issue? A flaw with Apache Spark? A bug in the resolution logic?

For reference:

```public class CreateDataSourceTableCommandVisitor extends QueryPlanVisitor<CreateDataSourceTableCommand, OpenLineage.OutputDataset> {

public CreateDataSourceTableCommandVisitor(OpenLineageContext context) { super(context); }

@Override public List<OpenLineage.OutputDataset> apply(LogicalPlan x) { CreateDataSourceTableCommand command = (CreateDataSourceTableCommand) x; CatalogTable catalogTable = command.table();

return Collections.singletonList(
    outputDataset()
        .getDataset(
            PathUtils.fromCatalogTable(catalogTable),
            catalogTable.schema(),
            OpenLineage.LifecycleStateChangeDatasetFacet.LifecycleStateChange.CREATE));

} }``Running this:cat events.log | jq '{eventTime: .eventTime, eventType: .eventType, runId: .run.runId, jobNamespace: .job.namespace, jobName: .job.name, outputs: .outputs[] | {namespace: .namespace, name: .name}, inputs: .inputs[] | {namespace: .namespace, name: .name}}'`

This is an output: { "eventTime": "2023-09-13T16:01:27.059Z", "eventType": "START", "runId": "bbbb5763-3615-46c0-95ca-1fc398c91d5d", "jobNamespace": "spark.cluster-1", "jobName": "ol_hadoop_test.execute_create_data_source_table_as_select_command.dhawes_db_ol_test_hadoop_tgt", "outputs": { "namespace": "file", "name": "/user/hive/warehouse/dhawes.db/ol_test_hadoop_tgt" }, "inputs": { "namespace": "<hdfs://nn1>", "name": "/user/hive/warehouse/dhawes.db/ol_test_hadoop_src" } }

👀 Paweł Leszczyński

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-09-14 07:32:25

*Thread Reply:* Seems like an issue on our side. Do you know how the source is read? What LogicalPlan leaf is used to read src? Would love to find how is this done differently

Damien Hawes (damien.hawes@booking.com)

2023-09-14 09:16:58

*Thread Reply:* Hmm, I'll have to do explain plan to see what exactly it is.

However my sample job uses spark.sql("SELECT ** FROM dhawes.ol_test_hadoop_src")

which itself is created using

spark.sql("SELECT 1 AS id").write.format("orc").mode("overwrite").saveAsTable("dhawes.ol_test_hadoop_src")

Damien Hawes (damien.hawes@booking.com)

2023-09-14 09:23:59

*Thread Reply:* ``>>> spark.sql("SELECT ** FROM dhawes.ol_test_hadoop_src").explain(True) == Parsed Logical Plan == 'Project [**] +- 'UnresolvedRelationdhawes.oltesthadoop_src`

== Analyzed Logical Plan == id: int Project [id#3] +- SubqueryAlias dhawes.ol_test_hadoop_src +- Relation[id#3] orc

== Optimized Logical Plan == Relation[id#3] orc

== Physical Plan == **(1) FileScan orc dhawes.oltesthadoop_src[id#3] Batched: true, Format: ORC, Location: InMemoryFileIndex[], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>```

tati (tatiana.alchueyr@astronomer.io)

2023-09-14 10:03:41

Hey everyone, Any chance we could have a openlineage-integration-common 1.1.1 release with the following changes..? • https://github.com/OpenLineage/OpenLineage/pull/2106 • https://github.com/OpenLineage/OpenLineage/pull/2108

➕ Michael Robinson, Harel Shein, Maciej Obuchowski, Jakub Dardziński, Paweł Leszczyński, Julien Le Dem

tati (tatiana.alchueyr@astronomer.io)

2023-09-14 10:05:19

*Thread Reply:* Specially the first PR is affecting users of the astronomer-cosmos library: https://github.com/astronomer/astronomer-cosmos/issues/533

Michael Robinson (michael.robinson@astronomer.io)

2023-09-14 10:05:24

*Thread Reply:* Thanks @tati for requesting your first OpenLineage release! Three +1s from committers will authorize

:gratitude_thank_you: tati

Michael Robinson (michael.robinson@astronomer.io)

2023-09-14 11:59:55

*Thread Reply:* The release is authorized and will be initiated within two business days.

🎉 tati

tati (tatiana.alchueyr@astronomer.io)

2023-09-15 04:40:12

*Thread Reply:* Thanks a lot, @Michael Robinson!

Julien Le Dem (julien@apache.org)

2023-09-14 20:23:01

Per discussion in the OpenLineage sync today here is a very early strawman proposal for an OpenLineage registry that producers and consumers could be registered in. Feedback or alternate proposals welcome https://docs.google.com/document/d/1zIxKST59q3I6ws896M4GkUn7IsueLw8ejct5E-TR0vY/edit Once this is sufficiently fleshed out, I’ll create an actual proposal on github

👍 Maciej Obuchowski

Julien Le Dem (julien@apache.org)

2023-10-03 20:33:35

*Thread Reply:* I have cleaned up the registry proposal. https://docs.google.com/document/d/1zIxKST59q3I6ws896M4GkUn7IsueLw8ejct5E-TR0vY/edit In particular: • I clarified that option 2 is preferred at this point. • I moved discussion notes to the bottom. they will go away at some point • Once it is stable, I’ll create a proposal with the preferred option. • we need a good proposal for the core facets prefix. My suggestion is to move core facets to core in the registry. The drawback is prefix would be inconsistent.

Julien Le Dem (julien@apache.org)

2023-10-05 17:34:12

*Thread Reply:* I have created a ticket to make this easier to find. Once I get more feedback I’ll turn it into a md file in the repo: https://docs.google.com/document/d/1zIxKST59q3I6ws896M4GkUn7IsueLw8ejct5E-TR0vY/edit#heading=h.enpbmvu7n8gu https://github.com/OpenLineage/OpenLineage/issues/2161

#2161 [PROPOSAL] Add a Registry of Producers and Consumers in OpenLineage

Purpose This is the early stage of an idea to get community feedback on what an OpenLineage registry for producers, custom facets and consumers could be. Once this document is stable enough, I’ll create an official proposal on the OpenLineage repo. Goal Allow third parties to register their implementations or custom extensions to make them easy to discover. Shorten “Producer” and “schema url” values Proposed implementation Current draft for discussion: <a href="https://docs.google.com/document/d/1zIxKST59q3I6ws896M4GkUn7IsueLw8ejct5E-TR0vY/edit#heading=h.br8d2vy9wme9">https://docs.google.com/document/d/1zIxKST59q3I6ws896M4GkUn7IsueLw8ejct5E-TR0vY/edit#heading=h.br8d2vy9wme9</a>

Labels

proposal

Michael Robinson (michael.robinson@astronomer.io)

2023-09-15 12:03:27

@channel Friendly reminder: the next OpenLineage meetup, our first in Toronto, is happening this coming Monday at 5 PM ET https://openlineage.slack.com/archives/C01CK9T7HKR/p1694441261486759

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)

@channel The first Toronto OpenLineage Meetup, featuring a presentation by recent adopter <a href="https://metaphor.io/">Metaphor</a>, is just one week away. On the agenda: <ol><li>Evolution of spec presentation/discussion (project background/history)</li><li>State of the community</li><li>Integrating OpenLineage with <a href="https://metaphor.io/">Metaphor</a> (by special guests <a href="https://www.linkedin.com/in/yeliu84/">Ye</a> & <a href="https://www.linkedin.com/in/ivanperepelitca/">Ivan</a>)</li><li>Spark/Column lineage update</li><li>Airflow Provider update</li><li>Roadmap Discussion Find more details and RSVP <a href="https://www.meetup.com/openlineage/events/295488014/?utm_medium=referral&utm_campaign=share-btn_savedevents_share_modal&utm_source=link|here">https://www.meetup.com/openlineage/events/295488014/?utmmedium=referral&utmcampaign=share-btnsavedeventssharemodal&utmsource=link|here</a>.</li> </ol>

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1694441261486759

👍 Maciej Obuchowski

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-16 03:30:27

@here we have dataproc operator getting called from a dag which submits a spark job, we wanted to maintain that continuity of parent job in the spark job so according to the documentation we can acheive that by using a macro called lineagerunid that requires task and taskinstance as the parameters. The problem we are facing is that our client’s have 1000's of dags, so asking them to change this everywhere it is used is not feasible, so we thought of using the taskpolicy feature in the airflow…but the problem is that taskpolicy gives you access to only the task/operator, but we don’t have the access to the task instance..that is required as a parameter to the lineagerun_id function. Can anyone kindly help us on how should we go about this one t1 = DataProcPySparkOperator( task_id=job_name, #required pyspark configuration, job_name=job_name, dataproc_pyspark_properties={ 'spark.driver.extraJavaOptions': f"-javaagent:{jar}={os.environ.get('OPENLINEAGE_URL')}/api/v1/namespaces/{os.getenv('OPENLINEAGE_NAMESPACE', 'default')}/jobs/{job_name}/runs/{{{{macros.OpenLineagePlugin.lineage_run_id(task, task_instance)}}}}?api_key={os.environ.get('OPENLINEAGE_API_KEY')}" dag=dag)

➕ Abdallah

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-16 04:22:47

*Thread Reply:* you don't need actual task instance to do that. you only should set additional argument as jinja template, same as above

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-16 04:25:28

*Thread Reply:* task_instance in this case is just part of string which is evaluated when jinja render happens

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-16 04:27:10

*Thread Reply:* ohh…then we could use the same example as above inside the task_policy to intercept the Operator and add the openlineage specific additions properties?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-16 04:30:59

*Thread Reply:* correct, just remember not to override all properties, just add ol specific

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-16 04:32:02

*Thread Reply:* yeah sure…thank you so much @Jakub Dardziński, will try this out and keep you posted

👍 Jakub Dardziński

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-09-16 05:00:24

*Thread Reply:* We want to automate setting those options at some point inside the operator itself

➕ Guntaka Jeevan Paul

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-16 19:40:27

@here is there a way by which we could add custom headers to openlineage client in airflow, i see that provision is there for spark integration via these properties spark.openlineage.transport.headers.xyz --> abcdef

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-19 16:40:55

*Thread Reply:* there’s no out-of-the-box possibility to do that yet, you’re very welcome to create an issue in GitHub and maybe contribute as well! 🙂

Mars Lan (Metaphor) (mars@metaphor.io)

2023-09-17 09:07:41

It doesn't seem like there's a way to override the OL endpoint from the default (/api/v1/lineage) in Airflow? I tried setting the OPENLINEAGE_ENDPOINT environment to no avail. Based on this statement, it seems that only OPENLINEAGE_URL was used to construct HttpConfig ?

<https://github.com/OpenLineage/OpenLineage/blob/main/client/python/openlineage/client/transport/factory.py | factory.py>

<pre><code> config = HttpConfig( url=os.environ["OPENLINEAGE_URL"], auth=create_token_provider( { "type": "api_key", "apiKey": os.environ.get("OPENLINEAGE_API_KEY", ""), }, ), </code></pre>

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-18 16:25:11

*Thread Reply:* That’s correct. For now there’s no way to configure the endpoint via env var. You can do that by using config file

Mars Lan (Metaphor) (mars@metaphor.io)

2023-09-18 16:30:39

*Thread Reply:* How do you do that in Airflow? Any particular reason for excluding endpoint override via env var? Happy to create a PR to fix that.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-18 16:52:48

*Thread Reply:* historical I guess? go for the PR, of course 🚀

Mars Lan (Metaphor) (mars@metaphor.io)

2023-10-03 08:52:16

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2151

#2151 Allow setting client's endpoint via environment variable

Problem Currently, it's not possible to set the OpenLineage endpoint (hard-coded to <code>/api/v1/lineage</code>) using an environment variable when running the Airflow integration. Solution Given that it's not possible to create the client manually in Airflow, especially now that OpenLineage has become an official Airflow provider, this change seems like the only feasible solution. ☐ Your change modifies the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json">core</a> OpenLineage model ☐ Your change modifies one or more OpenLineage <a href="https://github.com/OpenLineage/OpenLineage/tree/main/spec/facets">facets</a> One-line summary: Allow setting client's endpoint via environment variable. Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☑︎ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☑︎ Your changes are accompanied by tests (if relevant) ☑︎ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☑︎ You've updated any relevant documentation (if relevant) ☑︎ Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant) ☐ You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant) <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2023 contributors to the OpenLineage project

Labels

documentation, client/python

Terese Larsson (terese@jclab.se)

2023-09-18 08:22:34

Hi! I'm in need of help with wrapping my head around OpenLineage. My team have the goal of collecting metadata from the Airflow operators GreatExpectationsOperator, PythonOperator, MsSqlOperator and BashOperator (for dbt). Where can I see the sourcecode for what is collected for each operator, and is there support for these in the new provider apache-airflow-providers-openlineage? I am super confused and feel lost in the docs. 🤯 We are using MSSQL/ODBC to connect to our db, and this data does not seem to appear as datasets in Marquez, do I need to configure this? If so, HOW and WHERE? 🥲

Happy for any help, big or small! 🙏

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-18 16:26:07

*Thread Reply:* there’s no actual single source of what integrations are currently implemented in openlineage Airflow provider. That’s something we should work on so it’s more visible

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-18 16:26:46

*Thread Reply:* answering this quickly - GE & MS SQL are not currently implemented yet in the provider

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-18 16:26:58

*Thread Reply:* but I also invite you to contribute if you’re interested! 🙂

sarathch (sarathch@hpe.com)

2023-09-19 02:47:47

Hi I need help in extracting OpenLineage for PostgresOperator in json format. any suggestions or comments would be greatly appreciated

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-09-19 16:40:06

*Thread Reply:* If you're using Airflow 2.7, take a look at https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/guides/user.html

❤️ sarathch

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-09-19 16:40:54

*Thread Reply:* If you use one of the lower versions, take a look here https://openlineage.io/docs/integrations/airflow/usage

openlineage.io

Using the Airflow integration | OpenLineage

PREREQUISITES

Original URL: https://openlineage.io/docs/integrations/airflow/usage

sarathch (sarathch@hpe.com)

2023-09-20 06:26:56

*Thread Reply:* Maciej, Thanks for sharing the link https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/guides/user.html this should address the issue

Juan Luis Cano Rodríguez (juan_luis_cano@mckinsey.com)

2023-09-20 09:36:54

congrats folks 🥳 https://lfaidata.foundation/blog/2023/09/20/lf-ai-data-foundation-announces-graduation-of-openlineage-project

🎉 Jakub Dardziński, Mars Lan (Metaphor), Ross Turk, Guntaka Jeevan Paul, Peter Hicks, Maciej Obuchowski, Athitya Kumar, John Lukenoff, Harel Shein, Francis McGregor-Macdonald, Laurent Paris

👍 Athitya Kumar

❤️ Harel Shein

Michael Robinson (michael.robinson@astronomer.io)

2023-09-20 17:08:58

@channel We released OpenLineage 1.2.2! Added • Spark: publish the ProcessingEngineRunFacet as part of the normal operation of the OpenLineageSparkEventListener #2089 @d-m-h • Spark: capture and emit spark.databricks.clusterUsageTags.clusterAllTags variable from databricks environment #2099 @Anirudh181001 Fixed • Common: support parsing dbt_project.yml without target-path #2106 @tatiana • Proxy: fix Proxy chart #2091 @harels • Python: fix serde filtering #2044 @xli-1026 • Python: use non-deprecated apiKey if loading it from env variables @2029 @mobuchowski • Spark: Improve RDDs on S3 integration. #2039 @pawel-big-lebowski • Flink: prevent sending running events after job completes #2075 @pawel-big-lebowski • Spark & Flink: Unify dataset naming from URI objects #2083 @pawel-big-lebowski • Spark: Databricks improvements #2076 @pawel-big-lebowski Removed • SQL: remove sqlparser dependency from iface-java and iface-py #2090 @JDarDagran Thanks to all the contributors, including new contributors @tati, @xli-1026, and @d-m-h! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.2.2 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/1.1.0...1.2.2 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🔥 Maciej Obuchowski, Harel Shein, Anirudh Shrinivason

👍 Guntaka Jeevan Paul, John Rosenbaum, Sangeeta Mishra

Yevhenii Soboliev (esoboliev@griddynamics.com)

2023-09-22 21:05:20

*Thread Reply:* Hi @Michael Robinson Thank you! I love the job that you’ve done. If you have a few seconds, please hint at how I can push lineage gathered from Airflow and Spark jobs into DataHub for visualization? I didn’t find any solutions or official support neither at Openlineage nor at DataHub, but I still want to continue using Openlineage

Michael Robinson (michael.robinson@astronomer.io)

2023-09-22 21:30:22

*Thread Reply:* Hi Yevhenii, thank you for using OpenLineage. The DataHub integration is new to us, but perhaps the experts on Spark and Airflow know more. @Paweł Leszczyński @Maciej Obuchowski @Jakub Dardziński

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-09-23 08:11:17

*Thread Reply:* @Yevhenii Soboliev at Airflow Summit, Shirshanka Das from DataHub mentioned this as upcoming feature.

👍 Yevhenii Soboliev

🎯 Yevhenii Soboliev

Suraj Gupta (suraj.gupta@atlan.com)

2023-09-21 02:11:10

Hi, we're using Custom Operators in airflow(2.5) and are planning to expose lineage via default extractors: https://openlineage.io/docs/integrations/airflow/default-extractors/ Question: Now if we upgrade our Airflow version to 2.7 in the future, would our code be backward compatible? Since OpenLineage has now moved inside airflow and I think there is no concept of extractors in the latest version.

Suraj Gupta (suraj.gupta@atlan.com)

2023-09-21 02:15:00

*Thread Reply:* Also, do we have any docs on how OL works with the latest airflow version? Few questions: • How is it replacing the concept of custom extractors and Manually Annotated Lineage in the latest version? • Do we have any examples of setting up the integration to emit input/output datasets for non supported Operators like PythonOperator?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-27 10:04:09

*Thread Reply:* > Question: Now if we upgrade our Airflow version to 2.7 in the future, would our code be backward compatible? It will be compatible, “default extractors” is generally the same concept as we’re using in the 2.7 integration. One thing that might be good to update is import paths, from openlineage.airflow to airflow.providers.openlineage but should work both ways

> • Do we have any code samples/docs of setting up the integration to emit input/output datasets for non supported Operators like PythonOperator? Our experience with that is currently lacking - this means, it works like in bare airflow, if you annotate your PythonOperator tasks with old Airflow lineage like in this doc.

We want to make this experience better - by doing few things • instrumenting hooks, then collecting lineage from them • integration with AIP-48 datasets • allowing to emit lineage collected inside Airflow task by other means, by providing core Airflow API for that All those things require changing core Airflow in a couple of ways: • tracking which hooks were used during PythonOperator execution • just being able to emit datasets (airflow inlets/outlets) from inside of a task - they are now a static thing, so if you try that it does not work • providing better API for emitting that lineage, preferably based on OpenLineage itself rather than us having to convert that later. As this requires core Airflow changes, it won’t be live until Airflow 2.8 at the earliest.

thanks to @Maciej Obuchowski for this response

Jason Yip (jasonyip@gmail.com)

2023-09-21 18:36:17

I am using this accelerator that leverages OpenLineage on Databricks to publish lineage info to Purview, but it's using a rather old version of OpenLineage aka 0.18, anybody has tried it on a newer version of OpenLineage? I am facing some issues with the inputs and outputs for the same object is having different json https://github.com/microsoft/Purview-ADB-Lineage-Solution-Accelerator/

microsoft/Purview-ADB-Lineage-Solution-Accelerator

A connector to ingest Azure Databricks lineage into Microsoft Purview

Stars

Language

✅ Harel Shein

Jason Yip (jasonyip@gmail.com)

2023-09-21 21:51:41

I installed 1.2.2 on Databricks, followed the below init script: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/databricks/open-lineage-init-script.sh

my cluster config looks like this:

spark.openlineage.version v1 spark.openlineage.namespace adb-5445974573286168.8#default spark.openlineage.endpoint v1/lineage spark.openlineage.url.param.code 8kZl0bo2TJfnbpFxBv-R2v7xBDj-PgWMol3yUm5iP1vaAzFu9kIZGg== spark.openlineage.url https://f77b-50-35-69-138.ngrok-free.app

But it is not calling the API, it works fine with 0.18 version

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/databricks/open-lineage-init-script.sh | open-lineage-init-script.sh>

✅ Harel Shein

Jason Yip (jasonyip@gmail.com)

2023-09-21 23:16:10

I am attaching the log4j, there is no openlineagecontext

log4j-active (3).txt

✅ Harel Shein

Jason Yip (jasonyip@gmail.com)

2023-09-21 23:47:22

*Thread Reply:* this issue is resolved, solution can be found here: https://openlineage.slack.com/archives/C01CK9T7HKR/p1691592987038929

} Zahi Fail (https://openlineage.slack.com/team/U05KNSP01TR)

Hey, I’m running Spark application (spark version 3.4) with OL integration. I changed spark to use “debug” level, and I see the OL events with the below message: “Emitting lineage completed successfully:” With all the above, I can’t see the event in Marquez. Attaching the OL configurations. When changing the OL-spark version to 0.6.+, I do see event created in Marquez with only “Start” status (attached below). The OL-spark version is matching the Spark version? Is there a known issues with the Spark / OL versions ?

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1691592987038929

Harel Shein (harel.shein@gmail.com)

2023-09-25 08:59:10

*Thread Reply:* We were all out at Airflow Summit last week, so apologies for the delayed response. Glad you were able to resolve the issue!

Sangeeta Mishra (sangeeta@acceldata.io)

2023-09-25 05:11:50

@here I'm presently addressing a particular scenario that pertains to Openlineage authentication, specifically involving the use of an access key and secret.

I've implemented a custom token provider called AccessKeySecretKeyTokenProvider, which extends the TokenProvider class. This token provider communicates with another service, obtaining a token and an expiration time based on the provided access key, secret, and client ID.

My goal is to retain this token in a cache prior to its expiration, thereby eliminating the need for network calls to the third-party service. Is it possible without relying on an external caching system.

Harel Shein (harel.shein@gmail.com)

2023-09-25 08:56:53

*Thread Reply:* Hey @Sangeeta Mishra, I’m not sure that I fully understand your question here. What do you mean by OpenLineage authentication? What are you using to generate OL events? What’s your OL receiving backend?

Sangeeta Mishra (sangeeta@acceldata.io)

2023-09-25 09:04:33

*Thread Reply:* Hey @Harel Shein, I wanted to clarify the previous message. I apologize for any confusion. When I mentioned "OpenLineage authentication," I was actually referring to the authentication process for the OpenLineage backend, specifically using HTTP transport. This involves using my custom token provider, which utilizes access keys and secrets for authentication. The OL backend is http based backend . I hope this clears things up!

Harel Shein (harel.shein@gmail.com)

2023-09-25 09:05:12

*Thread Reply:* Are you using Marquez?

Sangeeta Mishra (sangeeta@acceldata.io)

2023-09-25 09:05:55

*Thread Reply:* We are trying to leverage our own backend here.

Harel Shein (harel.shein@gmail.com)

2023-09-25 09:07:03

*Thread Reply:* I see.. I’m not sure the OpenLineage community could help here. Which webserver framework are you using?

Sangeeta Mishra (sangeeta@acceldata.io)

2023-09-25 09:08:56

*Thread Reply:* KTOR framework

Sangeeta Mishra (sangeeta@acceldata.io)

2023-09-25 09:15:33

*Thread Reply:* Our backend authentication operates based on either a pair of keys or a single bearer token, with a limited time of expiry. Hence, wanted to cache this information inside the token provider.

Harel Shein (harel.shein@gmail.com)

2023-09-25 09:26:57

*Thread Reply:* I see, I would ask this question here https://ktor.io/support/

Ktor Framework

Support

Kotlin Server and Client Framework for microservices, HTTP APIs, and RESTful services

Original URL: https://ktor.io/support/

Sangeeta Mishra (sangeeta@acceldata.io)

2023-09-25 10:12:52

*Thread Reply:* Thank you

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-09-26 04:13:20

*Thread Reply:* @Sangeeta Mishra which openlineage client are you using: java or python?

Sangeeta Mishra (sangeeta@acceldata.io)

2023-09-26 04:19:53

*Thread Reply:* @Paweł Leszczyński I am using python client

Suraj Gupta (suraj.gupta@atlan.com)

2023-09-25 13:36:25

I'm using the Spark OpenLineage integration. In the outputStatistics output dataset facet we receive rowCount and size. The Job performs a SQL insert into a MySQL table and I'm receiving the size as 0. { "outputStatistics": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/1.1.0/integration/spark>", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/OutputStatisticsOutputDatasetFacet.json#/$defs/OutputStatisticsOutputDatasetFacet>", "rowCount": 1, "size": 0 } } I'm not sure what the size means here. Does this mean number of bytes inserted/updated? Also, do we have any documentation for Spark specific Job and Run facets?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-09-27 09:56:00

*Thread Reply:* I am not sure it's stated in the doc. Here's the list of spark facets schemas: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/shared/facets/spark/v1

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-26 00:51:30

@here In Airflow Integration we send across a lineage Event for Dag start and complete, but that is not the case with spark integration…we don’t receive any event for the application start and complete in spark…is this expected behaviour or am i missing something?

➕ Suraj Gupta

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-09-27 09:47:39

*Thread Reply:* For spark we do send start and complete for each spark action being run (single operation that causes spark processing being run). However, it is difficult for us to know if we're dealing with the last action within spark job or a spark script.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-09-27 09:49:35

*Thread Reply:* I think we need to look deeper into that as there is reoccuring need to capture such information

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-09-27 09:49:57

*Thread Reply:* and spark listener event has methods like onApplicationStart and onApplicationEnd

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-27 09:50:13

*Thread Reply:* We are using the SparkListener, which has a function called OnApplicationStart which gets called whenever a spark application starts, so i was thinking why cant we send one at start and simlarly at end as well

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-09-27 09:50:33

*Thread Reply:* additionally, we would like to have a concept of a parent run for a spark job which aggregates all actions run within a single spark job context

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-27 09:51:11

*Thread Reply:* yeah exactly. the way that it works with airflow integration

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-09-27 09:51:26

*Thread Reply:* we do have an issue for that https://github.com/OpenLineage/OpenLineage/issues/2105

#2105 [PROPOSAL] Spark jobs should have a parent Job describing the Run State of Spark pipeline.

Purpose: Currently there is no way to group spark jobs running within the same SparkContext since there is no concept of parent Jobs in Spark. In airflow we support Parent Runs and this way we can group Tasks together within a single DAG. Use cases: • For data catalog based OL backends, a user might want to search for Spark ETL pipelines first and then dive deeper into a specific pipeline to view the Spark Jobs. This is similar to how a user can search for airflow DAG assets and then look into the Tasks for that DAG. • Also this will help display operational metadata for Spark pipelines. For Airflow we have DAG run status (parent job status), but for Spark we don't have any such concept to display the completion state of a pipeline run. We can only track it at SparkJob level. Proposed implementation This should be similar to the airflow implementation.

Labels

proposal

Comments

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-09-27 09:52:08

*Thread Reply:* what you can is: come to our monthly Openlineage open meetings and raise that issue and convince the community about its importance

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-09-27 09:53:32

*Thread Reply:* yeah sure would love to do that…how can i join them, will that be posted here in this slack channel?

Michael Robinson (michael.robinson@astronomer.io)

2023-09-27 09:54:08

*Thread Reply:* Hi, you can see the schedule and RSVP here: https://openlineage.io/community

openlineage.io

Community | OpenLineage

Original URL: https://openlineage.io/community

🙌 Paweł Leszczyński

:gratitude_thank_you: Guntaka Jeevan Paul

Michael Robinson (michael.robinson@astronomer.io)

2023-09-27 11:19:16

Meetup recap: Toronto Meetup @ Airflow Summit, September 18, 2023 It was great to see so many members of our community at this event! I counted 32 total attendees, with all but a handful being first-timers. Topics included: • Presentation on the history, architecture and roadmap of the project by @Julien Le Dem and @Harel Shein • Discussion of OpenLineage support in Marquez by @Willy Lulciuc • Presentation by Ye Liu and Ivan Perepelitca from Metaphor, the social platform for data, about their integration • Presentation by @Paweł Leszczyński about the Spark integration • Presentation by @Maciej Obuchowski about the Apache Airflow Provider Thanks to all the presenters and attendees with a shout out to @Harel Shein for the help with organizing and day-of logistics, @Jakub Dardziński for the help with set up/clean up, and @Sheeri Cabral (Collibra) for the crucial assist with the signup sheet. This was our first meetup in Toronto, and we learned some valuable lessons about planning events in new cities — the first and foremost being to ask for a pic of the building! 🙂 But it seemed like folks were undeterred, and the space itself lived up to expectations. For a recording and clips from the meetup, head over to our YouTube channel. Upcoming events: • October 5th in San Francisco: Marquez Meetup @ Astronomer (sign up https://www.meetup.com/meetup-group-bnfqymxe/events/295444209/?utmmedium=referral&utmcampaign=share-btnsavedeventssharemodal&utmsource=link|here) • November: Warsaw meetup (details, date TBA) • January: London meetup (details, date TBA) Are you interested in hosting or co-hosting an OpenLineage or Marquez meetup? DM me!

metaphor.io

Metaphor - The Social Platform for Data

Original URL: https://metaphor.io/

YouTube

OpenLineage Project

Meetings, talks and tutorials by the OpenLineage Project, an Open Standard for lineage metadata collection

Original URL: https://www.youtube.com/channel/UCRMLy4AaSw_ka-gNV9nl7VQ/

Meetup

Marquez Meetup @ Astronomer, Thu, Oct 5, 2023, 5:30 PM | Meetup

Join us on Thursday, October 5th, from 5:30_8:30 pm to learn about the Marquez project. Meet other members of the community, get tips on making the most of the latest impro

Original URL: https://www.meetup.com/meetup-group-bnfqymxe/events/295444209/?utm_medium=referral&utm_campaign=share-btn_savedevents_share_modal&utm_source=link

🙌 Mars Lan (Metaphor), Harel Shein, Paweł Leszczyński

❤️ Jakub Dardziński, Harel Shein, Rodrigo Maia, Paweł Leszczyński, Julien Le Dem, Willy Lulciuc

🚀 Jakub Dardziński, Kevin Languasco

😅 Harel Shein

✅ Sheeri Cabral (Collibra)

Michael Robinson (michael.robinson@astronomer.io)

2023-09-27 11:55:47

*Thread Reply:* A few more pics:

Damien Hawes (damien.hawes@booking.com)

2023-09-27 12:23:05

Hi folks, am I correct in my observations that the Spark integration does not generate inputs and outputs for Kafka-to-Kafka pipelines?

EDIT: Removed the crazy wall of text. Relevant GitHub issue is here.

#2137 [SPARK] The integration fails to produce inputs and outputs in a Kafka-to-Kafka scenario

Problem Statement As the title suggests, when a Spark job is running in a streaming configuration that performs a Kafka-to-Kafka data flow, the integration fails to emit any events. Here's an example of an event that the connector emits: The connector emits the following output: <pre><code>23/09/27 18:17:22 DEBUG OpenLineageRunEventBuilder: Visiting query plan Optional[== Parsed Logical Plan == WriteToMicroBatchDataSource org.apache.spark.sql.kafka010.KafkaSourceProvider$KafkaTable@2e679177, c3df858a-d481-456f-9665-3ff0c2d6d19a, [kafka.bootstrap.servers=localhost:9092,localhost:9093,localhost:9094, topic=target, checkpointLocation=/Users/dhawes/Projects/spark-streaming-openlineage], Append, 0 +- StreamingDataSourceV2Relation [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13], org.apache.spark.sql.kafka010.KafkaSourceProvider$KafkaScan@2095a606, KafkaV2[Subscribe[source]], {"source":{"0":11}}, {"source":{"0":11}} == Analyzed Logical Plan == WriteToMicroBatchDataSource org.apache.spark.sql.kafka010.KafkaSourceProvider$KafkaTable@2e679177, c3df858a-d481-456f-9665-3ff0c2d6d19a, [kafka.bootstrap.servers=localhost:9092,localhost:9093,localhost:9094, topic=target, checkpointLocation=/Users/dhawes/Projects/spark-streaming-openlineage], Append, 0 +- StreamingDataSourceV2Relation [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13], org.apache.spark.sql.kafka010.KafkaSourceProvider$KafkaScan@2095a606, KafkaV2[Subscribe[source]], {"source":{"0":11}}, {"source":{"0":11}} == Optimized Logical Plan == WriteToDataSourceV2 org.apache.spark.sql.execution.streaming.sources.MicroBatchWrite@6b69d31b +- StreamingDataSourceV2Relation [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13], org.apache.spark.sql.kafka010.KafkaSourceProvider$KafkaScan@2095a606, KafkaV2[Subscribe[source]], {"source":{"0":11}}, {"source":{"0":11}} == Physical Plan == WriteToDataSourceV2 org.apache.spark.sql.execution.streaming.sources.MicroBatchWrite@6b69d31b, org.apache.spark.sql.execution.datasources.v2.DataSourceV2Strategy$$Lambda$1945/0x000000c001c246c8@28b23a0a +- **(1) Project [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13] +- MicroBatchScan[key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13] class org.apache.spark.sql.kafka010.KafkaSourceProvider$KafkaScan ] with output dataset builders [<function1>, <function1>, <function1>, <function1>, <function1>, <function1>, <function1>, <function1>] 23/09/27 18:17:22 INFO ConsoleTransport: {"eventTime":"2023_09_27T16:17:22.263Z","producer":"<https://github.com/OpenLineage/OpenLineage/tree/1.2.2/integration/spark>","schemaURL":"<https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunEvent>","eventType":"COMPLETE","run":{"runId":"3468112f_96c3_44b5_bd84_847c2d99db8f","facets":{"spark_version":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/1.2.2/integration/spark>","_schemaURL":"<https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunFacet>","spark-version":"3.3.0","openlineage_spark_version":"1.2.2"},"processing_engine":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/1.2.2/integration/spark>","_schemaURL":"<https://openlineage.io/spec/facets/1-1-0/ProcessingEngineRunFacet.json#/$defs/ProcessingEngineRunFacet>","version":"3.3.0","name":"spark","openlineageAdapterVersion":"1.2.2"},"environment_properties":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/1.2.2/integration/spark>","_schemaURL":"<https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunFacet>","environment-properties":{}}}},"job":{"namespace":"default","name":"spark_streaming_example.write_to_data_source_v2","facets":{}},"inputs":[],"outputs":[]} </code></pre> As you can see, there are zero inputs and outputs. Stuff to reproduce Code <pre><code>package streaming; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.streaming.StreamingQuery; import org.apache.spark.sql.streaming.Trigger; import java.util.concurrent.TimeUnit; public class SparkStreamingExampleApplication { public static void main(String[] args) throws Exception { SparkSession spark = SparkSession .builder() .appName("spark-streaming-example") .master("local") .config("spark.ui.enabled", false) .config("spark.jars.packages", "io.openlineage:openlineage_spark:1.2.2") .config("spark.extraListeners", "io.openlineage.spark.agent.OpenLineageSparkListener") .config("spark.openlineage.transport.type", "console") .config("spark.openlineage.facets.disabled", "[spark_unknown;spark.logicalPlan]") .getOrCreate(); Dataset<Row> df = spark.readStream() .format("kafka") .option("kafka.bootstrap.servers", "localhost:9092,localhost:9093,localhost:9094") .option("subscribe", "source") .load(); StreamingQuery kafkaWriteQuery = df.writeStream() .outputMode("append") .format("kafka") .option("kafka.bootstrap.servers", "localhost:9092,localhost:9093,localhost:9094") .option("topic", "target") .option("checkpointLocation", "/Users/dhawes/Projects/spark-streaming-openlineage") .trigger(Trigger.ProcessingTime(10, TimeUnit.SECONDS)) .start(); kafkaWriteQuery.awaitTermination(); } } </code></pre> build.gradle.kts <pre><code>plugins { java application } java { sourceCompatibility = JavaVersion.VERSION_1_8 targetCompatibility = JavaVersion.VERSION_1_8 toolchain { languageVersion.set(JavaLanguageVersion.of(8)) } } repositories { mavenLocal() mavenCentral() } dependencies { implementation("org.apache.spark:spark_core_2.12:3.3.0") implementation("org.apache.spark:spark_sql_2.12:3.3.0") implementation("org.apache.spark:spark_sql_kafka_0_10_2.12:3.3.0") implementation("org.apache.spark:spark_streaming_2.12:3.3.0") implementation("org.apache.spark:spark_streaming_kafka_0_10_2.12:3.3.0") implementation("io.openlineage:openlineage_spark:1.2.2") } application { mainClass = "streaming.SparkStreamingExampleApplication" } </code></pre> docker-compose.yaml ``` version: '3' services: zookeeper: image: confluentinc/cp-zookeeper:latest containername: 'spark-streaming-zookeeper' environment: ZOOKEEPERCLIENT_PORT: 2181 kafka1: image: 'confluentinc/cp-kafka:latest' containername: 'spark-streaming-kafka-1' dependson: - 'zookeeper' environment: KAFKABROKERID: 1 KAFKAZOOKEEPERCONNECT: zookeeper:2181 KAFKAADVERTISEDLISTENERS: <INSIDE://kafka1:9092,OUTSIDE://localhost:9092> KAFKALISTENERSECURITYPROTOCOLMAP: INSIDE:PLAINTEXT,OUTSIDE:PLAINTEXT KAFKAINTERBROKERLISTENERNAME: INSIDE KAFKA_LISTENERS: <INSIDE://0.0.0.0:9092,OUTSIDE://0.0.0.0:9094> ports: - "9092:9094" kafka2: image: 'confluentinc/cp-kafka:latest' containername: 'spark-streaming-kafka-2' dependson: - 'zookeeper' environment: KAFKABROKERID: 2 KAFKAZOOKEEPERCONNECT: zookeeper:2181 KAFKAADVERTISEDLISTENERS: <INSIDE://kafka2:9092,OUTSIDE://localhost:9093> KAFKALISTENERSECURITYPROTOCOLMAP: INSIDE:PLAINTEXT,OUTSIDE:PLAINTEXT KAFKAINTERBROKERLISTENERNAME: INSIDE KAFKA_LISTENERS: <INSIDE://0.0.0.0:9092,OUTSIDE://0.0.0.0:9095> ports: - "9093:9095" kafka3: image: 'confluentinc/cp-kafka:latest' containername: 'spark-streaming-kafka-3' dependson: - 'zookeeper' environment: KAFKABROKERID: 3 KAFKAZOOKEEPERCONNECT: zookeeper:2181 KAFKAADVERTISEDLISTENERS: <INSIDE://kafka3:9092…

👀 Paweł Leszczyński

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-09-28 02:42:18

*Thread Reply:* responded within the issue

Erik Alfthan (slack@alfthan.eu)

2023-09-28 02:40:40

Hello community First time poster - bear with me :)

I am looking to make a minor PR on the airflow integration (fixing github #2130), and the code change is easy enough, but I fail to install the python environment. I have tried the simple ones OpenLineage/integration/airflow > pip install -e . or OpenLineage/integration/airflow > pip install -r dev-requirements.txt but they both fail on ERROR: No matching distribution found for openlineage-sql==1.3.0

(which I think is an unreleased version in the git project)

How would I go about to install the requirements?

//Erik

PS. Sorry for posting this in general if there is a specific integration or contribution channel - I didnt find a better channel

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-09-28 03:04:48

*Thread Reply:* Hi @Erik Alfthan, the channel is totally OK. I am not airflow integration expert, but what it looks to me, you're missing openlineage-sql library, which is a rust library used to extract lineage from sql queries. This is how we do that in circle ci: https://app.circleci.com/pipelines/github/OpenLineage/OpenLineage/8080/workflows/aba53369-836c-48f5-a2dd-51bc0740a31c/jobs/140113

and subproject page with build instructions: https://github.com/OpenLineage/OpenLineage/tree/main/integration/sql

Erik Alfthan (slack@alfthan.eu)

2023-09-28 03:07:23

*Thread Reply:* Ok, so I go and "manually" build the internal dependency so that it becomes available in the pip cache?

I was hoping for something more automagical, but that should work

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-09-28 03:08:06

*Thread Reply:* I think so. @Jakub Dardziński am I right?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-28 03:18:27

*Thread Reply:* https://openlineage.io/docs/development/developing/python/setup there’s a guide how to setup the dev environment

> Typically, you first need to build openlineage-sql locally (see README). After each release you have to repeat this step in order to bump local version of the package. This might be somewhat exposed more in GitHub repository README as well

Erik Alfthan (slack@alfthan.eu)

2023-09-28 03:27:20

*Thread Reply:* It didnt find the wheel in the cache, but if I used the line in the sql/README.md pip install openlineage-sql --no-index --find-links ../target/wheels --force-reinstall It is installed and thus skipped/passed when pip later checks if it needs to be installed.

Now I have a second issue because it is expecting me to have mysqlclient-2.2.0 which seems to need a binary Command 'pkg-config --exists mysqlclient' returned non-zero exit status 127 and Command 'pkg-config --exists mariadb' returned non-zero exit status 127 I am on Ubuntu 22.04 in WSL2. Should I go to apt and grab me a mysql client?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-28 03:31:52

*Thread Reply:* > It didnt find the wheel in the cache, but if I used the line in the sql/README.md > pip install openlineage-sql --no-index --find-links ../target/wheels --force-reinstall > It is installed and thus skipped/passed when pip later checks if it needs to be installed. That’s actually expected. You should build new wheel locally and then install it.

> Now I have a second issue because it is expecting me to have mysqlclient-2.2.0 which seems to need a binary > Command 'pkg-config --exists mysqlclient' returned non-zero exit status 127 > and > Command 'pkg-config --exists mariadb' returned non-zero exit status 127 > I am on Ubuntu 22.04 in WSL2. Should I go to apt and grab me a mysql client? We’ve left some system specific configuration, e.g. mysqlclient, to users as it’s a bit aside from OpenLineage and more of general development task.

probably sudo apt-get install python3-dev default-libmysqlclient-dev build-essential should work

Erik Alfthan (slack@alfthan.eu)

2023-09-28 03:32:04

*Thread Reply:* I just realized that I should probably skip setting up my wsl and just run the tests in the docker setup you prepared

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-28 03:35:46

*Thread Reply:* You could do that as well but if you want to test your changes vs many Airflow versions that wouldn’t be possible I think (run them with tox btw)

Erik Alfthan (slack@alfthan.eu)

2023-09-28 04:54:39

*Thread Reply:* This is starting to feel like a rabbit hole 😞

When I run tox, I get a lot of build errors • client needs to be built • sql needs to be built to a different target than its readme says • a lot of builds fail on cython_sources

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-28 05:19:34

*Thread Reply:* would you like to share some exact log lines? I’ve never seen such errors, they probably are system specific

Erik Alfthan (slack@alfthan.eu)

2023-09-28 06:45:48

*Thread Reply:* Getting requirements to build wheel did not run successfully. │ exit code: 1 ╰─> [62 lines of output] /tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/config/setupcfg.py:293: _DeprecatedConfig: Deprecated config insetup.cfg`!!`

        `****************************************************************************************************************************************************************`
        `The license_file parameter is deprecated, use license_files instead.`

        `By 2023-Oct-30, you need to update your project and remove deprecated calls`
        `or your builds will no longer be supported.`

        `See <https://setuptools.pypa.io/en/latest/userguide/declarative_config.html> for details.`
        `****************************************************************************************************************************************************************`

`!!`
  `parsed = self.parsers.get(option_name, lambda x: x)(value)`
`running egg_info`
`writing lib3/PyYAML.egg-info/PKG-INFO`
`writing dependency_links to lib3/PyYAML.egg-info/dependency_links.txt`
`writing top-level names to lib3/PyYAML.egg-info/top_level.txt`
`Traceback (most recent call last):`
  `File "/home/obr_erikal/projects/OpenLineage/integration/airflow/.tox/py3-airflow-2.1.4/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in &lt;module&gt;`
    `main()`
  `File "/home/obr_erikal/projects/OpenLineage/integration/airflow/.tox/py3-airflow-2.1.4/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main`
    `json_out['return_val'] = hook(****hook_input['kwargs'])`
  `File "/home/obr_erikal/projects/OpenLineage/integration/airflow/.tox/py3-airflow-2.1.4/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel`
    `return hook(config_settings)`
  `File "/tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 355, in get_requires_for_build_wheel`
    `return self._get_build_requires(config_settings, requirements=['wheel'])`
  `File "/tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 325, in _get_build_requires`
    `self.run_setup()`
  `File "/tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 341, in run_setup`
    `exec(code, locals())`
  `File "&lt;string&gt;", line 271, in &lt;module&gt;`
  `File "/tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/__init__.py", line 103, in setup`
    `return distutils.core.setup(****attrs)`
  `File "/tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 185, in setup`
    `return run_commands(dist)`
  `File "/tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 201, in run_commands`
    `dist.run_commands()`
  `File "/tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands`
    `self.run_command(cmd)`
  `File "/tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/dist.py", line 989, in run_command`
    `super().run_command(command)`
  `File "/tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command`
    `cmd_obj.run()`
  `File "/tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/command/egg_info.py", line 318, in run`
    `self.find_sources()`
  `File "/tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/command/egg_info.py", line 326, in find_sources`
    `mm.run()`
  `File "/tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/command/egg_info.py", line 548, in run`
    `self.add_defaults()`
  `File "/tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/command/egg_info.py", line 586, in add_defaults`
    `sdist.add_defaults(self)`
  `File "/tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/command/sdist.py", line 113, in add_defaults`
    `super().add_defaults()`
  `File "/tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/_distutils/command/sdist.py", line 251, in add_defaults`
    `self._add_defaults_ext()`
  `File "/tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/_distutils/command/sdist.py", line 336, in _add_defaults_ext`
    `self.filelist.extend(build_ext.get_source_files())`
  `File "&lt;string&gt;", line 201, in get_source_files`
  `File "/tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 107, in __getattr__`
    `raise AttributeError(attr)`
`AttributeError: cython_sources`
`[end of output]`

note: This error originates from a subprocess, and is likely not a problem with pip. py3-airflow-2.1.4: exit 1 (7.85 seconds) /home/obr_erikal/projects/OpenLineage/integration/airflow> python -m pip install --find-links target/wheels/ --find-links ../sql/iface-py/target/wheels --use-deprecated=legacy-resolver --constraint=<https://raw.githubusercontent.com/apache/airflow/constraints-2.1.4/constraints-3.8.txt> apache-airflow==2.1.4 'mypy>=0.9.6' pytest pytest-mock -r dev-requirements.txt pid=368621 py3-airflow-2.1.4: FAIL ✖ in 7.92 seconds

Erik Alfthan (slack@alfthan.eu)

2023-09-28 06:53:54

*Thread Reply:* Then, for the actual error in my PR: Evidently you are not using isort, so what linter/fixer should I use for imports?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-28 06:58:15

*Thread Reply:* for the error - I think there’s a mistake in the docs. Could you please run maturin build --out target/wheels as a temp solution?

👀 Erik Alfthan

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-28 06:58:57

*Thread Reply:* we’re using ruff , tox runs it as one of commands

Erik Alfthan (slack@alfthan.eu)

2023-09-28 07:00:37

*Thread Reply:* Not in the airflow folder? OpenLineage/integration/airflow$ maturin build --out target/wheels 💥 maturin failed Caused by: pyproject.toml at /home/obr_erikal/projects/OpenLineage/integration/airflow/pyproject.toml is invalid Caused by: TOML parse error at line 1, column 1 | 1 | [tool.ruff] | ^ missing fieldbuild-system``

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-28 07:02:32

*Thread Reply:* I meant change here https://github.com/OpenLineage/OpenLineage/blob/main/integration/sql/README.md

so cd iface-py python -m pip install maturin maturin build --out ../target/wheels becomes cd iface-py python -m pip install maturin maturin build --out target/wheels tox runs install_command = python -m pip install {opts} --find-links target/wheels/ \ --find-links ../sql/iface-py/target/wheels but it should be install_command = python -m pip install {opts} --find-links target/wheels/ \ --find-links ../sql/target/wheels actually and I’m posting PR to fix that

Erik Alfthan (slack@alfthan.eu)

2023-09-28 07:05:12

*Thread Reply:* yes, that part I actually worked out myself, but the cython_sources error I fail to understand cause. I have python3-dev installed on WSL Ubuntu with python version 3.10.12 in a virtualenv. Anything in that that could cause issues?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-28 07:12:20

*Thread Reply:* looks like it has something to do with latest release of Cython? pip install "Cython<3" maybe solves the issue?

Erik Alfthan (slack@alfthan.eu)

2023-09-28 07:15:06

*Thread Reply:* I didnt have any cython before the install. Also no change. Could it be some update to setuptools itself? seems like the depreciation notice and the error is coming from inside setuptools

Erik Alfthan (slack@alfthan.eu)

2023-09-28 07:16:59

*Thread Reply:* (I.e. I tried the pip install "Cython<3" command without any change in the output )

Erik Alfthan (slack@alfthan.eu)

2023-09-28 07:20:30

*Thread Reply:* Applying ruff lint on the converter.py file fixed the issue on the PR though so unless you have any feedback on the change itself, I will set it up on my own computer later instead (right now doing changes on behalf of a client on the clients computer)

If the issue persists on my own computer, I'll dig a bit further

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-28 07:21:03

*Thread Reply:* It’s a bit hard for me to find the root cause as I cannot reproduce this locally and CI works fine as well

Erik Alfthan (slack@alfthan.eu)

2023-09-28 07:22:41

*Thread Reply:* Yeah, I am thinking that if I run into the same problem "at home", I might find it worthwhile to understand the issue. Right now, the client only wants the fix.

👍 Jakub Dardziński

Erik Alfthan (slack@alfthan.eu)

2023-09-28 07:25:10

*Thread Reply:* Is there an official release cycle?

or more specific, given that the PRs are approved, how soon can they reach openlineage-dbt and apache-airflow-providers-openlineage ?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-28 07:28:58

*Thread Reply:* we need to differentiate some things:

OpenLineage repository: a. dbt integration - this is the only place where it is maintained b. Airflow integration - here we only keep backwards compatibility but generally speaking starting from Airflow 2.7+ we would like to do all the job in Airflow repo as OL Airflow provider
Airflow repository - there’s only Airflow Openlineage provider compatible (and works best) with Airflow 2.7+

we have control over releases (obviously) in OL repo - it’s monthly cycle so beginning next week that should happen. There’s also a possibility to ask for ad-hoc release in #general slack channel and with approvals of committers the new version is also released

For Airflow providers - the cycle is monthly as well

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-28 07:31:30

*Thread Reply:* it’s a bit complex for this split but needed temporarily

Erik Alfthan (slack@alfthan.eu)

2023-09-28 07:31:47

*Thread Reply:* oh, I did the fix in the wrong place! The client is on airflow 2.7 and is using the provider. Is it syncing?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-28 07:32:28

*Thread Reply:* it’s not, two separate places a~nd we haven’t even added the whole thing with converting old lineage objects to OL specific~

editing, that’s not true

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-28 07:34:40

*Thread Reply:* the code’s here: https://github.com/apache/airflow/blob/main/airflow/providers/openlineage/extractors/manager.py#L154

<https://github.com/apache/airflow/blob/main/airflow/providers/openlineage/extractors/manager.py | manager.py>

<pre><code> def extract_inlets_and_outlets( </code></pre>

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-28 07:35:17

*Thread Reply:* sorry I did not mention this earlier. we definitely need to add some guidance how to proceed with contributions to OL and Airflow OL provider

Erik Alfthan (slack@alfthan.eu)

2023-09-28 07:36:10

*Thread Reply:* anyway, the dbt fix is the blocking issue, so if that parts comes next week, there is no real urgency in getting the columns. It is a nice to have for our ingest parquet files.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-28 07:37:12

*Thread Reply:* may I ask if you use some custom operator / python operator there?

Erik Alfthan (slack@alfthan.eu)

2023-09-28 07:37:33

*Thread Reply:* yeah, taskflow with inlets/outlets

Erik Alfthan (slack@alfthan.eu)

2023-09-28 07:38:38

*Thread Reply:* so we extract from sources and use pyarrow to create parquet files in storage that an mssql-server can use as external tables

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-09-28 07:39:54

*Thread Reply:* awesome 👍 we have plans to integrate more with Python operator as well but not earlier than in Airflow 2.8

Erik Alfthan (slack@alfthan.eu)

2023-09-28 07:43:41

*Thread Reply:* I guess writing a generic extractor for the python operator is quite hard, but if you could support some inlet/outlet type for tabular fileformat / their python libraries like pyarrow or maybe even pandas and document it, I think a lot of people would understand how to use them

➕ Harel Shein

Michael Robinson (michael.robinson@astronomer.io)

2023-09-28 16:16:24

Are you located in the Brussels area or within commutable distance? Interested in attending a meetup between October 16-20? If so, please DM @Sheeri Cabral (Collibra) or myself. TIA

❤️ Sheeri Cabral (Collibra)

Michael Robinson (michael.robinson@astronomer.io)

2023-10-02 11:58:32

@channel Hello all, I’d like to open a vote to release OpenLineage 1.3.0, including: • support for Spark 3.5 in the Spark integration • scheme preservation bug fix in the Spark integration • find-links path in tox bug in the Airflow integration fix • more graceful logging when no OL provider is installed in the Airflow integration • columns as schema facet for airflow.lineage.Table addition • SQLSERVER to supported dbt profile types addition Three +1s from committers will authorize. Thanks in advance.

🙌 Harel Shein, Paweł Leszczyński, Rodrigo Maia

👍 Jason Yip, Paweł Leszczyński

➕ Willy Lulciuc, Jakub Dardziński, Erik Alfthan, Julien Le Dem

Michael Robinson (michael.robinson@astronomer.io)

2023-10-02 17:00:08

*Thread Reply:* Thanks all. The release is authorized and will be initiated within 2 business days.

Jason Yip (jasonyip@gmail.com)

2023-10-02 17:11:46

*Thread Reply:* looking forward to that, I am seeing inconsistent results in Databricks for Spark 3.4+, sometimes there's no inputs / outputs, hope that is fixed?

Harel Shein (harel.shein@gmail.com)

2023-10-03 09:59:24

*Thread Reply:* @Jason Yip if it isn’t fixed for you, would love it if you could open up an issue that will allow us to reproduce and fix

👍 Jason Yip

Jason Yip (jasonyip@gmail.com)

2023-10-03 20:23:40

*Thread Reply:* @Harel Shein the issue still exists -> Spark 3.4 and above, including 3.5, saveAsTable and create table won't have inputs and outputs in Databricks

Jason Yip (jasonyip@gmail.com)

2023-10-03 20:30:15

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/2124

#2124 Same Delta Table not catching the location on write

What is the target system? Spark / Databricks What kind of integration is this? ☐ Produces OpenLineage metadata ☐ Consumes OpenLineage metadata ☐ Something else How should this integration be implemented? I am using OL 1.2.2, Azure Databricks Runtime 11.3 LTS. When creating a table writing into a ADLS location, OL won't be able to catch the location of the output. But when I read the same object it will be able to read the location as INPUT. Please note I have also tested Databricks Runtime 13.3 LTS, Spark 3.4.1 - it will give correct ADLS location in INPUT but the input will only show up once in a blue moon. Most of the time the inputs and outputs are blank. <pre><code> "inputs": [], "outputs": [] </code></pre> <pre><code>CREATE OR REPLACE TABLE transactions_adj USING DELTA LOCATION '<wasbs://studio@clororetaildevadls.blob.core.windows.net/examples/data/csv/completejourney/silver/transactions_adj>' AS SELECT household_id, basket_id, week_no, day, transaction_time, store_id, product_id, amount_list, campaign_coupon_discount, manuf_coupon_discount, manuf_coupon_match_discount, total_coupon_discount, instore_discount, amount_paid, units FROM ( SELECT household_id, basket_id, week_no, day, transaction_time, store_id, product_id, COALESCE(sales_amount - discount_amount - coupon_discount - coupon_discount_match,0.0) as amount_list, CASE WHEN COALESCE(coupon_discount_match,0.0) = 0.0 THEN -1 ** COALESCE(coupon_discount,0.0) ELSE 0.0 END as campaign_coupon_discount, CASE WHEN COALESCE(coupon_discount_match,0.0) != 0.0 THEN -1 ** COALESCE(coupon_discount,0.0) ELSE 0.0 END as manuf_coupon_discount, -1 ** COALESCE(coupon_discount_match,0.0) as manuf_coupon_match_discount, -1 ** COALESCE(coupon_discount - coupon_discount_match,0.0) as total_coupon_discount, COALESCE(-1 ** discount_amount,0.0) as instore_discount, COALESCE(sales_amount,0.0) as `amount_paid,` quantity as units FROM transactions ); </code></pre> Here's the COMPLETE event: <pre><code> "outputs":[ { "namespace":"dbfs", "name":"/user/hive/warehouse/journey.db/transactions_adj", "facets":{ "dataSource":{ "_producer":"<https://github.com/OpenLineage/OpenLineage/tree/1.2.2/integration/spark>", "_schemaURL":"<https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet>", "name":"dbfs", "uri":"dbfs" }, </code></pre> Below logical plan shows the path: <pre><code>== Analyzed Logical Plan == num_affected_rows: bigint, num_inserted_rows: bigint ReplaceTableAsSelect TableSpec(Map(),Some(DELTA),Map(),Some(<wasbs://studio@clororetaildevadls.blob.core.windows.net/examples/data/csv/completejourney/silver/transactions_adj>),None,None,false,Set()), true :- ResolvedIdentifier com.databricks.sql.managedcatalog.UnityCatalogV2Proxy@6251a8df, default.transactions_adj +- Project [household_id#184, basket_id#185L, week_no#193, day#186, transaction_time#192, store_id#190, product_id#187, amount_list#147, campaign_coupon_discount#148, manuf_coupon_discount#149, manuf_coupon_match_discount#150, total_coupon_discount#151, instore_discount#152, amount_paid#153, units#154] +- SubqueryAlias __auto_generated_subquery_name +- Project [household_id#184, basket_id#185L, week_no#193, day#186, transaction_time#192, store_id#190, product_id#187, coalesce(cast((((sales_amount#189 - discount_amount#191) - coupon_discount#194) - coupon_discount_match#195) as double), cast(0.0 as double)) AS amount_list#147, CASE WHEN (coalesce(cast(coupon_discount_match#195 as double), cast(0.0 as double)) = cast(0.0 as double)) THEN (cast(-1 as double) ** coalesce(cast(coupon_discount#194 as double), cast(0.0 as double))) ELSE cast(0.0 as double) END AS campaign_coupon_discount#148, CASE WHEN NOT (coalesce(cast(coupon_discount_match#195 as double), cast(0.0 as double)) = cast(0.0 as double)) THEN (cast(-1 as double) ** coalesce(cast(coupon_discount#194 as double), cast(0.0 as double))) ELSE cast(0.0 as double) END AS manuf_coupon_discount#149, (cast(-1 as double) ** coalesce(cast(coupon_discount_match#195 as double), cast(0.0 as double))) AS manuf_coupon_match_discount#150, (cast(-1 as double) ** coalesce(cast((coupon_discount#194 - coupon_discount_match#195) as double), cast(0.0 as double))) AS total_coupon_discount#151, coalesce(cast((cast(-1 as float) ** discount_amount#191) as double), cast(0.0 as double)) AS instore_discount#152, coalesce(cast(sales_amount#189 as double), cast(0.0 as double)) AS amount_paid#153, quantity#188 AS units#154] +- SubqueryAlias spark_catalog.default.transactions +- Relation spark_catalog.default.transactions[household_id#184,basket_id#185L,day#186,product_id#187,quantity#188,sales_amount#189,store_id#190,discount_amount#191,transaction_time#192,week_no#193,coupon_discount#194,coupon_discount_match#195] parquet </code></pre> Where should this integration be implemented? ☐ In the target system ☐ In the OpenLineage repo ☐ Somewhere else Do you plan to make this contribution yourself? ☐ I am interested in doing this work

Labels

integration/spark, integration/databricks

Comments

Jason Yip (jasonyip@gmail.com)

2023-10-03 20:30:21

*Thread Reply:* and of course this issue still exists

Harel Shein (harel.shein@gmail.com)

2023-10-03 21:45:09

*Thread Reply:* thanks for posting, we’ll continue looking into this.. if you find any clues that might help, please let us know.

Jason Yip (jasonyip@gmail.com)

2023-10-03 21:46:27

*Thread Reply:* is there any instructions on how to hook up a debugger to OL?

Harel Shein (harel.shein@gmail.com)

2023-10-04 09:04:16

*Thread Reply:* @Paweł Leszczyński has been working on adding a debug facet, but more suggestions are more than welcome!

Harel Shein (harel.shein@gmail.com)

2023-10-04 09:05:58

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2147

#2147 [SPARK] add debug facet to help resolving Spark integration issues

Problem Debugging openlineage-spark problems is tedious job. We would like to have debug facet that will collect automatically meaningful information when enabled. Closes: <a href="https://github.com/OpenLineage/OpenLineage/issues/2135">#2135</a> Solution • Create debug facet (more details in <a href="https://github.com/OpenLineage/OpenLineage/issues/2135">#2135</a> ), • Facet is disabled by default, • Allow enabling it thourgh SparkConf. <blockquote> Note: All schema changes require discussion. Please <a href="https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue">link the issue</a> for context. </blockquote> ☐ Your change modifies the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json">core</a> OpenLineage model ☐ Your change modifies one or more OpenLineage <a href="https://github.com/OpenLineage/OpenLineage/tree/main/spec/facets">facets</a> If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports <code>S3</code> and <code>GCS</code> filesystem operations, tested with AWS EMR). One-line summary: Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☑︎ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☑︎ Your changes are accompanied by tests (if relevant) ☑︎ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☐ Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant) ☐ You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant) <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2023 contributors to the OpenLineage project

Labels

documentation, integration/spark

Assignees

<a href="https://github.com/pawel-big-lebowski">@pawel-big-lebowski</a>

👀 Paweł Leszczyński

👍 Jason Yip

Jason Yip (jasonyip@gmail.com)

2023-10-05 03:20:11

*Thread Reply:* @Paweł Leszczyński do you have a build for the PR? Appreciated!

Harel Shein (harel.shein@gmail.com)

2023-10-05 15:05:08

*Thread Reply:* we’ll ask for a release once it’s reviewed and merged

Michael Robinson (michael.robinson@astronomer.io)

2023-10-02 12:28:28

@channel The September issue of OpenLineage News is here! This issue covers the big news about OpenLineage coming out of Airflow Summit, progress on the Airflow Provider, highlights from our meetup in Toronto, and much more. To get the newsletter directly in your inbox each month, sign up here.

apache.us14.list-manage.com

OpenLineage Project

OpenLineage Project Email Forms

Original URL: http://bit.ly/OL_news

🦆 Harel Shein, Paweł Leszczyński

🔥 Willy Lulciuc, Jakub Dardziński, Paweł Leszczyński

Damien Hawes (damien.hawes@booking.com)

2023-10-03 03:44:36

Hi folks - I'm wondering if its just me, but does io.openlineage:openlineage_sql_java:1.2.2 ship with the arm64.dylib binary? When i try and run code that uses the Java package on an Apple M1, the binary isn't found, The workaround is to checkout 1.2.2 and then build and publish it locally.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-10-03 09:01:38

*Thread Reply:* Not sure if I follow your question. Whenever OL is released, there is a script new-version.sh - https://github.com/OpenLineage/OpenLineage/blob/main/new-version.sh being run and modify the codebase.

So, If you pull the code, it contains OL version that has not been released yet and in case of dependencies, one need to build them on their own.

For example, here https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark#preparation Preparation section describes how to build openlineage-java and openlineage-sql in order to build openlineage-spark.

<https://github.com/OpenLineage/OpenLineage/blob/main/new-version.sh | new-version.sh>

Damien Hawes (damien.hawes@booking.com)

2023-10-04 05:27:26

*Thread Reply:* Hmm. Let's elaborate my use case a bit.

We run Apache Hive on-premise. Hive provides query execution hooks for pre-query, post-query, and I think failed query.

Any way, as part of the hook, you're given the query string.

So I, naturally, tried to pass the query string into OpenLineageSql.parse(Collections.singletonList(hookContext.getQueryPlan().getQueryStr()), "hive") in order to test this out.

I was using openlineage-sql-java:1.2.2 at that time, and no matter what query string I gave it, nothing was returned.

I then stepped through the code and noticed that it was looking for the arm64 lib, and I noticed that that package (downloaded from maven central) lacked that particular native binary.

Damien Hawes (damien.hawes@booking.com)

2023-10-04 05:27:36

*Thread Reply:* I hope that helps.

👍 Paweł Leszczyński

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-10-04 09:03:02

*Thread Reply:* I get in now. In Circle CI we do have 3 build steps: - build-integration-sql-x86 - build-integration-sql-arm - build-integration-sql-macos but no mac m1. I think at that time circle CI did not have a proper resource class in free plan. Additionally, @Maciej Obuchowski would prefer to migrate this to github actions as he claims this can be achieved there in a cleaner way (https://github.com/OpenLineage/OpenLineage/issues/1624).

Feel free to create an issue for this. Others would be able to upvote it in case they have similar experience.

#1624 CI: PoC building openlineage-sql on GitHub Actions with actions-rs

Building sql parser currently is the most convoluted CI process due to need to construct different binaries in multiple dimensions; both for Java and Python, and for multiple architectures; like Linux x86, Linux ARM, MacOS x86 etc. The jobs also differ in different context: release workflow has different jobs than build one, which in turn does not build all of the architectures. To simplify that, we should try using GitHub Actions with <a href="https://actions-rs.github.io/">https://actions-rs.github.io/</a> that should solve the problems we've currently had to replicate manually. End result of that task should be having various SQL artifacts produced by GitHub actions and available by GH Actions artifacts API:

Assignees

<a href="https://github.com/mobuchowski">@mobuchowski</a>

Labels

ci, integration/sql

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-10-23 11:56:12

*Thread Reply:* It doesn't have the free resource class still 😞 We're blocked on that unfortunately. Other solution would be to migrate to GH actions, where most of our solution could be replaced by something like that https://github.com/PyO3/maturin-action

PyO3/maturin-action

GitHub Action to install and run a custom maturin command with built-in support for cross compilation

Stars

Language

TypeScript

Michael Robinson (michael.robinson@astronomer.io)

2023-10-03 10:56:03

@channel We released OpenLineage 1.3.1! Added: • Airflow: add some basic stats to the Airflow integration #1845 @harels • Airflow: add columns as schema facet for airflow.lineage.Table (if defined) #2138 @erikalfthan • DBT: add SQLSERVER to supported dbt profile types #2136 @erikalfthan • Spark: support for latest 3.5 #2118 @pawel-big-lebowski Fixed: • Airflow: fix find-links path in tox #2139 @JDarDagran • Airflow: add more graceful logging when no OpenLineage provider installed #2141 @JDarDagran • Spark: fix bug in PathUtils’ prepareDatasetIdentifierFromDefaultTablePath (CatalogTable) to correctly preserve scheme from CatalogTable’s location #2142 @d-m-h Thanks to all the contributors, including new contributor @Erik Alfthan! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.3.1 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/1.2.2...1.3.1 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

👍 Jason Yip, Peter Hicks, Peter Huang, Mars Lan (Metaphor)

🎉 Sheeri Cabral (Collibra)

Mars Lan (Metaphor) (mars@metaphor.io)

2023-10-04 07:42:59

*Thread Reply:* Any chance we can do a 1.3.2 soonish to include https://github.com/OpenLineage/OpenLineage/pull/2151 instead of waiting for the next monthly release?

#2151 Allow setting client's endpoint via environment variable

Labels

documentation, client/python

Comments

Matthew Paras (matthewparas2020@u.northwestern.edu)

2023-10-03 12:34:57

Hey everyone - does anyone have a good mechanism for alerting on issues with open lineage? For example, maybe alerting when an event times out - perhaps to prometheus or some other kind of generic endpoint? Not sure the best approach here (if the meta inf extension would be able to achieve it)

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-10-04 03:01:02

*Thread Reply:* That's a great usecase for OpenLineage. Unfortunately, we don't have any doc or recomendation on that.

I would try using FluentD proxy we have (https://github.com/OpenLineage/OpenLineage/tree/main/proxy/fluentd) to copy event stream (alerting is just one of usecases for lineage events) and write fluentd plugin to send it asynchronously further to alerting service like PagerDuty.

It looks cool to me but I never had enough time to test this approach.

👍 Matthew Paras

Michael Robinson (michael.robinson@astronomer.io)

2023-10-05 14:44:14

@channel This month’s TSC meeting is next Thursday the 12th at 10am PT. On the tentative agenda: • announcements • recent releases • Airflow Summit recap • tutorial: migrating to the Airflow Provider • discussion topic: observability for OpenLineage/Marquez • open discussion • more (TBA) More info and the meeting link can be found on the website. All are welcome! Do you have a discussion topic, use case or integration you’d like to demo? DM me to be added to the agenda.

openlineage.io

TSC Meetings | OpenLineage

The OpenLineage Technical Steering Committee meets monthly, and is open to all.

Original URL: https://openlineage.io/meetings/

👀 Sheeri Cabral (Collibra), Julian LaNeve, Peter Hicks

Julien Le Dem (julien@apache.org)

2023-10-05 20:40:40

The Marquez meetup in San Francisco is happening right now! https://www.meetup.com/meetup-group-bnfqymxe/events/295444209/?utmmedium=referral&utmcampaign=share-btnsavedeventssharemodal&utmsource=link|https://www.meetup.com/meetup-group-bnfqymxe/events/295444209/?utmmedium=referral&utmcampaign=share-btnsavedeventssharemodal&utmsource=link

Meetup

Marquez Meetup @ Astronomer, Thu, Oct 5, 2023, 5:30 PM | Meetup

Join us on Thursday, October 5th, from 5:30_8:30 pm to learn about the Marquez project. Meet other members of the community, get tips on making the most of the latest impro

Original URL: https://www.meetup.com/meetup-group-bnfqymxe/events/295444209/?utm_medium=referral&utm_campaign=share-btn_savedevents_share_modal&utm_source=link

🎉 Paweł Leszczyński, Rodrigo Maia

Mars Lan (Metaphor) (mars@metaphor.io)

2023-10-06 07:19:01

@Michael Robinson can we cut a new release to include this change? • https://github.com/OpenLineage/OpenLineage/pull/2151

#2151 Allow setting client's endpoint via environment variable

Labels

documentation, client/python

Comments

➕ Harel Shein, Jakub Dardziński, Julien Le Dem, Michael Robinson, Maciej Obuchowski

Michael Robinson (michael.robinson@astronomer.io)

2023-10-06 19:16:02

*Thread Reply:* Thanks for requesting a release, @Mars Lan (Metaphor). It has been approved and will be initiated within 2 business days of next Monday.

🙏 Mars Lan (Metaphor)

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-10-08 23:59:36

@here I am trying out the openlineage integration of spark on databricks. There is no event getting emitted from Openlineage, I see logs saying OpenLineage Event Skipped. I am attaching the Notebook that i am trying to run and the cluster logs. Kindly can someone help me on this

cluster-logs

Jason Yip (jasonyip@gmail.com)

2023-10-09 00:02:10

*Thread Reply:* from my experience, it will only work on Spark 3.3.x or below, aka Runtime 12.2 or below. Anything above the events will show up once in a blue moon

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-10-09 00:04:38

*Thread Reply:* ohh, thanks for the information @Jason Yip, I am trying out with 13.3 Databricks Version and Spark 3.4.1, will try using a below version as you suggested. Any issue tracking this bug @Jason Yip

Jason Yip (jasonyip@gmail.com)

2023-10-09 00:06:06

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/2124

#2124 Same Delta Table not catching the location on write

Labels

integration/spark, integration/databricks

Comments

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-10-09 00:11:54

*Thread Reply:* tried with databricks 12.2 --> spark 3.3.2, still the same behaviour no event getting emitted

Jason Yip (jasonyip@gmail.com)

2023-10-09 00:12:35

*Thread Reply:* you can do 11.3, its the most stable one I know

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-10-09 00:12:46

*Thread Reply:* sure, let me try that out

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-10-09 00:31:51

*Thread Reply:* still the same problem…the jar that i am using is the latest openlineage-spark-1.3.1.jar, do you think that can be the problem

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-10-09 00:43:59

*Thread Reply:* tried with openlineage-spark-1.2.2.jar, still the same issue, seems like they are skipping some events

Jason Yip (jasonyip@gmail.com)

2023-10-09 01:47:20

*Thread Reply:* Probably not all events will be captured, I have only tested create tables and jobs

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-10-09 04:31:12

*Thread Reply:* Hi @Guntaka Jeevan Paul, how did you configure openlineage and what is your job doing?

We do have a bunch of integration tests on Databricks platform available here and they're passing on databricks runtime 13.0.x-scala2.12.

Could you also try running code same as our test does (this one)? If you run it and see OL events, this will make us sure your config is OK and we can continue further debug.

Looking at your spark script: could you save your dataset and see if you still don't see any events?

<https://github.com/OpenLineage/OpenLineage/blob/1.3.1/integration/spark/app/src/test/java/io/openlineage/spark/agent/DatabricksIntegrationTest.java | DatabricksIntegrationTest.java>

<https://github.com/OpenLineage/OpenLineage/blob/1.3.1/integration/spark/app/src/test/resources/databricks_notebooks/ctas.py | ctas.py>

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-10-09 05:06:41

*Thread Reply:* babynames = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("dbfs:/FileStore/babynames.csv") babynames.createOrReplaceTempView("babynames_table") years = spark.sql("select distinct(Year) from babynames_table").rdd.map(lambda row : row[0]).collect() years.sort() dbutils.widgets.dropdown("year", "2014", [str(x) for x in years]) display(babynames.filter(babynames.Year == dbutils.widgets.get("year")))

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-10-09 05:08:09

*Thread Reply:* this is the script that i am running @Paweł Leszczyński…kindly let me know if i’m doing any mistake. I have added the init script at the cluster level and from the logs i could see that openlineage is configured as i see a log statement

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-10-09 05:10:30

*Thread Reply:* there's nothing wrong in that script. It's just we decided to limit amount of OL events for jobs that don't write their data anywhere and just do collect operation

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-10-09 05:11:02

*Thread Reply:* this is also a potential reason why can't you see any events

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-10-09 05:14:33

*Thread Reply:* ohh…okk, will try out the test script that you have mentioned above. Kindly correct me if my understanding is correct, so if there are a few transformatiosna nd finally writing somewhere that is where the OL events are expected to be emitted?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-10-09 05:16:54

*Thread Reply:* yes. main purpose of the lineage is to track dependencies between the datasets, when a job reads from dataset A and writes to dataset B. In case of databricks notebook, that do show or collect and print some query result on the screen, there may be no reason to track it in the sense of lineage.

Michael Robinson (michael.robinson@astronomer.io)

2023-10-09 15:25:14

@channel We released OpenLineage 1.4.1! Additions: • Client: allow setting client’s endpoint via environment variable 2151 @Mars Lan (Metaphor) • Flink: expand Iceberg source types 2149 @Peter Huang • Spark: add debug facet 2147 @Paweł Leszczyński • Spark: enable Nessie REST catalog 2165 @julwin Thanks to all the contributors, especially new contributors @Peter Huang and @julwin! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.4.1 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/1.3.1...1.4.1 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

👍 Jason Yip, Ross Turk, Mars Lan (Metaphor), Harel Shein, Rodrigo Maia

Drew Bittenbender (drew@salt.io)

2023-10-09 16:55:35

Hello. I am getting started with OL and Marquez with dbt. I am using dbt-ol. The namespace of the dataset showing up in Marquez is not the namespace I provide using OPENLINEAGENAMESPACE. It happens to be the same as the source in Marquez which is the snowflake account uri. It's obviously picking up the other env variable OPENLINEAGEURL so i am pretty sure its not the environment. Is this expected?

Michael Robinson (michael.robinson@astronomer.io)

2023-10-09 18:56:13

*Thread Reply:* Hi Drew, thank you for using OpenLineage! I don’t know the details of your use case, but I believe this is expected, yes. In general, the dataset namespace is different. Jobs are namespaced separately from datasets, which are namespaced by their containing datasources. This is the case so datasets have the same name regardless of the job writing to them, as datasets are sometimes shared by jobs in different namespaces.

👍 Drew Bittenbender

Jason Yip (jasonyip@gmail.com)

2023-10-10 01:05:11

Any idea why "environment-properties" is gone in Spark 3.4+ in StartEvent?

Jason Yip (jasonyip@gmail.com)

2023-10-10 20:53:59

example:

{"environment_properties":{"spark.databricks.clusterUsageTags.clusterName":"<a href="mailto:jason.yip@tredence.com">jason.yip@tredence.com</a>'s Cluster","spark.databricks.job.runId":"","spark.databricks.job.type":"","spark.databricks.clusterUsageTags.azureSubscriptionId":"a4f54399_8db8_4849_adcc_a42aed1fb97f","spark.databricks.notebook.path":"/Repos/jason.yip@tredence.com/segmentation/01_Data Prep","spark.databricks.clusterUsageTags.clusterOwnerOrgId":"4679476628690204","MountPoints":[{"MountPoint":"/databricks-datasets","Source":"databricks_datasets"},{"MountPoint":"/Volumes","Source":"UnityCatalogVolumes"},{"MountPoint":"/databricks/mlflow-tracking","Source":"databricks/mlflow-tracking"},{"MountPoint":"/databricks-results","Source":"databricks_results"},{"MountPoint":"/databricks/mlflow-registry","Source":"databricks/mlflow-registry"},{"MountPoint":"/Volume","Source":"DbfsReserved"},{"MountPoint":"/volumes","Source":"DbfsReserved"},{"MountPoint":"/","Source":"DatabricksRoot"},{"MountPoint":"/volume","Source":"DbfsReserved"}],"User":"<a href="mailto:jason.yip@tredence.com">jason.yip@tredence.com</a>","UserId":"4768657035718622","OrgId":"4679476628690204"}}

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-10-11 03:46:13

*Thread Reply:* Is this related to any OL version? In OL 1.2.2. we've added extra variable spark.databricks.clusterUsageTags.clusterAllTags to be captured, but this should not break things.

I think we're facing some issues on recent databricks runtime versions. Here is an issue for this: https://github.com/OpenLineage/OpenLineage/issues/2131

Is the problem you describe specific to some databricks runtime versions?

#2131 [Spark Databricks] NoSuchMethodError on ReplaceTableAsSelect

More details here: <a href="https://github.com/OpenLineage/OpenLineage/issues/2121">#2121</a> Looks like different class implementation on databricks platform.

Labels

integration/spark, integration/databricks

Jason Yip (jasonyip@gmail.com)

2023-10-11 11:17:06

*Thread Reply:* yes, exactly Spark 3.4+

Jason Yip (jasonyip@gmail.com)

2023-10-11 21:12:27

*Thread Reply:* Btw I don't understand the code flow entirely, if we are talking about a different classpath only, I see there's Unity Catalog handler in the code and it says it works the same as Delta, but I am not seeing it subclassing Delta. I suppose it will work the same.

I am happy to jump on a call to show you if needed

Jason Yip (jasonyip@gmail.com)

2023-10-16 02:58:56

*Thread Reply:* @Paweł Leszczyński do you think in Spark 3.4+ only one event would happen?

/** * We get exact copies of OL events for org.apache.spark.scheduler.SparkListenerJobStart and * org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart. The same happens for end * events. * * @return */ private boolean isOnJobStartOrEnd(SparkListenerEvent event) { return event instanceof SparkListenerJobStart || event instanceof SparkListenerJobEnd; }

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-10-10 23:43:39

@here i am trying out the databricks spark integration and in one of the events i am getting a openlineage event where the output dataset is having a facet called symlinks , the statement that generated this event is this sql CREATE TABLE IF NOT EXISTS covid_research.covid_data USING CSV LOCATION '<abfss://oltptestdata@jeevanacceldata.dfs.core.windows.net/testdata/johns-hopkins-covid-19-daily-dashboard-cases-by-states.csv>' OPTIONS (header "true", inferSchema "true"); Can someone kindly let me know what this symlinks facet is. i tried seeing the spec but did not get it completely

Jason Yip (jasonyip@gmail.com)

2023-10-10 23:44:53

*Thread Reply:* I use it to get the table with database name

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-10-10 23:47:15

*Thread Reply:* so can i think it like if there is a synlink, then that table is kind of a reference to the original dataset

Jason Yip (jasonyip@gmail.com)

2023-10-11 01:25:44

*Thread Reply:* yes

🙌 Paweł Leszczyński

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-10-11 06:55:58

@here When i am running this sql as part of a databricks notebook, i am recieving an OL event where i see only an output dataset and there is no input dataset or a symlink facet inside the dataset to map it to the underlying azure storage object. Can anyone kindly help on this spark.sql(f"CREATE TABLE IF NOT EXISTS covid_research.uscoviddata USING delta LOCATION '<abfss://oltptestdata@jeevanacceldata.dfs.core.windows.net/testdata/modified-delta>'") { "eventTime": "2023-10-11T10:47:36.296Z", "producer": "<https://github.com/OpenLineage/OpenLineage/tree/1.2.2/integration/spark>", "schemaURL": "<https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunEvent>", "eventType": "COMPLETE", "run": { "runId": "d0f40be9-b921-4c84-ac9f-f14a86c29ff7", "facets": { "spark.logicalPlan": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/1.2.2/integration/spark>", "_schemaURL": "<https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunFacet>", "plan": [ { "class": "org.apache.spark.sql.catalyst.plans.logical.CreateTable", "num-children": 1, "name": 0, "tableSchema": [], "partitioning": [], "tableSpec": null, "ignoreIfExists": true }, { "class": "org.apache.spark.sql.catalyst.analysis.ResolvedIdentifier", "num-children": 0, "catalog": null, "identifier": null } ] }, "spark_version": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/1.2.2/integration/spark>", "_schemaURL": "<https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunFacet>", "spark-version": "3.3.0", "openlineage-spark-version": "1.2.2" }, "processing_engine": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/1.2.2/integration/spark>", "_schemaURL": "<https://openlineage.io/spec/facets/1-1-0/ProcessingEngineRunFacet.json#/$defs/ProcessingEngineRunFacet>", "version": "3.3.0", "name": "spark", "openlineageAdapterVersion": "1.2.2" } } }, "job": { "namespace": "default", "name": "adb-3942203504488904.4.azuredatabricks.net.create_table.covid_research_db_uscoviddata", "facets": {} }, "inputs": [], "outputs": [ { "namespace": "dbfs", "name": "/user/hive/warehouse/covid_research.db/uscoviddata", "facets": { "dataSource": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/1.2.2/integration/spark>", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet>", "name": "dbfs", "uri": "dbfs" }, "schema": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/1.2.2/integration/spark>", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet>", "fields": [] }, "storage": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/1.2.2/integration/spark>", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/StorageDatasetFacet.json#/$defs/StorageDatasetFacet>", "storageLayer": "unity", "fileFormat": "parquet" }, "symlinks": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/1.2.2/integration/spark>", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/SymlinksDatasetFacet.json#/$defs/SymlinksDatasetFacet>", "identifiers": [ { "namespace": "/user/hive/warehouse/covid_research.db", "name": "covid_research.uscoviddata", "type": "TABLE" } ] }, "lifecycleStateChange": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/1.2.2/integration/spark>", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/LifecycleStateChangeDatasetFacet.json#/$defs/LifecycleStateChangeDatasetFacet>", "lifecycleStateChange": "CREATE" } }, "outputFacets": {} } ] }

Damien Hawes (damien.hawes@booking.com)

2023-10-11 06:57:46

*Thread Reply:* Hey Guntaka - can I ask you a favour? Can you please stop using @here or @channel - please keep in mind, you're pinging over 1000 people when you use that mention. Its incredibly distracting to have Slack notify me of a message that isn't pertinent to me.

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-10-11 06:58:50

*Thread Reply:* sure noted @Damien Hawes

Damien Hawes (damien.hawes@booking.com)

2023-10-11 06:59:34

*Thread Reply:* Thank you!

Madhav Kakumani (madhav.kakumani@6point6.co.uk)

2023-10-11 12:04:24

Hi @there, I am trying to make API call to get column-lineage information could you please let me know the url construct to retrieve the same? As per the API documentation I am passing the following url to GET column-lineage: http://localhost:5000/api/v1/column-lineage but getting error code:400. Thanks

Willy Lulciuc (willy@datakin.com)

2023-10-12 13:55:26

*Thread Reply:* Make sure to provide a dataset field nodeId as a query param in your request. If you’ve seeded Marquez with test metadata, you can use: curl -XGET "<http://localhost:5002/api/v1/column-lineage?nodeId=datasetField%3Afood_delivery%3Apublic.delivery_7_days%3Acustomer_email>" You can view the API docs for column lineage here!

Madhav Kakumani (madhav.kakumani@6point6.co.uk)

2023-10-17 05:57:36

*Thread Reply:* Thanks Willy. The documentation says 'name space' so i constructed API Like this: 'http://marquez-web:3000/api/v1/column-lineage/nodeId=datasetField:file:/home/jovyan/Downloads/event_attribute.csv:eventType' but it is still not working 😞

Madhav Kakumani (madhav.kakumani@6point6.co.uk)

2023-10-17 06:07:06

*Thread Reply:* nodeId is constructed like this: datasetField:<namespace>:<dataset>:<field name>

Michael Robinson (michael.robinson@astronomer.io)

2023-10-11 13:00:01

@channel Friendly reminder: this month’s TSC meeting, open to all, is tomorrow at 10 am PT: https://openlineage.slack.com/archives/C01CK9T7HKR/p1696531454431629

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)

@channel This month’s TSC meeting is next Thursday the 12th at 10am PT. On the tentative agenda: • announcements • recent releases • Airflow Summit recap • tutorial: migrating to the Airflow Provider • discussion topic: observability for OpenLineage/Marquez • open discussion • more (TBA) More info and the meeting link can be found on the <a href="https://openlineage.io/meetings/">website</a>. All are welcome! Do you have a discussion topic, use case or integration you’d like to demo? DM me to be added to the agenda.

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1696531454431629

Michael Robinson (michael.robinson@astronomer.io)

2023-10-11 14:26:45

*Thread Reply:* Newly added discussion topics: • a proposal to add a Registry of Consumers and Producers • a dbt issue to add OpenLineage Dataset names to the Manifest • a proposal to add Dataset support in Spark LogicalPlan Nodes • a proposal to institute a certification process for new integrations

Jason Yip (jasonyip@gmail.com)

2023-10-12 15:08:34

This might be a dumb question, I guess I need to setup local Spark in order for the Spark tests to run successfully?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-10-13 01:56:19

*Thread Reply:* just follow these instructions: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark#build

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-10-13 06:41:56

*Thread Reply:* when trying to install openlineage-java in local via this command --> cd ../../client/java/ && ./gradlew publishToMavenLocal, i am receiving this error ```> Task :signMavenJavaPublication FAILED

FAILURE: Build failed with an exception.

** What went wrong: Execution failed for task ':signMavenJavaPublication'. > Cannot perform signing task ':signMavenJavaPublication' because it has no configured signatory```

Jason Yip (jasonyip@gmail.com)

2023-10-13 13:35:06

*Thread Reply:* @Paweł Leszczyński this is what I am getting

Jason Yip (jasonyip@gmail.com)

2023-10-13 13:36:00

*Thread Reply:* attaching the html

io.openlineage.spark.agent.lifecycle.plan.AlterTableAddPartitionCommandVisitorTest.html

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-10-16 03:02:13

*Thread Reply:* which java are you using? what is your operation system (is it windows?)?

Jason Yip (jasonyip@gmail.com)

2023-10-16 03:35:18

*Thread Reply:* yes it is Windows, i downloaded java 8 but I can try to build it with Linux subsystem or Mac

Guntaka Jeevan Paul (jeevan@acceldata.io)

2023-10-16 03:35:51

*Thread Reply:* In my case it is Mac

Jason Yip (jasonyip@gmail.com)

2023-10-16 03:56:09

*Thread Reply:* ** Where: Build file '/mnt/c/Users/jason/Downloads/github/OpenLineage/integration/spark/build.gradle' line: 9

** What went wrong: An exception occurred applying plugin request [id: 'com.adarshr.test-logger', version: '3.2.0'] > Failed to apply plugin [id 'com.adarshr.test-logger'] > Could not generate a proxy class for class com.adarshr.gradle.testlogger.TestLoggerExtension.

** Try:

Jason Yip (jasonyip@gmail.com)

2023-10-16 03:56:23

*Thread Reply:* tried with Linux subsystem

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-10-16 04:04:29

*Thread Reply:* we don't have any restrictions for windows builds, however it is something we don't test regularly. 2h ago we did have a successful build on circle CI https://app.circleci.com/pipelines/github/OpenLineage/OpenLineage/8271/workflows/0ec521ae-cd21-444a-bfec-554d101770ea

Jason Yip (jasonyip@gmail.com)

2023-10-16 04:13:04

*Thread Reply:* ... 111 more Caused by: java.lang.ClassNotFoundException: org.gradle.api.provider.HasMultipleValues ... 117 more

Jason Yip (jasonyip@gmail.com)

2023-10-17 00:26:07

*Thread Reply:* @Paweł Leszczyński now I am doing gradlew instead of gradle on windows coz Linux one doesn't work. The doc didn't mention about setting up Spark / Hadoop and that's my original question -- do I need to setup local Spark? Now it's throwing an error on Hadoop: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.

Jason Yip (jasonyip@gmail.com)

2023-10-21 23:33:48

*Thread Reply:* Got it working with Mac, couldn't get it working with Windows / Linux subsystem

Jason Yip (jasonyip@gmail.com)

2023-10-22 13:08:40

*Thread Reply:* Now getting class not found despite build and test succeeded

Jason Yip (jasonyip@gmail.com)

2023-10-22 21:46:23

*Thread Reply:* I uploaded the wrong jar.. there are so many jars, only the jar in the spark folder works, not subfolder

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-10-13 02:48:40

Hi team, I am running the following pyspark code in a cell: ```print("SELECTING 100 RECORDS FROM METADATA TABLE") df = spark.sql("""select ** from

limit 100""")

print("WRITING (1) 100 RECORDS FROM METADATA TABLE") df.write.mode("overwrite").format('delta').save("") df.createOrReplaceTempView("temp_metadata")

print("WRITING (2) 100 RECORDS FROM METADATA TABLE") df.write.mode("overwrite").format("delta").save("")

print("READING (1) 100 RECORDS FROM METADATA TABLE") dfread = spark.read.format('delta').load("") dfread.createOrReplaceTempView("metadata_1")

print("DOING THE MERGE INTO SQL STEP!") dfnew = spark.sql(""" MERGE INTO metadata1 USING

ON metadata1.id = tempmetadata.id WHEN MATCHED THEN UPDATE SET metadata1.id = tempmetadata.id, metadata1.aspect = tempmetadata.aspect WHEN NOT MATCHED THEN INSERT (id, aspect) VALUES (tempmetadata.id, tempmetadata.aspect) """)``


I am running with debug log levels. I actually don't see any of the events being logged for

SaveIntoDataSourceCommandor theMergeIntoCommand`, but OL is in fact emitting events to the backend. It seems like the events are just not being logged... I actually observe this for all delta table related spark sql queries...

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-10-16 00:01:42

*Thread Reply:* Hi @Paweł Leszczyński is this expected? CMIIW but we should expect to see the events being logged when running with debug log level right?

Damien Hawes (damien.hawes@booking.com)

2023-10-16 04:17:30

*Thread Reply:* It's impossible to know without seeing how you've configured the listener.

Can you show this configuration?

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-10-17 03:15:20

*Thread Reply:* spark.openlineage.transport.url <url> spark.openlineage.transport.endpoint /<endpoint> spark.openlineage.transport.type http spark.extraListeners io.openlineage.spark.agent.OpenLineageSparkListener spark.openlineage.facets.custom_environment_variables [BUNCH_OF_VARIABLES;] spark.openlineage.facets.disabled [spark_unknown\;spark.logicalPlan] These are my spark configs... I'm setting log level to debug with sc.setLogLevel("DEBUG")

Damien Hawes (damien.hawes@booking.com)

2023-10-17 04:40:03

*Thread Reply:* Two things:

If you want debug logs, you're going to have to provide a log4j.properties file or log4j2.properties file depending on the version of spark you're running. In that file, you will need to configure the logging levels. If I am not mistaken, the sc.setLogLevel controls ONLY the log levels of Spark namespaced components (i.e., org.apache.spark)
You're telling the listener to emit to a URL. If you want to see the events emitted to the console, then set spark.openlineage.transport.type=console, and remove the other spark.openlineage.transport.** configurations. Do either (1) or (2).

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-10-20 00:49:45

*Thread Reply:* @Damien Hawes Hi, sflr.

So enabling sc.setLogLevel does actually enable debug logs from Openlineage. I can see the events and everyting being logged if I save it as a parquet format instead of delta.
I do want to emit events to the url. But, I would like to just see what exactly are the events being emitted for some specific jobs, since I see that the lineage is incorrect for some MergeInto cases

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-10-26 04:56:50

*Thread Reply:* Hi @Damien Hawes would like to check again on whether you'd have any thoughts about this... Thanks! 🙂

Rodrigo Maia (rodrigo.maia@manta.io)

2023-10-17 03:17:57

Hello All 👋! We are currently trying to work the the spark integration for OpenLineage in our Databricks instance. The general setup is done and working with a few hicups here and there. But one thing we are still struggling is how to link all spark jobs events with a Databricks job or a notebook run. We´ve recently noticed that some of the events produced by OL have the "environment-properties" attribute with information (for our context) regarding notebook path (if it is a notebook run), or the the job run ID (if its a databricks job run). But the thing is that these attributes are not always present. I ran some samples yesterday for a job with 4 notebook tasks. From all 20 json payload sent by the OL listener, only 3 presented the "environment-properties" attribute. Its not only happening with Databricks jobs. When i run single notebooks and each cell has its onw set of spark jobs, not all json events presented that property either.

So my question is what is the criteria to have this attributes present or not in the event json file? Or maybe this in an issue? @Jason Yip did you find out anything about this?

⚙️ Spark 3.4 / OL-Spark 1.4.1

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-10-17 06:55:47

*Thread Reply:* In general, we assume that OL events per run are cumulative. So, if you have 20 events with the same runId , then even if a single event contains some facet, we consider this is OK and let the backend combine it together. That's what we do in Marquez project (a reference backend architecture for OL) and that's why it is worth to use in Marquez as a rest API.

Are you able to use job namespace to aggregate all the Spark actions run within the databricks notebook? This is something that should serve this purpose.

Jason Yip (jasonyip@gmail.com)

2023-10-17 12:48:33

*Thread Reply:* @Rodrigo Maia for Spark 3.4 I don't see the environment-properties showing up at all, but if you run the code as it is, register a listener on SparkListenerJobStart and get the properties, all of those properties will show up. There's an event filter that filters out the SparkListenerJobStart, I suspect that filtered out the "unneccessary" events.. was trying to do a custom build to do that, but still trying to setup Hadoop and Spark on my local

Rodrigo Maia (rodrigo.maia@manta.io)

2023-10-18 05:23:16

*Thread Reply:* @Paweł Leszczyński you are right. This is what we are doing as well, combining events with the same runId to process the information on our backend. But even so, there are several runIds without this information. I went through these events to have a better view of what was happening. As you can see from 7 runIds, only 3 were showing the "environment-properties" attribute. Some condition is not being met here, or maybe it is what @Jason Yip suspects and there's some sort of filtering of unnecessary events

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-10-19 02:28:03

*Thread Reply:* @Rodrigo Maia, If you are able to provide a small Spark script such that none of the OL events contain the environment-properties, but at least one should, please raise an issue for this.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-10-19 02:29:11

*Thread Reply:* It's extremely helpful when community open issues that are not only described well, but also contain small piece of code needed to reproduce this.

Rodrigo Maia (rodrigo.maia@manta.io)

2023-10-19 02:59:39

*Thread Reply:* I know. that's the goal. that is why I wanted to understand in the first place if there was any condition preventing this from happening, but now i get that this is not expected behaviour.

👍 Paweł Leszczyński

Jason Yip (jasonyip@gmail.com)

2023-10-19 13:44:00

*Thread Reply:* @Paweł Leszczyński @Rodrigo Maia I am referring to this: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/filters/DeltaEventFilter.java#L51

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/filters/DeltaEventFilter.java | DeltaEventFilter.java>

<pre><code> || isOnJobStartOrEnd(event); </code></pre>

Jason Yip (jasonyip@gmail.com)

2023-10-19 14:49:03

*Thread Reply:* Please note that I am getting the same behavior, no code is needed, Spark 3.4+ won't be generating no matter what. I have been testing the same code for 2 months from this issue: https://github.com/OpenLineage/OpenLineage/issues/2124

I tried the code without OL and it worked perfectly, so it is OL filtering out the event for sure. I will try posting the code I use to collect the properties.

#2124 Same Delta Table not catching the location on write

Labels

integration/spark, integration/databricks

Comments

Jason Yip (jasonyip@gmail.com)

2023-10-19 23:46:17

*Thread Reply:* this code proves that the prosperities are still there, somehow got filtered out by OL:

```%scala import org.apache.spark.scheduler._

class JobStartListener extends SparkListener { override def onJobStart(jobStart: SparkListenerJobStart): Unit = { // Extract properties here val jobId = jobStart.jobId val stageInfos = jobStart.stageInfos val properties = jobStart.properties

// You can print properties or save them somewhere
println(s"JobId: $jobId, Stages: ${stageInfos.size}, Properties: $properties")

} }

val listener = new JobStartListener() spark.sparkContext.addSparkListener(listener)

val df = spark.range(1000).repartition(10) df.count()```

Jason Yip (jasonyip@gmail.com)

2023-10-19 23:55:05

*Thread Reply:* of course feel free to test this logic as well, it still works -- if not the filtering:

https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/[…]ark/agent/facets/builder/DatabricksEnvironmentFacetBuilder.java

<pre><code>/** /** Copyright 2018-2023 contributors to the OpenLineage project /** SPDX-License-Identifier: Apache-2.0 **/ package io.openlineage.spark.agent.facets.builder; import com.databricks.backend.daemon.dbutils.MountInfo; import com.databricks.dbutils_v1.DbfsUtils; import io.openlineage.spark.agent.facets.EnvironmentFacet; import io.openlineage.spark.agent.models.DatabricksMountpoint; import io.openlineage.spark.api.CustomFacetBuilder; import io.openlineage.spark.api.OpenLineageContext; import java.lang.reflect.Constructor; import java.lang.reflect.InvocationTargetException; import java.lang.reflect.Parameter; import java.util.ArrayList; import java.util.Arrays; import java.util.HashMap; import java.util.List; import java.util.Map; import java.util.Optional; import java.util.function.BiConsumer; import lombok.extern.slf4j.Slf4j; import org.apache.spark.scheduler.SparkListenerJobStart; import scala.collection.JavaConversions; /**** ** {@link CustomFacetBuilder} that generates a {@link EnvironmentFacet} when using OpenLineage on ** Databricks. **/ @Slf4j public class DatabricksEnvironmentFacetBuilder extends CustomFacetBuilder<SparkListenerJobStart, EnvironmentFacet> { private Map<String, Object> dbProperties; private Class dbutilsClass; private DbfsUtils dbutils; public static boolean isDatabricksRuntime() { return System.getenv().containsKey("DATABRICKS_RUNTIME_VERSION"); } public DatabricksEnvironmentFacetBuilder() {} public DatabricksEnvironmentFacetBuilder(OpenLineageContext openLineageContext) { dbProperties = new HashMap<>(); // extract some custom environment variables if needed openLineageContext .getCustomEnvironmentVariables() .ifPresent( envVars -> envVars.forEach(envVar -> dbProperties.put(envVar, System.getenv().get(envVar)))); } @Override protected void build( SparkListenerJobStart event, BiConsumer<String, ? super EnvironmentFacet> consumer) { consumer.accept( "environment-properties", new EnvironmentFacet(getDatabricksEnvironmentalAttributes(event))); } private Map<String, Object> getDatabricksEnvironmentalAttributes(SparkListenerJobStart jobStart) { if (dbProperties == null) { dbProperties = new HashMap<>(); } // These are useful properties to extract if they are available List<String> dbPropertiesKeys = Arrays.asList( "orgId", "spark.databricks.clusterUsageTags.clusterOwnerOrgId", "spark.databricks.notebook.path", "spark.databricks.job.type", "spark.databricks.job.id", "spark.databricks.job.runId", "user", "userId", "spark.databricks.clusterUsageTags.clusterName", "spark.databricks.clusterUsageTags.clusterAllTags", "spark.databricks.clusterUsageTags.azureSubscriptionId"); dbPropertiesKeys.stream() .forEach( (p) -> { dbProperties.put(p, jobStart.properties().getProperty(p)); }); /**** ** Azure Databricks makes available a dbutils mount point to list aliased paths to cloud ** storage. However, that dbutils object is not available inside a spark listener. We must ** access it via reflection. **/ try { Optional<DbfsUtils> dbfsUtils = getDbfsUtils(); if (!dbfsUtils.isPresent()) { dbProperties.put("mountPoints", new ArrayList<DatabricksMountpoint>()); } else { dbProperties.put("mountPoints", getDatabricksMountpoints(dbfsUtils.get())); } } catch (Exception e) { log.warn("Failed to load dbutils in OpenLineageListener:", e); dbProperties.put("mountPoints", new ArrayList<DatabricksMountpoint>()); } return dbProperties; } // Starting in Databricks Runtime 11, there is a new constructor for DbFsUtils // If running on an older version, the constructor has no parameters. // If running on DBR 11 or above, you need to specify whether you allow mount operations (true or // false) private static Optional<DbfsUtils> getDbfsUtils() throws ClassNotFoundException, InstantiationException, IllegalAccessException, IllegalArgumentException, InvocationTargetException { Class dbutilsClass = Class.forName("com.databricks.dbutils_v1.impl.DbfsUtilsImpl"); Constructor[] dbutilsConstructors = dbutilsClass.getDeclaredConstructors(); if (dbutilsConstructors.length == 0) { log.warn( "Failed to load dbutils in OpenLineageListener as there were no declared constructors"); return Optional.empty(); } Constructor firstConstructor = dbutilsConstructors[0]; Parameter[] constructorParams = firstConstructor.getParameters(); if (constructorParams.length == 0) { log.debug("DbUtils constructor had no parameters"); return Optional.of((DbfsUtils) firstConstructor.newInstance()); } else if (constructorParams.length == 1 && constructorParams[0].getName().equals("allowMountOperations")) { log.debug("DbUtils constructor had one parameter named allowMountOperations"); return Optional.of((DbfsUtils) firstConstructor.newInstance(true)); } else { log.warn( "dbutils had {} constructors and the first constructor had {} params", dbutilsConstructors.length, constructorParams.length); return Optional.empty(); } } private static List<DatabricksMountpoint> getDatabricksMountpoints(DbfsUtils dbutils) { List<DatabricksMountpoint> mountpoints = new ArrayList<>(); List<MountInfo> mountsList = JavaConversions.seqAsJavaList(dbutils.mounts()); for (MountInfo mount : mountsList) { mountpoints.add(new DatabricksMountpoint(mount.mountPoint(), mount.source())); } return mountpoints; } } </code></pre>

Rodrigo Maia (rodrigo.maia@manta.io)

2023-10-30 04:46:16

*Thread Reply:* Any ideas on how could i test it?

ankit jain (ankit.goods10@gmail.com)

2023-10-17 22:57:03

Hello All, I am completely new for Openlineage, I have to setup the lab to conduct POC on various aspects like Lineage, metadata management , etc. As per openlineage site, i tried downloading Ubuntu, docker and binary files for Marquez. But I am lost somewhere and unable to configure whole setup. Can someone please assist in steps to start from scratch so that i can delve into the Openlineage capabilities. Many thanks

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-10-18 01:32:01

*Thread Reply:* hey, did you try to follow one of these guides? https://openlineage.io/docs/guides/about

Michael Robinson (michael.robinson@astronomer.io)

2023-10-18 09:14:08

*Thread Reply:* Which guide were you using, and what errors/issues are you encountering?

ankit jain (ankit.goods10@gmail.com)

2023-10-21 15:43:14

*Thread Reply:* Thanks Jakub for the response.

ankit jain (ankit.goods10@gmail.com)

2023-10-21 15:45:42

*Thread Reply:* In docker, marquez-api image is not running and exiting with the exit code 127.

Michael Robinson (michael.robinson@astronomer.io)

2023-10-22 09:34:53

*Thread Reply:* @ankit jain thanks. I don't recognize 127, but 9 times out of 10 if the API or DB container fails the reason is a port conflict. Have you checked if port 5000 is available?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-10-22 09:54:10

*Thread Reply:* could you please check what’s the output of git config --get core.autocrlf or git config --global --get core.autocrlf ?

ankit jain (ankit.goods10@gmail.com)

2023-10-24 08:09:14

*Thread Reply:* @Michael Robinson thanks , I checked the port 5000 is not available. I tried deleting docker images and recreating them, but still the same issue persist stating /Usr/bin/env bash/r not found. Gradle build is successful.

ankit jain (ankit.goods10@gmail.com)

2023-10-24 08:09:54

*Thread Reply:* @Jakub Dardziński thanks, first command resulted as true and second command has no response

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-10-24 08:15:57

*Thread Reply:* are you running docker and git in Windows or Mac OS before 10.0?

Matthew Paras (matthewparas2020@u.northwestern.edu)

2023-10-19 15:00:42

Hey all - we've been noticing that some events go unreported by openlineage (spark) when the AsyncEventQueue fills up and starts dropping events. Wondering if anyone has experienced this before, and knows why it is happening? We've expanded the event queue capacity and thrown more hardware at the problem but no dice

Also as a note, the query plans from this job are pretty big - could the listener just be choking up? Happy to open a github issue as well if we suspect that it could be the listener itself having issues

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-10-20 02:57:50

*Thread Reply:* Hi, just checking, are you excluding the sparkPlan from the events? Or is it sending the spark plan too

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-10-23 11:59:40

*Thread Reply:* yeah - setting spark.openlineage.facets.disabled to [spark_unknown;spark.logicalPlan] should help

Matthew Paras (matthewparas2020@u.northwestern.edu)

2023-10-24 17:50:26

*Thread Reply:* sorry for the late reply - turns out this job is just whack 😄 we were going in circles trying to figure it out, we end up dropping events without open lineage enabled at all. But good to know that disabling the logical plan should speed us up if we run into this again

praveen kanamarlapudi (kpraveen420@gmail.com)

2023-10-20 18:18:37

Hi,

We are using openlineage spark connector. We have used spark 3.2 and scala 2.12 so far. We have triggered a new job with Spark 3.4 and scala 2.13 and faced below exception.

java.lang.NoSuchMethodError: 'scala.collection.Seq org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.map(scala.Function1)' at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$buildInputDatasets$6(OpenLineageRunEventBuilder.java:341) at java.base/java.util.Optional.map(Optional.java:265) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildInputDatasets(OpenLineageRunEventBuilder.java:339) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.populateRun(OpenLineageRunEventBuilder.java:295) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:279) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:222) at io.openlineage.spark.agent.lifecycle.SparkSQLExecutionContext.start(SparkSQLExecutionContext.java:72) at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$sparkSQLExecStart$0(OpenLineageSparkListener.java:91)

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-10-23 04:56:25

*Thread Reply:* Hmy, that is interesting. Did it occur on databricks runtime? Could you give it a try with Scala 2.12? I think we don't test scala 2.13.

praveen kanamarlapudi (kpraveen420@gmail.com)

2023-10-23 12:02:13

*Thread Reply:* I believe our Scala 2.12 jobs are working fine. It's not databricks runtime. We run Spark on Kube.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-10-24 06:47:14

*Thread Reply:* Ok. I think You can raise an issue to support Scala 2.13 for latest Spark versions.

Damien Hawes (damien.hawes@booking.com)

2023-12-08 05:57:55

*Thread Reply:* Yeah - this just hit me yesterday.

Damien Hawes (damien.hawes@booking.com)

2023-12-08 05:58:29

*Thread Reply:* I've created a ticket for it, it wasn't a fun surprise, that's for sure.

priya narayana (n.priya88@gmail.com)

2023-10-26 06:13:40

Hi I want to customise the events which comes from Openlineage spark . Can some one give some information

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-10-26 07:45:41

*Thread Reply:* Hi @priya narayana, please get familiar with Extending section on our docs: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark#extending

priya narayana (n.priya88@gmail.com)

2023-10-26 09:53:07

*Thread Reply:* Okay thank you. Just checking any other docs or git code which also can help me

harsh loomba (hloomba@upgrade.com)

2023-10-26 13:11:17

Hello Team

harsh loomba (hloomba@upgrade.com)

2023-10-26 13:12:38

Im upgrading the version from openlineage-airflow==0.24.0 to openlineage-airflow 1.4.1 but im seeing the following error, any help is appreciated

harsh loomba (hloomba@upgrade.com)

2023-10-26 13:14:02

*Thread Reply:* @Jakub Dardziński any thoughts?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-10-26 13:14:24

*Thread Reply:* what version of Airflow are you using?

harsh loomba (hloomba@upgrade.com)

2023-10-26 13:14:52

*Thread Reply:* 2.6.3 that satisfies the requirement

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-10-26 13:16:38

*Thread Reply:* is it possible you have some custom operator?

harsh loomba (hloomba@upgrade.com)

2023-10-26 13:17:15

*Thread Reply:* i think its the base operator causing the issue

harsh loomba (hloomba@upgrade.com)

2023-10-26 13:17:36

*Thread Reply:* so no i believe

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-10-26 13:18:43

*Thread Reply:* BaseOperator is parent class for any other operators, it defines how to do deepcopy

harsh loomba (hloomba@upgrade.com)

2023-10-26 13:19:11

*Thread Reply:* yeah so its controlled by Airflow itself, I didnt customize it

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-10-26 13:19:49

*Thread Reply:* uhm, maybe it's possible you could share dag code? you may hide sensitive data

harsh loomba (hloomba@upgrade.com)

2023-10-26 13:21:23

*Thread Reply:* let me try with lower versions of openlineage, what's say

harsh loomba (hloomba@upgrade.com)

2023-10-26 13:21:39

*Thread Reply:* its a big jump from 0.24.0 to 1.4.1

harsh loomba (hloomba@upgrade.com)

2023-10-26 13:22:25

*Thread Reply:* but i will help here to investigate this issue

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-10-26 13:24:03

*Thread Reply:* for me it seems that within dag or task you're defining some object that is not easy to copy

harsh loomba (hloomba@upgrade.com)

2023-10-26 13:26:05

*Thread Reply:* possible, but with 0.24.0 that issue is not occurring, so worry is that the version upgrade could potentially break things

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-10-26 13:39:34

*Thread Reply:* 0.24.0 is not that old 🤔

harsh loomba (hloomba@upgrade.com)

2023-10-26 13:45:07

*Thread Reply:* i see the issue with 0.24.0 I see it as warning [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/threading.py", line 932, in _bootstrap_inner [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - self.run() [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/threading.py", line 870, in run [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - self._target(**self._args, ****self._kwargs) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/home/upgrade/.local/lib/python3.8/site-packages/openlineage/airflow/listener.py", line 89, in on_running [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - task_instance_copy = copy.deepcopy(task_instance) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 172, in deepcopy [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y = _reconstruct(x, memo, **rv) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 270, in _reconstruct [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - state = deepcopy(state, memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 146, in deepcopy [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y = copier(x, memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 230, in _deepcopy_dict [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y[deepcopy(key, memo)] = deepcopy(value, memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 172, in deepcopy [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y = _reconstruct(x, memo, **rv) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 270, in _reconstruct [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - state = deepcopy(state, memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 146, in deepcopy [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y = copier(x, memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 230, in _deepcopy_dict [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y[deepcopy(key, memo)] = deepcopy(value, memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 153, in deepcopy [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y = copier(memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/home/upgrade/.local/lib/python3.8/site-packages/airflow/models/dag.py", line 2162, in __deepcopy__ [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - setattr(result, k, copy.deepcopy(v, memo)) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 146, in deepcopy [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y = copier(x, memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 230, in _deepcopy_dict [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y[deepcopy(key, memo)] = deepcopy(value, memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 153, in deepcopy [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y = copier(memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/home/upgrade/.local/lib/python3.8/site-packages/airflow/models/baseoperator.py", line 1224, in __deepcopy__ [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - setattr(result, k, copy.deepcopy(v, memo)) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 172, in deepcopy [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y = _reconstruct(x, memo, **rv) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 270, in _reconstruct [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - state = deepcopy(state, memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 146, in deepcopy [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y = copier(x, memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 230, in _deepcopy_dict [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y[deepcopy(key, memo)] = deepcopy(value, memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 146, in deepcopy [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y = copier(x, memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 230, in _deepcopy_dict [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y[deepcopy(key, memo)] = deepcopy(value, memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 153, in deepcopy [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y = copier(memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/home/upgrade/.local/lib/python3.8/site-packages/airflow/models/baseoperator.py", line 1224, in __deepcopy__ [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - setattr(result, k, copy.deepcopy(v, memo)) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 146, in deepcopy [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y = copier(x, memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 230, in _deepcopy_dict [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y[deepcopy(key, memo)] = deepcopy(value, memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 161, in deepcopy [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - rv = reductor(4) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - TypeError: cannot pickle 'module' object but with 1.4.1 its stopped processing any further and threw error

harsh loomba (hloomba@upgrade.com)

2023-10-26 14:18:08

*Thread Reply:* I see the difference of calling in these 2 versions, current versions checks if Airflow is >2.6 then directly runs on_running but earlier version was running on separate thread. IS this what's raising this exception?

harsh loomba (hloomba@upgrade.com)

2023-10-26 14:24:49

*Thread Reply:* this is the issue - https://github.com/OpenLineage/OpenLineage/blob/c343835c1664eda94d5c315897ae6702854c81bd/integration/airflow/openlineage/airflow/listener.py#L89 while copying the task

<https://github.com/OpenLineage/OpenLineage/blob/c343835c1664eda94d5c315897ae6702854c81bd/integration/airflow/openlineage/airflow/listener.py | listener.py>

<pre><code> task_instance_copy = copy.deepcopy(task_instance) </code></pre>

harsh loomba (hloomba@upgrade.com)

2023-10-26 14:25:21

*Thread Reply:* since we are directly running if version>2.6.0 therefore its throwing error in main processing

harsh loomba (hloomba@upgrade.com)

2023-10-26 14:28:02

*Thread Reply:* may i know which Airflow version we tested this process?

harsh loomba (hloomba@upgrade.com)

2023-10-26 14:28:39

*Thread Reply:* im on 2.6.3

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-10-26 14:30:53

*Thread Reply:* 2.1.4, 2.2.4, 2.3.4, 2.4.3, 2.5.2, 2.6.1 usually there are not too many changes between minor versions

I still believe it might be some code you might improve and probably is also an antipattern in airflow

harsh loomba (hloomba@upgrade.com)

2023-10-26 14:34:26

*Thread Reply:* hummm...that's a valid observation but I dont write DAGS, other teams do, so imagine if many people wrote such DAGS I can't ask everyone to change their patterns right? If something is running on current openlineage version with warning that should still be running on upgraded version isn't it?

harsh loomba (hloomba@upgrade.com)

2023-10-26 14:38:04

*Thread Reply:* however I see ur point

harsh loomba (hloomba@upgrade.com)

2023-10-26 14:49:52

*Thread Reply:* So that specific task has 570 line of query and pretty bulky query, let me split into smaller units

harsh loomba (hloomba@upgrade.com)

2023-10-26 14:50:15

*Thread Reply:* that should help right? @Jakub Dardziński

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-10-26 14:51:27

*Thread Reply:* query length shouldn’t be the issue, rather any python code

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-10-26 14:51:50

*Thread Reply:* I get your point too, we might figure out some mechanism to skip irrelevant parts of task instance so that it doesn’t fail then

harsh loomba (hloomba@upgrade.com)

2023-10-26 14:52:12

*Thread Reply:* actually its failing on that task itself

harsh loomba (hloomba@upgrade.com)

2023-10-26 14:52:33

*Thread Reply:* let me try it will be pretty quick

harsh loomba (hloomba@upgrade.com)

2023-10-26 14:58:58

*Thread Reply:* @Jakub Dardziński but ur right we have to fix this at Openlineage side as well. Because ideally Openlineage shouldn't be causing any issue to the main DAG processing

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-10-26 17:51:05

*Thread Reply:* it doesn’t break any airflow functionality, execution is wrapped into try/except block, only exception traceback is logged as you can see

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-10-27 05:25:54

*Thread Reply:* Can you migrate to Airflow 2.7 and use apache-airflow-providers-openlineage? Ideally we wouldn't make meaningful changes to openlineage-airflow

harsh loomba (hloomba@upgrade.com)

2023-10-27 11:35:44

*Thread Reply:* yup thats what im planning to do

harsh loomba (hloomba@upgrade.com)

2023-10-27 13:59:03

*Thread Reply:* referencing to https://openlineage.slack.com/archives/C01CK9T7HKR/p1698398754823079?threadts=1698340358.557159&cid=C01CK9T7HKR|this conversation - what it takes to move to openlineage provider package from openlineage-airflow. Im updating Airflow to 2.7.2 but moving off of openlineage-airflow to provider package Im trying to estimate the amount of work it takes, any thoughts? reading changelogs I dont think its too much of a change but please share your thoughts and if somewhere its drafted please do share that as well

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-10-30 08:21:10

*Thread Reply:* Generally not much - I would maybe think of a operator coverage. For example, for BigQuery old openlineage-airflow supports BigQueryExecuteQueryOperator. However, new apache-airflow-providers-openlineage supports BigQueryInsertJobOperator - because it's intended replacement for BigQueryExecuteQueryOperator and Airflow community does not want to accept contributions to deprecated operators.

🙏 harsh loomba

harsh loomba (hloomba@upgrade.com)

2023-10-31 15:00:38

*Thread Reply:* one question if someone is around - when im keeping both openlineage-airflow and apache-airflow-providers-openlineage in my requirement file, i see the following error - from openlineage.airflow.extractors import Extractors ModuleNotFoundError: No module named 'openlineage.airflow' any thoughts?

John Lukenoff (john@jlukenoff.com)

2023-10-31 15:37:07

*Thread Reply:* I would usually do a pip freeze | grep openlineage as a sanity check to validate that the module is actually installed. Not sure how the provider and the module play together though

harsh loomba (hloomba@upgrade.com)

2023-10-31 17:07:41

*Thread Reply:* yeah so @John Lukenoff im not getting how i can use the specific extractor when i run my operator. Say for example, I have custom datawarehouseOperator and i want to override getopenlineagefacetsonstart and getopenlineagefacetsoncomplete using the redshift extractor then how would i do that?

Rodrigo Maia (rodrigo.maia@manta.io)

2023-10-27 05:49:25

Spark Integration Logs Hey There Are these events skipped because it's not supported or it's configured somewhere? 23/10/27 08:25:58 INFO SparkSQLExecutionContext: OpenLineage received Spark event that is configured to be skipped: SparkListenerSQLExecutionStart 23/10/27 08:25:58 INFO SparkSQLExecutionContext: OpenLineage received Spark event that is configured to be skipped: SparkListenerSQLExecutionEnd

Hitesh (splicer9904@gmail.com)

2023-10-27 08:12:32

Hi People, actually I want to intercept the OpenLineage spark events right after the job ends and before they are emitted, so that I can add some extra information to the events or remove some information that I don't want. Is there any way of doing this? Can someone please help me

Michael Robinson (michael.robinson@astronomer.io)

2023-10-30 09:03:57

*Thread Reply:* It general, I think this kind of use case is probably best served by facets, but what do you think @Paweł Leszczyński?

openlineage.io

Understanding and Using Facets | OpenLineage

Adapted from the OpenLineage spec.

Original URL: https://openlineage.io/docs/guides/facets

Kavitha (kkandaswamy@cardinalcommerce.com)

2023-10-27 17:01:12

Hello, has anyone run into similar error as posted in this github open issues[https://github.com/MarquezProject/marquez/issues/2468] while setting up marquez on an EC2 Instance, would appreciate any help to get past the errors

#2468 Connection error PostgreSQL and Marquez

Hi all! I really appreciate this repository and would love to use it for its data lineage aspect! Last few days, I've been trying to get the <code>marquez api</code> Docker image up and running but whatever I try, it does not seem to work. It returns the following errors in Docker Desktop: <code>2023-04-03 18:07:04 marquez-db | 2023-04-03 23:07:04.726 GMT [35] FATAL: password authentication failed for user "marquez" 2023-04-03 18:07:04 marquez-db | 2023-04-03 23:07:04.726 GMT [35] DETAIL: Role "marquez" does not exist.</code> and <code>ERROR [2023-04-03 22:44:31,217] org.apache.tomcat.jdbc.pool.ConnectionPool: Unable to create initial connections of pool. 2023-04-03 17:44:31 ! org.postgresql.util.PSQLException: FATAL: password authentication failed for user "marquez"</code> All the steps in the README.md file were used to configure the PostgreSQL database for Marquez and to set the correct environment variables. During my trouble shooting practices, I found a relatively similar issue on StackOverflow which refers to the danger in using the same ports for Postgres and Marquez but this did not help me yet in solving the issue (<a href="https://stackoverflow.com/questions/62115827/org-postgresql-util-psqlexception-fatal-password-authentication-failed-for-use">https://stackoverflow.com/questions/62115827/org-postgresql-util-psqlexception-fatal-password-authentication-failed-for-use</a>). Could you please help me out? Kind regards, Tom

Comments

Willy Lulciuc (willy@datakin.com)

2023-10-27 17:04:30

*Thread Reply:* Hmm, have you looked over our Running on AWS docs?

Willy Lulciuc (willy@datakin.com)

2023-10-27 17:06:08

*Thread Reply:* More specifically, the AWS RDS section. How are you deploying Marquez on Ec2?

Kavitha (kkandaswamy@cardinalcommerce.com)

2023-10-27 17:08:05

*Thread Reply:* we were primarily referencing this document on git - https://github.com/MarquezProject/marquez

MarquezProject/marquez

Collect, aggregate, and visualize a data ecosystem's metadata

Website

<https://marquezproject.ai>

Stars

1450

Kavitha (kkandaswamy@cardinalcommerce.com)

2023-10-27 17:09:05

*Thread Reply:* leveraged docker and docker-compose

Willy Lulciuc (willy@datakin.com)

2023-10-27 17:13:10

*Thread Reply:* hmm so you’re running docker-compose up on an Ec2 instance you’ve ssh’d into? (just trying to understand your setup better)

Kavitha (kkandaswamy@cardinalcommerce.com)

2023-10-27 17:13:26

*Thread Reply:* yes, thats correct

Willy Lulciuc (willy@datakin.com)

2023-10-27 17:16:39

*Thread Reply:* I’ve only used docker compose for local dev or integration tests. but, ok you’re probably in the PoC phase. Can you run the docker cmd on you local machine successfully? What OS is stalled on the Ec2 instance?

Kavitha (kkandaswamy@cardinalcommerce.com)

2023-10-27 17:18:00

*Thread Reply:* yes, i can run and the OS is Ubuntu 20.04.6 LTS

Kavitha (kkandaswamy@cardinalcommerce.com)

2023-10-27 17:19:27

*Thread Reply:* we initiallly ran into a permission denied error related to postgressql.conf file and we had to update file permissions to 777 and after which we started to see below errors

Kavitha (kkandaswamy@cardinalcommerce.com)

2023-10-27 17:19:36

*Thread Reply:* marquez-db | 2023-10-27 20:35:52.512 GMT [35] FATAL: no pghba.conf entry for host "172.18.0.5", user "marquez", database "marquez", no encryption marquez-db | 2023-10-27 20:35:52.529 GMT [36] FATAL: no pghba.conf entry for host "172.18.0.5", user "marquez", database "marquez", no encryption

Kavitha (kkandaswamy@cardinalcommerce.com)

2023-10-27 17:20:12

*Thread Reply:* we then manually updated pg_hba.conf file to include host user and db details

Willy Lulciuc (willy@datakin.com)

2023-10-27 17:20:42

*Thread Reply:* Did you also update the marquez.yml with the db user / password?

Kavitha (kkandaswamy@cardinalcommerce.com)

2023-10-27 17:20:48

*Thread Reply:* after which we started to see the errors posted in the github open issues page

Willy Lulciuc (willy@datakin.com)

2023-10-27 17:21:33

*Thread Reply:* hmm are you using an external database or are you spinning up the entire Marquez stack with docker compose?

Kavitha (kkandaswamy@cardinalcommerce.com)

2023-10-27 17:21:56

*Thread Reply:* we are spinning up the entire Marquez stack with docker compose

Kavitha (kkandaswamy@cardinalcommerce.com)

2023-10-27 17:23:24

*Thread Reply:* we did not change anything in the marquez.yml, i think we did not find that file in the github repo that we cloned into our local instance

Willy Lulciuc (willy@datakin.com)

2023-10-27 17:26:31

*Thread Reply:* It’s important that the init-db.sh script runs, but I don’t think it is

Willy Lulciuc (willy@datakin.com)

2023-10-27 17:26:56

*Thread Reply:* can you grab all the docker compose logs and share them? it’s hard to debug otherwise

Kavitha (kkandaswamy@cardinalcommerce.com)

2023-10-27 17:29:59

*Thread Reply:*

logs.txt

Willy Lulciuc (willy@datakin.com)

2023-10-27 17:33:15

*Thread Reply:* I would first suggest to remove the --build flag since you are specifying a version of Marquez to use via --tag

Willy Lulciuc (willy@datakin.com)

2023-10-27 17:33:49

*Thread Reply:* no the issue per se, but will help clear up some of the logs

Kavitha (kkandaswamy@cardinalcommerce.com)

2023-10-27 17:35:06

*Thread Reply:* for sure thanks. we could get the logs without the --build portion, we tried with that option just once

Kavitha (kkandaswamy@cardinalcommerce.com)

2023-10-27 17:35:40

*Thread Reply:* the errors were the same with/without --build option

Kavitha (kkandaswamy@cardinalcommerce.com)

2023-10-27 17:36:02

*Thread Reply:* marquez-api | ERROR [2023-10-27 21:34:58,019] org.apache.tomcat.jdbc.pool.ConnectionPool: Unable to create initial connections of pool. marquez-api | ! org.postgresql.util.PSQLException: FATAL: password authentication failed for user "marquez" marquez-api | ! at org.postgresql.core.v3.ConnectionFactoryImpl.doAuthentication(ConnectionFactoryImpl.java:693) marquez-api | ! at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:203) marquez-api | ! at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:258) marquez-api | ! at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54) marquez-api | ! at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:253) marquez-api | ! at org.postgresql.Driver.makeConnection(Driver.java:434) marquez-api | ! at org.postgresql.Driver.connect(Driver.java:291) marquez-api | ! at org.apache.tomcat.jdbc.pool.PooledConnection.connectUsingDriver(PooledConnection.java:346) marquez-api | ! at org.apache.tomcat.jdbc.pool.PooledConnection.connect(PooledConnection.java:227) marquez-api | ! at org.apache.tomcat.jdbc.pool.ConnectionPool.createConnection(ConnectionPool.java:768) marquez-api | ! at org.apache.tomcat.jdbc.pool.ConnectionPool.borrowConnection(ConnectionPool.java:696) marquez-api | ! at org.apache.tomcat.jdbc.pool.ConnectionPool.init(ConnectionPool.java:495) marquez-api | ! at org.apache.tomcat.jdbc.pool.ConnectionPool.<init>(ConnectionPool.java:153) marquez-api | ! at org.apache.tomcat.jdbc.pool.DataSourceProxy.pCreatePool(DataSourceProxy.java:118) marquez-api | ! at org.apache.tomcat.jdbc.pool.DataSourceProxy.createPool(DataSourceProxy.java:107) marquez-api | ! at org.apache.tomcat.jdbc.pool.DataSourceProxy.getConnection(DataSourceProxy.java:131) marquez-api | ! at org.flywaydb.core.internal.jdbc.JdbcUtils.openConnection(JdbcUtils.java:48) marquez-api | ! at org.flywaydb.core.internal.jdbc.JdbcConnectionFactory.<init>(JdbcConnectionFactory.java:75) marquez-api | ! at org.flywaydb.core.FlywayExecutor.execute(FlywayExecutor.java:147) marquez-api | ! at org.flywaydb.core.Flyway.info(Flyway.java:190) marquez-api | ! at marquez.db.DbMigration.hasPendingDbMigrations(DbMigration.java:73) marquez-api | ! at marquez.db.DbMigration.migrateDbOrError(DbMigration.java:27) marquez-api | ! at marquez.MarquezApp.run(MarquezApp.java:105) marquez-api | ! at marquez.MarquezApp.run(MarquezApp.java:48) marquez-api | ! at io.dropwizard.cli.EnvironmentCommand.run(EnvironmentCommand.java:67) marquez-api | ! at io.dropwizard.cli.ConfiguredCommand.run(ConfiguredCommand.java:98) marquez-api | ! at io.dropwizard.cli.Cli.run(Cli.java:78) marquez-api | ! at io.dropwizard.Application.run(Application.java:94) marquez-api | ! at marquez.MarquezApp.main(MarquezApp.java:60) marquez-api | INFO [2023-10-27 21:34:58,024] marquez.MarquezApp: Stopping app...

Willy Lulciuc (willy@datakin.com)

2023-10-27 17:38:52

*Thread Reply:* debugging docker issues like this is so difficult

Willy Lulciuc (willy@datakin.com)

2023-10-27 17:40:44

*Thread Reply:* it could be a number of things, but you are connected to the database it’s just that the marquez user hasn’t been created

Willy Lulciuc (willy@datakin.com)

2023-10-27 17:41:59

*Thread Reply:* the /init-db.sh is what manages user creation

Willy Lulciuc (willy@datakin.com)

2023-10-27 17:42:17

*Thread Reply:* so it’s possible that the script isn’t running for whatever reason on your Ec2 instance

Willy Lulciuc (willy@datakin.com)

2023-10-27 17:44:20

*Thread Reply:* do you have other services running on that Ec2 instance? Like, other than Marquez

Willy Lulciuc (willy@datakin.com)

2023-10-27 17:44:52

*Thread Reply:* is there a postgres process running outside of docker?

Kavitha (kkandaswamy@cardinalcommerce.com)

2023-10-27 20:34:50

*Thread Reply:* no other services except marquez on this EC2 instance

Kavitha (kkandaswamy@cardinalcommerce.com)

2023-10-27 20:35:49

*Thread Reply:* this was a new Ec2 instance that was spun up to install and use marquez

Kavitha (kkandaswamy@cardinalcommerce.com)

2023-10-27 20:36:09

*Thread Reply:* n we can confirm that no postgres process runs outside of docker

Jason Yip (jasonyip@gmail.com)

2023-10-29 03:06:28

I realize in Spark 3.4+, some job ids don't have a start event. What part of the code is responsible for triggering the START and COMPLETE event

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-10-30 09:59:53

*Thread Reply:* hi @Jason Yip could you provide an example of such a job?

Jason Yip (jasonyip@gmail.com)

2023-10-30 16:51:55

*Thread Reply:* @Paweł Leszczyński same old:

delete the old table if needed

_ = spark.sql('DROP TABLE IF EXISTS transactions')

expected structure of the file

transactionsschema = StructType([ StructField('householdid', IntegerType()), StructField('basketid', LongType()), StructField('day', IntegerType()), StructField('productid', IntegerType()), StructField('quantity', IntegerType()), StructField('salesamount', FloatType()), StructField('storeid', IntegerType()), StructField('discountamount', FloatType()), StructField('transactiontime', IntegerType()), StructField('weekno', IntegerType()), StructField('coupondiscount', FloatType()), StructField('coupondiscountmatch', FloatType()) ])

read data to dataframe

df = (spark .read .csv( adlsRootPath + '/examples/data/csv/completejourney/transactiondata.csv', header=True, schema=transactionsschema))

df.write\ .format('delta')\ .mode('overwrite')\ .option('overwriteSchema', 'true')\ .option('path', adlsRootPath + '/examples/data/csv/completejourney/silver/transactions')\ .saveAsTable('transactions')

df.count()

# create table object to make delta lake queryable

_ = spark.sql(f'''

CREATE TABLE transactions

USING DELTA

LOCATION '{adlsRootPath}/examples/data/csv/completejourney/silver/transactions'

''')

show data

display( spark.table('transactions') )

John Lukenoff (john@jlukenoff.com)

2023-10-30 18:51:43

👋 Hi team, cross-posting from the Marquez Channel in case anyone here has a better idea of the spec

> For most of our lineage extractors in airflow, we are using the rust sql parser from openlineage-sql to extract table lineage via sql statements. When errors occur we are adding an extractionError run facet similar to what is being done here. I’m finding in the case that multiple statements were extracted but one failed to parse while many others were successful, the lineage for these runs doesn’t appear as expected in Marquez. Is there any logic around the extractionError run facet that could be causing this? It seems reasonable to assume that we might take this to mean the entire run event is invalid if we have any extraction errors. > > I would still expect to see the other lineage we sent for the run but am instead just seeing the extractionError in the marquez UI, in the database, runs with an extractionError facet don’t seem to make it to the job_versions_io_mapping table

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/openlineage/airflow/extractors/sql_extractor.py | sql_extractor.py>

<pre><code> if sql_meta.errors: run_facets['extractionError'] = ExtractionErrorRunFacet( totalTasks=len(self.operator.sql) if isinstance(self.operator.sql, list) else 1, failedTasks=len(sql_meta.errors), errors=[ ExtractionError( errorMessage=error.message, stackTrace=None, task=error.origin_statement, taskNumber=error.index ) for error in sql_meta.errors ] ) </code></pre>

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-10-31 06:34:05

*Thread Reply:* Can you show the actual event? Should be in the events tab in Marquez

Kavitha (kkandaswamy@cardinalcommerce.com)

2023-10-31 11:59:07

*Thread Reply:* @John Lukenoff, would you mind posting the link to Marquez teams slack channel?

John Lukenoff (john@jlukenoff.com)

2023-10-31 12:15:37

*Thread Reply:* yep here is the link: https://marquezproject.slack.com/archives/C01E8MQGJP7/p1698702140709439

This is the full event, sanitized of internal info: { "job": { "name": "some_dag.some_task", "facets": {}, "namespace": "default" }, "run": { "runId": "a9565df2-f1a1-3ee3-b202-7626f8c4b92d", "facets": { "extractionError": { "errors": [ { "task": "ALTER SESSION UNSET QUERY_TAG;", "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/0.24.0/client/python>", "_schemaURL": "<https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/BaseFacet>", "taskNumber": 0, "errorMessage": "Expected one of TABLE or INDEX, found: SESSION" } ], "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/0.24.0/client/python>", "_schemaURL": "<https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/ExtractionErrorRunFacet>", "totalTasks": 1, "failedTasks": 1 } } }, "inputs": [ { "name": "foo.bar", "facets": {}, "namespace": "snowflake" }, { "name": "fizz.buzz", "facets": {}, "namespace": "snowflake" } ], "outputs": [ { "name": "foo1.bar2", "facets": {}, "namespace": "snowflake" }, { "name": "fizz1.buzz2", "facets": {}, "namespace": "snowflake" } ], "producer": "<https://github.com/MyCompany/repo/blob/next-master/company/data/pipelines/airflow_utils/openlineage_utils/client.py>", "eventTime": "2023-10-30T02:46:13.367274Z", "eventType": "COMPLETE" }

Kavitha (kkandaswamy@cardinalcommerce.com)

2023-10-31 12:43:07

*Thread Reply:* thank you!

Kavitha (kkandaswamy@cardinalcommerce.com)

2023-10-31 13:14:29

*Thread Reply:* @John Lukenoff, sorry to trouble again, is the slack channel still active? for whatever reason i cant get to this workspace

John Lukenoff (john@jlukenoff.com)

2023-10-31 13:15:26

*Thread Reply:* yep it’s still active, maybe you need to join the workspace first? https://join.slack.com/t/marquezproject/shared_invite/zt-266fdhg9g-TE7e0p~EHK50GJMMqNH4tg

Kavitha (kkandaswamy@cardinalcommerce.com)

2023-10-31 13:25:51

*Thread Reply:* that was a good call. the link you just shared worked! thank you!

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-10-31 13:27:55

*Thread Reply:* yeah from OL perspective this looks good - the inputs and outputs are there, the extraction error facet looks like it should

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-10-31 13:28:05

*Thread Reply:* must be some Marquez hiccup 🙂

👍 John Lukenoff

John Lukenoff (john@jlukenoff.com)

2023-10-31 13:28:45

*Thread Reply:* Makes sense, I’ll tail my marquez logs today to see if I can find anything

John Lukenoff (john@jlukenoff.com)

2023-11-01 19:37:06

*Thread Reply:* Somehow this started working after we switched from our beta to prod infrastructure. I suspect something was failing due to constraints on the size of our db and the load of poor quality data it was under after months of testing against it

Michael Robinson (michael.robinson@astronomer.io)

2023-11-01 11:34:43

@channel I’m opening a vote to release OpenLineage 1.5.0, including: • support for Cassandra Connectors lineage in the Flink integration • support for Databricks Runtime 13.3 in the Spark integration • support for rdd and toDF operations from the Spark Scala API in Spark • lowered requirements for attrs and requests packages in the Airflow integration • lazy rendering of yaml configs in the dbt integration • bug fixes, tests, infra fixes, doc changes, and more. Three +1s from committers will authorize an immediate release.

➕ Jakub Dardziński, William Angel, Abdallah, Willy Lulciuc, Paweł Leszczyński, Julien Le Dem

👍 Jason Yip

🚀 Luca Soato, tati

Michael Robinson (michael.robinson@astronomer.io)

2023-11-02 05:11:58

*Thread Reply:* Thanks, all. The release is authorized and will be initiated within 2 business days.

Michael Robinson (michael.robinson@astronomer.io)

2023-11-01 13:29:09

@channel The October 2023 issue of OpenLineage News is available now! to get in directly in your inbox each month.

apache.us14.list-manage.com

OpenLineage Project

OpenLineage Project Email Forms

Original URL: https://openlineage.us14.list-manage.com/track/click?u=fe7ef7a8dbb32933f30a10466&id=123767f606&e=ef0563a7f8

👍 Mars Lan (Metaphor), harsh loomba

🎉 tati

John Lukenoff (john@jlukenoff.com)

2023-11-01 19:40:39

Hi team 👋 , we’re finding that for our Spark jobs we are almost always getting some junk characters in our dataset names. We’ve pushed the regex filter to its limits and would like to extend the logic of deriving the dataset name in openlineage-spark (currently on 1.4.1). I seem to recall hearing we could do this by implementing our own LogicalPlanVisitor or something along those lines? Is that still the recommended approach and if so would this be possible to implement in Scala vs. Java (scala noob here 🙂)

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-11-02 03:34:15

*Thread Reply:* Hi John, we're always happy to help with the contribution.

One of the possible solutions to this would be to do that just in openlineage-java client: • introduce config entry like normalizeDatasetNameToAscii : enabled/disabled • modify DatasetIdentifier class to contain static member boolean normalizeDatasetNameToAscii and normalize dataset name according to this setting • additionally, you would need to add config entry in io.openlineage.client.OpenLineageYaml and make sure both loadOpenLineageYaml methods set DatasetIdentifier.normalizeDatasetNameToAscii based on the config • document this in the doc So, no Scala nor custom logical plan visitors required.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-11-02 03:34:47

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/main/client/java/src/main/java/io/openlineage/client/utils/DatasetIdentifier.java

<https://github.com/OpenLineage/OpenLineage/blob/main/client/java/src/main/java/io/openlineage/client/utils/DatasetIdentifier.java | DatasetIdentifier.java>

<pre><code>/** /** Copyright 2018-2023 contributors to the OpenLineage project /** SPDX-License-Identifier: Apache-2.0 **/ package io.openlineage.client.utils; import java.util.LinkedList; import java.util.List; import lombok.Value; @Value public class DatasetIdentifier { String name; String namespace; List<Symlink> symlinks; public enum SymlinkType { TABLE }; public DatasetIdentifier(String name, String namespace) { this.name = name; this.namespace = namespace; this.symlinks = new LinkedList<>(); } public DatasetIdentifier withSymlink(String name, String namespace, SymlinkType type) { symlinks.add(new Symlink(name, namespace, type)); return this; } public DatasetIdentifier withSymlink(Symlink symlink) { symlinks.add(symlink); return this; } @Value public static class Symlink { String name; String namespace; SymlinkType type; } } </code></pre>

🙌 John Lukenoff

Mike Fang (fangmik@amazon.com)

2023-11-01 20:30:38

I am looking to send OpenLineage events to an AWS API Gateway endpoint from an AWS MWAA instance. The problem is that all requests to AWS services need to be signed with SigV4, and using API Gateway with IAM authentication would require requests to API Gateway be signed with SigV4. Would the best way to do so be to just modify the python client HTTP transport to include a new config option for signing emitted OpenLineage events with SigV4? Are there any alternatives?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-11-02 02:41:50

*Thread Reply:* there’s actually an issue for that: https://github.com/OpenLineage/OpenLineage/issues/2189

but the way to do this is imho to create new custom transport (it might inherit from HTTP transport) and register it in transport factory

Mike Fang (fangmik@amazon.com)

2023-11-02 13:05:05

*Thread Reply:* I am thinking of just modifying the HTTP transport and using requests.auth.AuthBase to create different auth methods instead of a TokenProvider class

Classes which subclass requests.auth.AuthBase can also just directly be given to the requests call in the auth parameter

👍 Jakub Dardziński

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-11-02 14:40:24

*Thread Reply:* would you like to contribute? 🙂

Mike Fang (fangmik@amazon.com)

2023-11-02 14:43:05

*Thread Reply:* I was about to contribute, but I actually just realized that there is an existing way to provide a custom transport that would solve form y use case. My only question is how do I register this custom transport in my MWAA environment? Can I provide the custom transport as an Airflow plugin and then specify the class in the Openlineage.yml config? Will it automatically pick it up?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-11-02 15:45:56

*Thread Reply:* although I did not test this in MWAA but locally only: I’ve created Airflow plugin that in __init__.py has defined (or imported) following code: ```from openlineage.client.transport import register_transport, Transport, Config

@register_transport class FakeTransport(Transport): kind = "fake" config = Config

def __init__(self, config: Config) -> None:
    print(config)

def emit(self, event) -> None:
    print(event)```

setting AIRFLOW__OPENLINEAGE__TRANSPORT='{"type": "fake"}' does take effect and I can see output in Airflow logs

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-11-02 15:47:45

*Thread Reply:* in setup.py it’s: ..., entry_points={ 'airflow.plugins': [ 'custom_transport = custom_transport:CustomTransportPlugin', ], }, install_requires=["openlineage-python"] )

Mike Fang (fangmik@amazon.com)

2023-11-03 12:52:55

*Thread Reply:* ok great thanks for following up on this, super helpful

Michael Robinson (michael.robinson@astronomer.io)

2023-11-02 12:00:00

@channel We released OpenLineage 1.5.0, including: • support for Cassandra Connectors lineage in the Flink integration by @Peter Huang • support for Databricks Runtime 13.3 in the Spark integration by @Paweł Leszczyński • support for rdd and toDF operations from the Spark Scala API in Spark by @Paweł Leszczyński • lowered requirements for attrs and requests packages in the Airflow integration by @Jakub Dardziński • lazy rendering of yaml configs in the dbt integration by @Jakub Dardziński • bug fixes, tests, infra fixes, doc changes, and more. Thanks to all the contributors, including new contributor @Sophie LY! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.5.0 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/1.4.1...1.5.0 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

👍 Jason Yip, Sophie LY, Tristan GUEZENNEC -CROIX-, Mars Lan (Metaphor), Sangeeta Mishra

🚀 tati

Jason Yip (jasonyip@gmail.com)

2023-11-02 14:49:18

@Paweł Leszczyński I tested 1.5.0, it works great now, but the environment facets is gone in START... which I very much want it.. any thoughts?

data (10).json

Jason Yip (jasonyip@gmail.com)

2023-11-03 04:18:11

actually, it shows up in one of the RUNNING now... behavior is consistent between 11.3 and 13.3, thanks for fixing this issue

👍 Paweł Leszczyński

Jason Yip (jasonyip@gmail.com)

2023-11-04 15:44:22

*Thread Reply:* @Paweł Leszczyński looks like I need to bring bad news.. 13.3 is fixed for specific scenarios, but 11.3 is still reading output as dbfs.. there are scenarios that it's not producing input and output like:

create table table using delta as location 'abfss://....' Select ** from parquet.`abfss://....'

Jason Yip (jasonyip@gmail.com)

2023-11-04 15:44:31

*Thread Reply:* Will test more and ope issues

Rodrigo Maia (rodrigo.maia@manta.io)

2023-11-06 05:34:33

*Thread Reply:* @Jason Yiphow did you manage the get the environment attribute. it's not showing up to me at all. I've tried databricks abut also tried a local instance of spark.

Jason Yip (jasonyip@gmail.com)

2023-11-07 18:32:02

*Thread Reply:* @Rodrigo Maia its showing up in one of the RUNNING events, not in the START event anymore

Rodrigo Maia (rodrigo.maia@manta.io)

2023-11-08 03:04:32

*Thread Reply:* I never had a running event 🫠 Am I filtering something?

Jason Yip (jasonyip@gmail.com)

2023-11-08 13:03:26

*Thread Reply:* Umm.. ok show me your code, will try on my end

Jason Yip (jasonyip@gmail.com)

2023-11-08 14:26:06

*Thread Reply:* @Paweł Leszczyński @Rodrigo Maia actually if you are using UC-enabled cluster, you won't get any RUNNING events

Michael Robinson (michael.robinson@astronomer.io)

2023-11-03 12:00:07

@channel This month’s TSC meeting (open to all) is next Thursday the 9th at 10am PT. On the agenda: • announcements • recent releases • recent additions to the Flink integration by @Peter Huang • recent additions to the Spark integration by @Paweł Leszczyński • updates on proposals by @Julien Le Dem • discussion topics • open discussion More info and the meeting link can be found on the website. All are welcome! Do you have a discussion topic, use case or integration you’d like to demo? DM me to be added to the agenda.

openlineage.io

TSC Meetings | OpenLineage

The OpenLineage Technical Steering Committee meets monthly, and is open to all.

Original URL: https://openlineage.io/meetings/

👍 harsh loomba

priya narayana (n.priya88@gmail.com)

2023-11-04 07:08:10

Hi Team , we are trying to customize the events by writing custom lineage listener extending OpenLineageSparkListener, but would need some direction how to capture the events

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-11-04 07:11:46

*Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1698315220142929 Do you need some more guidance than that?

} priya narayana (https://openlineage.slack.com/team/U062Q95A1FG)

Hi I want to customise the events which comes from Openlineage spark . Can some one give some information

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1698315220142929

priya narayana (n.priya88@gmail.com)

2023-11-04 07:13:47

*Thread Reply:* yes

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-11-04 07:15:21

*Thread Reply:* It seems pretty extensively described, what kind of help do you need?

priya narayana (n.priya88@gmail.com)

2023-11-04 07:16:13

*Thread Reply:* io.openlineage.spark.api.OpenLineageEventHandlerFactory if i use this how will i pass custom listener to my spark submit

priya narayana (n.priya88@gmail.com)

2023-11-04 07:17:25

*Thread Reply:* I would like to know how will i customize my events using this . For example: - In "input" Facet i want only symlinks name i am not intereseted in anything else

priya narayana (n.priya88@gmail.com)

2023-11-04 07:17:32

*Thread Reply:* can you please provide some guidance

priya narayana (n.priya88@gmail.com)

2023-11-04 07:18:36

*Thread Reply:* @Jakub Dardziński this is the doubt i have

priya narayana (n.priya88@gmail.com)

2023-11-04 08:17:25

*Thread Reply:* Some one who did spark integration throw some light

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-11-04 08:21:22

*Thread Reply:* it's weekend for most of us so you probably need to wait until Monday for precise answers

David Goss (david.goss@matillion.com)

2023-11-06 04:03:42

👋 I raised a PR https://github.com/OpenLineage/OpenLineage/pull/2223 off the back of some Marquez conversations a while back to try and clarify how names of Snowflake objects should be expressed in OL events. I used Snowflake’s OL view as a guide, but also I appreciate there are other OL producers that involve Snowflake too (Airflow? dbt?). Any feedback on this would be appreciated!

#2223 spec: add clarity to snowflake naming docs

Snowflake-Labs/OpenLineage-AccessHistory-Setup

Stars

Last updated

3 months ago

David Goss (david.goss@matillion.com)

2023-11-08 10:42:35

*Thread Reply:* Thanks for merging this @Maciej Obuchowski!

👍 Maciej Obuchowski

Athitya Kumar (athityakumar@gmail.com)

2023-11-06 05:22:03

Hey team! 👋

We're trying to use openlineage-flink, and would like provide the openlineage.transport.type=http and configure other transport configs, but we're not able to find sufficient docs (tried this doc) on where/how these configs can be provided.

For example, in spark, the changes mostly were delegated to the spark-submit command like spark-submit --conf "spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener" \ --packages "io.openlineage:openlineage_spark:<spark-openlineage-version>" \ --conf "spark.openlineage.transport.url=http://{openlineage.client.host}/api/v1/namespaces/spark_integration/" \ --class com.mycompany.MySparkApp my_application.jar And the OpenLineageSparkListener has a method to retrieve the provided spark confs as an object in the ArgumentParser. Similarly, looking for some pointers on how the openlineage.transport configs can be provided to OpenLineageFlinkJobListener & how the flink listener parses/uses these configs

TIA! 😄

openlineage.io

Apache Flink | OpenLineage

This integration is considered experimental: only specific workflows and use cases are supported.

Original URL: https://openlineage.io/docs/integrations/flink

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-07 05:56:09

*Thread Reply:* similarly to spark config, you can use flink config

Athitya Kumar (athityakumar@gmail.com)

2023-11-07 22:36:53

*Thread Reply:* @Maciej Obuchowski - Got it. Our use-case is that we're trying to build a wrapper on top of openlineage-flink for productionising for our flink jobs.

We're trying to have a wrapper class that extends OpenLineageFlinkJobListener class, and overwrites the HTTP transport endpoint/url to a constant value (say, example.com and /api/v1/flink). But we see that the OpenLineageFlinkJobListener constructor is defined as a private constructor - just wanted to check with the team whether it was just a default scope, or intended to be private. If it was just a default scope, can we contribute a PR to make it public, to make it friendly for teams trying to adopt & extend openlineage?

And also, we wanted to understand better on where we're reading the HTTP transport endpoint/url configs in OpenLineageFlinkJobListener and what'd be the best place to override it to the constant endpoint/url for our use-case

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-08 05:55:43

*Thread Reply:* We parse flink conf to get that information: https://github.com/OpenLineage/OpenLineage/blob/26494b596e9669d2ada164066a73c44e04[…]ink/src/main/java/io/openlineage/flink/client/EventEmitter.java

> But we see that the OpenLineageFlinkJobListener constructor is defined as a private constructor - just wanted to check with the team whether it was just a default scope, or intended to be private. The way to construct is is a public builder in the same class

I think easier way than wrapper class would be use existing flink configuration, or to set up OPENLINEAGE_URL env variable, or have openlineage.yml config file - not sure why this is the way you've chosen?

<https://github.com/OpenLineage/OpenLineage/blob/26494b596e9669d2ada164066a73c44e04e884ba/integration/flink/src/main/java/io/openlineage/flink/client/EventEmitter.java | EventEmitter.java>

<pre><code> public EventEmitter(Configuration configuration) { </code></pre>

Athitya Kumar (athityakumar@gmail.com)

2023-11-09 12:41:02

*Thread Reply:* > I think easier way than wrapper class would be use existing flink configuration, or to set up OPENLINEAGE_URL env variable, or have openlineage.yml config file - not sure why this is the way you've chosen? @Maciej Obuchowski - The reasoning behind going with a wrapper class is that we can abstract out the nitty-gritty like how/where we're publishing openlineage events etc - especially for companies that have a lot of teams that may be adopting openlineage.

For example, if we wanna move away from http transport to kafka transport - we'd be changing only this wrapper class and ask folks to update their wrapper class dependency version. If we went without the wrapper class, then the exact config changes would need to be synced and done by many different teams, who may not have enough context.

Similarly, if we wanna enable some other default best-practise configs, or inject any company-specific configs etc, the wrapper would be useful in abstracting out the details and be the 1 place that handles all openlineage related integrations for any future changes.

That's why we wanna extend openlineage's listener class & leverage most of the OSS code as-is; and at the same time, have the ability to extend & inject customisations. I think that's where some things like having getters for the class object attributes, or having public constructors would be really helpful 😄

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-09 13:03:56

*Thread Reply:* @Athitya Kumar that makes sense. Feel free to provide PR adding getters and stuff.

🎉 Athitya Kumar

Athitya Kumar (athityakumar@gmail.com)

2023-11-22 10:47:32

*Thread Reply:* Created an issue for the same: https://github.com/OpenLineage/OpenLineage/issues/2273

@Maciej Obuchowski - Can you assign this issue to me, if the proposal looks good?

#2273 [PROPOSAL] Make spark & flink integration classes more extendable

Purpose: Creating this issue, as per discussion on <a href="https://openlineage.slack.com/archives/C01CK9T7HKR/p1699266123453379">this Slack thread</a>. For teams trying to adopt openlineage at scale in their workplace, typically it'd be preferred to have one place/team that owns & manages all openlineage-related configurations, integrations etc centrally. For this purpose of centrally managing openlineage, users need to be able to extend the integration classes as much as possible to re-use existing OSS code - while only having to override minimal methods to suit their workpace-specific customisations. However, the Spark (<code>io.openlineage.spark.agent.OpenLineageSparkListener</code>) & Flink (<code>io.openlineage.flink.OpenLineageFlinkJobListener</code>) listener implementation classes currently are not directly extendable out-of-the-box. We have most of the fields/attributes set as private without any getters/setters, and the default constructors coming from <code>@Builder</code> has protected scope. Proposed implementation • Add getter/setter methods for the private fields/attributes in these classes, either via Lombok or explicit getter/setter methods • Change the constructor scope to public, for enabling extending classes to call the <code>super</code> constructor

Labels

proposal

👍 Maciej Obuchowski

Yannick Libert (yannick.libert.partner@decathlon.com)

2023-11-07 06:03:49

Hi all, we (I work with @Sophie LY and @Abdallah) have a quick question regarding the spark integration: if a spark app contains several jobs, they will be named "mysparkappname.job1" and "mysparkappname.job2" eg: sparkjob.collectlimit sparkjob.mappartitionsparallelcollection

If I understood correctly, the spark integration maps one Spark job to a single OpenLineage Job, and the application itself should be assigned a Run id at startup and each job that executes will report the application's Run id as its parent job run (taken from: https://openlineage.io/docs/integrations/spark/).

In our case, the app Run Id is never created, and the jobs runs don't contain any parent facets. We tested it with a recent integration version in 1.4.1 and also an older one (0.26.0). Did we miss something in the OL spark integration config?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-11-07 06:07:51

*Thread Reply:* hey, a name of the output dataset should be put at the end of the job name. This was introduced to help with jobs that call multiple spark actions

Yannick Libert (yannick.libert.partner@decathlon.com)

2023-11-07 07:05:52

*Thread Reply:* Hi Paweł, Thanks for your answer, yes indeed with the newer version of OL, we automatically have the name of the output dataset at the end of the job name, but no App run id, nor any parent run facet.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-11-07 08:16:44

*Thread Reply:* yes, you're right. I mean you can set in config spark.openlineage.parentJobName which will be shared through whole app run, but this needs to be set manually

Yannick Libert (yannick.libert.partner@decathlon.com)

2023-11-07 08:36:58

*Thread Reply:* I see, thanks a lot for your reply we'll try that

ldacey (lance.dacey2@sutherlandglobal.com)

2023-11-07 10:49:25

if I have a dataset on adls gen2 which synapse connects to as an external delta table, is that the use case of a symlink dataset? the delta table is connected to by PBI and by Synapse, but the underlying data is exactly the same

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-08 10:49:04

*Thread Reply:* Sounds like it, yes - if the logical dataset names are different but physical one is the same

Rodrigo Maia (rodrigo.maia@manta.io)

2023-11-08 12:38:52

Has anyone here tried OpenLineage with Spark on Amazon EMR?

Jason Yip (jasonyip@gmail.com)

2023-11-08 13:01:16

*Thread Reply:* No but it should work the same I tried on AWS and Google Colab and Azure

👍 Jakub Dardziński

Tristan GUEZENNEC -CROIX- (tristan.guezennec@decathlon.com)

2023-11-09 03:10:54

*Thread Reply:* Yes. @Abdallah could provide some details if needed.

👍 Abdallah

🔥 Maciej Obuchowski

Rodrigo Maia (rodrigo.maia@manta.io)

2023-11-20 11:29:26

*Thread Reply:* Thanks @Tristan GUEZENNEC -CROIX- HI @Abdallah i was able to set up a spark cluster on AWS EMR but im struggling to configure the OL Listener. Ive tried with steps and bootstrap actions for the jar and it didn't work out. How did you manage to include the jar? Besides, what about the spark configuration? Could you send me a sample of these configs?

Rodrigo Maia (rodrigo.maia@manta.io)

2023-11-28 03:52:44

*Thread Reply:* HI @Abdallah. Ive sent you a message with somo more information. If you could provide some more details that would be awesome. Thank you 😄

Abdallah (abdallah@terrab.me)

2023-11-28 04:14:14

*Thread Reply:* Hi, sorry for the late reply. I didn't see your message at the right timling.

Abdallah (abdallah@terrab.me)

2023-11-28 04:16:02

*Thread Reply:* If you want to test ol quickly without bootstrap action, you can use the following submit.json

[ { "Name": "MyJob", "Jar" : "command-runner.jar", "Args": [ "spark-submit", "--deploy-mode", "cluster", "--packages","io.openlineage:openlineage_spark:1.2.2", "--conf","spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener", "--conf","spark.openlineage.transport.type=http", "--conf","spark.openlineage.transport.url=<OPENLINEAGE-URL>", "--conf","spark.openlineage.version=v1", "path/to/your/job.py" ], "ActionOnFailure": "CONTINUE", "Type": "CUSTOM_JAR" } ] with

aws emr add-steps --cluster-id <cluster-id> --steps file://<path-to-your-json>/submit.json

Abdallah (abdallah@terrab.me)

2023-11-28 04:19:20

*Thread Reply:* "--packages","io.openlineage:openlineage_spark:1.2.2" Need to be mentioned before the creation of the spark session

Michael Robinson (michael.robinson@astronomer.io)

2023-11-08 12:44:54

@channel Friendly reminder: this month’s TSC meeting, open to all, is tomorrow at 10 am PT: https://openlineage.slack.com/archives/C01CK9T7HKR/p1699027207361229

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)

@channel This month’s TSC meeting (open to all) is next Thursday the 9th at 10am PT. On the agenda: • announcements • recent releases • recent additions to the Flink integration by @Peter Huang • recent additions to the Spark integration by @Paweł Leszczyński • updates on proposals by @Julien Le Dem • discussion topics • open discussion More info and the meeting link can be found on the <a href="https://openlineage.io/meetings/">website</a>. All are welcome! Do you have a discussion topic, use case or integration you’d like to demo? DM me to be added to the agenda.

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1699027207361229

👍 Jakub Dardziński

Jason Yip (jasonyip@gmail.com)

2023-11-10 15:25:45

@Paweł Leszczyński regarding to https://github.com/OpenLineage/OpenLineage/issues/2124, OL is parsing out the table location in Hive metastore, it is the location of the table in the catalog and not the physical location of the data. It is both right and wrong because it is a table, just it is an external table.

https://docs.databricks.com/en/sql/language-manual/sql-ref-external-tables.html

docs.databricks.com

External tables

Learn about Unity Catalog external tables in Databricks SQL and Databricks Runtime.

Original URL: https://docs.databricks.com/en/sql/language-manual/sql-ref-external-tables.html

#2124 Same Delta Table not catching the location on write

Labels

integration/spark, integration/databricks

Comments

Jason Yip (jasonyip@gmail.com)

2023-11-10 15:32:28

*Thread Reply:* Here's for more reference: https://dilorom.medium.com/finding-the-path-to-a-table-in-databricks-2c74c6009dbb

Medium

Finding the Path to a Managed Table in Databricks

This article shows how to find a path for a managed Databricks table.

Reading time

2 min read

Original URL: https://dilorom.medium.com/finding-the-path-to-a-table-in-databricks-2c74c6009dbb

Jason Yip (jasonyip@gmail.com)

2023-11-11 03:29:33

@Paweł Leszczyński this is why if create a table with adls location it won't show input and output:

https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/spark35/src[…]k35/agent/lifecycle/plan/CreateReplaceOutputDatasetBuilder.java

Because the catalog object is not there.

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/spark35/src/main/java/io/openlineage/spark35/agent/lifecycle/plan/CreateReplaceOutputDatasetBuilder.java | CreateReplaceOutputDatasetBuilder.java>

<pre><code> if (!di.isPresent()) { return Collections.emptyList(); } </code></pre>

Jason Yip (jasonyip@gmail.com)

2023-11-11 03:30:44

Databricks needs to be re-written in a way that supports Databricks it seems like

Jason Yip (jasonyip@gmail.com)

2023-11-13 03:00:42

@Paweł Leszczyński I went back to 1.4.1, output does show adls location. But environment facet is gone in 1.4.1. It shows up in 1.5.0 but namespace is back to dbfs....

data (11).json

Jason Yip (jasonyip@gmail.com)

2023-11-13 03:18:37

@Paweł Leszczyński I diff CreateReplaceDatasetBuilder.java and CreateReplaceOutputDatasetBuilder.java and they are the same except for the class name, so I am not sure what is causing the change. I also realize you don't have a test case for ADLS

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-11-13 04:52:07

*Thread Reply:* Thanks @Jason Yip for your engagement in finding the cause and solution to this issue.

Among the technical problems, another problem here is that our databricks integration tests are run on AWS and the issue you describe occurs in Azure. I would consider this a primary issue as it is difficult for me to verify the behaviour you describe and fix it with a failing integration test at the start.

Are you able to reproduce the issue on AWS Databricks environment so that we could include it in our integration tests and make sure the behvaiour will not change later on in future?

Jason Yip (jasonyip@gmail.com)

2023-11-13 18:06:44

*Thread Reply:* I didn't know Azure and AWS Databricks are different. Let me try it on AWS as well

Jason Yip (jasonyip@gmail.com)

2023-11-23 04:51:38

*Thread Reply:* @Paweł Leszczyński finally got a chance to run but its a different script, its pretty interesting

Jason Yip (jasonyip@gmail.com)

2023-11-23 04:51:59

*Thread Reply:* "inputs": [ { "namespace": "<wasbs://publicwasb@mmlspark.blob.core.windows.net>", "name": "AdultCensusIncome.parquet" } ], "outputs": [ { "namespace": "<wasbs://publicwasb@mmlspark.blob.core.windows.net>", "name": "AdultCensusIncome.parquet" }

Jason Yip (jasonyip@gmail.com)

2023-11-23 04:52:19

*Thread Reply:* df.write.format("delta").mode("overwrite").option("path", "").saveAsTable("test.AdultCensusIncome")

Jason Yip (jasonyip@gmail.com)

2023-11-23 04:52:41

*Thread Reply:* it somehow got the input path as output path 😲

Jason Yip (jasonyip@gmail.com)

2023-11-23 04:53:53

*Thread Reply:* here's the full script:

df = spark.read.parquet( "" )

sc.jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "") sc.jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "")

df.write.format("delta").mode("overwrite").option("path", "").saveAsTable("test.AdultCensusIncome")

Jason Yip (jasonyip@gmail.com)

2023-11-23 04:54:23

*Thread Reply:* yep, just 3 lines, will try it on Azure as well

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-11-23 08:14:23

*Thread Reply:* just to clarify, were you able to reproduce this issue just on AWS databricks using s3? Asking this again, bcz this is how our intergation test environment looks like.

Jason Yip (jasonyip@gmail.com)

2023-11-23 18:28:36

*Thread Reply:* @Paweł Leszczyński the issue is different. The issue before on Azure is that it says it's dbfs despite it's wasbs. This time around the destination is s3, but it says wasbs... it somehow took the path from inputs

Jason Yip (jasonyip@gmail.com)

2023-11-23 18:28:52

*Thread Reply:* I have not tested this new script on Azure yet

Jason Yip (jasonyip@gmail.com)

2023-11-24 03:53:21

*Thread Reply:* @Paweł Leszczyński tried on Azure, results is the same

Jason Yip (jasonyip@gmail.com)

2023-11-24 03:53:24

*Thread Reply:* df = spark.read.parquet("")

df.write\ .format('delta')\ .mode('overwrite')\ .option('overwriteSchema', 'true')\ .option('path', adlsRootPath + '/examples/data/parquet/AdultCensusIncome/silver/AdultCensusIncome')\ .saveAsTable('AdultCensusIncome')

Jason Yip (jasonyip@gmail.com)

2023-11-24 03:54:05

*Thread Reply:* "outputs": [ { "namespace": "dbfs", "name": "/user/hive/warehouse/adultcensusincome",

Jason Yip (jasonyip@gmail.com)

2023-11-24 03:54:24

*Thread Reply:* Please note that 1.4.1 it correctly identified output as wasbs

Jason Yip (jasonyip@gmail.com)

2023-11-24 04:14:39

*Thread Reply:* so to summarize, on Azure it'd become dbfs

Jason Yip (jasonyip@gmail.com)

2023-11-24 04:15:01

*Thread Reply:* on AWS, it somehow becomes the same as input

Jason Yip (jasonyip@gmail.com)

2023-11-24 04:15:34

*Thread Reply:* 1.4.1 Azure is fine, I have not tested 1.4.1 on AWS

Naresh reddy (naresh.naresh36@gmail.com)

2023-11-15 07:17:24

Hi Can anyone point me to the deck on how Airflow can be integrated using Openlineage?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-15 07:27:34

*Thread Reply:* https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/index.html

https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/guides/user.html

Naresh reddy (naresh.naresh36@gmail.com)

2023-11-15 07:27:55

*Thread Reply:* thank you @Maciej Obuchowski

Naresh reddy (naresh.naresh36@gmail.com)

2023-11-15 11:09:24

Can anyone tell me why OL is better than other competitors if you can provide an analysis that would be great

Harel Shein (harel.shein@gmail.com)

2023-11-16 11:46:16

*Thread Reply:* Hey @Naresh reddy can you help me understand what you mean by competitors? OL is a specification that can be used to solve various problems, so if you have a clear problem statement, maybe I can help with pros/cons for that problem

Naresh reddy (naresh.naresh36@gmail.com)

2023-11-22 23:49:05

*Thread Reply:* i wanted to integrate Airflow using OL but wanted to understand what are the pros and cons of OL, if you can shed light that would be great

Harel Shein (harel.shein@gmail.com)

2023-11-27 19:11:54

*Thread Reply:* Airflow supports OL natively via a provider since 2.7. But it’s hard for me to tell you pros/cons without understanding your use case

👀 Naresh reddy

Naresh reddy (naresh.naresh36@gmail.com)

2023-11-15 11:10:58

what are the pros and cons of OL. we often talk about positives to market it but what are the pain points using OL,how it's addressing user issues?

Michael Robinson (michael.robinson@astronomer.io)

2023-11-16 13:38:42

*Thread Reply:* Hi @Naresh reddy, thanks for your question. We’ve heard that OpenLineage is attractive because of its desirable integrations, including a best-in-class Spark integration, its extensibility, the fact that it’s not destructive, and the fact that it’s open source. I’m not aware of pain points per se, but there are certainly features and integrations that we wish we could focus on but can’t at the moment — like the Dagster integration, which needs a new maintainer. OpenLineage is like any other open standard in that ecosystem coverage is a constant process rather than a journey, and it requires contributions in order to get close to 100%. Thankfully, we are gaining users and contributors all the time, and integrations are being added or improved upon daily. See the Ecosystem page on the website for a list of consumers and producers and links to more resources, and check out the GitHub repo for the codebase, commit history, contributors, governance procedures, and more. We’re quick to respond to messages here and issues on GitHub — usually within one day.

OpenLineage/OpenLineage

An Open Standard for lineage metadata collection

Website

<http://openlineage.io>

Stars

1449

🙌 Naresh reddy

karthik nandagiri (karthik.nandagiri@gmail.com)

2023-11-19 23:57:38

Hi So we can use openlineage to identify column level lineage with Airflow , Spark? will it also allow to connect to Power BI and derive the downstream column lineage ?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-20 06:07:36

*Thread Reply:* Yes, it works with Airflow and Spark - there is caveat that amount of operators that support it on Airflow side is fairly small and limited generally to most popular SQL operators. > will it also allow to connect to Power BI and derive the downstream column lineage ? No, there is no such feature yet 🙂 However, there's nothing preventing this - if you wish to work on such implementation, we'd be happy to help.

karthik nandagiri (karthik.nandagiri@gmail.com)

2023-11-21 00:20:11

*Thread Reply:* Thank you Maciej Obuchowski for the update. Currently we are looking out for a tool which can support connecting to Power Bi and pull column level lineage information for reports and dashboards. How this can be achieved with OL ? Can you give some idea?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-21 07:59:10

*Thread Reply:* I don't think I can help you with that now, unless you want to work on your own integration with PowerBI 🙁

Rafał Wójcik (rwojcik@griddynamics.com)

2023-11-21 07:02:08

Hi Everyone, first of all - big shout to all contributors - You do amazing job here. I want to use OpenLineage in our project - to do so I want to setup some POC and experiment with possibilities library provides - I start working on sample from the conference talk: https://github.com/getindata/openlineage-bbuzz2023-column-lineage but when I go into spark transformation after staring context with openlineage I have issues with SessionHiveMetaStoreClient on section 3- does anyone has other plain sample to play with, to not setup everything from scratch?

getindata/openlineage-bbuzz2023-column-lineage

Language

Jupyter Notebook

Last updated

5 months ago

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-21 07:37:00

*Thread Reply:* Can you provide details about those issues? Like exceptions, logs, details of the jobs and how do you run them?

Rafał Wójcik (rwojcik@griddynamics.com)

2023-11-21 07:45:37

*Thread Reply:* Hi @Maciej Obuchowski - I rerun docker container after deleting metadata_db folder possibly created by other local test, and fix this one but got problem with OpenLineageListener - during initialization of spark: while I execute: spark = (SparkSession.builder.master('local') .appName('Food Delivery') .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener') .config('spark.jars', '<local-path>/openlineage-spark-0.27.2.jar,<local-path>/postgresql-42.6.0.jar') .config('spark.openlineage.transport.type', 'http') .config('spark.openlineage.transport.url', '<http://api:5000>') .config('spark.openlineage.facets.disabled', '[spark_unknown;spark.logicalPlan]') .config('spark.openlineage.namespace', 'food-delivery') .config('spark.sql.warehouse.dir', '/tmp/spark-warehouse/') .config("spark.sql.repl.eagerEval.enabled", True) .enableHiveSupport() .getOrCreate()) I got Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : org.apache.spark.SparkException: Exception when registering SparkListener at org.apache.spark.SparkContext.setupAndStartListenerBus(SparkContext.scala:2563) at org.apache.spark.SparkContext.<init>(SparkContext.scala:643) at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58) at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77) at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499) at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:480) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:238) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.base/java.lang.Thread.run(Thread.java:833) Caused by: java.lang.ClassNotFoundException: io.openlineage.spark.agent.OpenLineageSparkListener at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:587) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520) at java.base/java.lang.Class.forName0(Native Method) at java.base/java.lang.Class.forName(Class.java:467) at org.apache.spark.util.Utils$.classForName(Utils.scala:218) at org.apache.spark.util.Utils$.$anonfun$loadExtensions$1(Utils.scala:2921) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293) at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108) at org.apache.spark.util.Utils$.loadExtensions(Utils.scala:2919) at org.apache.spark.SparkContext.$anonfun$setupAndStartListenerBus$1(SparkContext.scala:2552) at org.apache.spark.SparkContext.$anonfun$setupAndStartListenerBus$1$adapted(SparkContext.scala:2551) at scala.Option.foreach(Option.scala:407) at org.apache.spark.SparkContext.setupAndStartListenerBus(SparkContext.scala:2551) ... 15 more looks like by some reasons jars are not loaded - need to look into it

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-21 07:58:09

*Thread Reply:* 🤔 Jars are added during image building: https://github.com/getindata/openlineage-bbuzz2023-column-lineage/blob/main/Dockerfile#L12C1-L12C29

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-21 07:58:28

*Thread Reply:* are you sure <local-path> is right?

Rafał Wójcik (rwojcik@griddynamics.com)

2023-11-21 08:00:49

*Thread Reply:* yes, it's same as in sample - wondering why it's not get added: ```from pyspark.sql import SparkSession

spark = (SparkSession.builder.master('local') .appName('Food Delivery') .config('spark.jars', '/home/jovyan/jars/openlineage-spark-0.27.2.jar,/home/jovyan/jars/postgresql-42.6.0.jar') .config('spark.sql.warehouse.dir', '/tmp/spark-warehouse/') .config("spark.sql.repl.eagerEval.enabled", True) .enableHiveSupport() .getOrCreate())

print(spark.sparkContext._jsc.sc().listJars())

Vector()```

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-21 08:04:31

*Thread Reply:* can you make sure jars are in this directory? just by docker run --entrypoint /usr/local/bin/bash IMAGE_NAME "ls /home/jovyan/jars"

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-21 08:06:27

*Thread Reply:* another option to try is to replace spark.jars with spark.jars.packages io.openlineage:openlineage_spark:1.5.0,org.postgresql:postgresql:42.7.0

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-11-21 08:16:54

*Thread Reply:* I think this was done for the purpose of presentation to make sure the demo will work without internet access. This can be the reason to add jar manually to a docker. openlineage-spark can be added to Spark via spark.jar.packages , like we do here https://openlineage.io/docs/integrations/spark/quickstart_local

openlineage.io

Quickstart with Jupyter | OpenLineage

Trying out the Spark integration is super easy if you already have Docker Desktop and git installed.

Original URL: https://openlineage.io/docs/integrations/spark/quickstart_local

Rafał Wójcik (rwojcik@griddynamics.com)

2023-11-21 09:21:59

*Thread Reply:* got it guys - thanks a lot for help - it turns out that spark context from notebook 2 and 3 has come kind of metadata conflict - when I combine those 2 and recreate image to clean up old metadata it works. One more note is that sometimes kernels return weird results but it may be caused by some local nuances - anyways thx !

Shahid Shaikh (ssshahidwin@gmail.com)

2023-11-22 05:18:53

Hi Everyone I created custom operator in airflow for to extract metadata of that file like size creation time and modification time like that and I used that in my dag it is rnning fine in airflow by saving metadata info of that file to csv file but i want to see that metadata info which we are saving in csv file to be shown in marquez ui in extra facet How we can add that extra facet to see on marquez ui ? Thankyou

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-11-22 05:20:34

*Thread Reply:* are you adding job or run facet?

Shahid Shaikh (ssshahidwin@gmail.com)

2023-11-22 05:21:32

*Thread Reply:* job facet

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-11-22 05:22:24

*Thread Reply:* that should be run facet, right? that’s dynamic value dependent on individual run

Shahid Shaikh (ssshahidwin@gmail.com)

2023-11-22 05:24:42

*Thread Reply:* yes yes sorry run facet

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-11-22 05:25:12

*Thread Reply:* it should appear then in Marquez as additional facet

Shahid Shaikh (ssshahidwin@gmail.com)

2023-11-22 05:31:31

*Thread Reply:* I can see these things already for custom operator but i want to add extra info under the root like the metadata of the file

like ( file_name, size, modification time, creation time )

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-11-22 05:33:20

*Thread Reply:* yeah, for that you need to provide additional run facets under these keys you mentioned

Shahid Shaikh (ssshahidwin@gmail.com)

2023-11-23 01:31:01

*Thread Reply:* can u please tell precisely where we can add additional facet to be visible on ui ?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-11-23 04:16:53

*Thread Reply:* custom extractors should return from extract methods TaskMetadata that takes run_facets as argument https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/openlineage/airflow/extractors/base.py#L28

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/openlineage/airflow/extractors/base.py | base.py>

<pre><code> run_facets: Dict[str, BaseFacet] = attr.ib(factory=dict) </code></pre>

Shahid Shaikh (ssshahidwin@gmail.com)

2023-11-30 06:34:55

*Thread Reply:* Hi @Jakub Dardziński finally today I am able to get extra facet on Marquez ui using custom operator and and custom extractor. Thanks for the help. It is really nice community.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-11-30 06:43:55

*Thread Reply:* Hey, that’s great to hear! Sorry I didn’t answer yesterday but you managed on your own 🙂

Shahid Shaikh (ssshahidwin@gmail.com)

2023-11-30 10:51:09

*Thread Reply:* Yes, No problem I tried and explored more by self. Thanks 😊.

Rafał Wójcik (rwojcik@griddynamics.com)

2023-11-22 05:32:40

Hi Guys, one more questions - as we fix sample from previous thread I start playing with example - when I execute in notebook: spark.sql('''SELECT ** FROM public.restaurants r where r.name = "BBUZZ DONER"''').show() I got all raw events in marquez but all job.facets.sql field are null - is there any way to capture sql query that we use in spark? I know that we can pull this out from spark.logicalPlan but plain sql will be much more convenient

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-11-22 06:53:37

*Thread Reply:* I was looking into it some time ago and was not able to extract SQL from logical plan. It seemed to me that SQL string is translated into LogicalPlan before Openlineage code gets called and I wasn't able to find SQL anywhere

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-22 06:54:31

*Thread Reply:* Optional<String> query = ScalaConversionUtils.asJavaOptional(relation.jdbcOptions().parameters().get(JDBCOptions$.MODULE$.JDBC_QUERY_STRING())); 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-22 06:54:54

*Thread Reply:* ah, not only from JDBCRelation

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-11-22 08:03:52

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/1250#issuecomment-1306865798

Comment on #1250 Include SQL facet for Spark SQL jobs if SQL was executed

This does not seem to be easy. It seems sql text is being executed within <a href="https://github.com/apache/spark/blob/7429223cfd6c53f9d847d58e43190d2a0311f6c4/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L617">SparkSession code:</a> <pre><code> def sql(sqlText: String): DataFrame = withActive { val tracker = new QueryPlanningTracker val plan = tracker.measurePhase(QueryPlanningTracker.PARSING) { sessionState.sqlParser.parsePlan(sqlText) } Dataset.ofRows(self, plan, tracker) } </code></pre> The method above loses <code>sqlText</code> which is translated to a plan. The <code>ofRows</code> method within <code>Dataset.scala</code> looks like: <pre><code> /**** A variant of ofRows that allows passing in a tracker so we can track query parsing time. **/ def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan, tracker: QueryPlanningTracker) : DataFrame = sparkSession.withActive { val qe = new QueryExecution(sparkSession, logicalPlan, tracker) qe.assertAnalyzed() new Dataset[Row](qe, RowEncoder(qe.analyzed.schema)) } } </code></pre> and this is a method that creates <code>QueryExecution</code>. Openlineage integration within Spark relies on <code>QueryExecution</code> objects and <code>sqlText</code> is not present anymore when such object is instantiated.

Michael Robinson (michael.robinson@astronomer.io)

2023-11-22 13:57:29

@general The next OpenLineage meetup is happening one week from today on November 29th in Warsaw/remote (it’s hybrid) at 5:30 pm CET (8:30 am PT). Sign-up and more details can be found https://www.meetup.com/warsaw-openlineage-meetup-group/events/296705558/?utmmedium=referral&utmcampaign=share-btnsavedeventssharemodal&utmsource=link|here. Hoping to see many of you there!

Meetup

OpenLineage Meetup @ Google, Wed, Nov 29, 2023, 5:30 PM | Meetup

Data engineers and pipeline managers know that producing data lineage – end-to-end pipeline metadata instrumented at runtime or parsed at design time – is a heavy lift with

Original URL: https://www.meetup.com/warsaw-openlineage-meetup-group/events/296705558/?utm_medium=referral&utm_campaign=share-btn_savedevents_share_modal&utm_source=link

🙌 Harel Shein, Maciej Obuchowski, Sangeeta Mishra

:flag_pl: Harel Shein, Maciej Obuchowski, Paweł Leszczyński, Stefan Krawczyk

Ryhan Sunny (rsunny@altinity.com)

2023-11-23 12:26:31

Hi all, don’t miss out on the upcoming Open Source Analytics (virtual) Conference OSA Con 2023 - the place to be for cutting-edge insights into open-source analytics and AI! Learn and share development tips on the hottest open source projects, like Kafka, Airflow, Grafana, ClickHouse, and DuckDB. See who’s speaking and save your spot today at https://osacon.io/

osacon.io

OSA CON 2023

The Open Source Analytics Conference (OSA CON) is the go-to conference for all things open-source analytics.

Original URL: https://osacon.io/

❤️ Jarek Potiuk, Jakub Dardziński, Julien Le Dem, Willy Lulciuc

ldacey (lance.dacey2@sutherlandglobal.com)

2023-11-24 10:47:55

with Airflow, I have operators and define inputdatasets and outputdatasets. If Task B uses the outputdataset of Task A as the inputdataset, does it overwrite the metadata such as the documentation facet etc?

should I ensure that the inputdataset info is exactly the same between tasks, or do I move certain logic into the runfacet?

for example, this input_facet is the output dataset from my previous task.

input_facet = { "dataSource": input_source, "schema": SchemaDatasetFacet(fields=fields), "deltaTable": DeltaTableFacet( path=self.source_model.dataset_uri, name=self.source_model.name, description=self.source_model.description, partition_columns=json.dumps(self.source_model.partition_columns or []), unique_constraint=json.dumps(self.source_model.unique_constraint or []), rows=self.source_delta_table.rows, file_count=self.source_delta_table.file_count, size=self.source_delta_table.size, ), } input_dataset = Dataset( namespace=input_namespace, name=input_name, facets=input_facet )

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-24 10:51:55

*Thread Reply:* First question, how do you do that? Define get_openlineage_facets methods on your custom operators?

ldacey (lance.dacey2@sutherlandglobal.com)

2023-11-24 10:59:09

*Thread Reply:* yeah

I have this method:

``` def getopenlineagefacetsoncomplete(self, task_instance: Any) -> Any: """Returns the OpenLineage facets for the task instance""" from airflow.providers.openlineage.extractors import OperatorLineage

    _ = task_instance
    inputs = self._create_input_datasets()
    outputs = self._create_output_datasets()
    run_facet = self._create_run_facet()
    job_facet = self._create_job_facet()
    return OperatorLineage(
        inputs=inputs, outputs=outputs, run_facets=run_facet, job_facets=job_facet
    )```

and then I define each facet

ldacey (lance.dacey2@sutherlandglobal.com)

2023-11-24 11:00:33

*Thread Reply:* the input and output change

``` def createjob_facet(self) -> dict[str, Any]: """Creates the Job facet for the OpenLineage Job""" from openlineage.client.facet import ( DocumentationJobFacet, OwnershipJobFacet, OwnershipJobFacetOwners, SourceCodeJobFacet, )

    return {
        "documentation": DocumentationJobFacet(
            description=f"""Filters data from {self.source_model.dataset_uri} using
            Polars and writes the data to the path:
            {self.destination_model.dataset_uri}.
            """
        ),
        "ownership": OwnershipJobFacet(
            owners=[OwnershipJobFacetOwners(name=self.owner)]
        ),
        "sourceCode": SourceCodeJobFacet(
            language="python", source=self.transform_model_name
        ),
    }

def _create_run_facet(self) -&gt; dict[str, Any]:
    """Creates the Run facet for the OpenLineage Run"""
    return {}```

ldacey (lance.dacey2@sutherlandglobal.com)

2023-11-24 11:02:07

*Thread Reply:* but a lot of the time I am reading a dataset and filtering it or selecting a subset of columns and saving a new dataset. I just want to make sure my input_dataset remains consistent basically, since a lot of different airflow tasks might using it

ldacey (lance.dacey2@sutherlandglobal.com)

2023-11-24 11:02:36

*Thread Reply:* and these are all custom operators

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-24 11:03:29

*Thread Reply:* Yeah, you should be okay sending those facets on output dataset only

ldacey (lance.dacey2@sutherlandglobal.com)

2023-11-24 11:07:19

*Thread Reply:* so just ignore the input_facet completely? or pass an empty list or something?

return OperatorLineage( inputs=None, outputs=outputs, run_facets=run_facet, job_facets=job_facet )

ldacey (lance.dacey2@sutherlandglobal.com)

2023-11-24 11:09:53

*Thread Reply:* cool I'll try that, makes things cleaner for sure

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-24 11:10:20

*Thread Reply:* pass input datasets, just don't include redundant facets

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-24 11:10:39

*Thread Reply:* if you won't pass datasets, you won't get lineage

ldacey (lance.dacey2@sutherlandglobal.com)

2023-11-24 11:26:12

*Thread Reply:* got it, thanks

Haneefa tasneem (haneefa.tasneem.shaik@gmail.com)

2023-11-28 01:12:47

I am trying to run this spark script but my spark context is stopping on its own. Without the open-lineage configuration(listener) the spark script is working fine. I need the configuration to integrate with open-lineage.

from pyspark.sql import SparkSession from pyspark.sql.functions import col

def execute_spark_script(query_num, output_path): # Create a Spark session spark = (SparkSession.builder.master('local[**]').appName('openlineage_spark_test')

         `# Install and set up the OpenLineage listener`
         `.config('spark.jars.packages', 'io.openlineage:openlineage_spark:0.3.+')`
         `.config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener')`
         `.config('spark.openlineage.host', '<http://localhost:3000>')`
         `.config('spark.openlineage.namespace', 'airflow')`
         `.config('spark.openlineage.transport.type', 'console')`
         `.getOrCreate()`
         `)`




`# DataFrame 1`
`data = [[295, "South Bend", "Indiana", "IN", 101190, 112.9]]`
`columns = ["rank", "city", "state", "code", "population", "price"]`
`df1 = spark.createDataFrame(data, schema="rank LONG, city STRING, state STRING, code STRING, population LONG, price DOUBLE")`

`print(f"Count after DataFrame 1: {df1.count()} rows, {len(df1.columns)} columns")`

`# Save DataFrame 1 to the desired location`
`df1.write.mode("overwrite").csv(output_path + "df1")`

`# DataFrame 2`
`df2 = (spark.read`
      `.format("csv")`
      `.option("header", "true")`
      `.option("inferSchema", "true")`
      `.load("/home/haneefa/Downloads/export.csv")`
      `)`

`# Save DataFrame 2 to the desired location`
`df2.write.mode("overwrite").csv(output_path + "df2")`

`# Returns a DataFrame that combines the rows of df1 and df2`
`query_df = df1.union(df2)`
`print(f"Count after combining DataFrame 1 and DataFrame 2: {query_df.count()} rows, {len(query_df.columns)} columns")`

`# Save the combined DataFrame to the desired location`
`query_df.write.mode("overwrite").csv(output_path + "query_df")`

`# Query 1: Add a new column derived from existing columns`
`query1_df = query_df.withColumn("population_price_ratio", col("population") / col("price"))`
`print(f"Count after Query 1: {query1_df.count()} rows, {len(query1_df.columns)} columns")`

`# Save Query 1 result to the desired location`
`query1_df.write.mode("overwrite").csv(output_path + "query1_df")`


`spark.stop()`

if __name__ == "__main__": execute_spark_script(1, "/home/haneefa/airflow/dags/saved_files/")

Damien Hawes (damien.hawes@booking.com)

2023-11-28 06:19:40

*Thread Reply:* Hello @Haneefa tasneem - can you confirm which version of Spark you're running?

Damien Hawes (damien.hawes@booking.com)

2023-11-28 06:20:44

*Thread Reply:* Additionally, I noticed this line:

.config('spark.jars.packages', 'io.openlineage:openlineage_spark:0.3.+') Could you try changing the version to 1.5.0 instead of 0.3.+?

👍 Paweł Leszczyński

Haneefa tasneem (haneefa.tasneem.shaik@gmail.com)

2023-11-28 08:04:24

*Thread Reply:* Hello. my spark version is 3.5.0 ```from pyspark.sql import SparkSession from pyspark.sql.functions import col import urllib.request

def executesparkscript(querynum, outputpath): # Create a Spark session oljars= ['https://repo1.maven.org/maven2/io/openlineage/openlineage-spark/0.3.1/openlineage-spark-0.3.1.jar'] files = [urllib.request.urlretrieve(url)[0] for url in oljars] spark = (SparkSession.builder.master('local[**]').appName('openlineagesparktest').config('spark.jars', ",".join(files))

         # Install and set up the OpenLineage listener

         .config('spark.jars.packages', 'io.openlineage:openlineage_spark:0.3.1')
         .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener')
         .config('spark.openlineage.host', '<http://localhost:5000>')
         .config('spark.openlineage.namespace', 'airflow')
         .getOrCreate()
         )




# DataFrame 1
data = [[295, "South Bend", "Indiana", "IN", 101190, 112.9]]
columns = ["rank", "city", "state", "code", "population", "price"]
df1 = spark.createDataFrame(data, schema="rank LONG, city STRING, state STRING, code STRING, population LONG, price DOUBLE")

print(f"Count after DataFrame 1: {df1.count()} rows, {len(df1.columns)} columns")

# Save DataFrame 1 to the desired location
df1.write.mode("overwrite").csv(output_path + "df1")

# DataFrame 2
df2 = (spark.read
      .format("csv")
      .option("header", "true")
      .option("inferSchema", "true")
      .load("/home/haneefa/Downloads/export.csv")
      )

# Save DataFrame 2 to the desired location
df2.write.mode("overwrite").csv(output_path + "df2")

# Returns a DataFrame that combines the rows of df1 and df2
query_df = df1.union(df2)
print(f"Count after combining DataFrame 1 and DataFrame 2: {query_df.count()} rows, {len(query_df.columns)} columns")

# Save the combined DataFrame to the desired location
query_df.write.mode("overwrite").csv(output_path + "query_df")

# Query 1: Add a new column derived from existing columns
query1_df = query_df.withColumn("population_price_ratio", col("population") / col("price"))
print(f"Count after Query 1: {query1_df.count()} rows, {len(query1_df.columns)} columns")

# Save Query 1 result to the desired location
query1_df.write.mode("overwrite").csv(output_path + "query1_df")


spark.stop()

if name == "main": executesparkscript(1, "/home/haneefa/airflow/dags/saved_files/")```

Haneefa tasneem (haneefa.tasneem.shaik@gmail.com)

2023-11-28 08:04:45

*Thread Reply:* above is the modified code

Haneefa tasneem (haneefa.tasneem.shaik@gmail.com)

2023-11-28 08:05:20

*Thread Reply:* It seems the issue is with the listener in spark config

Haneefa tasneem (haneefa.tasneem.shaik@gmail.com)

2023-11-28 08:05:55

*Thread Reply:* please do let me know if i should make any changes

Damien Hawes (damien.hawes@booking.com)

2023-11-28 08:20:46

*Thread Reply:* Yeah - I suspect its because of the version of the connector that you're using.

You're using 0.3.1, please try it with 1.5.0.

🚀 Paweł Leszczyński

Haneefa tasneem (haneefa.tasneem.shaik@gmail.com)

2023-11-28 12:27:02

*Thread Reply:* Yes. it seems the issue was with the version. thank you. its resolved now

Haneefa tasneem (haneefa.tasneem.shaik@gmail.com)

2023-11-29 03:29:09

*Thread Reply:* Hi. I was able to see some metadata information on Marquez. I wanted to know if there is a way I can see any lineage of the data as its getting transformed? As in, We are running a query here, I wanted to know If we can see the lineage of dataset. I tried modifying the code like this: # Save Query 1 result to the desired location query_df.write.mode("overwrite").csv(output_path + "query_df")

`# Register the DataFrame as a temporary SQL table`
`query_df.write.mode("overwrite").saveAsTable("temp_table")`

`# Query 1: Add a new column derived from existing columns`
`query1_df = spark.sql("SELECT **, population / price as population_price_ratio FROM temp_table")`
`print(f"Count after Query 1: {query1_df.count()} rows, {len(query1_df.columns)} columns")`


`# Register the DataFrame as a temporary SQL table`
`query1_df.write.mode("overwrite").saveAsTable("temp_table2")`

`# Save Query 1 result to the desired location`
`query1_df.write.mode("overwrite").csv(output_path + "query1_df")`

Haneefa tasneem (haneefa.tasneem.shaik@gmail.com)

2023-11-29 03:32:25

*Thread Reply:* But I'm getting error. Is there a way I can see how we are deriving a new column from the previous dataset.

Michael Robinson (michael.robinson@astronomer.io)

2023-11-28 13:06:08

@channel Friendly reminder: our first Warsaw meetup is happening tomorrow at 5:30 PM CET (8:30 AM PT) — and it’s hybrid https://openlineage.slack.com/archives/C01CK9T7HKR/p1700679449568039

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)

@general The next OpenLineage meetup is happening one week from today on November 29th in Warsaw/remote (it’s hybrid) at 5:30 pm CET (8:30 am PT). Sign-up and more details can be found <a href="https://www.meetup.com/warsaw-openlineage-meetup-group/events/296705558/?utm_medium=referral&utm_campaign=share-btn_savedevents_share_modal&utm_source=link|here">https://www.meetup.com/warsaw-openlineage-meetup-group/events/296705558/?utmmedium=referral&utmcampaign=share-btnsavedeventssharemodal&utmsource=link|here</a>. Hoping to see many of you there!

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1700679449568039

slackbot

2023-11-29 06:30:57

This message was deleted.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-11-29 07:03:57

*Thread Reply:* Hey Shahid, I think we already discussed it here https://openlineage.slack.com/archives/C01CK9T7HKR/p1700648333528029?threadts=1700648333.528029&cid=C01CK9T7HKR|https://openlineage.slack.com/archives/C01CK9T7HKR/p1700648333528029?threadts=1700648333.528029&cid=C01CK9T7HKR

} Shahid Shaikh (https://openlineage.slack.com/team/U062FLR8WBZ)

Hi Everyone I created custom operator in airflow for to extract metadata of that file like size creation time and modification time like that and I used that in my dag it is rnning fine in airflow by saving metadata info of that file to csv file but i want to see that metadata info which we are saving in csv file to be shown in marquez ui in extra facet How we can add that extra facet to see on marquez ui ? Thankyou

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1700648333528029?thread_ts=1700648333.528029&cid=C01CK9T7HKR

Haneefa tasneem (haneefa.tasneem.shaik@gmail.com)

2023-11-29 09:09:00

Hi. This code is running on my local ubuntu(Both in and out of the virtual environment). I have installed Airflow in virtual environment in ubuntu. Its not getting executed on Airflow. I'm getting the following error - airflow.exceptions.AirflowException: Cannot execute: spark-submit --master <spark://10.0.2.15:4041> --jars /home/haneefa/Downloads/openlineage-spark-1.5.0.jar --name SparkScript_query1 --deploy-mode client /home/haneefa/airflow/dags/custom_operators/sample_sql_spark.py 1. Error code is: 1. [2023-11-29, 13:51:21 UTC] {taskinstance.py:1400} INFO - Marking task as FAILED. dag_id=spark_dagf, task_id=spark_submit_query1, execution_date=20231129T134548, start_date=20231129T134723, end_date=20231129T135121 The spark-submit is working fine on my ubuntu. ```from pyspark import SparkContext from os.path import abspath from pyspark.sql import SparkSession from pyspark.sql.functions import col import urllib.request

def executesparkscript(querynum, outputpath): # Create a Spark session #oljars= ['https://repo1.maven.org/maven2/io/openlineage/openlineage-spark/1.5.0/openlineage-spark-1.5.0.jar'] warehouselocation = abspath('spark-warehouse') #files = [urllib.request.urlretrieve(url)[0] for url in oljars] spark = (SparkSession.builder.master('local[**]').appName('openlineagespark_test') #.config('spark.jars', ",".join(files))

         # Install and set up the OpenLineage listener

         .config('spark.jars.packages', 'io.openlineage:openlineage_spark:1.5.0')
         .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener')
         .config('spark.openlineage.host', '<http://localhost:5000>')
         .config('spark.openlineage.namespace', 'airflow')
         .config('spark.openlineage.transport.type', 'console')
         .config("spark.sql.warehouse.dir", warehouse_location)
         .getOrCreate()
         )

spark.sparkContext.setLogLevel("INFO")


# DataFrame 1
data = [[295, "South Bend", "Indiana", "IN", 101190, 112.9]]
columns = ["rank", "city", "state", "code", "population", "price"]
df1 = spark.createDataFrame(data, schema="rank LONG, city STRING, state STRING, code STRING, population LONG, price DOUBLE")

print(f"Count after DataFrame 1: {df1.count()} rows, {len(df1.columns)} columns")

# Save DataFrame 1 to the desired location
df1.write.mode("overwrite").csv(output_path + "df1")

# DataFrame 2
df2 = (spark.read
      .format("csv")
      .option("header", "true")
      .option("inferSchema", "true")
      .load("/home/haneefa/Downloads/export.csv")
      )

# Returns a DataFrame that combines the rows of df1 and df2
query_df = df1.union(df2)
query_df.count()

# Save DataFrame 2 to the desired location
query_df.write.mode("overwrite").csv(output_path + "query_df")

# Register the DataFrame as a temporary SQL table
query_df.write.mode("overwrite").saveAsTable("temp_table")

# Query 1: Add a new column derived from existing columns
query1_df = spark.sql("SELECT **, population / price as population_price_ratio FROM temp_table")
print(f"Count after Query 1: {query1_df.count()} rows, {len(query1_df.columns)} columns")

# Register the DataFrame as a temporary SQL table
query1_df.write.mode("overwrite").saveAsTable("temp_table2")

# Save Query 1 result to the desired location
query1_df.write.mode("overwrite").csv(output_path + "query1_df")

spark.stop()

if name == "main": executesparkscript(1, "/home/haneefa/airflow/dags/saved_files/")

spark-submit --master --name SparkScriptquery1 --deploy-mode client /home/haneefa/airflow/dags/customoperators/samplesqlspark.py

./bin/spark-submit --class "SparkTest" --master local[**] --jars```

This is my dag: ```from datetime import datetime, timedelta from airflow import DAG from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator

defaultargs = { 'owner': 'admin', 'startdate': datetime(2023, 1, 1), }

dag = DAG( 'sparkdagf', defaultargs=defaultargs, scheduleinterval=None, )

Set up SparkSubmitOperator for each query

queries = ['query1'] previous_task = None

for query in queries: taskid = f'sparksubmit{query}' scriptpath = '/home/haneefa/airflow/dags/customoperators/samplesqlspark.py'
querynum = queries.index(query) + 1

spark_task = SparkSubmitOperator(
    task_id=task_id,
    application=script_path,
    name=f"SparkScript_{query}",
    conn_id='spark_2',  
    jars='/home/haneefa/Downloads/openlineage-spark-1.5.0.jar',
    application_args=[str(query_num)],
    dag=dag,
)

if previous_task:
    previous_task >> spark_task

previous_task = spark_task

if name == "main": dag.cli()```

Michael Robinson (michael.robinson@astronomer.io)

2023-11-29 10:36:45

@channel Today’s hybrid meetup with Google starts in about one hour! DM me for the link. https://openlineage.slack.com/archives/C01CK9T7HKR/p1701194768476699

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)

@channel Friendly reminder: our first Warsaw meetup is happening tomorrow at 5:30 PM CET (8:30 AM PT) — and it’s hybrid <a href="https://openlineage.slack.com/archives/C01CK9T7HKR/p1700679449568039">https://openlineage.slack.com/archives/C01CK9T7HKR/p1700679449568039</a>

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1701194768476699

David Lauzon (davidonlaptop@gmail.com)

2023-11-29 16:35:03

*Thread Reply:* Thanks for organizing this meetup and sharing it online. Very good quality of talks btw !

Michael Robinson (michael.robinson@astronomer.io)

2023-11-29 16:37:26

*Thread Reply:* Thanks for coming! @Jens Pfau did most of the heavy lifting

🙌 Paweł Leszczyński

Jens Pfau (jenspfau@google.com)

2023-11-30 04:52:17

*Thread Reply:* Thank you to everyone who turned up! I noticed there was a question from @Sheeri Cabral (Collibra) that we didn't answer: Can someone point me to information on the collector that sends to S3? I'm curious as to the implementation (e.g. does each api push result in one S3 file? so the bucket ends up having millions of files? Does it append to a file? are there rules (configurable or otherwise) about rotating a file or directory? etc

I didn't quite catch the context of that. @Paweł Leszczyński did you?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-11-30 06:46:27

*Thread Reply:* There was a question about querying lineage events in warehouses. Julien has confirmed that in case of hunderds of thousands or even million of events, this could be still accomplished within Marquez as PostgreSQL as its backend, should handle this.

I was referring to fluentd openlineage proxy which lets users copy the event and send it to multiple backend. Fluentd has a list of out-of-the box output plugins containing BigQuery, S3, Redshift and others (https://www.fluentd.org/dataoutputs)

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-11-30 06:47:11

*Thread Reply:* Some more info about fluentd proxy: https://github.com/OpenLineage/OpenLineage/tree/main/proxy/fluentd

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2023-12-05 12:35:44

*Thread Reply:* This is extremely helpful, thanks @Paweł Leszczyński!!!

Stefan Krawczyk (stefan@dagworks.io)

2023-11-29 15:00:00

Hi all. I’m Stefan, and I’m after some directional advice:

Context: • I drive an open source project called Hamilton. TL;DR: it’s an opinionated way to write python code to express dataflows, e.g. great for feature engineering, data processing, doing LLM workflows, etc. in python.
• One of the features is that you get “lineage as code”, e.g. you can get “column/dataframe level” lineage for pandas (or pyspark) code. So given a dataflow definition, and execution of a Hamilton dataflow, we can emit this code provenance for any artifacts generated — and this works where ever python runs. Ask: • As I understand open lineage was built more for artifact to artifact lineage. E.g. this table -> table -> ML model, etc. Question, for people who use/consume lineage, would this extra level of granularity (e.g. the code that ran to create an artifact) that we can provide with Hamilton, be interesting to emit as part of an open lineage event? (e.g. see inside your python airflow task, or spark job). I’m trying to determine how to prioritize an open lineage integration, and whether someone would pick up Hamilton because of it. • If you would find this extra level of granularity useful, could I schedule a call with you so I can learn more about your use case please? CC @Jakub Dardziński since I saw at the meet up that you deal with the airflow python operator & open lineage.

DAGWorks-Inc/hamilton

Your single tool to express data, ML, and LLM pipelines with simple python functions. Runs anywhere that python runs, E.G. spark, airflow, jupyter, fastapi, etc. Incrementally adoptable. Use Hamilton to build testable, reusable, and self-documenting dataflows with lineage and metadata out of the box.

Website

<https://hamilton.dagworks.io/en/latest/>

Stars

1051

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-30 09:57:11

*Thread Reply:* We try to include SourceCodeFacet when possible when emitting OpenLineage event, however it's purely optional as facets are, since we can't guarantee it will be always there - for example it's not possible for us to get actual sql from spark-sql jobs.

✅ Sheeri Cabral (Collibra)

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-11-30 09:57:29

*Thread Reply:* Not sure if that answers your question 🙂

Stefan Krawczyk (stefan@dagworks.io)

2023-11-30 19:15:24

*Thread Reply:* @Maciej Obuchowski kind of. That’s tactical implementation advice and definitely useful.

My question is more around is that kind of detail actually useful for someone/what use case would it power/enable? e.g. if I integrate openlineage and emit that level of detail, would someone use it? If so, why?

Mandy Chessell (mandy.e.chessell@gmail.com)

2023-12-02 12:45:25

*Thread Reply:* @Stefan Krawczyk the types of use cases I have seen of this style is a business user/auditor that is not interested in how many jobs or intermediate stores are used in the end-to-end pipeline. They want to understand the transformations that the data underwent. For this type of user we extract details such as the source code facet, or facets that describe a rule or decision from the lineage graph and just display these transformations/decisions. If hamilton was providing the content for these types of facets, would they be consumable by such a user? Or perhaps, I should say, "could" they be consumed by such a user if the pipeline developer was careful to use meaningful variable names and comments.

Stefan Krawczyk (stefan@dagworks.io)

2023-12-02 14:07:42

*Thread Reply:* Awesome thanks @Mandy Chessell. That's useful context. What domain would this person be operating in? Yes I would think Hamilton could be useful here. It'll help standardize how the code is structured, and makes it easier to link an output with the code that created it.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2023-12-04 15:01:26

*Thread Reply:* @Stefan Krawczyk speaking as a vendor of lineage technology, our customers want to see the transformations - SQL if it’s SQL-based, or as much description as possible if it’s an ETL tool without direct SQL. (e.g. IBM DataStage might show a few stages like “group” and “aggregation” and show what field was used for grouping, etc).

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2023-12-04 15:03:50

*Thread Reply:* i have seen 5 general use cases for lineage:

Data Provenance - “where the data comes from”. This is often used when explaining what data means - by showing where it came from. e.g. salary data - did it come from a Google form, or from an HR department? The most visceral example I have is when a VP sees 2 similar reports with different numbers. e.g. Total Sales but the reports have different numbers, and then the VP wants to know why, and how come we can’t trust the data, etc?

With lineage it’s easy to see that the data from report A had test data taken out and that’s why the sales $$ is less, but also that’s the accurate one.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2023-12-04 15:05:18

*Thread Reply:* 2. Impact Analysis - if data provenance is “what’s upstream”, impact analysis is “what’s downstream”. You want to change something and want to know what it might affect. Perhaps you want to decommission a table or a whole server. Perhaps you are expanding, e.g. you started off in US Dollars and now are adding Japanese Yen…you have a field called “Amount” and now you want to add another field called “currency” and update every report that uses “Amount”…..Impact analysis is for that use case.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2023-12-04 15:05:41

*Thread Reply:* 3. Compliance - you can show that your sensitive data stays in the sensitive places, because you can show where your data is flowing.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2023-12-04 15:06:31

*Thread Reply:* 4. Migration verification - Compare the lineage from legacy system A with the lineage from new system B. When they have parity, your migration is complete.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2023-12-04 15:07:03

*Thread Reply:* 5. Onboarding - lineage diagrams can be used like architecture diagrams to easily acquaint a user with what data you have, how it flows, what reports it goes to, etc.

Stefan Krawczyk (stefan@dagworks.io)

2023-12-05 00:42:27

*Thread Reply:* Thanks, that context is helpful @Sheeri Cabral (Collibra) !

❤️ Sheeri Cabral (Collibra), Jakub Dardziński

Juan Luis Cano Rodríguez (juan_luis_cano@mckinsey.com)

2023-12-11 05:08:31

*Thread Reply:* loved this thread, thanks everyone!

Michael Robinson (michael.robinson@astronomer.io)

2023-12-01 17:03:32

@channel The November issue of OpenLineage News is here! This issue covers the latest updates to the OpenLineage Airflow Provider, a recap of the meetup in Warsaw, recent releases, and much more. To get the newsletter directly in your inbox each month, sign up here. apache.us14.list-manage.com

openlineage.us14.list-manage.com

OpenLineage Project

OpenLineage Project Email Forms

Original URL: http://bit.ly/OL_news

👍 Paweł Leszczyński, Jakub Dardziński

Michael Robinson (michael.robinson@astronomer.io)

2023-12-04 15:34:26

@channel I’m opening a vote to release OpenLineage 1.6.0, including: • a new JobTypeFacet containing additional job-related information to improve support for Flink and streaming in general • an option for the Flink job listener to read from Flink conf • in the dbt integration, a new command to send metadata of the last run without running the job • bug fixes in the Spark and Flink integrations • more. Three +1s from committers will authorize. Thanks in advance.

➕ Jakub Dardziński, Harel Shein, Damien Hawes, Mandy Chessell

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-12-05 08:16:25

*Thread Reply:* Can we hold on for a day? I do have some doubts about this one https://github.com/OpenLineage/OpenLineage/pull/2293

#2293 Bump commons-logging:commons-logging from 1.2 to 1.3.0 in /client/java

Bumps commons-logging:commons-logging from 1.2 to 1.3.0. <a href="https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores">Dependabot compatibility score</a> Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting <code>@dependabot rebase</code>. <hr /> Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: • <code>@dependabot rebase</code> will rebase this PR • <code>@dependabot recreate</code> will recreate this PR, overwriting any edits that have been made to it • <code>@dependabot merge</code> will merge this PR after your CI passes on it • <code>@dependabot squash and merge</code> will squash and merge this PR after your CI passes on it • <code>@dependabot cancel merge</code> will cancel a previously requested merge and block automerging • <code>@dependabot reopen</code> will reopen this PR if it is closed • <code>@dependabot close</code> will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually • <code>@dependabot show <dependency name> ignore conditions</code> will show all of the ignore conditions of the specified dependency • <code>@dependabot ignore this major version</code> will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) • <code>@dependabot ignore this minor version</code> will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) • <code>@dependabot ignore this dependency</code> will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

Labels

client/java, dependencies, dependabot

Comments

Michael Robinson (michael.robinson@astronomer.io)

2023-12-05 08:50:44

*Thread Reply:* Our policy allows for 2 days, so there’s no problem as far as I’m concerned

Michael Robinson (michael.robinson@astronomer.io)

2023-12-05 11:32:29

*Thread Reply:* Thanks, all, the release is authorized and will be initiated within 2 business days.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-12-05 11:33:35

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2297 - this should resolve the problem. we don't have to wait anymore. Thank You @Michael Robinson

#2297 [SPARK] exclude commons-logging from jar

Problem <code>commons-logging</code> should not be shipped with a spark jar even as relocated dependency. Solution Exclude package in shadowJar plugin <blockquote> Note: All schema changes require discussion. Please <a href="https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue">link the issue</a> for context. </blockquote> ☐ Your change modifies the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json">core</a> OpenLineage model ☐ Your change modifies one or more OpenLineage <a href="https://github.com/OpenLineage/OpenLineage/tree/main/spec/facets">facets</a> If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports <code>S3</code> and <code>GCS</code> filesystem operations, tested with AWS EMR). One-line summary: Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☑︎ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☐ Your changes are accompanied by tests (if relevant) ☑︎ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☐ Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant) ☐ You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant) <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2023 contributors to the OpenLineage project

Labels

documentation, integration/spark

Comments

Sathish Kumar J (sathish.jeganathan@walmart.com)

2023-12-06 21:51:29

@Harel Shein thanks for the invite!

Harel Shein (harel.shein@gmail.com)

2023-12-06 21:52:39

*Thread Reply:* Welcome :)

Zacay Daushin (zacayd@octopai.com)

2023-12-07 04:24:52

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-07 04:27:51

*Thread Reply:* Hey Zacay!

Zacay Daushin (zacayd@octopai.com)

2023-12-07 04:28:51

I know here does OpenLineage has support of column level?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-07 04:30:12

*Thread Reply:* it’s currently supported in Spark integration and for SQL-based Airflow operators

Zacay Daushin (zacayd@octopai.com)

2023-12-07 04:30:48

*Thread Reply:* and DBT?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-07 04:31:11

*Thread Reply:* DBT has only table-level lineage at the moment

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-07 04:36:28

*Thread Reply:* would you be interested in contributing/helping to add this feature? 🙂

Zacay Daushin (zacayd@octopai.com)

2023-12-07 04:38:12

*Thread Reply:* look at Gudo Soft

Zacay Daushin (zacayd@octopai.com)

2023-12-07 04:38:18

*Thread Reply:* it parser Sql

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-07 04:40:05

*Thread Reply:* I’m sorry but I didn’t get it. What does the parser provide that would be helpful in terms of dbt?

Sumeet Gyanchandani (sumeet.gyanchandani@gmail.com)

2023-12-07 07:11:47

*Thread Reply:* @Jakub Dardziński I played around with column-level lineage in Spark recently and also listening to OL events and converting them to Apache Atlas entities to upload in Azure Purview. Works like a charm 🙂

Flink not so much. Do you have column-level lineage in Flink yet or is it on the roadmap for future? Happy to contribute.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-07 07:12:29

*Thread Reply:* @Maciej Obuchowski @Paweł Leszczyński know more about Flink 🙂

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-07 07:12:58

*Thread Reply:* it's awesome you got it working with Purview!

🙏 Sumeet Gyanchandani

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-07 07:19:58

*Thread Reply:* we're actively working with Flink team to have first-class integration for Flink - a lot of things like column-level lineage are unfortunately currently not available

🙌 Sumeet Gyanchandani

Sumeet Gyanchandani (sumeet.gyanchandani@gmail.com)

2023-12-07 09:09:33

*Thread Reply:* @Jakub Dardziński and @Maciej Obuchowski thank you for the prompt responses. I really appreciate it!

🙂 Jakub Dardziński

Simran Suri (mailsimransuri@gmail.com)

2023-12-07 05:21:13

Hi everyone, could someone help me in how can I integrating OpenLineage with dbt? I'm particularly interested in sending events to a Kafka topic rather than using HTTP. Any guidance on this would be greatly appreciated.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-07 05:28:04

*Thread Reply:* Hey, it’s essentially described here: https://openlineage.io/docs/client/python

openlineage.io

Python | OpenLineage

Overview

Original URL: https://openlineage.io/docs/client/python

👍 Simran Suri

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-07 05:28:40

*Thread Reply:* in your case best would be to: Set an environment variable to a file path: OPENLINEAGE_CONFIG=path/to/openlineage.yml.

👍 Simran Suri

Simran Suri (mailsimransuri@gmail.com)

2023-12-07 05:29:43

*Thread Reply:* that will work with dbt right? only setting up environment variable would be enough? Also, to get OL events I need to do dbt-ol run correct?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-07 05:30:23

*Thread Reply:* setting env var + putting config file to pointed path

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-07 05:30:31

*Thread Reply:* and yes, you need to run it with dbt-ol wrapper script

Simran Suri (mailsimransuri@gmail.com)

2023-12-07 05:31:50

*Thread Reply:* Great, that's very helpful. I'll try the same and will definitely ask another question if I encounter any issues while trying this out.

👍 Jakub Dardziński

Damien Hawes (damien.hawes@booking.com)

2023-12-07 07:15:55

*Thread Reply:* @Jakub Dardziński - regarding the Python client, does it have a functionality similar to the Java client? For example, the Java client allows you to use the service provider interface to implement a custom ~client~ transport.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-07 07:17:37

*Thread Reply:* do you mean custom transport?

Damien Hawes (damien.hawes@booking.com)

2023-12-07 07:17:47

*Thread Reply:* Correct.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-07 07:21:23

*Thread Reply:* it sure does, there's mention of it in the link above

Rodrigo Maia (rodrigo.maia@manta.io)

2023-12-07 12:26:46

Hi All, im still struggling to get Lineage from Databricks Unity Catalog. Is anyone here extracting lineage from Databricks/UC successfully?

Rodrigo Maia (rodrigo.maia@manta.io)

2023-12-07 12:27:09

*Thread Reply:* OL 1.5 is showing up this error:

23/12/07 17:20:15 INFO PlanUtils: apply method failed with org.apache.spark.SparkException: There is no Credential Scope. Current env: Driver at com.databricks.unity.UCSDriver$Manager.$anonfun$currentScopeId$3(UCSDriver.scala:131) at scala.Option.getOrElse(Option.scala:189) at com.databricks.unity.UCSDriver$Manager.currentScopeId(UCSDriver.scala:131) at com.databricks.unity.UCSDriver$Manager.currentScope(UCSDriver.scala:134) at com.databricks.unity.UnityCredentialScope$.currentScope(UnityCredentialScope.scala:100) at com.databricks.unity.UnityCredentialScope$.getSAMRegistry(UnityCredentialScope.scala:120) at com.databricks.unity.SAMRegistry$.registerSAM(SAMRegistry.scala:307) at com.databricks.unity.SAMRegistry$.registerDefaultSAM(SAMRegistry.scala:323) at org.apache.spark.sql.catalyst.catalog.SessionCatalogImpl.defaultTablePath(SessionCatalog.scala:1200) at com.databricks.sql.managedcatalog.ManagedCatalogSessionCatalog.defaultTablePath(ManagedCatalogSessionCatalog.scala:991) at io.openlineage.spark3.agent.lifecycle.plan.catalog.AbstractDatabricksHandler.getDatasetIdentifier(AbstractDatabricksHandler.java:92) at io.openlineage.spark3.agent.lifecycle.plan.catalog.CatalogUtils3.lambda$getDatasetIdentifier$2(CatalogUtils3.java:61) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1361) at java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126) at java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.findAny(ReferencePipeline.java:536) at io.openlineage.spark3.agent.lifecycle.plan.catalog.CatalogUtils3.getDatasetIdentifier(CatalogUtils3.java:63) at io.openlineage.spark3.agent.lifecycle.plan.catalog.CatalogUtils3.getDatasetIdentifier(CatalogUtils3.java:46) at io.openlineage.spark3.agent.utils.PlanUtils3.getDatasetIdentifier(PlanUtils3.java:79) at io.openlineage.spark35.agent.lifecycle.plan.CreateReplaceOutputDatasetBuilder.apply(CreateReplaceOutputDatasetBuilder.java:144) at io.openlineage.spark35.agent.lifecycle.plan.CreateReplaceOutputDatasetBuilder.lambda$apply$3(CreateReplaceOutputDatasetBuilder.java:116) at java.util.Optional.map(Optional.java:215) at io.openlineage.spark35.agent.lifecycle.plan.CreateReplaceOutputDatasetBuilder.apply(CreateReplaceOutputDatasetBuilder.java:114) at io.openlineage.spark35.agent.lifecycle.plan.CreateReplaceOutputDatasetBuilder.apply(CreateReplaceOutputDatasetBuilder.java:60) at io.openlineage.spark35.agent.lifecycle.plan.CreateReplaceOutputDatasetBuilder.apply(CreateReplaceOutputDatasetBuilder.java:39) at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder$1.apply(AbstractQueryPlanDatasetBuilder.java:94) at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder$1.apply(AbstractQueryPlanDatasetBuilder.java:85) at io.openlineage.spark.agent.util.PlanUtils.safeApply(PlanUtils.java:279) at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder.lambda$apply$0(AbstractQueryPlanDatasetBuilder.java:75) at java.util.Optional.map(Optional.java:215) at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder.apply(AbstractQueryPlanDatasetBuilder.java:67) at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder.apply(AbstractQueryPlanDatasetBuilder.java:39) at io.openlineage.spark.agent.util.PlanUtils.safeApply(PlanUtils.java:279) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$null$23(OpenLineageRunEventBuilder.java:451) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at java.util.Iterator.forEachRemaining(Iterator.java:116) at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485) at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:272) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.StreamSpliterators$WrappingSpliterator.forEachRemaining(StreamSpliterators.java:313) at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildOutputDatasets(OpenLineageRunEventBuilder.java:410) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.populateRun(OpenLineageRunEventBuilder.java:298) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:281) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:238) at io.openlineage.spark.agent.lifecycle.SparkSQLExecutionContext.end(SparkSQLExecutionContext.java:126) at io.openlineage.spark.agent.OpenLineageSparkListener.sparkSQLExecEnd(OpenLineageSparkListener.java:98) at io.openlineage.spark.agent.OpenLineageSparkListener.onOtherEvent(OpenLineageSparkListener.java:84) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:102) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:42) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:42) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:118) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:102) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:114) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:114) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:109) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:105) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1660) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:105)

Rodrigo Maia (rodrigo.maia@manta.io)

2023-12-07 12:27:41

*Thread Reply:* Event output is showing up empty.

Rodrigo Maia (rodrigo.maia@manta.io)

2023-12-11 08:21:08

*Thread Reply:* Anyone with the same issue or no issues at all regarding Unity Catalog?

Michael Robinson (michael.robinson@astronomer.io)

2023-12-07 12:58:55

@channel We released OpenLineage 1.6.2, including: • Dagster: support Dagster 1.5.x #2220 @tsungchih • Dbt: add a new command dbt-ol send-events to send metadata of the last run without running the job #2285 @sophiely • Flink: add option for Flink job listener to read from Flink conf #2229 @ensctom • Spark: get column-level lineage from JDBC dbtable option #2284 @mobuchowski • Spec: introduce JobTypeJobFacet to contain additional job related information#2241 @pawel-big-lebowski • SQL: add quote information from sqlparser-rs #2259 @JDarDagran • bug fixes, tests, and more. Thanks to all the contributors, including new contributors @tsungchih and @ensctom! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.6.2 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/1.5.0...1.6.2 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

Jorge (jorge.varona@nike.com)

2023-12-07 13:01:27

👋 Hi everyone! Just lurking to get insights into how I might implement this at my company to solve our lineage challenge.

Michael Robinson (michael.robinson@astronomer.io)

2023-12-07 13:02:06

*Thread Reply:* 👋 Hi Jorge, welcome! Can you tell us a bit about your use case?

Jorge (jorge.varona@nike.com)

2023-12-07 13:03:51

*Thread Reply:* Large company with a big ball of mud data ecosystem. There is active debate on doubling down on a vendor (probably wont work) or doing the work to instrument our jobs/toolling in an agnostic way. I'm inclined to do the later but want to understand what the implementation may look like.

Harel Shein (harel.shein@gmail.com)

2023-12-07 14:44:04

*Thread Reply:* welcome! please feel free to ask any questions as they arise

👍 Jorge

Simran Suri (mailsimransuri@gmail.com)

2023-12-08 05:59:40

Hello everyone, I'm currently utilizing the OpenLineage JAR version - openlineage_spark_1_1_0.jar to capture lineage information in my environment. I have a specific requirement to capture facets related to Input Datasets, especially focusing on Data Quality Metrics facets such as row count information for the input data, as well as Output Dataset facets encompassing output statistics, like row count information for the output data.

Although the tags "inputFacets":{}}] and "outputFacets":{}}] seem to be enabled in the event, but the values within these tags are not reflecting the expected information, it seems to be blank always.

This Setup involves Databricks, and the cluster's Spark version is Apache Spark 3.3.2. and I've configured the OpenLineage setup in the Global Init scripts within the Databricks workspace.

Would greatly appreciate it if someone could provide guidance or insight into this issue.

Harel Shein (harel.shein@gmail.com)

2023-12-08 11:40:47

*Thread Reply:* can you turn on the debug facet and share an example event so we can try to help? spark.openlineage.debugFacet should be set to enabled this is from https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark

Simran Suri (mailsimransuri@gmail.com)

2023-12-11 03:22:03

*Thread Reply:* Hi @Harel Shein, sure.

Simran Suri (mailsimransuri@gmail.com)

2023-12-11 03:23:59

*Thread Reply:* This text file contains a total of 10-11 events, including the start and completion events of one of my notebook runs. The process is simply reading from a Hive location and performing a full load to another Hive location.

OpenLineage_Events_Facet_Test.txt

Rodrigo Maia (rodrigo.maia@manta.io)

2023-12-11 03:32:40

*Thread Reply:* @Simran Suri do you get any cluster logs with an error? im running a newer version of OL jar and im getting inputs and outputs from hive (but not for Unity Catalog)

Simran Suri (mailsimransuri@gmail.com)

2023-12-11 03:54:28

*Thread Reply:* No @Rodrigo Maia, I can't see any errors there

Harel Shein (harel.shein@gmail.com)

2023-12-12 16:05:18

*Thread Reply:* thanks for adding that! per the docs here, did you extend the Input/OutputDatasetFacetBuilder with anything to track data quality metrics?

Simran Suri (mailsimransuri@gmail.com)

2023-12-13 00:49:01

*Thread Reply:* Actually no, I didn't tried this out, can you help me with some more brief on it like how it can be extended? do I need to add some configs?

Michael Robinson (michael.robinson@astronomer.io)

2023-12-08 15:04:43

@channel This month’s TSC meeting is next Thursday the 14th at 10am PT. On the tentative agenda: • announcements • recent releases • proposal updates • open discussion • more (TBA) More info and the meeting link can be found on the website. All are welcome! Do you have a discussion topic, use case or integration you’d like to demo? DM me to be added to the agenda.

openlineage.io

TSC Meetings | OpenLineage

The OpenLineage Technical Steering Committee meets monthly, and is open to all.

Original URL: https://openlineage.io/meetings/

Joey Mukherjee (joey.mukherjee@swri.org)

2023-12-09 18:27:05

Hi! I'm interested in using OpenLineage to track data files in my pipeline. Basically, I receive a data file and run some other Python code + ancillary files to produce other data files which are then used to produce more data files and onward. Each level of files is versioned and I would like to track this lineage. I don't use Spark, Airflow, dbt, etc. I do use Prefect though. My wrapper script is Python. Is OpenLineage appropriate? Seems like it... is my "DataSet" every individual file that I produce? I think I have to write my own integration and facet. Is that also correct? Any other advice? Thanks!

Harel Shein (harel.shein@gmail.com)

2023-12-10 12:09:47

*Thread Reply:* Hi @Joey Mukherjee, welcome to the community! This use case would work using the openlineage spec. you are right, unfortunately we don’t currently have a Prefect integration, but we’d be happy to support you if you chose to write it! 🙂 I don’t know enough about Prefect to say what would be the right model to map to. Have you looked at the OpenLineage data model? https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md

👍 Jakub Dardziński

Athitya Kumar (athityakumar@gmail.com)

2023-12-11 08:12:40

Hey team.

In the spark integration: When we do a spark.jdbc() or spark.read() from a JDBC connection like mysql/postgres etc, does OpenLineage support capturing metadata on the JDBC connection (host URI / port / username etc) in the OL events or not?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-11 09:32:14

*Thread Reply:* > OpenLineage support capturing metadata on the JDBC connection (host URI / port / username etc) I think those are part of dataset namespace? Probably not username tho

Athitya Kumar (athityakumar@gmail.com)

2023-12-11 12:34:46

*Thread Reply:* Ack, and are these spark.jdbc inputs being captured for spark 3.x onwards or for spark 2.x as well?

Athitya Kumar (athityakumar@gmail.com)

2023-12-11 12:34:55

*Thread Reply:* @Maciej Obuchowski ^

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-11 12:53:25

*Thread Reply:* I would assume not for Spark 2

David Goss (david.goss@matillion.com)

2023-12-11 11:12:27

❓ Are there any particular standards/conventions around how to express data types in the dataset schema facet? The examples I’ve seen have been just types like integer etc. I think it would be useful for certain types to include modifiers for size, precision etc, so like varchar(50) and stuff like that. Would it just be a case of, stick to how the platform (mysql, snowflake, whatever) expresses it in DDL?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-11 12:54:12

*Thread Reply:* We do express what's in DDL basically. It's database specific anyway

👍 David Goss

harsh loomba (hloomba@upgrade.com)

2023-12-11 16:31:08

QQ - we have multi-cluster Redshift architecture where there is a possibility that table meta information could exist in different cluster. Now the way I see extractor here that it requires table meta-information to create the lineage right here? Currently I dont see those tables in my input datasets which are out of that cluster. Any thoughts?

<https://github.com/OpenLineage/OpenLineage/blob/d145314de14a449112375ea229aba30383434ff5/integration/airflow/openlineage/airflow/extractors/sql_extractor.py | sql_extractor.py>

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-11 16:34:13

*Thread Reply:* I’m not entirely sure how multi-cluster Redshift works. AFAIK the solution would to take advantage from SVV_REDSHIFT_COLUMNS

👀 harsh loomba

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-11 16:34:31

*Thread Reply:* I’ve got opened PR for Airflow OL provider here: https://github.com/apache/airflow/pull/35794

👀 harsh loomba

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-14 04:56:15

*Thread Reply:* hey @harsh loomba, did you have a chance to check above out?

harsh loomba (hloomba@upgrade.com)

2023-12-14 11:50:53

*Thread Reply:* I did check, looks promising to the problem statement we have right @Willy Lulciuc?

Willy Lulciuc (willy@datakin.com)

2023-12-14 14:30:44

*Thread Reply:* @harsh loomba yep, it does!

Willy Lulciuc (willy@datakin.com)

2023-12-14 14:31:20

*Thread Reply:* thanks @Jakub Dardziński for the quick fix!

harsh loomba (hloomba@upgrade.com)

2023-12-14 14:31:44

*Thread Reply:* ohhh great

harsh loomba (hloomba@upgrade.com)

2023-12-14 14:31:58

*Thread Reply:* thanks!

harsh loomba (hloomba@upgrade.com)

2023-12-14 14:32:07

*Thread Reply:* hopefully it will added in upcoming OL release

Joey Mukherjee (joey.mukherjee@swri.org)

2023-12-11 19:47:37

I'm playing with OpenLineage and for my start event and complete event, do they have to have the same input and output datasets? Say my input datasets generate files unknown at start time, can OpenLineage handle that? Right now, I am getting 422 Client Error: for url: http://xxx:5000/api/v1/lineage . How do I find out the error? I am not using any of the integrations.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-12 05:37:04

*Thread Reply:* > I'm playing with OpenLineage and for my start event and complete event, do they have to have the same input and output datasets? Say my input datasets generate files unknown at start time, can OpenLineage handle that? Idea of OpenLineage is to handle this kind of events - the events are generally ment to be cumulative. As you said, inputs or outputs can be not known at start time, but there can be opposite situation. You're reading version of the dataset that has been changing in the meantime, and you know the particular version only on start.

That being said, the actual handling of those events depends on the consumer itself, so it depends on what you're using.

> Right now, I am getting 422 Client Error: for url: http://xxx:5000/api/v1/lineage . How do I find out the error? I am not using any of the integrations. It's probably consumer specific - hard to tell without knowledge of what you're using.

Athitya Kumar (athityakumar@gmail.com)

2023-12-13 03:06:36

Hey folks. Is there a way to disable column-level lineage / "schema" facet in inputs/outputs for spark integration?

Basically to have just table-level lineage and disable column-level lineage

➕ Anirudh Shrinivason

Athitya Kumar (athityakumar@gmail.com)

2023-12-13 03:22:17

*Thread Reply:* cc @Honey Thakuria

Damien Hawes (damien.hawes@booking.com)

2023-12-13 03:47:20

*Thread Reply:* The spark integration provides the ability to disable facets, via the spark.openlineage.facets.disabled configuration.

You provide values like this: spark.openlineage.facets.disabled=[spark_unknown;spark.logicalPlan;<more>]

👍 Jakub Dardziński, Paweł Leszczyński

Athitya Kumar (athityakumar@gmail.com)

2023-12-13 06:25:24

*Thread Reply:* Right, but is there a specific facet we can disable here to get table-level lineage but skip column-level lineage in the events?

Damien Hawes (damien.hawes@booking.com)

2023-12-13 06:44:00

*Thread Reply:* The facet name that you want to disable is "columnLineage" in this case

rajeshree parmar (rajeshreedatavizz@gmail.com)

2023-12-13 05:05:07

Hi i want to use open lineage on my plateform how i can use . i have already metadata and profiler pipleline and how integreate

Kacper Muda (mudakacper@gmail.com)

2023-12-13 08:33:36

Hi, i'd like to request a release (patch 1.6.3) that will include this PR: #2305 . It would help people using OL with Airflow integration (with Airflow version 2.6).

👍 Harel Shein, Maciej Obuchowski

➕ Jakub Dardziński

Michael Robinson (michael.robinson@astronomer.io)

2023-12-14 08:40:03

*Thread Reply:* Thanks for requesting a release. It is authorized and will be initiated within 2 business days (not including Friday).

Kacper Muda (mudakacper@gmail.com)

2023-12-14 08:57:34

*Thread Reply:* Perfect, thank you !

Joey Mukherjee (joey.mukherjee@swri.org)

2023-12-13 09:59:56

I have a question/misunderstanding about the output section from Marquez under I/O. For one of my Jobs, the outputs are the right output and all of the inputs from the previous Job in the pipeline. I'm not sure why, but how do I find the cause of this? I feel like I did everything correct. To be clear, I am using a custom pipeline using files and the example code from the Python section of the OpenLineage docs.

Joey Mukherjee (joey.mukherjee@swri.org)

2023-12-13 13:18:15

*Thread Reply:* I notice only one of my three jobs has Run information. I have only six events, and all three have START and COMPLETE states. From what I can tell, the information in the events list is correct, but the GUI is not right. Open to any ideas on how to debug!

Michael Robinson (michael.robinson@astronomer.io)

2023-12-13 11:34:15

@channel This month’s TSC meeting is tomorrow at 10 am PT https://openlineage.slack.com/archives/C01CK9T7HKR/p1702065883107479

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)

@channel This month’s TSC meeting is next Thursday the 14th at 10am PT. On the tentative agenda: • announcements • recent releases • proposal updates • open discussion • more (TBA) More info and the meeting link can be found on the <a href="https://openlineage.io/meetings/">website</a>. All are welcome! Do you have a discussion topic, use case or integration you’d like to demo? DM me to be added to the agenda.

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1702065883107479

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-12-14 02:01:32

Hi team, I noticed this error: ERROR ColumnLevelLineageUtils: Error when invoking static method 'buildColumn [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875245402Z [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875250061Z [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875253194Z [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875256744Z [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875260324Z [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875263824Z [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875267715Z [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875270666Z [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875273785Z [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875276824Z [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875279799Z [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875285515Z [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875288751Z [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875292022Z [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875294766Z [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875297558Z [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875300352Z [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875303027Z [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875305695Z [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875308769Z ... in some pipelines for OL version 0.30.1. May I check if this has already been LineageDatasetFacet' for Spark3 java.lang.reflect.InvocationTargetException at jdk.internal.reflect.GeneratedMethodAccessor469.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at io.openlineage.spark.agent.lifecycle.plan.column.ColumnLevelLineageUtils.buildColumnLineageDatasetFacet(ColumnLevelLineageUtils.java:35) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$buildOutputDatasets$21(OpenLineageRunEventBuilder.java:434) at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655) at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildOutputDatasets(OpenLineageRunEventBuilder.java:447) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.populateRun(OpenLineageRunEventBuilder.java:306) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:289) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:232) at io.openlineage.spark.agent.lifecycle.SparkSQLExecutionContext.start(SparkSQLExecutionContext.java:70) at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$sparkSQLExecStart$0(OpenLineageSparkListener.java:91) at java.base/java.util.Optional.ifPresent(Optional.java:183) reported/been fixed in the later releases of OL? Thanks!

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-12-14 02:35:34

*Thread Reply:* No. it wasn't reported. Are you able to reproduce the same with recent Openlineage version?

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-12-14 02:36:47

*Thread Reply:* I have not tried actually... let me try that and get back if it persists. Thanks

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-12-14 02:37:54

*Thread Reply:* just to make sure: this is just the error in the logs and should not prevent OL event being generated (except for column level lineage). right?

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-12-14 02:45:04

*Thread Reply:* Yeah, but column level lineage is what we'd actually want to capture so wondering why this error was being thrown

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-12-14 02:46:03

*Thread Reply:* sure, could you also paste the end of the stacktrace?

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-12-14 02:48:28

*Thread Reply:* Yup, let me get that

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-12-14 02:49:43

*Thread Reply:* INFO - 2023-12-12T15:06:34.726868651Z at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726871557Z at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726874327Z at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726878805Z at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726881789Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.collect(InputFieldsCollector.java:61) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726884664Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.lambda$collect$0(InputFieldsCollector.java:61) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726887371Z at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726890358Z at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726893117Z at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726895977Z at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726898933Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.collect(InputFieldsCollector.java:61) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726901694Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.lambda$collect$0(InputFieldsCollector.java:61) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726904581Z at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726907524Z at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726910276Z at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726913201Z at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726916094Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.collect(InputFieldsCollector.java:61) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726918805Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.lambda$collect$0(InputFieldsCollector.java:61) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726921673Z at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726924364Z at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726927240Z at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726930033Z at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726932890Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.collect(InputFieldsCollector.java:61) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726935677Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.lambda$collect$0(InputFieldsCollector.java:61) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726938562Z at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726941533Z at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726944428Z at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726947297Z at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726951913Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.collect(InputFieldsCollector.java:61) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726954964Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.lambda$collect$0(InputFieldsCollector.java:61) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726957942Z at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726960897Z at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726963830Z at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726968390Z at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726971331Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.collect(InputFieldsCollector.java:61) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726974162Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.lambda$collect$0(InputFieldsCollector.java:61) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726977051Z at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726979989Z at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726982847Z at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726985790Z at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726988700Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.collect(InputFieldsCollector.java:61) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726991547Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.lambda$collect$0(InputFieldsCollector.java:61) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726994338Z at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726997282Z at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727000102Z at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727003005Z at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727006039Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.collect(InputFieldsCollector.java:61) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727009344Z at io.openlineage.spark3.agent.lifecycle.plan.column.ColumnLevelLineageUtils.collectInputsAndExpressionDependencies(ColumnLevelLineageUtils.java:72) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727012365Z at io.openlineage.spark3.agent.lifecycle.plan.column.ColumnLevelLineageUtils.lambda$null$2(ColumnLevelLineageUtils.java:83) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727015353Z at java.base/java.util.Optional.ifPresent(Optional.java:183) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727018363Z at io.openlineage.spark3.agent.lifecycle.plan.column.ColumnLevelLineageUtils.lambda$collectInputsAndExpressionDependencies$3(ColumnLevelLineageUtils.java:80) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727021227Z at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:174) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727024129Z at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1(TreeNode.scala:175) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727029039Z at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1$adapted(TreeNode.scala:175) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727032321Z at scala.collection.immutable.List.foreach(List.scala:431) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727035293Z at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:175) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727038585Z at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1(TreeNode.scala:175) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727041600Z at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1$adapted(TreeNode.scala:175) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727044499Z at scala.collection.immutable.List.foreach(List.scala:431) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727047389Z at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:175) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727050337Z at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1(TreeNode.scala:175) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727053034Z at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1$adapted(TreeNode.scala:175) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727055990Z at scala.collection.immutable.List.foreach(List.scala:431) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727060521Z at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:175) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727063671Z at io.openlineage.spark3.agent.lifecycle.plan.column.ColumnLevelLineageUtils.collectInputsAndExpressionDependencies(ColumnLevelLineageUtils.java:76) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727066753Z at io.openlineage.spark3.agent.lifecycle.plan.column.ColumnLevelLineageUtils.buildColumnLineageDatasetFacet(ColumnLevelLineageUtils.java:42) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727069592Z ... 35 more

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-12-14 02:45:58

Another question, I noticed that for a few cases, lineage is not being captured if running: df.toPandas() via pyspark, and then doing some pandas operations on it and writing it back to an s3 location. May I check if this is expected?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-12-14 02:47:18

*Thread Reply:* this is something we haven't tested nor included in our tests. not sure what happens spark data goes to pandas.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-12-14 02:48:18

*Thread Reply:* Got it. Thanks!

Damien Hawes (damien.hawes@booking.com)

2023-12-15 04:39:45

*Thread Reply:* I can speak from experience, because we had a similar issue in our custom spark listener, toPandas breaks lineage because toPandas is like calling collect, which forces the data in the Spark DataFrame to the driver, and pipes that data to the running Python sidecar process. What ever you do afterwards, you're running code in a private memory space and Spark has no way of knowing what you're doing.

👍 Anirudh Shrinivason

:gratitude_thank_you: Anirudh Shrinivason

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-12-29 03:41:29

*Thread Reply:* Just wondering, is it even technically possible to address this case where pipelines use toPandas to get lineage?

Damien Hawes (damien.hawes@booking.com)

2024-01-08 04:46:36

*Thread Reply:* To an extent - yes. Though not necessarily with the OL connector. As hinted previously, we had to recently update our own listener that we wrote years ago in order to publish lineage for subquery alias / project operations from collect / toPandas operations.

Athitya Kumar (athityakumar@gmail.com)

2023-12-14 04:58:57

We're noticing significant delays in a few spark jobs, wherein the openlineage spark listener seems to be running even after the spark/yarn application have completed & requested for a gracious exit. Even after the application has ended, we see couple of events still being processed & each event takes around 5-6 mins (rarely we have seen 9 mins as well):

2023-12-13 22:52:26.834 -0800] [INFO ] [spark-listener-group-shared] [org.apache.spark.scheduler.AsyncEventQueue.logInfo@57] - Process of event SparkListenerJobEnd(12,1702535631760,JobSucceeded) by listener OpenLineageSparkListener took 385.396979168s. This kinda results in our actual jobs taking 30+ mins to be marked as completed, which impacts SLA.

Has anyone faced this issue, and any tips on how we can debug what event is causing this exact 5-6 mins issue / whic method in OpenLineageSparkListener is taking time?

Athitya Kumar (athityakumar@gmail.com)

2023-12-14 10:59:48

*Thread Reply:* Ping @Paweł Leszczyński @Maciej Obuchowski ^

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-14 11:56:17

*Thread Reply:* Can you try disabling lineage plan and spark_unknown facet? Only thing I can think of is serializing extremely large logical plans

👍 Paweł Leszczyński

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-12-14 12:32:33

*Thread Reply:* yes, omiting plan serialialization and its serialization can be first thing to try

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-12-14 12:33:27

*Thread Reply:* the other one would be to verify the backend is responding in reasonable time

Athitya Kumar (athityakumar@gmail.com)

2023-12-14 13:57:22

*Thread Reply:* @Paweł Leszczyński @Maciej Obuchowski - These 2 facets have already been disabled - but what we see if that the event is still too huge for openlineage spark listener to realistically process in time.

For example, we have a job that reads from a S3 directory which has lot of dirs/files - resulting in 1041 inputs ** 8 columns per output 😅

Is there a max limit config we can set on openlineage to consider parsing events if the spark event size is < x mb or # of inputs+outputs are < y or something like that?

Athitya Kumar (athityakumar@gmail.com)

2023-12-14 14:06:34

*Thread Reply:* And also, when we disable facets like spark.logicalPlan / schema / columnLineage - does it mean that the part of the event is itself not being read from spark or is it just being used while generating/emitting the OL event?

Basically if we have a very huge spark event, would disabling facets help or would it still kinda take a lot of time?

Athitya Kumar (athityakumar@gmail.com)

2023-12-15 12:48:57

*Thread Reply:* @Paweł Leszczyński @Maciej Obuchowski - WDYT? ^

In our use-case, we figured out the scenario that caused this huge event issue. We had a job that did spark.read from <s3://bucket/dir1/dir2/dir3_**/**> (with a file-level wildcard) instead of <s3://bucket/dir1/dir2/dir3_**/> - we were able to remove the wildcard and fix this issue.

But I think this is something that should be handled on spark listener side, so that we're not really dependent on the patterns of the spark job code itself 😄

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-12-18 03:59:46

*Thread Reply:* That's the problem that bumps from time to time. So far, we were never able to come up with a general solution like circuit breaker.

We do rather solve the problems after they're identified. In the past we added option to prevent sending serialized LogicalPlan as well as w trim it if it exceeds certain about of kilobytes.

What caused the issue here? Was it the event being too large or spark openlineage internals trying to resolve ** and causing long lasting backend calls to S3?

Haneefa tasneem (haneefa.tasneem.shaik@gmail.com)

2023-12-14 06:22:28

Hi. I am trying to run a spark script through an Airflow Dag. I am not able to see any lineage information. In the spark script I had taken up a sample csv file. I created a Data-frame and made some transformations and then saved it as a csv file. I am not able to see any lineage information. Please do let me know if there is any way I can see Lineage information. Here is my spark script for reference. ```from pyspark import SparkContext from os.path import abspath from pyspark.sql import SparkSession from pyspark.sql.functions import col import urllib.request

def executesparkscript(querynum, outputpath): # Create a Spark session #oljars= ['https://repo1.maven.org/maven2/io/openlineage/openlineage-spark/1.5.0/openlineage-spark-1.5.0.jar'] warehouselocation = abspath('spark-warehouse') #files = [urllib.request.urlretrieve(url)[0] for url in oljars] spark = (SparkSession.builder.appName("DataManipulation").master('local[**]').appName('openlineagespark_test') #.config('spark.jars', ",".join(files))

         # Install and set up the OpenLineage listener

         .config('spark.jars.packages', 'io.openlineage:openlineage_spark:1.5.0')
         .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener')
         .config('spark.openlineage.host', '<http://localhost:5000>')
         .config('spark.openlineage.namespace', 'airflow')
         .config('spark.openlineage.transport.type', 'console')
         .config("spark.sql.warehouse.dir", warehouse_location)
         .getOrCreate()
         )

spark.sparkContext.setLogLevel("INFO")


# DataFrame 1
data = [[295, "South Bend", "Indiana", "IN", 101190, 112.9]]
columns = ["rank", "city", "state", "code", "population", "price"]
df1 = spark.createDataFrame(data, schema="rank LONG, city STRING, state STRING, code STRING, population LONG, price DOUBLE")

print(f"Count after DataFrame 1: {df1.count()} rows, {len(df1.columns)} columns")

# Save DataFrame 1 to the desired location
df1.write.mode("overwrite").csv(output_path + "df1")

# DataFrame 2
df2 = (spark.read
      .format("csv")
      .option("header", "true")
      .option("inferSchema", "true")
      .load("/home/haneefa/Downloads/export.csv")
      )

df2.write.mode("overwrite").csv(output_path + "df2")

# Returns a DataFrame that combines the rows of df1 and df2
query_df = df1.union(df2)
query_df.count()

# Save DataFrame 2 to the desired location
query_df.write.mode("overwrite").csv(output_path + "query_df")



# Save Query 1 result to the desired location
query_df.write.mode("overwrite").csv(output_path + "query_df")


# Register the DataFrame as a temporary SQL table
query_df.write.saveAsTable("temp_tb1")

# Query 1: Add a new column derived from existing columns
query1_df = spark.sql("SELECT **, population / price as population_price_ratio FROM temp_tb1")
print(f"Count after Query 1: {query1_df.count()} rows, {len(query1_df.columns)} columns")

# Register the DataFrame as a temporary SQL table
query1_df.write.saveAsTable("temp_tb2")

# Save Query 1 result to the desired location
query1_df.write.mode("overwrite").csv(output_path + "query1_df")

# Read the saved DataFrame in parquet format
<b>#parquet_df</b> =spark.read.parquet(output_path + "query1_df")

spark.stop()

if name == "main": executesparkscript(1, "/home/haneefa/airflow/dags/saved_files/")

spark-submit --master --name SparkScriptquery1 --deploy-mode client /home/haneefa/airflow/dags/customoperators/samplesqlspark.py

./bin/spark-submit --class "SparkTest" --master local[**] --jars ```

Abdallah (abdallah@terrab.me)

2023-12-14 09:28:27

*Thread Reply:* Do you run your script through a spark-submit?

Haneefa tasneem (haneefa.tasneem.shaik@gmail.com)

2023-12-16 14:12:08

*Thread Reply:* yes I do. Here is my Airflow Dag Code. ```from datetime import datetime, timedelta from airflow import DAG from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator

defaultargs = { 'owner': 'airflow', 'dependsonpast': False, 'startdate': datetime(2023, 1, 1), }

dag = DAG( 'abcsparkdagedit', defaultargs=defaultargs, description='A simple Airflow DAG to run the provided Spark script', scheduleinterval='@once', )

sparktask = SparkSubmitOperator( taskid='runsparkscript', application='/home/haneefa/airflow/dags/customoperators/sparkedit.py', #path to your Spark script name='examplesparkjob', connid='spark2', # Spark connection ID configured in Airflow jars='/home/haneefa/Downloads/openlineage-spark-1.5.0.jar', verbose=False, dag=dag, )

spark_task``` please do let me know if I can do anything

Haneefa tasneem (haneefa.tasneem.shaik@gmail.com)

2023-12-18 02:30:50

*Thread Reply:* Hi. Just checking in. please do let me know if I can try anything.

Abdallah (abdallah@terrab.me)

2024-01-03 08:16:36

*Thread Reply:* Can you bring the driver logs please ?

Abdallah (abdallah@terrab.me)

2024-01-03 08:17:06

*Thread Reply:* I see that you are mixing different types of transport. ... .config('spark.openlineage.host', '<http://localhost:5000>') ... .config('spark.openlineage.transport.type', 'console') ...

Mariusz Górski (gorskimariusz13@gmail.com)

2023-12-15 03:38:06

hey, I have a question re OpenLineage Spark integration. According to docs, while configuring spark session following parameters (amongst others) can be passed: • spark.openlineage.parentJobName • spark.openlineage.appName • spark.openlineage.namespace what I understand from OL spec is that first parameter (parentJobName) would go into facets section of every lineage event, while spark.openlineage.appName would replace spark.appName portion of job.name property of lineage event (and indeed that’s what we are observing). spark.openlineage.namespace would materialize as job.namespace in lineage events (that also seems accurate). What I also found in the documentations it that parent facet is used to materialize a job (like Airflow DAG) under which given event is emitted. That was also my expectation about parentJobName however after configuring this option, value that I am setting it to is nowhere to be found in output lineage events. The question is then - is my understanding of this property wrong (and if so, what is the correct one) or this value is not being propagated properly to lineage events?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-12-15 04:19:51

*Thread Reply:* cześć 😉 to sum up: you're setting parentJobName property and you don't see it anywhere within the content of OL event. ~Looking at the code, the facet should be attached to job.~

👍 Mariusz Górski

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-12-15 06:07:26

*Thread Reply:* Looking at the code, the facet should be ingested into the event within this method https://github.com/OpenLineage/OpenLineage/blob/203052d663c4cd461ec38787434fc53a57[…]enlineage/spark/agent/lifecycle/OpenLineageRunEventBuilder.java

However, I don't see any integration test for this. Feel free to create an issue for this in case parent job name is still missing.

<https://github.com/OpenLineage/OpenLineage/blob/203052d663c4cd461ec38787434fc53a57a81936/integration/spark/shared/src/main/java/io/openlineage/spark/agent/lifecycle/OpenLineageRunEventBuilder.java | OpenLineageRunEventBuilder.java>

<pre><code> parentRunFacet.ifPresent(runFacetsBuilder::parent); </code></pre>

Athitya Kumar (athityakumar@gmail.com)

2023-12-15 13:53:47

*Thread Reply:* +1, I've noticed this too - the namespace & appName from spark conf reflects in the openlineage events, while the parentJobName doesn't.

But as a workaround, we kinda used namespace as a JSON string & provided the parentJobName as a key in this JSON

Mariusz Górski (gorskimariusz13@gmail.com)

2023-12-16 04:10:49

*Thread Reply:* indeed you can put anything into namespace but this is a workaround and we'd like to have generic OL spec compliant approach. so far I've checked tests and parent facet is available in some of them (like BQ write integration test) but in some not so still not sure why sometimes it's there and sometimes not. will keep digging.

Mariusz Górski (gorskimariusz13@gmail.com)

2023-12-16 09:23:28

*Thread Reply:* ok I figured it out 🙂 tl;dr when you use parentJobName you need to also define parentRunId (this implicit relationship could be better documented) and the tricky part here is parentRunId needs to be proper UUID, not just random string (which was wrong assumption on my part, again - improvement to the docs also required). I will update docs in the upcoming week based on this discovery. I tested this and after making changes as per above description parent facet is visible in spark OL events 🙂

🙌 Jakub Dardziński

Mariusz Górski (gorskimariusz13@gmail.com)

2023-12-27 04:11:27

*Thread Reply:* https://github.com/OpenLineage/docs/pull/268 took a little longer but there it is, I've also changed the proposal for exemplary airflow task metadata so it's aligned with how parent run facet is populated in airflow since OL 1.7.0 🙂

#268 chore: update spark docs

• use dag.dag_id as parentRunName for spark example airflow task. This makes it consistent with contents of parent run facet in airflow tasks. • make it explicit that parentRunId is both required when parentRunName is configured & it needs to be valid 128-bit uuid

Comments

Simran Suri (mailsimransuri@gmail.com)

2023-12-17 11:33:34

Hi everyone, I have a question about Airflow's Job-to-Job level lineage. I've successfully obtained OpenLineage events for Airflow DAGs and now I'm looking to create inter DAG dependency (task-to-task), such as mapping D1.t1->D1.t2->D1.t3->D2.t1->D2.t2->D2.t3. Could you please advise on which fields I should consider to achieve this level of mapping?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-17 15:34:24

*Thread Reply:* I’m not sure what you’re trying to achieve

would you like to set relationships both between Airflow task <-> task and DAG <-> DAG?

Simran Suri (mailsimransuri@gmail.com)

2023-12-17 15:43:12

*Thread Reply:* Yes, I'm trying to set a relationship between both of them. Suppose I've 2 DAGs, D1 and D2, with 2 tasks in it. So the lineage would be D1.t1 -> D1.t2 (TriggerDagRunOperator) ->D2.t1->D2.t2 So in this way, I'll be able to mark an inter DAG dependency and task dependency also within the same DAG

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-17 15:55:47

*Thread Reply:* in terms of Marquez relationship between jobs is established by common input/output datasets there’s also parent/child relationship (e.g. between DAG and task) there’s actually no facet that would point directly between two jobs, that would require some sort of customization

Simran Suri (mailsimransuri@gmail.com)

2023-12-17 16:12:28

*Thread Reply:* @Jakub Dardziński In terms of Airflow openlineage events, I can see task level dependencies within the same DAG in events. But where I have inter-DAG dependencies via TriggerDagRunOperator I can't see that level of information in lineage events as mentioned here

I'm not able to find what are the upstream dependencies of a DAG?

Parkash Pant (ppant@tucowsinc.com)

2023-12-17 21:24:16

Hi Everyone, Need help! I am new to OpenLineage and Marquez and I am trying to test it with our local installation of airflow 2.6.3. Both airflow and marquez are running in a separate docker container and I have installed openlineage-airflow integration in airflow and set OPENLINEAGEURL and OPENLINEAGENAMESPACE. However, upon successful running a DAG, airflow is not able to emit OpenLineage event. Below is the error msg. I have also cross-checked that Marquez API is listening at localhost:5000. 2023-12-17 14:20:18 [2023-12-17T20:20:18.489+0000] {{adapter.py:98}} ERROR - Failed to emit OpenLineage event of id f0d30e1a-30cc-3ce5-9cbb-1d3529d5e206 2023-12-17 14:20:18 Traceback (most recent call last): 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connection.py", line 174, in _new_conn 2023-12-17 14:20:18 conn = connection.create_connection( 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/util/connection.py", line 95, in create_connection 2023-12-17 14:20:18 raise err 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection 2023-12-17 14:20:18 sock.connect(sa) 2023-12-17 14:20:18 ConnectionRefusedError: [Errno 111] Connection refused 2023-12-17 14:20:18 2023-12-17 14:20:18 During handling of the above exception, another exception occurred: 2023-12-17 14:20:18 2023-12-17 14:20:18 Traceback (most recent call last): 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 714, in urlopen 2023-12-17 14:20:18 httplib_response = self._make_request( 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 415, in _make_request 2023-12-17 14:20:18 conn.request(method, url, ****httplib_request_kw) 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connection.py", line 244, in request 2023-12-17 14:20:18 super(HTTPConnection, self).request(method, url, body=body, headers=headers) 2023-12-17 14:20:18 File "/usr/lib/python3.10/http/client.py", line 1282, in request 2023-12-17 14:20:18 self._send_request(method, url, body, headers, encode_chunked) 2023-12-17 14:20:18 File "/usr/lib/python3.10/http/client.py", line 1328, in _send_request 2023-12-17 14:20:18 self.endheaders(body, encode_chunked=encode_chunked) 2023-12-17 14:20:18 File "/usr/lib/python3.10/http/client.py", line 1277, in endheaders 2023-12-17 14:20:18 self._send_output(message_body, encode_chunked=encode_chunked) 2023-12-17 14:20:18 File "/usr/lib/python3.10/http/client.py", line 1037, in _send_output 2023-12-17 14:20:18 self.send(msg) 2023-12-17 14:20:18 File "/usr/lib/python3.10/http/client.py", line 975, in send 2023-12-17 14:20:18 self.connect() 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connection.py", line 205, in connect 2023-12-17 14:20:18 conn = self._new_conn() 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connection.py", line 186, in _new_conn 2023-12-17 14:20:18 raise NewConnectionError( 2023-12-17 14:20:18 urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f3244438a90>: Failed to establish a new connection: [Errno 111] Connection refused 2023-12-17 14:20:18 2023-12-17 14:20:18 During handling of the above exception, another exception occurred: 2023-12-17 14:20:18 2023-12-17 14:20:18 Traceback (most recent call last): 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send 2023-12-17 14:20:18 resp = conn.urlopen( 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 798, in urlopen 2023-12-17 14:20:18 retries = retries.increment( 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 592, in increment 2023-12-17 14:20:18 raise MaxRetryError(_pool, url, error or ResponseError(cause)) 2023-12-17 14:20:18 urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=5000): Max retries exceeded with url: /api/v1/lineage (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3244438a90>: Failed to establish a new connection: [Errno 111] Connection refused')) 2023-12-17 14:20:18 2023-12-17 14:20:18 During handling of the above exception, another exception occurred: 2023-12-17 14:20:18 2023-12-17 14:20:18 Traceback (most recent call last): 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/openlineage/airflow/adapter.py", line 95, in emit 2023-12-17 14:20:18 return self.client.emit(event) 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/openlineage/client/client.py", line 102, in emit 2023-12-17 14:20:18 self.transport.emit(event) 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/openlineage/client/transport/http.py", line 159, in emit 2023-12-17 14:20:18 resp = <a href="http://session.post">session.post</a>( 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/requests/sessions.py", line 637, in post 2023-12-17 14:20:18 return self.request("POST", url, data=data, json=json, ****kwargs) 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/requests/sessions.py", line 589, in request 2023-12-17 14:20:18 resp = self.send(prep, ****send_kwargs) 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/requests/sessions.py", line 703, in send 2023-12-17 14:20:18 r = adapter.send(request, ****kwargs) 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/requests/adapters.py", line 519, in send 2023-12-17 14:20:18 raise ConnectionError(e, request=request) 2023-12-17 14:20:18 requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=5000): Max retries exceeded with url: /api/v1/lineage (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3244438a90>: Failed to establish a new connection: [Errno 111] Connection refused'))

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-18 02:28:33

*Thread Reply:* try with host.docker.internal instead of localhost

Parkash Pant (ppant@tucowsinc.com)

2023-12-18 11:51:48

*Thread Reply:* It worked. Thanks!

Michael Ourch (michael.ourch@qonto.com)

2023-12-18 05:02:29

Hey everyone, I started experimenting OpenLineage with my stack (dbt + Snowflake). The data lineage works great but I could not get the column lineage feature working. Is it only implemented for Spark at the moment ? Thanks 🙏

✅ Michael Ourch

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-18 05:03:54

*Thread Reply:* it works from Spark (with exception of JDBC) and Airflow SQL-based operators (except BigQuery Operator)

🙏 Michael Ourch

Michael Ourch (michael.ourch@qonto.com)

2023-12-18 08:20:18

*Thread Reply:* Thanks @Jakub Dardziński for the clarification 🙏

Daniel Henneberger (me@danielhenneberger.com)

2023-12-18 16:16:30

Hey ya'll, I'm trying out the flink open lineage integration with marquez. I cloned marquez and did a ./docker/up.sh and configured flink using the yaml. However, when it tries to emit a metric, i get:

ERROR io.openlineage.flink.client.EventEmitter - Failed to emit OpenLineage event: io.openlineage.client.OpenLineageClientException: code: 422, response: {"errors":["job.facets.jobType.integration must not be null"]} at io.openlineage.client.transports.HttpTransport.throwOnHttpError(HttpTransport.java:150) ~[sqrl-cli.jar:0.4.1-SNAPSHOT] at io.openlineage.client.transports.HttpTransport.emit(HttpTransport.java:124) ~[sqrl-cli.jar:0.4.1-SNAPSHOT] at io.openlineage.client.transports.HttpTransport.emit(HttpTransport.java:111) ~[sqrl-cli.jar:0.4.1-SNAPSHOT] at io.openlineage.client.OpenLineageClient.emit(OpenLineageClient.java:46) ~[sqrl-cli.jar:0.4.1-SNAPSHOT] Is there something else I need to configure?

Daniel Henneberger (me@danielhenneberger.com)

2023-12-18 17:22:39

I opened an issue: https://github.com/OpenLineage/OpenLineage/issues/2324

#2324 Flink JobFacet missing 'integration' field, mismatched processingType

According to the <a href="https://openlineage.io/docs/spec/facets/job-facets/job-type">job-type facet doc</a>, the 'integration' and 'processingType' fields are required, however only the processingType is supplied and integration is missing. The processingType should also be something like STREAM or BATCH, but the string constant 'FLINK' is provided. <a href="https://github.com/OpenLineage/OpenLineage/blob/50eaa11e45c4e9b834d9fc1e24cc430eea2ccdff/spec/facets/JobTypeJobFacet.json#L29">OpenLineage/spec/facets/JobTypeJobFacet.json</a> Line 29 in </OpenLineage/OpenLineage/commit/50eaa11e45c4e9b834d9fc1e24cc430eea2ccdff|50eaa11> <a href="https://github.com/OpenLineage/OpenLineage/blob/50eaa11e45c4e9b834d9fc1e24cc430eea2ccdff/integration/flink/src/main/java/io/openlineage/flink/visitor/lifecycle/FlinkExecutionContext.java#L139">OpenLineage/integration/flink/src/main/java/io/openlineage/flink/visitor/lifecycle/FlinkExecutionContext.java</a> Line 139 in </OpenLineage/OpenLineage/commit/50eaa11e45c4e9b834d9fc1e24cc430eea2ccdff|50eaa11>

Comments

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-12-19 03:07:34

*Thread Reply:* great finding, sorry for this. this should help: https://github.com/OpenLineage/OpenLineage/pull/2325

#2325 [FLINK] fix jobType job facet

Problem Flink integration wrongly assigns properties to <code>JobTypeJobFacet</code> Solution <blockquote> Note: All schema changes require discussion. Please <a href="https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue">link the issue</a> for context. </blockquote> ☐ Your change modifies the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json">core</a> OpenLineage model ☐ Your change modifies one or more OpenLineage <a href="https://github.com/OpenLineage/OpenLineage/tree/main/spec/facets">facets</a> If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports <code>S3</code> and <code>GCS</code> filesystem operations, tested with AWS EMR). One-line summary: Checklist ☐ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☐ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☑︎ Your changes are accompanied by tests (if relevant) ☑︎ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☐ Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant) ☐ You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant) <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2023 contributors to the OpenLineage project

Labels

documentation, integration/flink, streaming

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-12-19 00:19:59

Hi team, Noticed this OL error for spark 3.4.1: 23/12/15 09:51:35 ERROR PlanUtils: Apply failed: java.lang.NoSuchMethodError: 'java.lang.String org.apache.spark.sql.execution.datasources.PartitionedFile.filePath()' at io.openlineage.spark.agent.util.PlanUtils.lambda$null$4(PlanUtils.java:241) at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) at java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) at java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:274) at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497) at java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:274) at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655) at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) at io.openlineage.spark.agent.util.PlanUtils.findRDDPaths(PlanUtils.java:248) at io.openlineage.spark.agent.lifecycle.plan.AbstractRDDNodeVisitor.findInputDatasets(AbstractRDDNodeVisitor.java:42) at io.openlineage.spark.agent.lifecycle.plan.SqlExecutionRDDVisitor.apply(SqlExecutionRDDVisitor.java:43) at io.openlineage.spark.agent.lifecycle.plan.SqlExecutionRDDVisitor.apply(SqlExecutionRDDVisitor.java:22) at io.openlineage.spark.agent.util.PlanUtils$1.lambda$apply$2(PlanUtils.java:99) at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177) at java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) at io.openlineage.spark.agent.util.PlanUtils$1.apply(PlanUtils.java:115) at io.openlineage.spark.agent.util.PlanUtils$1.apply(PlanUtils.java:79) at scala.PartialFunction.applyOrElse(PartialFunction.scala:127) at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126) at scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:30) at scala.PartialFunction$AndThen.applyOrElse(PartialFunction.scala:194) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$visitLogicalPlan$14(OpenLineageRunEventBuilder.java:400) at io.openlineage.spark.agent.util.ScalaConversionUtils$3.apply(ScalaConversionUtils.java:131) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$map$1(TreeNode.scala:305) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$map$1$adapted(TreeNode.scala:305) at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:285) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1$adapted(TreeNode.scala:286) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1$adapted(TreeNode.scala:286) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode.map(TreeNode.scala:305) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$buildInputDatasets$6(OpenLineageRunEventBuilder.java:351) at java.base/java.util.Optional.map(Optional.java:265) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildInputDatasets(OpenLineageRunEventBuilder.java:349) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.populateRun(OpenLineageRunEventBuilder.java:305) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:289) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:241) at io.openlineage.spark.agent.lifecycle.SparkSQLExecutionContext.end(SparkSQLExecutionContext.java:95) at io.openlineage.spark.agent.OpenLineageSparkListener.sparkSQLExecEnd(OpenLineageSparkListener.java:98) at io.openlineage.spark.agent.OpenLineageSparkListener.onOtherEvent(OpenLineageSparkListener.java:84) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1471) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) OL version 0.30.1. May I check if this has already been reported/been fixed in the later releases of OL? Thanks!

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-12-19 03:16:08

*Thread Reply:* i think this was fixed here -> https://github.com/OpenLineage/OpenLineage/pull/2039/files#diff-9dced0631874d61fd7bdfa4c923e3325c4b9ca1cd1ff3e2b35d4edc5c3a7c8c7R242

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-12-19 22:35:34

*Thread Reply:* Got it thanks!

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-12-19 00:22:03

Also, just checking, is there a way to set the log level for OL separately for spark? Or does it always use the underlying spark context log level?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-12-19 03:17:21

*Thread Reply:* this is how we set it in tests -> https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/test/resources/log4j.properties

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/test/resources/log4j.properties | log4j.properties>

<pre><code># Set everything to be logged to the console log4j.rootCategory=INFO, console log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err log4j.appender.console.layout=org.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n # set the log level for the openlineage spark library log4j.logger.io.openlineage=DEBUG log4j.logger.io.openlineage.spark.shaded=WARN log4j.logger.org.apache.spark.storage=WARN log4j.logger.org.apache.spark.scheduler=WARN </code></pre>

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-12-19 22:35:16

*Thread Reply:* Ahh I see... it goes in via log4j properties. Is there any plan to make this configurable via simpler means? Say env variable or using dedicated spark configs?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-12-20 02:27:06

*Thread Reply:* We didn't plan anything so far. But feel free to create an issue and justify why is this important. perhaps more people share the same feeling about it.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-12-20 02:37:45

*Thread Reply:* Sure, I'll do that then. Thanks! 🙂

Zacay Daushin (zacayd@octopai.com)

2023-12-19 04:45:05

hi does someone uses openLineage solution to get metadata of Airflow?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-19 05:06:24

*Thread Reply:* hey Zacay, do you have any issue with using Airflow integration?

Zacay Daushin (zacayd@octopai.com)

2023-12-20 05:42:30

*Thread Reply:* Do i need to install openlineage or only configuration?

Zacay Daushin (zacayd@octopai.com)

2023-12-20 05:43:03

*Thread Reply:* i install marquez and point on the Ariflow cfg to listen to the marquez:5000

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-20 05:43:12

*Thread Reply:* what version of Airflow are you using?

Zacay Daushin (zacayd@octopai.com)

2023-12-20 05:43:55

*Thread Reply:* Version: v2.8.0

PyPI

apache-airflow

Programmatically author, schedule and monitor data pipelines

Original URL: https://pypi.python.org/pypi/apache-airflow/2.8.0

Zacay Daushin (zacayd@octopai.com)

2023-12-20 05:44:19

*Thread Reply:* 2.8.0

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-20 05:46:09

*Thread Reply:* you should follow this guide then https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/guides/user.html

as any other Airflow provider you need to install it, e.g. with pip install apache-airflow-provider-openlineage

Zacay Daushin (zacayd@octopai.com)

2023-12-20 05:47:07

*Thread Reply:* and these variables are kept on airflow.cfg or on .env?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-20 05:50:12

*Thread Reply:* this is Airflow config variables so you can either set them in airflow.cfg or using environment variables as described here https://airflow.apache.org/docs/apache-airflow/stable/cli-and-env-variables-ref.html#envvar-AIRFLOW__-SECTION-__-KEY

Zacay Daushin (zacayd@octopai.com)

2023-12-20 05:54:21

*Thread Reply:* i created a DAG that creates a table and inserts it one line then op the airflow.cfg [openlineage] transport = '{"type": "http", "url": "http://10.0.19.7:5000"}' namespace='airflow'

Zacay Daushin (zacayd@octopai.com)

2023-12-20 05:54:38

*Thread Reply:* and on 10.0.19.7 there is a URL of marquze on port 300

Zacay Daushin (zacayd@octopai.com)

2023-12-20 05:54:51

*Thread Reply:* i run the dag but see no lineage

Zacay Daushin (zacayd@octopai.com)

2023-12-20 05:54:58

*Thread Reply:* is there any log to get?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-20 05:57:16

*Thread Reply:* there would be if you enable debug logs

Zacay Daushin (zacayd@octopai.com)

2023-12-20 07:00:28

*Thread Reply:* in the Airflow.cfg?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-20 07:14:07

*Thread Reply:* Yes. I’m assuming you don’t see anything yet in logs without enabling debug level?

Zacay Daushin (zacayd@octopai.com)

2023-12-20 07:15:05

*Thread Reply:* I see on the logs of the airflow

Zacay Daushin (zacayd@octopai.com)

2023-12-20 07:15:17

*Thread Reply:* but not nothing related to the lineage

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-20 07:17:04

*Thread Reply:* in Admin > Plugins can you see whether you have OpenLineageProviderPlugin and if so, are there listeners?

Zacay Daushin (zacayd@octopai.com)

2023-12-20 07:19:02

*Thread Reply:* there is a OpenLineageProviderPlugin but no liseners

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-20 07:20:52

*Thread Reply:* listeners are disabled following this logic: def _is_disabled() -> bool: return ( conf.getboolean("openlineage", "disabled", fallback=False) or os.getenv("OPENLINEAGE_DISABLED", "false").lower() == "true" or ( conf.get("openlineage", "transport", fallback="") == "" and conf.get("openlineage", "config_path", fallback="") == "" and os.getenv("OPENLINEAGE_URL", "") == "" and os.getenv("OPENLINEAGE_CONFIG", "") == "" ) ) so maybe your config is not loaded properly?

Zacay Daushin (zacayd@octopai.com)

2023-12-20 07:25:38

*Thread Reply:* Dont

airflow.cfg

Zacay Daushin (zacayd@octopai.com)

2023-12-20 07:25:44

*Thread Reply:* here is the config

Zacay Daushin (zacayd@octopai.com)

2023-12-20 07:25:56

*Thread Reply:*

Zacay Daushin (zacayd@octopai.com)

2023-12-20 07:25:59

*Thread Reply:* here is also a .env

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-20 07:27:25

*Thread Reply:* you have transport and namespace twice under openlineage section, second ones are empty

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-20 07:32:23

*Thread Reply:* apparently .env is not taken into account, not sure where the file is and what is your deployment

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-20 07:32:44

*Thread Reply:* also, AIRFLOW__LINEAGE__BACKEND=openlineage.lineage_backend.OpenLineageBackend is needed only for Airflow <2.3

Zacay Daushin (zacayd@octopai.com)

2023-12-20 07:34:54

*Thread Reply:* ok so now i have a airflow.cfg [openlineage] transport = '{"type": "http", "url": "http://10.0.19.7:5000"}' namespace='my-namespace' But still when i run i see no lineage

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-20 07:41:15

*Thread Reply:* can you verify and confirm that changes in your Airflow config are applied? I don’t see any other reason, I also can’t tell what’s your deployment

Zacay Daushin (zacayd@octopai.com)

2023-12-20 07:43:32

*Thread Reply:* Can you send me an example of airflow.cfg that works and i will try to compare

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-20 07:44:34

*Thread Reply:* the one you sent seems ok, it may be a matter of how you configure Airflow to read it, where you put changed config file

Zacay Daushin (zacayd@octopai.com)

2023-12-20 07:45:16

*Thread Reply:* it is on the same place where the logs and dags libraries

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-20 07:47:22

*Thread Reply:* okay, let’s try this

please change temporarily expose_config = False to expose_config = True and check whether you can see config in the UI under Admin > Configuration

Zacay Daushin (zacayd@octopai.com)

2023-12-20 07:49:21

*Thread Reply:* I need to stop and start the docker?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-20 07:50:23

*Thread Reply:* yes

Zacay Daushin (zacayd@octopai.com)

2023-12-20 07:53:41

*Thread Reply:* i changed but see no change on the UI under Admin->Configuration

Zacay Daushin (zacayd@octopai.com)

2023-12-20 07:54:11

*Thread Reply:* i wonder of the cfg file affects at all on the Airflow

Zacay Daushin (zacayd@octopai.com)

2023-12-20 07:54:29

*Thread Reply:* maybe it is relevant to the docker-compose.yaml?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-20 07:56:44

*Thread Reply:* again, I don't know how you deploy your Airflow instance if you need more help with that you might ask in Airflow Slack or learn more about from Airflow docs (which are pretty good imho)

Zacay Daushin (zacayd@octopai.com)

2023-12-20 09:16:42

*Thread Reply:* It seems that the config is on one of the contianers

Zacay Daushin (zacayd@octopai.com)

2023-12-20 09:16:47

*Thread Reply:* so i got inside

Zacay Daushin (zacayd@octopai.com)

2023-12-20 09:17:05

*Thread Reply:* but i think that if i stop and start it doesnt saves the cahnge

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-20 09:21:53

*Thread Reply:* I would manipulate with env vars to override Airflow config entries (I passed the link above) and set them in compose file for all airflow containers I'm assuming you're using Airflow provided docker compose yaml

Zacay Daushin (zacayd@octopai.com)

2023-12-20 09:22:25

*Thread Reply:* right i use docker-compose yaml

Zacay Daushin (zacayd@octopai.com)

2023-12-20 09:28:23

*Thread Reply:* so you mean to create an .env file in the location of the ymal and there put AIRFLOWOPENLINEAGEDISABLED=False

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-20 09:30:20

*Thread Reply:* no, to put this into yaml file

Zacay Daushin (zacayd@octopai.com)

2023-12-20 09:30:37

*Thread Reply:* OK

Zacay Daushin (zacayd@octopai.com)

2023-12-20 09:32:51

*Thread Reply:* i put on the yaml OPENLINEAGEDISABLED=false OPENLINEAGEURL=http://10.0.19.7:5000 AIRFLOWOPENLINEAGENAMESPACE=food_delivery and i will stop and start and lets see how it goes

harsh loomba (hloomba@upgrade.com)

2023-12-20 14:04:37

*Thread Reply:* out of curiosity, do you use Astronomer bootstrap solution to spinup airflow with openlineage?

Zacay Daushin (zacayd@octopai.com)

2023-12-19 04:45:13

i used this post https://openlineage.io/docs/guides/airflow_proxy/

openlineage.io

Using the OpenLineage Proxy with Airflow | OpenLineage

This tutorial introduces you to using the OpenLineage Proxy with Airflow. OpenLineage has various integrations that will enable Airflow to emit OpenLineage events when using Airflow Integrations. In this tutorial, you will be running a local instance of Airflow using Docker Compose and learning how to enable and setup OpenLineage to emit data lineage events. The tutorial will use two backends to check the data lineage, 1) the Proxy, and 2) Marquez.

Original URL: https://openlineage.io/docs/guides/airflow_proxy/

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-19 08:49:49

*Thread Reply:* For newest Airflow provider documentation, please look at https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/index.html

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-12-20 02:37:02

Hi team, got this error with OL 1.6.2 on dbr aws: 23/12/20 07:10:18 ERROR DriverDaemon$: XXX Fatal uncaught exception. Terminating driver. java.lang.IllegalStateException: LiveListenerBus is stopped. at org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:109) at org.apache.spark.scheduler.LiveListenerBus.addToSharedQueue(LiveListenerBus.scala:66) at org.apache.spark.sql.QueryProfileListener$.initialize(QueryProfileListener.scala:122) at com.databricks.backend.daemon.driver.DatabricksILoop$.$anonfun$executeDependedOperations$1(DatabricksILoop.scala:652) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1709) at com.databricks.unity.UCSUniverseHelper$.withNewScope(UCSUniverseHelper.scala:8) at com.databricks.backend.daemon.driver.DatabricksILoop$.executeDependedOperations(DatabricksILoop.scala:580) at com.databricks.backend.daemon.driver.DatabricksILoop$.initializeSharedDriverContext(DatabricksILoop.scala:448) at com.databricks.backend.daemon.driver.DatabricksILoop$.getOrCreateSharedDriverContext(DatabricksILoop.scala:294) at com.databricks.backend.daemon.driver.DriverCorral.driverContext(DriverCorral.scala:292) at com.databricks.backend.daemon.driver.DriverCorral.<init>(DriverCorral.scala:159) at com.databricks.backend.daemon.driver.DriverDaemon.<init>(DriverDaemon.scala:71) at com.databricks.backend.daemon.driver.DriverDaemon$.create(DriverDaemon.scala:452) at com.databricks.backend.daemon.driver.DriverDaemon$.initialize(DriverDaemon.scala:546) at com.databricks.backend.daemon.driver.DriverDaemon$.wrappedMain(DriverDaemon.scala:511) at com.databricks.DatabricksMain.$anonfun$main$1(DatabricksMain.scala:149) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at com.databricks.DatabricksMain.$anonfun$withStartupProfilingData$1(DatabricksMain.scala:498) at com.databricks.logging.UsageLogging.$anonfun$recordOperation$1(UsageLogging.scala:571) at com.databricks.logging.UsageLogging.executeThunkAndCaptureResultTags$1(UsageLogging.scala:666) at com.databricks.logging.UsageLogging.$anonfun$recordOperationWithResultTags$4(UsageLogging.scala:684) at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:426) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:196) at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:424) at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:418) at com.databricks.DatabricksMain.withAttributionContext(DatabricksMain.scala:91) at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:470) at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:455) at com.databricks.DatabricksMain.withAttributionTags(DatabricksMain.scala:91) at com.databricks.logging.UsageLogging.recordOperationWithResultTags(UsageLogging.scala:661) at com.databricks.logging.UsageLogging.recordOperationWithResultTags$(UsageLogging.scala:580) at com.databricks.DatabricksMain.recordOperationWithResultTags(DatabricksMain.scala:91) at com.databricks.logging.UsageLogging.recordOperation(UsageLogging.scala:571) at com.databricks.logging.UsageLogging.recordOperation$(UsageLogging.scala:540) at com.databricks.DatabricksMain.recordOperation(DatabricksMain.scala:91) at com.databricks.DatabricksMain.withStartupProfilingData(DatabricksMain.scala:498) at com.databricks.DatabricksMain.main(DatabricksMain.scala:148) at com.databricks.backend.daemon.driver.DriverDaemon.main(DriverDaemon.scala) Dbr spark 3.4, jre 1.8 Is spark 3.4 not supported for OL as of yet?

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-12-20 02:37:29

*Thread Reply:* OL 1.3.1 works fine btw... This error only pops up with OL 1.6.2

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-12-20 03:00:26

*Thread Reply:* Error happens when starting the cluster itself

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-12-20 04:56:09

*Thread Reply:* spark 3.4 yes, but databricks runtie 14.x was not testet so far (edited)

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-12-20 09:00:12

*Thread Reply:* I've just tested existing databricks integration on latest dbr 14.2 and spark 3.5. The integration tests for databricks, we have, are passing. So please write more details on how do end up with logs like above.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-12-21 01:59:54

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2328 -> link to tests being run

#2328 [SPARK] allow spark 3.5 and dbr 14 integration tests

Problem CI tests don't support latest Databricks version Solution <blockquote> Note: All schema changes require discussion. Please <a href="https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue">link the issue</a> for context. </blockquote> ☐ Your change modifies the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json">core</a> OpenLineage model ☐ Your change modifies one or more OpenLineage <a href="https://github.com/OpenLineage/OpenLineage/tree/main/spec/facets">facets</a> If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports <code>S3</code> and <code>GCS</code> filesystem operations, tested with AWS EMR). One-line summary: Checklist ☐ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☐ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☐ Your changes are accompanied by tests (if relevant) ☑︎ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☐ Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant) ☐ You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant) <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2023 contributors to the OpenLineage project

Labels

integration/spark, ci, tests

Comments

👀 Anirudh Shrinivason

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-12-29 03:12:16

*Thread Reply:* Hi @Paweł Leszczyński sorry missed out on this earlier... Actually, I got this while trying to simply start the dbr cluster with the OL spark configs. Nothing else done from my end for this. Works with OL 1.3.1 but not with 1.6.2

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-12-29 03:14:03

*Thread Reply:* there was an issue with 1.6.2 related to log class on the classpath which may be responsible for this, but it got solved in recent release.

👍 Anirudh Shrinivason

Zacay Daushin (zacayd@octopai.com)

2023-12-20 05:39:45

hi does somenone uses Airflow lineage?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-20 05:40:58

*Thread Reply:* hey Zacay, I’ve already tried to help you here: https://openlineage.slack.com/archives/C01CK9T7HKR/p1702980384927369?thread_ts=1702979105.626809&cid=C01CK9T7HKR

} Jakub Dardziński (https://openlineage.slack.com/team/U02S6F54MAB)

hey Zacay, do you have any issue with using Airflow integration?

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1702980384927369?thread_ts=1702979105.626809&cid=C01CK9T7HKR

Zacay Daushin (zacayd@octopai.com)

2023-12-20 08:57:07

*Thread Reply:* thanks for the help

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-20 08:57:26

*Thread Reply:* did you manage to make it working?

harsh loomba (hloomba@upgrade.com)

2023-12-20 14:04:58

*Thread Reply:* out of curiosity, do you use Astronomer bootstrap solution to spinup airflow with openlineage?

Zacay Daushin (zacayd@octopai.com)

2023-12-20 07:25:53

Shahid Shaikh (ssshahidwin@gmail.com)

2023-12-20 14:05:55

Hi Everyone I was trying to add extra facet at job level but I am not able to do that. to explain more - assume that d1(input) and d2 (output) are the databases and by using some python file which having classes and functions helps it to get convert from d1 to d2. so currently for this i am extracting some info about python file by using external python parser and want to add that to jobs which will be in between d1 and d2. i tried by adding ```custom_facets = { "customKey": "customValue", "anotherKey": "anotherValue" # Add more custom facets as needed }

        # Creating a Job with custom facets
        job = Job(
            namespace=file_name,
            name=single_fn_info.name,
            facets=custom_facets  # Include the custom facets here
        )```

but it is not working. I did look into this documentation https://openlineage.io/docs/spec/facets/custom-facets/ and tried but it's still not showing any facet on ui. currently in facets section for every job it only shows root dict which is blank and nothing else. So what i am missing how we can implement this ??

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-12-21 05:12:26

@Michael Robinson can we vote for the OL release? #2319 brings significant fix to Spark logging and I think it's worth releasing without waiting for release cycle.

➕ Paweł Leszczyński, Jakub Dardziński, Maciej Obuchowski, Rodrigo Maia, Michael Robinson, harsh loomba

👍 Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2023-12-21 09:55:59

*Thread Reply:* Thanks, @Paweł Leszczyński, the release is authorized and will be initiated as soon as possible within 2 business days.

🙌 Paweł Leszczyński

Michael Robinson (michael.robinson@astronomer.io)

2023-12-21 10:54:23

*Thread Reply:* Changelog PR: https://github.com/OpenLineage/OpenLineage/pull/2331

harsh loomba (hloomba@upgrade.com)

2023-12-21 13:00:32

*Thread Reply:* @Jakub Dardziński any progress on this one https://github.com/apache/airflow/pull/35794 wondering if this could have been a part of release

#35794 openlineage, aws: add OpenLineage support for Redshift SQL.

Add flat information schema query support in SQLParser. This PR adds support for Redshift SQL. Additional changes in OpenLineage SQL utils were neccessary. This is because Redshift has capability to make cross-database queries, however, traditional cross-database queries to <code>information_schema.columns</code> does not work. Instead one should read from <code>SVV_REDSHIFT_COLUMNS</code> which is a single table containing multi-database information.

Labels

provider:amazon-aws, area:providers, provider:openlineage

Comments

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-21 13:02:36

*Thread Reply:* I'm dependent on Airflow committers, pinging in the PR from time to time

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-21 13:02:57

*Thread Reply:* if you wish to comment that you need the PR it might be taken into consideration as well

harsh loomba (hloomba@upgrade.com)

2023-12-21 13:13:47

*Thread Reply:* wait this will be supported by openlineage-airflow package as well right? I see you have made changes in airflow provider package but I dont see changes in standalone openlineage repo 🤔

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-21 13:15:22

*Thread Reply:* We aim to make as less changes as possible to openlineage-airflow package to encourage users to use provider one. This change would be quite huge and probably won't be backported.

harsh loomba (hloomba@upgrade.com)

2023-12-21 13:17:54

*Thread Reply:* We have yet not moved to provider packages coz our team is still making decision. And the move to provider package needs me to change a lot, so i would prefer this feature in standalone package rather

Hitesh (splicer9904@gmail.com)

2023-12-21 05:25:29

Hi Team I am trying to send openlineage events to a Kafka eventhub and for that, I am passing spark.openlineage.transport.properties.sasl.jaas.config as org.apache.kafka.common.security.plain.PlainLoginModule required username=\"connection_string\" password=\"connection_string\";

when I run the job, I get the error Value not specified for key 'username' in JAAS config

I am getting connection string from Kafka evenhub properties and I'm running openlineage-spark-1.4.1

Can someone please help me out?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-21 09:20:53

*Thread Reply:* I don't think we can parse those non-escaped spaces as regular properties

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-21 09:23:01

*Thread Reply:* can you try something like "org.apache.kafka.common.security.plain.PlainLoginModule required username='connection_string' password='connection_string'"

Abdallah (abdallah@terrab.me)

2023-12-21 11:14:44

Hello,

I hope you are doing well.

I am wondering if any of you had the same issue before.

Thank you. 23/12/21 16:01:26 WARN DatabricksEnvironmentFacetBuilder: Failed to load dbutils in OpenLineageListener: java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at com.databricks.dbutils_v1.impl.DbfsUtilsImpl._driverContext$lzycompute(DbfsUtilsImpl.scala:27) at com.databricks.dbutils_v1.impl.DbfsUtilsImpl._driverContext(DbfsUtilsImpl.scala:26) at com.databricks.dbutils_v1.impl.DbfsUtilsImpl.$anonfun$driverContext$1(DbfsUtilsImpl.scala:29) at com.databricks.dbutils_v1.impl.DBUtilsV1Utils$.checkLocalDriver(DBUtilsV1Impl.scala:61) at com.databricks.dbutils_v1.impl.DbfsUtilsImpl.driverContext(DbfsUtilsImpl.scala:29) at <a href="http://com.databricks.dbutils_v1.impl.DbfsUtilsImpl.sc">com.databricks.dbutils_v1.impl.DbfsUtilsImpl.sc</a>(DbfsUtilsImpl.scala:30) at com.databricks.dbutils_v1.impl.DbfsUtilsImpl._core$lzycompute(DbfsUtilsImpl.scala:32) at com.databricks.dbutils_v1.impl.DbfsUtilsImpl._core(DbfsUtilsImpl.scala:32) at com.databricks.dbutils_v1.impl.DbfsUtilsImpl.$anonfun$core$1(DbfsUtilsImpl.scala:34) at com.databricks.dbutils_v1.impl.DBUtilsV1Utils$.checkLocalDriver(DBUtilsV1Impl.scala:61) at com.databricks.dbutils_v1.impl.DbfsUtilsImpl.core(DbfsUtilsImpl.scala:34) at com.databricks.dbutils_v1.impl.DbfsUtilsImpl.mounts(DbfsUtilsImpl.scala:166) at io.openlineage.spark.agent.facets.builder.DatabricksEnvironmentFacetBuilder.getDatabricksMountpoints(DatabricksEnvironmentFacetBuilder.java:142) at io.openlineage.spark.agent.facets.builder.DatabricksEnvironmentFacetBuilder.getDatabricksEnvironmentalAttributes(DatabricksEnvironmentFacetBuilder.java:98) at io.openlineage.spark.agent.facets.builder.DatabricksEnvironmentFacetBuilder.build(DatabricksEnvironmentFacetBuilder.java:60) at io.openlineage.spark.agent.facets.builder.DatabricksEnvironmentFacetBuilder.build(DatabricksEnvironmentFacetBuilder.java:32) at io.openlineage.spark.api.CustomFacetBuilder.accept(CustomFacetBuilder.java:40) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$null$27(OpenLineageRunEventBuilder.java:491) at java.lang.Iterable.forEach(Iterable.java:75) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$buildRunFacets$28(OpenLineageRunEventBuilder.java:491) at java.util.ArrayList.forEach(ArrayList.java:1259) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRunFacets(OpenLineageRunEventBuilder.java:491) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.populateRun(OpenLineageRunEventBuilder.java:313) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:289) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:250) at io.openlineage.spark.agent.lifecycle.SparkSQLExecutionContext.start(SparkSQLExecutionContext.java:167) at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$onJobStart$10(OpenLineageSparkListener.java:151) at java.util.Optional.ifPresent(Optional.java:159) at io.openlineage.spark.agent.OpenLineageSparkListener.onJobStart(OpenLineageSparkListener.java:147) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:39) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:39) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:118) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:102) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:107) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:107) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:102) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:98) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1639) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:98)

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-21 12:01:59

*Thread Reply:* Can you provide additional details: what is your version of OpenLineage integration, Spark, Databricks; how you're running your job, additional logs

Abdallah (abdallah@terrab.me)

2023-12-21 12:02:42

*Thread Reply:* Version of OL 1.2.2

Abdallah (abdallah@terrab.me)

2023-12-21 12:03:07

*Thread Reply:* spark_version: 11.3.x-scala2.12

Abdallah (abdallah@terrab.me)

2023-12-21 12:03:32

*Thread Reply:* Running job through spark-submit

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-12-22 06:03:06

*Thread Reply:* @Abdallah could you try to upgrade to 1.7.0 that was released lately?

Michael Robinson (michael.robinson@astronomer.io)

2023-12-21 13:08:08

@channel We released OpenLineage 1.7.0! In this release, we turned off support for Airflow versions >=2.8.0 in the Airflow integration and added a parent run facet to COMPLETE and FAIL events, also in Airflow. If you’re on the most recent release of Airflow and wish to continue receiving events from Airflow after upgrading, use the OpenLineage Airflow Provider instead.

Added • Airflow: add parent run facet to COMPLETE and FAIL events in Airflow integration #2320 @kacpermuda Adds a parent run facet to all events in the Airflow integration.

Removed • Airflow: remove Airflow 2.8+ support #2330 @kacpermuda To encourage use of the Provider, this removes the listener from the plugin if the Airflow version is >=2.8.0.

A number of bug fixes were released as well, including: • Airflow: repair up.sh for MacOS #2316 #2318 @kacpermuda Some scripts were not working well on MacOS. This adjusts them. • Airflow: repair run_id for FAIL event in Airflow 2.6+ #2305 @kacpermuda The Run_id in a FAIL event was different than in the START event for Airflow 2.6+. • Flink: name Kafka datasets according to the naming convention #2321 @pawel-big-lebowski Adds a kafka:// prefix to Kafka topic datasets’ namespaces. • Spec: fix inconsistency with Redshift authority format #2315 @davidjgoss Amends the Authority format for consistency with other references in the same section.

Thanks to all the contributors, including new contributor @Kacper Muda! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.7.0 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/1.6.2...1.7.0 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

Jakub Dardziński (jakub.dardzinski@getindata.com)

2023-12-21 13:10:40

*Thread Reply:* Shoutout to @Kacper Muda for huge contribution from the very start 🚀

:gratitude_thank_you: Kacper Muda, Sheeri Cabral (Collibra)

Michael Robinson (michael.robinson@astronomer.io)

2023-12-21 13:11:22

*Thread Reply:* +1 on that! I think there were even more changes from Kacper than are listed here

:gratitude_thank_you: Kacper Muda

Kacper Muda (mudakacper@gmail.com)

2023-12-21 14:43:38

*Thread Reply:* Thanks ! Just a quick note: listener API was introduced in Airflow 2.3 so it was already missing in the plugin for Airflow < 2.3 - I made no changes there. I just removed it from > 2.8.0, to encourage use of the provider, as You said 🙂 But the result is exactly as You said, there is no listener in <2.3 and >=2.8 😄

Kacper Muda (mudakacper@gmail.com)

2023-12-21 14:45:57

*Thread Reply:* So in result: i don't think we turned off support for Airflow versions <2.3.0 🙂

Michael Robinson (michael.robinson@astronomer.io)

2023-12-21 14:46:48

*Thread Reply:* This is good to know — thanks. I’ve updated the notes here and will do so elsewhere, as well.

🙌 Kacper Muda, Jakub Dardziński

Shahid Shaikh (ssshahidwin@gmail.com)

2023-12-30 14:33:21

*Thread Reply:* Hi @Jakub Dardziński so before for custom operator we used to write custom extractor file and used to save with other extractors under the airflow integration folder.

what is the procedure now according to this new provider update ?

Kacper Muda (mudakacper@gmail.com)

2024-01-01 05:08:40

*Thread Reply:* Hey @Shahid Shaikh, as of my understanding nothing has changed regarding Custom Extractors, You can still use them if You wish to (see the bottom of this docs, to see how they can be registered in provider package). However, in my opinion, the best way to use the provider package is to implement OpenLineage methods directly in the operators, as described here. Let me know if this answers Your question.

Shahid Shaikh (ssshahidwin@gmail.com)

2024-01-04 04:46:34

*Thread Reply:* Yes, Thanks @Kacper Muda I referred the docs and able to do the work as u said by using provider package directly.

Shahid Shaikh (ssshahidwin@gmail.com)

2023-12-28 00:46:57

Hi Everyone I was looking into airflow intergration with Marquez i ran number of dags, I observed that for a particular task in every dag it makes new job on Marquez and we are not able to see any job to job linkage on marquez map. Why it is not showing ? How can we add this feature ?

Rodrigo Maia (rodrigo.maia@manta.io)

2023-12-28 05:57:49

Hey All! I've tested merging operations for SPARK (on Databricks) and the OpenLineage Result. For the moment i failed to produce any input/output for the jobs, regardless of the data schema being present in the logical plan attribute of the JSON OL event.

My test consisted in: • Reading from parquet/CSV file in DBFS (databricks file storage) • Creating a Temporary Table • Performing the merging with spark.sql("merge... target...source ") with the target being a table in hive. Test Variables: • Source: Parquet/CSV • Target: Hive Table • OLxSpark versions: ◦ OL 1.6.2 -> Spark 3.3.2 ◦ OL 1.6.2 -> Spark 3.2.1 ◦ OL 1.0.0 -> Spark 3.2.1 I've created a pdf with some code samples and OL inputs and output attributes.

metadata.json

Rodrigo Maia (rodrigo.maia@manta.io)

2023-12-28 05:59:53

*Thread Reply:* @Sai @Anirudh Shrinivason Did you have any luck with the merge events?

Rodrigo Maia (rodrigo.maia@manta.io)

2023-12-28 06:00:51

*Thread Reply:* @Paweł Leszczyński I know you were working on this in the past. Am I missing something here?

Rodrigo Maia (rodrigo.maia@manta.io)

2023-12-28 06:02:23

*Thread Reply:* Should i test with more recent versions of spark and the latest of OL?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-12-28 06:31:04

*Thread Reply:* Are you able to verify if the issue is databricks runtime specific or the same happens on vanilla Spark docker containers?

We do have an integration test for merge into and delta tables here -> https://github.com/OpenLineage/OpenLineage/blob/17fa874083850b711f364c4de656e14d79[…]/java/io/openlineage/spark/agent/SparkDeltaIntegrationTest.java

would this test fail on databricks as well?

<https://github.com/OpenLineage/OpenLineage/blob/17fa874083850b711f364c4de656e14d79984440/integration/spark/app/src/test/java/io/openlineage/spark/agent/SparkDeltaIntegrationTest.java | SparkDeltaIntegrationTest.java>

<pre><code> void testDeltaMergeInto() { </code></pre>

Rodrigo Maia (rodrigo.maia@manta.io)

2023-12-28 08:30:56

*Thread Reply:* the test on databricks fails to generate inputs and outputs for the merge operation

"job": { "namespace": "default", "name": "dbc-a9e3696a-291f_cloud_databricks_com_execute_merge_into_command_edge", "facets": {} }, "inputs": [], "outputs": []

Rodrigo Maia (rodrigo.maia@manta.io)

2023-12-28 08:32:11

*Thread Reply:* The create view also fails to generate Inputs and outputs (but this one i don't know if it is supposed to) "job": { "namespace": "default", "name": "dbc-a9e3696a-291f_cloud_databricks_com_execute_create_view_command", "facets": {} }, "inputs": [], "outputs": []

Rodrigo Maia (rodrigo.maia@manta.io)

2023-12-28 08:47:51

*Thread Reply:* The test "testMergeInto" also failed for databricks: https://github.com/OpenLineage/OpenLineage/blob/17fa874083850b711f364c4de656e14d79984440/integration/spark/app/src/test/java/io/openlineage/spark/agent/SparkDeltaIntegrationTest.java#L424C7-L424C7

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-12-28 23:00:10

*Thread Reply:* Hey @Rodrigo Maia I'm using 1.3.1 OL version, and it seems to work for most merge into case though not all...

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2023-12-29 02:36:14

*Thread Reply:* @Rodrigo Maia will try to add existing testMergeInto to be run on Databricks and see why is it failing

🙌 Rodrigo Maia

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-01-02 08:39:50

*Thread Reply:* looking into this and hopefully will have solution this week

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-01-03 06:27:09

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2348

#2348 [SPARK] support column lineage for merge into on databricks

Problem <code>merge into</code> is not supported on Databricks platform which is a result of <code>MergeIntoCommandEdge</code> node within logical plan, where <code>MergeIntoCommandEdge</code> is not present within Spark nor Delta codebase. Solution Extract lineage from <code>MergeIntoCommandEdge</code> class. <blockquote> Note: All schema changes require discussion. Please <a href="https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue">link the issue</a> for context. </blockquote> ☐ Your change modifies the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json">core</a> OpenLineage model ☐ Your change modifies one or more OpenLineage <a href="https://github.com/OpenLineage/OpenLineage/tree/main/spec/facets">facets</a> If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports <code>S3</code> and <code>GCS</code> filesystem operations, tested with AWS EMR). One-line summary: Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☐ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☑︎ Your changes are accompanied by tests (if relevant) ☑︎ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☐ Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant) ☐ You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant) <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2023 contributors to the OpenLineage project

Labels

documentation, integration/spark

Comments

❤️ Rodrigo Maia

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-01-02 10:47:11

Hello! My company has a job opening for a Senior Data Engineer in our Data Office, in Czech Republic - https://www.linkedin.com/feed/update/urn:li:activity:7147975999253610496/ DM me here or on LinkedIn if you have questions.

linkedin.com

Sheeri Cabral on LinkedIn: Senior Data Engineer, Data Office

New year, new job for YOU! My company is hiring for a Senior Data Engineer for our Data Office. This Must be experienced with Data Ops, data modeling, DW…

Original URL: https://www.linkedin.com/feed/update/urn:li:activity:7147975999253610496/

🔥 Maciej Obuchowski

Harel Shein (harel.shein@gmail.com)

2024-01-02 11:00:58

*Thread Reply:* created a #jobs channel, as we’re continuing to build a strong community here 🙂

🚀 Jakub Dardziński

Michael Robinson (michael.robinson@astronomer.io)

2024-01-02 11:30:00

@channel This Friday, Jan. 4, is the last day to respond to the 2023 Ecosystem Survey, which will close at 5 pm ET that day. It’s an opportunity to tell us about your organization’s lineage needs and priorities for the purpose of updating the project roadmap. For example, you can tell us: • which data connectors you’d like OL to support • which cloud data platforms you’d like OL to support • which additional orchestrators you’d like OL to support • and more. Your feedback is very important. Thanks in advance!

Google Docs

2023 OpenLineage Ecosystem Survey

Original URL: https://forms.gle/cPk3skNgnB4iab9H6

Rodrigo Martins Cardoso (rodrigo.cardoso1@ibm.com)

2024-01-02 12:27:27

Hi everyone! Just passing by for a quick question: after reading SparkColumnLineage and ColumnLineageDatasetFacet.json I can see in the future work section of the 1st URL that

Current version of the mechanism allows finding input fields that were used to produce the output field but does not determine how were they used This means that the transformationDescription and transformationType from ColumnLineageDatasetFacet are not sent in the events?

Thanks in advance for any inputs! Have a nice day.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-01-03 04:59:48

*Thread Reply:* Currently, Spark integration does not fill transformation fields. Additionally, there's a proposal to improve this: https://github.com/OpenLineage/OpenLineage/issues/2186

#2186 [PROPOSAL] Formalizing transformation types

Purpose: Automatic support of lineage use cases highly depends on the type of transformations between columns. Oftentimes the use cases to be supported pose a significant risk (compliance, impact analysis) and false negatives are highly undesirable. This means that the taxonomy of transformation types supported by OpenLineage has to be tightly specified. Otherwise OpenLineage consumers cannot build reliable automation on top of the transformation types, and therefore not reliably support these high-risk use cases. This is a proposal for modifying the column-level lineage facet with a tightly specified taxonomy of transformation types. Proposed implementation <ol><li>Change transformationType definition: From:</li> </ol> <pre><code>"transformationType": { "type": "string", "description": "IDENTITY|MASKED reflects a clearly defined behavior. IDENTITY: exact same as input; MASKED: no original data available (like a hash of PII for example)" } </code></pre> To: <pre><code>"transformationType": { "description": "The type of the transformation. One of: IDENTITY: Any exact copy from source to target. TRANSFORMED: Any data transfer that is not an IDENTITY. INDIRECT: A lineage relationship that does not constitute any data transfer but a relationship that indirectly influences the movement of data. For example, this can be a join, group, filter, or order. ", "type": "string", "enum": ["IDENTITY", "TRANSFORMED", "INDIRECT"] } </code></pre> <ol start="2"><li>Add transformationSubType:</li> </ol> <pre><code>"transformationSubType": { "description": "A producer-specific subtype of the transformation type. One of FILTERING, GROUPING, ORDERING or user-specified." "type": "string", } </code></pre> There is more detail in the discussion doc: <a href="https://docs.google.com/document/d/1KgLQy5fibXdDF3k186BzFTgFYe81Yu3UyTDH9VBMmaA/edit?usp=sharing&resourcekey=0-Ng_0eR3Sv1fC880tvUsG5A|https://docs.google.com/document/d/1KgLQy5fibXdDF3k186BzFTgFYe81Yu3UyTDH9VBMmaA/edit?usp=sharing&resourcekey=0-Ng_0eR3Sv1fC880tvUsG5A">https://docs.google.com/document/d/1KgLQy5fibXdDF3k186BzFTgFYe81Yu3UyTDH9VBMmaA/edit?usp=sharing&resourcekey=0-Ng0eR3Sv1fC880tvUsG5A|https://docs.google.com/document/d/1KgLQy5fibXdDF3k186BzFTgFYe81Yu3UyTDH9VBMmaA/edit?usp=sharing&resourcekey=0-Ng0eR3Sv1fC880tvUsG5A</a>

Labels

proposal

Comments

Rodrigo Martins Cardoso (rodrigo.cardoso1@ibm.com)

2024-01-03 05:17:09

*Thread Reply:* Thanks a lot for the clarification!

Vinay R (vinayrme58@gmail.com)

2024-01-03 11:33:45

Hi,

I'm currently working on extracting transformation logic from Redshift SQL insert statements. For instance, in the query 'SELECT A+B as Col1 FROM table1,' I'm trying to fetch the transformation logic 'A+B' for the column 'Col1.' I've been using regex, but it has limitations with subqueries, unions, and CTEs. Are there any specialized tools or suggestions for improving this process, especially when dealing with complex SQL structures like subqueries and unions?

Queries are in rsql format. Thanks in advance any help would be appreciated.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-03 14:04:27

*Thread Reply:* We don’t parse what are actual transformations done in sqlparser

Mariusz Górski (gorskimariusz13@gmail.com)

2024-01-03 12:14:41

Hey, I’ve submitted update to OL docs re spark - anyone fancy to conduct review? 👀 🙏

https://github.com/OpenLineage/docs/pull/268

#268 chore: update spark docs

• use <code>dag.dag_id</code> as <code>parentRunName</code> for spark example airflow task (instead of <code>job_id</code>). This makes it consistent with contents of parent run facet in airflow tasks. • make it explicit that <code>parentRunId</code> is: • required when <code>parentRunName</code> is configured • expected to be valid 128-bit uuid

Comments

Shahid Shaikh (ssshahidwin@gmail.com)

2024-01-04 04:43:48

Hi everyone does job to job linkage possible in openlineage ?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-04 04:47:05

*Thread Reply:* OpenLineage focuses on data lineage which means we don’t explicitly track job to job lineage but rather job -> dataset -> job

Shahid Shaikh (ssshahidwin@gmail.com)

2024-01-04 04:53:24

*Thread Reply:* Thanks for the insight, Jakub! That makes sense. I'm curious, though, if there's a way within openlineage for achieving this kind of linkage?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-04 04:54:53

*Thread Reply:* sure, you can create custom facet that would be used to expose such relations in terms of Airflow there is Airflow specific run facet that contains upstream and downstream task ids but it’s rather informative than to achieve explicitly what you’re trying to do

Shahid Shaikh (ssshahidwin@gmail.com)

2024-01-04 05:02:37

*Thread Reply:* Thanks for clarifying that, Jakub. I'll look into that as u suggested.

Michael Robinson (michael.robinson@astronomer.io)

2024-01-04 12:25:16

@channel UK-based members: a speaker is needed for a meetup with Confluent on January 31st in London. Please reply or DM me if you’re interested in speaking.

Michael Robinson (michael.robinson@astronomer.io)

2024-01-04 13:21:25

@channel This month’s TSC meeting is next Thursday the 11th at 10am PT. On the tentative agenda: • announcements • recent releases • open discussion • more (TBA) More info and the meeting link can be found on the website. All are welcome! Do you have a discussion topic, use case or integration you’d like to demo? DM me to be added to the agenda.

William Sia (willsia@amazon.com)

2024-01-08 03:35:03

Hi everyone, I am trying to use the datasetVersion facet, but want to confirm that I am using it correctly, at the moment the behaviour seemed a bit weird, but basically we have a pipeline like this sourceData1 -> job1 -> datasetA

dataSetA v1, sourceData2 -> job2 -> datasetB

But the job can be configured to use a specific version of incoming dataset, so in the above example the job2 had been configured to always use datasetA v1. So If I run a job1 that produces datasetA v2. If I run job2 , it should pick use the v1.

When we're integrating our pipeline by emitting events to OpenLineage ( Marquez ), we include "inputs": [ { "namespace": "test.namespace", "name": "datasetA", "facets": { "version": { "_producer": "<https://some.producer.com/version/1.0>", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/DatasetVersionDatasetFacet.json>", "datasetVersion": "1" } } } ], "datasetVersion":"1" -> this refers to the "datasetVersion" of output dataset of job1

But what happened at the moment is, job2 always use the latest version of datasetA and it rewrites the latest version's version of datasetA to 1

So just wandering if I am emitting the events wrongly or datasetVersion is not meant to be used for this purpose

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-01-08 04:57:43

*Thread Reply:* Hi @William Sia, OpenLineage-Spark integration fetches latest version of the dataset. If you're deliberately reading some older one, OpenLineage event will still point to the latest which is kind of a bug or missing feature. feel free to create an issue for this. If possible, please provide some code snippet to reproduce this. Is this happening on delta/iceberg?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-08 09:46:16

*Thread Reply:* > But what happened at the moment is, job2 always use the latest version of datasetA and it rewrites the latest version's version of datasetA to 1 You mean that event contains different version than you emit? Or do you mean you see different thing on the Marquez UI?

William Sia (willsia@amazon.com)

2024-01-08 21:56:19

*Thread Reply:* @Maciej Obuchowski , what I meant is I am expecting my job2 to have lineage to input of datasetA v1 (because I specify the version facet of the dataset input to 1) , but what happened is job2 has lineage to the latest version of datasetA and the datasetVersion parameter of the Version Facet of the latest version is now modified to 1 (it was 2 because there was a run on job1 that updates the version to 2 ). So job2 that has an input of datasetA modified the Version Facet of the later.

William Sia (willsia@amazon.com)

2024-01-08 21:58:52

*Thread Reply:* Cheers @Paweł Leszczyński, will raise an issue with more details on github on this. So at the moment we're enable the application that my team is working on to emit events. So I am emitting the event manually at the moment to mimic the data flow in our system.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-09 07:32:02

*Thread Reply:* > (because I specify the version facet of the dataset input to 1) So what I understand is that you emit event for job2 that has datasetA as input with version 1 as shown above

> but what happened is job2 has lineage latest version of datasetA on UI I understand? I think it's Marquez issue - properly display which dataset version you're reading. Can you post screenshot on Marquez issue showing exactly where you'd expect to see different data?

> datasetVersion parameter of the Version Facet of the latest version is now modified to 1 (it was 2 because there was a run on job1 that updates the version to 2 ). Yeah, if job1 updated datasetA to next version, then it makse sense.

William Sia (willsia@amazon.com)

2024-01-11 22:08:49

*Thread Reply:* I had raised this issue with some sample payload here https://github.com/MarquezProject/marquez/issues/2733 , Let me know if you have more details

#2733 Marquez does not take into account of input dataset version facet when creating lineage (it always use latest version of input datasets)

This is the step to replicate the issue reported on <a href="https://openlineage.slack.com/archives/C01CK9T7HKR/p1704702903925299">this slack discussion</a>. TLDR; When emitting a Lineage Event for a Run that has older version (using the version facet) of input datasets, Marquez always use the latest version of the input datasets and it modifies the version facet of the input dataset to the previous version. Step to replicate can be found below Creating datasetA version 1 Payload is below <a href="https://github.com/MarquezProject/marquez/files/13910935/1.create_run1_datasetA.json">1.createrun1datasetA.json</a> <pre><code> "outputs": [ { "namespace": "usa.california.sanjose", "name": "datasetA", "facets": { "version": { "_producer": "<https://some.producer.com/version/1.0>", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/DatasetVersionDatasetFacet.json>", "datasetVersion": "1" } }, "ownership": { "_producer": "<https://some.producer.com/version/1.0>", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/OwnershipJobFacet.json>", "owners": [ { "name": "/usa/california/sanjose", "type": "GROUP" } ] } } </code></pre> Creating datasetB version 1 <a href="https://github.com/MarquezProject/marquez/files/13910954/2.create_run1_datasetB.json">2.createrun1datasetB.json</a> Notice that I want to explicitly use <code>version 1</code> of <code>datasetA</code> <pre><code> "inputs": [ { "namespace": "usa.california.sanjose", "name": "datasetA", "facets": { "version": { "_producer": "<https://some.producer.com/version/1.0>", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/DatasetVersionDatasetFacet.json>", "datasetVersion": "1" } } } ], </code></pre> Updating datasetA to version 2 <a href="https://github.com/MarquezProject/marquez/files/13910964/3.create_run2_datasetA.json">3.createrun2datasetA.json</a> <pre><code> "outputs": [ { "namespace": "usa.california.sanjose", "name": "datasetA", "facets": { "version": { "_producer": "<https://some.producer.com/version/1.0>", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/DatasetVersionDatasetFacet.json>", "datasetVersion": "2" } }, "ownership": { "_producer": "<https://some.producer.com/version/1.0>", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/OwnershipJobFacet.json>", "owners": [ { "name": "/usa/california/sanjose", "type": "GROUP" } ] } } ], </code></pre> Updating datasetB to version 2 but still using version 1 of dataset A <a href="https://github.com/MarquezProject/marquez/files/13911000/4.create_run2_datasetB.json">4.createrun2datasetB.json</a> Querying datasetA versions URL: <a href="http://localhost:3000/api/v1/namespaces/usa.california.sanjose/datasets/datasetA/versions">http://localhost:3000/api/v1/namespaces/usa.california.sanjose/datasets/datasetA/versions</a> I removed some of the variables of the response payload so not to distract on the important bit. Notice how datasetA <code>version 2</code> is now modified to <code>version 1</code>. <pre><code>{ "versions": [ { "createdAt": "2024-01-11T22:18:00.001Z", "version": "12805e64-6a48-3bba-80c4-9ce513fdb069", "namespace": "usa.california.sanjose", "createdByRun": { "inputDatasetVersions": [ { "datasetVersionId": { "namespace": "<glue://glue.ap-southeast-2.amazonaws.com>", "name": "ghg_emission_factor.egrid_data", "version": "3a7c4759-785b-3e05-a684-17ba703690ff" }, "facets": {} } ], "outputDatasetVersions": [ { "datasetVersionId": { "namespace": "usa.california.sanjose", "name": "datasetA", "version": "12805e64-6a48-3bba-80c4-9ce513fdb069" }, "facets": {} } ], "facets": {} }, "facets": { "version": { "_producer": "<https://some.producer.com/version/1.0>", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/DatasetVersionDatasetFacet.json>", "datasetVersion": "1" } } }, { "createdAt": "2024-01-11T01:18:00.001Z", "version": "0eda9f90-f79e-3686-8747-efaa09d1a404", "createdByRun": { "inputDatasetVersions": [ { "datasetVersionId": { "namespace": "<glue://glue.ap-southeast-2.amazonaws.com>", "name": "ghg_emission_factor.egrid_dataV2", "version": "6340cce2-867b-3d53-96d5-a35de41fd9a1" }, "facets": {} } ], "outputDatasetVersions": [ { "datasetVersionId": { "namespace": "usa.california.sanjose", "name": "datasetA", "version": "0eda9f90-f79e-3686-8747-efaa09d1a404" }, "facets": {} } ], "facets": {} }, "facets": { "version": { "_producer": "<https://some.producer.com/version/1.0>", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/DatasetVersionDatasetFacet.json>", "datasetVersion": "1" } } } ] } </code></pre> Querying the datasetB versions Notice how the version of the input dataset of 2nd Run points to the latest version of datasetA even though we explicitly specify version 1 when emitting the Lineage Event. ``` { "versions": [ { "createdAt": "2024-01-12T23:18:00.001Z", "version": "0c70a3fc-c242-3e6f-96fd-dc711fbea05f", "createdByRun": { "inputDatasetVersions": [ { "datasetVersionId": { "namespace": "usa.california.sanjose", "name": "datasetA", "version": "0eda9f90-f79e-3686-8747-efaa09d1a404" }, "facets": {} } ], "outputDatasetVersions": [ { "datasetVersionId": { "namespace": "usa.california.sanjose", "name": "datasetB", "version": "0c70a3fc-c242-3e6f-96fd-dc711fbea05f" }, "facets": {} } ], "facets": {} }, "facets": { "version": { "producer": "<a href="https://some.producer.com/version/1.0">https://some.producer.com/version/1.0</a>", "schemaURL": "<a href="https://openlineage.io/spec/facets/1-0-0/DatasetVersionDatasetFacet.json">https://openlineage.io/spec/facets/1-0-0/DatasetVersionDatasetFacet.json</a>", "datasetVersion": "2" } } }, { "createdAt": "2024-01-11T23:18:00.001Z", "version": "491ac309-ed7d-3c3e-a11d-51fab7ce84f6", "createdByRun": { "inputDatasetVersions": [ { "datasetVersionId": { "namespace": "usa.california.sanjose", "name": "datasetA", "version": "12805e64-6a48-3bba-80c4-9ce513fdb069" }, "facets": {} } ], "outputDatasetVersions": [ { "datasetVersionId": { "namespace": "usa.california.sanjose", "name": "datasetB", "version": "491ac309-ed7d-3c3e-a11d-51fab7ce84f6" }, "facets": {} } ], "facets": {} }, "facets": { "version": { "producer": "<a href="https://some.producer.com/version/1.0">https://some.producer.com/version/1.0</a>", "schemaU…

Comments

Michael Robinson (michael.robinson@astronomer.io)

2024-01-08 12:55:01

@channel Meetup alert: our first OpenLineage x Apache Kafka® meetup will be happening on January 31st (in-person) at 6:00 PM in Confluent’s London offices. Keep an eye out for more details and sign up https://www.meetup.com/london-openlineage-meetup-group/events/298420417/?utmmedium=referral&utmcampaign=share-btnsavedeventssharemodal&utmsource=link|here.

Meetup

OpenLineage x Apache Kafka® Meetup, Wed, Jan 31, 2024, 6:00 PM | Meetup

Join us at Confluent's offices in London on January 31st for an OpenLineage x Kafka meetup. We'll discuss the current state of lineage in general and support for streams in

Original URL: https://www.meetup.com/london-openlineage-meetup-group/events/298420417/?utm_medium=referral&utm_campaign=share-btn_savedevents_share_modal&utm_source=link

🚀 alexandre bergere, Maciej Obuchowski

❤️ Willy Lulciuc, Maciej Obuchowski, Eric Veleker

Zacay Daushin (zacayd@octopai.com)

2024-01-09 03:06:38

Hi are you familiar with the Spline solution? do you know if we can use it on databricks withourt adding any code to the notebooks in order to get the notebook name?

Damien Hawes (damien.hawes@booking.com)

2024-01-09 03:42:24

*Thread Reply:* Spline is maintained by an entirely different group of people. Thus, we aren't familiar with the semantics of Spline. I suggest that you reach out to the maintainers of Spline to understand more about its capabilities.

➕ Maciej Obuchowski

Zacay Daushin (zacayd@octopai.com)

2024-01-09 03:42:57

*Thread Reply:* thanks

Anand Thamothara Dass (anand_thamotharadass@cable.comcast.com)

2024-01-09 08:57:33

Hi folks, We are in MWAA supported Airflow 2.7.2. For OpenLineage integration, do we have transport type as kinesis to emit the events?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-09 09:08:01

*Thread Reply:* tbere’s no kinesis transport available in Python at the moment

Anand Thamothara Dass (anand_thamotharadass@cable.comcast.com)

2024-01-09 09:12:04

*Thread Reply:* Thank you @Jakub Dardziński . Can we expect a support in future? Any suggestive alternative till then?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-09 09:13:36

*Thread Reply:* that’s open source project 🙂 if someone finds it useful to contribute or the community decides to implement this - that’ll probably be supported

I’m not sure what’s your use case but we have Kafka transport

Anand Thamothara Dass (anand_thamotharadass@cable.comcast.com)

2024-01-09 09:20:48

*Thread Reply:* Thanks @Jakub Dardziński. We have limited support for kafka and we are widely using Kinesis. The usecase is that, we are trying to build a in house lineage store, where we can emit these events. If Kafka is the only streaming option available, we can give a try.

Damien Hawes (damien.hawes@booking.com)

2024-01-09 09:52:16

*Thread Reply:* I'd look to see if AWS offers something like KafkaProxy, but for Kinesis. That is, you emit via HTTP and it forwards to the Kinesis stream.

Anand Thamothara Dass (anand_thamotharadass@cable.comcast.com)

2024-01-09 10:00:24

*Thread Reply:* Nice.. I ll take a look in to it.

Harel Shein (harel.shein@gmail.com)

2024-01-09 10:03:55

*Thread Reply:* agree with @Jakub Dardziński & @Damien Hawes. the transport interface is also pretty straightforward, so adding support for Kinesis might be trivial - in case you do decide to contribute

👍 Anand Thamothara Dass

Shahid Shaikh (ssshahidwin@gmail.com)

2024-01-10 14:22:18

Hello everyone,

I'm exploring OpenLineage + Marquez for lineage in our warehouse solution connecting to Snowflake. We've set up ETL jobs for various operations like Select, Update, Delete, and Merge. Is there a way to capture Change Data Capture (CDC) at the column level using OpenLineage? Our goal is to present this CDC information as facets on Marquez. Any insights or suggestions on this would be greatly appreciated!

👍 alexandre bergere

Harel Shein (harel.shein@gmail.com)

2024-01-11 05:14:03

*Thread Reply:* What are you using for CDC?

Shahid Shaikh (ssshahidwin@gmail.com)

2024-01-11 16:12:11

*Thread Reply:* @Harel Shein In our current setup, we typically utilize Slowly Changing Dimension (SCD) Type 2 at the source level for Change Data Capture (CDC). In this specific scenario, the source is Snowflake.

During the ETL process, we receive a CDC trigger or event from Snowflake.

Harel Shein (harel.shein@gmail.com)

2024-01-12 13:01:58

*Thread Reply:* what do you use to run this ETL process? If that framework is supported by openlineage, I assume it would work

Michael Robinson (michael.robinson@astronomer.io)

2024-01-10 15:23:47

@channel This month’s TSC meeting, open to all, is tomorrow at 10 am PT https://openlineage.slack.com/archives/C01CK9T7HKR/p1704392485753579

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)

@channel This month’s TSC meeting is next Thursday the 11th at 10am PT. On the tentative agenda: • announcements • recent releases • open discussion • more (TBA) More info and the meeting link can be found on the <a href="https://openlineage.io/meetings/">website</a>. All are welcome! Do you have a discussion topic, use case or integration you’d like to demo? DM me to be added to the agenda.

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1704392485753579

Shahid Shaikh (ssshahidwin@gmail.com)

2024-01-11 04:32:41

Hi Everyone

there is this search option on facets on Marquez UI. Is there any functionality of that?

Do we have the functionality to search on the lineage we are getting?

Harel Shein (harel.shein@gmail.com)

2024-01-11 05:43:04

*Thread Reply:* great question @Shahid Shaikh! I’m actually not sure, would you mind posting this question on the Marquez slack? https://join.slack.com/t/marquezproject/shared_invite/zt-2afft44fo-nWcmZmdrv7qSxfNl6iOKMg

Shahid Shaikh (ssshahidwin@gmail.com)

2024-01-11 13:35:47

*Thread Reply:* Thanks @Harel Shein I will take further update from Marquez slack.

👍 Harel Shein

Simran Suri (mailsimransuri@gmail.com)

2024-01-11 04:38:39

Hi everyone, I'm currently running my Spark code on Azure Kubernetes Service (AKS), and I'm interested in knowing if OpenLineage can provide cluster details such as its name, etc. In my current setup and run events, I haven't been able to find these cluster-related details.

Harel Shein (harel.shein@gmail.com)

2024-01-11 05:47:13

*Thread Reply:* I’m not sure if all of those details are readily available to the spark application to extract them. If you find a way, those can be reported in a custom facet. The debug facet for Spark would be the closest option to what you’re looking for, but definitely doesn’t contain k8s cluster level data.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-11 07:01:18

*Thread Reply:* Not out of the box, but if those are in environment variables, you can use spark.openlineage.facets.custom_environment_variables=[YOUR_VARIABLE;] to include it as CustomEnvironmentFacet

Simran Suri (mailsimransuri@gmail.com)

2024-01-11 07:04:01

*Thread Reply:* Thanks, can you please share the reference link for the same, so that I can try n implement this @Maciej Obuchowski

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-11 07:32:21

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark#general

Rodrigo Maia (rodrigo.maia@manta.io)

2024-01-11 10:12:27

*Thread Reply:* Would this work if these environment variables were set at runtime?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-11 10:13:45

*Thread Reply:* during runtime of a job? I don't think so, we collect them when we spawn the OpenLineage integration

Rodrigo Maia (rodrigo.maia@manta.io)

2024-01-11 10:15:01

*Thread Reply:* I thought so. thank you 🙂

Rodrigo Maia (rodrigo.maia@manta.io)

2024-01-11 10:23:42

*Thread Reply:* @Maciej Obuchowski By the way, coming back to this, is there any way to pass some variable value from the executing code/job to the OL spark listener? The example above is something I was looking forward to, like, setting some (ENV) var at run time and having access to this value in the OL event.

Simran Suri (mailsimransuri@gmail.com)

2024-01-15 03:30:06

*Thread Reply:* @Maciej Obuchowski, Actually cluster details are not in ENV variables. In my Airflow setup those are stored in the Airflow connections and I've a connection ID for that, I'm successfully getting the connection ID details but not the associated Spark cluster name and details.

David Goss (david.goss@matillion.com)

2024-01-11 09:01:29

❓ Is there a standard or well-used facet for recording the type of a dataset? e.g. Table vs View at a very simple level.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-11 09:28:11

*Thread Reply:* I think we think of type of the dataset on a dataset namespace schema: something s3 will always be a link to path in a bucket.

Table vs View is a specific category of datasets that makes a lot of sense when talking about database datasets, but would not make sense when talking about object storage ones, so my naive view is that's something that should be unique to a particular data source type.

David Goss (david.goss@matillion.com)

2024-01-11 09:30:10

*Thread Reply:* Very fair! I think we’re probably going to use a custom facet for this for our own purposes, I just wanted to check if there was anything standard or standard-adjacent we should align to.

Julien Le Dem (julien@apache.org)

2024-01-11 18:59:34

*Thread Reply:* This sounds good. Please share that custom facet’s schema if you’re willing. This might be a good candidate for a new optional core facet for views.

👍 David Goss

David Goss (david.goss@matillion.com)

2024-02-07 09:49:43

*Thread Reply:* Sorry for the late reply @Julien Le Dem. So we have a custom dataset facet internally right now which is doing a couple of different jobs: • datasetType captures TABLE vs VIEW etc. We’re getting into more types with APIs, files etc so that will expand • nameParts captures the component parts of the dataset name where it’s a fully-qualified table name for example - so like DATABASE.SCHEMA.TABLE would be in there as ["TABLE", "SCHEMA", "DATABASE"] - this might seem weird but it’s useful for us to not have to parse this back out later especially where the canonical name has quotes, escapes etc { "$schema": "<https://json-schema.org/draft/2020-12/schema>", "$id": "****************/MatillionDatasetFacet.json", "title": "Matillion Dataset Facet Schema", "description": "Matillion Dataset Facet Schema", "type": "object", "properties": { "type": { "type": "string", "description": "The type of the dataset", "enum": [ "TABLE", "VIEW", "UNKNOWN" ] }, "nameParts": { "type": "array", "description": "The component parts of the dataset name (e.g. table, schema, database), from lowest to highest level", "items": { "type": "string" } } } }

David Goss (david.goss@matillion.com)

2024-02-07 09:50:46

*Thread Reply:* (This is just how we’re handling some stuff internally really - the idea is we can use our custom facet where needed and will pivot to standard facets if/when they start to exist for those use cases.)

David Goss (david.goss@matillion.com)

2024-06-14 12:20:37

*Thread Reply:* @Julien Le Dem I’ve opened an issue to propose this as a standard dbmsObject facet https://github.com/OpenLineage/OpenLineage/issues/2774 let me know what you think. I’m not particularly wedded to both of these things being bundled together in a single facet, but it seemed best to propose them together either way.

#2774 [PROPOSAL] standard dataset facet for DBMS objects

Purpose: Many datasets in an OpenLineage landscape are likely to be objects (tables, views) within a DBMS (in which I'm including Postgres, MySQL, Oracle, plus cloud warehouses like Snowflake etc). Such objects have properties which may be of interest to an OpenLineage-ingesting system, but would not be applicable to other kinds of dataset (blob storage, APIs, etc). Specifically: • Hierarchy - OpenLineage models a dataset name as a single string which is reasonable. DBMS objects tend to have a name derived from a hierarchy e.g. <code>database</code> -> <code>schema</code> -> <code>table</code> become <code>database.schema.table</code>. A consuming system may want to model those parts separately to e.g. make the hierarchy navigable in a tree fashion by a user, or assist searching. They could parse the single name string, but this would be fussy and error-prone - just splitting on the period is not enough for some object names, where parts may have to have escape characters added to ensure they are unambiguous. • Object type - Both tables and views can be represented as datasets; it may benefit a consuming system's user to know which tye a dataset is. Proposed implementation I'd propose a new <code>dbmsObject</code> facet, applicable to both input and output datasets, and static datasets. Example dataset: <pre><code>{ "namespace": "<snowflake://my-account.eu-central-1>", "name": "MY_DATABASE.MY_SCHEMA.\"InterestinglyNamedTable\"", "facets": { "dbmsObject": { "_producer": "...", "_schemaURL": "...", "objectType": "TABLE", "hierarchy": [ { "name": "MY_DATABASE" }, { "name": "MY_SCHEMA" }, { "name": "InterestinglyNamedTable" } ] } } } </code></pre> Schema: <pre><code>{ "$schema": "<https://json-schema.org/draft/2020-12/schema>", "$id": "...", "type": "object", "properties": { "type": { "type": "string", "description": "The type of object in the DBMS", "enum": [ "TABLE", "VIEW" ] }, "hierarchy": { "type": "array", "description": "The names of items in the hierarchy within the namespace (e.g. database, schema, table), from highest to lowest level", "items": { "type": "object", "properties": { "name": { "type": "string", "description": "The name of the item in the hierarchy" } } } } } } </code></pre>

Labels

kind:proposal, state:needs-triage

Julien Le Dem (julien@apache.org)

2024-01-11 19:16:03

As discussed in the call today. I have updated the OpenLineage registry proposal. It includes as well the contributions of @Sheeri Cabral (Collibra) and feedback from the review of the . If you are interested, (in particular @Eric Veleker, @Ernie Ostic, @Sheeri Cabral (Collibra), @Jens Pfau but others as well) please comment on the PR. I think we are close enough to implement a first version.

#2228 [PROPOSAL #2161] Add a Registry of Producers and Consumers in OpenLineage

Problem We need an OpenLineage registry for producers, custom facets and consumers. • Allow third parties to register their implementations or custom extensions to make them easy to discover. • Shorten “Producer” and “schema url” values Concept needing a registry: Producers: • Custom facet prefix to registry of facet schemas • Producer uri to full URL of producer doc • Facets produced • Facet URI to full facet schema url Consumers: • URL to Documentation of facets understood. • Facets consumed Requirements: • Producers can create and evolve their custom facets without requiring approval from the OpenLineage project. • Producers and Consumers can update the list of the facets they produce or consume without requiring approval from the OpenLineage project. • URIs should be short • A registered name can be both a producer and a consumer. Closes: <a href="https://github.com/OpenLineage/OpenLineage/issues/2161">#2161</a> Solution A central registry that defines only names for producers and consumers but externalizes custom facets and what is consumed/produced. One-line summary: Checklist ☐ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☐ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☐ Your changes are accompanied by tests (if relevant) ☐ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☐ Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant) ☐ You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant) <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2023 contributors to the OpenLineage project

Labels

documentation, proposal

Comments

❤️ Jarek Potiuk, Michael Robinson, Paweł Leszczyński

Honey Thakuria (Honey_Thakuria@intuit.com)

2024-01-12 00:08:47

Hi everyone, We're trying to use Openlineage for Presto on Spark script mentioned below, but aren't getting CREATE or INSERT lifecycle events and are only able to get DROP events.

Could anyone help us in letting know the way how we can get proper events for Presto on Spark ? Any tweaks, config changes ?? cc @Kiran Hiremath @Athitya Kumar ```drop table if exists schemaname.table1; drop table if exists schemaname.table2; drop table if exists schema_name.table3;

CREATE TABLE schema_name.table1 AS SELECT ** FROM ( VALUES (1, 'a'), (2, 'b'), (3, 'c') ) AS temp1 (id, name);

CREATE TABLE schema_name.table2 AS SELECT ** FROM ( VALUES (1, 'a'), (2, 'b'), (3, 'c') ) AS temp2 (id, name);

CREATE TABLE schemaname.table3 AS SELECT ** FROM schemaname.table1 UNION ALL SELECT ** FROM schema_name.table2;```

Athitya Kumar (athityakumar@gmail.com)

2024-01-13 10:54:45

Hey team! 👋

We're seeing some instances where the openlineage spark listener class seems to be running long even after the spark job has completed.

While we debug the reason for why it's long running (huge spark event due to partitions / json read etc, which could be lagging the listener thread/JVM); we just wanted to see if there's a way we could ensure from the openlineage listener that any of the methods overriden from SparkListener doesn't execute for more than 2 mins (say)?

For example, does having a configurable timeout value (like spark.openlineage.timeout.seconds=120 spark conf) and having an internal Executor + Future.get() with timeout / google's SimpleTimeLimiter make sense for this complexity for any spark event handled by OL spark listener?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-01-15 04:07:58

*Thread Reply:* yes, this would definietely make sense

Athitya Kumar (athityakumar@gmail.com)

2024-01-14 22:51:40

Hey team.

We're observing that the openlineage spark listener runs long (sometimes, even for couple of hours) even though spark job has completed. We've seen the pattern this is mostly happening for jobs with 3 levels of subqueries - is there a known issue for this, wherein huge spark event object from listener bus causes huge delay in openlineage spark listener's event processing due to JVM lag or openlineage code etc?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-01-15 04:07:11

*Thread Reply:* there is no issue for this. just to make sure: did you disable serializing LogicalPlan and sending it via event?

Athitya Kumar (athityakumar@gmail.com)

2024-01-15 04:14:00

*Thread Reply:* @Paweł Leszczyński - Yup, we've set this conf: "spark.openlineage.facets.disabled": "[spark_unknown;spark.logicalPlan]" We've also seen logs from spark where it says that event took more than 10-15 mins to process: [INFO ] [spark-listener-group-shared] [org.apache.spark.scheduler.AsyncEventQueue.logInfo@57] - Process of event SparkListenerSQLExecutionEnd(54,1702566454593) by listener OpenLineageSparkListener took 614.914735426s.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-01-15 04:24:43

*Thread Reply:* yeah, that's interesting. are you able to make sure if this is related to event generation and not backend issue (like marquez deadlock) ?

Athitya Kumar (athityakumar@gmail.com)

2024-01-15 04:25:58

*Thread Reply:* Yup yup, we use HTTP transport which has a 30 seconds API GW timeout on our side - but we also tried with console transport type to rule out the possibility of a backend issue

Faced the same issue with console transport type too

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-01-15 06:32:28

*Thread Reply:* when running for a long time, does it succeed or not? Perhaps there is a cycle in logical plan. Could you turn on and attach debugFacet ?

Athitya Kumar (athityakumar@gmail.com)

2024-01-15 06:45:02

*Thread Reply:* It runs for a long time and succeeds - for some jobs it's a matter of 1 hour, whereas we've seen jobs delaying by 4 hours as well.

@Paweł Leszczyński - How do we enable debugFacet? Any additional spark conf to add?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-01-15 06:47:16

*Thread Reply:* yes, spark.openlineage.debugFacet=enabled

Kacper Muda (mudakacper@gmail.com)

2024-01-15 05:29:21

Hey, can I request a patch release that will include this fix ? If there is already an upcoming release planned let me know, maybe it will be soon enough 🙂

➕ Jakub Dardziński, Harel Shein, Maciej Obuchowski

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-16 16:01:22

*Thread Reply:* I’m bumping this request. I would love to include this https://github.com/OpenLineage/OpenLineage/pull/2373 as well

Michael Robinson (michael.robinson@astronomer.io)

2024-01-19 09:05:54

*Thread Reply:* Thanks, all. The release is authorized.

🙌 Jakub Dardziński

Rodrigo Maia (rodrigo.maia@manta.io)

2024-01-15 07:24:23

Hello All! I want to try to help the community by testing some of the bugs I've been reporting but also try to work with custom facets. Im having trouble building the project. Is there any step-by-step document (an dependencies) on how to build it?

Rodrigo Maia (rodrigo.maia@manta.io)

2024-01-15 07:48:40

*Thread Reply:*

Rodrigo Maia (rodrigo.maia@manta.io)

2024-01-15 07:51:20

*Thread Reply:* gradle --version

Gradle 8.5

Build time: 2023-11-29 14:08:57 UTC Revision: 28aca86a7180baa17117e0e5ba01d8ea9feca598

Kotlin: 1.9.20 Groovy: 3.0.17 Ant: Apache Ant(TM) version 1.10.13 compiled on January 4 2023 JVM: 21.0.1 (Homebrew 21.0.1) OS: Mac OS X 14.2 aarch64

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-15 09:20:10

*Thread Reply:* I think easiest way would be to use Java 8 (it's still supported by Spark)

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-15 09:23:30

*Thread Reply:* I recommend Azul Zulu, it works for mac M1 https://www.azul.com/downloads/?version=java-8-lts&os=macos&architecture=arm-64-bit&package=jdk#zulu

Azul | Better Java Performance, Superior Java Support

Azul Downloads

Click here to download the Azul Zulu Builds of OpenJDK for Java 7, 8, 11, 13, 15, 17, 19, 21 for Linux, Windows and macOS. Also download Azul Platform Prime.

Est. reading time

1 minute

Original URL: https://www.azul.com/downloads/?version=java-8-lts&os=macos&architecture=arm-64-bit&package=jdk#zulu

Rodrigo Maia (rodrigo.maia@manta.io)

2024-01-15 11:10:00

*Thread Reply:* Java Downgraded to 8. Thanks. But now, lets say I want to build the project (focusing on the spark integration). What should i do?

i tried gradle build (in integration/spark) and got the same error. Am i missing any step here?

Mattia Bertorello (mattia.bertorello@booking.com)

2024-01-15 16:37:05

*Thread Reply:* If you're having issues with Java, I suggest using https://asdf-vm.com/ to ensure that Gradle doesn't accidentally use a different Java version.

asdf-vm.com

asdf | asdf

Manage multiple runtime versions with a single CLI tool

Original URL: https://asdf-vm.com/

.tool-versions

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-16 05:01:01

*Thread Reply:* @Rodrigo Maia I would try removing contents of ~/.m2/repository - this usually helps with this kind of error

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-16 05:01:12

*Thread Reply:* The only downside is that you'll have to redownload dependencies

Rodrigo Maia (rodrigo.maia@manta.io)

2024-01-17 05:22:09

*Thread Reply:* thank you for the ideas to solve the issue. I'll try that during the weekend 😄

Dinesh Singh (dinesh.r-singh@hpe.com)

2024-01-16 07:00:17

Hello Team, I am new to columnar level lineage, I would like to know what is the fast and easiest means of creating a columnar level data/json. Which integration will be helpful ? Airflow, Spark or direct mysql operator.

Dinesh Singh (dinesh.r-singh@hpe.com)

2024-01-16 07:04:36

*Thread Reply:* Previously I have worked with datahub integration with spark to generate an lineage. How do i skip data hub and directly use Openlineage adaptors. Below is the previous Example :

from pyspark.sql import SparkSession

spark = SparkSession \ .builder \ .master("local[**]") \ .appName("NetflixMovies") \ .config("spark.jars.packages", "io.acryl:datahub_spark_lineage:0.8.23") \ .config("spark.extraListeners", "datahub.spark.DatahubSparkListener") \ .config("spark.datahub.rest.server", "<http://4.labs.net:8080>") \ .enableHiveSupport() \ .getOrCreate()

print("Reading CSV File") movies_df = spark \ .read \ .format("csv") \ .option("header", "true") \ .option("inferSchema", "true") \ .load("n_movies.csv")

movies_above8_rating = movies_df.filter(movies_df.rating >= 8.0)

print("Writing CSV File") movies_above8_rating \ .coalesce(1) \ .write \ .format("csv") \ .option("header", "true") \ .mode("overwrite") \ .save("movies_above_rating8.csv") print("Completed") spark.stop()

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-16 07:12:28

*Thread Reply:* Have you tried directly doing the same just using OpenLineage Spark listener?

Dinesh Singh (dinesh.r-singh@hpe.com)

2024-01-16 07:16:21

*Thread Reply:* I did tried, no success yet !

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-16 07:16:43

*Thread Reply:* Can you share the resulting event?

Dinesh Singh (dinesh.r-singh@hpe.com)

2024-01-16 07:17:06

*Thread Reply:* I need sometime to reproduce.

Dinesh Singh (dinesh.r-singh@hpe.com)

2024-01-16 07:17:16

*Thread Reply:* Is there any other example you can suggest ?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-16 07:33:47

*Thread Reply:* You can look at our integration tests: https://github.com/OpenLineage/OpenLineage/blob/31f8ce588526e9c7c4bc7d849699cb7ce2[…]va/io/openlineage/spark/agent/ColumnLineageIntegrationTest.java

<https://github.com/OpenLineage/OpenLineage/blob/31f8ce588526e9c7c4bc7d849699cb7ce2969c8f/integration/spark/app/src/test/java/io/openlineage/spark/agent/ColumnLineageIntegrationTest.java | ColumnLineageIntegrationTest.java>

<pre><code>public class ColumnLineageIntegrationTest { </code></pre>

Fabio Manganiello (fabio.manganiello@booking.com)

2024-01-19 05:06:31

Hi channel, do we have a standard way to infer the type of the job from the OpenLineage events that we receive? Context: I'm building a product where we're expected to tell the users whether a particular job is an Airflow DAG, a dbt workflow, a Spark job etc. The (relatively) most reliable piece of data to extract this information is the producer string published in the event, but I've noticed that there's really no standard way of formatting it. After seeing some producer strings end in /dbt and /spark we thought "cool, we can just scrape the last slashed token of the producer URL and infer the type from there". Then we met the Airflow integration, which apparently publishes a producer in the format <https://github.com/apache/airflow/tree/providers-openlineage/1.3.0>.

Is the producer the only way to infer the type of a job from an event? If so, are there any efforts in standardizing its possible values, or maybe create a registry of producers readily available in the OL libraries? If not, what's an alternative way of getting this info?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-01-19 05:53:21

*Thread Reply:* We introduced JobTypeJobFacet https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/JobTypeJobFacet.json some time ago

<https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/JobTypeJobFacet.json | JobTypeJobFacet.json>

<pre><code>{ "$schema": "<https://json-schema.org/draft/2020-12/schema>", "$id": "<https://openlineage.io/spec/facets/2-0-2/JobTypeJobFacet.json>", "$defs": { "JobTypeJobFacet": { "allOf": [ { "$ref": "<https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/JobFacet>" }, { "type": "object", "properties": { "processingType": { "description": "Job processing type like: BATCH or STREAMING", "type": "string", "example": "BATCH" }, "integration": { "description": "OpenLineage integration type of this job: SPARK|DBT|AIRFLOW|FLINK", "type": "string", "example": "SPARK" }, "jobType": { "description": "Run type like: QUERY|COMMAND|DAG|TASK|JOB|MODEL", "type": "string", "example": "QUERY" } }, "required": ["processingType", "integration"] } ], "type": "object" } }, "type": "object", "properties": { "jobType": { "$ref": "#/$defs/JobTypeJobFacet" } } } </code></pre>

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-01-19 05:54:39

Fabio Manganiello (fabio.manganiello@booking.com)

2024-01-19 05:56:30

*Thread Reply:* Thanks, that should definitely address my concern! Is it supposed to be filled explicitly (e.g. through env variables) or is it filled by the OL connector automatically? I've taken a look at several dumps of OL events but I can't recall seeing that facet being populated

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-01-19 06:13:17

*Thread Reply:* I introduced this for Flink only https://github.com/OpenLineage/OpenLineage/pull/2241/files as that is what I needed at that point. You would need to add it into Spark integration if you like

ldacey (lance.dacey2@sutherlandglobal.com)

2024-01-19 10:49:34

*Thread Reply:* JobTypeJobFacet(processingType="BATCH", integration="AIRFLOW", jobType="QUERY"),

would you consider the jobType as QUERY if airflow is running a task that filters a delta table and saves the output to a different delta table? it is a task as well in that case of course

ldacey (lance.dacey2@sutherlandglobal.com)

2024-01-19 10:00:04

any suggestions on naming for Graph API sources from outlook? I pull a lot of data from email attachments with Airflow. generally I am passing a resource (email address), the mailbox, and subfolder. from there I list messages and find attachments

Kacper Muda (mudakacper@gmail.com)

2024-01-19 10:08:06

*Thread Reply:* What is the difference between mailbox and email address ? Could You provide some examples of all the parts provided?

ldacey (lance.dacey2@sutherlandglobal.com)

2024-01-19 10:22:42

*Thread Reply:* sure, so I definitely want to leave out the email address if possible (it is my name and I dont want it in OL metadata)

so we use the graph API credentials to open the mailbox of a resource (my work email) then we are opening the main Inbox folder, and from there we open folders/subfolders which is important to track

hook = MicrosoftEmailHook(graph_conn_id="ms_email_api", resource="my@email", folder="client_name", subfolder="system_name")

``` def getconn(self) -> MailBox: """Connects to the Office 365 account and opens the mailbox""" credentials = (self.graphconn.login, self.graphconn.password) account = Account( credentials, authflowtype="credentials", tenantid=self.graphconn.host ) if not account.isauthenticated: account.authenticate() return account.mailbox(resource=self.resource)

@cached_property
def mailbox(self) -&gt; MailBox:
    """Returns the authenticated account mailbox"""
    return self.get_conn()

@cached_property
def current_folder(self) -&gt; Folder:
    """Property to lazily open and return the current folder"""
    if not self._current_folder:
        self.open_folder()
    return self._current_folder

def open_folder(self) -&gt; Folder:
    """Opens the specified folder and sets it as the current folder"""
    inbox = self.mailbox.inbox_folder()
    f = inbox.get_folder(folder_name=self.folder)
    self._current_folder = (
        f.get_folder(folder_name=self.subfolder) if self.subfolder else f
    )

def get_messages(
    self,
    query: Query | None = None,
    download_attachments: bool = False,
) -&gt; list[Message]:
    """Lists all messages in a folder. Optionally filters them based on an OData
    query

    Args:
        query: OData query object
        download_attachments: Whether attachments should be downloaded

    Returns:
        A Message object or list of Message objects. A tuple of the Message and
        Attachment is returned if return_attachments is True
    """
    messages = [
        message
        for message in self.current_folder.get_messages(
            limit=self.limit,
            batch=self.batch,
            query=query,
            download_attachments=download_attachments,
        )
    ]
    return messages```

ldacey (lance.dacey2@sutherlandglobal.com)

2024-01-19 10:30:27

*Thread Reply:* I am not sure if something like this would be the namespace or the resource name, but I would consider a "source" of data to be this outlook://{self.folder}/{self.subfolder}

ldacey (lance.dacey2@sutherlandglobal.com)

2024-01-19 10:30:44

*Thread Reply:* then we list attachments with filters (subject, received since and until)

Kacper Muda (mudakacper@gmail.com)

2024-01-19 10:33:56

*Thread Reply:* Ok, got it. I'm not sure if avoiding to include Your email is the best choice, as it's the best identifier of the attachments' location 🙂 It's just my opinion, but i think i would go with something like:

namespace: email://{email_addres} name: {folder}/{subfolder}/

namespace: email name: {email_address}/{folder}/{subfolder}

Kacper Muda (mudakacper@gmail.com)

2024-01-19 10:34:44

*Thread Reply:* I'm not sure if outlook://{self.folder}/{self.subfolder} is descriptive enough, as You can't really tell who received this attachment. But of course, it all depends on Your use case

ldacey (lance.dacey2@sutherlandglobal.com)

2024-01-19 10:35:14

*Thread Reply:* here is the output of message.build_url() <https://graph.microsoft.com/v1.0/users/name@company.com/folder>

ldacey (lance.dacey2@sutherlandglobal.com)

2024-01-19 10:36:08

*Thread Reply:* which is a built in method in the O365 library I am using. and yeah I think i need the resource name no matter what

ldacey (lance.dacey2@sutherlandglobal.com)

2024-01-19 10:36:28

*Thread Reply:* it makes it clear it is my personal work email address and not the service account email account..

Kacper Muda (mudakacper@gmail.com)

2024-01-19 10:37:52

*Thread Reply:* I think i would not base the namespace on the email provider like outlook or gmail, because it's something that can easily change over time, yet it does not influence the actual content of the email. If You suddenly transfer to Google from Microsoft, and move all Your folder/subfolder structure, does it really matter fr You, from lineage perspective ?

Kacper Muda (mudakacper@gmail.com)

2024-01-19 10:38:57

*Thread Reply:* This one i think is more debatable 😄

ldacey (lance.dacey2@sutherlandglobal.com)

2024-01-19 10:39:16

*Thread Reply:* agree

ldacey (lance.dacey2@sutherlandglobal.com)

2024-01-19 10:39:59

*Thread Reply:* the folder name in outlook corresponds to the GCS bucket name the files are loaded to as well, so I would want that to be nice and clear in marquez. it is all based on the client name which is the "folder" for emails

Kacper Muda (mudakacper@gmail.com)

2024-01-19 10:44:15

*Thread Reply:* If Your specific email folder structure is somehow mapped to gcs, then You can try looking at the naming spec for gcs and somehow mix it in. The namespace in gcs contains the bucket name, so maybe in Your case it's wise to put it there as well, so You have separate namespaces for each client? At the end of the day, You are using it and it should match Your needs

ldacey (lance.dacey2@sutherlandglobal.com)

2024-01-19 10:54:44

*Thread Reply:* yep that is correct, it kind of falls apart with SFTP sources because the namespace includes the host which is not very descriptive (just some domain names).

we try to keep things separate between client namespaces though in general, since they are unrelated and different teams work on them

Abdallah (abdallah@terrab.me)

2024-01-19 10:11:36

Hello,

I hope you are doing all well.

We've observed an issue with symlinks generated by Spark integration, specifically when dealing with a Glue Catalog Table. The dataset namespace ends up being the same as the dataset name. This issue arises from the method used to parse the dataset's URI.

I would like to discuss this issue with the contributors.

https://github.com/OpenLineage/OpenLineage/pull/2379

#2379 fix/spark-integration-catalog-symlink-without-warehouse

👋 Hi there, Issue 1 We've observed an issue with symlinks generated by Spark integration, specifically when dealing with a Glue Catalog Table. The dataset namespace ends up being the same as the dataset name. This issue arises from the method used to parse the dataset's URI. For example, in this Unit Test: <a href="https://github.com/OpenLineage/OpenLineage/blob/aa382998c48ab2d7488052fe27b972efcf63f7f9/integration/spark/shared/src/test/java/io/openlineage/spark/agent/util/PathUtilsTest.java#L165">OpenLineage/integration/spark/shared/src/test/java/io/openlineage/spark/agent/util/PathUtilsTest.java</a> Line 165 in </OpenLineage/OpenLineage/commit/aa382998c48ab2d7488052fe27b972efcf63f7f9|aa38299> The URI <code><hdfs://namenode:8020/warehouse/table></code> leads to: • dataset.name=/warehouse/table • dataset=<hdfs://namenode:8020> And for symlinks: • symlinks[0].name=db.table • symlinks[0].namespace=/warehouse However, in our scenario: <a href="https://github.com/OpenLineage/OpenLineage/blob/aa382998c48ab2d7488052fe27b972efcf63f7f9/integration/spark/shared/src/main/java/io/openlineage/spark/agent/util/PathUtils.java#L54">OpenLineage/integration/spark/shared/src/main/java/io/openlineage/spark/agent/util/PathUtils.java</a> Line 54 in </OpenLineage/OpenLineage/commit/aa382998c48ab2d7488052fe27b972efcf63f7f9|aa38299> After logging the function, <pre><code>catalogTable: Map(Catalog -> spark_catalog, Database -> dev_db, Table -> feat_count, Created Time -> Fri Jan 19 13:01:04 UTC 2024, Last Access -> UNKNOWN, Created By -> Spark , Type -> EXTERNAL, Provider -> parquet, Location -> <s3://dev-bucket/feat_count)> catalogTable.storage(): Storage(Location: <s3://dev-bucket/feat_count>) metastoreUri: Optional.empty catalogTable.identifier(): `spark_catalog`.`dev_db`.`feat_count` nameFromTableIdentifier(catalogTable.identifier(): dev_db.feat_count di.getName(): feat_count StringUtils.substringBeforeLast(di.getName(), File.separator): feat_count (which is the namespace calculated but the name of a table in our case) </code></pre> we found the following for <code>catalogTable</code>: • Location: <s3://dev-bucket/feat_count> The issue stems from <code>StringUtils.substringBeforeLast(di.getName(), File.separator)</code>, which, if the separator is not in <code>di.getName()</code>, simply returns <code>di.getName()</code>. Consequently, we end up with: • URI = <s3://dev-bucket/feat_count> • dataset.name=<s3://dev-bucket> • dataset=carriercount • symlinks[0].name=devdb.carriercount • symlinks[0].namespace=carriercount Issue 2 Creating a namespace from the table path prefix doesn't seem logical in our context. Each team has its own database and places tables freely within the bucket. A more sensible approach would be: • symlinks[0].name=carriercount • symlinks[0].namespace=sparkcatalog.dev_db What are your thoughts on this? I'm eager to contribute and resolve this issue. Apache Spark integration covers <code>S3</code> and <code>Glue Catalog</code> filesystem operations, and it's tested with AWS EMR. One-liner Summary: Checklist ☑︎ Sign off on your work. ☑︎ Ensure your pull request title is according to our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a>. ☑︎ Include tests where relevant. ☑︎ Keep your changes small and self-contained. ☐ Update any relevant documentation. ☐ Add a one-liner to the changelog describing the change's purpose, if necessary. ☐ Version the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a>, if relevant. ☐ Add a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files, if relevant.

Labels

integration/spark

Athitya Kumar (athityakumar@gmail.com)

2024-01-22 02:48:44

Hey team

Does openlineage-spark listener support Kafka sinks parsing from logical plan, to publish to lineage events? If yes, what's the openlineage-spark version since which it's been supported?

We have a pattern like df.write.format("KAFKA").option("TOPIC", "topicName") wherein we don't see any openlineage events - we're using openlineage-spark:1.1.0 currently

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-01-22 03:09:40

*Thread Reply:* it does in some way as we do have an integration test for that https://github.com/OpenLineage/OpenLineage/blob/2e3bee1dc3a48dd7730e85d80a4b29569b[…]ation/spark/app/src/test/resources/sparkscripts/sparkkafka.py https://github.com/OpenLineage/OpenLineage/blob/2e3bee1dc3a48dd7730e85d80a4b29569b[…]a/io/openlineage/spark/agent/SparkContainerIntegrationTest.java

<https://github.com/OpenLineage/OpenLineage/blob/2e3bee1dc3a48dd7730e85d80a4b29569b343d49/integration/spark/app/src/test/resources/spark_scripts/spark_kafka.py | spark_kafka.py>

<https://github.com/OpenLineage/OpenLineage/blob/2e3bee1dc3a48dd7730e85d80a4b29569b343d49/integration/spark/app/src/test/java/io/openlineage/spark/agent/SparkContainerIntegrationTest.java | SparkContainerIntegrationTest.java>

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-01-22 03:10:18

*Thread Reply:* However, this feature has not been improved nor developed for long time (like 2 years), and surely has some gaps like https://github.com/OpenLineage/OpenLineage/issues/372

#372 spark: Cover streaming LogicalPlans

Labels

integration/spark, streaming

Michael Robinson (michael.robinson@astronomer.io)

2024-01-22 15:42:55

@channel We released OpenLineage 1.8.0, featuring: • Flink: support Flink 1.18 #2366 @HuangZhenQiu • Spark: add Gradle plugins to simplify the build process to support Scala 2.13 #2376 @d-m-h • Spark: support multiple Scala versions LogicalPlan implementation #2361 @mattiabertorello • Spark: Use ScalaConversionUtils to convert Scala and Java collections #2357 @mattiabertorello • Spark: support MERGE INTO queries on Databricks #2348 @pawel-big-lebowski • Spark: Support Glue catalog in iceberg #2283 @nataliezeller1 Plus many bug fixes and a change in Spark! Thanks to all the contributors with a special shoutout to new contributor @Mattia Bertorello, who had no fewer than 5 contributions in this release! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.8.0 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/1.7.0...1.8.0 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🚀 Jakub Dardziński, alexandre bergere, Harel Shein, Mattia Bertorello, Maciej Obuchowski, Julian LaNeve

harsh loomba (hloomba@upgrade.com)

2024-01-22 16:43:32

Hello team I see the following issue when i install apache-airflow-providers-openlineage==1.4.0

harsh loomba (hloomba@upgrade.com)

2024-01-22 16:43:56

*Thread Reply:* @Jakub Dardziński @Willy Lulciuc

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-22 17:00:17

*Thread Reply:* Have you modified your local files? Please try to uninstall and reinstall the package

harsh loomba (hloomba@upgrade.com)

2024-01-22 17:10:47

*Thread Reply:* i didn't do anything, all other packages are fine

harsh loomba (hloomba@upgrade.com)

2024-01-22 17:12:42

*Thread Reply:* let me reinstall

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-22 17:30:10

*Thread Reply:* it’s odd because it should be line 142 actually, are you sure you’re installing it properly?

harsh loomba (hloomba@upgrade.com)

2024-01-22 17:33:33

*Thread Reply:* actually it could be my local

harsh loomba (hloomba@upgrade.com)

2024-01-22 17:33:39

*Thread Reply:* im trying something

jayant joshi (itsjayantjoshi@gmail.com)

2024-01-25 06:42:58

Hi Team! I want to create column level lineage using pyspark I followed https://openlineage.io/blog/openlineage-spark/ blog step by step while run next command docker run --network spark_default -p 3000:3000 -e MARQUEZ_HOST=marquez-api -e MARQUEZ_PORT=5000 --link marquez-api:marquez-api marquezproject/marquez-web:0.19.1 will see <http://localhost:3000> marquez UI but but it continusly fetching "Something went wrong while fetching search" not giving any result. in cmd " Error occurred while trying to proxy request /api/v1/search/?q=p&sort=NAME&limit=100 from localhost:3000 to <http://marquez-api:5000/> (EAI_AGAIN) (<https://nodejs.org/api/errors.html#errors_common_system_errors>)" for dataset I am reading CSV file. for reference attaching SC is there any solution?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-25 07:48:45

*Thread Reply:* First, can you try some newer version? Latest is 0.44.0, 0.19.1 is over two years old

Tamizharasi Karuppasamy (tamizhsangami@gmail.com)

2024-01-29 04:13:49

*Thread Reply:* Hi.. I am trying to generate column-level lineage for a csv file on AWS S3 using Spark and OpenLineage. I follow https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/README.md and https://openlineage.io/docs/integrations/spark/quickstart_local for my reference. But I get below error. Kindly help me resolve it.

openlineage.io

Quickstart with Jupyter | OpenLineage

Trying out the Spark integration is super easy if you already have Docker Desktop and git installed.

Original URL: https://openlineage.io/docs/integrations/spark/quickstart_local

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/README.md | README.md>

OpenLineage Spark Listener The OpenLineage Spark Agent uses jvm instrumentation to emit OpenLineage metadata. Installation Maven: <pre><code><dependency> <groupId>io.openlineage</groupId> <artifactId>openlineage-spark</artifactId> <version>1.8.0</version> </dependency> </code></pre> or Gradle: <pre><code>implementation 'io.openlineage:openlineage_spark:1.8.0' </code></pre> Getting started Quickstart The fastest way to get started testing Spark and OpenLineage is to use the docker-compose files included in the project. From the Spark integration directory ($OPENLINEAGE_ROOT/integration/spark), execute <pre><code>docker-compose up </code></pre> This will start Marquez as an Openlineage client and Jupyter Spark notebook on localhost:8888. On startup, the notebook container logs will show a list of URLs including an access token, such as <pre><code>notebook_1 | To access the notebook, open this file in a browser: notebook_1 | file:///home/jovyan/.local/share/jupyter/runtime/nbserver-9-open.html notebook_1 | Or copy and paste one of these URLs: notebook_1 | <http://abc12345d6e:8888/?token=XXXXXX> notebook_1 | or <http://127.0.0.1:8888/?token=XXXXXX> </code></pre> Copy the URL with the localhost IP and paste it into your browser window to begin creating a new Jupyter Spark notebook (see the <docs|https://jupyter-docker-stacks.readthedocs.io/en/latest/> for info on using the Jupyter docker image). OpenLineageSparkListener as a plain Spark Listener The SparkListener can be referenced as a plain Spark Listener implementation. Create a new notebook and paste the following into the first cell: <pre><code>from pyspark.sql import SparkSession spark = (SparkSession.builder.master('local') .appName('sample_spark') .config('spark.jars.packages', 'io.openlineage:openlineage_spark:1.8.0') .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener') .config('spark.openlineage.transport.url', 'http://{openlineage.client.host}/api/v1/namespaces/spark_integration/') .getOrCreate()) </code></pre> To use the local jar, you can build it with ( before running the below command, please complete the following sections listed below in this README file #build #preparation #Build jar) <pre><code>gradle shadowJar </code></pre> then reference it in the Jupyter notebook with the following (note that the jar should be built before running the <code>docker-compose up</code> step or docker will just mount a dummy folder; once the <code>build/libs</code> directory exists, you can repeatedly build the jar without restarting the jupyter container): <pre><code>from pyspark.sql import SparkSession file = "/home/jovyan/openlineage/libs/openlineage-spark-1.8.0.jar" spark = (SparkSession.builder.master('local').appName('rdd_to_dataframe') .config('spark.jars', file) .config('spark.jars.packages', 'org.postgresql:postgresql:42.2.+') .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener') .config('spark.openlineage.transport.type', 'http') .config('spark.openlineage.transport.url', 'http://{openlineage.client.host}/api/v1/namespaces/spark_integration/') .getOrCreate()) </code></pre> Arguments Spark Listener The SparkListener reads its configuration from SparkConf parameters. These can be specified on the command line or in the <code>conf/spark-defaults.conf</code> file. The following parameters can be specified in the Spark configuration: <blockquote> NOTE: The <code>console</code> transport mode does not require any additional config so it's preferable for debug or first time set up. Its enabled by setting <code>spark.openlineage.transport.type</code> value to <code>console</code>. </blockquote> General Parameters configuring the Spark integration HTTP URL You can supply http parameters using values in url, the parsed <code>spark.openlineage.**</code> properties are located in url as follows: <code>{transport.url}/{transport.endpoint}/namespaces/{namespace}/jobs/{parentJobName}/runs/{parentRunId}?app_name={appName}&api_key={transport.apiKey}&timeout={transport.timeout}&xxx={<a href="http://transport.urlParams.xxx">transport.urlParams.xxx</a>}</code> example: <code><http://localhost:5000/api/v1/namespaces/ns_name/jobs/job_name/runs/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx?app_name=app&api_key=abc&timeout=5000&xxx=xxx></code> Kinesis If <code>spark.openlineage.transport.type</code> is set to <code>kinesis</code>, then the below parameters would be read and used when building KinesisProducer. Also, KinesisTransport depends on you to provide artifact <code>com.amazonaws:amazon_kinesis_producer:0.14.0</code> or compatible on your classpath. Kafka If <code>spark.openlineage.transport.type</code> is set to <code>kafka</code>, then the below parameters would be read and used when building KafkaProducer. Build Java 8 Testing requires a Java 8 JVM to test the Scala Spark components. <code>export JAVA_HOME=</code> '/usr/libexec/java_home -v 1.8' Preparation The integration depends on two libraries that are build locally <code>openlineage-java</code> and <code>openlineage-sql-java</code>, so before any testing or building of a package you need to publish the appropriate artifacts to local maven repository. To build the packages you need to execute. To install <code>openlineage-java</code> in local maven repo run: <pre><code>cd ../../client/java/ && ./gradlew publishToMavenLocal </code></pre> For <code>openlineage-sql-java</code> run: <pre><code>../../integration/sql/iface-java/ && ./script/compile.sh ../../integration/sql/iface-java/ && ./script/build.sh </code></pre> Testing To run the tests, from the current directory run: <pre><code>./gradlew test </code></pre> To run the integration tests, from the current directory run: <pre><code>./gradlew integrationTest </code></pre> Build jar <pre><code>./gradlew shadowJar </code></pre> Extending The Spark library is intended to support extension via custom implementations of a handful of interfaces. Nearly every extension interface extends or mimics Scala's <code>PartialFunction</code>. The <code>isDefinedAt(Object x)</code> method determines whether a given input is a valid input to the function. A default implementation of <code>isDefinedAt(Object x)</code> is provided, which checks the generic type arguments of the concrete class, if concrete type arguments are given, and determines if the input argument matches the generic type. For example, the following class is automatically defined for an input argument of type <code>MyDataset</code>. <pre><code>class MyDatasetDetector extends QueryPlanVisitor<MyDataset, OutputDataset> { } </code></pre> API The following APIs are still evolving and may change over time based on user feedback. <h3><shared/src/main/java/io/openlineage/spark/api/OpenLineageEventHandlerFactory.java|<code>OpenLineageEventHandlerFactory</code>></h3> This interface defines the main entrypoint to the extension codebase. Custom implementations are registered by following Java's <a href="https://docs.oracle.com/javase/8/docs/api/java/util/ServiceLoader.html"><code>ServiceLoader</code> conventions</a>. A file called <code>io.openlineage.spark.api.OpenLineageEventHandlerFactory</code> must exist in the application or jar's <code>META-INF/service</code> directory. Each line of that file must be the fully qualified class name of a concrete implementation of <code>OpenLineageEventHandlerFactory</code>. More than one implementation can be present in a single file. This might be useful to separate extensions that are targeted toward different environments - e.g., one factory may contain Azure-specific extensions, while another factory may contain GCP extensions. The <code>OpenLineageEventHandlerFactory</code> interface makes heavy use of default methods. Implementations may override any or all of the following methods ``` /** * Return a collection of QueryPlanVisitors that can generate InputDatasets from a LogicalPlan node */ Collection<PartialFunction<LogicalPlan, List<InputDataset>>> createInputDatasetQueryPlanVisitors(OpenLineageContext context); /** * Return a collection of QueryPlanVisitors that can generate OutputDatasets from a LogicalPlan node */ Collection<PartialFunction<LogicalPlan, List<OutputDataset>>> createOutputDatasetQueryPlanVisitors(OpenLinea…

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-29 05:00:53

*Thread Reply:* looks like you have a typo in versioning @Tamizharasi Karuppasamy

jayant joshi (itsjayantjoshi@gmail.com)

2024-01-29 06:07:41

*Thread Reply:* Hi @Maciej Obuchowski as per your suggestion I used 0.44.0 but error still persist "[HPM] Error occurred while proxying request localhost:3000/api/v1/events/lineage?limit=20&before=2024-01-29T23:59:59.000Z&after=2024-01-29T00:00:00.000Z&offset=0&sortDirection=desc to <http://marquez-api:5000/> [EAI_AGAIN] (<https://nodejs.org/api/errors.html#errors_common_system_errors>" even in docker "marquez-api" is not running . for your reference sharing log ... 2024-01-29 16:35:29 marquez-db | 2024-01-29 11:05:29.547 UTC [39] FATAL: password authentication failed for user "marquez" 2024-01-29 16:35:29 marquez-db | 2024-01-29 11:05:29.547 UTC [39] DETAIL: Role "marquez" does not exist. 2024-01-29 16:35:29 marquez-db | Connection matched pg_hba.conf line 95: "host all all all md5" 2024-01-29 16:35:29 marquez-api | ERROR [2024-01-29 11:05:29,553] org.apache.tomcat.jdbc.pool.ConnectionPool: Unable to create initial connections of pool. 2024-01-29 16:35:29 marquez-api | ! org.postgresql.util.PSQLException: FATAL: password authentication failed for user "marquez" 2024-01-29 16:35:29 marquez-api | ! at org.postgresql.core.v3.ConnectionFactoryImpl.doAuthentication(ConnectionFactoryImpl.java:693) 2024-01-29 16:35:29 marquez-api | ! at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:203) 2024-01-29 16:35:29 marquez-api | ! at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:258) 2024-01-29 16:35:29 marquez-api | ! at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54) 2024-01-29 16:35:29 marquez-api | ! at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:263) 2024-01-29 16:35:29 marquez-api | ! at org.postgresql.Driver.makeConnection(Driver.java:443) 2024-01-29 16:35:29 marquez-api | ! at org.postgresql.Driver.connect(Driver.java:297) 2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.PooledConnection.connectUsingDriver(PooledConnection.java:346) 2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.PooledConnection.connect(PooledConnection.java:227) 2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.ConnectionPool.createConnection(ConnectionPool.java:772) 2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.ConnectionPool.borrowConnection(ConnectionPool.java:700) 2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.ConnectionPool.init(ConnectionPool.java:499) 2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.ConnectionPool.<init>(ConnectionPool.java:155) 2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.DataSourceProxy.pCreatePool(DataSourceProxy.java:118) 2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.DataSourceProxy.createPool(DataSourceProxy.java:107) 2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.DataSourceProxy.getConnection(DataSourceProxy.java:131) 2024-01-29 16:35:29 marquez-api | ! at org.flywaydb.core.internal.jdbc.JdbcUtils.openConnection(JdbcUtils.java:48) 2024-01-29 16:35:29 marquez-api | ! at org.flywaydb.core.internal.jdbc.JdbcConnectionFactory.<init>(JdbcConnectionFactory.java:75) 2024-01-29 16:35:29 marquez-api | ! at org.flywaydb.core.FlywayExecutor.execute(FlywayExecutor.java:147) 2024-01-29 16:35:29 marquez-api | ! at <a href="http://org.flywaydb.core.Flyway.info">org.flywaydb.core.Flyway.info</a>(Flyway.java:190) 2024-01-29 16:35:29 marquez-api | ! at marquez.db.DbMigration.hasPendingDbMigrations(DbMigration.java:78) 2024-01-29 16:35:29 marquez-api | ! at marquez.db.DbMigration.migrateDbOrError(DbMigration.java:33) 2024-01-29 16:35:29 marquez-api | ! at marquez.MarquezApp.run(MarquezApp.java:109) 2024-01-29 16:35:29 marquez-api | ! at marquez.MarquezApp.run(MarquezApp.java:51) 2024-01-29 16:35:29 marquez-api | ! at io.dropwizard.cli.EnvironmentCommand.run(EnvironmentCommand.java:67) 2024-01-29 16:35:29 marquez-api | ! at io.dropwizard.cli.ConfiguredCommand.run(ConfiguredCommand.java:98) 2024-01-29 16:35:29 marquez-api | ! at io.dropwizard.cli.Cli.run(Cli.java:78) 2024-01-29 16:35:29 marquez-api | ! at io.dropwizard.Application.run(Application.java:94) 2024-01-29 16:35:29 marquez-api | ! at marquez.MarquezApp.main(MarquezApp.java:63) 2024-01-29 16:35:29 marquez-api | INFO [2024-01-29 11:05:29,556] marquez.MarquezApp: Stopping app...

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-29 06:08:29

*Thread Reply:* Please delete all docker volumes related to Marquez and try again

👍 jayant joshi

jayant joshi (itsjayantjoshi@gmail.com)

2024-01-30 02:23:06

*Thread Reply:* Thanks @Maciej Obuchowski

Willy Lulciuc (willy@datakin.com)

2024-01-30 13:47:41

*Thread Reply:* @jayant joshi have you’ve tried running Marquez via gitpod? see https://github.com/MarquezProject/marquez?tab=readme-ov-file#try-it

Michael Robinson (michael.robinson@astronomer.io)

2024-03-18 09:02:34

*Thread Reply:* @jayant joshi did deleting all volumes work for you, or did you discover another solution? We see users encountering this error from time to time, and it would be helpful to know more.

Michael Robinson (michael.robinson@astronomer.io)

2024-01-25 09:43:16

@channel Our first London meetup is happening next Wednesday, Jan. 31, at Confluent's offices in Covent Garden. Click through to the Meetup page to sign up and view an up-to-date agenda, featuring talks by @Abdallah, Kirill Kulikov at Confluent, and @Paweł Leszczyński! https://openlineage.slack.com/archives/C01CK9T7HKR/p1704736501486239

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)

@channel Meetup alert: our first OpenLineage x Apache Kafka® meetup will be happening on January 31st (in-person) at 6:00 PM in Confluent’s London offices. Keep an eye out for more details and sign up <a href="https://www.meetup.com/london-openlineage-meetup-group/events/298420417/?utm_medium=referral&utm_campaign=share-btn_savedevents_share_modal&utm_source=link|here">https://www.meetup.com/london-openlineage-meetup-group/events/298420417/?utmmedium=referral&utmcampaign=share-btnsavedeventssharemodal&utmsource=link|here</a>.

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1704736501486239

❤️ Maciej Obuchowski, Abdallah, Eric Veleker, Rodrigo Maia, tati

Michael Robinson (michael.robinson@astronomer.io)

2024-01-29 12:51:10

@channel The 2023 Year in Review special issue of OpenLineage News is here! This issue contains an overview of the exciting events, developments and releases in 2023, including the Airflow Provider and Static Lineage. To get the newsletter directly in your inbox each month, sign up here.

openlineage.us14.list-manage.com

OpenLineage Project

OpenLineage Project Email Forms

Original URL: http://bit.ly/OL_news

❤️ Ross Turk, Harel Shein, tati, Rodrigo Maia, Maciej Obuchowski, Jarek Potiuk, Mattia Bertorello, Sheeri Cabral (Collibra)

🙌 Francis McGregor-Macdonald, Harel Shein, tati, Maciej Obuchowski, Jarek Potiuk

Ross Turk (ross@rossturk.com)

2024-01-29 13:15:16

*Thread Reply:* love it! so much activity!

:gratitude_thank_you: Michael Robinson

➕ Harel Shein, Maciej Obuchowski

Balachandru S (balachandru2468@gmail.com)

2024-01-30 01:16:01

Hi Team, I am trying to run a pyspark code through spark-submit command. Find the command which trying below. While running below command I am facing an issue "java.net.UnknownHostException: marquez-api". Any solution to solve this issue ? Note that through jupyter lab I can run same code and create a lineage in the Marquez UI.

"spark-submit --conf "spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener" --packages "io.openlineage:openlineagespark:1.7.0" --conf "spark.openlineage.transport.type=http" --conf "spark.openlineage.transport.url= http://marquez-api:5000" --conf "spark.openlineage.namespace=sparkintegration" pyspark_etl.py".

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-30 02:06:27

*Thread Reply:* You've probably been running jupyter lab with docker compose alonng with Marquez containers?

Try changing to host.docker.internal from marquez-api

Balachandru S (balachandru2468@gmail.com)

2024-01-30 02:43:43

*Thread Reply:* Are you mean to say try passing below one ? "spark.openlineage.transport.url= http://host.docker.internal:5000"

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-30 02:47:25

*Thread Reply:* yes

Balachandru S (balachandru2468@gmail.com)

2024-01-30 02:47:40

*Thread Reply:* The above one also throwing a same error.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-30 02:48:00

*Thread Reply:* how do you instantiate marquez?

Balachandru S (balachandru2468@gmail.com)

2024-01-30 02:49:56

*Thread Reply:* Using below two sources: git clone https://github.com/OpenLineage/OpenLineage

docker run --network sparkdefault -p 3000:3000 -e MARQUEZHOST=marquez-api -e MARQUEZ_PORT=5000 --link marquez-api:marquez-api marquezproject/marquez-web:0.44.0

OpenLineage/OpenLineage

An Open Standard for lineage metadata collection

Website

<http://openlineage.io>

Stars

1496

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-30 02:50:56

*Thread Reply:* so you need to pass additional -p 5000:5000 flag

Balachandru S (balachandru2468@gmail.com)

2024-01-30 03:02:02

*Thread Reply:* Are you mean like this --conf "spark.openlineage.transport.port=5000" in the spark submit command ?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-30 03:04:03

*Thread Reply:* no, in docker run command

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-30 03:04:59

*Thread Reply:* you need to expose port 5000 for marquez-api container

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-30 03:26:49

*Thread Reply:* and if you expose localhost:5000 should be available too

Balachandru S (balachandru2468@gmail.com)

2024-01-30 04:03:53

*Thread Reply:* Find the attached localhost 5000 & 5001 port results. Note that while running same code in the jupyter notebook, I could see lineage on the Marquez UI. For running a code through spark-submit only I am facing an issue.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-30 04:04:48

*Thread Reply:* yeah but are you running jupyter from console or in docker?

Balachandru S (balachandru2468@gmail.com)

2024-01-30 04:20:34

*Thread Reply:* I am using the jupyter one which comes with Open Lineage docker container

Balachandru S (balachandru2468@gmail.com)

2024-01-31 04:25:13

*Thread Reply:* Hi @Jakub Dardziński, good afternoon. Any solution on this one ?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-01-31 04:43:13

*Thread Reply:* this is some workshop copy-paste tutorial https://github.com/OpenLineage/workshops/blob/main/spark/column-lineage-oct-2022.ipynb i did in the past.

you can try the following code snippet: import json,requests marquez_url = "<http://host.docker.internal:5000>" ## this may depend on your local setup if (requests.get("{}/api/v1/namespaces".format(marquez_url)).status_code == 200): print("Marquez is OK.") else: print("Cannot connect to Marquez") to see if docker network is setup properly

<https://github.com/OpenLineage/workshops/blob/main/spark/column-lineage-oct-2022.ipynb | column-lineage-oct-2022.ipynb>

<code> { "cells": [ { "cell_type": "markdown", "id": "8867600e", "metadata": {}, "source": [ "# Column Lineage Demo - October 2022\n", "\n", "In this document we provide a demo of current implementation of Column Level Lineage within Marquez and Openlinege \n", "\n", "## Envirnonment setup \n", "\n", " ** I've tested this demo on `Mac 12.6` with `Docker Desktop 4.12.0`. \n", " ** I've setup Spark Jupyter environment like described here [here](<https://openlineage.io/docs/integrations/spark/quickstart_local>)\n", " ** Run Marquez using the latest source code which contains the column level lineage feature.\n", "</code>\n", " git clone <a href="https://github.com/MarquezProject/marquez.git">https://github.com/MarquezProject/marquez.git</a>\n", " cd marquez\n", " ./docker/up.sh --build\n", " ``<code>\n", "\n", "> Once an October 2022 version of Marquez is released, testing this with Marquez docker image may be an easier option. \n", " \n", "Let's check first if Marquez instance is properly running under a defined address. Returned status code should be 200." ] }, { "cell_type": "code", "execution_count": 42, "id": "b43f2f98", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Marquez is OK.\n" ] } ], "source": [ "import json,requests\n", "marquez_url = \"<http://host.docker.internal:5000\>" ## this may depend on your local setup\n", "if (requests.get(\"{}/api/v1/namespaces\".format(marquez_url)).status_code == 200):\n", " print(\"Marquez is OK.\")\n", "else:\n", " print(\"Cannot connect to Marquez\")" ] }, { "cell_type": "markdown", "id": "036c5b2f", "metadata": {}, "source": [ "Let's create the warehouse directory first." ] }, { "cell_type": "code", "execution_count": null, "id": "51e8b4f9", "metadata": {}, "outputs": [], "source": [ "%mkdir /home/jovyan/notebooks/spark-warehouse" ] }, { "cell_type": "markdown", "id": "7fe60ddb", "metadata": {}, "source": [ "If Marquez connection is OK, we can start a Spark context with OpenLineage pointed to Marquez" ] }, { "cell_type": "code", "execution_count": 43, "id": "914e5905", "metadata": {}, "outputs": [], "source": [ "from pyspark.sql import SparkSession\n", "\n", "spark = (SparkSession.builder.master('local')\n", " .appName('sample_spark')\n", " .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener')\n", " .config('spark.jars.packages', 'io.openlineage:openlineage_spark:0.15.1')\n", " .config('spark.openlineage.url', '{}/api/v1/namespaces/column-lineage/'.format(marquez_url))\n", " .getOrCreate())" ] }, { "cell_type": "markdown", "id": "9bc086ab", "metadata": {}, "source": [ "Let's clear Docker's existing warehouse, in case the previous data exist." ] }, { "cell_type": "code", "execution_count": 44, "id": "a011bf15", "metadata": {}, "outputs": [], "source": [ "%rm -rf /home/jovyan/notebooks/spark-warehouse/**" ] }, { "cell_type": "markdown", "id": "dfc940c6", "metadata": {}, "source": [ "## Run example Spark Job" ] }, { "cell_type": "markdown", "id": "e3e7f340", "metadata": {}, "source": [ "Let's create now four datasets:</code>dataseta<code>,</code>datasetb<code>,</code>datasetc<code>,</code>datasetd<code>:\n", " **</code>dataseta<code>has to columns</code>col1<code>and</code>col2<code>filled with some data,\n", " **</code>datasetb<code>has one column</code>col3<code>and is created from</code>dataseta<code>,\n", " **</code>datasetc<code>with</code>col4<code>and</code>datasetd<code>with</code>col5<code>are created from</code>datasetb<code>\n" ] }, { "cell_type": "code", "execution_count": 45, "id": "f401c1e1", "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "22/10/11 15:46:09 WARN HadoopFSUtils: The directory file:/home/jovyan/notebooks/spark-warehouse/dataset_c was not found. Was it deleted very recently?\n", "22/10/11 15:46:10 WARN HadoopFSUtils: The directory file:/home/jovyan/notebooks/spark-warehouse/dataset_d was not found. Was it deleted very recently?\n" ] } ], "source": [ "spark.createDataFrame([\n", " {'col_1': 1, 'col_2': 2},\n", " {'col_1': 3, 'col_2': 4}\n", "]).write.mode(\"overwrite\").saveAsTable('dataset_a')\n", "spark.sql(\"SELECT col_1 + col_2 AS col_3 FROM dataset_a\").write.mode(\"overwrite\").saveAsTable('dataset_b')\n", "spark.sql(\"SELECT col_3 AS col_4 FROM dataset_b\").write.mode(\"overwrite\").saveAsTable('dataset_c')\n", "spark.sql(\"SELECT col_3 AS col_5 FROM dataset_b\").write.mode(\"overwrite\").saveAsTable('dataset_d')" ] }, { "cell_type": "markdown", "id": "4143ccb8", "metadata": {}, "source": [ "This should result in following column lineage graph:\n", " **</code>col3<code>is created out of</code>col1<code>and</code>col2<code>,\n", " **</code>col4<code>and</code>col5<code>depend on</code>col3<code>." ] }, { "cell_type": "markdown", "id": "94092c13", "metadata": {}, "source": [ "## Marquez API " ] }, { "cell_type": "markdown", "id": "40563cf7", "metadata": {}, "source": [ "### Get dataset resource with column lineage included\n", "\n", "First we may list some example datasets:" ] }, { "cell_type": "code", "execution_count": 46, "id": "f9abcf87", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"namespace\": \"file\",\n", " \"name\": \"/home/jovyan/notebooks/spark-warehouse/dataset_a\"\n", "}\n" ] } ], "source": [ "datasets = requests.get(\"{}/api/v1/namespaces/file/datasets\".format(marquez_url)).json()\n", "print(json.dumps(datasets[\"datasets\"][0][\"id\"], indent=2))" ] }, { "cell_type": "markdown", "id": "f4b64488", "metadata": {}, "source": [ "Let's try now to fetch a specific dataset:\n", " ** ****namespace****:</code>file<code>,\n", " ** ****name****:</code>/home/jovyan/notebooks/spark-warehouse/datasetc<code>\n", " \n", "We need to encode dataset name to be able to pass it through URL." ] }, { "cell_type": "code", "execution_count": 9, "id": "1e2c674a", "metadata": {}, "outputs": [], "source": [ "import urllib\n", "encoded_name = urllib.parse.quote_plus(\"/home/jovyan/notebooks/spark-warehouse/dataset_c\")" ] }, { "cell_type": "markdown", "id": "61dbcf02", "metadata": {}, "source": [ "</code>datasetc<code>was created from a single column</code>col3<code>in</code>dataset_b<code>, so its column lineage section should only contain a single field. " ] }, { "cell_type": "code", "execution_count": 47, "id": "486a7e2a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[\n", " {\n", " \"name\": \"col_4\",\n", " \"inputFields\": [\n", " {\n", " \"namespace\": \"file\",\n", " \"dataset\": \"/home/jovyan/notebooks/spark-warehouse/dataset_b\",\n", " \"field\": \"col_3\"\n", " }\n", " ],\n", " \"transformationDescription\": null,\n", " \"transformationType\": null\n", " }\n", "]\n" ] } ], "source": [ "dataset = requests.get(\"{}/api/v1/namespaces/file/datasets/{}\".format(marquez_url, encoded_name)).json()\n", "print(json.dumps(dataset[\"columnLineage\"], indent=2))" ] }, { "cell_type": "markdown", "id": "3f24b59d", "metadata": {}, "source": [ "Fields</code>transformationDescription<code>and</code>transformationType` are available in the OpenLineage standard specification but not implemented in Spark integration (which is the only one).\n", "\n", "Colu…

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-01-31 04:44:05

*Thread Reply:* the other parts of workshop refer to old version of connector, but please verify first if everything with your docker network is fine before running real Spark job

Balachandru S (balachandru2468@gmail.com)

2024-01-31 04:53:29

*Thread Reply:* Sure, thanks.

Balachandru S (balachandru2468@gmail.com)

2024-01-31 04:54:55

*Thread Reply:* From your code, I could see marquez-api is running successfully at "http://marquez-api:5000". Find attached screenshot.

Balachandru S (balachandru2468@gmail.com)

2024-01-31 04:57:08

*Thread Reply:* And I could create a lineage for spark in the Jupyter notebook successfully. But when I am submitting a job through spark-submit I am facing an issue "ERROR EventEmitter: Could not emit lineage w/ exception io.openlineage.client.OpenLineageClientException: java.net.UnknownHostException: marquez-api".

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-01-31 04:57:44

*Thread Reply:* great. do you run spark-submit from within the same docker container?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-01-31 04:58:08

*Thread Reply:* you can open terminal tab within jupyter to run it

Balachandru S (balachandru2468@gmail.com)

2024-01-31 05:02:05

*Thread Reply:* Sure, thanks. Let me try running it.

Balachandru S (balachandru2468@gmail.com)

2024-01-31 07:52:22

*Thread Reply:* Thanks @Paweł Leszczyński, it ran successfully in the jupyter terminal.

Mayur Singal (mayur.s@deuexsolutions.com)

2024-01-30 05:50:58

Hi Team, I'm trying out OpenLineage with databricks and I'm not seeing expected results, need help to figure out what's the issue

I'm following this quickstart guide: https://openlineage.io/docs/integrations/spark/quickstart_databricks/

more details in the thread 🧵

openlineage.io

Quickstart with Databricks | OpenLineage

OpenLineage's Spark Integration can be installed on Databricks leveraging init scripts. Please note, Databricks on Google Cloud does not currently support the DBFS CLI, so the proposed solution will not work on Google Cloud until that feature is enabled.

Original URL: https://openlineage.io/docs/integrations/spark/quickstart_databricks/

Mayur Singal (mayur.s@deuexsolutions.com)

2024-01-30 05:52:09

*Thread Reply:* I executed this piece of code as per the quickstart guide but the result generated on marquez is not as expected: spark.createDataFrame([ {'a': 1, 'b': 2}, {'a': 3, 'b': 4} ]).write.mode("overwrite").saveAsTable("default.temp")

Mayur Singal (mayur.s@deuexsolutions.com)

2024-01-30 05:53:58

*Thread Reply:* the quickstart guide shows this example and it produces the result with a output node in the results, But when I run this in databricks I see no output node generated.

Mayur Singal (mayur.s@deuexsolutions.com)

2024-01-30 05:56:02

*Thread Reply:* same thing happens when I run this piece of code

```sourcetable = "mayurtable" destinationtable = "onkartable"

Read the data from the source table into a DataFrame

sourcedf = spark.sql(f"SELECT id FROM {sourcetable}")

Show the source DataFrame

source_df.show()

Write the data from the source DataFrame to the destination table

sourcedf.write.mode("overwrite").saveAsTable(destinationtable)``where I'm transferring data from one table to another and would expect a lineage betweenmayurtableandonkartable` few events gets captured while running this piece of code but none of them has any output node

Mayur Singal (mayur.s@deuexsolutions.com)

2024-01-30 05:57:51

*Thread Reply:* as a result onkar_table as a dataset was never recorded hence lineage between mayur_table and onkar_table was not recorded as well

Mayur Singal (mayur.s@deuexsolutions.com)

2024-01-30 05:58:27

*Thread Reply:* I'm using OpenLineage jar version 1.7.0 also tried this out with version 1.8.0 but had no luck

Mayur Singal (mayur.s@deuexsolutions.com)

2024-01-30 05:59:17

*Thread Reply:* any help would be appreciated, not sure if this could be a configuration issue

Rodrigo Maia (rodrigo.maia@manta.io)

2024-01-30 06:06:40

*Thread Reply:* are you using UC catalog?

🙏 Mayur Singal

Rodrigo Maia (rodrigo.maia@manta.io)

2024-01-30 06:07:45

*Thread Reply:* ive reported an issue while ago about "saveAsTable(destination_table)" when using Unity

Mayur Singal (mayur.s@deuexsolutions.com)

2024-01-30 06:12:52

*Thread Reply:* Umm, yes UC is enabled on my instance but I'm performing this operation on default hive_metastore catalog

Mayur Singal (mayur.s@deuexsolutions.com)

2024-01-30 06:23:59

*Thread Reply:* just to add some more context, running this query via notebook also doesn't generate any lineage

%sql insert into onkar_table select id from mayur_table;

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-30 07:11:52

*Thread Reply:* any relevant logs @Mayur Singal? Might be regular issue with databricks where they replace OSS Spark class with their own implementation silently

Mayur Singal (mayur.s@deuexsolutions.com)

2024-01-30 07:30:42

*Thread Reply:* I had a similar thought, I checked the logs there is no relevant information over there don't see any errors as well.

Mayur Singal (mayur.s@deuexsolutions.com)

2024-02-01 05:10:34

*Thread Reply:* Hi Team, please do let me know if I need to open any ticket for this or could this be a configuration issue on my side!

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-01 05:28:05

*Thread Reply:* Can you try to set log level to debug and running again?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-01 05:29:15

*Thread Reply:* Would be interesting what happens in code here https://github.com/OpenLineage/OpenLineage/blob/31f8ce588526e9c7c4bc7d849699cb7ce2[…]ent/lifecycle/plan/column/CreateReplaceInputDatasetBuilder.java

<https://github.com/OpenLineage/OpenLineage/blob/31f8ce588526e9c7c4bc7d849699cb7ce2969c8f/integration/spark/spark34/src/main/java/io/openlineage/spark34/agent/lifecycle/plan/column/CreateReplaceInputDatasetBuilder.java | CreateReplaceInputDatasetBuilder.java>

<pre><code>public class CreateReplaceInputDatasetBuilder </code></pre>

Rodrigo Maia (rodrigo.maia@manta.io)

2024-02-05 08:24:09

*Thread Reply:* I can try that and let you know

👍 Maciej Obuchowski, Mayur Singal

Balachandru S (balachandru2468@gmail.com)

2024-01-30 08:14:51

Hi Team, I want to view column level lineage for pyspark code. So I have sample pyspark ETL code and I tried to run on the Open Lineage. In the Marquez I could see high level lineage diagram like attached screenshot. Is any dependency I need to add to get a column level lineage ?. Thanks.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-01-30 08:55:29

*Thread Reply:* > Is any dependency I need to add to get a column level lineage ?. No, you need to wait for Marquez release that contains this PR 🙂

#2725 Web: Updates for Table and Column Lineage

Problem We currently have a feature gap for column lineage where we don't have a display for it. In order to correctly display this we needed to have something to handle nested graphs so that columns can be nested under tables. This PR both creates a new page for column lineage and an updated view for lineage with a common set of shared principles. Solution Feature Set • Redesigned lineage page • New column lineage page • Refresh button (lineage + column lineage) • Compact mode for lineage display • Zooming and framing capabilities and zoom controls • Open view for graphs with a side panel instead of a draggable bar that limited graph navigation and visualization. • Shared action menu for graph context and actions • Tuning down the graph depth and preserving more in the url. Technical • New client side rendering engine with out of the box support for arbitrarily nested graphs (subgraphs) and hot swappable rendering strategies. There is also an easy to use and type safe framework to construct node types • New custom React based d3 manipulation framework for zooming, panning, transitions, framing and focal points that uses React paradigms instead of direct DOM access like a conventional d3 approach • A Minimap that allows for context awareness in a larger graph along with auto positioning extents Marquez.Lineage.mov <a href="https://private-user-images.githubusercontent.com/7514204/293756591-0c5ce570-6422-4469-9a36-790573fe57ca.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDY2MjMyMzEsIm5iZiI6MTcwNjYyMjkzMSwicGF0aCI6Ii83NTE0MjA0LzI5Mzc1NjU5MS0wYzVjZTU3MC02NDIyLTQ0NjktOWEzNi03OTA1NzNmZTU3Y2EucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDEzMCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDAxMzBUMTM1NTMxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9YjUwMGRjZjNjMGUwNjYzYTc2MjMwMTIwMmMzZGFjYzc1NjAwY2I1NWVjZThlY2UxODNlMDZmMmQ4MzA2ZmY5NiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.NJ7ldOslCT9IjML4Hyo3SLn3Hw8FggtgBeazfCpLRRw">image</a> <a href="https://private-user-images.githubusercontent.com/7514204/293756603-a5b6d6e8-fde6-4ab0-b123-205e0e821fc5.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDY2MjMyMzEsIm5iZiI6MTcwNjYyMjkzMSwicGF0aCI6Ii83NTE0MjA0LzI5Mzc1NjYwMy1hNWI2ZDZlOC1mZGU2LTRhYjAtYjEyMy0yMDVlMGU4MjFmYzUucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDEzMCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDAxMzBUMTM1NTMxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9OTU3NmU1MmM4Njk5MGM4MWZhNzQzOTY4YTRhNDllOWI2OWNhYzUzZjkzNzBjNGJmNjhmMDFhMmFjNmVmMzkwMCZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.Bm1Nw3lpGJp6E0pf_NK8WEKvjnU3uI7OCWPX-yPVCWY">image</a> <a href="https://private-user-images.githubusercontent.com/7514204/293756610-9bfacc0e-1279-49dd-ba8d-67c64663c9a5.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDY2MjMyMzEsIm5iZiI6MTcwNjYyMjkzMSwicGF0aCI6Ii83NTE0MjA0LzI5Mzc1NjYxMC05YmZhY2MwZS0xMjc5LTQ5ZGQtYmE4ZC02N2M2NDY2M2M5YTUucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDEzMCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDAxMzBUMTM1NTMxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9Yjg5NDZkNWU4MzhjMDdiMGMyNGEzNzQzMjFiOTk0YzljZWJiMzBlM2UxMGE1OWI0MjA0MDEyYWZmZmY1YzEzZCZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.ac2K--QZObd0wLYiFjwIiW25WHd-ZQdZeAss0Fgmsl8">image</a> One-line summary: New views for both table and column lineage with shared new engine. Marks column lineage as <code>beta</code> for preview. Checklist ☐ You've <a href="https://github.com/MarquezProject/marquez/blob/main/CONTRIBUTING.md#sign-your-work">signed-off</a> your work ☐ Your changes are accompanied by tests (if relevant) ☐ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☐ You've included a one-line summary of your change for the <a href="https://github.com/MarquezProject/marquez/blob/main/CHANGELOG.md#unreleased"><code>CHANGELOG.md</code></a> (Depending on the change, this may not be necessary). ☐ You've versioned your <code>.sql</code> database schema migration according to <a href="https://flywaydb.org/documentation/concepts/migrations#naming">Flyway's naming convention</a> (if relevant) ☐ You've included a <a href="https://github.com/MarquezProject/marquez/blob/main/CONTRIBUTING.md#copyright--license">header</a> in any source code files (if relevant)

Labels

docs, web

Comments

👍 Balachandru S

jayant joshi (itsjayantjoshi@gmail.com)

2024-01-30 08:59:00

*Thread Reply:* I am also very much interested to see column level lineage using pyspark script in visual graph that located in the PR.

RaghavanAP (raghavan.panneerselvam@wavicledata.com)

2024-01-30 09:03:06

*Thread Reply:* Is there any tentative date for the next release which includes Column level lineage using spark integration?

Willy Lulciuc (willy@datakin.com)

2024-01-30 13:42:37

*Thread Reply:* The Marquez team is working to having an RC by end of next month for users to try! https://marquezproject.slack.com/archives/C01E8MQGJP7/p1706636124360629?thread_ts=1706621358.017329&cid=C01E8MQGJP7

Balachandru S (balachandru2468@gmail.com)

2024-01-30 23:44:51

*Thread Reply:* Can we get visual representation of column level lineage using any other data consumers of Amundsen or Egeria or Apache atlas ?

Michael Robinson (michael.robinson@astronomer.io)

2024-01-30 12:00:06

@channel Our first London meetup, at Confluent's offices in London, is tomorrow! https://openlineage.slack.com/archives/C01CK9T7HKR/p1706193796047049

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)

@channel Our first London meetup is happening next Wednesday, Jan. 31, at Confluent's offices in Covent Garden. Click through to the Meetup page to sign up and view an up-to-date agenda, featuring talks by @Abdallah, Kirill Kulikov at Confluent, and @Paweł Leszczyński! <a href="https://openlineage.slack.com/archives/C01CK9T7HKR/p1704736501486239">https://openlineage.slack.com/archives/C01CK9T7HKR/p1704736501486239</a>

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1706193796047049

❤️ Jarek Potiuk, Maciej Obuchowski, Mattia Bertorello, Willy Lulciuc, Abdallah, Harel Shein

🙌 Jarek Potiuk, Maciej Obuchowski, Mattia Bertorello, Abdallah

Athitya Kumar (athityakumar@gmail.com)

2024-01-30 13:40:41

Hey team. Is column/attribute level lineage supported for input/topic Kafka topic ports in the OpenLineage Flink listener?

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-04-30 10:23:03

*Thread Reply:* You probably already know this but as of 2 months ago, no 😄

Dheeraj (dheeraj.athrey@gmail.com)

2024-01-31 01:28:10

Hello all I am using airflow and dbt. I want to know if there is any tool using open lineage that actually provides the following feature.

If I pick a job J1, I want to be able to see the real time job start/run/finish status of all the jobs J1 is dependent on in the lineage.

Any other tool/framework that would help me achieve this is also helpful.

Thank you

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-31 02:13:41

*Thread Reply:* what is a job in this case? dbt run/Airflow DagRun?

Dheeraj (dheeraj.athrey@gmail.com)

2024-01-31 02:37:41

*Thread Reply:* By job, I meant an airflow dag.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-01-31 02:38:42

*Thread Reply:* @tati might say more as project maintainer but you could take a look into https://github.com/astronomer/astronomer-cosmos it transforms dbt project into Airflow DAG and plays really nice with OL too

astronomer/astronomer-cosmos

Run your dbt Core projects as Apache Airflow DAGs and Task Groups with a few lines of code

Website

<https://astronomer.github.io/astronomer-cosmos/>

Stars

371

RaghavanAP (raghavan.panneerselvam@wavicledata.com)

2024-01-31 02:14:31

Hi Team, Am running the pyspark script through spark-submit command and am facing the error as "Could not emit lineage". Can you please help on this error? spark-submit command: spark-submit --conf "spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener" --packages "io.openlineage:openlineagespark:1.7.0" --conf "spark.openlineage.transport.type=http" --conf "spark.openlineage.transport.url=http://host.docker.internal:5000" --conf "spark.openlineage.namespace=sparkintegration" --conf "spark.pyspark.python=python" pyspark_etl.py

Error Screenshot:

Mattia Bertorello (mattia.bertorello@booking.com)

2024-01-31 08:31:22

*Thread Reply:* Hi,

The host of Marquez is probably wrong "spark.openlineage.transport.url=<http://host.docker.internal:5000>" Check on your machine where Marquez is exposed. Thanks

Balachandru S (balachandru2468@gmail.com)

2024-01-31 08:17:46

Hi Team, I need to config AWS CLI on the Jupyter lab (which comes with open lineage docker container) to read and write an AWS S3 files through pyspark code. For this I am running this command "sudo ./aws/install". While running this command it's prompt me to password. I not sure what is the password here. Can you please help me on this ?. Here I just need sudo password for default jupyter lab (which comes with open lineage docker container). Thanks.

Mattia Bertorello (mattia.bertorello@booking.com)

2024-01-31 08:26:46

*Thread Reply:* Hi, Did you try to pass this environment variable -e GRANT_SUDO=yes ? https://jupyter-docker-stacks.readthedocs.io/en/latest/using/common.html#permission-specific-configurations

Balachandru S (balachandru2468@gmail.com)

2024-01-31 08:33:26

*Thread Reply:* While composing up an open lineage docker-compose.yml. It showed the path to access jupyter lab, through the path I am accessing it. I didn't run any command externally. Find the attached screenshot.

Mattia Bertorello (mattia.bertorello@booking.com)

2024-01-31 08:36:41

*Thread Reply:* Mine was a suggestion to add this variable GRANT_SUDO=yes in the docker compose if you need sudo access. And in general it's not a OpenLineage problem you should look into the documentation of the jupyter docker image

Balachandru S (balachandru2468@gmail.com)

2024-01-31 08:38:07

*Thread Reply:* Sure, will do that. Thanks,

👍 Mattia Bertorello

Balachandru S (balachandru2468@gmail.com)

2024-02-02 04:18:20

*Thread Reply:* Hi @Mattia Bertorello, good afternoon I just tried to inspect the notebook container, there I could "GRANT_SUDO=yes". And after passing this also it's asking the password. Find the attached screenshot. Thanks.

Michael Robinson (michael.robinson@astronomer.io)

2024-01-31 10:46:34

@channel This is happening this evening! https://openlineage.slack.com/archives/C01CK9T7HKR/p1706193796047049

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1706193796047049

🎉 Harel Shein, Jakub Dardziński, Maciej Obuchowski, Ernie Ostic, harsh loomba

Abhinav Ajith (abhinavajith0968@gmail.com)

2024-02-01 01:37:31

Hi Team

Abhinav Ajith (abhinavajith0968@gmail.com)

2024-02-01 01:38:27

I am trying to run open lineage for sql query. Any Idea on how to use the openlineage-sql package

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-01 04:44:04

*Thread Reply:* Hey, could please tell more about your use case?

Suhas Shenoy (ksuhasshenoy@gmail.com)

2024-02-01 04:01:35

Hi everyone. I am trying to integrate openlineage with apache airflow version > 2.7. I followed this guide: https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/guides/user.html. This guide mentions that no changes are required in the dag file. I am using BashOperator in the dag file but openlineage does not seem to emit the events. What am I missing?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-01 04:02:52

*Thread Reply:* BashOperator can only pass lineage if you set inlets or outlets manually

Suhas Shenoy (ksuhasshenoy@gmail.com)

2024-02-01 04:25:37

*Thread Reply:* Thanks for the reply Jakub. I created a dag with the code of example in this url: https://openlineage.io/docs/integrations/airflow/manual. But still no luck generating events.

Suhas Shenoy (ksuhasshenoy@gmail.com)

2024-02-01 04:35:22

*Thread Reply:* I am using airflow version 2.7.1. In my airflow.cfg file, there was no [openlineage] section and I had to add it. Are these two related in any way? @Jakub Dardziński

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-01 04:43:14

*Thread Reply:* I ran the bash example DAG you linked and I can confirm I’m getting the lineage

Could you please confirm you’re sending events to Marquez at all? Can you see Airflow jobs created in Marquez but without inputs/outputs?

Suhas Shenoy (ksuhasshenoy@gmail.com)

2024-02-01 04:45:21

*Thread Reply:* I am not sending the events to Marquez. I have an endpoint defined in the localhost where I am trying to receive the event

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-01 04:45:38

*Thread Reply:* I see, so are you receiving any events?

Suhas Shenoy (ksuhasshenoy@gmail.com)

2024-02-01 04:46:09

*Thread Reply:* No. I am not receiving anything in that endpoint.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-01 04:46:54

*Thread Reply:* Would you want to share your [openlineage] section?

Suhas Shenoy (ksuhasshenoy@gmail.com)

2024-02-01 04:47:50

*Thread Reply:* Sure.

[openlineage] transport = '{"type": "http", "url": "http://localhost:8082/event"}'

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-01 04:50:02

*Thread Reply:* and what endpoint is available in your local server? if it’s /event then you should set following:

[openlineage] transport = ‘{“type”: “http”, “url”: “http://localhost:8082”, “endpoint”: “event”}’

Suhas Shenoy (ksuhasshenoy@gmail.com)

2024-02-01 05:15:12

*Thread Reply:* Yes the endpoint is /event and I made this change. Still no luck. Not only the http transport but console transport is also not working.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-01 05:24:23

*Thread Reply:* Could you please check what’s in the Airflow UI under Admin > Plugins?

listeners should be there under OpenLineageProviderPlugin

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-01 05:26:11

*Thread Reply:* @Suhas Shenoy if you're using Astro Runtime image locally then you additionally need to set env variable OPENLINEAGE_DISABLED=false due to bug in some versions of the runtime

Suhas Shenoy (ksuhasshenoy@gmail.com)

2024-02-01 05:31:24

*Thread Reply:* There is no listeners attribute in OpenLineageProviderPlugin. And I am not using Astro Runtime image. I installed airflow following this documentation: https://airflow.apache.org/docs/apache-airflow/2.7.1/installation/installing-from-pypi.html

Suhas Shenoy (ksuhasshenoy@gmail.com)

2024-02-01 05:46:54

*Thread Reply:* How do I make that listener attribute set to OpenLineageListener?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-01 05:51:48

*Thread Reply:* do you see OpenLineageProviderPlugin then?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-01 05:52:42

*Thread Reply:* can you try setting OPENLINEAGE_DISABLED=false and use OPENLINEAGE_URL=<http://localhost:8082> instead of setting it in config?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-01 05:53:42

*Thread Reply:* I'd rather bet airflow.cfg is not read or it's in the wrong path

Suhas Shenoy (ksuhasshenoy@gmail.com)

2024-02-01 05:57:29

*Thread Reply:* This is the snapshot of my Plugins. I will also try with the configs which you mentioned.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-01 06:14:17

*Thread Reply:* ah, on Airflow 2.7 it does not show the listeners row yet

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-01 06:15:39

*Thread Reply:* if you do have access to Airflow’s console you could check what’s the output of below: ```from airflow.configuration import conf

print(conf.get('openlineage', 'transport'))```

Suhas Shenoy (ksuhasshenoy@gmail.com)

2024-02-01 06:26:52

*Thread Reply:* This was the output I got '{"type": "http", "url": "<http://localhost:8082>", "endpoint": "event"}'

Suhas Shenoy (ksuhasshenoy@gmail.com)

2024-02-01 09:12:34

*Thread Reply:* I am running Apache airflow in a virtual environment. Will it matter in any way?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-01 09:14:13

*Thread Reply:* have you tried OPENLINEAGE_DISABLED=false and OPENLINEAGE_URL=<http://localhost:8082> ? does virtual environment mean docker?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-01 09:15:45

*Thread Reply:* eventually try checking what’s the output from: ```from airflow.providers.openlineage.plugins.openlineage import isdisabled

print(isdisabled())```

Suhas Shenoy (ksuhasshenoy@gmail.com)

2024-02-01 09:16:42

*Thread Reply:* Yes I tried both the things.

No not docker. I have created virtual environment using python -m venv venv_name

Suhas Shenoy (ksuhasshenoy@gmail.com)

2024-02-01 09:18:09

*Thread Reply:* The output of the code block is false.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-01 09:35:54

*Thread Reply:* that means you have listener enabled

what I can think of is to set airflow logging level to debug and see what logs say

☝️ Maciej Obuchowski

Suhas Shenoy (ksuhasshenoy@gmail.com)

2024-02-01 23:21:36

*Thread Reply:* This is the log from scheduler terminal. I found 3 logs related to openlineage

```DEBUG - Loading EntryPoint(name='providerinfo', value='airflow.providers.openlineage.getproviderinfo:getproviderinfo', group='apacheairflow_provider') from package apache-airflow-providers-openlineage

DEBUG - Importing entry_point plugin openlineage

DEBUG - Creating module airflow.macros.OpenLineageProviderPlugin```

Suhas Shenoy (ksuhasshenoy@gmail.com)

2024-02-01 23:32:23

*Thread Reply:* And this is the log related to openlineage I found in the task1: https://pastebin.com/qbeYFTuJ

Suhas Shenoy (ksuhasshenoy@gmail.com)

2024-02-02 06:13:36

*Thread Reply:* Hi Team. Any reason why the extractor is failing to extract the metadata?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-02 06:16:49

*Thread Reply:* Are you running the same DAG as in https://openlineage.io/docs/integrations/airflow/manual/?

Suhas Shenoy (ksuhasshenoy@gmail.com)

2024-02-02 06:18:18

*Thread Reply:* Yes.

priya narayana (n.priya88@gmail.com)

2024-02-01 06:59:24

Hi Team! When i run a job in Dataproc Bigquery job with openlineage 1.4.1 version I am seeing events which has anonymous tables and i cannot see proper lineage event in console with inputs and outputs . can you please tell me which version of openlineage version can help me with right table name in input and output complete event

priya narayana (n.priya88@gmail.com)

2024-02-02 10:28:23

*Thread Reply:* Can someone guide me here

priya narayana (n.priya88@gmail.com)

2024-02-02 10:28:38

*Thread Reply:* @Maciej Obuchowski @Paweł Leszczyński @Michael Robinson

ldacey (lance.dacey2@sutherlandglobal.com)

2024-02-01 13:02:45

any thoughts on how sources from a REST API should be organized, or if they even should be set up as a dataset in OpenLineage? I am querying various endpoints and saving the results to jsonl files on GCS. those get read and ultimately loaded as parquet files in various delta tables (certain unnested columns become a completely separate table)

https://something.something.com/api/v2/ticket_audits

https://developer.zendesk.com/api-reference/ticketing/tickets/ticket_audits/

developer.zendesk.com

Ticket Audits

Developer documentation for products at Zendesk

Original URL: https://developer.zendesk.com/api-reference/ticketing/tickets/ticket_audits/

ldacey (lance.dacey2@sutherlandglobal.com)

2024-02-01 13:06:14

for the input dataset: namespace would be "https://something.something.com" resource would be "/api/v2/ticketaudits" not sure about the name, just "ticketaudits"?

maybe the uri would include the entire URL with query parameters, or those parameters could be a separate custom facet.

output dataset would be the GCS bucket and file?

Michael Robinson (michael.robinson@astronomer.io)

2024-02-01 15:42:01

@channel The latest issue of OpenLineage News is available now, featuring a rundown of the recent releases, updates to the Airflow Provider, events, proposals, and more. To get the newsletter directly in your inbox each month, sign up here.

openlineage.us14.list-manage.com

OpenLineage Project

OpenLineage Project Email Forms

Original URL: http://bit.ly/OL_news

🚀 Jakub Dardziński, Harel Shein, tati

RaghavanAP (raghavan.panneerselvam@wavicledata.com)

2024-02-02 04:09:43

Hi Team, My objective is to configure AWS with open lineage in windows os , to read the file from s3. Can you please help me with required steps or share any documents or links for reference?

Balachandru S (balachandru2468@gmail.com)

2024-02-02 06:05:22

Hi Team, I need to execute sudo command in the jupyter notebook terminal (which comes with open lineage). Find the configuration details in the attached screenshot and I tried changing the password through the command like attached screenshot. After changing the password, I could use the created password for jupyter notebook login. But not for the jupyter terminal login. I could see usename is jovyan for both. Cany anyone help me to get a password for the jupyter terminal ?. Thanks.

Michael Robinson (michael.robinson@astronomer.io)

2024-02-02 15:13:10

@channel This month’s TSC meeting is next Thursday the 8th at 10am PT. On the tentative agenda: • announcements • recent releases • Coming soon: simplified job hierarchy in Spark @Maciej Obuchowski • Flink integration updates @Maciej Obuchowski • open discussion • more (TBA) More info and the meeting link can be found on the website. All are welcome! Do you have a discussion topic, use case or integration you’d like to demo? DM me to be added to the agenda.

openlineage.io

TSC Meetings | OpenLineage

The OpenLineage Technical Steering Committee meets monthly, and is open to all.

Original URL: https://openlineage.io/meetings/

👍 Maciej Obuchowski, Mattia Bertorello

Peter Huang (huangzhenqiu0825@gmail.com)

2024-02-08 12:48:48

*Thread Reply:* @Maciej Obuchowski I am not able to attend the meeting this time. Please help to update on our agreement on the flink community about the flink lineage listener APIs.

👍 Maciej Obuchowski

priya narayana (n.priya88@gmail.com)

2024-02-05 04:08:31

Hi Team , When i run jobs in Dataproc - Target host is not specified error is coming , when i set --properties 'dataproc:dataproc.lineage.enabled=true'. Can you tell what settings am i missing

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-05 04:42:48

*Thread Reply:* hey, we don't develop Dataproc integration directly, this seems like Google service issue and should be resolved by their support

Rodrigo Maia (rodrigo.maia@manta.io)

2024-02-05 10:56:52

HI! Is anybody here working with AWS Glue + OpenLineage (Spark)?

Michael Robinson (michael.robinson@astronomer.io)

2024-02-05 13:29:05

*Thread Reply:* There's been some progress on support for Glue recently: https://github.com/OpenLineage/OpenLineage/pull/2283

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-06 07:32:07

*Thread Reply:* Glue Spark overall is working, but I think Glue Data Catalog is not

Rodrigo Maia (rodrigo.maia@manta.io)

2024-02-07 05:37:06

*Thread Reply:* I'll try to configure an instance with OL latest version to validate the results in terms of the jobs and the catalog. Thank you

Suraj Gupta (suraj.gupta@atlan.com)

2024-02-06 02:20:47

Do we have any list/documentation of all the connectors supported by Spark + OpenLineage?

Ruchira Prasad (ruchiraprasad@gmail.com)

2024-02-07 12:35:55

Hi Team, I tried to set up marquez on my Windows PC by using Docker Desktop by referring this document. https://openlineage.io/getting-started/ Once set up, it's not start the "marquez-api" by giving an error as "org.postgresql.util.PSQLException: FATAL: password authentication failed for user "marquez"" Do you have any idea how to fix this?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-07 12:43:13

*Thread Reply:* could you please check marquez-db logs?

Ruchira Prasad (ruchiraprasad@gmail.com)

2024-02-07 12:48:09

*Thread Reply:* It has this FATAL: password authentication failed for user "marquez" DETAIL: Role "marquez" does not exist.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-07 12:49:56

*Thread Reply:* no error related to init script?

Ruchira Prasad (ruchiraprasad@gmail.com)

2024-02-07 13:24:37

*Thread Reply:* Probably you might ask this.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-07 13:26:53

*Thread Reply:* as you're on windows you may have wrong crlf setting in git. you can either fix the script manually or change git settings

Ruchira Prasad (ruchiraprasad@gmail.com)

2024-02-08 00:51:24

*Thread Reply:* I set git config --global core.autocrlf false. Then the line endings remain as LF. With this, the above error was gone. But it has an authentication error as below.

Ruchira Prasad (ruchiraprasad@gmail.com)

2024-02-08 10:59:25

*Thread Reply:* Isn't this some code issue in this repo @Jakub Dardziński https://github.com/MarquezProject/marquez

MarquezProject/marquez

Collect, aggregate, and visualize a data ecosystem's metadata

Website

<https://marquezproject.ai>

Stars

1551

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-08 11:18:40

*Thread Reply:* did you try docker compose down and up again?

Ruchira Prasad (ruchiraprasad@gmail.com)

2024-02-08 11:56:03

*Thread Reply:* Yes. didn't work.

Michael Robinson (michael.robinson@astronomer.io)

2024-02-07 13:47:08

@channel This month's TSC meeting, open to all, is tomorrow! Note the recent additions to the agenda 👀. https://openlineage.slack.com/archives/C01CK9T7HKR/p1706904790913259

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)

@channel This month’s TSC meeting is next Thursday the 8th at 10am PT. On the tentative agenda: • announcements • recent releases • Coming soon: simplified job hierarchy in Spark @Maciej Obuchowski • Flink integration updates @Maciej Obuchowski • open discussion • more (TBA) More info and the meeting link can be found on the <a href="https://openlineage.io/meetings/">website</a>. All are welcome! Do you have a discussion topic, use case or integration you’d like to demo? DM me to be added to the agenda.

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1706904790913259

👍 Ernie Ostic, Jakub Dardziński, Maciej Obuchowski, Mattia Bertorello

❤️ Harel Shein, alexandre bergere

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-02-08 09:25:41

HIRING! Hello OpenLineage community members! Collibra is hiring for 2 Software Engineers (one regular, one senior) in Czech Republic, for our Lineage ETL team. These jobs are open because one engineer is leaving to hike the Pacific Coast Trail, and another is backfill (an engineer left a different lineage team and an engineer from ETL team is moving to that team).

I am the Product Manager for this team, so you’d be working with me daily. Ask me anything (in a thread here or a DM). Creating an OpenLineage consumer is our next project starting later this month :D

https://www.linkedin.com/feed/update/urn:li:activity:7161361969311084544/

linkedin.com

Sheeri Cabral :watermelon: on LinkedIn: Senior Software Engineer (Python), Lineage

Hello! If you&#39;re a software engineer in Czech Republic and are looking for a job, we have 2 job openings for the product line I am a product manager for. If…

Original URL: https://www.linkedin.com/feed/update/urn:li:activity:7161361969311084544/

🔥 Maciej Obuchowski, Kacper Muda, Michael Robinson, Harel Shein, Mattia Bertorello, Paweł Leszczyński

:flag_cz: Maciej Obuchowski, Harel Shein, Paweł Leszczyński

🥾 Mattia Bertorello

Michael Robinson (michael.robinson@astronomer.io)

2024-02-08 09:47:28

*Thread Reply:* Exciting news all around! Thanks for the update.

Harel Shein (harel.shein@gmail.com)

2024-02-08 09:53:30

*Thread Reply:* Amazing news @Sheeri Cabral (Collibra) 🎉

Harel Shein (harel.shein@gmail.com)

2024-02-08 09:53:44

*Thread Reply:* Cross posting to #jobs

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-02-08 10:00:31

*Thread Reply:* Oh! I forgot about that channel!

ldacey (lance.dacey2@sutherlandglobal.com)

2024-02-08 18:49:02

any general recommendations for naming buckets in GCS? i have a separate bucket per client (unrelated to each other).

then i normally have folders like:

sourcename/raw (csv/json data) sourcename/bronze (append only delta table) sourcename/silver (no duplicates) sourcename/gold (additional filters/aggregations)

I kind of like having an OL dataset per raw file so I can see the schema and some metadata about what happened to that file in Marquez, but it makes my Namespace dataset view kind of messy. perhaps I should separate the raw data into a separate bucket (meaning a separate namespace)? I know that Dataplex zones are based on "raw" and "curated" so maybe that makes sense?

it seems like having a separate bucket per source might be a bit much (, , ) since we have many hundreds of unique sources.

Polly

2024-02-09 11:30:45

@Harel Shein has a polly for you! Votes in this polly are anonymous 🔒.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-02-15 08:30:17

*Thread Reply:* FYI the EST times aren’t always correct, e.g. 1300 am. (I skipped over some 12 am EST times because I didn’t notice it really meant noon, 12 pm)

Harel Shein (harel.shein@gmail.com)

2024-02-16 13:37:31

*Thread Reply:* oh, oops. sorry about that 🤦‍♂️

Harel Shein (harel.shein@gmail.com)

2024-02-09 11:33:30

Hi all @here, We are considering changing the times for our monthly TSC meetings to allow more members of the community to participate. Please vote according to your preferred meeting days and times. Thank you!

🔥 Paweł Leszczyński, Mateusz Kozioł, Michael Robinson, Sheeri Cabral (Collibra)

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-16 13:30:35

*Thread Reply:* So, are we switching?

Harel Shein (harel.shein@gmail.com)

2024-02-16 13:40:03

*Thread Reply:* looks like Wednesday 9:30am PST / 12:30pm EST / 6:30pm CET / 11:00pm IST is the winner!

👍 Maciej Obuchowski

Wajdi Fathallah (wajdi@siffletdata.com)

2024-02-13 08:14:30

Hi team, I really like the way you define URI in https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md and would like to know:

any recommendations to extend this list and support other data sources?
do you plan to support assets other that Tables, Jobs ? for instance BI dashboard, AI models etc..?

<https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md | Naming.md>

Naming We define the unique name strategy per resource to ensure it is followed uniformly independently of who is producing metadata, so we can connect lineage from various sources. Both Jobs and Datasets are in their own namespaces. Job namespaces are related to their schedulers. The namespace for a dataset is the unique name for its datasource. Datasets The namespace and name of a datasource can be combined to form a URI (scheme:[//authority]path) • Namespace = scheme:[//authority] (the datasource) • Name = path (the datasets) Data warehouses/data bases Datasets are called tables. Tables are organized into databases and schemas. Postgres: Datasource hierarchy: • Host • Port Naming hierarchy: • Database • Schema • Table Identifier: • Namespace: postgres://{host}:{port} of the service instance. • Scheme = postgres • Authority = {host}:{port} • Unique name: {database}.{schema}.{table} • URI = postgres://{host}:{port}/{database}.{schema}.{table} MySQL: Datasource hierarchy: • Host • Port Naming hierarchy: • Database • Table Identifier: • Namespace: mysql://{host}:{port} of the service instance. • Scheme = mysql • Authority = {host}:{port} • Unique name: {database}.{table} • URI = mysql://{host}:{port}/{database}.{table} Trino: Datasource hierarchy: • Host • Port Naming hierarchy: • Catalog • Schema • Table Identifier: • Namespace: trino://{host}:{port} of the service instance. • Scheme = trino • Authority = {host}:{port} • Unique name: {catalog}.{schema}.{table} • URI = trino://{host}:{port}/{catalog}.{schema}.{table} Redshift: Datasource hierarchy: • Host: examplecluster.<XXXXXXXXXXXX>.<a href="http://us-west-2.redshift.amazonaws.com">us-west-2.redshift.amazonaws.com</a> • Port: 5439 OR • Cluster identifier • Region name • Port (defaults to 5439) Naming hierarchy: • Database • Schema • Table One can interact with Redshift using SQL or Data API. The combination of cluster identifier and region name is the only common unique ID available to both. Identifier: • Namespace: redshift://{clusteridentifier}.{regionname}:{port} of the cluster instance. • Scheme = redshift • Authority = {clusteridentifier}.{regionname}:{port} • Unique name: {database}.{schema}.{table} • URI = redshift://{clusteridentifier}.{regionname}:{port}/{database}.{schema}.{table} Athena: Datasource hierarchy: • Host: athena.{region_name}.<a href="http://amazonaws.com">amazonaws.com</a> Naming hierarchy: • Catalog • Database • Table Identifier: • Namespace: <awsathena://athena>.{regionname}.<a href="http://amazonaws.com">amazonaws.com</a> of the service instance. • Scheme = awsathena • Authority = athena.{regionname}.<a href="http://amazonaws.com">amazonaws.com</a> • Unique name: {catalog}.{database}.{table} • URI = <awsathena://athena>.{region_name}.<a href="http://amazonaws.com/{catalog}.{database}.{table}|amazonaws.com/{catalog}.{database}.{table}">http://amazonaws.com/{catalog}.{database}.{table}|amazonaws.com/{catalog}.{database}.{table}</a> Snowflake See: • <a href="https://docs.snowflake.com/en/user-guide/admin-account-identifier|Account Identifiers | Snowflake Documentation">https://docs.snowflake.com/en/user-guide/admin-account-identifier|Account Identifiers | Snowflake Documentation</a> • <a href="https://docs.snowflake.com/en/sql-reference/identifiers.html|Object Identifiers | Snowflake Documentation">https://docs.snowflake.com/en/sql-reference/identifiers.html|Object Identifiers | Snowflake Documentation</a> Datasource hierarchy: • account identifier (composite of organization name and account name) Naming hierarchy: • Database: {database name} => unique across the account • Schema: {schema name} => unique within the database • Table: {table name} => unique within the schema Identifier: • Namespace: snowflake://{organization name}-{account name} • Scheme = snowflake • Authority = {organization name}-{account name} • Name: {database}.{schema}.{table} • URI = snowflake://{organization name}-{account name}/{database}.{schema}.{table} Snowflake resolves and stores names for databases, schemas, tables and columns differently depending on how they are <a href="https://docs.snowflake.com/en/sql-reference/identifiers-syntax">expressed in statements</a> (e.g. unquoted vs quoted). The representation of names in OpenLineage events should be based on the canonical name that Snowflake stores. Specifically: • For dataset names, each period-delimited part (database/schema/table) should be in the simplest form it would take in a statement i.e. quoted only if necessary. For example, a table <code>My Table</code> in schema <code>MY_SCHEMA</code> and in database <code>MY_DATABASE</code> would be represented as <code>MY_DATABASE.MY_SCHEMA."My Table"</code>. If in doubt, check <a href="https://docs.snowflake.com/en/sql-reference/account-usage/access_history">Snowflake's <code>ACCESS_HISTORY</code> view</a> to see how <code>objectName</code> is formed for a given table. • For column names, the canonical name should always be used verbatim. BigQuery See: <a href="https://cloud.google.com/resource-manager/docs/creating-managing-projects|Creating and managing projects | Resource Manager Documentation">https://cloud.google.com/resource-manager/docs/creating-managing-projects|Creating and managing projects | Resource Manager Documentation</a> <a href="https://cloud.google.com/bigquery/docs/datasets-intro|Introduction to datasets | BigQuery">https://cloud.google.com/bigquery/docs/datasets-intro|Introduction to datasets | BigQuery</a> <a href="https://cloud.google.com/bigquery/docs/tables-intro|Introduction to tables | BigQuery">https://cloud.google.com/bigquery/docs/tables-intro|Introduction to tables | BigQuery</a> Datasource hierarchy: • bigquery Naming hierarchy: • Project Name: {project name} => is not unique • Project number: {project number} => numeric: is unique across Google cloud • Project ID: {project id} => readable: is unique across Google cloud • dataset: {dataset name} => is unique within a project • table: {table name} => is unique within a dataset Identifier: • Namespace: bigquery • Scheme = bigquery • Authority = • Unique name: {project id}.{dataset name}.{table name} • URI = bigquery:{project id}.{dataset name}.{table name} Azure Synapse: Datasource hierarchy: • Host: <XXXXXXXXXXXX>.<a href="http://sql.azuresynapse.net">sql.azuresynapse.net</a> • Port: 1433 • Database: SQLPool1 Naming hierarchy: • Schema • Table Identifier: • Namespace: sqlserver://{host}:{port};database={database}; • Scheme = sqlserver • Authority = {host}:{port} • Unique name: {schema}.{table} • URI = sqlserver://{host}:{port};database={database}/{schema}.{table} Azure Cosmos DB: Datasource hierarchy: <azurecosmos://%s.documents.azure.com/dbs/%s> • Host: <XXXXXXXXXXXX>.<a href="http://documents.azure.com">documents.azure.com</a> • Database Naming hierarchy: • Schema • Table Identifier: • Namespace: azurecosmos://{host}/dbs/{database} • Scheme = azurecosmos • Authority = {host} • Unique name: /colls/{table} • URI = azurecosmos://{host}.<a href="http://documents.azure.com/dbs/{database}/colls/{table}|documents.azure.com/dbs/{database}/colls/{table}">http://documents.azure.com/dbs/{database}/colls/{table}|documents.azure.com/dbs/{database}/colls/{table}</a> Azure Data Explorer: Datasource hierarchy: • Host: <clustername>.<clusterlocation> • Database • Table Naming hierarchy: • Database • Table Identifier: • Namespace: azurekusto://{host}.<a href="http://kusto.windows.net/{database}|kusto.windows.net/{database}">http://kusto.windows.net/{database}|kusto.windows.net/{database}</a> • Scheme = azurekusto • Unique name: {database}/{table} • URI = azurekusto://{host}.<a href="http://kusto.windows.net/{database}/{table}|kusto.windows.net/{database}/{table}">http://kusto.windows.net/{database}/{table}|kusto.windows.net/{database}/{table}</a> Distributed file systems/blob stores GCS Datasource hierarchy: none, naming is global Naming hierarchy: • bucket name => globally unique • Path Identifier : • Namespace: gs://{bucket name} • Scheme = gs • Authority = {bucket name} • Unique name: {path} • URI = gs://{bucket name}{path} S3 Naming hierarchy: • bucket name => globally unique • Path Identifier : • Namespace: s3://{bucket name} • Scheme = s3 • Authority = {bucket name} • Unique name: {path} • URI = s3://{bucket name}{path} HDFS Naming hierarchy: • Namenode: host + port • Path Identifier : • Namespace: hdfs://{namenode host}:{namenode port} • Scheme = hdfs • Authority = {namenode host}:{namenode port} • Unique name: {path} • URI = hdfs://{namenode host}:{namenode port}{path} DBFS (Databricks File System) Naming hierarchy: • workspace name: globally unique • Path Identifier : • Namespace: hdfs://{workspace name} • Scheme = hdfs • Authority = workspace name • Unique name: {path} • URI = hdfs://{workspace nam…

👍 Athitya Kumar, alexandre bergere

Julien Le Dem (julien@apache.org)

2024-02-13 19:51:22

*Thread Reply:* Hello Wajdi, nice to see you!

If you wan to add a missing datasource to that document, you should open an issue to propose it. The main requirement is they need to be unique and canonical. Once we agree on the format you can open a PR and add it to the spec.
So far the idea, is to model AI model and dashboards as datasets. an AI model is an asset with metadata that is produced by a job. It will have specific facets. Let us know what you think. You are welcome to push that discussion further if you want to add those to the spec.

Wajdi Fathallah (wajdi@siffletdata.com)

2024-02-14 09:29:00

*Thread Reply:* Thanks for your answer @Julien Le Dem

Balachandru S (balachandru2468@gmail.com)

2024-02-14 01:06:05

Hi Team, We are getting the column-level lineage for pyspark code with the help of OpenLineage. We are interested to know the limitation here, is there any limitation in creating column-level lineage for pyspark code like lines of code, number of input data sources, number of output data sources or pyspark operations (filter, join & extra) in the OpenLineage ?. We have gone through the OpenLineage documentation, from the documentation we could only get supported spark versions and data source types alone. Thanks.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-14 05:48:43

*Thread Reply:* Spark 2.4 does not support Column-level lineage.

> number of output data sources In Spark model, each output creates additional action that goes through general Spark compilation and execution model - we follow that, so you'll end up with multiple events.

One thing we're not supporting for CLL (...maybe yet?) is using mapping to Scala object using DeserializeToObject logical plan, as we're breaking CLL chain there.

> We are interested to know the limitation here If anything, it depends on the type of data sources. Some LogicalPlans (like streaming ones) are not supported, as well as external ones. We're working on API for external connectors to implement, so we can have stable support for them.

Balachandru S (balachandru2468@gmail.com)

2024-02-14 08:31:42

*Thread Reply:* Thanks @Maciej Obuchowski.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-02-15 08:31:10

*Thread Reply:* follow-up - what versions of spark - if any - create column-level lineage?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-15 09:15:02

*Thread Reply:* 3.0+ 🙂 For now, it can depend on the connector though - Paweł is working on a better way to do it. https://github.com/OpenLineage/OpenLineage/pull/2272

#2272 [SPARK] lineage metadata extraction built-in within Spark extensions

Problem Support for lineage extraction built-in within Spark extensions. This PR adds <code>spark-interfaces-scala</code> package which allows lineage extraction to be implemented within Spark extensions (like iceberg, delta, gcs, etc). Openlineage integration, when traversing query plan, verifies if nodes implement defined interfaces. If so, interface methods are used to extract lineage. Please refer to <code>README.md</code> within the PR for more details. Solution PR contains extra artifact with interfaces as well as Spark integration internals <blockquote> Note: All schema changes require discussion. Please <a href="https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue">link the issue</a> for context. </blockquote> ☐ Your change modifies the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json">core</a> OpenLineage model ☐ Your change modifies one or more OpenLineage <a href="https://github.com/OpenLineage/OpenLineage/tree/main/spec/facets">facets</a> If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports <code>S3</code> and <code>GCS</code> filesystem operations, tested with AWS EMR). One-line summary: Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☑︎ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☑︎ Your changes are accompanied by tests (if relevant) ☐ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☐ Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant) ☐ You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant) <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2023 contributors to the OpenLineage project

Labels

documentation, integration/spark, ci, integration/flink, tests, streaming

Comments

Michael Robinson (michael.robinson@astronomer.io)

2024-02-14 09:26:47

@channel The Marquez Project, the reference implementation for OpenLineage, has released an RC featuring column lineage support among other significant web UI improvements. For those unfamiliar with our sister project, Marquez is a highly scalable backend, API and UI for metadata management that implements the OpenLineage spec. In addition to being a powerful, platform-agnostic solution for metadata management, it's always been a great way to get started with OpenLineage and explore what the capabilities are. Now it's an even better reflection of those capabilities and, in particular, the strengths of the Spark integration and Airflow Provider. Whether you're already a seasoned Marquez user or you're new to the project, please consider trying out the RC. It would be very helpful to know what your experience is like and about any issues you run into. Thanks!

Resources Release: https://github.com/MarquezProject/marquez/releases/tag/0.45.0-rc.1 Changelog: https://github.com/MarquezProject/marquez/blob/0.45.0-rc.1/CHANGELOG.md Commit history: https://github.com/MarquezProject/marquez/compare/0.43.1...0.45.0-rc.1 Maven: https://oss.sonatype.org/#nexus-search;quick~marquez PyPI: https://pypi.org/project/marquez-python/ Docker: https://hub.docker.com/u/marquezproject

🚀 Mattia Bertorello, Jakub Dardziński, Andy Alseth, Maciej Obuchowski, Willy Lulciuc, Minkyu Park, Sheeri Cabral (Collibra)

Michael Robinson (michael.robinson@astronomer.io)

2024-02-14 11:07:48

*Thread Reply:*

Michael Robinson (michael.robinson@astronomer.io)

2024-02-14 11:08:02

*Thread Reply:*

Michael Robinson (michael.robinson@astronomer.io)

2024-02-14 11:08:16

*Thread Reply:*

Willy Lulciuc (willy@datakin.com)

2024-02-14 13:38:31

*Thread Reply:* For those interested in learning more about Marquez, I hold office hours every Tue 9:30AM PST! You can join here 👉 https://astronomer.zoom.us/j/84548968341?pwd=Asb8rpLuhbSalGP9i4BYHd1UXfPQe1.1

ldacey (lance.dacey2@sutherlandglobal.com)

2024-02-15 10:11:11

*Thread Reply:* looks sharp. it seems to have fixed some issues I might have had in 0.44 which did not show all connections

I did an airflow backfill job which redownloaded all files from a SFTP (191 files) and each of those are a separate OL dataset. in this view I clicked on a single file, but because it is connected to the "extract" airflow task, it shows all of the files that task downloaded as well (dynamic mapped tasks in Airflow)

ldacey (lance.dacey2@sutherlandglobal.com)

2024-02-15 10:16:07

*Thread Reply:* but day to day there is normally just one file downloaded so that should be a cleaner view in the future I assume? that input dataset (SFTP file) should only refer to the output dataset on GCS

[sftp_file] --> [GCS file in landing folder] --> [GCS file in raw folder (renamed), only added if the checksum doesn't exist] -> [bronze delta table on GCS] --> [silver delta tableon GCS] --> [gold delta table on ADLS gen2]

I am just considering whether I should proceed with file level datasets if it will make the UI too messy and complex. we do not control changes in the schema from clients, so on one hand it is nice to track

Zacay Daushin (zacayd@octopai.com)

2024-02-14 09:32:43

is this also parses python code for column level?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-14 09:34:22

*Thread Reply:* Marquez is a reference backend for OpenLineage. What Michael highlighted is its ability to visualize column lineage.

Parsing Python code is, I think, undoable in sustainable way. What we’re aiming at is to use hook-level lineage which is work in progress. This work happens in Airflow OpenLineage provider.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-02-15 08:32:26

*Thread Reply:* (and just to clarify - does the airflow openlineage provider generate column-level lineage info?)

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-15 08:47:24

*Thread Reply:* it does with exceptions we even have that info now in the main repo page: https://github.com/OpenLineage/OpenLineage?tab=readme-ov-file#integration-matrix 🙂

Minkyu Park (minkyu.park.200@gmail.com)

2024-02-14 18:51:09

Hi all, greetings again from my new account with personal email. It's been a while to see you all 😄

👋 Jakub Dardziński, Paweł Leszczyński, Maciej Obuchowski

Harel Shein (harel.shein@gmail.com)

2024-02-14 19:48:48

*Thread Reply:* Good to see you here @Minkyu Park ❤️

❤️ Minkyu Park

Michael Robinson (michael.robinson@astronomer.io)

2024-02-14 19:50:45

*Thread Reply:* welcome back, @Minkyu Park!

❤️ Minkyu Park

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-15 05:01:24

*Thread Reply:* Hello @Minkyu Park 🙂

❤️ Minkyu Park

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-02-15 08:33:21

*Thread Reply:* Welcome back Minkyu!

(Does open source need some presentations on ‘how best to contribute to open source’ including ‘use personal emails/github whenever possible because your job may not last forever’?)

🤣 Minkyu Park

👍 Minkyu Park

Minkyu Park (minkyu.park.200@gmail.com)

2024-02-15 18:30:46

Are extractors in OL airflow integration still available from airflow 2.7+ OL providers, or should it be implemented separately in Airflow?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-15 18:32:04

*Thread Reply:* They’re still available, you can register them with [openlineage] extractors or (AIRFLOW__OPENLINEAGE__EXTRACTORS env var) - see: https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/configurations-ref.html#extractors

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-15 18:32:45

*Thread Reply:* however, we encourage to use get_openlineage_facets_on_** methods just like we do within all providers now

Minkyu Park (minkyu.park.200@gmail.com)

2024-02-15 18:35:46

*Thread Reply:* In order to use get_openlineage_facets_on_** with the DefaultExtractor, the operator itself should implement the method, right? https://github.com/apache/airflow/blob/main/airflow/providers/openlineage/extractors/base.py#L115

<https://github.com/apache/airflow/blob/main/airflow/providers/openlineage/extractors/base.py | base.py>

<pre><code> return self._get_openlineage_facets(self.operator.get_openlineage_facets_on_start) # type: ignore </code></pre>

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-15 18:36:52

*Thread Reply:* correct, you can find multiple examples, like https://github.com/apache/airflow/blob/main/airflow/providers/amazon/aws/operators/athena.py#L208

<https://github.com/apache/airflow/blob/main/airflow/providers/amazon/aws/operators/athena.py | athena.py>

<pre><code> def get_openlineage_facets_on_start(self) -> OperatorLineage: </code></pre>

Minkyu Park (minkyu.park.200@gmail.com)

2024-02-15 18:37:12

*Thread Reply:* 👍

Jackson Goerner (jgoerner@squareup.com)

2024-02-15 19:24:14

👋 Hello! I'm trying to integrate our airflow service with open lineage, and while the standard transports work, I'm having trouble using a custom transport (I'm sure its a silly mistake). Details are in 🧵

✅ Jackson Goerner

Jackson Goerner (jgoerner@squareup.com)

2024-02-15 19:24:30

*Thread Reply:* The reason I need a custom transport is we have an api which requires some authentication with aws incognito, which when executing on the box seems to work fine (I can instantiate the transport with config, and emit events and that works all good.)

From what I understand the type transport configuration should be the fully qualified class name, which I've done. Using a standard file transport worked fine, and I saw the files being created as tasks we're executed. However, replacing my openlineage.yml file with:

transport: type: lineage_utils.api_transport.Transport env: dev endpoint: lineage/data Has stopped working. The file containing the class is /usr/share/airflow/plugins/lineage_utils/api_transport.py, and the first thing that file does is just touch a log file, which it isn't even doing, so I think the issue is something to do with the fully qualified class name.

The airflow worker service has the following standard configuration:

[Service] Environment="PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/share/airflow/venv/bin:/usr/local/bin" Environment="PYTHONPATH=${PYTHONPATH}:/usr/share/airflow/plugins" EnvironmentFile=/etc/sysconfig/airflow User=airflow Group=airflow Type=simple ExecStart=/usr/share/airflow/venv/bin/airflow celery worker -D -l /var/log/airflow/airflow-worker.log --stdout /var/log/airflow/airflow-worker-stdout.log --stderr /var/log/airflow/airflow-worker-stderr.log -q default Restart=on-failure RestartSec=10s Which should add the plugins directory to the PYTHONPATH (which is working for the dags themselves). Changing the qualified path to plugins.lineage_utils.api_transport.Transport also didn't work.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-16 03:46:01

*Thread Reply:* Any logs/exceptions you can share?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-16 03:52:39

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/main/client/python/openlineage/client/utils.py#L23 it should log warning if there are issues with importing the class

<https://github.com/OpenLineage/OpenLineage/blob/main/client/python/openlineage/client/utils.py | utils.py>

<pre><code> log.warning(e) </code></pre>

Jackson Goerner (jgoerner@squareup.com)

2024-02-19 19:11:28

*Thread Reply:* Thanks for pointing towards the logs - the issue was with my (faulty) interpretation of the docs here. I just copied the base transport/config definitions and implemented the methods, rather than subclassing the base classes, which caused issues. Everything is all working now! Thanks a bunch.

openlineage.io

Python | OpenLineage

Overview

Original URL: https://openlineage.io/docs/client/python/#custom-transport-type

🚀 Jakub Dardziński

Simran Suri (mailsimransuri@gmail.com)

2024-02-19 01:26:05

Hi, I am currently incorporating my Spark app name explicitly within the Spark code and further trying to build lineage. During this process, I've noticed certain transformations occurring in the job name in the event. Specifically, hyphens are changing into underscores, and uppercase letters are being converted into underscores followed by lowercase letters. I would appreciate more detailed information on these naming conventions. Are there specific scenarios or considerations we should be mindful of when adding the Spark app name to ensure same job name in events?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-19 08:43:45

*Thread Reply:* Best I can do is point you to where it happened: https://github.com/MarquezProject/marquez/pull/1191 and code that does it now: https://github.com/OpenLineage/OpenLineage/blob/7f1dd20d32a2048c6e8608679e0ee5e928[…]openlineage/spark/agent/lifecycle/SparkSQLExecutionContext.java

#1191 Update spark job name to reflect spark application name and execution node

<https://github.com/OpenLineage/OpenLineage/blob/7f1dd20d32a2048c6e8608679e0ee5e9283f2922/integration/spark/app/src/main/java/io/openlineage/spark/agent/lifecycle/SparkSQLExecutionContext.java | SparkSQLExecutionContext.java>

👍 Simran Suri

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-19 08:37:48

Hey all - with @Damien Hawes as we get closer to releasing Spark integration with support for Scala 2.13 we think of requiring specifying Scala version in the artifact name. This is generally standard procedure for libraries containing part in Scala - for example Iceberg integration for Spark or Spark itself.

The reason we ask is because Scala 2.12 compatible version currently does not have the suffix: for OpenLineage 1.8 the maven artifact coordinates are <dependency> <groupId>io.openlineage</groupId> <artifactId>openlineage-spark</artifactId> <version>1.8.0</version> </dependency> and for 1.9 would be <dependency> <groupId>io.openlineage</groupId> <artifactId>openlineage-spark_2.12</artifactId> <version>1.9.0</version> </dependency>

Damien Hawes (damien.hawes@booking.com)

2024-02-19 08:40:37

*Thread Reply:* It should be noted that the 2.13 variant would have these coordinates: <dependency> <groupId>io.openlineage</groupId> <artifactId>openlineage-spark_2.13</artifactId> <version>1.9.0</version> </dependency>

Balachandru S (balachandru2468@gmail.com)

2024-02-21 03:53:45

Hi Team, I ran pyspark code in the Jupyter notebook (which comes with OpenLineage) and created lineage diagrams in the Marquez successfully. But I have a doubt instead of using Jupyter notebook (comes with OpenLineage), can use use my locally installed spark to create a lineage?. Thanks.

Kacper Muda (mudakacper@gmail.com)

2024-02-21 04:14:11

*Thread Reply:* Hey, yes it's possible to use OpenLineage Spark integration outside of Jupyter notebook. Check out the docs on that integration.

Balachandru S (balachandru2468@gmail.com)

2024-02-21 04:26:51

*Thread Reply:* Thanks @Kacper Muda

Rodrigo Maia (rodrigo.maia@manta.io)

2024-02-21 04:43:19

*Thread Reply:* @Balachandru S its easy as moving the jar OL spark listener to the JAR spark folder and configuring spark as usual.

or something like:

SparkSession.builder .appName("example") .config("spark.jars.packages", "io.openlineage:openlineage_spark:1.8.0")

Balachandru S (balachandru2468@gmail.com)

2024-02-21 09:12:19

*Thread Reply:* Thanks @Rodrigo Maia

Abdallah (abdallah@terrab.me)

2024-02-21 06:17:11

Hi team, I hope you are all doing well. After the recent contributions, some of which addressed significant issues, may I kindly request a new release, please? Thank you very much for your time.

➕ Abdallah, Sophie LY, Yannick Libert, Tristan GUEZENNEC -CROIX-, Michael Robinson, Jakub Dardziński, Maciej Obuchowski, Damien Hawes

Rodrigo Maia (rodrigo.maia@manta.io)

2024-02-21 06:53:36

*Thread Reply:* Tha would be awesome

Abdallah (abdallah@terrab.me)

2024-02-21 11:37:46

*Thread Reply:* @Maciej Obuchowski Do you consider any release soon ?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-21 12:20:15

*Thread Reply:* @Michael Robinson is working on it 🙂

Abdallah (abdallah@terrab.me)

2024-02-21 12:20:45

*Thread Reply:* Owh ! Thank you !

Rodrigo Maia (rodrigo.maia@manta.io)

2024-02-21 06:51:43

Bug on Unity Catalog Support on OpenLineage for Databricks:

Im struggling to generate any output when running pyspark, reading and writing to Unity Catalog on Databricks, while input facets are created properly with correct symlinks:

example code: df = spark.read.table("rodrigo.default.brazil_universities") df.write.mode("overwrite").saveAsTable("rodrigo.default.brazil_universities_temp1") lineage json payload for all events (start and complete - it never emits a running event): ..."outputs":[]} What i noticed from the error logs:

If im writing to the default schema: 24/02/21 11:50:49 INFO PlanUtils: apply method failed with org.apache.spark.SparkException: There is no Credential Scope. Current env: Driver at com.databricks.unity.UCSDriver$Manager.$anonfun$currentScopeId$3(UCSDriver.scala:131) at scala.Option.getOrElse(Option.scala:189) at com.databricks.unity.UCSDriver$Manager.currentScopeId(UCSDriver.scala:131) at com.databricks.unity.UCSDriver$Manager.currentScope(UCSDriver.scala:134) at com.databricks.unity.UnityCredentialScope$.currentScope(UnityCredentialScope.scala:100) at com.databricks.unity.UnityCredentialScope$.getSAMRegistry(UnityCredentialScope.scala:120) at com.databricks.unity.SAMRegistry$.registerSAM(SAMRegistry.scala:322) at com.databricks.unity.SAMRegistry$.registerDefaultSAM(SAMRegistry.scala:338) at org.apache.spark.sql.catalyst.catalog.SessionCatalogImpl.defaultTablePath(SessionCatalog.scala:1200) at com.databricks.sql.managedcatalog.ManagedCatalogSessionCatalog.defaultTablePath(ManagedCatalogSessionCatalog.scala:991) at io.openlineage.spark3.agent.lifecycle.plan.catalog.AbstractDatabricksHandler.getDatasetIdentifier(AbstractDatabricksHandler.java:92) at io.openlineage.spark3.agent.lifecycle.plan.catalog.CatalogUtils3.lambda$getDatasetIdentifier$2(CatalogUtils3.java:61) ........ If im writing to other schema (not the default): 24/02/21 11:47:16 INFO PlanUtils: apply method failed with org.apache.spark.SparkException: There is no Credential Scope. Current env: Driver at com.databricks.unity.UCSDriver$Manager.$anonfun$currentScopeId$3(UCSDriver.scala:131) at scala.Option.getOrElse(Option.scala:189) at com.databricks.unity.UCSDriver$Manager.currentScopeId(UCSDriver.scala:131) at com.databricks.unity.UCSDriver$Manager.currentScope(UCSDriver.scala:134) at com.databricks.unity.UnityCredentialScope$.currentScope(UnityCredentialScope.scala:100) at com.databricks.unity.UnityCredentialScope$.getSAMRegistry(UnityCredentialScope.scala:120) at com.databricks.unity.SAMRegistry$.registerSAM(SAMRegistry.scala:322) at com.databricks.unity.SAMRegistry$.registerDefaultSAM(SAMRegistry.scala:338) at org.apache.spark.sql.catalyst.catalog.SessionCatalogImpl.defaultTablePath(SessionCatalog.scala:1200) at com.databricks.sql.managedcatalog.ManagedCatalogSessionCatalog.defaultTablePath(ManagedCatalogSessionCatalog.scala:991) at io.openlineage.spark3.agent.lifecycle.plan.catalog.AbstractDatabricksHandler.getDatasetIdentifier(AbstractDatabricksHandler.java:92) at io.openlineage.spark3.agent.lifecycle.plan.catalog.CatalogUtils3.lambda$getDatasetIdentifier$2(CatalogUtils3.java:61) .......... Has anyone successfully used Unity Catalog and produced the output facet?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-21 07:07:03

*Thread Reply:* I would call this feature request rather than bug 🙂

💡 Abdallah

Rodrigo Maia (rodrigo.maia@manta.io)

2024-02-21 09:03:10

*Thread Reply:* i really wish i could help on that

Matthew Paras (matthewparas2020@u.northwestern.edu)

2024-02-21 13:18:10

*Thread Reply:* I have a fix for this!

Matthew Paras (matthewparas2020@u.northwestern.edu)

2024-02-21 13:18:45

*Thread Reply:* Let me create an issue on github, I'm happy to make the change in a PR as well

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-21 13:20:28

*Thread Reply:* Great to hear that!

Matthew Paras (matthewparas2020@u.northwestern.edu)

2024-02-21 13:21:30

*Thread Reply:* Should I do issue + PR or just straight to PR?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-21 13:43:34

*Thread Reply:* Straight to PR is good for me, assuming you don't plan to break other things as part of the solution 🙂

Matthew Paras (matthewparas2020@u.northwestern.edu)

2024-02-21 13:51:37

*Thread Reply:* It is a fairly small change and we've been running this patched version on databricks for a few weeks now, so cautiously optimistic that it doesn't break anything else 😄

🙌 Jakub Dardziński, Maciej Obuchowski, Rodrigo Maia

Matthew Paras (matthewparas2020@u.northwestern.edu)

2024-02-21 21:58:40

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2453

#2453 Address outputs missing with unity catalog

Problem 👋 Thanks for opening a pull request! Please include a brief summary of the problem your change is trying to solve, or bug fix. If your change fixes a bug or you'd like to provide context on why you're making the change, please <a href="https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue">link the issue</a> as follows: Closes: #ISSUE-NUMBER Solution Please describe your change as it relates to the problem, or bug fix, as well as any dependencies. If your change requires a schema change, please describe the schema modification(s) and whether it's a backwards-incompatible or backwards-compatible change, then select one of the following: <blockquote> Note: All schema changes require discussion. Please <a href="https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue">link the issue</a> for context. </blockquote> ☐ Your change modifies the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json">core</a> OpenLineage model ☐ Your change modifies one or more OpenLineage <a href="https://github.com/OpenLineage/OpenLineage/tree/main/spec/facets">facets</a> If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports <code>S3</code> and <code>GCS</code> filesystem operations, tested with AWS EMR). One-line summary: Checklist ☐ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☐ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☐ Your changes are accompanied by tests (if relevant) ☐ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☐ Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant) ☐ You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant) <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2023 contributors to the OpenLineage project

Labels

integration/spark

Comments

:gratitude_thank_you: Michael Robinson

Rodrigo Maia (rodrigo.maia@manta.io)

2024-03-13 12:43:16

*Thread Reply:* @Matthew Paras Hi! im still struggling with empty outputs on databricks with OL latest version.

24/03/13 16:35:56 INFO PlanUtils: apply method failed with org.apache.spark.SparkException: There is no Credential Scope. Current env: Driver

Any idea on how to solve this?

Rodrigo Maia (rodrigo.maia@manta.io)

2024-03-13 12:53:44

*Thread Reply:* Any databricks runtime version i should test with?

Matthew Paras (matthewparas2020@u.northwestern.edu)

2024-03-13 15:35:41

*Thread Reply:* interesting, I think we're running on 13.3 LTS - we also haven't upgraded to the official OL version, still using the patched one that I built

Rodrigo Maia (rodrigo.maia@manta.io)

2024-02-21 09:03:49

💡Idea for next group meeting A workshop would come in handy to help newbies start contributing to the project 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-21 09:13:41

*Thread Reply:* I think this is what you're looking for https://www.youtube.com/watch?v=SXebBTVcY4Q

YouTube

} OpenLineage Project (https://www.youtube.com/@openlineageproject6897)

How to Become an OpenLineage-Spark Contributor | November 29, 2023

Original URL: https://www.youtube.com/watch?v=SXebBTVcY4Q

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-21 09:15:47

*Thread Reply:* What kind of guidance would you like to see?

There are several places that introduce some help: https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md https://openlineage.io/docs/development/developing/ each integration has its own readme that writes down basic dev e.g. https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/README.md

and above link from Maciej 🙂

<https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md | CONTRIBUTING.md>

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/README.md | README.md>

Max Zheng (mzheng@plaid.com)

2024-02-21 13:52:07

Hi, I'm running into an odd "deadlock" with small datasets (eg. < 300 rows input/output) with Spark/OpenLineage with Kafka sink. The problem doesn't seem deterministic but is frequent enough where we've had to disable the listener on all of our jobs, and when it happens from driver logs and Spark UI it seems like the job hangs forever (there's a running query but no jobs are generated in the Spark UI, and no driver logs are generated at all). It doesn't seem like disk/memory/CPU utilization is high when this happens (< 20% memory utilization from our Prometheus metrics)

Are there any debug logs I can enable to get a better sense of whats happening? Any advice on how to debug this? Thanks

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-21 14:28:55

*Thread Reply:* What do you exactly mean by deadlock? Does it happen only with Kafka sink? How are you running the job, is it Databricks, EMR, some other provider, or some custom setup? Is it PySpark or Scala job? What Spark API are you using - RDDs, Data frame, Spark SQL?

Generally any logging we do is to driver logs, so it does not help that they are missing 😞 One thing that would be helpful is serialized logical plan of the job.

Quickly looking at the code, I don't see obvious candidate where something went wrong

Max Zheng (mzheng@plaid.com)

2024-02-21 14:36:59

*Thread Reply:* We are running on EMR 6.10.1, I've only been able to reproduce with Kafka sink but doing some more thorough testing with console sink now

• pyspark job • DataFrame API By deadlock I mean there's a running query in the Spark UI, but there's no active job/stages - the job just seems to be idle

Max Zheng (mzheng@plaid.com)

2024-02-21 14:48:18

*Thread Reply:* == Parsed Logical Plan == SaveIntoDataSourceCommand org.apache.hudi.Spark32PlusDefaultSource@6e5f1c4f, Map(hoodie.copyonwrite.record.size.estimate -> 57, hoodie.insert.shuffle.parallelism -> 1500, path -> {path}, hoodie.datasource.write.precombine.field -> _autogenerated_primary_key, hoodie.bootstrap.index.enable -> false, hoodie.metadata.enable -> true, hoodie.metrics.graphite.metric.prefix -> lake_production, hoodie.index.type -> SIMPLE, hoodie.datasource.write.operation -> upsert, hoodie.metrics.reporter.type -> GRAPHITE, hoodie.datasource.write.recordkey.field -> _autogenerated_primary_key, hoodie.table.name -> {table_name}, hoodie.datasource.write.table.type -> COPY_ON_WRITE, hoodie.datasource.write.hive_style_partitioning -> true, hoodie.metrics.graphite.host -> {host}, hoodie.datasource.write.table.name -> {table_name}, hoodie.populate.meta.fields -> false, hoodie.metrics.graphite.port -> 9109, hoodie.metrics.on -> true, hoodie.datasource.write.keygenerator.class -> org.apache.hudi.keygen.NonpartitionedKeyGenerator, hoodie.upsert.shuffle.parallelism -> 1500, hoodie.datasource.write.partitionpath.field -> ), Overwrite +- Project [c1#3, c2#4, a#5, c3#6, c4#7, c5#8, c6#9, uuid(Some(-217463645859800419)) AS _autogenerated_primary_key#17] +- Relation [c1#3,c2#4,a#5,c3#6,c4#7,c5#8,c6#9] parquet Here's the parsed logical plan for the stuck running query (its been 8 hours and typically finishes in < 1 minute, assuming this will never finish). The job is just rewriting a parquet input from S3 as an Apache Hudi table in S3.

👍 Maciej Obuchowski

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-21 15:08:34

*Thread Reply:* Ah, by Kafka sink you mean OpenLineage writing events to Kafka, not Spark writing to Kafka.

Max Zheng (mzheng@plaid.com)

2024-02-21 15:09:31

*Thread Reply:* Correct, sorry for the ambiguity 😅

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-21 15:20:53

*Thread Reply:* Would be great if you could check that with console transport. If it does not fix the issue, my guess is that something weird happens with Hudi, as we don't support it explicitly

👍 Max Zheng

Max Zheng (mzheng@plaid.com)

2024-02-21 15:54:10

*Thread Reply:* Sounds good, will do

I'm confused on how a listener could affect the running job ... wonder if its something on onJobStart https://github.com/OpenLineage/OpenLineage/blob/1.8.0/integration/spark/app/src/ma[…]n/java/io/openlineage/spark/agent/OpenLineageSparkListener.java

<https://github.com/OpenLineage/OpenLineage/blob/1.8.0/integration/spark/app/src/main/java/io/openlineage/spark/agent/OpenLineageSparkListener.java | OpenLineageSparkListener.java>

<pre><code> public void onJobStart(SparkListenerJobStart jobStart) { </code></pre>

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-21 15:56:32

*Thread Reply:* I'm also not sure, but deadlock sure sounds like something that theoretically could happen - SaveIntoDataSourceVisitor for example uses createRelation if it can't recognize particular relation type

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-21 15:56:43

*Thread Reply:* Which it does not with Hudi

Max Zheng (mzheng@plaid.com)

2024-02-21 16:00:48

*Thread Reply:* Got it hmm

Max Zheng (mzheng@plaid.com)

2024-02-21 17:19:26

*Thread Reply:* Same issue with spark.openlineage.transport.type console

Max Zheng (mzheng@plaid.com)

2024-02-21 17:54:28

*Thread Reply:* Looking at the thread dump of the driver it seems like spark-listener-group-shared is running org.apache.hudi.DataSourceOptionsHelper$.<init>(DataSourceOptions.scala:731) org.apache.hudi.DataSourceOptionsHelper$.<clinit>(DataSourceOptions.scala) org.apache.hudi.DataSourceReadOptions$.<init>(DataSourceOptions.scala:143) org.apache.hudi.DataSourceReadOptions$.<clinit>(DataSourceOptions.scala) org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:75) org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:68) io.openlineage.spark.agent.lifecycle.plan.SaveIntoDataSourceCommandVisitor.apply(SaveIntoDataSourceCommandVisitor.java:139) io.openlineage.spark.agent.lifecycle.plan.SaveIntoDataSourceCommandVisitor.apply(SaveIntoDataSourceCommandVisitor.java:46) io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder$1.apply(AbstractQueryPlanDatasetBuilder.java:94) io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder$1.apply(AbstractQueryPlanDatasetBuilder.java:85) io.openlineage.spark.agent.util.PlanUtils.safeApply(PlanUtils.java:279) io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder.lambda$apply$0(AbstractQueryPlanDatasetBuilder.java:75) io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder$$Lambda$4382/1107261117.apply(Unknown Source) java.util.Optional.map(Optional.java:215) io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder.apply(AbstractQueryPlanDatasetBuilder.java:67) io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder.apply(AbstractQueryPlanDatasetBuilder.java:39) io.openlineage.spark.agent.util.PlanUtils.safeApply(PlanUtils.java:279) io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$null$23(OpenLineageRunEventBuilder.java:453) io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder$$Lambda$4380/2129506005.apply(Unknown Source) ... If an event listener is stuck like here, does it prevent the next job from starting?

Max Zheng (mzheng@plaid.com)

2024-02-21 19:08:51

*Thread Reply:* It seems like another thread and the Spark listener thread are both trying to access a singleton from Hudi in the stuck state. This seems like the problem to me though I am not familiar with Scala

org.apache.hudi.DataSourceWriteOptions$.<init>(DataSourceOptions.scala:400) org.apache.hudi.DataSourceWriteOptions$.<clinit>(DataSourceOptions.scala) org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:141) org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47) org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) => holding Monitor(org.apache.spark.sql.execution.command.ExecutedCommandExec@213524349}) org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) ... org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:860) org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:390) org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:363) org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239) Seems to be stuck on https://github.com/apache/hudi/blob/release-0.12.2/hudi-spark-datasource/hudi-spar[…]k-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala

(I've asked in the Hudi Slack about this)

<https://github.com/apache/hudi/blob/release-0.12.2/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala | DataSourceOptions.scala>

<pre><code> DataSourceWriteOptions.TABLE_TYPE, </code></pre>

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 06:00:53

*Thread Reply:* So yeah - the approach from our side would be to implement HudiRelationVisitor that refrains from calling createRelation because it can utilize some unique data present in Hudi's DefaultSource .

Looking at the Hudi code, it seems we can get the destination path from options: https://github.com/apache/hudi/blob/3a97b01c0263c4790ffa958b865c682f40b4ada4/hudi-[…]spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala However the various path operation might require calling remote filesystem. This would require tests if we can safely call for example TablePathUtils.getTablePath

<https://github.com/apache/hudi/blob/3a97b01c0263c4790ffa958b865c682f40b4ada4/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala | DefaultSource.scala>

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 06:02:58

*Thread Reply:* Also not exactly sure why it deadlocks on reading config options? I can't exactly see where it breaks, since the code line numbers don't align with main anymore 🙂

Max Zheng (mzheng@plaid.com)

2024-02-22 12:31:37

*Thread Reply:* I suspect its because the two singletons being loaded reference each other? ie. DataSourceWriteOptions and DataSourceOptionsHelper

I did manage to workaround this by doing this dumb fake read at Spark session initialization try: spark.read.format("hudi").load("dummy") except Exception: pass

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 12:33:01

*Thread Reply:* maybe, but why does that happen reliably?

Max Zheng (mzheng@plaid.com)

2024-02-22 12:34:06

*Thread Reply:* I have no idea, but over 3 runs it was always the same lines in the thread dump

Ruchira Prasad (ruchiraprasad@gmail.com)

2024-02-22 06:25:40

Is it possible to integrate the SQL Server Integration Service (SSIS) data pipeline with OpenLineage Marquez? If so give me some reference or guideline on how to implement it.

Damien Hawes (damien.hawes@booking.com)

2024-02-22 06:28:23

*Thread Reply:* Oh. It's been a while (> 10 years) since I last used SSIS (SQL Server 2008R2 and SQL Server 2012), however, if you're able to obtain the SQL queries it executes (assuming regular SQL transforms), you'd be able to run it through the openlineage-sql parser.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-22 06:31:09

*Thread Reply:* It is possible, however in Airflow for instance there has not been added such support.

Damien Hawes (damien.hawes@booking.com)

2024-02-22 06:43:05

*Thread Reply:* Additionally, you could parse the DTSX file (DTSX is an XML-like file), to extract the data flow out of it, and adapt it to the OpenLineage format.

Damien Hawes (damien.hawes@booking.com)

2024-02-22 06:43:23

*Thread Reply:* Though, importantly, this would probably be a Job Event and not a Run Event.

Athitya Kumar (athityakumar@gmail.com)

2024-02-22 08:10:57

Hey team. We have a use-case where we would like to leverage the transports of openlineage (console, http, kafka etc) & enhance/customise the spark metrics/events published by the OL spark listener.

For example, OL spark listener currently publishes lineage-related info like inputs, outputs, column-level lineage etc - we want to add more job/task-level metrics etc that's available from the raw Spark Bus events. What'd be the best way to go about it, to add our custom logic & publish custom fields in the OL event payload to the OL transports?

Kacper Muda (mudakacper@gmail.com)

2024-02-22 08:13:08

*Thread Reply:* Hey, have You checked the docs about Custom Facets? It can help You attach any custom logic You might have, including: > more job/task-level metrics etc to an OpenLineage Event.

Athitya Kumar (athityakumar@gmail.com)

2024-02-22 08:17:02

*Thread Reply:* @Kacper Muda - Yup, I checked the above doc. But I thought it explains more on the "how" part from a schema / POJO perspective.

I was more interested to see where would be the best place to make these CustomFacet and logic changes. Should it be via extending the OL Spark Listener class & overriding/super, or some other way to inject our custom business logic which would be used by OL when it prepares the events?

👍 Kacper Muda

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 08:32:58

*Thread Reply:* @Athitya Kumar https://github.com/OpenLineage/OpenLineage/blob/31f8ce588526e9c7c4bc7d849699cb7ce2[…]/src/main/java/io/openlineage/spark/api/CustomFacetBuilder.java

<https://github.com/OpenLineage/OpenLineage/blob/31f8ce588526e9c7c4bc7d849699cb7ce2969c8f/integration/spark/shared/src/main/java/io/openlineage/spark/api/CustomFacetBuilder.java | CustomFacetBuilder.java>

<pre><code> implements AbstractPartial<Object>, BiConsumer<Object, BiConsumer<String, ? super F>> { </code></pre>

Athitya Kumar (athityakumar@gmail.com)

2024-02-22 08:51:37

*Thread Reply:* @Maciej Obuchowski - Is the recommendation to use this CustomFacetBuilder by extending/overriding OL spark listener's methods like onTaskEnd / onJobEnd etc? I remember seeing something regarding ServiceLoader and being able to "inject" our own logic earlier - but didn't get to explore that completely

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 08:59:28

*Thread Reply:* @Athitya Kumar the idea is to implement OpenLineageEventHandlerFactory in your own code, implementing methods where you add your own CustomFacetBuilders , then provide it in separate JAR for ServiceLoader with META-INF file.

You can see example in tests: Implementation that adds TestRunFacet META-INF file

<https://github.com/OpenLineage/OpenLineage/blob/31f8ce588526e9c7c4bc7d849699cb7ce2969c8f/integration/spark/app/src/test/java/io/openlineage/spark/agent/util/TestOpenLineageEventHandlerFactory.java | TestOpenLineageEventHandlerFactory.java>

<https://github.com/OpenLineage/OpenLineage/blob/9d31aa6f0177a60ba6b85812abd46159e5cdae74/integration/spark/app/src/test/resources/META-INF/services/io.openlineage.spark.api.OpenLineageEventHandlerFactory | io.openlineage.spark.api.OpenLineageEventHandlerFactory>

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-22 09:01:02

*Thread Reply:* your CustomFacetBuilder can specify on which Spark events it listens to

Michael Robinson (michael.robinson@astronomer.io)

2024-03-04 13:28:12

*Thread Reply:* @Athitya Kumar can you tell us if this resolved your issue?

Athitya Kumar (athityakumar@gmail.com)

2024-03-06 01:30:32

*Thread Reply:* @Michael Robinson - Yup, it's resolved for event types that're already being emitted from OpenLineage - but we have some events like StageCompleted / TaskEnd etc where we don't send events currently, where we'd like to plug-in our CustomFacets

https://openlineage.slack.com/archives/C01CK9T7HKR/p1709298185120219?thread_ts=1709297395.323109&cid=C01CK9T7HKR

} Maciej Obuchowski (https://openlineage.slack.com/team/U01RA9B5GG2)

I think to do this would be a not that small change, you'd need to add handling for those methods for <code>ExecutionContexts</code> <a href="https://github.com/OpenLineage/OpenLineage/blob/31f8ce588526e9c7c4bc7d849699cb7ce2969c8f/integration/spark/shared/src/main/java/io/openlineage/spark/agent/lifecycle/ExecutionContext.java#L16">https://github.com/OpenLineage/OpenLineage/blob/31f8ce588526e9c7c4bc7d849699cb7ce2[…]java/io/openlineage/spark/agent/lifecycle/ExecutionContext.java</a> and <code>OpenLineageSparkListener</code> itself to pass it forward. When it comes to implementation of them in particular contexts, I would make sure they don't emit unless you have something concrete set up for them, like those metrics you've set up.

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1709298185120219?thread_ts=1709297395.323109&cid=C01CK9T7HKR

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-06 12:57:53

*Thread Reply:* @Athitya Kumar can you store the facets somewhere (like OpenLineageContext) and send them with complete event later?

Monisha SL (monisha.sl@philips.com)

2024-02-22 08:30:17

Is it possible to integrate OpenLineage with Glue Jobs written in PySpark. If yes, please help me with reference documents

✅ Rodrigo Maia, Monisha SL

Rodrigo Maia (rodrigo.maia@manta.io)

2024-02-22 08:48:43

*Thread Reply:* it is, but there is no support for a glue catalog. Host the OL jar in S3 On your pyspark code put the configs for spark and the reference for Extra Jars glue big configuration pointing to your S3/jar file

Monisha SL (monisha.sl@philips.com)

2024-02-26 09:03:35

*Thread Reply:* @Rodrigo Maia I am unable to start the OL server on EC2 instance. Am trying to connect to OL hosted on EC2 from Glue to provide the OL URL. But unable to bring up the OL Server.

Rodrigo Maia (rodrigo.maia@manta.io)

2024-02-26 10:23:28

*Thread Reply:* here is an axample:

Monisha SL (monisha.sl@philips.com)

2024-02-26 11:17:42

*Thread Reply:* @Rodrigo Maia Thanks this helped the job to run successfully. But am quite confused on where to find the lineage generated? Does this needs to be integrated with Marquez UI? Apologies for so many questions, am new to this piece of work.

Rodrigo Maia (rodrigo.maia@manta.io)

2024-02-26 11:22:45

*Thread Reply:* is this case, the spark config is set to console. you can check on the logs in cloudwatch. I havent tried with Marquez

Monisha SL (monisha.sl@philips.com)

2024-02-26 11:24:41

*Thread Reply:* I can find it on the logs, looking for UI to showcase the lineage. Any help will be appreciated. Thank you.

soumilshah1995 (shahsoumil519@gmail.com)

2024-02-22 18:52:31

Hello there I am new here I am trying to run docker example given on GH im facing following issue im running on Mac M2

```git clone git@github.com:MarquezProject/marquez.git && cd marquez

./docker/up.sh```

(venv) soumilshah@Soumils-MBP marquez % ./docker/up.sh ...creating volumes: marquez_data, marquez_db-conf, marquez_db-init, marquez_db-backup Successfully copied 7.17kB to volumes-provisioner:/data/wait-for-it.sh Added files to volume marquez_data: wait-for-it.sh Successfully copied 2.05kB to volumes-provisioner:/db-conf/postgresql.conf Added files to volume marquez_db-conf: postgresql.conf Successfully copied 2.05kB to volumes-provisioner:/db-init/init-db.sh Added files to volume marquez_db-init: init-db.sh DONE! [+] Running 39/3 ✔ web 12 layers [⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿] 0B/0B Pulled 31.9s ✔ db 14 layers [⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿] 0B/0B Pulled 15.9s ✔ api 10 layers [⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿] 0B/0B Pulled 8.8s [+] Building 0.0s (0/0) docker:desktop-linux [+] Running 6/2 ✔ Network marquez_default Created0.0s 0.7s ✔ Container marquez-db Created0.7s ✔ Container marquez-api Created0.0s ! api The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested 0.0s ✔ Container marquez-web Created0.0s ! web The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested 0.0s Attaching to marquez-api, marquez-db, marquez-web Error response from daemon: Ports are not available: exposing port TCP 0.0.0.0:5000 -> 0.0.0.0:0: listen tcp 0.0.0.0:5000: bind: address already in use

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-23 08:28:14

*Thread Reply:* AFAIK on newer Macs Airplay Receiver runs on that port - please change the port to something else, unused or disable Airplay Receiver

✅ soumilshah1995

soumilshah1995 (shahsoumil519@gmail.com)

2024-02-23 18:37:07

*Thread Reply:* Thanks

Max Zheng (mzheng@plaid.com)

2024-02-23 02:05:58

Hi, I'm taking a look at lineage data from Spark and there's a weird event type called {spark_application_name}.map_partitions_parallel_collection which has 1 input (an S3 path) and 1 output which is strangely the same S3 path as the input - anyone have idea what this event is?

spark.logicalPlan=None, spark_properties=None are both None for this event which seems kind of weird

Max Zheng (mzheng@plaid.com)

2024-02-23 02:09:39

*Thread Reply:* Could be some oddity with using Apache Hudi but the Spark execution plan just shows a SaveIntoDataSourceCommand (which has a correct looking OpenLineage event)

👍 Maciej Obuchowski

Max Zheng (mzheng@plaid.com)

2024-02-23 02:09:53

*Thread Reply:* Just found this extra event that was generated very odd

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-23 06:45:26

*Thread Reply:* this is RDD event I think

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-23 06:46:16

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/31f8ce588526e9c7c4bc7d849699cb7ce2[…]src/main/java/io/openlineage/spark/agent/util/RddPathUtils.java

<https://github.com/OpenLineage/OpenLineage/blob/31f8ce588526e9c7c4bc7d849699cb7ce2969c8f/integration/spark/shared/src/main/java/io/openlineage/spark/agent/util/RddPathUtils.java | RddPathUtils.java>

<pre><code> static class MapPartitionsRDDExtractor implements RddPathExtractor<MapPartitionsRDD> { </code></pre>

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-23 06:53:52

*Thread Reply:* RDD events work kinda differently, they go through RddExecutionContext and not SparkSQLExecutionContext - LogicalPlan is a concept of Spark SQL https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-QueryExecution.html

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-23 06:54:27

*Thread Reply:* on the other hand, it would be probably good to attach Spark Properties facet

Max Zheng (mzheng@plaid.com)

2024-02-23 12:51:49

*Thread Reply:* Ah I see

Max Zheng (mzheng@plaid.com)

2024-02-23 12:53:08

*Thread Reply:* Its a little weird the output seems to be created erroneously, it definitely isn't writing anything to the input path

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-23 15:35:53

*Thread Reply:* Would be nice to look at it more as part of more proper Hudi support

Max Zheng (mzheng@plaid.com)

2024-02-23 16:40:02

*Thread Reply:* Yep, and thanks for the help in explaining things/code pointers 🙂

Balachandru S (balachandru2468@gmail.com)

2024-02-23 04:43:49

Hi Team, I want to generate table-level and column-level lineage for pyspark script. With the help of OpenLineage I can see lineage in the Marquez. Now I am interested to know the limitations here, can I create a lineage for pyspark script with 50+ number of sources and targets?. Are there any limitations in the source and target file count in the spark job? Like the above one, is there any limitation for spark job depth also?. Thanks.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-23 06:44:50

*Thread Reply:* I don't think there's hard limitation, we've seen it working for one job where serialized size of logical plan was >100MB

jayant joshi (itsjayantjoshi@gmail.com)

2024-02-23 08:09:09

Hi Team, If any transformation changes happening in the Spark job, can we track this under data lineage version history?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-23 08:26:11

*Thread Reply:* Do you mean column-level lineage? If yes, then we generally have that for Spark.

Balachandru S (balachandru2468@gmail.com)

2024-02-23 08:42:38

*Thread Reply:* @Maciej Obuchowski if I am running the same spark job multiple times with each time some transformation changes, can we trace transformation changes between runs?. Thanks.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-26 05:36:30

*Thread Reply:* Do you want to track exactly how the transformation happens? I would look at spark.logicalPlan facet, however it's raw information, and not sure how hard it is to get anything meaningful. Do you want to just track end result - how the resulting columns are formed? I would look at column-level lineage facet.

If that answer does not satisfies you - we process transformation information as part of generating column-level lineage, you can take look at that code.

Balachandru S (balachandru2468@gmail.com)

2024-02-26 05:56:06

*Thread Reply:* Thanks @Maciej Obuchowski

Hubert Boguski (hubert.boguski@spglobal.com)

2024-02-23 15:52:57

Hi everyone! Want to say this is a really awesome and much needed effort for the data community. Just wanted to double check my understanding here-

TLDR: is there a way to achieve this with python operators in a DAG with multiple python operators: To propagate the job hierarchy, tasks must send their own run ids so that the downstream process can report the ParentRunFacet with the proper run id.

Is there a way to have the multiple Python Operators (multiple python tasks) in a DAG trace job hierachy in Marquez. Right now I am invoking GET request and receiving a response in one task and passing the first tasks output to another task, but the jobs in Marquez shows up as different jobs (job hierarchy not shown). I understand there could be some extractor classes that could be implemented, but just wondering if there is anything simpler out there to show hierarchy with Python Operators 😅

I read somewhere that we can use the parent Id to link these jobs, but I had no luck (picture of docs)

I also saw a youtube video https://www.youtube.com/watch?v=yrSDngUhj8U But as I understand this is still a work in progress

I know there are some supported extractors already, but they dont support my use case. Thanks in advance!!!

YouTube

} OpenLineage Project (https://www.youtube.com/@openlineageproject6897)

Extracting Lineage from PythonOperator | November 29, 2023

Original URL: https://www.youtube.com/watch?v=yrSDngUhj8U

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-26 05:34:03

*Thread Reply:* Hey, I’m not exactly sure what you’d like to achieve.

First, let’s explain what’s the hierarchy. When Airflow TaskInstance is running it is considered run for which DagRun is its parent run. Similarly if task spawns other jobs, they should set TaskInstance as their parent run. More about the idea here: https://openlineage.io/docs/spec/job-hierarchy

If you’d like to emit additional events within PythonOperator’s callback and set this task as parent run for those additional events you should take another approach: • here’s how task uuid is generated: https://github.com/apache/airflow/blob/main/airflow/providers/openlineage/plugins/listener.py#L79 • based on the above you could use build_task_instance_run_id method with additional information taken from context (that is dag_id, task_id, execution_date, try_number) Code snippet you provided allows to point to DAG as parent run.

openlineage.io

Job Hierarchy | OpenLineage

In a complex environment, where there are thousands of processing jobs daily, there can be a lot of chaos.

Original URL: https://openlineage.io/docs/spec/job-hierarchy

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-26 05:35:02

*Thread Reply:* Extracting lineage from PythonOperator is still in progress, proposal should be public soon.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-26 05:36:11

*Thread Reply:* Maybe I’m mistaken what you mean by job hierarchy. What would you expect to be shown as a graph?

Suhas Shenoy (ksuhasshenoy@gmail.com)

2024-02-25 23:51:38

Hi team, Just wanted to know if we can we configure openlineage to send events to two endpoints?

👍 Derya Meral

Kacper Muda (mudakacper@gmail.com)

2024-02-26 02:57:35

*Thread Reply:* Hi, i don't think it's possible with a built-in HTTP Transport, but You can always implement a custom Transport, that will suit Your needs 🙂

👍 Suhas Shenoy, Derya Meral

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-02-26 03:27:59

*Thread Reply:* Hi @Suhas Shenoy, this was one of the motivations behind fluentD proxy -> https://github.com/OpenLineage/OpenLineage/tree/main/proxy/fluentd

❤️ Kacper Muda, Derya Meral

👍 Suhas Shenoy, Jakub Dardziński, Mattia Bertorello, Derya Meral

Derya Meral (drderyameral@gmail.com)

2024-02-26 11:24:36

*Thread Reply:* I was wondering the same thing and joined the slack to ask this. In our case, we're wondering whether it would be possible to use both Marquez and Datahub with our Airflow. Thanks for the pointer, @Kacper Muda!

Max Zheng (mzheng@plaid.com)

2024-02-26 12:37:43

Does anyone know how to disable the logger for: 4/02/23 09:08:01 INFO AsyncEventQueue: Process of event SparkListenerJobStart It seems like spark.redaction.regex isn't respected by this logger which causes secrets to get logged when Spark OpenLineage is enabled

Max Zheng (mzheng@plaid.com)

2024-02-26 12:51:28

I also ran into an error Py4JNetworkError("Answer from Java side is empty") very frequently on some jobs. It seems like this is caused by our Spark sessions being explicitly stopped via a spark.stop() and then OpenLineageSparkListener trying to access the stopped Spark session, which causes the application to fail (I suspect in https://github.com/OpenLineage/OpenLineage/blob/5298e8a30a8168dd8096f334e5d484812a[…]park/agent/lifecycle/plan/SaveIntoDataSourceCommandVisitor.java)

Should session already closed exceptions be handled gracefully there? Part of this is likely due to Hudi not being supported well (it seems like its very slow for some jobs and doesn't generate any events before crashing) but this seems like a race condition that can generally cause crashes

<https://github.com/OpenLineage/OpenLineage/blob/5298e8a30a8168dd8096f334e5d484812a48aa84/integration/spark/shared/src/main/java/io/openlineage/spark/agent/lifecycle/plan/SaveIntoDataSourceCommandVisitor.java | SaveIntoDataSourceCommandVisitor.java>

<pre><code> relation = p.createRelation(sqlContext, command.options()); </code></pre>

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-02-27 02:26:01

*Thread Reply:* Hi Max,

Do you have any logs on OpenLineageSparkListener crashing? All the exceptions from OpenLineageSparkListener trying to build an event should be caught and we do have a test for this. Please provide more details on that. How do you know it's failing? The desired behaviour is to log and exception and proceed.

Max Zheng (mzheng@plaid.com)

2024-02-27 13:26:46

*Thread Reply:* Seems like its on OpenLineageSparkListener.onJobEnd ```24/02/25 16:12:49 INFO PlanUtils: apply method failed with java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext. This stopped SparkContext was created at:

org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58) sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) java.lang.reflect.Constructor.newInstance(Constructor.java:423) py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) py4j.Gateway.invoke(Gateway.java:238) py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) py4j.ClientServerConnection.run(ClientServerConnection.java:106) java.lang.Thread.run(Thread.java:750)

The currently active SparkContext was created at:

at org.apache.spark.SparkContext.assertNotStopped(SparkContext.scala:121) ~[spark-core_2.12-3.3.1-amzn-0.1.jar:3.3.1-amzn-0.1]
at org.apache.spark.sql.SparkSession.&lt;init&gt;(SparkSession.scala:113) ~[spark-sql_2.12-3.3.1-amzn-0.1.jar:3.3.1-amzn-0.1]
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:962) ~[spark-sql_2.12-3.3.1-amzn-0.1.jar:3.3.1-amzn-0.1]
at org.apache.spark.sql.SQLContext$.getOrCreate(SQLContext.scala:1023) ~[spark-sql_2.12-3.3.1-amzn-0.1.jar:3.3.1-amzn-0.1]
at org.apache.spark.sql.SQLContext.getOrCreate(SQLContext.scala) ~[spark-sql_2.12-3.3.1-amzn-0.1.jar:3.3.1-amzn-0.1]
at org.apache.hudi.client.common.HoodieSparkEngineContext.&lt;init&gt;(HoodieSparkEngineContext.java:65) ~[hudi-spark-bundle.jar:0.12.2-amzn-0]
at org.apache.hudi.SparkHoodieTableFileIndex.&lt;init&gt;(SparkHoodieTableFileIndex.scala:65) ~[hudi-spark-bundle.jar:0.12.2-amzn-0]
at org.apache.hudi.HoodieFileIndex.&lt;init&gt;(HoodieFileIndex.scala:81) ~[hudi-spark-bundle.jar:0.12.2-amzn-0]
at org.apache.hudi.HoodieBaseRelation.fileIndex$lzycompute(HoodieBaseRelation.scala:236) ~[hudi-spark-bundle.jar:0.12.2-amzn-0]
at org.apache.hudi.HoodieBaseRelation.fileIndex(HoodieBaseRelation.scala:234) ~[hudi-spark-bundle.jar:0.12.2-amzn-0]
at org.apache.hudi.BaseFileOnlyRelation.toHadoopFsRelation(BaseFileOnlyRelation.scala:153) ~[hudi-spark-bundle.jar:0.12.2-amzn-0]
at org.apache.hudi.DefaultSource$.resolveBaseFileOnlyRelation(DefaultSource.scala:268) ~[hudi-spark-bundle.jar:0.12.2-amzn-0]
at org.apache.hudi.DefaultSource$.createRelation(DefaultSource.scala:232) ~[hudi-spark-bundle.jar:0.12.2-amzn-0]
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:111) ~[hudi-spark-bundle.jar:0.12.2-amzn-0]
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:68) ~[hudi-spark-bundle.jar:0.12.2-amzn-0]
at io.openlineage.spark.agent.lifecycle.plan.SaveIntoDataSourceCommandVisitor.apply(SaveIntoDataSourceCommandVisitor.java:140) ~[io.openlineage_openlineage-spark-1.6.2.jar:1.6.2]
at io.openlineage.spark.agent.lifecycle.plan.SaveIntoDataSourceCommandVisitor.apply(SaveIntoDataSourceCommandVisitor.java:47) ~[io.openlineage_openlineage-spark-1.6.2.jar:1.6.2]
at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder$1.apply(AbstractQueryPlanDatasetBuilder.java:94) ~[io.openlineage_openlineage-spark-1.6.2.jar:1.6.2]
at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder$1.apply(AbstractQueryPlanDatasetBuilder.java:85) ~[io.openlineage_openlineage-spark-1.6.2.jar:1.6.2]
at io.openlineage.spark.agent.util.PlanUtils.safeApply(PlanUtils.java:279) ~[io.openlineage_openlineage-spark-1.6.2.jar:1.6.2]
at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder.lambda$apply$0(AbstractQueryPlanDatasetBuilder.java:75) ~[io.openlineage_openlineage-spark-1.6.2.jar:1.6.2]
at java.util.Optional.map(Optional.java:215) ~[?:1.8.0_392]
at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder.apply(AbstractQueryPlanDatasetBuilder.java:67) ~[io.openlineage_openlineage-spark-1.6.2.jar:1.6.2]
at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder.apply(AbstractQueryPlanDatasetBuilder.java:39) ~[io.openlineage_openlineage-spark-1.6.2.jar:1.6.2]
at io.openlineage.spark.agent.util.PlanUtils.safeApply(PlanUtils.java:279) ~[io.openlineage_openlineage-spark-1.6.2.jar:1.6.2]
at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$null$23(OpenLineageRunEventBuilder.java:451) ~[io.openlineage_openlineage-spark-1.6.2.jar:1.6.2]
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[?:1.8.0_392]
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_392]
at java.util.Iterator.forEachRemaining(Iterator.java:116) ~[?:1.8.0_392]
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) ~[?:1.8.0_392]
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) ~[?:1.8.0_392]
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) ~[?:1.8.0_392]
at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) ~[?:1.8.0_392]
at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) ~[?:1.8.0_392]
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_392]
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485) ~[?:1.8.0_392]
at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:272) ~[?:1.8.0_392]
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384) ~[?:1.8.0_392]
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) ~[?:1.8.0_392]
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) ~[?:1.8.0_392]
at java.util.stream.StreamSpliterators$WrappingSpliterator.forEachRemaining(StreamSpliterators.java:313) ~[?:1.8.0_392]
at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742) ~[?:1.8.0_392]
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) ~[?:1.8.0_392]
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) ~[?:1.8.0_392]
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) ~[?:1.8.0_392]
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_392]
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) ~[?:1.8.0_392]
at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildOutputDatasets(OpenLineageRunEventBuilder.java:410) ~[io.openlineage_openlineage-spark-1.6.2.jar:1.6.2]
at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.populateRun(OpenLineageRunEventBuilder.java:298) ~[io.openlineage_openlineage-spark-1.6.2.jar:1.6.2]
at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:281) ~[io.openlineage_openlineage-spark-1.6.2.jar:1.6.2]
at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:259) ~[io.openlineage_openlineage-spark-1.6.2.jar:1.6.2]
at io.openlineage.spark.agent.lifecycle.SparkSQLExecutionContext.end(SparkSQLExecutionContext.java:257) ~[io.openlineage_openlineage-spark-1.6.2.jar:1.6.2]
at io.openlineage.spark.agent.OpenLineageSparkListener.onJobEnd(OpenLineageSparkListener.java:167) ~[io.openlineage_openlineage-spark-1.6.2.jar:1.6.2]
at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:39) ~[spark-core_2.12-3.3.1-amzn-0.1.jar:3.3.1-amzn-0.1]
at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) ~[spark-core_2.12-3.3.1-amzn-0.1.jar:3.3.1-amzn-0.1]
at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) ~[spark-core_2.12-3.3.1-amzn-0.1.jar:3.3.1-amzn-0.1]
at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) ~[spark-core_2.12-3.3.1-amzn-0.1.jar:3.3.1-amzn-0.1]
at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) ~[spark-core_2.12-3.3.1-amzn-0.1.jar:3.3.1-amzn-0.1]
at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) ~[spark-core_2.12-3.3.1-amzn-0.1.jar:3.3.1-amzn-0.1]
at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) ~[spark-core_2.12-3.3.1-amzn-0.1.jar:3.3.1-amzn-0.1]
at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) ~[spark-core_2.12-3.3.1-amzn-0.1.jar:3.3.1-amzn-0.1]
at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) ~[scala-library-2.12.15.jar:?]
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) ~[scala-library-2.12.15.jar:?]
at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) ~[spark-core_2.12-3.3.1-amzn-0.1.jar:3.3.1-amzn-0.1]
at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) ~[spark-core_2.12-3.3.1-amzn-0.1.jar:3.3.1-amzn-0.1]
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1447) ~[spark-core_2.12-3.3.1-amzn-0.1.jar:3.3.1-amzn-0.1]
at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) ~[spark-core_2.12-3.3.1-amzn-0.1.jar:3.3.1-amzn-0.1]

24/02/25 16:13:04 INFO AsyncEventQueue: Process of event SparkListenerJobEnd(23,1708877534168,JobSucceeded) by listener OpenLineageSparkListener took 15.64437991s. 24/02/25 16:13:04 ERROR JniBasedUnixGroupsMapping: error looking up the name of group 1001: No such file or directory```

Max Zheng (mzheng@plaid.com)

2024-02-27 19:20:10

*Thread Reply:* Hmm yeah I'm confused, https://github.com/OpenLineage/OpenLineage/blob/1.6.2/integration/spark/shared/src/main/java/io/openlineage/spark/agent/util/PlanUtils.java#L277 seems to indicate as you said (safeApply swallows the exception), but the job exits after on an error code (EMR marks the job as failed)

The crash stops if I remove spark.stop() or disable the OpenLineage listener so this is odd 🤔

<https://github.com/OpenLineage/OpenLineage/blob/1.6.2/integration/spark/shared/src/main/java/io/openlineage/spark/agent/util/PlanUtils.java | PlanUtils.java>

<pre><code> public static <T, D> List<T> safeApply(PartialFunction<D, List<T>> pfn, D x) { </code></pre>

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-02-28 04:21:31

*Thread Reply:* 24/02/25 16:12:49 INFO PlanUtils: apply method failed with -> yeah, log level is info. It would look as if you were trying to run some action after stopping spark, but you said that disabling OpenLineage listener makes it succeed. This is odd.

Max Zheng (mzheng@plaid.com)

2024-02-28 13:11:11

*Thread Reply:* Maybe its some race condition on shutdown logic with event listeners? It seems like the listener being enabled is causing executors to be spun up (which fails) after the Spark session is already stopped

• After the stacktrace above I see ConsoleTransport log some OpenLineage event data • Then oddly it looks like a bunch of executors are launched after the Spark session has already been stopped • These executors crash on startup which is likely whats causing the Spark job to exit with an error code 24/02/24 07:18:03 INFO ConsoleTransport: {"eventTime":"2024_02_24T07:17:05.344Z","producer":"<https://github.com/OpenLineage/OpenLineage/tree/1.6.2/integration/spark>", ... 24/02/24 07:18:06 INFO YarnAllocator: Will request 1 executor container(s) for ResourceProfile Id: 0, each with 4 core(s) and 27136 MB memory. with custom resources: <memory:27136, max memory:2147483647, vCores:4, max vCores:2147483647> 24/02/24 07:18:06 INFO YarnAllocator: Submitted 1 unlocalized container requests. 24/02/24 07:18:09 INFO YarnAllocator: Launching container container_1708758297553_0001_01_000004 on host {ip} for executor with ID 3 for ResourceProfile Id 0 with resources <memory:27136, vCores:4> 24/02/24 07:18:09 INFO YarnAllocator: Launching executor with 21708m of heap (plus 5428m overhead/off heap) and 4 cores 24/02/24 07:18:09 INFO YarnAllocator: Received 1 containers from YARN, launching executors on 1 of them. 24/02/24 07:18:09 INFO YarnAllocator: Completed container container_1708758297553_0001_01_000003 on host: {ip} (state: COMPLETE, exit status: 1) 24/02/24 07:18:09 WARN YarnAllocator: Container from a bad node: container_1708758297553_0001_01_000003 on host: {ip}. Exit status: 1. Diagnostics: [2024-02-24 07:18:06.508]Exception from container-launch. Container id: container_1708758297553_0001_01_000003 Exit code: 1 Exception message: Launch container failed Shell error output: Nonzero exit code=1, error message='Invalid argument number' The new executors all fail with: Caused by: org.apache.spark.rpc.RpcEndpointNotFoundException: Cannot find endpoint: <spark://CoarseGrainedScheduler>@{ip}:{port}

Max Zheng (mzheng@plaid.com)

2024-02-28 13:44:20

*Thread Reply:* The debug logs from AsyncEventQueue show OpenLineageSparkListener took 21.301411402s fwiw - I'm assuming thats abnormally long

Max Zheng (mzheng@plaid.com)

2024-02-28 16:07:37

*Thread Reply:* The yarn logs also seem to indicate the listener is somehow causing the app to start up again 2024-02-24 07:18:00,152 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl (SchedulerEventDispatcher:Event Processor): container_1708758297553_0001_01_000002 Container Transitioned from RUNNING to COMPLETED 2024-02-24 07:18:00,155 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator (SchedulerEventDispatcher:Event Processor): assignedContainer application attempt=appattempt_1708758297553_0001_000001 container=null queue=default clusterResource=<memory:54272, vCores:8> type=OFF_SWITCH requestedPartition= 2024-02-24 07:18:00,155 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 2 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1708758297553_0001_000001 2024-02-24 07:18:00,155 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl (SchedulerEventDispatcher:Event Processor): container_1708758297553_0001_01_000003 Container Transitioned from NEW to ALLOCATED Is there some logic in the listener that can create a Spark session if there is no active session?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-02-29 03:29:40

*Thread Reply:* not sure of this, I couldn't find any place of that in code

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-29 05:36:43

*Thread Reply:* Probably another instance when doing something generic does not work with Hudi well 😶

Max Zheng (mzheng@plaid.com)

2024-02-29 12:44:24

*Thread Reply:* Dumb question, what info needs to be fetched from Hudi? Is this in the createRelation call? I'm surprised the logs seem to indicate Hudi table metadata seems to be being read from S3 in the listener

What would need to be implemented for proper Hudi support?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-29 15:06:42

*Thread Reply:* @Max Zheng well, basically we need at least proper name and namespace for the dataset. How we do that is completely dependent on the underlying code, so probably somewhere here: https://github.com/apache/hudi/blob/3a97b01c0263c4790ffa958b865c682f40b4ada4/hudi-[…]-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala

Most likely we don't need to do any external calls or read anything from S3. It's just done because without something that understands Hudi classes we just do the generic thing (createRelation) that has the biggest chance to work.

For example, for Iceberg we can get the data required just by getting config from their catalog config - and I think with Hudi it has to work the same way, because logically - if you're reading some table, you have to know where it is or how it's named.

<https://github.com/apache/hudi/blob/3a97b01c0263c4790ffa958b865c682f40b4ada4/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala | HoodieBaseRelation.scala>

<pre><code> lazy val tableState: HoodieTableState = { </code></pre>

Max Zheng (mzheng@plaid.com)

2024-02-29 16:05:07

*Thread Reply:* That makes sense, and that info is in the hoodie.properties file that seems to be loaded based on the logs. But the events I see OL generate seem to have S3 path and S3 bucket as a the name and namespace respectively - ie. it doesn't seem to be using any of the metadata being read from Hudi? "outputs": [ { "namespace": "s3://{bucket}", "name": "{S3 prefix path}", (we'd be perfectly happy with just the S3 path/bucket - is there a way to disable createRelation or have OL treat these Hudi as raw parquet?)

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-05 05:58:14

*Thread Reply:* > But the events I see OL generate seem to have S3 path and S3 bucket as a the name and namespace respectively - ie. it doesn't seem to be using any of the metadata being read from Hudi? Probably yes - as I've said, the OL handling of it is just inefficient and not specific to Hudi. It's good enought that they generate something that seems to be valid dataset naming 🙂 And, the fact it reads S3 metadata is not intended - it's just that Hudi implements createRelation this way.

(we'd be perfectly happy with just the S3 path/bucket - is there a way to disable createRelation or have OL treat these Hudi as raw parquet?) The way OpenLineage Spark integration works is by looking at Optimized Logical Plan of particular Spark job. So the solution would be to implement Hudi specific path in SaveIntoDataSourceCommandVisitor or any particular other visitor that touches on the Hudi path - or, if Hudi has their own LogicalPlan nodes, implement support for it.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-05 08:50:14

*Thread Reply:* (sorry for answering that late @Max Zheng, I thought I had the response send and it was sitting in my draft for few days 😞 )

Max Zheng (mzheng@plaid.com)

2024-03-06 19:37:32

*Thread Reply:* Thanks for the explanation @Maciej Obuchowski

I've been digging into the source code to see if I can help contribute Hudi support for OL. At least in SaveIntoDataSourceCommandVisitor it seems all I need to do is: ```--- a/integration/spark/shared/src/main/java/io/openlineage/spark/agent/lifecycle/plan/SaveIntoDataSourceCommandVisitor.java +++ b/integration/spark/shared/src/main/java/io/openlineage/spark/agent/lifecycle/plan/SaveIntoDataSourceCommandVisitor.java @@ -114,8 +114,9 @@ public class SaveIntoDataSourceCommandVisitor LifecycleStateChange lifecycleStateChange = (SaveMode.Overwrite == command.mode()) ? OVERWRITE : CREATE;

if (command.dataSource().getClass().getName().contains("DeltaDataSource")) {
if (command.dataSource().getClass().getName().contains("DeltaDataSource") || command.dataSource().getClass().getName().contains("org.apache.hudi.Spark32PlusDefaultSource")) { if (command.options().contains("path")) {
log.info("Delta/Hudi data source detected, path: {}", command.options().get("path").get()); URI uri = URI.create(command.options().get("path").get()); return Collections.singletonList( outputDataset() @@ -123,6 +124,7 @@ public class SaveIntoDataSourceCommandVisitor } }``This seems to work and avoids thecreateRelation` call but I still run into the same crash 🤔 so now I'm not sure if this is a Hudi issue. Do you know of any other dependencies on the output data source? I wonder if https://openlineage.slack.com/archives/C01CK9T7HKR/p1708671958295659 rdd events could be the culprit?

I'm going to try and reproduce the crash without Hudi and just with parquet

} Max Zheng (https://openlineage.slack.com/team/U06L217224C)

Hi, I'm taking a look at lineage data from Spark and there's a weird event type called <code>{spark_application_name}.map_partitions_parallel_collection</code> which has 1 input (an S3 path) and 1 output which is strangely the same S3 path as the input - anyone have idea what this event is? <code>spark.logicalPlan=None, spark_properties=None</code> are both None for this event which seems kind of weird

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1708671958295659

Max Zheng (mzheng@plaid.com)

2024-03-06 20:24:14

*Thread Reply:* Hmm reading over RDDExecutionContext it seems highly unlikely anything in that would cause this crash

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-07 04:53:44

*Thread Reply:* There might be other part related to reading from Hudi?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-07 04:54:22

*Thread Reply:* SaveIntoDataSourceCommandVisitor only takes care about root node of whole LogicalPlan

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-07 04:57:51

*Thread Reply:* I would serialize logical plan and take a look at leaf nodes of the job that causes hang

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-07 04:58:05

*Thread Reply:* for simple check you can just make the dataset handler that handles them return early

Max Zheng (mzheng@plaid.com)

2024-03-07 11:54:39

*Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1708544898883449?thread_ts=1708541527.152859&cid=C01CK9T7HKR the parsed logical plan for my test job is just the SaveIntoDataSourceCommandVisitor(though I might be mis-understanding what you mean by leaf nodes)

} Max Zheng (https://openlineage.slack.com/team/U06L217224C)

<code>== Parsed Logical Plan == SaveIntoDataSourceCommand org.apache.hudi.Spark32PlusDefaultSource@6e5f1c4f, Map(hoodie.copyonwrite.record.size.estimate -&gt; 57, hoodie.insert.shuffle.parallelism -&gt; 1500, path -&gt; {path}, hoodie.datasource.write.precombine.field -&gt; _autogenerated_primary_key, hoodie.bootstrap.index.enable -&gt; false, hoodie.metadata.enable -&gt; true, hoodie.metrics.graphite.metric.prefix -&gt; lake_production, hoodie.index.type -&gt; SIMPLE, hoodie.datasource.write.operation -&gt; upsert, hoodie.metrics.reporter.type -&gt; GRAPHITE, hoodie.datasource.write.recordkey.field -&gt; _autogenerated_primary_key, hoodie.table.name -&gt; {table_name}, hoodie.datasource.write.table.type -&gt; COPY_ON_WRITE, hoodie.datasource.write.hive_style_partitioning -&gt; true, hoodie.metrics.graphite.host -&gt; {host}, hoodie.datasource.write.table.name -&gt; {table_name}, hoodie.populate.meta.fields -&gt; false, hoodie.metrics.graphite.port -&gt; 9109, hoodie.metrics.on -&gt; true, hoodie.datasource.write.keygenerator.class -&gt; org.apache.hudi.keygen.NonpartitionedKeyGenerator, hoodie.upsert.shuffle.parallelism -&gt; 1500, hoodie.datasource.write.partitionpath.field -&gt; ), Overwrite +- Project [c1#3, c2#4, a#5, c3#6, c4#7, c5#8, c6#9, uuid(Some(-217463645859800419)) AS _autogenerated_primary_key#17] +- Relation [c1#3,c2#4,a#5,c3#6,c4#7,c5#8,c6#9] parquet</code> Here's the parsed logical plan for the stuck running query (its been 8 hours and typically finishes in < 1 minute, assuming this will never finish). The job is just rewriting a parquet input from S3 as an Apache Hudi table in S3.

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1708544898883449?thread_ts=1708541527.152859&cid=C01CK9T7HKR

Max Zheng (mzheng@plaid.com)

2024-03-07 12:12:28

*Thread Reply:* I was able to reproduce the issue with InsertIntoHadoopFsRelationCommand with aparquet write with the same job - I'm starting to suspect this is a Spark with Docker/yarn bug

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-07 13:17:19

*Thread Reply:* Without hudi read?

Max Zheng (mzheng@plaid.com)

2024-03-07 13:17:46

*Thread Reply:* Yep, it reads json and writes out as parquet

Max Zheng (mzheng@plaid.com)

2024-03-07 13:18:27

*Thread Reply:* We're with EMR so I created an AWS support ticket to ask whether this is a known issue with YARN/Spark on Docker

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-07 13:19:53

*Thread Reply:* Very interesting, would be great to see if we see more data in the metrics in the next release

Max Zheng (mzheng@plaid.com)

2024-03-07 13:21:17

*Thread Reply:* For sure, if its on master or if you have a patch I can build the jar and run my job with it if that'd be helpful

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-07 13:22:04

*Thread Reply:* Not yet 😶

🙏 Max Zheng

Max Zheng (mzheng@plaid.com)

2024-03-11 20:20:14

*Thread Reply:* After even more investigation I think I found the cause. In https://github.com/OpenLineage/OpenLineage/blob/987e5b806dc8bd6c5aab5f85c97af76a87[…]n/java/io/openlineage/spark/agent/OpenLineageSparkListener.java a SparkListenerSQLExecutionEnd event is processed after the SparkSession is stopped - I believe createSparkSQLExecutionContext is doing something weird in https://github.com/OpenLineage/OpenLineage/blob/987e5b806dc8bd6c5aab5f85c97af76a87[…]n/java/io/openlineage/spark/agent/lifecycle/ContextFactory.java at SparkSession sparkSession = queryExecution.sparkSession(); I'm not sure if this is defined behavior for the session to be accessed after its stopped? After I skipped the event in onOtherEvent if the session is stopped it no longer crashes trying to spin up new executors

(I can make a Github issue + try to land a patch if you agree this seems like a bug)

<https://github.com/OpenLineage/OpenLineage/blob/987e5b806dc8bd6c5aab5f85c97af76a87b1de0f/integration/spark/app/src/main/java/io/openlineage/spark/agent/OpenLineageSparkListener.java | OpenLineageSparkListener.java>

<https://github.com/OpenLineage/OpenLineage/blob/987e5b806dc8bd6c5aab5f85c97af76a87b1de0f/integration/spark/app/src/main/java/io/openlineage/spark/agent/lifecycle/ContextFactory.java | ContextFactory.java>

Max Zheng (mzheng@plaid.com)

2024-03-11 21:27:14

*Thread Reply:* (it might affect all events and this is just the first hit)

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-12 05:55:27

*Thread Reply:* @Max Zheng is the job particularly short lived? We've seen some times when for very short jobs we had the SparkSession stopped (especially if people close it manually) but it never led to any problems like this deadlock.

Max Zheng (mzheng@plaid.com)

2024-03-12 12:20:12

*Thread Reply:* I don't think job duration is related (also its not a deadlock, its causing the app to crash https://openlineage.slack.com/archives/C01CK9T7HKR/p1709143871823659?thread_ts=1708969888.804979&cid=C01CK9T7HKR) - it failed for ~ 1 hour long job and when testing still failed when I sampled the job input with df.limit(10000). It seems like it happens on jobs where events take a long time to process (like > 20s in the other thread).

I added this block to verify its being processed after the Spark context is stopped and to skip

```+ private boolean isSparkContextStopped() {

return asJavaOptional(SparkSession.getDefaultSession()
.map(sparkContextFromSession)
.orElse(activeSparkContext))
.map(
ctx -> {
return ctx.isStopped();
})
.orElse(true); // If for some reason we can't get the Spark context, we assume it's stopped
} + @Override public void onOtherEvent(SparkListenerEvent event) { if (isDisabled) { return; }
if (isSparkContextStopped()) {
log.warn("SparkContext is stopped, skipping event: {}", event.getClass());
return;
}This logs and no longer causes the same app to crash24/03/12 04:57:14 WARN OpenLineageSparkListener: SparkSession is stopped, skipping event: class org.apache.spark.sql.execution.ui.SparkListenerDriverAccumUpdates```

} Max Zheng (https://openlineage.slack.com/team/U06L217224C)

Maybe its some race condition on shutdown logic with event listeners? It seems like the listener being enabled is causing executors to be spun up (which fails) after the Spark session is already stopped • After the stacktrace above I see <code>ConsoleTransport</code> log some OpenLineage event data • Then oddly it looks like a bunch of executors are launched after the Spark session has already been stopped • These executors crash on startup which is likely whats causing the Spark job to exit with an error code <code>24/02/24 07:18:03 INFO ConsoleTransport: {"eventTime":"2024_02_24T07:17:05.344Z","producer":"<https://github.com/OpenLineage/OpenLineage/tree/1.6.2/integration/spark>", ... 24/02/24 07:18:06 INFO YarnAllocator: Will request 1 executor container(s) for ResourceProfile Id: 0, each with 4 core(s) and 27136 MB memory. with custom resources: &lt;memory:27136, max memory:2147483647, vCores:4, max vCores:2147483647&gt; 24/02/24 07:18:06 INFO YarnAllocator: Submitted 1 unlocalized container requests. 24/02/24 07:18:09 INFO YarnAllocator: Launching container container_1708758297553_0001_01_000004 on host {ip} for executor with ID 3 for ResourceProfile Id 0 with resources &lt;memory:27136, vCores:4&gt; 24/02/24 07:18:09 INFO YarnAllocator: Launching executor with 21708m of heap (plus 5428m overhead/off heap) and 4 cores 24/02/24 07:18:09 INFO YarnAllocator: Received 1 containers from YARN, launching executors on 1 of them. 24/02/24 07:18:09 INFO YarnAllocator: Completed container container_1708758297553_0001_01_000003 on host: {ip} (state: COMPLETE, exit status: 1) 24/02/24 07:18:09 WARN YarnAllocator: Container from a bad node: container_1708758297553_0001_01_000003 on host: {ip}. Exit status: 1. Diagnostics: [2024-02-24 07:18:06.508]Exception from container-launch. Container id: container_1708758297553_0001_01_000003 Exit code: 1 Exception message: Launch container failed Shell error output: Nonzero exit code=1, error message='Invalid argument number'</code> The new executors all fail with: <code>Caused by: org.apache.spark.rpc.RpcEndpointNotFoundException: Cannot find endpoint: <spark://CoarseGrainedScheduler>@{ip}:{port}</code>

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1709143871823659?thread_ts=1708969888.804979&cid=C01CK9T7HKR

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-12 12:29:34

*Thread Reply:* might the crash be related to memory issue?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-12 12:29:48

*Thread Reply:* ah, I see

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-12 12:31:30

*Thread Reply:* another question, are you explicitely stopping the sparksession/sparkcontext from within your job?

Max Zheng (mzheng@plaid.com)

2024-03-12 12:31:47

*Thread Reply:* Yep, it only happens where we explicitly stop with spark.stop()

Max Zheng (mzheng@plaid.com)

2024-03-13 16:18:23

*Thread Reply:* Created: https://github.com/OpenLineage/OpenLineage/issues/2513

#2513 Spark integration can cause YARN applications to crash if Spark context is explicitly stopped

Related to <a href="https://github.com/OpenLineage/OpenLineage/issues/2511">#2511</a> We have jobs that take a very long time to run <code>buildInputDatasets</code> (> 90s). This means the listener events are processed after <code>spark.stop()</code> (the Spark context is stopped in our application), which is problematic because it seems like the AM continues to request containers from YARN even after the Spark context is stopped. These new containers crash because they're unable to connect to the scheduler, which eventually cause YARN application to be marked as failed. <pre><code>Caused by: org.apache.spark.rpc.RpcEndpointNotFoundException: Cannot find endpoint: <spark://CoarseGrainedScheduler@{ip}:{port}> </code></pre> I believe this is because in <a href="https://github.com/OpenLineage/OpenLineage/blob/987e5b806dc8bd6c5aab5f85c97af76a87b1de0f/integration/spark/app/src/main/java/io/openlineage/spark/agent/OpenLineageSparkListener.java#L83">https://github.com/OpenLineage/OpenLineage/blob/987e5b806dc8bd6c5aab5f85c97af76a87[…]n/java/io/openlineage/spark/agent/OpenLineageSparkListener.java</a> a <code>SparkListenerSQLExecutionEnd</code> event is processed after the Spark context is stopped - I believe <code>createSparkSQLExecutionContext</code> is doing something undefined in <a href="https://github.com/OpenLineage/OpenLineage/blob/987e5b806dc8bd6c5aab5f85c97af76a87b1de0f/integration/spark/app/src/main/java/io/openlineage/spark/agent/lifecycle/ContextFactory.java#L45">OpenLineage/integration/spark/app/src/main/java/io/openlineage/spark/agent/lifecycle/ContextFactory.java</a> Line 45 in </OpenLineage/OpenLineage/commit/987e5b806dc8bd6c5aab5f85c97af76a87b1de0f|987e5b8> . I verified this was happening by adding log statements that printed whether the Spark context was stopped when receiving the event. I'm not sure if this is defined behavior for the Spark context to be accessed after its stopped? After I skipped the event in <code>onOtherEvent</code> if the Spark context is stopped it no longer crashes trying to spin up new containers in some of our jobs. <pre><code>+ private boolean isSparkContextStopped() { + return asJavaOptional(SparkSession.getDefaultSession() + .map(sparkContextFromSession) + .orElse(activeSparkContext)) + .map( + ctx -> { + return ctx.isStopped(); + }) + .orElse(true); // If for some reason we can't get the Spark context, we assume it's stopped + } + @Override public void onOtherEvent(SparkListenerEvent event) { if (isDisabled) { return; } + if (isSparkContextStopped()) { + log.warn("SparkContext is stopped, skipping event: {}", event.getClass()); + return; + } </code></pre> However I don't think this is sufficient to patch this, we still have jobs that take >200s to process events fail even with this patch. It seems like this is because the Spark context is stopped during processing <code>onOtherEvent</code> so this check is insufficient. Could <a href="https://openlineage.io/docs/integrations/spark/configuration/circuit_breaker/">https://openlineage.io/docs/integrations/spark/configuration/circuit_breaker/</a> be used to trigger a circuit breaker if the Spark context is stopped?

Max Zheng (mzheng@plaid.com)

2024-02-26 12:52:47

Lastly, would disabling facets improve performance? eg. disabling spark.logicalPlan

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-02-27 02:26:44

*Thread Reply:* Disabling spark.LogicalPlan may improve performance of populating OL event. It's disabled by default in recent version (the one released yesterday). You can also use circuit breaker feature if you are worried about Ol integration affecting Spark jobs

🤩 Yannick Libert

Yannick Libert (yannick.libert.partner@decathlon.com)

2024-02-27 05:20:13

*Thread Reply:* This feature is going to be so useful for us! Love it!

Michael Robinson (michael.robinson@astronomer.io)

2024-02-26 14:23:37

@channel We released OpenLineage 1.9.1, featuring: • Airflow: add support for JobTypeJobFacet properties #2412 @mattiabertorello • dbt: add support for JobTypeJobFacet properties #2411 @mattiabertorello • Flink: support Flink Kafka dynamic source and sink #2417 @HuangZhenQiu • Flink: support multi-topic Kafka Sink #2372 @pawel-big-lebowski • Flink: support lineage for JDBC connector #2436 @HuangZhenQiu • Flink: add common config gradle plugin #2461 @HuangZhenQiu • Java: extend circuit breaker loaded with ServiceLoader #2435 @pawel-big-lebowski • Spark: integration now emits intermediate, application level events wrapping entire job execution #2371 @mobuchowski • Spark: support built-in lineage within DataSourceV2Relation #2394 @pawel-big-lebowski • Spark: add support for JobTypeJobFacet properties #2410 @mattiabertorello • Spark: stop sending spark.LogicalPlan facet by default #2433 @pawel-big-lebowski • Spark/Flink/Java: circuit breaker #2407 @pawel-big-lebowski • Spark: add the capability to publish Scala 2.12 and 2.13 variants of openlineage-spark #2446 @d-m-h A large number of changes and bug fixes were also included. Thanks to all our contributors with a special shout-out to @Damien Hawes, who contributed >10 PRs to this release! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.9.1 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/1.8.0...1.9.1 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🚀 Jakub Dardziński, Jackson Goerner, Abdallah, Yannick Libert, Mattia Bertorello, Tristan GUEZENNEC -CROIX-, Fabio Manganiello, Maciej Obuchowski

🎉 Abdallah, Mattia Bertorello

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-26 14:33:27

*Thread Reply:* Oudstanding work @Damien Hawes 👏

➕ Michael Robinson, Mattia Bertorello, Fabio Manganiello, Maciej Obuchowski

Abdallah (abdallah@terrab.me)

2024-02-27 00:39:29

*Thread Reply:* Thank you 👏👏

ldacey (lance.dacey2@sutherlandglobal.com)

2024-02-27 11:02:19

*Thread Reply:* any idea how OL releases tie into the airflow provider?

I assume that a separate apache-airflow-providers-airflow release would be made in the future to incorporate the new features/fixes?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-27 11:05:55

*Thread Reply:* yes, Airflow providers are released on behalf of Airflow community and different than Airflow core release

Max Zheng (mzheng@plaid.com)

2024-02-27 15:24:57

*Thread Reply:* It seems like OpenLineage Spark is still on 1.8.0? Any idea when this will be updated? Thanks!

Damien Hawes (damien.hawes@booking.com)

2024-02-27 15:29:28

*Thread Reply:* @Max Zheng https://openlineage.io/docs/integrations/spark/#how-to-use-the-integration

openlineage.io

Apache Spark | OpenLineage

This integration is known to work with Apache Spark 2.4 and later.

Original URL: https://openlineage.io/docs/integrations/spark/#how-to-use-the-integration

Max Zheng (mzheng@plaid.com)

2024-02-27 15:30:14

*Thread Reply:* Oh got it, didn't see the note The above necessitates a change in the artifact identifier for io.openlineage:openlineage-spark. After version 1.8.0, the artifact identifier has been updated. For subsequent versions, utilize: io.openlineage:openlineage_spark_${SCALA_BINARY_VERSION}:${OPENLINEAGE_SPARK_VERSION}.

Max Zheng (mzheng@plaid.com)

2024-02-27 15:30:18

*Thread Reply:* Thanks!

Damien Hawes (damien.hawes@booking.com)

2024-02-27 15:30:36

*Thread Reply:* You're welcome.

Derya Meral (drderyameral@gmail.com)

2024-02-26 15:04:33

Hi all, I'm working on a local Airflow-OpenLineage-Marquez integration using Airflow 2.7.3 and python 3.10. Everything seems to be installed correctly with the appropriate settings. I'm seeing events, jobs, tasks trickle into the UI. I'm using the PostgresOperator. When it's time for the SQL code to be parsed, I'm seeing the following in my Airflow logs: [2024-02-26, 19:43:17 UTC] {sql.py:457} INFO - Running statement: SELECT CURRENT_SCHEMA;, parameters: None [2024-02-26, 19:43:17 UTC] {base.py:152} WARNING - OpenLineage provider method failed to extract data from provider. [2024-02-26, 19:43:17 UTC] {manager.py:198} WARNING - Extractor returns non-valid metadata: None Can anyone give me pointers on why exactly this might be happening? I've tried also with the SQLExecuteQueryOperator, same result. I previously got a Marquez setup to work with the external OpenLineage package for Airflow with Airflow 2.6.1. But I'm struggling with this newer integrated OpenLineage version

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-26 15:10:21

*Thread Reply:* Does this happen for some particular SQL but works for other? Also, my understanding is that it worked with openlineage-airflow on Airflow 2.6.1 (the same code)? What version of OL provider are you using?

Derya Meral (drderyameral@gmail.com)

2024-02-26 15:20:22

*Thread Reply:* I've been using one toy DAG and have only tried with the two operators mentioned. Currently, my team's code doesn't use provider operators so it would not really work well with OL.

Yes, it worked with Airflow 2.6.1. Same code.

Right now, I'm using apache-airflow-providers-openlineage==1.5.0 and the other OL dependencies are at 1.9.1.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-02-26 15:21:00

*Thread Reply:* Would you want to share the SQL statement?

Derya Meral (drderyameral@gmail.com)

2024-02-26 15:31:42

*Thread Reply:* It has some PII in it, but it's basically in the form of: ```DROP TABLE IF EXISTS usersmeral.keyrelations;

CREATE TABLE usersmeral.keyrelations AS

WITH staff AS ( SELECT ...) ,enabled AS (SELECT ...) SELECT ... FROM public.borrowers LEFT JOIN ...;``` We're splitting the query with sqlparse.split() and feed it to a PostgresOperator.

Derya Meral (drderyameral@gmail.com)

2024-02-27 09:26:41

*Thread Reply:* I thought I should share our configs in case I'm missing something: ```[openlineage] disabled = False disabledforoperators =

namespace =

extractors =

config_path = /opt/airflow/openlineage.yml transport =

disablesourcecode = ```

Derya Meral (drderyameral@gmail.com)

2024-02-27 09:27:20

*Thread Reply:* The YAML file: transport: type: http url: <http://marquez:5000>

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-04 13:01:19

*Thread Reply:* Are you running on apple silicon?

Derya Meral (drderyameral@gmail.com)

2024-03-04 15:39:05

*Thread Reply:* Yep, is that the issue?

Michael Robinson (michael.robinson@astronomer.io)

2024-02-28 13:00:00

@channel Since lineage will be the focus of a panel at Data Council Austin next month, it seems like a great opportunity to organize a meetup. Please get in touch if you might be interested in attending, presenting or hosting!

datacouncil.ai

Data Lineage Panel: We’ve Come a Long Way

Original URL: https://www.datacouncil.ai/talks24/data-lineage-panel-weve-come-a-long-way?

✅ Sheeri Cabral (Collibra), Jarek Potiuk, Howard Yoo

❤️ Harel Shein, Julian LaNeve, Paweł Leszczyński, Maciej Obuchowski

Declan Grant (declan.grant@sdktek.com)

2024-02-28 14:37:16

Hi all, I'm running into an unusual issue with OpenLineage on Databricks. When using OL 1.4.1 on a cluster that runs over 100 jobs every 30 minutes. After a couple hours, a DRIVER_NOT_RESPONDING error starts showing up in the event log with the message Driver is up but is not responsive, likely due to GC.. After a DRIVER_HEALTHY the error occurs again several minutes later. Is this a known issue that has been solved in a later release, or is there something I can do in Databricks to stop this?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-29 05:27:20

*Thread Reply:* My guess would be that with that amount of jobs scheduled shortly the SparkListener queue grows and some internal healthcheck times out?

Maybe you could try disabling spark.logicalPlan and spark_unknown facets to see if this speeds things up.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-02-29 09:42:27

*Thread Reply:* BTW, are you receiving OL events in the meantime?

Michael Robinson (michael.robinson@astronomer.io)

2024-03-04 12:55:50

*Thread Reply:* Hi @Declan Grant, can you tell us if disabling the facets worked?

Declan Grant (declan.grant@sdktek.com)

2024-03-04 14:30:14

*Thread Reply:* We had already tried disabling the facets, and that did not solve the issue.

Here is the relevant spark config: spark.openlineage.transport.type console spark.openlineage.facets.disabled [spark_unknown;spark.logicalPlan;schema;columnLineage;dataSource] We are not interested in column lineage at this time.

Declan Grant (declan.grant@sdktek.com)

2024-03-04 14:31:28

*Thread Reply:* OL has been uninstalled from the cluster, so I can't immediately say whether events are received while the driver is not responding.

Michael Robinson (michael.robinson@astronomer.io)

2024-02-28 15:19:51

@channel This month's issue of OpenLineage News is in inboxes now! Sign up to ensure you always get the latest issue. In this edition: a rundown of open issues, new docs and new videos, plus updates on the Airflow Provider, Spark integration and Flink integration (+ more).

openlineage.us14.list-manage.com

OpenLineage Project

OpenLineage Project Email Forms

Original URL: http://bit.ly/OL_news

👍 Mattia Bertorello

Simran Suri (mailsimransuri@gmail.com)

2024-03-01 01:19:04

Hi all, I've been trying to gather clues on how OpenLineage fetches our inputs' namespace and name from our Spark codebase. Routing to the exact logic would be very helpful for one of my usecase.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-03-01 02:25:10

*Thread Reply:* There is no single place where the namespace is assigned to dataset as this is strictly dependending on what datasets are read. Spark, as other OpenLineage integrations, follows the naming convention -> https://openlineage.io/docs/spec/naming

openlineage.io

Naming Conventions | OpenLineage

Employing a unique naming strategy per resource ensures that the spec is followed uniformly regardless of metadata producer.

Original URL: https://openlineage.io/docs/spec/naming

Fabio Manganiello (fabio.manganiello@booking.com)

2024-03-01 04:42:12

Hi all, I'm working on propagating the parent facet from an Airflow DAG to the dbt workflows it launches, and I'm a bit puzzled by the current logic in lineageparentid. It generates an ID in the form namespace/name/run_id (which is the format that dbt-ol expects as well), but here name is actually a UUID generated from the job's metadata, and run_id is the internal Airflow task instance name (usually a concatenation of execution date + try number) instead of a UUID, like OpenLineage advises.

Instead of using this function I've made my own where name=<dag_id>.<task_id> (as this is the job name propagated in other OpenLineage events as well), and run_id = lineage_run_id(operator, task_instance) - basically using the UUID hashing logic for the run_id that is currently used for the name instead. This seems to be more OpenLineage-compliant and it allows us to link things properly.

Is there some reason that I'm missing behind the current logic? Things are even more confusing IMHO because there's also a newlineagerun_id utility that calculates the run_id simply as a random UUID, without the UUID serialization logic of lineage_run_id, so it's not clear which one I'm supposed to use.

<https://github.com/OpenLineage/OpenLineage/blob/4008755e23668402d493a20ab2c1b203d40410ae/integration/airflow/openlineage/airflow/macros.py | macros.py>

<https://github.com/OpenLineage/OpenLineage/blob/4008755e23668402d493a20ab2c1b203d40410ae/integration/airflow/openlineage/airflow/utils.py | utils.py>

👀 Kacper Muda

Fabio Manganiello (fabio.manganiello@booking.com)

2024-03-01 05:52:28

*Thread Reply:* FYI the function I've come up with to link things properly looks like this:

```from airflow.models import BaseOperator, TaskInstance from openlineage.airflow.macros import JOBNAMESPACE from openlineage.airflow.plugin import lineagerunid

def lineageparentid(self: BaseOperator, taskinstance: TaskInstance) -> str: return "/".join( [ _JOBNAMESPACE, f"{taskinstance.dagid}.{taskinstance.taskid}", lineagerunid(self, task_instance), ] )```

Damien Hawes (damien.hawes@booking.com)

2024-03-04 04:19:39

*Thread Reply:* @Paweł Leszczyński @Jakub Dardziński - any thoughts here?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-04 05:12:15

*Thread Reply:* newlineagerun_id is some very old util method that should be deleted imho

I agree what you propose is more OL-compliant. Indeed, what we have in Airflow provider for dbt cloud integration is pretty the same you have: https://github.com/apache/airflow/blob/main/airflow/providers/dbt/cloud/utils/openlineage.py#L132

the reason for that is I think that the logic was a subject of change over time and dbt-ol script just was not updated properly

<https://github.com/OpenLineage/OpenLineage/blob/4008755e23668402d493a20ab2c1b203d40410ae/integration/airflow/openlineage/airflow/utils.py | utils.py>

<https://github.com/apache/airflow/blob/main/airflow/providers/dbt/cloud/utils/openlineage.py | openlineage.py>

👍 Fabio Manganiello

Michael Robinson (michael.robinson@astronomer.io)

2024-03-04 12:53:44

*Thread Reply:* @Fabio Manganiello would you mind opening an issue about this on GitHub?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-04 12:54:14

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/2488 there is one already 🙂 @Fabio Manganiello thank you for that!

#2488 Update the logic of `openlineage.airflow.macros.lineage_parent_id`

The macro <code>openlineage.airflow.macros.lineage_parent_id</code> currently generates the parent_id as <code>namespace/name/run_id</code>, where <code>name</code> however is a UUID generated from the task metadata, and <code>run_id</code> is the concatenation of the task instance timestamp and try number. This isn't compliant with the OpenLineage conventions. I'd expect <code>name=<dag_id>.<task_id></code>, as it's a convention in other parts of the codebase, and run_id to be a UUID.

Comments

:gratitude_thank_you: Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2024-03-04 13:05:13

*Thread Reply:* Oops, should have checked first! Yes, thanks Fabio

Kacper Muda (mudakacper@gmail.com)

2024-03-04 13:19:50

*Thread Reply:* There is also a PR already, sent as separate message by @Fabio Manganiello. And the same fix for the provider here. Some discussion is needed about what changes can we made to the macros and whether they will be "breaking", so feel free to comment.

#2489 [Airflow] Fixed run format returned by the `lineage_parent_id` macro

#37877 [OpenLineage] fix: Fix parent id macro and remove unused utils

:gratitude_thank_you: Michael Robinson

Honey Thakuria (Honey_Thakuria@intuit.com)

2024-03-01 07:49:55

Hey team, we're trying to extract certain Spark metrics with OL using custom Facets.

But we're not getting SparkListenerTaskStart , SparkListenerTaskEnd event as part of custom facet.

We're only able to get SparkListenerJobStart, SparkListenerJobEnd, SparkListenerSQLExecutionStart, SparkListenerSQLExecutionEnd.

This is how our custom facet code looks like : ``` @Override protected void build(SparkListenerEvent event, BiConsumer<String, ? super TestRunFacet> consumer) { if (event instanceof SparkListenerSQLExecutionStart) { ...} if (event instanceof SparkListenerTaskStart) { ...}

}But when we're executing the same Spark SQL using custom listener without OL facets, we're able to get Task level metrics too:public class IntuitSparkMetricsListener extends SparkListener { @Override public void onJobStart(SparkListenerJobStart jobStart){ log.info("job start logging starts"); log.info(jobStart.toString());

}


@Override
public void onTaskEnd(SparkListenerTaskEnd taskEnd) {

} .... }``` Could anyone give us certain input on how to get Task level metrics in OL facet itself ? Also, any issue due to SparkListenerEvent vs SparkListener ?

cc @Athitya Kumar @Kiran Hiremath

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-01 08:00:09

*Thread Reply:* OpenLineageSparkListener is not listening on SparkListenerTaskStart at all. It listens to SparkListenerTaskEnd , but only to fill metrics for OutputStatisticsOutputDatasetFacet

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-01 08:03:05

*Thread Reply:* I think to do this would be a not that small change, you'd need to add handling for those methods for ExecutionContexts https://github.com/OpenLineage/OpenLineage/blob/31f8ce588526e9c7c4bc7d849699cb7ce2[…]java/io/openlineage/spark/agent/lifecycle/ExecutionContext.java and OpenLineageSparkListener itself to pass it forward.

When it comes to implementation of them in particular contexts, I would make sure they don't emit unless you have something concrete set up for them, like those metrics you've set up.

<https://github.com/OpenLineage/OpenLineage/blob/31f8ce588526e9c7c4bc7d849699cb7ce2969c8f/integration/spark/shared/src/main/java/io/openlineage/spark/agent/lifecycle/ExecutionContext.java | ExecutionContext.java>

<pre><code>public interface ExecutionContext { </code></pre>

Fabio Manganiello (fabio.manganiello@booking.com)

2024-03-04 06:57:09

Hi folks, I have created a PR to address the required changes in the Airflow lineage_parent_id macro, as discussed in my previous comment (cc @Jakub Dardziński @Damien Hawes @Mattia Bertorello)

} Fabio Manganiello (https://openlineage.slack.com/team/U06BV4F12JU)

Hi all, I'm working on propagating the <code>parent</code> facet from an Airflow DAG to the dbt workflows it launches, and I'm a bit puzzled by the current logic in <a href="https://github.com/OpenLineage/OpenLineage/blob/4008755e23668402d493a20ab2c1b203d40410ae/integration/airflow/openlineage/airflow/macros.py#L34">lineageparentid</a>. It generates an ID in the form <code>namespace/name/run_id</code> (which is the <a href="https://github.com/OpenLineage/OpenLineage/blob/840dd35c0466d9c1e773187fc1430950532c3cf2/integration/dbt/scripts/dbt-ol#L125">format that </a><code>dbt-ol</code><a href="https://github.com/OpenLineage/OpenLineage/blob/840dd35c0466d9c1e773187fc1430950532c3cf2/integration/dbt/scripts/dbt-ol#L125"> expects as well</a>), but here <code>name</code> is actually a UUID generated from the job's metadata, and <code>run_id</code> is the internal Airflow task instance name (usually a concatenation of execution date + try number) instead of a UUID, like OpenLineage advises. Instead of using this function I've made my own where <code>name=<dag_id>.<task_id></code> (as this is the job name propagated in other OpenLineage events as well), and <code>run_id = lineage_run_id(operator, task_instance)</code> - basically using the UUID hashing logic for the <code>run_id</code> that is currently used for the <code>name</code> instead. This seems to be more OpenLineage-compliant and it allows us to link things properly. Is there some reason that I'm missing behind the current logic? Things are even more confusing IMHO because there's also a <a href="https://github.com/OpenLineage/OpenLineage/blob/4008755e23668402d493a20ab2c1b203d40410ae/integration/airflow/openlineage/airflow/utils.py#L360">newlineagerun_id</a> utility that calculates the <code>run_id</code> simply as a random UUID, without the UUID serialization logic of <code>lineage_run_id</code>, so it's not clear which one I'm supposed to use.

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1709286132907539

#2489 [Airflow] Fixed run format returned by the `lineage_parent_id` macro

Problem The macro currently returns a string in the format <code><namespace>/<name>/<run_id></code>. However, in this string: • <code>name</code> should be <code><dag_id>.<task_id></code>, not a UUID. • <code>run_id</code> should be a UUID, not <code><run_timestamp>.<try_number></code>. This is to comply with the OpenLineage conventions used everywhere else. Closes: <a href="https://github.com/OpenLineage/OpenLineage/issues/2488">#2488</a> Solution As stated above, <code>name</code> and <code>run_id</code> should be populated in the expected format. The change may break back-compatibility in cases where <code>lineage_parent_id</code> is used without keyword-arguments, since the signature has changed from: <pre><code>def lineage_parent_id(run_id, task: "BaseOperator", task_instance: "TaskInstance"): ... </code></pre> To: <pre><code>def lineage_parent_id( task: "BaseOperator", task_instance: "TaskInstance", run_id: typing.Optional[str] = None ): ... </code></pre> That makes the signature of <code>lineage_parent_id</code> consistent with that of <code>lineage_run_id</code>. <code>run_id</code> is now an optional parameter - it defaults to the one calculated via <code>lineage_run_id</code> for the given instance, and it can optionally be overridden. ☐ Your change modifies the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json">core</a> OpenLineage model ☐ Your change modifies one or more OpenLineage <a href="https://github.com/OpenLineage/OpenLineage/tree/main/spec/facets">facets</a> One-line summary: Fixed run format returned by the <code>lineage_parent_id</code> Airflow macro. Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☑︎ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☑︎ Your changes are accompanied by tests (if relevant) ☑︎ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☑︎ You've updated any relevant documentation (if relevant) ☑︎ Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant) ☐ You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant) <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2023 contributors to the OpenLineage project

Labels

integration/airflow

Comments

👀 Kacper Muda

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-13 14:10:46

*Thread Reply:* Hey Fabio, thanks for the PR. Please let us know if you need any help with fixing tests.

🙌 Fabio Manganiello

Michael Robinson (michael.robinson@astronomer.io)

2024-03-06 15:22:46

@channel This month’s TSC meeting is next week on a new day/time: Wednesday the 13th at 9:30am PT. Please note that this will be the new day/time going forward! On the tentative agenda: • announcements ◦ new integrations: DataHub and OpenMetadata ◦ upcoming events • recent release 1.9.1 highlights • Scala 2.13 support in Spark overview by @Damien Hawes • Circuit breaker in Spark & Flink @Paweł Leszczyński • discussion items • open discussion More info and the meeting link can be found on the website. All are welcome! Do you have a discussion topic, use case or integration you’d like to demo? Reply here or DM me to be added to the agenda.

openlineage.io

TSC Meetings | OpenLineage

The OpenLineage Technical Steering Committee meets monthly, and is open to all.

Original URL: https://openlineage.io/meetings/

🙏 Willy Lulciuc

✅ Sheeri Cabral (Collibra)

Max Zheng (mzheng@plaid.com)

2024-03-06 19:45:11

Hi, would it be reasonable to add a flag to skip RUNNING events for the Spark integration? https://openlineage.io/docs/integrations/spark/job-hierarchy For some jobs we're seeing AsyncEventQueue report ~20s to process each event and a lot of RUNNING events being generated

IMO this might work as an alternative to https://github.com/OpenLineage/OpenLineage/issues/2375 ? It seems like it'd be more valuable to get the START/COMPLETE events vs intermediate RUNNING events

openlineage.io

Job Hierarchy | OpenLineage

Please get familiar with OpenLineage Job Hierarchy concept before reading this.

Original URL: https://openlineage.io/docs/integrations/spark/job-hierarchy

#2375 [PROPOSAL] Support timeout for openlineage spark listener

Purpose: Currently, we have seen 1-2% of our spark workloads, whose spark jobs complete within the normal duration; but the openlineage spark listener takes hours to complete processing all spark events - which keeps the infrastructure up, resulting in SLA impacts as well as infrastructure cost. While we have been seeing different root-causes for the spark listener taking a long time to finish processing events, we'd like to have a way to reduce the blast radius. That way, even if openlineage does end up going to the long-running scenario, we have a way to configure a hard timeout limit - so that we can be sure that the jobs don't end up going out of SLA, even if lineage doesn't get captured in such cases. Proposed implementation This section describes how you propose to model it. We can add support for a new spark conf, that can be configured by users like: spark.openlineage.listener.timeout.seconds=120 (say, 2 mins) Internally, if a timeout is configured by the user, we can have an internal <code>Executor + Future.get()</code> with timeout / use Guava <code>SimpleTimeLimiter</code> to ensure all the spark events handled by OL spark listener have a max timeout rather than running for hours. Relevant slack thread: <a href="https://openlineage.slack.com/archives/C01CK9T7HKR/p1705161285825369">https://openlineage.slack.com/archives/C01CK9T7HKR/p1705161285825369</a>

Labels

proposal

Comments

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-03-07 03:13:16

*Thread Reply:* Well, I think the real problem is 20s event generator. What we should do is to include timer spent on each visitor or dataset builder within debug facet. Once this is done, we could reach out to you again to let you guide us which code part leads to such scenario.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-03-07 03:13:44

*Thread Reply:* @Maciej Obuchowski do we have an issue for this? I think we discussed it recently.

Max Zheng (mzheng@plaid.com)

2024-03-07 11:58:05

*Thread Reply:* > What we should do is to include timer spent on each visitor or dataset builder within debug facet. I could help provide this data if that'd be helpful, how/what instrumentation should I add? If you've got a patch handy I could apply it locally, build, and collect this data from my test job

Max Zheng (mzheng@plaid.com)

2024-03-07 12:15:42

*Thread Reply:* Its also taking > 20s per event with parquet writes instead of hudi writes in my job so I don't think thats the culprit

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-07 14:45:59

*Thread Reply:* I'm working on instrumentation/metrics right now, will be ready for next release 🙂

🙌 Max Zheng, Sheeri Cabral (Collibra)

Max Zheng (mzheng@plaid.com)

2024-03-11 20:04:22

*Thread Reply:* I did some manual timing and 90% of the latency is from buildInputDatasets https://github.com/OpenLineage/OpenLineage/blob/987e5b806dc8bd6c5aab5f85c97af76a87[…]enlineage/spark/agent/lifecycle/OpenLineageRunEventBuilder.java

Manual as in I modified: long startTime = System.nanoTime(); List<InputDataset> datasets = Stream.concat( buildDatasets(nodes, inputDatasetBuilders), openLineageContext .getQueryExecution() .map( qe -> ScalaConversionUtils.fromSeq(qe.optimizedPlan().map(inputVisitor)) .stream() .flatMap(Collection::stream) .map(((Class<InputDataset>) InputDataset.class)::cast)) .orElse(Stream.empty())) .collect(Collectors.toList()); long endTime = System.nanoTime(); double durationInSec = (endTime - startTime) / 1_000_000_000.0; <a href="http://log.info">log.info</a>("buildInputDatasets 1: {}s", durationInSec); 24/03/11 23:44:58 INFO OpenLineageRunEventBuilder: buildInputDatasets 1: 95.710143007s Is there anything I can instrument/log to narrow down further why this is so slow? buildOutputDatasets is also kind of slow at ~10s

<https://github.com/OpenLineage/OpenLineage/blob/987e5b806dc8bd6c5aab5f85c97af76a87b1de0f/integration/spark/shared/src/main/java/io/openlineage/spark/agent/lifecycle/OpenLineageRunEventBuilder.java | OpenLineageRunEventBuilder.java>

<pre><code> List<InputDataset> datasets = </code></pre>

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-12 05:57:58

*Thread Reply:* @Max Zheng it's not extremely easy because sometimes QueryPlanVisitors/DatasetBuilders delegate work to other ones, but I think I'll have a relatively good solution soon: https://github.com/OpenLineage/OpenLineage/pull/2496

👍 Paweł Leszczyński, Max Zheng

Max Zheng (mzheng@plaid.com)

2024-03-12 12:20:24

*Thread Reply:* Got it, should I open a Github issue to track this?

For context the code is def load_df_with_schema(spark: SparkSession, s3_base: str): schema = load_schema(spark, s3_base) file_paths = get_file_paths(spark, "/".join([s3_base, "manifest.json"])) return spark.read.format("json").load( file_paths, schema=schema, mode="FAILFAST", ) And the input schema has ~250 columns

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-12 12:24:00

*Thread Reply:* the instrumentation issues are already there, but please do open issue for the slowness 👍

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-12 12:24:34

*Thread Reply:* and yes, it can be some degenerated example where we do something way more often than once

Max Zheng (mzheng@plaid.com)

2024-03-12 12:25:19

*Thread Reply:* Got it, I'll try to create a working reproduction and ticket it 🙂

Max Zheng (mzheng@plaid.com)

2024-03-13 16:18:31

*Thread Reply:* Created https://github.com/OpenLineage/OpenLineage/issues/2511

#2511 Spark integration very slow on building input datasets

Discussion in <a href="https://openlineage.slack.com/archives/C01CK9T7HKR/p1709772311187589">https://openlineage.slack.com/archives/C01CK9T7HKR/p1709772311187589</a> We have some jobs in pyspark that have the following structure: <pre><code>def load_df_with_schema(spark: SparkSession, s3_base: str): schema = load_schema(spark, s3_base) file_paths = get_file_paths(spark, "/".join([s3_base, "manifest.json"])) return spark.read.format("json").load( file_paths, schema=schema, mode="FAILFAST", ) df = load_df_with_schema(...) df = df.limit(10000) # added to demonstrate its not related to data size df.write.format("parquet").save(output_path) </code></pre> The schema for the dataframe can have > 200 columns. We've found event processing is very slow and have isolated to <code>buildInputDatasets</code><a href="https://github.com/OpenLineage/OpenLineage/blob/987e5b806dc8bd6c5aab5f85c97af76a87b1de0f/integration/spark/shared/src/main/java/io/openlineage/spark/agent/lifecycle/OpenLineageRunEventBuilder.java#L360">https://github.com/OpenLineage/OpenLineage/blob/987e5b806dc8bd6c5aab5f85c97af76a87[…]enlineage/spark/agent/lifecycle/OpenLineageRunEventBuilder.java</a> I built the jar locally with the following changes <pre><code> long startTime = System.nanoTime(); List<InputDataset> datasets = Stream.concat( buildDatasets(nodes, inputDatasetBuilders), openLineageContext .getQueryExecution() .map( qe -> ScalaConversionUtils.fromSeq(qe.optimizedPlan().map(inputVisitor)) .stream() .flatMap(Collection::stream) .map(((Class<InputDataset>) InputDataset.class)::cast)) .orElse(Stream.empty())) .collect(Collectors.toList()); long endTime = System.nanoTime(); double durationInSec = (endTime - startTime) / 1_000_000_000.0; <a href="http://log.info">log.info</a>("buildInputDatasets 1: {}s", durationInSec); </code></pre> This showed it took 95s to run, which seems very slow <pre><code>24/03/11 23:44:58 INFO OpenLineageRunEventBuilder: buildInputDatasets 1: 95.710143007s </code></pre> cc: <a href="https://github.com/mobuchowski">@mobuchowski</a> I'll try to reproduce with a dummy dataset with reproduction steps and paste here.

Comments

👍 Maciej Obuchowski

Suhas Shenoy (ksuhasshenoy@gmail.com)

2024-03-07 02:02:43

Hi team... I am trying to emit openlineage events from a spark job. When I submit the job using spark-submit, this is what I see in console.

ERROR AsyncEventQueue: Listener OpenLineageSparkListener threw an exception io.openlineage.client.OpenLineageClientException: io.openlineage.spark.shaded.com.fasterxml.jackson.databind.JsonMappingException: Failed to find TransportBuilder (through reference chain: io.openlineage.client.OpenLineageYaml["transport"]) at io.openlineage.client.OpenLineageClientUtils.loadOpenLineageYaml(OpenLineageClientUtils.java:149) at io.openlineage.spark.agent.ArgumentParser.extractOpenlineageConfFromSparkConf(ArgumentParser.java:114) at io.openlineage.spark.agent.ArgumentParser.parse(ArgumentParser.java:78) at io.openlineage.spark.agent.OpenLineageSparkListener.initializeContextFactoryIfNotInitialized(OpenLineageSparkListener.java:277) at io.openlineage.spark.agent.OpenLineageSparkListener.onJobStart(OpenLineageSparkListener.java:110) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1356) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) Caused by: io.openlineage.spark.shaded.com.fasterxml.jackson.databind.JsonMappingException: Failed to find TransportBuilder (through reference chain: io.openlineage.client.OpenLineageYaml["transport"]) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.JsonMappingException.wrapWithPath(JsonMappingException.java:402) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.JsonMappingException.wrapWithPath(JsonMappingException.java:361) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.deser.BeanDeserializerBase.wrapAndThrow(BeanDeserializerBase.java:1853) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.deser.BeanDeserializer.vanillaDeserialize(BeanDeserializer.java:316) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:177) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.deser.DefaultDeserializationContext.readRootValue(DefaultDeserializationContext.java:323) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4825) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3809) at io.openlineage.client.OpenLineageClientUtils.loadOpenLineageYaml(OpenLineageClientUtils.java:147) ... 18 more Caused by: java.lang.IllegalArgumentException: Failed to find TransportBuilder at io.openlineage.client.transports.TransportResolver.lambda$getTransportBuilder$3(TransportResolver.java:38) at java.base/java.util.Optional.orElseThrow(Optional.java:403) at io.openlineage.client.transports.TransportResolver.getTransportBuilder(TransportResolver.java:37) at io.openlineage.client.transports.TransportResolver.resolveTransportConfigByType(TransportResolver.java:16) at io.openlineage.client.transports.TransportConfigTypeIdResolver.typeFromId(TransportConfigTypeIdResolver.java:35) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.jsontype.impl.TypeDeserializerBase._findDeserializer(TypeDeserializerBase.java:159) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.jsontype.impl.AsPropertyTypeDeserializer._deserializeTypedForId(AsPropertyTypeDeserializer.java:151) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.jsontype.impl.AsPropertyTypeDeserializer.deserializeTypedFromObject(AsPropertyTypeDeserializer.java:136) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.deser.AbstractDeserializer.deserializeWithType(AbstractDeserializer.java:263) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.deser.impl.FieldProperty.deserializeAndSet(FieldProperty.java:147) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.deser.BeanDeserializer.vanillaDeserialize(BeanDeserializer.java:314) ... 23 more Can I get any help on this?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-03-07 02:20:32

*Thread Reply:* Looks like misconfigured transport. Please refer to this -> https://openlineage.io/docs/integrations/spark/configuration/transport and https://openlineage.io/docs/integrations/spark/configuration/spark_conf for more details. I think you're missing spark.openlineage.transport.type property.

openlineage.io

Spark Config Parameters | OpenLineage

The following parameters can be specified:

Original URL: https://openlineage.io/docs/integrations/spark/configuration/spark_conf

Suhas Shenoy (ksuhasshenoy@gmail.com)

2024-03-07 02:28:10

*Thread Reply:* This is my configuration of the transport: conf.set("sparkscalaversion", "2.12") conf.set("spark.extraListeners","io.openlineage.spark.agent.OpenLineageSparkListener") conf.set("spark.openlineage.transport.type","http") conf.set("spark.openlineage.transport.url","<http://localhost:8082>") conf.set("spark.openlineage.transport.endpoint","/event") conf.set("spark.extraListeners","io.openlineage.spark.agent.OpenLineageSparkListener") During spark-submit if I include --packages "io.openlineage:openlineage_spark:1.8.0" I am able to receive events.

I have already included this line in build.sbt libraryDependencies += "io.openlineage" % "openlineage-spark" % "1.8.0"

So I don't understand why I have to pass the packages again

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-03-07 03:07:04

*Thread Reply:* OK, the configuration is OK. I think that when using libraryDependencies you get rid of manifest from within our JAR which is used by ServiceLoader

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-03-07 03:07:40

*Thread Reply:* this is happening here -> https://github.com/OpenLineage/OpenLineage/blob/main/client/java/src/main/java/io/openlineage/client/transports/TransportResolver.java#L32

<https://github.com/OpenLineage/OpenLineage/blob/main/client/java/src/main/java/io/openlineage/client/transports/TransportResolver.java | TransportResolver.java>

<pre><code> private static TransportBuilder getTransportBuilder(Predicate<TransportBuilder> predicate) { </code></pre>

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-03-07 03:08:37

*Thread Reply:* And this is the known issue related to this -> https://github.com/OpenLineage/OpenLineage/issues/1860

#1860 io.openlineage.client.OpenLineageClientException: io.openlineage.spark.shaded.com.fasterxml.jackson.databind.JsonMappingException: Failed to find TransportBuilder

On Version <code>0.25.0</code> when running <code>OpenLineage</code> with Spark, we are getting: <pre><code>io.openlineage.client.OpenLineageClientException: io.openlineage.spark.shaded.com.fasterxml.jackson.databind.JsonMappingException: Failed to find TransportBuilder (through reference chain: io.openlineage.client.OpenLineageYaml["transport"]) at io.openlineage.client.OpenLineageClientUtils.loadOpenLineageYaml(OpenLineageClientUtils.java:140) at io.openlineage.spark.agent.ArgumentParser.extractOpenlineageConfFromSparkConf(ArgumentParser.java:114) at io.openlineage.spark.agent.ArgumentParser.parse(ArgumentParser.java:78) at io.openlineage.spark.agent.OpenLineageSparkListener.initializeContextFactoryIfNotInitialized(OpenLineageSparkListener.java:277) at io.openlineage.spark.agent.OpenLineageSparkListener.onApplicationStart(OpenLineageSparkListener.java:267) at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:55) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91) at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1347) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82) Caused by: io.openlineage.spark.shaded.com.fasterxml.jackson.databind.JsonMappingException: Failed to find TransportBuilder (through reference chain: io.openlineage.client.OpenLineageYaml["transport"]) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.JsonMappingException.wrapWithPath(JsonMappingException.java:392) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.JsonMappingException.wrapWithPath(JsonMappingException.java:351) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.deser.BeanDeserializerBase.wrapAndThrow(BeanDeserializerBase.java:1821) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.deser.BeanDeserializer.vanillaDeserialize(BeanDeserializer.java:316) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:177) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.deser.DefaultDeserializationContext.readRootValue(DefaultDeserializationContext.java:323) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4674) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3666) at io.openlineage.client.OpenLineageClientUtils.loadOpenLineageYaml(OpenLineageClientUtils.java:138) ... 17 more Caused by: java.lang.IllegalArgumentException: Failed to find TransportBuilder at io.openlineage.client.transports.TransportResolver.lambda$getTransportBuilder$3(TransportResolver.java:38) at java.util.Optional.orElseThrow(Optional.java:290) at io.openlineage.client.transports.TransportResolver.getTransportBuilder(TransportResolver.java:37) at io.openlineage.client.transports.TransportResolver.resolveTransportConfigByType(TransportResolver.java:16) at io.openlineage.client.transports.TransportConfigTypeIdResolver.typeFromId(TransportConfigTypeIdResolver.java:35) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.jsontype.impl.TypeDeserializerBase._findDeserializer(TypeDeserializerBase.java:159) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.jsontype.impl.AsPropertyTypeDeserializer._deserializeTypedForId(AsPropertyTypeDeserializer.java:125) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.jsontype.impl.AsPropertyTypeDeserializer.deserializeTypedFromObject(AsPropertyTypeDeserializer.java:110) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.deser.AbstractDeserializer.deserializeWithType(AbstractDeserializer.java:263) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.deser.impl.FieldProperty.deserializeAndSet(FieldProperty.java:147) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.deser.BeanDeserializer.vanillaDeserialize(BeanDeserializer.java:314) ... 22 more </code></pre> We see that <code>OpenLineage</code> added support for custom transport types in <code>0.24.0</code> We're not sure if it's related to that. We checked with version <code>0.23.0</code> and we don't see the exception in that version. We are using <code>OpenLineage</code> with Spark using <code>spark-submit</code>, dependency: <code>io.openlineage:openlineage_spark:0.23.0</code>

Assignees

<a href="https://github.com/pawel-big-lebowski">@pawel-big-lebowski</a>

Labels

bug, integration/spark

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-03-07 03:09:47

*Thread Reply:* This comment -> https://github.com/OpenLineage/OpenLineage/issues/1860#issuecomment-1750536744 explains this and shows how to fix this. I am happy to help new contributors with this.

Comment on #1860 io.openlineage.client.OpenLineageClientException: io.openlineage.spark.shaded.com.fasterxml.jackson.databind.JsonMappingException: Failed to find TransportBuilder

I think I get your problem now. Sorry, that it took me so long, but mistakenly I thought it is about custom transport types. So, while doing your uber jars, you get rid of <code>io.openlineage.client.transports.TransportBuilder</code> Meta-inf file and lose access to all the transports defined (including STANDARD transports). The solution would be to have an copy of transport builders: <pre><code>io.openlineage.client.transports.HttpTransportBuilder io.openlineage.client.transports.KafkaTransportBuilder io.openlineage.client.transports.ConsoleTransportBuilder io.openlineage.client.transports.FileTransportBuilder io.openlineage.client.transports.KinesisTransportBuilder </code></pre> somewhere within the codebase and use it that within <code>io.openlineage.client.transports.TransportResolver#getTransportBuilder</code> method. Please feel free to contribute on that. This could be just few lines around: <pre><code>Optional<TransportBuilder> optionalBuilder = StreamSupport.stream(loader.spliterator(), false).filter(predicate).findFirst(); </code></pre> to <code>Stream.concat(<<stream from meta inf>>, <<stream from hardcoded list>>)</code>. We apply <code>findFirst</code> on the stream so potential duplicates should not bother us. I would be happy to help with this PR.

Suhas Shenoy (ksuhasshenoy@gmail.com)

2024-03-07 03:10:57

*Thread Reply:* Thanks for the detailed reply and pointers. Will look into it.

Michael Robinson (michael.robinson@astronomer.io)

2024-03-07 15:56:52

@channel The big redesign of Marquez Web is out now following a productive testing period and some modifications along with added features. In addition to a wholesale redesign including column lineage support, it includes a new dataset tagging feature. It's worth checking out as a consumption layer in your lineage solution. A blog post with more details is coming soon, but here are some screenshots to whet your appetite. (See the thread for a screencap of the column lineage display.) Marquez quickstart: https://marquezproject.ai/docs/quickstart/ The release itself: https://github.com/MarquezProject/marquez/releases/tag/0.45.0

🤯 Ross Turk, Julien Le Dem, Harel Shein, Juan Luis Cano Rodríguez, Paweł Leszczyński, Mattia Bertorello, Rodrigo Maia

❤️ Harel Shein, Peter Huang, Kengo Seki, Paul Wilson Villena, Paweł Leszczyński, Mattia Bertorello, alexandre bergere, Rodrigo Maia, Maciej Obuchowski, Ernie Ostic, Dongjin Seo

✅ Sheeri Cabral (Collibra)

Cory Visi (cvisi@amazon.com)

2024-03-07 17:34:18

*Thread Reply:* Are those field descriptions coming from emitted events? or from a defined schema that's being added by marquez?

Ted McFadden (tmcfadden@consoleconnect.com)

2024-03-07 17:51:42

*Thread Reply:* Nice work! Are there any examples of the mode being switched from Table level to Column level or do I miss understand what mode is?

Michael Robinson (michael.robinson@astronomer.io)

2024-03-07 17:52:11

*Thread Reply:* @Cory Visi Those are coming from the events. The screenshots are of the UI seeded with metadata. You can find the JSON used for this here: https://github.com/MarquezProject/marquez/blob/main/docker/metadata.json

Michael Robinson (michael.robinson@astronomer.io)

2024-03-07 17:53:38

*Thread Reply:* The three screencaps in my first message actually don't include the column lineage display feature (but there are lots of other upgrades in the release)

Michael Robinson (michael.robinson@astronomer.io)

2024-03-07 17:55:56

*Thread Reply:* column lineage view:

❤️ Paweł Leszczyński, Rodrigo Maia, Cory Visi

Ted McFadden (tmcfadden@consoleconnect.com)

2024-03-07 18:01:21

*Thread Reply:* Thanks, that's what I wanted to get a look at. Cheers

👍 Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2024-03-07 18:01:25

*Thread Reply:* @Ted McFadden what the initial 3 screencaps show is switching between the graph view and detailed views of the datasets and jobs

David Sharp (davidsharp7@gmail.com)

2024-03-07 23:59:42

*Thread Reply:* Hey with the tagging we’ve identified a slight bug - PR has been put into fix.

Rodrigo Maia (rodrigo.maia@manta.io)

2024-03-08 05:31:15

*Thread Reply:* The "query" section looks awesome, Congrats!!! But from the openlineage side, when is the query attribute available?

Cory Visi (cvisi@amazon.com)

2024-03-08 07:36:29

*Thread Reply:* Fantastic work!

Michael Robinson (michael.robinson@astronomer.io)

2024-03-08 07:55:30

*Thread Reply:* @Rodrigo Maia the OpenLineage spec supports this via the SQLJobFacet. See: https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/SQLJobFacet.json

Ernie Ostic (ernie.ostic@getmanta.com)

2024-03-08 08:42:40

*Thread Reply:* Thanks Michael....do we have a list of which providers are known to be populating the SQL JobFacet (assuming that the solution emitting the events uses SQL and has access to it)?

Michael Robinson (michael.robinson@astronomer.io)

2024-03-08 08:59:24

*Thread Reply:* @Maciej Obuchowski or @Jakub Dardziński can add more detail, but this doc has a list of operators supported by the SQL parser.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-08 09:01:13

*Thread Reply:* yeah, so basically any of the operators that is sql-compatible - SQLExecuteQueryOperator + Athena, BQ I think

Ernie Ostic (ernie.ostic@getmanta.com)

2024-03-08 09:05:45

*Thread Reply:* Thanks! That helps for Airflow --- do we know if any other Providers are fully supporting this powerful facet?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-08 09:07:45

*Thread Reply:* whoa, powerful 😅 I just checked sources, the only missing from above is CopyFromExternalStageToSnowflakeOperator

are you interested in some specific ones?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-08 09:08:49

*Thread Reply:* and ofc you can have SQLJobFacet coming from dbt or spark as well or any other systems triggered via Airflow

Ernie Ostic (ernie.ostic@getmanta.com)

2024-03-08 11:03:36

*Thread Reply:* Thanks Jakub. It will be interesting to know which providers we are certain provide SQL, that are entirely independent of Airflow.

✅ Sheeri Cabral (Collibra)

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-08 11:07:50

*Thread Reply:* I don’t think we have any facet-oriented docs (e.g. what produces SQLJobFacet) and if that makes sense

Ernie Ostic (ernie.ostic@getmanta.com)

2024-03-08 11:14:40

*Thread Reply:* Thanks. Ultimately, it's a bigger question that we've talked about before, about best ways to document and validate what things/facets you can support/consume (as a consumer) or which you support/populate as a provider.

✅ Sheeri Cabral (Collibra)

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-08 11:16:05

*Thread Reply:* The doc that @Michael Robinson shared is automatically generated from Airflow code, so it should provide the best option for build-in operators. If we're talking about providers/operators outside Airflow repo, then I think @Julien Le Dem’s registry proposal would best support that need

☝️ Jakub Dardziński, Ernie Ostic

Athitya Kumar (athityakumar@gmail.com)

2024-03-07 23:44:08

Hey team. Is column/attribute level lineage supported for input/topic Kafka topic ports in the OpenLineage Flink listener?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-03-08 02:07:58

*Thread Reply:* Column level lineage is currently not supported for Flink

Ruchira Prasad (ruchiraprasad@gmail.com)

2024-03-08 04:57:20

Is it possible to explain me "OTHER" Run State and whether we can use this to send Lineage events to check the health of a service that is running in background and triggered interval manner. It will be really helpful, if someone can send example JSON for "OTHER" run state

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-03-08 05:17:19

*Thread Reply:* The example idea behind other was: imagine a system that requests for compute resorouces and would like to emit OpenLineage event about request being made. That's why other can occur before start. The other idea was to put other elsewhere to provide agility for new scenarios. However, we want to restrict which event types are terminating ones and don't want other there. This is important for lineage consumers, as when they receive terminating event for a given run, they know all the events related to the run were emitted.

Ruchira Prasad (ruchiraprasad@gmail.com)

2024-03-08 05:38:21

*Thread Reply:* @Paweł Leszczyński Is it possible to track the health of a service by using OpenLineage Events? Of so, How? As an example, I have a windows service, and I want to make sure the service is up and running.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-03-08 05:53:58

*Thread Reply:* depends on what do you mean by service. If you consider a data processing job as a service, then you can track if it successfully completes.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-08 07:08:46

*Thread Reply:* I think other systems would be more suited for healthchecks, like OpenTelemetry or Datadog

Efthymios Hadjimichael (ehadjimichael@id5.io)

2024-03-08 07:22:03

hey there, trying to configure databricks spark with the openlineage spark listener 🧵

Efthymios Hadjimichael (ehadjimichael@id5.io)

2024-03-08 07:22:52

*Thread Reply:* databricks runtime for clusters: 14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12) we are shipping a global init script that looks like the following: ```#!/bin/bash VERSION="1.9.1" SCALAVERSION="2.12" wget -O /mnt/driver-daemon/jars/openlineage-spark$${SCALAVERSION}-$${VERSION}.jar https://repo1.maven.org/maven2/io/openlineage/openlineage-spark$${SCALAVERSION}/$${VERSION}/openlineage-spark$${SCALA_VERSION}-$${VERSION}.jar

SPARKDEFAULTSFILE="/databricks/driver/conf/00-openlineage-defaults.conf"

if [[ $DBISDRIVER = "TRUE" ]]; then cat > $SPARKDEFAULTSFILE <<- EOF [driver] { "spark.extraListeners" = "com.databricks.backend.daemon.driver.DBCEventLoggingListener,io.openlineage.spark.agent.OpenLineageSparkListener" "spark.openlineage.version" = "v1" "spark.openlineage.transport.type" = "http" "spark.openlineage.transport.url" = "https://some.url" "spark.openlineage.dataset.removePath.pattern" = "(\/[a-z]+[-a-zA-Z0-9]+)+(?<remove>.**)" "spark.openlineage.namespace" = "some_namespace" } EOF fi``` with openlineage-spark 1.9.1

Efthymios Hadjimichael (ehadjimichael@id5.io)

2024-03-08 07:23:38

*Thread Reply:* getting fatal exceptions: 24/03/07 14:14:05 ERROR DatabricksMain$DBUncaughtExceptionHandler: Uncaught exception in thread spark-listener-group-shared! java.lang.NoClassDefFoundError: com/databricks/sdk/scala/dbutils/DbfsUtils at io.openlineage.spark.agent.facets.builder.DatabricksEnvironmentFacetBuilder.getDbfsUtils(DatabricksEnvironmentFacetBuilder.java:124) at io.openlineage.spark.agent.facets.builder.DatabricksEnvironmentFacetBuilder.getDatabricksEnvironmentalAttributes(DatabricksEnvironmentFacetBuilder.java:92) at io.openlineage.spark.agent.facets.builder.DatabricksEnvironmentFacetBuilder.build(DatabricksEnvironmentFacetBuilder.java:58) and spark driver crashing when spark runs

Efthymios Hadjimichael (ehadjimichael@id5.io)

2024-03-08 07:28:43

*Thread Reply:* browsing the code for 1.9.1 shows that the exception comes from trying to access the class for databricks dbfsutils here

should I file a bug on github, or am I doing something very wrong here?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-08 07:53:00

*Thread Reply:* Looks like something has changed in the Databricks 14 🤔

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-08 07:53:17

*Thread Reply:* Issue on GitHub is the right way

Efthymios Hadjimichael (ehadjimichael@id5.io)

2024-03-08 07:53:49

*Thread Reply:* thanks, opening one now with this information.

Efthymios Hadjimichael (ehadjimichael@id5.io)

2024-03-08 09:21:24

*Thread Reply:* link to issue for anyone interested, thanks again!

👍 Maciej Obuchowski

Abdallah (abdallah@terrab.me)

2024-03-15 10:09:00

*Thread Reply:* Hi @Maciej Obuchowski I am having the same issue with older versions of Databricks.

Abdallah (abdallah@terrab.me)

2024-03-18 02:47:30

*Thread Reply:* I don't think that the spark's integration is working anymore for any of the environments in Databricks and not only the version 14.

➕ Tristan GUEZENNEC -CROIX-

Abdallah (abdallah@terrab.me)

2024-03-18 05:38:05

*Thread Reply:* The issue is coming from this change :

https://github.com/OpenLineage/OpenLineage/commit/89beaadb9a608244e5375d0f76a124f78c5167d8#diff-5542f4a02d45d03e2fdc70[…]9d72a4785a077c52acf86c3aa6a

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-18 07:14:09

*Thread Reply:* @Abdallah are you willing to provide PR?

Abdallah (abdallah@terrab.me)

2024-03-18 11:51:20

*Thread Reply:* I am having a look

Abdallah (abdallah@terrab.me)

2024-03-20 04:45:02

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2530

#2530 fix: missing libs in final jar

Problem Since OpenLineage 1.9.1, we started getting this error in Databricks environments <pre><code>ERROR Utils: throw uncaught fatal error in thread spark-listener-group-shared java.lang.NoClassDefFoundError: com/databricks/sdk/scala/dbutils/DbfsUtils at io.openlineage.spark.agent.facets.builder.DatabricksEnvironmentFacetBuilder.getDbfsUtils(DatabricksEnvironmentFacetBuilder.java:124) at io.openlineage.spark.agent.facets.builder.DatabricksEnvironmentFacetBuilder.getDatabricksEnvironmentalAttributes(DatabricksEnvironmentFacetBuilder.java:92) at io.openlineage.spark.agent.facets.builder.DatabricksEnvironmentFacetBuilder.build(DatabricksEnvironmentFacetBuilder.java:58) at io.openlineage.spark.agent.facets.builder.DatabricksEnvironmentFacetBuilder.build(DatabricksEnvironmentFacetBuilder.java:32) at io.openlineage.spark.api.CustomFacetBuilder.accept(CustomFacetBuilder.java:40) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$null$27(OpenLineageRunEventBuilder.java:508) at java.lang.Iterable.forEach(Iterable.java:75) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$buildRunFacets$28(OpenLineageRunEventBuilder.java:508) at java.util.ArrayList.forEach(ArrayList.java:1259) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRunFacets(OpenLineageRunEventBuilder.java:508) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.populateRun(OpenLineageRunEventBuilder.java:328) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:304) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:265) at io.openlineage.spark.agent.lifecycle.SparkSQLExecutionContext.start(SparkSQLExecutionContext.java:221) at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$null$13(OpenLineageSparkListener.java:178) at io.openlineage.client.circuitBreaker.NoOpCircuitBreaker.run(NoOpCircuitBreaker.java:27) at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$onJobStart$14(OpenLineageSparkListener.java:176) at java.util.Optional.ifPresent(Optional.java:159) at io.openlineage.spark.agent.OpenLineageSparkListener.onJobStart(OpenLineageSparkListener.java:172) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:42) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:42) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:118) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:102) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:114) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:114) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:109) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:105) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1493) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:105) Caused by: java.lang.ClassNotFoundException: com.databricks.sdk.scala.dbutils.DbfsUtils at java.net.URLClassLoader.findClass(URLClassLoader.java:387) at java.lang.ClassLoader.loadClass(ClassLoader.java:419) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) ... 33 more </code></pre> Closes: <a href="https://github.com/OpenLineage/OpenLineage/issues/2499">#2499</a> Solution The package <code>com.databricks:dbutils_api_2.11:0.0.4</code> and <code>com.databricks:databricks_sdk_java:0.14.0</code> exist bu default in the databricks clusters while the package <code>com.databricks:databricks_dbutils_scala_2.12:0.1.4</code> need to be added to the jar. One-line summary: Add the <code>com.databricks:databricks_dbutils_scala_2.12:0.1.4</code> in the <code>gradle.build</code>. Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☑︎ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☐ Your changes are accompanied by tests (if relevant) ☑︎ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☐ Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant) ☐ You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant) <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2023 contributors to the OpenLineage project

Labels

integration/spark

slackbot

2024-03-08 12:04:26

This message was deleted.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-08 12:13:32

*Thread Reply:* is what you sent an event for DAG or task?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-08 12:22:32

*Thread Reply:* so far Marquez cannot show job hierarchy (DAG is parent to tasks) so you need click on some of the tasks in the UI to see proper view

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-08 12:33:25

*Thread Reply:* is this the only job listed?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-08 12:33:37

*Thread Reply:* no, I can see 191 total

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-08 12:34:22

*Thread Reply:* what if you choose any other job that has ACustomingestionDag. prefix?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-08 12:39:24

*Thread Reply:* you also have namespaces in right upper corner. datasets are probably in different namespace than Airflow jobs

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-08 12:47:52

*Thread Reply:* https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/supported_classes.html

this is the list of supported operators currently

not all of them send dataset information, e.g. PythonOperator

Nargiza Fernandez (nargizafernandez@gmail.com)

2024-03-08 14:06:35

hi everyone!

i configured openlineage + marquez to my Amazon managed Apache Airflow to get better insights of the DAGS. for implementation i followed the https://aws.amazon.com/blogs/big-data/automate-data-lineage-on-amazon-mwaa-with-openlineage/ guide, using helm/k8s option. marquez is up and running i can see my DAGs and depending DAGs in jobs section, however when clicking on any of the dags in jobs list i see only one job without any dependencies. i would like to see the whole chain of tasks execution. how can i achieve this goal? please advice.

additional information: we dont have Datasets in our MWAA. MWAA Airflow - v. 2.7.2 Openlineage plugin.py - from airflow.plugins_manager import AirflowPlugin from airflow.models import Variable import os

os.environ["OPENLINEAGEURL"] = Variable.get('OPENLINEAGEURL', default_var='')

class EnvVarPlugin(AirflowPlugin): name = "envvarplugin"

requirements.txt: httplib2 urllib3 oauth2client bingads pymssql certifi facebook_business mysql-connector-python google-api-core google-auth google-api-python-client apiclient google-auth-httplib2 google-auth-oauthlib pymongo pandas numpy pyarrow apache-airflow-providers-openlineage

Also, where can i find the meaning of Depth, complete mode, compact nodes options? i believe it is an view option?

Thank you in advance for your help!

Willy Lulciuc (willy@datakin.com)

2024-03-08 14:17:50

*Thread Reply:* Jobs may not have any dependencies depending on the Airflow operator used (ex: PythonOperator). Can you provide the OL events for the job you expect to have inputs/outputs? In the Marquez Web UI, you can use the events tab:

Nargiza Fernandez (nargizafernandez@gmail.com)

2024-03-08 14:42:14

*Thread Reply:* i expect to see dependencies from all my jobs. i was hoping marquez will show similar view as airflow does, and therefore having easier chance to troubleshoot failed DAGs. please refer to the image below.

Nargiza Fernandez (nargizafernandez@gmail.com)

2024-03-08 17:02:09

*Thread Reply:* is this what you requested?

Nargiza Fernandez (nargizafernandez@gmail.com)

2024-03-11 10:19:28

*Thread Reply:* hello! @Willy Lulciuc could you please guide me further? what can be done to see the whole chain of DAG execution in openlineage/marquez?

Nargiza Fernandez (nargizafernandez@gmail.com)

2024-03-11 14:42:01

*Thread Reply:* from textwrap import dedent import mysql.connector import pymongo import logging import sys import ast from airflow import DAG from airflow.operators.python import PythonOperator from airflow.operators.trigger_dagrun import TriggerDagRunOperator from airflow.operators.python import BranchPythonOperator from airflow.providers.http.operators.http import SimpleHttpOperator from airflow.models import Variable from bson.objectid import ObjectId we do use PythonOperator, however we are specifying task dependencies in the DAG code, example:

error_task = PythonOperator( 891 task_id='error', 892 python_callable=error, 893 dag=dag, 894 trigger_rule = "one_failed" 895 ) 896 897 transformed_task >> generate_dict >> api_trigger_dependent_dag >> error_task for this case is there a way to have detailed view in Marquez Web UI?

Nargiza Fernandez (nargizafernandez@gmail.com)

2024-03-11 14:50:17

*Thread Reply:* @Jakub Berezowski hello! could you please take a look at my case and advice what can be done whenever you have time? thank you!

Suresh Kumar (ssureshkumar6@gmail.com)

2024-03-10 04:35:02

Hi All, I'm based out of Sydney and we are using the open lineage on Azure data platform. I'm looking for some direction and support where we got struck currently on lineage creation from Spark (Azure Synapse Analytics) PySpark not able to emit lineage when there are some complex transformations happening. The open lineage version we currently using is v0.18 and Spark version is 3.2.

Kacper Muda (mudakacper@gmail.com)

2024-03-11 03:54:43

*Thread Reply:* Hi, could you provide some more details on the issue you are facing? Some debug logs, specific error message, pyspark code that causes the issue? Also, current OpenLineage version is 1.9.1 , is there any reason you are using an outdated 0.18?

Suresh Kumar (ssureshkumar6@gmail.com)

2024-03-11 19:15:18

*Thread Reply:* Thanks for the headsup. We are in process of upgrading the library and get back to you.

Kylychbek Zhumabai uulu (kylychbekeraliev2000@gmail.com)

2024-03-11 12:51:09

Hello everyone, is there anyone who integrated AWS MWAA with Openlineage, I'm trying it but it is not working, can you give some ideas and steps if you have an experience for that?

Michael Robinson (michael.robinson@astronomer.io)

2024-03-12 12:37:47

@channel This month's TSC meeting, open to all, is tomorrow at 9:30 PT. The updated agenda includes exciting news of new integrations and presentations by @Damien Hawes and @Paweł Leszczyński. Hope to see you there! https://openlineage.slack.com/archives/C01CK9T7HKR/p1709756566788589

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)

@channel This month’s TSC meeting is next week on a new day/time: Wednesday the 13th at 9:30am PT. Please note that this will be the new day/time going forward! On the tentative agenda: • announcements ◦ new integrations: DataHub and OpenMetadata ◦ upcoming events • recent release 1.9.1 highlights • Scala 2.13 support in Spark overview by @Damien Hawes • Circuit breaker in Spark & Flink @Paweł Leszczyński • discussion items • open discussion More info and the meeting link can be found on the <a href="https://openlineage.io/meetings/">website</a>. All are welcome! Do you have a discussion topic, use case or integration you’d like to demo? Reply here or DM me to be added to the agenda.

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1709756566788589

🚀 Mattia Bertorello, Maciej Obuchowski, Sheeri Cabral (Collibra), Paweł Leszczyński

Suhas Shenoy (ksuhasshenoy@gmail.com)

2024-03-13 10:28:41

Hi team.. If we are trying to send openlineage events from spark job to kafka endpoint which requires keystore and truststore related properties to be configured, how can we configure it?

Kacper Muda (mudakacper@gmail.com)

2024-03-13 10:33:48

*Thread Reply:* Hey, check out this docs and spark.openlineage.transport.properties.[xxx] configuration. Is this what you are looking for?

Suhas Shenoy (ksuhasshenoy@gmail.com)

2024-03-13 11:08:49

*Thread Reply:* Yes... Thanks

Rodrigo Maia (rodrigo.maia@manta.io)

2024-03-13 11:46:09

Hello all 👋! Has anyone tried to use spark udfs with openlineage? Does it make sense for the column-level lineage to stop working in this context?

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-03-15 08:47:54

*Thread Reply:* did you investigate if it still works on a table-level?

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-03-15 08:49:50

*Thread Reply:* (I haven’t tried it, but looking at spark UDFs it looks like there are many differences - https://medium.com/@suffyan.asad1/a-deeper-look-into-spark-user-defined-functions-537c6efc5fb3 - nothing is jumping out at me as “this is why it doesn’t work” though.

Medium

A deeper look into Spark User Defined Functions

This article provides insights into using Spark UDFs to manipulate complex, and nested array, map and struct data. Code examples in…

Reading time

10 min read

Original URL: https://medium.com/@suffyan.asad1/a-deeper-look-into-spark-user-defined-functions-537c6efc5fb3

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-03-14 03:49:21

This week brought us many fixes to the Flink integration like: • #2507 which resolves critical issues introduced in recent release, • #2508 which makes JDBC dataset naming consistent with dataset naming convention and having a common code for Spark & Flink to extract dataset identifier from JDBC connection url. • #2512 which includes database schema in dataset identifier for JDBC integration in Flink. These are significant improvements and I think they should not wait for the next release cycle. I would like to start a vote for an immediate release.

➕ Kacper Muda, Paweł Leszczyński, Mattia Bertorello, Maciej Obuchowski, Harel Shein, Damien Hawes, Peter Huang

Michael Robinson (michael.robinson@astronomer.io)

2024-03-14 10:46:42

*Thread Reply:* Thanks, all. The release is approved..

🙌 Paweł Leszczyński

Michael Robinson (michael.robinson@astronomer.io)

2024-03-14 15:26:58

*Thread Reply:* Changelog PR is here: https://github.com/OpenLineage/OpenLineage/pull/2516

#2516 Add missing changes for 1.10 release.

Labels

documentation

Michael Robinson (michael.robinson@astronomer.io)

2024-03-15 11:05:02

@channel We released OpenLineage 1.10.2, featuring:

Additions • Dagster: add new provider for version 1.6.10 #2518 @JDarDagran • Flink: support lineage for a hybrid source #2491 @HuangZhenQiu • Flink: bump Flink JDBC connector version #2472 @HuangZhenQiu • Java: add a OpenLineageClientUtils#loadOpenLineageJson(InputStream) and change OpenLineageClientUtils#loadOpenLineageYaml(InputStream) methods #2490 @d-m-h • Java: add info from the HTTP response to the client exception #2486 @davidjgoss • Python: add support for MSK IAM authentication with a new transport #2478 @mattiabertorello Removal • Airflow: remove redundant information from facets #2524 @kacpermuda Fixes • Airflow: proceed without rendering templates if task_instance copy fails #2492 @kacpermuda • Flink: fix class not found issue for Cassandra #2507 @pawel-big-lebowski • Flink: refine the JDBC table name #2512 @HuangZhenQiu • Flink: fix JDBC dataset naming #2508 @pawel-big-lebowski • Flink: fix failure due to missing Cassandra classes #2507 @pawel-big-lebowski • Flink: fix release runtime dependencies #2504 @HuangZhenQiu • Spark: fix the HttpTransport timeout #2475 @pawel-big-lebowski • Spark: prevent NPE if the context is null #2515 @pawel-big-lebowski • Spec: improve Cassandra lineage metadata #2479 @HuangZhenQiu Thanks to all the contributors with a shout out to @Maciej Obuchowski for the after-hours CI fix! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.10.2 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/1.9.1...1.10.2 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🚀 Maciej Obuchowski, Kacper Muda, Mattia Bertorello, Paweł Leszczyński

🔥 Maciej Obuchowski, Mattia Bertorello, Paweł Leszczyński, Peter Huang

GUNJAN YADU (gunjanyadu6@gmail.com)

2024-03-18 08:12:43

Hi I am new to Openlineage. So can someone help me to understand and how exactly it is setup and how I can setup in my personal laptop and play with it to gain hands on experience

Kacper Muda (mudakacper@gmail.com)

2024-03-18 08:15:17

*Thread Reply:* Hey, checkout our Getting Started guide, and the whole documentation on python, java, spark etc. where you will find all the information about the setup and configuration. For Airflow>=2.7, there is a separate documentation

GUNJAN YADU (gunjanyadu6@gmail.com)

2024-03-18 08:52:41

*Thread Reply:* I am getting this error when i am following the commands in my windows laptop: git clone git@github.com:MarquezProject/marquez.git && cd marquez/docker running up.sh --seed marquez-api | WARNING 'MARQUEZCONFIG' not set, using development configuration. seed-marquez-with-metadata | wait-for-it.sh: waiting 15 seconds for api:5000 marquez-web | [HPM] Proxy created: /api/v1 -> http://api:5000/ marquez-web | App listening on port 3000! marquez-api | INFO [2024-03-18 12:45:01,702] org.eclipse.jetty.util.log: Logging initialized @1991ms to org.eclipse.jetty.util.log.Slf4jLog marquez-api | INFO [2024-03-18 12:45:01,795] io.dropwizard.server.DefaultServerFactory: Registering jersey handler with root path prefix: / marquez-api | INFO [2024-03-18 12:45:01,796] io.dropwizard.server.DefaultServerFactory: Registering admin handler with root path prefix: / marquez-api | INFO [2024-03-18 12:45:01,797] io.dropwizard.assets.AssetsBundle: Registering AssetBundle with name: graphql-playground for path /graphql-playground/** marquez-api | INFO [2024-03-18 12:45:01,807] marquez.MarquezApp: Running startup actions... marquez-api | INFO [2024-03-18 12:45:01,842] org.flywaydb.core.internal.license.VersionPrinter: Flyway Community Edition 8.5.13 by Redgate marquez-api | INFO [2024-03-18 12:45:01,842] org.flywaydb.core.internal.license.VersionPrinter: See what's new here: https://flywaydb.org/documentation/learnmore/releaseNotes#8.5.13 marquez-api | INFO [2024-03-18 12:45:01,842] org.flywaydb.core.internal.license.VersionPrinter: marquez-db | 2024-03-18 12:45:02.039 GMT [34] FATAL: password authentication failed for user "marquez" marquez-db | 2024-03-18 12:45:02.039 GMT [34] DETAIL: Role "marquez" does not exist. marquez-db | Connection matched pghba.conf line 100: "host all all all scram-sha-256" marquez-api | ERROR [2024-03-18 12:45:02,046] org.apache.tomcat.jdbc.pool.ConnectionPool: Unable to create initial connections of pool. marquez-api | ! org.postgresql.util.PSQLException: FATAL: password authentication failed for user "marquez"

Do I have to do any additional setup to run marquez in local.

Kacper Muda (mudakacper@gmail.com)

2024-03-18 09:02:47

*Thread Reply:* I don't think OpenLineage and Marquez support windows in any way

👍 Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2024-03-18 09:04:57

*Thread Reply:* But another way to explore OL and Marquez is with GitPod: https://github.com/MarquezProject/marquez?tab=readme-ov-file#try-it

Michael Robinson (michael.robinson@astronomer.io)

2024-03-18 09:05:17

*Thread Reply:* Also, @GUNJAN YADU have you tried deleting all volumes and starting over?

GUNJAN YADU (gunjanyadu6@gmail.com)

2024-03-18 09:10:49

*Thread Reply:* Volumes as in?

Kacper Muda (mudakacper@gmail.com)

2024-03-18 09:12:21

*Thread Reply:* Probably docker volumes, you can find them in docker dashboard app:

👍 Michael Robinson

GUNJAN YADU (gunjanyadu6@gmail.com)

2024-03-18 09:13:44

*Thread Reply:* Okay Its password authentication failure. So do I have to do any kind of posgres setup or environment variable setup

GUNJAN YADU (gunjanyadu6@gmail.com)

2024-03-18 09:24:29

*Thread Reply:* marquez-db | 2024-03-18 13:19:37.211 GMT [36] FATAL: password authentication failed for user "marquez" marquez-db | 2024-03-18 13:19:37.211 GMT [36] DETAIL: Role "marquez" does not exist.

GUNJAN YADU (gunjanyadu6@gmail.com)

2024-03-18 10:11:43

*Thread Reply:* Setup is successful

Michael Robinson (michael.robinson@astronomer.io)

2024-03-18 11:20:43

*Thread Reply:* @GUNJAN YADU can share what steps you took to make it work?

GUNJAN YADU (gunjanyadu6@gmail.com)

2024-03-19 00:14:17

*Thread Reply:* First I cleared the volumes Then did the steps mentioned in link you shared in git bash. It worked then

Michael Robinson (michael.robinson@astronomer.io)

2024-03-19 09:00:19

*Thread Reply:* Ah, so you used GitPod?

GUNJAN YADU (gunjanyadu6@gmail.com)

2024-03-21 00:35:58

*Thread Reply:* No I haven’t. I ran all the commands in git bash

👍 Michael Robinson

Rohan Doijode (doijoderohan882@gmail.com)

2024-03-19 08:06:07

Hi everyone !

I'm beginner to this tool.

My name is Rohan and facing challenges on Marquez. I have followed the steps as mentioned on website and facing this error. Please check attached picture.

Michael Robinson (michael.robinson@astronomer.io)

2024-03-19 09:35:16

*Thread Reply:* Hi Rohan, welcome! There are a number of guides across the OpenLineage and Marquez sites. Would you please share a link to the guide you are using? Also, terminal output as well as version and system information would be helpful. The issue could be a simple config problem or more complicated, but it's impossible to say from the screenshot.

Rohan Doijode (doijoderohan882@gmail.com)

2024-03-20 01:47:22

*Thread Reply:* Hi Michael Robinson,

Thank you for reverting on this.

The link I used for installation : https://openlineage.io/getting-started/

I have attached the terminal output.

Docker version : 25.0.3, build 4debf41

openlineage.io

Getting Started | OpenLineage

Original URL: https://openlineage.io/getting-started/

Rohan Doijode (doijoderohan882@gmail.com)

2024-03-20 01:48:55

*Thread Reply:* Continuing above thread with a screenshot :

Michael Robinson (michael.robinson@astronomer.io)

2024-03-20 11:25:36

*Thread Reply:* Thanks for the details, @Rohan Doijode. Unfortunately, Windows isn't currently supported. To explore OpenLineage+Marquez on Windows we recommend using this pre-configured Marquez Gitpod environment.

Rohan Doijode (doijoderohan882@gmail.com)

2024-03-21 00:49:41

*Thread Reply:* Hi @Michael Robinson,

Thank you for your input.

My issues has been resolved.

🎉 Michael Robinson

Kacper Muda (mudakacper@gmail.com)

2024-03-19 11:37:02

Hey team! Quick check - has anyone submitted or is planning to submit a CFP for this year's Airflow Summit with an OL talk? Let me know! 🚀

➕ Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2024-03-19 11:40:11

*Thread Reply:* https://sessionize.com/airflow-summit-2024/

sessionize.com

Airflow Summit 2024: Call for Presentations

Airflow Summit is the annual community-driven conference for Apache Airflow users and developers. ...

Original URL: https://sessionize.com/airflow-summit-2024/

Michael Robinson (michael.robinson@astronomer.io)

2024-03-19 11:40:22

*Thread Reply:* the CFP is scheduled to close on April 17

Kacper Muda (mudakacper@gmail.com)

2024-03-19 11:40:59

*Thread Reply:* Yup. I was thinking about submitting one, but don't want to overlap with someone that already did 🙂

Anand Thamothara Dass (anand_thamotharadass@cable.comcast.com)

2024-03-19 14:54:06

Hey Team, We are using MWAA (AWS Managed airflow) which is on version 2.7.2. So we are making use of airflow provided openlineage packages. We have simple test DAG which uses BashOperator and we would like to use manually annotated lineage. So we have provided the inlets and outlets. But when I am run the job. I see the errors - Failed to extract metadata using found extractor <airflow.providers.openlineage.extractors.bash.BashExtractor object at 0x7f9446276190> - section/key [openlineage/disabledforoperators]. Do I need to make any configuration changes?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-19 15:26:53

*Thread Reply:* hey, there’s a fix for that: https://github.com/apache/airflow/pull/37994 not released yet.

Unfortunately, before the release you need to manually set missing entries in configuration

Anand Thamothara Dass (anand_thamotharadass@cable.comcast.com)

2024-03-19 16:15:18

*Thread Reply:* Thanks @Jakub Dardziński . So the temporary fix is to set disabledforoperators for the unsupported operators? If I do that, Do I get my lineage emitted for bashOperator with manually annotated information?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-19 16:15:59

*Thread Reply:* I think you should set it for disabled_for_operators, config_path and transport entries (maybe you’ve set some of them already)

Anand Thamothara Dass (anand_thamotharadass@cable.comcast.com)

2024-03-19 16:23:25

*Thread Reply:* Ok . Thanks. Yes I did them already.

Anand Thamothara Dass (anand_thamotharadass@cable.comcast.com)

2024-03-19 22:03:04

*Thread Reply:* These are my configurations. Its emitting run event only. I have my manually annotated lineage defined for the bashoperator. So when I provide the disabledforoperators, I don't see any errors, But log clearly says "Skipping extraction for operator BashOperator". So I don't see the inlets & outlets info in marquez. If I don't provide disabledforoperators, it fails with error "Failed to extract metadata using found extractor <airflow.providers.openlineage.extractors.bash.BashExtractor object at 0x7f9446276190> - section/key [openlineage/disabledforoperators]". So i cannot go either way. Any workaround? or I am making some mistake?

Kacper Muda (mudakacper@gmail.com)

2024-03-20 02:28:53

*Thread Reply:* Hey @Anand Thamothara Dass, make sure to simply set the config_path , disabled_for_operators and transport to empty strings, unless you actually want to use it (f.e. leave transport as it is if it contains the configuration to the backend). Current issue is that when no variables are found the error is raised, no matter if the actual value is set - they simply need to be in configuration, even as empty string.

In your setup i seed that you included BashOperator in disabled, so that's why it's ignored.

Anand Thamothara Dass (anand_thamotharadass@cable.comcast.com)

2024-03-20 12:03:55

*Thread Reply:* Hmm strange. setting to empty strings worked. When I display it in console, I am able to see all the outlets information. But when I transport it to marquez endpoint, I am able to see only run events. No dataset information are captured in Marquez. But when I build the payload myself outside Airflow and push it using postman, I am able to see the dataset information as well in marquez. So I don't know where is the issue. Its airflow or openlineage or marquez 😕

Kacper Muda (mudakacper@gmail.com)

2024-03-20 12:07:07

*Thread Reply:* Could you share your dag code and task logs for that operator? I think if you use BashOperator and attach inlets and outlets to it, it should work just fine. Also please share the version of Ol package you are using and the name

Anand Thamothara Dass (anand_thamotharadass@cable.comcast.com)

2024-03-20 14:57:40

*Thread Reply:* @Kacper Muda - Got that fixed. {"type": "http","url":"<http://10.80.35.62:3000%7Chttp://<ip>:3000>%22,%22endpoint%22:%22api/v1/lineage%22}. Got the end point removed. {"type": "http","url":"<http://10.80.35.62:3000%7Chttp://<ip>:3000>%22}. Kept only till here. It worked. Didn't think that, v1/lineage forces only run events capture. Thanks for all the support !!!

👍 Jakub Dardziński, Kacper Muda

Rohan Doijode (doijoderohan882@gmail.com)

2024-03-21 07:44:29

Hi all,

We are planning to use OL as Data Lineage Tool.

We have data in S3 and do use AWS Kinesis. We are looking forward for guidelines to generate graphical representation over Marquez or any other compatible tool.

This includes lineage on column level and metadata during ETL.

Thank you in advance

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-03-21 09:06:06

Hello all, we are struggling with a spark integration with AWS Glue. We have gotten to a configuration that is not causing errors in spark, but it’s not producing any output in the S3 bucket. Can anyone help figure out what’s wrong? (code in thread)

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-03-21 09:06:35

*Thread Reply:* ```import sys from awsglue.transforms import ** from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job from pyspark.context import SparkConf from pyspark.sql import SparkSession

args = getResolvedOptions(sys.argv, ["JOBNAME"]) print(f'the job name received is : {args["JOBNAME"]}')

spark1 = SparkSession.builder.appName("OpenLineageExample").config("spark.extraListeners", "io.openlineage.spark.agent.OpenLineageSparkListener").config("spark.openlineage.transport.type", "file").config("spark.openlineage.transport.location", "").config("spark.openlineage.namespace", "AWSGlue").getOrCreate()

glueContext = GlueContext(sc)

Initialize the glue context

sc = SparkContext(spark1)

glueContext = GlueContext(spark1) spark = glueContext.spark_session

job = Job(glueContext) job.init(args["JOB_NAME"], args)

df=spark.read.format("csv").option("header","true").load("s3://<bucket>/input/Master_Extract/") df.write.format('csv').option('header','true').save('

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-03-21 09:07:05

*Thread Reply:* cc @Rodrigo Maia since I know you’ve done some AWS glue

Damien Hawes (damien.hawes@booking.com)

2024-03-21 11:41:39

*Thread Reply:* Several things:

s3 isn't a file system. It is an object storage system. Concretely, this means when an object is written, it's immutable. If you want to update the object, you need to read it in its entirety, modify it, and then write it back.
Java probably doesn't know how to handle the s3 protocol.

Damien Hawes (damien.hawes@booking.com)

2024-03-21 11:41:54

*Thread Reply:* (As opposed the the file protocol)

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-03-21 12:05:15

*Thread Reply:* OK, so the problem is we’ve set it to config(“spark.openlineage.transport.type”, “file”) and then give it s3:// instead of a file path…..

But it’s AWS Glue so we don’t have a local filesystem to save it to.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-03-21 12:05:55

*Thread Reply:* (I also hear you that S3 isn’t an ideal place for concatenating to a logfile because you can’t concatenate)

Damien Hawes (damien.hawes@booking.com)

2024-03-21 12:20:46

*Thread Reply:* Unfortunately, I have zero experience with Glue.

Several approaches:

Emit to Kafka (you can use MSK)
Emit to Kinesis
Emit to Console (perhaps a centralised logging tool, like Cloudwatch will pick it up)
Emit to a local file, but I have no idea how you retrieve that file.
Emit to an HTTP endpoint

☝️ Maciej Obuchowski

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-03-21 12:22:25

*Thread Reply:* I appreciate some ideas for next steps

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-03-21 12:22:30

*Thread Reply:* Thank you

Rodrigo Maia (rodrigo.maia@manta.io)

2024-03-21 12:25:30

*Thread Reply:* did you try transport console to check if the OL setup is working? regardless of i/o, it should put something in the logs with an event.

👀 Sheeri Cabral (Collibra)

Damien Hawes (damien.hawes@booking.com)

2024-03-21 12:36:41

*Thread Reply:* Assuming the log4j[2].properties file is configured to allow the io.openlineage package to log at the appropriate level.

👀 Sheeri Cabral (Collibra)

tati (tatiana.alchueyr@astronomer.io)

2024-03-22 07:01:47

*Thread Reply:* @Sheeri Cabral (Collibra), did you try to use a different transport type, as suggested by @Damien Hawes in https://openlineage.slack.com/archives/C01CK9T7HKR/p1711038046057459?thread_ts=1711026366.869199&cid=C01CK9T7HKR? And described in the docs: https://openlineage.io/docs/integrations/spark/configuration/transport#file

Or would you like for the OL spark driver to support an additional transport type (e.g. s3) to emit OpenLineage events?

} Damien Hawes (https://openlineage.slack.com/team/U05FLJE4GDU)

Unfortunately, I have zero experience with Glue. Several approaches: <ol><li>Emit to Kafka (you can use MSK)</li><li>Emit to Kinesis</li><li>Emit to Console (perhaps a centralised logging tool, like Cloudwatch will pick it up)</li><li>Emit to a local file, but I have no idea how you retrieve that file.</li><li>Emit to an HTTP endpoint</li> </ol>

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1711038046057459?thread_ts=1711026366.869199&cid=C01CK9T7HKR

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-03-22 09:40:39

*Thread Reply:* I will try different transport types, haven’t gotten a chance to yet.

🙌 tati

tati (tatiana.alchueyr@astronomer.io)

2024-03-25 07:05:17

*Thread Reply:* Thanks, @Sheeri Cabral (Collibra); please let us know how it goes!

Pooja K M (pooja.km@philips.com)

2024-04-02 05:06:26

*Thread Reply:* @Sheeri Cabral (Collibra) did you tried on the other transport types by any chance?

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-04-03 08:32:20

*Thread Reply:* Sorry, with the holiday long weekend in Europe things are a bit slow. We did, and I just put a message in the #general chat https://openlineage.slack.com/archives/C01CK9T7HKR/p1712147347085319 as we are getting some errors with the spark integration.

} Sheeri Cabral (https://openlineage.slack.com/team/U0323HG8C8H)

Looking for some help with spark and the “UNCLASSIFIED_ERROR; An error occurred while calling o110.load. Cannot call methods on a stopped SparkContext.” (more details in thread - should I be making this into a github issue instead?)

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1712147347085319

Rodrigo Maia (rodrigo.maia@manta.io)

2024-03-22 14:45:12

I've been testing around with different Spark versions. Does anyone know if OpenLineage works with spark 2.4.4 (scala 2.12.10)? Ive getting a lot of errors, but ive only tried versions 1.8+

Michael Robinson (michael.robinson@astronomer.io)

2024-03-22 16:36:32

*Thread Reply:* Hi @Rodrigo Maia, OpenLineage does not officially support Spark 2.4.4. The earliest version supported is 2.4.6. See this doc for more information about the supported versions of Spark, Airflow, Dagster, dbt, and Flink.

👍 Rodrigo Maia

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-03-23 04:15:27

*Thread Reply:* OpenLineage CI runs against 2.4.6 and it is passing. I wouldn't expect any breaking differences between 2.4.4 and 2.4.6, but please let us know if this is the case.

👍 Rodrigo Maia

Michael Robinson (michael.robinson@astronomer.io)

2024-03-22 15:18:52

@channel Thanks to everyone who attended our first Boston meetup, co-sponsored by Astronomer and Collibra and featuring presentations by partners at Collibra, Astronomer and Datadog, this past Tuesday at Microsoft New England. Shout out to @Sheeri Cabral (Collibra), @Jonathan Morin, and @Paweł Leszczyński for presenting and to Sheeri for co-hosting! Topics included: • "2023 in OpenLineage," a big year that saw: ◦ 5 new integrations, ◦ the Airflow Provider launch, ◦ the addition of static/"design-time" lineage in 1.0.0, ◦ the addition of column lineage from SQL statements via the SQL parser, ◦ and 22 releases. • A demo of Marquez, which now supports column-level lineage in a revamped UI • Discussion of "Why Do People Use Lineage?" by Sheeri at Collibra, covering: ◦ differences between design and operational lineage, ◦ use cases served such as compliance, traceability/provenance, impact analysis, migration validation, and quicker onboarding, ◦ features of Collibra's lineage • A demo of streaming support in the Apache Flink integration by Paweł at Astronomer, illustrating lineage from: ◦ a Flink job reading from a Kafka topic to Postgres, ◦ a few SQL jobs running queries in Postgres, ◦ a Flink job taking a Postgres table and publishing it back to Kafka • A demo of an OpenLineage integration POC at Datadog by Jonathan, covering: ◦ Use cases served by Datadog's Data Streams Monitoring service ◦ OpenLineage's potential role providing and standardizing cross-platform lineage for Datadog's monitoring platform. Thanks to Microsoft for providing the space. If you're interested in attending, presenting at, or hosting a future meetup, please reach out.

🙌 Jonathan Morin, Harel Shein, Rodrigo Maia, Maciej Obuchowski

:datadog: Harel Shein, Paweł Leszczyński, Rodrigo Maia, Maciej Obuchowski, Jean-Mathieu Saponaro

👏 Peter Huang, Rodrigo Maia, tati

🎉 tati

❤️ Sheeri Cabral (Collibra)

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-25 07:08:21

*Thread Reply:* Hey @Michael Robinson, was the meetup recorded?

Michael Robinson (michael.robinson@astronomer.io)

2024-03-25 09:26:04

*Thread Reply:* @Maciej Obuchowski yes, and a clip is on YouTube. Hoping to have @Jonathan Morin’s clip posted soon, as well

YouTube

} OpenLineage Project (https://www.youtube.com/@openlineageproject6897)

Why Do People Use Lineage? | March 19, 2024

Original URL: https://www.youtube.com/watch?v=e-9XtjjCKLc

❤️ Sheeri Cabral (Collibra)

Stefan Krawczyk (stefan@dagworks.io)

2024-03-22 19:57:48

Airflow 2.8.3 Python 3.11 Trying to do a hello world lineage example using this simple bash operator DAG — but I don’t have anything emitting to my marquez backend. I’m running airflow locally following docker-compose setup here. More details in thread:

Stefan Krawczyk (stefan@dagworks.io)

2024-03-22 19:59:45

*Thread Reply:* Here is my airflow.cfg under ```[webserver] expose_config = 'True'

[openlineage] configpath = '' transport = '{"type": "http", "url": "http://localhost:5002", "endpoint": "api/v1/lineage"}' disabledfor_operators = ''```

Stefan Krawczyk (stefan@dagworks.io)

2024-03-22 20:01:15

*Thread Reply:* I can curl my marquez backend just fine — but yeah not seeing anything emitted by airflow

Stefan Krawczyk (stefan@dagworks.io)

2024-03-22 20:19:44

*Thread Reply:* Have I missed something in the set-up? Is there a way I can validate the config was ingested correctly?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-23 03:42:40

*Thread Reply:* Can you see any logs related to OL in Airflow? Is Marquez in the same docker compose? Maybe try changing to host.docker.internal from localhost

Stefan Krawczyk (stefan@dagworks.io)

2024-03-24 00:51:31

*Thread Reply:* So I figured it out. For reference the issue was that ./config wasn’t for airflow.cfg as I had blindly interpreted it to be. Instead, setting the open lineage values as environment variables worked.

Stefan Krawczyk (stefan@dagworks.io)

2024-03-24 01:01:47

*Thread Reply:* Otherwise for the simple DAG with just BashOperators, I was expecting to see a similar “lineage” DAG in marquez, but I only see individual jobs. Is that expected?

Formulating my question differently, does the open lineage data model assume a bipartite type graph, of Job -> Dataset -> Job -> Dataset etc always? Seems like there would be cases where you could have Job -> Job where there is no explicit “data artifact produced”?

Stefan Krawczyk (stefan@dagworks.io)

2024-03-24 02:13:30

*Thread Reply:* Another question — is there going to be integration with the “datasets” & inlets/outlets concept airflow now has? E.g. I would expect the OL integration to capture this:

```# [START datasetdef] dag1dataset = Dataset("", extra={"hi": "bye"})

[END dataset_def]

with DAG( dagid="datasetproduces1", catchup=False, startdate=pendulum.datetime(2021, 1, 1, tz="UTC"), schedule="@daily", tags=["produces", "dataset-scheduled"], ) as dag1: # [START taskoutlet] BashOperator(outlets=[dag1dataset], taskid="producingtask1", bashcommand="sleep 5") # [END task_outlet]``` i.e. the outlets part. Currently it doesn’t seem to.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-25 03:47:29

*Thread Reply:* OL only converts File and Table entities so far from manual inlets and outlets

👍 Stefan Krawczyk

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-25 05:00:22

*Thread Reply:* on the Job -> Dataset -> Job -> Dataset: OL and Marquez do not aim into reflecting Airflow DAGs. They rather focus on exposing metadata that is collected around data processing

Stefan Krawczyk (stefan@dagworks.io)

2024-03-25 14:27:42

*Thread Reply:* > on the Job -> Dataset -> Job -> Dataset: OL and Marquez do not aim into reflecting Airflow DAGs. They rather focus on exposing metadata that is collected around data processing That makes sense. I’m was just thinking through the implications and boundaries of what “lineage” is modeled. Thanks

Suhas Shenoy (ksuhasshenoy@gmail.com)

2024-03-25 06:18:05

Hi Team... We have a use case where we want to know when a column of the table gets updated in BIGQUERY and we have some questions related to it.

In some of the openlineage events that are generated, outputs.facets.columnLineage is null. Can we assume all the columns get updated when this is the case?
Also outputs.facets.schema seems to be null in some of the events generated. How do we get the schema of the table in this case?
output.namespace is also null in some cases. How do we determine output datasource in this case?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-25 07:07:02

*Thread Reply:* For BigQuery, we use BigQuery API to get the lineage that unfortunately does not present us with column-level lineage. Adding that would be a feature.

For 2. and 3. it might happen that the result you're reading is from query cache, as this was earlier executed and not changed - in that case we won't have full information yet. https://cloud.google.com/bigquery/docs/cached-results

Google Cloud

Using cached query results | BigQuery | Google Cloud

Describes the use of cached query results, its limitations, pricing, quotas, and storage. Shows how to disable the retrieval of cached query results, and ensure and verify its usage.

Original URL: https://cloud.google.com/bigquery/docs/cached-results

Suhas Shenoy (ksuhasshenoy@gmail.com)

2024-03-25 07:45:04

*Thread Reply:* So, can we assume that if the query is not a duplicate one, fields outputs.facets.schema and output.namespace will not be empty? And ignore the COMPLETE events when those fields are empty as they are not providing any new updates?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-25 07:59:55

*Thread Reply:* > So, can we assume that if the query is not a duplicate one, fields outputs.facets.schema and output.namespace will not be empty? Yes, I would assume so. > And ignore the COMPLETE events when those fields are empty as they are not providing any new updates? That probably depends on your use case, different jobs can access same tables/do same queries in that case.

Suhas Shenoy (ksuhasshenoy@gmail.com)

2024-03-25 23:49:46

*Thread Reply:* Okay. We wanted to know how can we determine the output datasource from the events?

Ruchira Prasad (ruchiraprasad@gmail.com)

2024-03-26 01:51:15

Hi Team, Currently OpenLineage Marquez use postgres db to store the meta data. Instead postgres, we want to store them on Snowflake DB. Do we have kind if inbuilt configuration in the marquez application to change the marquez database to Snowflake? If not, what will be the approach?

Damien Hawes (damien.hawes@booking.com)

2024-03-26 04:50:25

*Thread Reply:* The last time I looked at Marquez (July last year), Marquez was highly coupled to PostgreSQL specific functionality. It had code, particularly for the graph traversal, written in PostgreSQL's PL/pgSQL. Furthermore, it uses PostgreSQL as an OLTP database. My limited knowledge of Snowflake says that it is an OLAP database, this means that it would be a very poor fit for the application. For any migration to another database engine, it would be a large undertaking.

☝️ Maciej Obuchowski

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-03-26 05:13:25

*Thread Reply:* Hi @Ruchira Prasad, this is not possible at the moment. Marquez splits OL events into neat relational model to allow efficient lineage queries. I don't think this would be achievable in Snowflake.

As an alternative approach, you can try fluentd proxy -> https://github.com/OpenLineage/OpenLineage/tree/main/proxy/fluentd Fluentd provides bunch of useful output plugins that let you send logs into several warehouses (https://www.fluentd.org/plugins), however I cannot find snowflake on the list.

On the snowflake side, there is quickstart on how to ingest fluentd logs into it -> https://quickstarts.snowflake.com/guide/integrating_fluentd_with_snowflake/index.html#0

To wrap up: if you need lineage events in Snowflake, you can consider sending events to a FluentD endpoint and then load them to Snowflake. In contrast to Marquez, you will query raw events which may be cumbersome in some cases like getting several OL events that describe a single run.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-26 05:56:39

*Thread Reply:* Note that supporting (not even migrating) a backend application that can use multiple database engines comes at a huge opportunity cost, and it's not like Marquez has more contributors than it needs 🙂

Ruchira Prasad (ruchiraprasad@gmail.com)

2024-03-26 06:28:47

*Thread Reply:* Since both Postgres and Snowflake supports JDBC, can't we point to Snowflake with changing following?

Damien Hawes (damien.hawes@booking.com)

2024-03-26 06:29:16

*Thread Reply:* It doesn't have anything to do with the driver. JDBC is the driver, it defines the protocol that that communication link must abide by.

Just like how ODBC is a driver, and in the .NET world, how OLE DB is a driver.

It tells us nothing about the capabilities of the database. In this case, using PostgreSQL was chosen because of its capabilities, and because of those capabilities, the application code leverages more of those capabilities than just a generic read / write database. Moving all that logic from PostgreSQL PL/pgSQL to the application would (1) take a significant investment in time; (2) present bugs; (3) slow down the application response time, because you have to make many more round-trips to the database, instead of keeping the code close to the data.

☝️ Maciej Obuchowski

Damien Hawes (damien.hawes@booking.com)

2024-03-26 06:39:57

*Thread Reply:* If you're still curious, and want to test things out for yourself:

Create a graph structure on a SQL database (edge table, vertex table, relationship table)
Write SQL to perform that traversal
Write Java application code that reads from the database, then tries to perform traversals by again reading data from the database. Measure the performance impact, and you will see that (2) is far quicker than (3). This is one of the reasons why Marquez uses PostgreSQL and leverages its PL/pgSQL capabilities, because otherwise the application would be significantly for any traversal that is more than a few levels deep.

Bipan Sihra (bsihra@amazon.com)

2024-03-26 15:57:50

Hi Team,

Looking for feedback on the below Problem and Proposal.

We are using OpenLineage with our AWS EMR clusters to extract lineage and send it to a backend Marquez deployment (also in AWS). This is working fine and we are getting table and column level lineage.

Problem: Is we are seeing: • 15+ OpenLineage events with multiple jobs being shows in Marquez for a single Spark job in EMR. This causes confusion because team members using Marquez are unsure which "job" in Marquez to look at. • The S3 locations are being populated in the namespace. We wanted to use namespace for teams. However, having S3 locations in the namespace in a way "pollutes" the list. I understand the above are not issues/bugs. However, our users want us to "clean" up the Marquez UI.

Proposal: One idea was to have a Lambda intercept the 10-20 raw OpenLineage events from EMR and then process -> condense them down to 1 event with the job, run, inputs, outputs. And secondly, to swap out the namespace from S3 to actual team names via a lookup we would host ourselves.

While the above proposal technically could work we wanted to check with the team here if it makes sense, any caveats, alternatives others have used. Ideally, we don't want to own parsing OpenLineage events if there is an existing solution.

Bipan Sihra (bsihra@amazon.com)

2024-03-26 15:58:15

*Thread Reply:* Screenshot: 1 spark job = multiple "jobs" in Marquez

Bipan Sihra (bsihra@amazon.com)

2024-03-26 15:58:35

*Thread Reply:* Screenshot: S3 locations in namespace.

Michael Robinson (michael.robinson@astronomer.io)

2024-03-26 16:59:48

*Thread Reply:* Hi @Bipan Sihra, thanks for posting this -- it's exciting to hear about your use case at Amazon! I wonder if you wouldn't mind opening a GitHub issue so we can track progress on this and make sure you get answers to your questions.

Michael Robinson (michael.robinson@astronomer.io)

2024-03-26 17:23:19

*Thread Reply:* Also, would you please share the version of openlineage-spark you are on?

Bipan Sihra (bsihra@amazon.com)

2024-03-27 09:05:09

*Thread Reply:* Hi @Michael Robinson. Sure, I can open a Github issue. Also, we are currently using io.openlineage:openlineage_spark_2.12:1.9.1.

👍 Michael Robinson

Tristan GUEZENNEC -CROIX- (tristan.guezennec@decathlon.com)

2024-03-28 09:51:12

*Thread Reply:* @Yannick Libert

Bipan Sihra (bsihra@amazon.com)

2024-03-28 09:52:43

*Thread Reply:* I was able to find info I needed here: https://github.com/OpenLineage/OpenLineage/discussions/597

Ranvir Singh (ranvir.tune@gmail.com)

2024-03-27 07:55:37

Hi Team, we are trying to collect lineage for a Spark job using OpenLineage(v1.8.0) and Marquez (v0.46). We can see the "Schema" details for all "Datasets" created but we can't see "Column-level" lineage and getting "Column lineage not available for the specified dataset" on Marquez UI under "COLUMN LINEAGE" tab.

About Spark Job: The job reads data from few oracle tables using JDBC connections as Temp views in Spark, performs some transformations (joining & aggregations) over different steps, creating intermediate temp views and finally writing the data to HDFS location. So, it looks something like this:

Read oracle tables as temp views -> transformations set1 --> creation of few more temp views from previously created temp views --> transformations set2, set3 ... --> Finally writing to hdfs(when all the temp view gets materialised in-memory to create final output dataset). We are getting the schema details for finally written dataset but no column-level lineage for the same. Also, while checking the json lineage data, I can see "" (blank) for "inputs" key (just before "outputs" key which contains dataset name & other details in nested key-value form). As per my understanding, this explains null value for "columnLineage" key hence no column-level lineage but unable to understand why!

Appreciate if you could share some thoughts/idea in terms of what is going wrong here as we are stuck on this point? Also, not sure we can get the column-level lineage only for datasets created from permanent Hive tables and not for temp/un-materialised views using OpenLineage & Marquez.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-27 08:54:38

*Thread Reply:* My first guess would be that either some of the interaction between JDBC/views/materialization make the CLL not show, or possibly transformations - if you're doing stuff like UDFs we lose the column-level info, but it's hard to confirm without seeing events and/or some minimal reproduction

Ranvir Singh (ranvir.tune@gmail.com)

2024-03-29 08:48:03

*Thread Reply:* Hi @Maciej Obuchowski, Thanks for responding on this. We are using SparkSQL where we are reading the data from Oracle tables as temptable then running sql like queries (for transformation) on previously created temptable. Now, let say we want to run a set of transformations, so we have written the transformation logic as sql like queryies. So, when this first query (query1) would get executed resulting in creation of temptable1, then query2 will get executed on temptable1 creating temptable2 and so on. For such use case, we have developed a custom function, this custom function will take these queries (query1, query2, ...) as input and will run iteratively and will create temptable1, temptable2,... and so on. This custom function uses RDD APIs and in-built functions like collect() along with few other scala functions. So, not sure whether usage of RDD will break the lineage or what's going wrong. Lastly, we do have jobs where we are using direct UDFs in spark but we aren't getting CLL for those jobs also which doesn't have UDF usage. Hope this gives some context on how we are running the job.

Ranvir Singh (ranvir.tune@gmail.com)

2024-04-04 13:08:32

*Thread Reply:* Hey @Maciej Obuchowski, appreciate your help/comments on this.

George Tong (george@terradot.earth)

2024-03-27 14:53:44

Hey everyone 👋

I’m working at a carbon capture 🌍 company and we’re designing how we want to store data in our PostgreSQL database at the moment. One of the key things we’re focusing on is traceability and transparency of data, as well as ability to edit and maintain historical data. This is key as if we make an error and need to update a previous data point, we want to know everything downstream of that data point that needs to be rerun and recalculated. You might be able to guess where this is going… • Any advice on how we should be designing our table schemas to support editing and traceability? We’re currently looking using temporal tables • Is Open Lineage the right tool for downstream tracking and traceability? Are there any other tools we should be looking at instead? I’m new here so hopefully I asked in the right channel. Let me know if I should be asking elsewhere!

Kacper Muda (mudakacper@gmail.com)

2024-03-28 05:55:48

*Thread Reply:* Hey, In my opinion, OpenLineage is the right tool for what you are describing. Together with some backend like Marquez it will allow you to visualize data flow, dependencies (upstreams, downstreams) and more 🙂

🙌 George Tong

Michael Robinson (michael.robinson@astronomer.io)

2024-03-28 15:54:58

*Thread Reply:* Hi George, welcome! To add to what Kacper said, I think it also depends on what you are looking for in terms of "transparency." I guess I'm wondering exactly what you mean by this. A consumer using the OpenLineage standard (like Marquez, which we recommend in general but especially for getting started) will collect metadata about your pipelines' datasets and jobs but won't collect the data itself or support editing of your data. You're probably fully aware of this, but it's a point of confusion sometimes, and since you mentioned transparency and updating data I wanted to emphasize this. I hope this helps!

🙌 George Tong

George Tong (george@terradot.earth)

2024-03-28 19:28:36

*Thread Reply:* Thanks for the thoughts folks! Yes I think my thoughts are starting to become more concrete - retaining a history of data and ensuring that you can always go back to a certain time of your data is different from understanding the downstream impact of a data change, (which is what OpenLineage seems to tackle)

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2024-03-28 03:18:42

Hi team, so we're using OL v 1.3.1 on databricks, on a non termination cluster. We're seeing that the heap memory is increasing very significantly, and notice that the majority of the memory comes from OL. Any idea if we're having some memory leaks from OL? Have we seen any similar issues being reported before? Thanks!

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-28 10:43:36

*Thread Reply:* First idea would be to bump version 🙂

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-03-28 10:56:55

*Thread Reply:* Does it affect all the jobs or just some of them? Does it somehow correlate with amount of spark tasks a job is processing? Would you be able to test the behaviour on the jar prepared from the branch? Any other details helping to reproduce this would be nice.

So many questions for the start... Happy to see you again @Anirudh Shrinivason. Can't wait looking into this next week.

Damien Hawes (damien.hawes@booking.com)

2024-03-28 11:12:23

*Thread Reply:* FYI - this is my experience as discussed on Tuesday @Paweł Leszczyński @Maciej Obuchowski

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2024-04-01 05:31:09

*Thread Reply:* Hey @Maciej Obuchowski @Paweł Leszczyński Thanks for the questions! Here are some details and clarifications I have:

First idea would be to bump version Has such an issue been fixed in the later versions? So this is an already known issue with 1.3.1 version? Just curious why bumping it might resolve the issue...
Does it affect all the jobs or just some of them So far, we're monitoring the heap at a cluster level... It's a shared non-termination cluster. I'll try to take a look at a job level to get some more insights.
Does it somehow correlate with amount of spark tasks a job is processing This was my initial thought too, but from looking at a few of the pipelines, they seem relatively straightforward logic wise. And I don't think it's because a lot of tasks are running in parallel causing the amount of allocated objects to be very high... (Let me check back on this)
Any other details helping to reproduce this would be nice. Yes! Let me try to dig a little more, and try to get back with more details...
FYI - this is my experience as discussed on Tuesday Hi @Damien Hawes may I check if there is anywhere I could get some more information on your observations? Since it seems related, maybe they're the same issues? But all in all, I ran a high level memory analyzer, and it seemed to look like a memory leak from the OL jar... We noticed the heap size from OL almost monotonically increasing to >600mb... I'll try to check and do a bit more analysis before getting back with more details. :gratitudethankyou:

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2024-04-02 00:52:32

*Thread Reply:* This is what the heap dump looks like after 45 mins btw... ~11gb from openlineage out of 14gb heap

❤️ Paweł Leszczyński, Maciej Obuchowski

Damien Hawes (damien.hawes@booking.com)

2024-04-02 03:34:50

*Thread Reply:* Nice. That's slightly different to my experience. We're running a streaming pipeline on a conventional Spark cluster (not databricks).

Damien Hawes (damien.hawes@booking.com)

2024-04-03 04:56:13

*Thread Reply:* OK. I've found the bug. I will create an issue for it.

cc @Maciej Obuchowski @Paweł Leszczyński

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-03 04:59:49

*Thread Reply:* Great. I am also looking into unknown facet. I think this could be something like this -> https://github.com/OpenLineage/OpenLineage/pull/2557/files

Damien Hawes (damien.hawes@booking.com)

2024-04-03 05:00:25

*Thread Reply:* Not quite.

Damien Hawes (damien.hawes@booking.com)

2024-04-03 05:01:00

*Thread Reply:* The problem is that the UnknownEntryFacetListener accumulates state, even if the spark_unknown facet is disabled.

Damien Hawes (damien.hawes@booking.com)

2024-04-03 05:01:41

*Thread Reply:* The problem is that the code eagerly calls UnknownEntryFacetListener#apply

🙌 Paweł Leszczyński

Damien Hawes (damien.hawes@booking.com)

2024-04-03 05:01:54

*Thread Reply:* Without checking if the facet is disabled or not.

Damien Hawes (damien.hawes@booking.com)

2024-04-03 05:02:17

*Thread Reply:* It only checks whether the facet is disabled or not, when it needs to add the details to the event.

Damien Hawes (damien.hawes@booking.com)

2024-04-03 05:03:40

*Thread Reply:* Furthermore, even if the facet is enabled, it never clears its state.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-03 05:04:31

*Thread Reply:* yes, and if logical plan is spark.createDataFrame with local data, this can get huge

Damien Hawes (damien.hawes@booking.com)

2024-04-03 06:01:10

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/2561

👍 Paweł Leszczyński

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2024-04-03 06:20:51

*Thread Reply:* 🙇

Tom Linton (tom.linton@atlan.com)

2024-03-28 21:50:01

Hello All - I've begun my OL journey rather recently and am running into trouble getting lineage going in an airflow job. I spun up a quick flask server to accept and print the OL requests. It appears that there are no Inputs or Outputs. Is that something I have to set in my DAG? Reference code and responses are attached.

bronze_car_sales.py

data_7b696104-2a30-4339-868d-fd7dd3b17c1b.json

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-29 03:38:18

*Thread Reply:* hook-level lineage is not yet supported, you should you SnowflakeOperator instead

Tom Linton (tom.linton@atlan.com)

2024-03-29 08:53:29

*Thread Reply:* Thanks @Jakub Dardziński! I used the hook because it looks like that is the supported operator based on airflow docs

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-29 09:20:10

*Thread Reply:* you can see this is under SQLExecuteQueryOperator without going into the details part of the implentation is on hooks side there, not the operator

Vinnakota Priyatam Sai (vinnakota.priyatam@walmart.com)

2024-03-29 00:14:17

Hi team, we are collecting OpenLineage events across different jobs where the output datasources are BQ, Cassandra and Postgres. We are mostly interested in the freshness of columns across these different datasources. Using OpenLineage COMPLETE event's dataset.datasource and dataset.schema we want to understand which columns are updated at what time.

We have a few questions related to BQ (as output dataset) events:

How to identify if the output datasource is BQ, Cassandra or Postgres?
Can we rely on dataset.datasource and dataset.schema for BQ table name and column names?
Even if one column is updated, do we get all the column details in dataset.schema?
If dataset.datasource or dataset.schema value is null, can we assume that no column has been updated in that event?
Are there any sample BQ events that we can refer to understand the events?
Is it possible to get columnLineage details for BQ as output datasource?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-29 10:11:28

*Thread Reply:* > 1. How to identify if the output datasource is BQ, Cassandra or Postgres? The dataset namespace would contain that information: for example, the namespace for BQ would be simple bigquery and for Postgres it would be postgres://{host}:{port}

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-29 10:15:06

*Thread Reply:* > 1. Can we rely on dataset.datasource and dataset.schema for BQ table name and column names? > 2. Even if one column is updated, do we get all the column details in dataset.schema? > 3. If dataset.datasource or dataset.schema value is null, can we assume that no column has been updated in that event? If talking about BigQuery Airflow operators, the known issue is BigQuery query caching. You're guaranteed to get this information if the query is running for the first time, but if the query is just reading from the cache instead of being executed, we don't get that information. That would result in a run without actual input dataset data.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-03-29 10:15:56

*Thread Reply:* > 1. Is it possible to get columnLineage details for BQ as output datasource? BigQuery API does not give us this information yet - we could augment the API data with SQL parser one though. It's a feature that don't exist yet though

Vinnakota Priyatam Sai (vinnakota.priyatam@walmart.com)

2024-03-29 10:18:32

*Thread Reply:* This is very helpful, thanks a lot @Maciej Obuchowski

Mark Dunphy (markd@spotify.com)

2024-03-29 11:54:02

Hi all, we are trying to use dbt-ol to capture lineage. We use dbt custom aliases based on the --target flag passed in to dbt-ol run. So for example if using --target dev the model alias might be some_prefix__model_a whereas with --target prod the model alias might be model_a without any prefix. OpenLineage doesn't seem to pick up on this custom alias and sends model_a regardless in the input/output. Is this intended? I'm relatively new to this data world so it is possible I'm missing something basic here.

Michael Robinson (michael.robinson@astronomer.io)

2024-03-29 15:52:17

*Thread Reply:* Welcome and thanks for using OpenLineage! Someone with dbt expertise will reply soon.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-29 18:21:35

*Thread Reply:* looks like it’s another entry in manifest.json : https://schemas.getdbt.com/dbt/manifest/v10.json

called alias that is not taken into consideration

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-29 18:22:24

*Thread Reply:* it needs more analysis whether and how this entry is set

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-29 18:30:06

*Thread Reply:* btw how do you create alias per target? I did this:

-- Use the `ref` function to select from other models
{% if target.name != 'prod' %}
{{ config(materialized='incremental',unique_key='id',
        on_schema_change='sync_all_columns', alias='third_model_dev'
) }}
{% else %}
{{ config(materialized='incremental',unique_key='id',
        on_schema_change='sync_all_columns', alias='third_model_prod'
) }}
{% endif %}

select x.id, lower(y.name)
from {{ ref('my_first_dbt_model') }} as x
left join {{ ref('my_second_dbt_model' )}} as y
ON x.id = y.i

but I’m curious if that’s correct scenario to test

Mark Dunphy (markd@spotify.com)

2024-04-01 09:31:26

*Thread Reply:* thanks for looking into this @Jakub Dardziński! we are using the generatealiasname macro to control this. our macro looks very similar to this example

Mark Dunphy (markd@spotify.com)

2024-04-18 11:27:08

*Thread Reply:* hey @Jakub Dardziński - have you had a chance to look more into this at all? i appreciate your help

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-18 11:28:40

*Thread Reply:* I did some time ago then other responsibilities came 😞 I think I have proper solution to that, would need to test this though more thoroughly

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-18 11:29:46

*Thread Reply:* or maybe you would be interested in contribution? 🙂 I can give right directions

Mark Dunphy (markd@spotify.com)

2024-04-18 11:29:47

*Thread Reply:* that is amazing to hear! is there any way i can help out?

Mark Dunphy (markd@spotify.com)

2024-04-18 11:29:51

*Thread Reply:* yes absolutely

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-18 11:30:27

*Thread Reply:* I’ll write down my findings and possible solution, will get back to you

Mark Dunphy (markd@spotify.com)

2024-04-18 11:32:21

*Thread Reply:* awesome, sounds good. thank you again

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-18 12:04:08

*Thread Reply:* ```diff --git a/integration/common/openlineage/common/provider/dbt/processor.py b/integration/common/openlineage/common/provider/dbt/processor.py index f5cb94e6..2c7cf315 100644 --- a/integration/common/openlineage/common/provider/dbt/processor.py +++ b/integration/common/openlineage/common/provider/dbt/processor.py @@ -246,16 +246,18 @@ class DbtArtifactProcessor: )

         run_id = str(uuid.uuid4())

+ alias = outputnode.get("alias", None) if name.startswith("snapshot."): jobname = ( f"{outputnode['database']}.{outputnode['schema']}" - f".{self.removeprefix(run['uniqueid'], 'snapshot.')}" + f".{alias}" + (".build.snapshot" if self.command == "build" else ".snapshot") ) else: jobname = ( f"{outputnode['database']}.{outputnode['schema']}" - f".{self.removeprefix(run['unique_id'], 'model.')}" + f".{alias}" + (".build.run" if self.command == "build" else "") ) ``` that’s basically the change but I’d need to check more if that fixes naming for tests (if tests have aliases at all)

Tom Linton (tom.linton@atlan.com)

2024-03-29 12:37:48

Is it possible to configure OL to only send OL Events for certain dags in airflow?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-29 14:22:30

*Thread Reply:* it will be possible once latest version of OL provider is released with this PR: https://github.com/apache/airflow/pull/37725

#37725 openlineage: add `opt-in` option

Problem statement Currently you can only disable OpenLineage for all DAGs and tasks or selectively disable by operator class. Users report lack of possibility to do so on DAG/task level instead while opting in. Proposed solution Add <code>opt_in</code> option as well as <code>enable/disable_lineage</code> methods that work both on DAG and task objects. Internally it passes constant parameter (Airflow's <code>airflow.models.Param</code>) that is next checked in OpenLineage listener. <hr /> ^ Add meaningful description above Read the <a href="https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#pull-request-guidelines">Pull Request Guidelines</a> for more information. In case of fundamental code changes, an Airflow Improvement Proposal (<a href="https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals|AIP">https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals|AIP</a>) is needed. In case of a new dependency, check compliance with the <a href="https://www.apache.org/legal/resolved.html#category-x">ASF 3rd Party License Policy</a>. In case of backwards incompatible changes please leave a note in a newsfragment file, named <code>{pr_number}.significant.rst</code> or <code>{issue_number}.significant.rst</code>, in <a href="https://github.com/apache/airflow/tree/main/newsfragments">newsfragments</a>.

Labels

area:providers, area:dev-tools, kind:documentation, provider:openlineage

✅ Tom Linton

Tom Linton (tom.linton@atlan.com)

2024-03-29 16:09:16

*Thread Reply:* Thanks!

Tom Linton (tom.linton@atlan.com)

2024-03-29 13:10:52

Is it common to see this error?

Untitled

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-03-29 17:32:07

*Thread Reply:* seems like trim in select statements causes issues

✅ Tom Linton

Michael Robinson (michael.robinson@astronomer.io)

2024-04-01 10:04:45

@channel I'd like to open a vote to release OpenLineage 1.11.0, including: • Spark: lineage metadata extraction built-in to Spark extensions • Spark: change SparkPropertyFacetBuilder to support recording Spark runtime config • Java client: add metrics-gathering mechanism • Flink: support Flink 1.19.0 • SQL: show error message when OpenLineageSql cannot find native library Three +1s from committers will authorize. Thanks!

➕ Harel Shein, Rodrigo Maia, Jakub Dardziński, alexandre bergere, Maciej Obuchowski

Michael Robinson (michael.robinson@astronomer.io)

2024-04-04 09:44:38

*Thread Reply:* Thanks, all. The release is authorized and will be performed within 2 business days excluding tomorrow.

Michael Robinson (michael.robinson@astronomer.io)

2024-04-01 16:13:24

@channel The latest issue of OpenLineage News is available now, featuring a rundown of upcoming and recent events, recent releases, updates to the Airflow Provider, open proposals, and more. To get the newsletter directly in your inbox each month, sign up here. openlineage.us14.list-manage.com

openlineage.us14.list-manage.com

OpenLineage Project

OpenLineage Project Email Forms

Original URL: http://bit.ly/OL_news

Pooja K M (pooja.km@philips.com)

2024-04-02 06:01:39

Hi All, We are trying transform entities according to medallian model, where each entity goes through multiple layers of data transformation and the workflow is like the data is picked from kafka channel and stored into parquet and then trasforming it to hudi tables in silver layer. so now we are trying to capture lineage data, so far we have tried with transport type console but we are not seeing the lineage data in console (we are running this job from aws glue). below are the configuration which we have added. spark = (SparkSession.builder .appName('samplelineage') .config('spark.jars.packages', 'io.openlineage:openlineagespark:1.8.0') .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener') .config('spark.openlineage.namespace', 'LineagePortTest') .config('spark.openlineage.parentJobNamespace', 'LineageJobNameSpace') .config("spark.openlineage.transport.type", "console") .config('spark.openlineage.parentJobName', 'LineageJobName') .getOrCreate())

Damien Hawes (damien.hawes@booking.com)

2024-04-02 07:24:13

*Thread Reply:* Does Spark tell your during startup that it is adding the listener?

The log line should be something like "Adding io.openlineage.spark.agent.OpenLineageSparkListener"

Damien Hawes (damien.hawes@booking.com)

2024-04-02 07:24:58

*Thread Reply:* Additionally, ensure your log4j.properties / log4j2.properties (depending on the version of Spark that you are using) allows io.openlineage at info level

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-02 08:04:16

*Thread Reply:* I think, as usual, hudi is the problem 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-02 08:04:35

*Thread Reply:* or are you just not seeing any OL logs/events?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-02 08:05:31

*Thread Reply:* as @Damien Hawes said, you should see Spark log org.apache.spark.SparkContext - Registered listener io.openlineage.spark.agent.OpenLineageSparkListener

Pooja K M (pooja.km@philips.com)

2024-04-02 09:24:00

*Thread Reply:* yes I could see the mentioned logs in the console while job runs

Pooja K M (pooja.km@philips.com)

2024-04-02 09:30:17

*Thread Reply:* Also we are not seeing OL events

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 08:32:49

*Thread Reply:* do you see any errors or other logs that could be relevant to OpenLineage? also, some simple reproduction might help

Pooja K M (pooja.km@philips.com)

2024-04-03 09:06:18

*Thread Reply:* ya we could see below logs INFO SparkSQLExecutionContext: OpenLineage received Spark event that is configured to be skipped: SparkListenerSQLExecutionEnd

Francisco Morillo (fmorillo@amazon.es)

2024-04-02 14:04:07

Hi All! Im trying to set up OpenLineage with Managed Flink at AWS. but im getting this error:

`"throwableInformation": "io.openlineage.client.transports.HttpTransportResponseException: code: 400, response: \n\tat io.openlineage.client.transports.HttpTransport.throwOnHttpError(HttpTransport.java:151)\n\tat`

This is what i see in marquez. where is flink is trying to send the open lineage events

items "message":string"The Job Result cannot be fetch..." "_producer":string"<https://github.com/OpenLineage>..." "_schemaURL":string"<https://openlineage.io/spec/fa>..." "stackTrace":string"org.apache.flink.util.FlinkRuntimeException: The Job Result cannot be fetched through the Job Client when in Web Submission. at org.apache.flink.client.deployment.application.WebSubmissionJobClient.getJobExecutionResult(WebSubmissionJobClient.java:92) at

Im passing the conf like this:

Properties props = new Properties(); props.put("openlineage.transport.type","http"); props.put("openlineage.transport.url","http://<marquez-ip>:5000/api/v1/lineage"); props.put("execution.attached","true"); Configuration conf = ConfigurationUtils.createConfiguration(props); StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(conf);

Harel Shein (harel.shein@gmail.com)

2024-04-02 14:26:12

*Thread Reply:* Hey @Francisco Morillo, which version of Marquez are you running? Streaming support was a relatively recent addition to Marquez

Francisco Morillo (fmorillo@amazon.es)

2024-04-02 14:29:32

*Thread Reply:* So i was able to set it up working locally. Having Flink integrated with open lineage

Francisco Morillo (fmorillo@amazon.es)

2024-04-02 14:29:43

*Thread Reply:* But once i deployed marquez in an ec2 using docker

Francisco Morillo (fmorillo@amazon.es)

2024-04-02 14:30:16

*Thread Reply:* and have managed flink trying to emit events to openlineage i just receive the flink job event, but not the kafka source / iceberg sink

Francisco Morillo (fmorillo@amazon.es)

2024-04-02 14:32:31

*Thread Reply:* I ran this: $ git clone git@github.com:MarquezProject/marquez.git && cd marquez

Harel Shein (harel.shein@gmail.com)

2024-04-02 14:50:41

*Thread Reply:* hmmm. I see. you're probably running the latest version of marquez then, should be ok. did you try the console transport first to see how the events look like?

Harel Shein (harel.shein@gmail.com)

2024-04-02 14:51:10

*Thread Reply:* kafka source and iceberg sink should be well supported for flink

Francisco Morillo (fmorillo@amazon.es)

2024-04-02 14:54:31

*Thread Reply:* i believe there is an issue with how the conf is passed to flink job in managed flink

Harel Shein (harel.shein@gmail.com)

2024-04-02 14:55:37

*Thread Reply:* ah, that may be the case. what are you seeing in the flink job logs?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-02 14:59:02

*Thread Reply:* I think setting execution.attached might not work when you set it this way

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-02 15:05:05

*Thread Reply:* is there an option to use regular flink-conf.yaml?

Francisco Morillo (fmorillo@amazon.es)

2024-04-02 15:48:34

*Thread Reply:* in the flink logs im seeing the io.openlineage.client.transports.HttpTransportResponseException: code: 400, response: \n\tat.

in marquez im seeing the job result cannot be fetched.

we cant modify flink-conf in managed flink

Francisco Morillo (fmorillo@amazon.es)

2024-04-02 15:49:39

*Thread Reply:*

Francisco Morillo (fmorillo@amazon.es)

2024-04-02 15:49:54

*Thread Reply:* this is what i see at marquez at ec2

Harel Shein (harel.shein@gmail.com)

2024-04-02 15:50:58

*Thread Reply:* hmmm.. I'm wondering if the issue is with Marquez processing the events or the openlineage events themselves. can you try with: props.put("openlineage.transport.type","console"); ?

Francisco Morillo (fmorillo@amazon.es)

2024-04-02 15:51:08

*Thread Reply:* compared to what i see locally. Locally is the same job but just writing to localhost marquez, but im passing the openlineage conf trough env

Francisco Morillo (fmorillo@amazon.es)

2024-04-02 15:52:50

*Thread Reply:* @Harel Shein when set to console, where will the events be printed? Cloudwatch logs?

Harel Shein (harel.shein@gmail.com)

2024-04-02 15:53:17

*Thread Reply:* I think so, yes

Francisco Morillo (fmorillo@amazon.es)

2024-04-02 15:53:20

*Thread Reply:* let me try

Harel Shein (harel.shein@gmail.com)

2024-04-02 15:53:39

*Thread Reply:* the same place you're seeing your flink logs right now

Harel Shein (harel.shein@gmail.com)

2024-04-02 15:54:12

*Thread Reply:* the same place you found that client exception

Francisco Morillo (fmorillo@amazon.es)

2024-04-02 16:09:34

*Thread Reply:* I will post the events

Francisco Morillo (fmorillo@amazon.es)

2024-04-02 16:09:37

*Thread Reply:* "logger": "io.openlineage.flink.OpenLineageFlinkJobListener", "message": "onJobSubmitted event triggered for flink-jobs-prod.kafka-iceberg-prod", "messageSchemaVersion": "1", "messageType": "INFO", "threadName": "Flink-DispatcherRestEndpoint-thread-4" }

Francisco Morillo (fmorillo@amazon.es)

2024-04-02 16:09:52

*Thread Reply:* "locationInformation": "io.openlineage.flink.TransformationUtils.processLegacySinkTransformation(TransformationUtils.java:90)", "logger": "io.openlineage.flink.TransformationUtils", "message": "Processing legacy sink operator Print to System.out", "messageSchemaVersion": "1", "messageType": "INFO", "threadName": "Flink-DispatcherRestEndpoint-thread-4" }

Francisco Morillo (fmorillo@amazon.es)

2024-04-02 16:10:08

*Thread Reply:* "locationInformation": "io.openlineage.flink.TransformationUtils.processLegacySinkTransformation(TransformationUtils.java:90)", "logger": "io.openlineage.flink.TransformationUtils", "message": "Processing legacy sink operator org.apache.flink.streaming.api.functions.sink.DiscardingSink@68d0a141", "messageSchemaVersion": "1", "messageType": "INFO", "threadName": "Flink-DispatcherRestEndpoint-thread-4" }

Francisco Morillo (fmorillo@amazon.es)

2024-04-02 16:10:46

*Thread Reply:* "locationInformation": "io.openlineage.client.transports.ConsoleTransport.emit(ConsoleTransport.java:21)", "logger": "io.openlineage.client.transports.ConsoleTransport", "message": "{\"eventTime\":\"2024_04_02T20:07:03.30108Z\",\"producer\":\"<https://github.com/OpenLineage/OpenLineage/tree/1.10.2/integration/flink>\",\"schemaURL\":\"<https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunEvent>\",\"eventType\":\"START\",\"run\":{\"runId\":\"cda9a0d2_6dfd_4db2_b3d0_f11d7b082dc0\"},\"job\":{\"namespace\":\"flink_jobs_prod\",\"name\":\"kafka-iceberg-prod\",\"facets\":{\"jobType\":{\"_producer\":\"<https://github.com/OpenLineage/OpenLineage/tree/1.10.2/integration/flink>\",\"_schemaURL\":\"<https://openlineage.io/spec/facets/2-0-2/JobTypeJobFacet.json#/$defs/JobTypeJobFacet>\",\"processingType\":\"STREAMING\",\"integration\":\"FLINK\",\"jobType\":\"JOB\"}}},\"inputs\":[{\"namespace\":\"<kafka://b-1.mskflinkopenlineage>.<>.<http://kafka.us-east-1.amazonaws.com:9092,b_3.mskflinkopenlineage.<>kafka.us_east_1.amazonaws.com:9092,b-2.mskflinkopenlineage.<>.c22.kafka.us-east-1.amazonaws.com:9092\%22,\%22name\%22:\%22temperature-samples\%22,\%22facets\%22:{\%22schema\%22:{\%22_producer\%22:\%22<https://github.com/OpenLineage/OpenLineage/tree/1.10.2/integration/flink>\%22,\%22_schemaURL\%22:\%22<https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet>\%22,\%22fields\%22:[{\%22name\%22:\%22sensorId\%22,\%22type\%22:\%22int\%22},{\%22name\%22:\%22room\%22,\%22type\%22:\%22string\%22},{\%22name\%22:\%22temperature\%22,\%22type\%22:\%22float\%22},{\%22name\%22:\%22sampleTime\%22,\%22type\%22:\%22long\%22}]}}|kafka.us_east_1.amazonaws.com:9092,b-3.mskflinkopenlineage.<>kafka.us-east-1.amazonaws.com:9092,b_2.mskflinkopenlineage.<>.c22.kafka.us_east_1.amazonaws.com:9092\",\"name\":\"temperature_samples\",\"facets\":{\"schema\":{\"_producer\":\"<https://github.com/OpenLineage/OpenLineage/tree/1.10.2/integration/flink>\",\"_schemaURL\":\"<https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet>\",\"fields\":[{\"name\":\"sensorId\",\"type\":\"int\"},{\"name\":\"room\",\"type\":\"string\"},{\"name\":\"temperature\",\"type\":\"float\"},{\"name\":\"sampleTime\",\"type\":\"long\"}]}}>}],\"outputs\":[{\"namespace\":\"<s3://iceberg-open-lineage-891377161433>\",\"name\":\"/iceberg/open_lineage.db/open_lineage_room_temperature_prod\",\"facets\":{\"schema\":{\"_producer\":\"<https://github.com/OpenLineage/OpenLineage/tree/1.10.2/integration/flink>\",\"_schemaURL\":\"<https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet>\",\"fields\":[{\"name\":\"room\",\"type\":\"STRING\"},{\"name\":\"temperature\",\"type\":\"FLOAT\"},{\"name\":\"sampleCount\",\"type\":\"INTEGER\"},{\"name\":\"lastSampleTime\",\"type\":\"TIMESTAMP\"}]}}}]}",

Francisco Morillo (fmorillo@amazon.es)

2024-04-02 16:11:12

*Thread Reply:* locationInformation": "io.openlineage.flink.tracker.OpenLineageContinousJobTracker.startTracking(OpenLineageContinousJobTracker.java:100)", "logger": "io.openlineage.flink.tracker.OpenLineageContinousJobTracker", "message": "Starting tracking thread for jobId=de9e0d5b5d19437910975f231d5ed4b5", "messageSchemaVersion": "1", "messageType": "INFO", "threadName": "Flink-DispatcherRestEndpoint-thread-4" }

Francisco Morillo (fmorillo@amazon.es)

2024-04-02 16:11:25

*Thread Reply:* "locationInformation": "io.openlineage.flink.OpenLineageFlinkJobListener.onJobExecuted(OpenLineageFlinkJobListener.java:191)", "logger": "io.openlineage.flink.OpenLineageFlinkJobListener", "message": "onJobExecuted event triggered for flink-jobs-prod.kafka-iceberg-prod", "messageSchemaVersion": "1", "messageType": "INFO", "threadName": "Flink-DispatcherRestEndpoint-thread-4" }

Francisco Morillo (fmorillo@amazon.es)

2024-04-02 16:11:41

*Thread Reply:* "locationInformation": "io.openlineage.flink.tracker.OpenLineageContinousJobTracker.stopTracking(OpenLineageContinousJobTracker.java:120)", "logger": "io.openlineage.flink.tracker.OpenLineageContinousJobTracker", "message": "stop tracking", "messageSchemaVersion": "1", "messageType": "INFO", "threadName": "Flink-DispatcherRestEndpoint-thread-4" }

Francisco Morillo (fmorillo@amazon.es)

2024-04-02 16:12:07

*Thread Reply:* "locationInformation": "io.openlineage.client.transports.ConsoleTransport.emit(ConsoleTransport.java:21)", "logger": "io.openlineage.client.transports.ConsoleTransport", "message": "{\"eventTime\":\"2024_04_02T20:07:04.028017Z\",\"producer\":\"<https://github.com/OpenLineage/OpenLineage/tree/1.10.2/integration/flink>\",\"schemaURL\":\"<https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunEvent>\",\"eventType\":\"FAIL\",\"run\":{\"runId\":\"cda9a0d2_6dfd_4db2_b3d0_f11d7b082dc0\",\"facets\":{\"errorMessage\":{\"_producer\":\"<https://github.com/OpenLineage/OpenLineage/tree/1.10.2/integration/flink>\",\"_schemaURL\":\"<https://openlineage.io/spec/facets/1-0-0/ErrorMessageRunFacet.json#/$defs/ErrorMessageRunFacet>\",\"message\":\"The Job Result cannot be fetched through the Job Client when in Web Submission.\",\"programmingLanguage\":\"JAVA\",\"stackTrace\":\"org.apache.flink.util.FlinkRuntimeException: The Job Result cannot be fetched through the Job Client when in Web Submission.\\n\\tat org.apache.flink.client.deployment.application.WebSubmissionJobClient.getJobExecutionResult(WebSubmissionJobClient.java:92)\\n\\tat org.apache.flink.client.program.StreamContextEnvironment.getJobExecutionResult(StreamContextEnvironment.java:152)\\n\\tat org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:123)\\n\\tat org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1969)\\n\\tat com.amazonaws.services.msf.KafkaStreamingJob.main(KafkaStreamingJob.java:342)\\n\\tat java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\\n\\tat java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\\n\\tat java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\\n\\tat java.base/java.lang.reflect.Method.invoke(Method.java:566)\\n\\tat org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:355)\\n\\tat org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222)\\n\\tat org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:114)\\n\\tat org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:84)\\n\\tat org.apache.flink.client.deployment.application.DetachedApplicationRunner.run(DetachedApplicationRunner.java:70)\\n\\tat org.apache.flink.runtime.webmonitor.handlers.JarRunOverrideHandler.lambda$handleRequest$3(JarRunOverrideHandler.java:239)\\n\\tat java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)\\n\\tat java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)\\n\\tat java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\\n\\tat java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)\\n\\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\\n\\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\\n\\tat java.base/java.lang.Thread.run(Thread.java:829)\\n\"}}},\"job\":{\"namespace\":\"flink_jobs_prod\",\"name\":\"kafka-iceberg-prod\",\"facets\":{\"jobType\":{\"_producer\":\"<https://github.com/OpenLineage/OpenLineage/tree/1.10.2/integration/flink>\",\"_schemaURL\":\"<https://openlineage.io/spec/facets/2-0-2/JobTypeJobFacet.json#/$defs/JobTypeJobFacet>\",\"processingType\":\"STREAMING\",\"integration\":\"FLINK\",\"jobType\":\"JOB\"}}}}", "messageSchemaVersion": "1", "messageType": "INFO", "threadName": "Flink-DispatcherRestEndpoint-thread-4" }

Francisco Morillo (fmorillo@amazon.es)

2024-04-02 16:15:35

*Thread Reply:* this is what i see in cloudwatch when set to console

Francisco Morillo (fmorillo@amazon.es)

2024-04-02 16:17:50

*Thread Reply:* So its nothing to do with marquez but with openlineage and flink

Harel Shein (harel.shein@gmail.com)

2024-04-02 16:22:10

*Thread Reply:* hmm.. the start event actually looks pretty good to me: { "eventTime": "2024-04-02T20:07:03.30108Z", "producer": "<https://github.com/OpenLineage/OpenLineage/tree/1.10.2/integration/flink>", "schemaURL": "<https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunEvent>", "eventType": "START", "run": { "runId": "cda9a0d2-6dfd-4db2-b3d0-f11d7b082dc0" }, "job": { "namespace": "flink-jobs-prod", "name": "kafka-iceberg-prod", "facets": { "jobType": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/1.10.2/integration/flink>", "_schemaURL": "<https://openlineage.io/spec/facets/2-0-2/JobTypeJobFacet.json#/$defs/JobTypeJobFacet>", "processingType": "STREAMING", "integration": "FLINK", "jobType": "JOB" } } }, "inputs": [ { "namespace": "<kafka://b-1.mskflinkopenlineage>.<>.<http://kafka.us-east-1.amazonaws.com:9092,b_3.mskflinkopenlineage.<>kafka.us_east_1.amazonaws.com:9092,b-2.mskflinkopenlineage.<>.c22.kafka.us-east-1.amazonaws.com:9092|kafka.us_east_1.amazonaws.com:9092,b-3.mskflinkopenlineage.<>kafka.us-east-1.amazonaws.com:9092,b_2.mskflinkopenlineage.<>.c22.kafka.us_east_1.amazonaws.com:9092>", "name": "temperature-samples", "facets": { "schema": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/1.10.2/integration/flink>", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet>", "fields": [ { "name": "sensorId", "type": "int" }, { "name": "room", "type": "string" }, { "name": "temperature", "type": "float" }, { "name": "sampleTime", "type": "long" } ] } } } ], "outputs": [ { "namespace": "<s3://iceberg-open-lineage-891377161433>", "name": "/iceberg/open_lineage.db/open_lineage_room_temperature_prod", "facets": { "schema": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/1.10.2/integration/flink>", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet>", "fields": [ { "name": "room", "type": "STRING" }, { "name": "temperature", "type": "FLOAT" }, { "name": "sampleCount", "type": "INTEGER" }, { "name": "lastSampleTime", "type": "TIMESTAMP" } ] } } } ] }

Francisco Morillo (fmorillo@amazon.es)

2024-04-02 16:22:37

*Thread Reply:* so with that start event should marquez be able to build the proper lineage?

Francisco Morillo (fmorillo@amazon.es)

2024-04-02 16:22:57

*Thread Reply:* This is what i would get with flink marquez locally

Harel Shein (harel.shein@gmail.com)

2024-04-02 16:23:33

*Thread Reply:* yes, but then it looks like the flink job is failing and we're seeing this event: { "eventTime": "2024-04-02T20:07:04.028017Z", "producer": "<https://github.com/OpenLineage/OpenLineage/tree/1.10.2/integration/flink>", "schemaURL": "<https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunEvent>", "eventType": "FAIL", "run": { "runId": "cda9a0d2-6dfd-4db2-b3d0-f11d7b082dc0", "facets": { "errorMessage": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/1.10.2/integration/flink>", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/ErrorMessageRunFacet.json#/$defs/ErrorMessageRunFacet>", "message": "The Job Result cannot be fetched through the Job Client when in Web Submission.", "programmingLanguage": "JAVA", "stackTrace": "org.apache.flink.util.FlinkRuntimeException: The Job Result cannot be fetched through the Job Client when in Web Submission.ntat org.apache.flink.client.deployment.application.WebSubmissionJobClient.getJobExecutionResult(WebSubmissionJobClient.java:92)ntat org.apache.flink.client.program.StreamContextEnvironment.getJobExecutionResult(StreamContextEnvironment.java:152)ntat org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:123)ntat org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1969)ntat com.amazonaws.services.msf.KafkaStreamingJob.main(KafkaStreamingJob.java:342)ntat java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)ntat java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)ntat java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)ntat java.base/java.lang.reflect.Method.invoke(Method.java:566)ntat org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:355)ntat org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222)ntat org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:114)ntat org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:84)ntat org.apache.flink.client.deployment.application.DetachedApplicationRunner.run(DetachedApplicationRunner.java:70)ntat org.apache.flink.runtime.webmonitor.handlers.JarRunOverrideHandler.lambda$handleRequest$3(JarRunOverrideHandler.java:239)ntat java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)ntat java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)ntat java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)ntat java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)ntat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)ntat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)ntat java.base/java.lang.Thread.run(Thread.java:829)n" } } }, "job": { "namespace": "flink-jobs-prod", "name": "kafka-iceberg-prod", "facets": { "jobType": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/1.10.2/integration/flink>", "_schemaURL": "<https://openlineage.io/spec/facets/2-0-2/JobTypeJobFacet.json#/$defs/JobTypeJobFacet>", "processingType": "STREAMING", "integration": "FLINK", "jobType": "JOB" } } } }

Francisco Morillo (fmorillo@amazon.es)

2024-04-02 16:24:11

*Thread Reply:* But the thing is that the flink job is not really failling

Harel Shein (harel.shein@gmail.com)

2024-04-02 16:25:03

*Thread Reply:* interesting, would love to see what @Paweł Leszczyński / @Maciej Obuchowski / @Peter Huang think. This is beyond my depth on the flink integration 🙂

Francisco Morillo (fmorillo@amazon.es)

2024-04-02 16:34:51

*Thread Reply:* Thanks Harel!! Yes please, it would be great to see how openlineage can work with AWS Managed flink

➕ Harel Shein

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-03 02:43:12

*Thread Reply:* Just to clarify - is this setup working with openlineage flink integration turned off? From what I understand, your job emits cool START event, than a job fails and emits FAIL event with error stacktrace The Job Result cannot be fetched through the Job Client when in Web Submission which is cool as well.

The question is: does it fail bcz of Openlineage integration or it is just Openlineage which carries stacktrace of a failed job. I couldn't see anything Openlineage related in the stacktrace.

Francisco Morillo (fmorillo@amazon.es)

2024-04-03 09:43:34

*Thread Reply:* What do you mean with Flink integration turned off?

Francisco Morillo (fmorillo@amazon.es)

2024-04-03 09:44:28

*Thread Reply:* the flink job is not failling but, we are receiving an openlineage event that says fail, to which we then not see the proper dag in marquez

Francisco Morillo (fmorillo@amazon.es)

2024-04-03 09:45:18

*Thread Reply:* does openlineage work if the job is submited through web submission?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 09:47:44

*Thread Reply:* the answer is "probably not unless you can set up execution.attached beforehand"

Francisco Morillo (fmorillo@amazon.es)

2024-04-03 09:48:49

*Thread Reply:* execution.attached doesnt seem to work with job submitted through web submission.

Francisco Morillo (fmorillo@amazon.es)

2024-04-03 09:54:51

*Thread Reply:* When setting execution attached to false, i only get the start event, but it doesnt build the dag in the job space in marquez

Francisco Morillo (fmorillo@amazon.es)

2024-04-03 09:57:14

*Thread Reply:*

Francisco Morillo (fmorillo@amazon.es)

2024-04-03 09:57:40

*Thread Reply:* I still see this in cloudwatch logs: locationInformation": "io.openlineage.flink.client.EventEmitter.emit(EventEmitter.java:50)", "logger": "io.openlineage.flink.client.EventEmitter", "message": "Failed to emit OpenLineage event: ", "messageSchemaVersion": "1", "messageType": "ERROR", "threadName": "Flink-DispatcherRestEndpoint-thread-1", "throwableInformation": "io.openlineage.client.transports.HttpTransportResponseException: code: 400, response: \n\tat io.openlineage.client.transports.HttpTransport.throwOnHttpError(HttpTransport.java:151)\n\tat io.openlineage.client.transports.HttpTransport.emit(HttpTransport.java:128)\n\tat io.openlineage.client.transports.HttpTransport.emit(HttpTransport.java:115)\n\tat io.openlineage.client.OpenLineageClient.emit(OpenLineageClient.java:60)\n\tat io.openlineage.flink.client.EventEmitter.emit(EventEmitter.java:48)\n\tat io.openlineage.flink.visitor.lifecycle.FlinkExecutionContext.lambda$onJobSubmitted$0(FlinkExecutionContext.java:66)\n\tat io.openlineage.client.circuitBreaker.NoOpCircuitBreaker.run(NoOpCircuitBreaker.java:27)\n\tat io.openlineage.flink.visitor.lifecycle.FlinkExecutionContext.onJobSubmitted(FlinkExecutionContext.java:59)\n\tat io.openlineage.flink.OpenLineageFlinkJobListener.start(OpenLineageFlinkJobListener.java:180)\n\tat io.openlineage.flink.OpenLineageFlinkJobListener.onJobSubmitted(OpenLineageFlinkJobListener.java:156)\n\tat org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.lambda$executeAsync$12(StreamExecutionEnvironment.java:2099)\n\tat java.base/java.util.ArrayList.forEach(ArrayList.java:1541)\n\tat org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:2099)\n\tat org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:188)\n\tat org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:119)\n\tat org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1969)\n\tat com.amazonaws.services.msf.KafkaStreamingJob.main(KafkaStreamingJob.java:345)\n\tat java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.base/java.lang.reflect.Method.invoke(Method.java:566)\n\tat org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:355)\n\tat org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222)\n\tat org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:114)\n\tat org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:84)\n\tat org.apache.flink.client.deployment.application.DetachedApplicationRunner.run(DetachedApplicationRunner.java:70)\n\tat org.apache.flink.runtime.webmonitor.handlers.JarRunOverrideHandler.lambda$handleRequest$3(JarRunOverrideHandler.java:239)\n\tat java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)\n\tat java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)\n\tat java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\n\tat java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.base/java.lang.Thread.run(Thread.java:829)\n"

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 10:01:52

*Thread Reply:* I think it will be a limitation of our integration then, at least until https://cwiki.apache.org/confluence/display/FLINK/FLIP-314%3A+Support+Customized+Job+Lineage+Listener - the way we're integrating with Flink requires it to be able to access execution results https://github.com/OpenLineage/OpenLineage/blob/main/integration/flink/app/src/main/java/io/openlineage/flink/OpenLineageFlinkJobListener.java#L[…]6

not sure if we can somehow work around this

Francisco Morillo (fmorillo@amazon.es)

2024-04-03 10:04:09

*Thread Reply:* with that flip we wouldnt need execution.attached?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 10:04:58

*Thread Reply:* Nope - it would add different mechanism to integrate with Flink other than JobListener

Francisco Morillo (fmorillo@amazon.es)

2024-04-03 10:09:38

*Thread Reply:* Could a workaround be, instead of having the http tranport, sending to kafka and have a java/python client writing the events to marquez?

Francisco Morillo (fmorillo@amazon.es)

2024-04-03 10:10:30

*Thread Reply:* because i just tried with executtion.attached to false and with console transport, i just receive the event for start but no errors. not sure if thats the only event needed in marquez to build a dag

Francisco Morillo (fmorillo@amazon.es)

2024-04-03 10:16:42

*Thread Reply:* also, wondering if the event actually reached marquez, why wouldnt the job dag be showned?

Francisco Morillo (fmorillo@amazon.es)

2024-04-03 10:16:52

*Thread Reply:* its the same start event i have received when running localy

Francisco Morillo (fmorillo@amazon.es)

2024-04-03 10:17:15

*Thread Reply:*

Francisco Morillo (fmorillo@amazon.es)

2024-04-03 10:25:47

*Thread Reply:* comparison of marquez receiving event from managed flink on aws (left). to marquez localhost receiving event from local flink. its the same event. however marquez in ec2 is not building dag

Francisco Morillo (fmorillo@amazon.es)

2024-04-03 10:26:14

*Thread Reply:* @Maciej Obuchowski is there any other event needed for dag?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 10:38:48

*Thread Reply:* > Could a workaround be, instead of having the http tranport, sending to kafka and have a java/python client writing the events to marquez? I think there are two problems, and the 400 is probably just the followup from the original one - maybe too long stacktrace makes Marquez reject the event? The original one, the attached one, is the cause why the integration tries to send the FAIL event at the first place

Peter Huang (huangzhenqiu0825@gmail.com)

2024-04-03 10:45:35

*Thread Reply:* For the error described in message "The Job Result cannot be fetched through the Job Client when in Web Submission.", I feel it is a bug in flink. Which version of flink are you using? @Francisco Morillo

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 11:02:46

*Thread Reply:* looking at implementation, it seems to be by design: /**** ** A {@link JobClient} that only allows asking for the job id of the job it is attached to. ** ** This is used in web submission, where we do not want the Web UI to have jobs blocking threads ** while waiting for their completion. **/

Peter Huang (huangzhenqiu0825@gmail.com)

2024-04-03 11:32:51

*Thread Reply:* Yes, looks like flink code try to fetch the Job Result for the web submission job, thus the exception is raised.

Francisco Morillo (fmorillo@amazon.es)

2024-04-03 12:27:05

*Thread Reply:* Flink 1.15.2

Francisco Morillo (fmorillo@amazon.es)

2024-04-03 12:28:00

*Thread Reply:* But still wouldnt marquez be able to build the dag with the start event?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 12:28:50

*Thread Reply:* In Marquez, new dataset version is created when the run completes

Francisco Morillo (fmorillo@amazon.es)

2024-04-03 12:29:14

*Thread Reply:* but that doesnt show as events in marquez right?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 12:29:33

*Thread Reply:* I think that was going to be changed for streaming jobs - right @Paweł Leszczyński? - but not sure if that's already merged

Francisco Morillo (fmorillo@amazon.es)

2024-04-03 12:33:34

*Thread Reply:* in latest marquez version?

Francisco Morillo (fmorillo@amazon.es)

2024-04-03 12:41:52

*Thread Reply:* is this the right transport url? props.put("openlineage.transport.url","http://localhost:5000/api/v1/lineage");

Francisco Morillo (fmorillo@amazon.es)

2024-04-03 12:42:36

*Thread Reply:* because i was able to see streaming jobs in marquez when running locally, as well as having a flink local job writing to the marquez on ec2. its as the dataset and job doesnt get created in marquez from the event

Francisco Morillo (fmorillo@amazon.es)

2024-04-03 13:05:28

*Thread Reply:* I tried with flink 1.18 and same. i receive the start event but the job and dataset are not created in marquez

Francisco Morillo (fmorillo@amazon.es)

2024-04-03 13:15:59

*Thread Reply:* If i try locally and set execution.attached to false it does work. So it seems that the main issue is that openlineage doesnt work with flink job submission through web ui

👀 Maciej Obuchowski

Peter Huang (huangzhenqiu0825@gmail.com)

2024-04-03 16:54:20

*Thread Reply:* From my understanding until now, set execution.attched = false mitigates the exception in flink (at least from the flink code, it is the logic). On the other hand, the question goes to when to build the dag when receive events. @Paweł Leszczyński From our org, we changed the default behavior. The flink listener will periodically send running events out. Once the lineage backend receive the running event, a new dag will be created.

Francisco Morillo (fmorillo@amazon.es)

2024-04-03 17:00:26

*Thread Reply:* How can i configure that?

Peter Huang (huangzhenqiu0825@gmail.com)

2024-04-03 17:02:00

*Thread Reply:* To send periodical running event, some changes are needed in the open lineage flink lib. Let's wait for @Paweł Leszczyński for concrete plan. I am glad to create a PR for this.

Francisco Morillo (fmorillo@amazon.es)

2024-04-03 17:05:28

*Thread Reply:* im still wondering why the dag was not created in marquez, unless there are some other events that open lineage sends for it to build the job and dataset that if submitted through webui it doesnt work. I will try to replicate in EMR

Francisco Morillo (fmorillo@amazon.es)

2024-04-03 20:37:03

*Thread Reply:* Looking at marquez logs, im seeing this

arquez.api.OpenLineageResource: Unexpected error while processing request ! java.lang.IllegalArgumentException: namespace '<kafka://b-1.mskflinkopenlineage.fdz2z7.c22.kafka.us-east-1.amazonaws.com:9092>,b-3.mskflinkopenlineage.fdz2z7.c22.kafka.us-east-1.amazonaws.com:9092,b_2.mskflinkopenlineage.fdz2z7.c22.kafka.us_east_1.amazonaws.com:9092' must contain only letters (a-z, A-Z), numbers (0-9), underscores (_), at (@), plus (+), dashes (-), colons (:), equals (=), semicolons (;), slashes (/) or dots (.) with a maximum length of 1024 characters.

Francisco Morillo (fmorillo@amazon.es)

2024-04-03 20:37:38

*Thread Reply:* can marquez work with msk?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-04 02:43:06

*Thread Reply:* The graph on Marquez side should be present just after sending START event, once the START contains information about input/output datasets. Commas are the problem here and we should modify Flink integration to separate broker list by a semicolon.

✅ Francisco Morillo

Fabio Manganiello (fabio.manganiello@booking.com)

2024-04-03 05:50:05

Hi all, I've opened a PR for the dbt-ol script. We've noticed that the script doesn't transparently return/exit the exit code of the child dbt process. This makes it hard for the parent process to tell if the underlying workflow succeeded or failed - in the case of Airflow, the parent DAG will mark the job as succeeded even if it actually failed. Let me know if you have thought/comments (cc @Arnab Bhattacharyya)

#2560 dbt: `dbt-ol` should transparently exit with the same exit code as the child `dbt` process

Problem See <a href="https://github.com/OpenLineage/OpenLineage/issues/2558">#2558</a> If <code>dbt-ol</code> doesn't exit with the same code as the child <code>dbt</code> process, then any upstream process(es), including Airflow DAGs, won't be able to detect workflow failures. Closes: <a href="https://github.com/OpenLineage/OpenLineage/issues/2558">#2558</a> Solution <code>dbt-ol.main</code> should return the <code>return_code</code> of the child <code>dbt</code> process, and we should explicitly <code>sys.exit</code> it. This PR also adds a couple of extra changes to make the code of <code>dbt-ol</code> more PEP/Flake compliant. In particular: • Logging functions <a href="https://realpython.com/python-f-strings/#lazy-evaluation-in-Logging">should favor lazy evaluation over f-strings</a>. • <code>subprocess.Popen</code> should always be run through a context manager (for proper process/resources clean up). One-line summary: <code>dbt-ol</code> should transparently exit with the same exit code as the child <code>dbt</code> process. Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☑︎ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> • ~[ ] Your changes are accompanied by tests (if relevant)~ ☑︎ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ~- [ ] You've updated any relevant documentation (if relevant)~ ☑︎ Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary) ~- [ ] You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant)~ ~- [ ] You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant)~ <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2023 contributors to the OpenLineage project

Labels

integration/dbt

❤️ Harel Shein

Tristan GUEZENNEC -CROIX- (tristan.guezennec@decathlon.com)

2024-04-04 04:41:36

*Thread Reply:* @Sophie LY FYI

Fabio Manganiello (fabio.manganiello@booking.com)

2024-04-03 06:33:34

Is there a timeline for the 1.11.0 release? Now that the dbt-ol fix has been merged we may either wait for the release or temporarily point to main

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-03 06:34:09

*Thread Reply:* I think it’s going to be today or really soon. cc: @Michael Robinson

🎉 Fabio Manganiello

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 06:37:45

*Thread Reply:* would be great if we could fix the unknown facet memory issue in this release, I think @Paweł Leszczyński @Damien Hawes are working on it

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 06:38:02

*Thread Reply:* I think this is a critical kind of bug

Damien Hawes (damien.hawes@booking.com)

2024-04-03 06:39:27

*Thread Reply:* Yeah, it's a tough-to-figure-out-where-the-fix-should-be kind of bug.

😨 Jakub Dardziński

Damien Hawes (damien.hawes@booking.com)

2024-04-03 06:39:56

*Thread Reply:* The solution is simple, at least in my mind. If spark_unknown is disabled, don't accumulate state.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-03 06:40:11

*Thread Reply:* i think we should go first with unknown entry facet as it has bigger impact

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 06:40:12

*Thread Reply:* if there's no better fast idea, just disable that facet for now?

Damien Hawes (damien.hawes@booking.com)

2024-04-03 06:40:26

*Thread Reply:* It doesn't matter if the facet is disabled or not

Damien Hawes (damien.hawes@booking.com)

2024-04-03 06:40:38

*Thread Reply:* The UnknownEntryFacetListener still accumulates state

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-03 06:40:48

*Thread Reply:* @Damien Hawes will you be able to prepare this today/tomorrow?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 06:40:58

*Thread Reply:* disable == comment/remove code related to it, together with UnknownEntryFacetListener 🙂

Damien Hawes (damien.hawes@booking.com)

2024-04-03 06:40:59

*Thread Reply:* I'm working on it today

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 06:41:01

*Thread Reply:* in this case 🙂

Damien Hawes (damien.hawes@booking.com)

2024-04-03 06:41:31

*Thread Reply:* You're proposing to rip the code out completely?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 06:42:02

*Thread Reply:* at least for this release - I think it's better to release code without it and without memory bug, rather than having it bugged as it is

Damien Hawes (damien.hawes@booking.com)

2024-04-03 06:42:06

*Thread Reply:* The only place where I see it being applied is here:

``` private <L extends LogicalPlan> QueryPlanVisitor<L, D> asQueryPlanVisitor(T event) { AbstractQueryPlanDatasetBuilder<T, P, D> builder = this; return new QueryPlanVisitor<L, D>(context) { @Override public boolean isDefinedAt(LogicalPlan x) { return builder.isDefinedAt(event) && isDefinedAtLogicalPlan(x); }

  @Override
  public List&lt;D&gt; apply(LogicalPlan x) {
    unknownEntryFacetListener.accept(x);
    return builder.apply(event, (P) x);
  }
};

}```

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-03 06:42:11

*Thread Reply:* come on, this should be few lines of change

Damien Hawes (damien.hawes@booking.com)

2024-04-03 06:42:17

*Thread Reply:* Inside: AbstractQueryPlanDatasetBuilder

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-03 06:42:21

*Thread Reply:* once we know what it is

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 06:42:32

*Thread Reply:* it's useful in some narrow debug cases, but the memory bug potentially impacts all

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-03 06:43:15

*Thread Reply:* openLineageContext .getQueryExecution() .filter(qe -> !FacetUtils.isFacetDisabled(openLineageContext, "spark_unknown")) .flatMap(qe -> unknownEntryFacetListener.build(qe.optimizedPlan())) .ifPresent(facet -> runFacetsBuilder.put("spark_unknown", facet)); this should always clean the listener

Damien Hawes (damien.hawes@booking.com)

2024-04-03 06:43:19

*Thread Reply:* @Paweł Leszczyński - every time AbstractQueryPlanDatasetBuilder#apply is called, the UnknownEntryFacetListener is invoked

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-03 06:43:38

*Thread Reply:* the code is within OpenLineageRunEventBuilder

Damien Hawes (damien.hawes@booking.com)

2024-04-03 06:43:50

*Thread Reply:* @Paweł Leszczyński - it will only clean the listener, if spark_unknown is enabled

Damien Hawes (damien.hawes@booking.com)

2024-04-03 06:43:56

*Thread Reply:* because of that filter step

Damien Hawes (damien.hawes@booking.com)

2024-04-03 06:44:11

*Thread Reply:* but the listener still accumulates state, regardless of that snippet you shared

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-03 06:44:12

*Thread Reply:* yes, and we need to modify it to always clean

Damien Hawes (damien.hawes@booking.com)

2024-04-03 06:45:45

*Thread Reply:* We have a difference in understanding here, I think.

If spark_unknown is disabled, the UnknownEntryFacetListener still accumulates state. Your proposed change will not clean that state.
If spark_unknown is enabled, well, sometimes we get StackOverflow errors due to infinite recursion during serialisation.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 06:46:35

*Thread Reply:* just to get a bit out from particular solution: I would love if we could either release with

a proper fix that won't accumulate memory if facet is disabled, and clean up it it's not
have that facet removed for now I don't want to have a release now that will contain this bug, because we're trying to do a "good" solution but have no time to do it properly for the release

👍 Damien Hawes

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 06:46:57

*Thread Reply:* I think the impact of this bug is big

Damien Hawes (damien.hawes@booking.com)

2024-04-03 06:47:24

*Thread Reply:* My opinion is that perhaps the OpenLineageContext object needs to be extended to hold which facets are enabled / disabled.

➕ Maciej Obuchowski

Damien Hawes (damien.hawes@booking.com)

2024-04-03 06:47:52

*Thread Reply:* This way, things that inherit from AbstractQueryPlanDatasetBuilder can check, should they be a no-op or not

Damien Hawes (damien.hawes@booking.com)

2024-04-03 06:48:36

*Thread Reply:* Or, ```private <L extends LogicalPlan> QueryPlanVisitor<L, D> asQueryPlanVisitor(T event) { AbstractQueryPlanDatasetBuilder<T, P, D> builder = this; return new QueryPlanVisitor<L, D>(context) { @Override public boolean isDefinedAt(LogicalPlan x) { return builder.isDefinedAt(event) && isDefinedAtLogicalPlan(x); }

@Override
public List&lt;D&gt; apply(LogicalPlan x) {
  unknownEntryFacetListener.accept(x);
  return builder.apply(event, (P) x);
}

}; }``` This needs to be changed

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-03 06:48:40

*Thread Reply:* @Damien Hawes could u look at this again https://github.com/OpenLineage/OpenLineage/pull/2557/files ?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-03 06:49:27

*Thread Reply:* i think clearing visitedNodes within populateRun should solve this

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-03 06:51:01

*Thread Reply:* the solution is (1) don't store logical plans, but their string representation (2) clear what you collected after populating a facet

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 06:51:18

*Thread Reply:* even if it works, I still don't really like it because we accumulate state in asQueryPlanVisitor just to clear it later

Damien Hawes (damien.hawes@booking.com)

2024-04-03 06:51:19

*Thread Reply:* It works, but I'm still annoyed that UnknownEntryFacetListener is being called in the first place

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 06:51:46

*Thread Reply:* also i think in case of really large plans it could be an issue still?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-03 06:53:06

*Thread Reply:* why @Maciej Obuchowski?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 06:55:47

*Thread Reply:* we've seen >20MB serialized logical plans, and that's what essentially treeString does if I understand it correctly

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 06:56:56

*Thread Reply:* and then the serialization can potentially still take some time...

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-03 07:01:19

*Thread Reply:* where did you find treeString serializes a plan?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-03 07:05:44

*Thread Reply:* treeString is used by default toString method of TreeNode, so would be super weird if they serialized entire object within it. I couldn't find any of such code within Spark implementation

Damien Hawes (damien.hawes@booking.com)

2024-04-03 07:19:02

*Thread Reply:* I also remind you, that there is the problem with the job metrics holder as well

Damien Hawes (damien.hawes@booking.com)

2024-04-03 07:19:17

*Thread Reply:* That will also, eventually, cause an OOM crash

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-03 07:27:41

*Thread Reply:* So, I agreeUnknownEntryFacetListener code should not be called if a facet is disabled. I agree we should have another PR and fix for job metrics.

The question is: what do we want to have shipped within the next release? Do we want to get rid of static member that acumulates all the logical plans (which is cleaner approach) or just clear it once not needed anymore? I think we'll need to clear it anyway in case someone turns the unkown facet feature on.

Damien Hawes (damien.hawes@booking.com)

2024-04-03 07:39:09

*Thread Reply:* In my opinion, the approach for the immediate release is to clear the plans. Though, I'd like tests that prove it works.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 08:10:02

*Thread Reply:* @Damien Hawes so let's go with Paweł's PR?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-03 08:24:04

*Thread Reply:* So, prooving this helps would be great. One option would be to prepare integration test that runs something and verifies later on that private static map is empty. Another, a way nicer, would be to write a code that generates a few MB dataset reads into memory and saves into a file, and then within integration tests code runs something like https://github.com/jerolba/jmnemohistosyne to see memory consumption of classess we're interested in (not sure how difficult this is to write such thing)

This could be also beneficial to prevent similar issues in future and solve job metrics issue.

jerolba/jmnemohistosyne

JMnemohistosyne executes programmatically memory histograms of Java process

Stars

Language

Java

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 09:02:37

*Thread Reply:* @Damien Hawes @Paweł Leszczyński would be great to clarify if you're working on it now

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 09:02:43

*Thread Reply:* as this blocks release

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 09:02:47

*Thread Reply:* fyi @Michael Robinson

👍 Michael Robinson

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-03 09:48:12

*Thread Reply:* I can try to prove that the PR I propoposed brings improvement. However, if Damien wants to work on his approach targetting this release, I am happy to hand it over.

Damien Hawes (damien.hawes@booking.com)

2024-04-03 10:20:24

*Thread Reply:* I'm not working on it at the moment. I think Pawel's approach is fine for the time being.

Damien Hawes (damien.hawes@booking.com)

2024-04-03 10:20:31

*Thread Reply:* I'll focus on the JobMetricsHolder problem

Damien Hawes (damien.hawes@booking.com)

2024-04-03 10:24:54

*Thread Reply:* Side note: @Paweł Leszczyński @Maciej Obuchowski - are you able to give any guidance why the UnknownEntryFacetListener was implemented that way, as opposed to just examining the event in a stateless manner?

Damien Hawes (damien.hawes@booking.com)

2024-04-03 11:18:28

*Thread Reply:* OK. @Paweł Leszczyński @Maciej Obuchowski - I think I found the memory leak with JobMetricsHolder. If we receive an event like SparkListenerJobStart, but there isn't any dataset in it, it looks like we're storing the metrics, but we never get rid of them.

😬 Maciej Obuchowski

Damien Hawes (damien.hawes@booking.com)

2024-04-03 11:21:50

*Thread Reply:* Here's the logs

log.txt

🙌 Maciej Obuchowski

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 11:36:50

*Thread Reply:* > Side note: @Paweł Leszczyński @Maciej Obuchowski - are you able to give any guidance why the UnknownEntryFacetListener was implemented that way, as opposed to just examining the event in a stateless manner? It's one of the older parts of codebase, implemented mostly in 2021 by person no longer associated with the project... hard to tell to be honest 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 11:37:52

*Thread Reply:* but I think we have much more freedom to modify it, as it's not standarized or user facing feature

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 11:47:02

*Thread Reply:* to solve stageMetrics issue - should they always be a separate Map per job that's associated with jobId allowing it to be easily cleaned... but there's no jobId on SparkListenerTaskEnd

Damien Hawes (damien.hawes@booking.com)

2024-04-03 11:47:16

*Thread Reply:* Nah

Damien Hawes (damien.hawes@booking.com)

2024-04-03 11:47:17

*Thread Reply:* Actually

Damien Hawes (damien.hawes@booking.com)

2024-04-03 11:47:21

*Thread Reply:* Its simpler than that

Damien Hawes (damien.hawes@booking.com)

2024-04-03 11:47:36

*Thread Reply:* The bug is here:

public void cleanUp(int jobId) { Set<Integer> stages = jobStages.remove(jobId); stages = stages == null ? Collections.emptySet() : stages; stages.forEach(jobStages::remove); }

Damien Hawes (damien.hawes@booking.com)

2024-04-03 11:47:51

*Thread Reply:* We remove from jobStages N + 1 times

Damien Hawes (damien.hawes@booking.com)

2024-04-03 11:48:14

*Thread Reply:* JobStages is supposed to carry a mapping from Job -> Stage

Damien Hawes (damien.hawes@booking.com)

2024-04-03 11:48:30

*Thread Reply:* and stageMetrics a mapping from Stage -> TaskMetrics

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 11:49:00

*Thread Reply:* ah yes

Damien Hawes (damien.hawes@booking.com)

2024-04-03 11:49:03

*Thread Reply:* Here, we remove the job from jobStages, and obtain the associated stages, and then we use those stages to remove from jobStages again

Damien Hawes (damien.hawes@booking.com)

2024-04-03 11:49:11

*Thread Reply:* It's a "huh?" moment

😂 Jakub Dardziński

Damien Hawes (damien.hawes@booking.com)

2024-04-03 11:49:53

*Thread Reply:* The amount of logging I added, just to see this, was crazy

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 11:50:46

*Thread Reply:* public void cleanUp(int jobId) { Set<Integer> stages = jobStages.remove(jobId); stages = stages == null ? Collections.emptySet() : stages; stages.forEach(stageMetrics::remove); } so it's just jobStages -> stageMetrics here, right?

Damien Hawes (damien.hawes@booking.com)

2024-04-03 11:50:57

*Thread Reply:* Yup

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 11:51:09

*Thread Reply:* yeah it looks so obvious after seeing that 😄

Damien Hawes (damien.hawes@booking.com)

2024-04-03 11:51:40

*Thread Reply:* I even wrote a separate method to clear the stageMetrics map

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 11:51:41

*Thread Reply:* it was there since 2021 in that form 🙂

Damien Hawes (damien.hawes@booking.com)

2024-04-03 11:52:00

*Thread Reply:* and placed it in the same locations as the cleanUp method in the OpenLineageSparkListener

Damien Hawes (damien.hawes@booking.com)

2024-04-03 11:52:09

*Thread Reply:* Wrote a unit test

Damien Hawes (damien.hawes@booking.com)

2024-04-03 11:52:12

*Thread Reply:* It fails

Damien Hawes (damien.hawes@booking.com)

2024-04-03 11:52:17

*Thread Reply:* and I was like, "why?"

Damien Hawes (damien.hawes@booking.com)

2024-04-03 11:52:25

*Thread Reply:* Investigate further, and then I noticed this method

😄 Maciej Obuchowski

Damien Hawes (damien.hawes@booking.com)

2024-04-03 12:33:42

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2565

Michael Robinson (michael.robinson@astronomer.io)

2024-04-03 14:39:06

*Thread Reply:* Has Damien's PR unblocked the release?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 14:39:33

*Thread Reply:* No, we need one more from Paweł

:gratitude_thank_you: Michael Robinson

Damien Hawes (damien.hawes@booking.com)

2024-04-04 10:37:42

*Thread Reply:* OK. Pawel's PR has been merged @Michael Robinson

👍 Michael Robinson

Damien Hawes (damien.hawes@booking.com)

2024-04-04 12:12:28

*Thread Reply:* Given these developments, I'ld like to call for a release of 1.11.0 to happen today, unless there are any objections.

➕ Harel Shein, Jakub Dardziński

👀 Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2024-04-04 12:28:38

*Thread Reply:* Changelog PR is RFR: https://github.com/OpenLineage/OpenLineage/pull/2574

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-04 14:29:04

*Thread Reply:* CircleCI has problems

Fabio Manganiello (fabio.manganiello@booking.com)

2024-04-04 18:12:27

*Thread Reply:* ```self = <tests.conftest.DagsterRunLatestProvider object at 0x7fcd84faed60> repositoryname = 'testrepo'

def get_instance(self, repository_name: str) -&gt; DagsterRun:

> from dagster.core.remoterepresentation.origin import ( ExternalJobOrigin, ExternalRepositoryOrigin, InProcessCodeLocationOrigin, ) E ImportError: cannot import name 'ExternalJobOrigin' from 'dagster.core.remoterepresentation.origin' (/home/circleci/.pyenv/versions/3.8.19/lib/python3.8/site-packages/dagster/core/remote_representation/origin.py)

tests/conftest.py:140: ImportError```

Fabio Manganiello (fabio.manganiello@booking.com)

2024-04-04 18:12:39

*Thread Reply:* >>> from dagster.core.remote_representation.origin import ( ... ExternalJobOrigin, ... ExternalRepositoryOrigin, ... InProcessCodeLocationOrigin, ... ) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<frozen importlib._bootstrap>", line 1176, in _find_and_load File "<frozen importlib._bootstrap>", line 1138, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 1078, in _find_spec File "/home/blacklight/git_tree/OpenLineage/venv/lib/python3.11/site-packages/dagster/_module_alias_map.py", line 36, in find_spec assert base_spec, f"Could not find module spec for {base_name}." AssertionError: Could not find module spec for dagster._core.remote_representation. >>> from dagster.core.host_representation.origin import ( ... ExternalJobOrigin, ... ExternalRepositoryOrigin, ... InProcessCodeLocationOrigin, ... ) >>> ExternalJobOrigin <class 'dagster._core.host_representation.origin.ExternalJobOrigin'>

Fabio Manganiello (fabio.manganiello@booking.com)

2024-04-04 18:13:07

*Thread Reply:* It seems that the parent module should be dagster.core.host_representation.origin, not dagster.core.remote_representation.origin

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-04 18:14:55

*Thread Reply:* did you rebase? for >=1.6.9 it’s dagster.core.remote_representation.origin, should be ok

Fabio Manganiello (fabio.manganiello@booking.com)

2024-04-04 18:18:06

*Thread Reply:* Indeed, I was just looking at https://github.com/dagster-io/dagster/pull/20323 (merged 4 weeks ago)

#20323 [external-rename] host_representation -> remote_representation

Summary & Motivation Internal companion PR: <a href="https://github.com/dagster-io/internal/pull/8610">https://github.com/dagster-io/internal/pull/8610</a> Rename <code>dagster._core.host_represenation</code> -> <code>remote_representation</code>. How I Tested These Changes Existing test suite.

Comments

Fabio Manganiello (fabio.manganiello@booking.com)

2024-04-04 18:18:43

*Thread Reply:* I did a pip install of the integration from main and it seems to install a previous version though:

>>> dagster.__version__ '1.6.5'

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-04 18:18:59

*Thread Reply:* try --force-reinstall maybe

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-04 18:19:08

*Thread Reply:* it works fine for me, CI doesn’t crash either

Fabio Manganiello (fabio.manganiello@booking.com)

2024-04-04 18:20:09

*Thread Reply:* https://app.circleci.com/pipelines/github/OpenLineage/OpenLineage/10020/workflows/4d3a33b4-47ef-4cf6-b6de-1bb95611fad7/jobs/200011 (although the ImportError seems to be different from mine)

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-04 18:20:53

*Thread Reply:* huh, how didn’t I see this

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-04 18:21:30

*Thread Reply:* I think we should limit upper version of dagster, it’s not even really maintained

Fabio Manganiello (fabio.manganiello@booking.com)

2024-04-04 18:28:14

*Thread Reply:* I've also just noticed that ExternalJobOrigin and ExternalRepositoryOrigin have been renamed to RemoteJobOrigin and RemoteRepositoryOrigin on 1.7.0 - and that's apparently the version the CI installed

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-04 18:28:32

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2579

👍 Fabio Manganiello

Mantas Mykolaitis (mantasmy@wix.com)

2024-04-03 07:24:26

Hey 👋 When I am running TrinoOperator on Airflow 2.7 I am getting this: [2024-04-03, 11:10:44 UTC] {base.py:162} WARNING - OpenLineage provider method failed to extract data from provider. [2024-04-03, 11:10:44 UTC] {manager.py:276} WARNING - Extractor returns non-valid metadata: None I've upgraded apache-airflow-providers-openlineage to 1.6.0 (maybe it is too new for Airflow 2.7 version?). And due to the warning I am ending with empty input/output facets... Seems that it is not capable to connect to Trino and extract table structure... When I tried on our prod Airflow version (2.6.3) and openlineage-airflow it was capable to connect and extract table structure, but not to do the column level lineage mapping.

Any input would be very helpful. Thanks

Mantas Mykolaitis (mantasmy@wix.com)

2024-04-03 07:28:29

*Thread Reply:* Tried with default version of OL plugin that comes with 2.7 Airflow (1.0.1) so result was the same

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-03 07:31:55

*Thread Reply:* Could you please enable DEBUG logs in Airflow and provide them?

Mantas Mykolaitis (mantasmy@wix.com)

2024-04-03 07:42:14

*Thread Reply:*

Untitled

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-03 07:50:30

*Thread Reply:* thanks it seems like only the beginning of the logs. I’m assuming it fails on complete event

Mantas Mykolaitis (mantasmy@wix.com)

2024-04-03 07:56:00

*Thread Reply:* I am sorry! This is the full log

Untitled

Mantas Mykolaitis (mantasmy@wix.com)

2024-04-03 08:00:03

*Thread Reply:* What I also just realised that we have our own TrinoOperator implementation, which inherits from SQLExecuteQueryOperator (same as original TrinoOperator)... So maybe inlets and outlets aren't being set due to that

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-03 08:00:52

*Thread Reply:* yeah, it could interfere

Mantas Mykolaitis (mantasmy@wix.com)

2024-04-03 08:01:04

*Thread Reply:* But task was rather simple: create_table_apps_log_test = TrinoOperator( task_id=f"create_table_test", sql=""" CREATE TABLE if not exists mytable as SELECT app_id, msid, instance_id from table limit 1 """ )

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-03 08:01:26

*Thread Reply:* do you use some other hook to connect to Trino?

Mantas Mykolaitis (mantasmy@wix.com)

2024-04-03 08:03:12

*Thread Reply:* Just checked. So we have our own hook to connect to Trino... that inherits from TrinoHook 🙄

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-03 08:06:05

*Thread Reply:* hard to say, you could check https://github.com/apache/airflow/blob/main/airflow/providers/trino/hooks/trino.py#L252 to see how integration collects basic information how to retrieve connection

Mantas Mykolaitis (mantasmy@wix.com)

2024-04-03 08:10:24

*Thread Reply:* Just thinking why did it worked with Airflow 2.6.3 and openlineage-airflow package, seems that it was accessing Trino differently

Mantas Mykolaitis (mantasmy@wix.com)

2024-04-03 08:10:40

*Thread Reply:* But anyways, will try to look more into it. Thanks for tips!

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-03 08:12:13

*Thread Reply:* please let me know your findings, it might be some bug introduced in provider package

👍 Mantas Mykolaitis

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-04-03 08:29:07

Looking for some help with spark and the “UNCLASSIFIED_ERROR; An error occurred while calling o110.load. Cannot call methods on a stopped SparkContext.” We are not getting any openLineage data in Cloudwatch nor in sparkHistoryLogs. (more details in thread - should I be making this into a github issue instead?)

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-04-03 08:29:29

*Thread Reply:* The python code:

import sys from awsglue.transforms import ** from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from pyspark.conf import SparkConf from awsglue.context import GlueContext from awsglue.job import Job

conf = SparkConf() conf.set("spark.extraListeners","io.openlineage.spark.agent.OpenLineageSparkListener")\ .set("spark.jars.packages","io.openlineage:openlineage_spark:1.10.2")\ .set("spark.openlineage.version","v1")\ .set("spark.openlineage.namespace","OL_EXAMPLE_DN")\ .set("spark.openlineage.transport.type","console") ## @params: [JOB_NAME] args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext.getOrCreate(conf=conf) glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args['JOB_NAME'], args) df = spark.read.format("csv").option("header","true").load("<s3-folder-path>") df.write.format("csv").option("header","true").save("<s3-folder-path>",mode='overwrite') job.commit()

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-04-03 08:29:32

*Thread Reply:* Nothing appears in cloudwatch, or in the sparkHistoryLogs. Here's the jr_runid file from sparkHistoryLogs - it shows that the work was done, but nothing about openlineage or where the spark session was stopped before OL could do anything: { "Event": "SparkListenerApplicationStart", "App Name": "nativespark-check_python_-jr_<jrid>", "App ID": "spark-application-0", "Timestamp": 0, "User": "spark" } { "Event": "org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart", "executionId": 0, "description": "load at NativeMethodAccessorImpl.java:0", "details": "org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185)\nsun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\nsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\nsun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\njava.lang.reflect.Method.invoke(Method.java:498)\npy4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\npy4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\npy4j.Gateway.invoke(Gateway.java:282)\npy4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\npy4j.commands.CallCommand.execute(CallCommand.java:79)\npy4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)\npy4j.ClientServerConnection.run(ClientServerConnection.java:106)\njava.lang.Thread.run(Thread.java:750)", "physicalPlanDescription": "== Parsed Logical Plan ==\nGlobalLimit 1\n+- LocalLimit 1\n +- Filter (length(trim(value#7, None)) > 0)\n +- Project [value#0 AS value#7]\n +- Project [value#0]\n +- Relation [value#0] text\n\n== Analyzed Logical Plan ==\nvalue: string\nGlobalLimit 1\n+- LocalLimit 1\n +- Filter (length(trim(value#7, None)) > 0)\n +- Project [value#0 AS value#7]\n +- Project [value#0]\n +- Relation [value#0] text\n\n== Optimized Logical Plan ==\nGlobalLimit 1\n+- LocalLimit 1\n +- Filter (length(trim(value#0, None)) > 0)\n +- Relation [value#0] text\n\n== Physical Plan ==\nCollectLimit 1\n+- **(1) Filter (length(trim(value#0, None)) > 0)\n +- FileScan text [value#0] Batched: false, DataFilters: [(length(trim(value#0, None)) > 0)], Format: Text, Location: InMemoryFileIndex(1 paths)[<s3-csv-file>], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string>\n", "sparkPlanInfo": { "nodeName": "CollectLimit", "simpleString": "CollectLimit 1", "children": [ { "nodeName": "WholeStageCodegen (1)", "simpleString": "WholeStageCodegen (1)", "children": [ { "nodeName": "Filter", "simpleString": "Filter (length(trim(value#0, None)) > 0)", "children": [ { "nodeName": "InputAdapter", "simpleString": "InputAdapter", "children": [ { "nodeName": "Scan text ", "simpleString": "FileScan text [value#0] Batched: false, DataFilters: [(length(trim(value#0, None)) > 0)], Format: Text, Location: InMemoryFileIndex(1 paths)[<s3-csv-file>], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string>", "children": [], "metadata": { "Location": "InMemoryFileIndex(1 paths)[<s3-csv-file>]", "ReadSchema": "struct<value:string>", "Format": "Text", "Batched": "false", "PartitionFilters": "[]", "PushedFilters": "[]", "DataFilters": "[(length(trim(value#0, None)) > 0)]" }, "metrics": [ { "name": "number of output rows from cache", "accumulatorId": 14, "metricType": "sum" }, { "name": "number of files read", "accumulatorId": 15, "metricType": "sum" }, { "name": "metadata time", "accumulatorId": 16, "metricType": "timing" }, { "name": "size of files read", "accumulatorId": 17, "metricType": "size" }, { "name": "max size of file split", "accumulatorId": 18, "metricType": "size" }, { "name": "number of output rows", "accumulatorId": 13, "metricType": "sum" } ] } ], "metadata": {}, "metrics": [] } ], "metadata": {}, "metrics": [ { "name": "number of output rows", "accumulatorId": 12, "metricType": "sum" } ] } ], "metadata": {}, "metrics": [ { "name": "duration", "accumulatorId": 11, "metricType": "timing" } ] } ], "metadata": {}, "metrics": [ { "name": "shuffle records written", "accumulatorId": 9, "metricType": "sum" }, { "name": "shuffle write time", "accumulatorId": 10, "metricType": "nsTiming" }, { "name": "records read", "accumulatorId": 7, "metricType": "sum" }, { "name": "local bytes read", "accumulatorId": 5, "metricType": "size" }, { "name": "fetch wait time", "accumulatorId": 6, "metricType": "timing" }, { "name": "remote bytes read", "accumulatorId": 3, "metricType": "size" }, { "name": "local blocks read", "accumulatorId": 2, "metricType": "sum" }, { "name": "remote blocks read", "accumulatorId": 1, "metricType": "sum" }, { "name": "remote bytes read to disk", "accumulatorId": 4, "metricType": "size" }, { "name": "shuffle bytes written", "accumulatorId": 8, "metricType": "size" } ] }, "time": 0, "modifiedConfigs": {} } { "Event": "SparkListenerApplicationEnd", "Timestamp": 0 }

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 09:06:04

*Thread Reply:* I think this is related to job.commit() that probably stops context underneath

✅ Sheeri Cabral (Collibra)

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 09:06:33

*Thread Reply:* This is probably the same bug: https://github.com/OpenLineage/OpenLineage/issues/2513 but manifests differently

#2513 Spark integration can cause YARN applications to crash if Spark context is explicitly stopped

Labels

integration/spark

Comments

Rodrigo Maia (rodrigo.maia@manta.io)

2024-04-03 09:45:59

*Thread Reply:* can you try without the job.commit()?

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-04-03 09:54:39

*Thread Reply:* Sure!

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-04-03 09:56:31

*Thread Reply:* BTW it makes sense that if the spark listener is disabled, that the openlineage integration shouldn’t even try. (If we removed that line, it doesn’t feel like the integration would actually work….)

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 09:57:51

*Thread Reply:* you mean removing this? conf.set("spark.extraListeners","io.openlineage.spark.agent.OpenLineageSparkListener")\ if you don't set it, none of our code is actually being loaded

Rodrigo Maia (rodrigo.maia@manta.io)

2024-04-03 09:59:25

*Thread Reply:* i meant, removing the job.init and job.commit for testing purposes. glue should work without that,

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-04-03 12:47:03

*Thread Reply:* We removed job.commit, same error. Should we also remove job.init?

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-04-03 12:48:06

*Thread Reply:* Won’t removing this change the functionality? job.init(args[‘JOB_NAME’], args)

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 13:22:11

*Thread Reply:* interesting - maybe something else stops the job explicitely underneath on Glue?

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-04-03 13:38:02

*Thread Reply:* Will have a look.

DEEVITH NAGRAJ (deevithraj435@gmail.com)

2024-04-03 23:09:10

*Thread Reply:* Hi all, I'm working with Sheeri on this, so couple of queries,

tried to set("spark.openlineage.transport.location","/sample.txt>") then the job succeeds but no output in the sample.txt file. (however there are some files created in /sparkHistoryLogs and /sparkHistoryLogs/output), I dont see the OL output file here.
2.set("spark.openlineage.transport.type","console") the job fails with “UNCLASSIFIED_ERROR; An error occurred while calling o110.load. Cannot call methods on a stopped SparkContext.”
if we are using http as transport.type, then can we use basic auth instead of api_key?

❤️ Sheeri Cabral (Collibra)

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-04 05:32:05

*Thread Reply:* > 3. if we are using http as transport.type, then can we use basic auth instead of api_key? Would be good to add that to HttpTransport 🙂

<https://github.com/OpenLineage/OpenLineage/blob/fc480219f50681a542a29f055f7e8f0a0c6041b9/client/java/src/main/java/io/openlineage/client/transports/HttpTransport.java | HttpTransport.java>

``` /* /* Copyright 2018-2024 contributors to the OpenLineage project /* SPDX-License-Identifier: Apache-2.0 */ package io.openlineage.client.transports; import static org.apache.http.Consts.UTF8; import static org.apache.http.HttpHeaders.ACCEPT; import static org.apache.http.HttpHeaders.AUTHORIZATION; import static org.apache.http.HttpHeaders.CONTENTTYPE; import static org.apache.http.entity.ContentType.APPLICATION_JSON; import io.openlineage.client.OpenLineage; import io.openlineage.client.OpenLineageClientException; import io.openlineage.client.OpenLineageClientUtils; import java.io.Closeable; import java.io.IOException; import java.net.URI; import java.net.URISyntaxException; import java.util.Collections; import java.util.HashMap; import java.util.Map; import javax.annotation.Nullable; import lombok.NonNull; import lombok.experimental.Delegate; import lombok.extern.slf4j.Slf4j; import org.apache.commons.lang3.StringUtils; import org.apache.http.HttpResponse; import org.apache.http.client.config.RequestConfig; import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpPost; import org.apache.http.client.utils.URIBuilder; import org.apache.http.entity.StringEntity; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClientBuilder; import org.apache.http.util.EntityUtils; @Slf4j public final class HttpTransport extends Transport implements Closeable { private static final String API_V1 = "/api/v1"; private final CloseableHttpClient http; private final URI uri; private @Nullable final TokenProvider tokenProvider; private final Map<String, String> headers; public HttpTransport(@NonNull final HttpConfig httpConfig) { this(withTimeout(httpConfig), httpConfig); } private static CloseableHttpClient withTimeout(HttpConfig httpConfig) { int timeoutMs; if (httpConfig.getTimeout() != null) { // deprecated approach, value in seconds as double provided timeoutMs = (int) (httpConfig.getTimeout() ** 1000); } else if (httpConfig.getTimeoutInMillis() != null) { timeoutMs = httpConfig.getTimeoutInMillis(); } else { // default one timeoutMs = 5000; } <pre><code>RequestConfig config = RequestConfig.custom() .setConnectTimeout(timeoutMs) .setConnectionRequestTimeout(timeoutMs) .setSocketTimeout(timeoutMs) .build(); return HttpClientBuilder.create().setDefaultRequestConfig(config).build(); </code></pre> } public HttpTransport( @NonNull final CloseableHttpClient httpClient, @NonNull final HttpConfig httpConfig) { super(Type.HTTP); this.http = httpClient; try { this.uri = getUri(httpConfig); } catch (URISyntaxException e) { throw new OpenLineageClientException(e); } this.tokenProvider = httpConfig.getAuth(); this.headers = httpConfig.getHeaders() != null ? httpConfig.getHeaders() : new HashMap<>(); } private URI getUri(HttpConfig httpConfig) throws URISyntaxException { URI url = httpConfig.getUrl(); if (url == null) { throw new OpenLineageClientException( "url can't be null, try setting openlineage.transport.url in config"); } URIBuilder builder = new URIBuilder(url); if (StringUtils.isNotBlank(url.getPath())) { if (StringUtils.isNotBlank(httpConfig.getEndpoint())) { throw new OpenLineageClientException("You can't pass both uri and endpoint parameters."); } } else { String endpoint = StringUtils.isNotBlank(httpConfig.getEndpoint()) ? httpConfig.getEndpoint() : API_V1 + "/lineage"; builder.setPath(endpoint); } if (httpConfig.getUrlParams() != null) { httpConfig.getUrlParams().entrySet().stream() .forEach(e -> builder.addParameter(e.getKey().replace("url.param.", ""), e.getValue())); } return builder.build(); } @Override public void emit(@NonNull OpenLineage.RunEvent runEvent) { final String eventAsJson = OpenLineageClientUtils.toJson(runEvent); emit(eventAsJson); } @Override public void emit(String eventAsJson) { log.debug("POST event on URL {}", uri); try { final HttpPost request = new HttpPost(); request.setURI(uri); setHeaders(request); request.setEntity(new StringEntity(eventAsJson, APPLICATION_JSON)); <pre><code> try (CloseableHttpResponse response = http.execute(request)) { throwOnHttpError(response); EntityUtils.consume(response.getEntity()); } } catch (IOException e) { throw new OpenLineageClientException(e); } </code></pre> } private void setHeaders(HttpPost request) { // set headers to accept json headers.put(ACCEPT, APPLICATIONJSON.toString()); headers.put(CONTENTTYPE, APPLICATION_JSON.toString()); // if tokenProvider preset overwrite authorization if (tokenProvider != null) { headers.put(AUTHORIZATION, tokenProvider.getToken()); } headers.forEach(request::addHeader); } private void throwOnHttpError(@NonNull HttpResponse response) throws IOException { final int code = response.getStatusLine().getStatusCode(); if (code >= 400 && code < 600) { // non-2xx throw new HttpTransportResponseException( code, EntityUtils.toString(response.getEntity(), UTF_8)); } } @Override public void close() throws IOException { http.close(); } /** Returns an new {@link HttpTransport.Builder} object for building {@link HttpTransport}s. **/ public static Builder builder() { return new Builder(); } /** * Builder for {@link HttpTransport} instances. * * Usage: * * <pre>{@code * HttpTransport httpTransport = HttpTransport().builder() * .url("<a href="http://localhost:5000">http://localhost:5000</a>") * .build() * }</pre> * * @deprecated Use {@link HttpConfig} instead */ @Deprecated public static final class Builder { private static final URI DEFAULTOPENLINEAGEURI = OpenLineageClientUtils.toUri("<a href="http://localhost:8080">http://localhost:8080</a>"); <pre><code>private @Nullable CloseableHttpClient httpClient; @Delegate private final HttpConfig httpConfig = new HttpConfig(); private Builder() { httpConfig.setUrl(DEFAULT_OPENLINEAGE_URI); } public Builder uri(@NonNull String urlAsString) { return uri(OpenLineageClientUtils.toUri(urlAsString)); } public Builder uri(@NonNull String urlAsString, @NonNull Map<String, String> queryParams) { return uri(OpenLineageClientUtils.toUri(urlAsString), queryParams); } public Builder uri(@NonNull URI uri) { return uri(uri, Collections.emptyMap()); } public Builder uri(@NonNull URI uri, @NonNull Map<String, String> queryParams) { try { final URIBuilder builder = new URIBuilder(uri); queryParams.forEach(builder::addParameter); httpConfig.setUrl(builder.build()); } catch (URISyntaxException e) { throw new OpenLineageClientException(e); } return this; } public Builder timeout(@Nullable Double timeout) { httpConfig.setTimeout(timeout); return this; } public Builder http(@NonNull CloseableHttpClient httpClient) { this.httpClient = httpClient; return this; } public Builder tokenProvider(@Nullable TokenProvider tokenProvider) { httpConfig.setAuth(tokenProvider); return this; } public Builder apiKey(@Nullable String apiKey) { final ApiKeyTokenProvider apiKeyTokenProvider = new ApiKeyTokenProvider(); apiKeyTokenProvider.setApiKey(apiKey); return tokenProvider(apiKeyTokenProvider); } /**** ** Returns an {@link HttpTransport} object with the properties of this {@link ** HttpTransport.Builder}. **/ public HttpTransport build() { if (httpClient != null) { return new HttpTransport(httpClient, httpConfig); } return new HttpTransport(httpConfig); } </code></pre> }…

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-04 05:33:16

*Thread Reply:* > 1. tried to set("spark.openlineage.transport.location","<|s3:<s3bucket>/sample.txt>") then the job succeeds but no output in the sample.txt file. (however there are some files created in /sparkHistoryLogs and /sparkHistoryLogs/output), I dont see the OL output file here.
Yeah, FileTransport does not work with object storage - it needs to be regular filesystem. I don't know if we can make it work without pulling a lot of dependencies and making it significantly more complex - but of course we'd like to see such contribution

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-04-04 08:11:44

*Thread Reply:* @DEEVITH NAGRAJ yes, that’s why the PoC is to have the sparklineage use the transport type of “console” - we can’t save to files in S3.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-04-04 08:12:54

*Thread Reply:* @DEEVITH NAGRAJ if we can get it to work in console, and CloudWatch shows us openlineage data, then we can change the transport type to an API and set up fluentd to collect the data.

BTW yesterday another customer got it working in console, and Roderigo from this thread also saw it working in console, so we know it does work in general 😄

🙌 DEEVITH NAGRAJ

DEEVITH NAGRAJ (deevithraj435@gmail.com)

2024-04-04 11:47:20

*Thread Reply:* yes Sheeri, I agree we need to get it to work in the console.I dont see anything in the cloudwatch, and the error is thrown when tried to set("spark.openlineage.transport.type","console") the job fails with “UNCLASSIFIED_ERROR; An error occurred while calling o110.load. Cannot call methods on a stopped SparkContext.”

do we need to specify scala version in .set("spark.jars.packages","io.openlineage:openlineagespark:1.10.2") like .set("spark.jars.packages","io.openlineage:openlineagespark_2.13:1.10.2")? is that causing the issue?

❤️ Sheeri Cabral (Collibra)

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-04-04 14:03:37

*Thread Reply:* Awesome! We’ve got it so the job succeeds when we set the transport type to “console”. Anyone have any tips on where to find it in CloudWatch? the job itself has a dozen or so different logs and we’re clicking all of them, but maybe there’s an easier way?

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-04-10 12:39:53

*Thread Reply:* FWIW we have been able to set it up and get OpenLineage from AWS Glue, so we’re all set 😄

🎉 Rodrigo Maia, Maciej Obuchowski

Rodrigo Maia (rodrigo.maia@manta.io)

2024-04-10 13:44:46

*Thread Reply:* @Sheeri Cabral (Collibra) Were you able to use the glue catalog after all or not?

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-04-10 15:47:17

*Thread Reply:* We did not try the Glue Catalog, as the Spark integration worked.

Mark de Groot (mdegroot@ilionx.com)

2024-04-03 10:15:27

Hi everyone, I've started 2 weeks ago to implement openLineage in our solution. But I've run into some problems and quite frankly I don't understand what I'm doing wrong. The situation is, we are using Azure Synapse with notebooks and we want to pick up the data lineage. I have found a lot of documentation about databricks in combination with Openlineage. But there is not much documentation with Synapse in combination with Openlineage. I've installed the newest library "openlineage-1.10.2" in the Synapse Apache Spark packages (so far so good). The next step I did was to configure the Apache Spark configuration, based on a blog I’ve found I filled in the following properties: spark.extraListeners - io.openlineage.spark.agent.OpenLineageSparkListener spark.openlineage.host – <https://functionapp.azurewebsites.net/api/function> spark.openlineage.namespace – synapse name spark.openlineage.url.param.code – XXXX spark.openlineage.version – 1

I’m not sure if the namespace is good, I think it's the name of synapse? But the moment I want to run the Synapse notebook (creating a simple dataframe) it shows me an error

Py4JJavaError Traceback (most recent call last) Cell In [5], line 1 ----> 1 df = spark.read.load('<abfss://bronsedomein1@xxxxxxxx.dfs.core.windows.net/adventureworks/vendors.parquet>', format='parquet') **2** display(df) Py4JJavaError: An error occurred while calling o4060.load. : org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.

I can’t figure out what I’m doing wrong, does somebody have a clue?

Thanks, Mark

Harel Shein (harel.shein@gmail.com)

2024-04-03 10:35:46

*Thread Reply:* this error seems unrelated to openlineage to me, can you try removing all the openlineage related properties from the config and testing this out just to rule that out?

Mark de Groot (mdegroot@ilionx.com)

2024-04-03 10:39:30

*Thread Reply:* Hey Harel,

Mark de Groot (mdegroot@ilionx.com)

2024-04-03 10:40:49

*Thread Reply:* Yes I removed all the related openlineage properties. And (ofcourse 😉 ) it's working fine. But the moment I fill in the Properties as mentiond above, it gives me the error.

Harel Shein (harel.shein@gmail.com)

2024-04-03 10:45:41

*Thread Reply:* thanks for checking, wanted to make sure. 🙂

👍 Mark de Groot

Harel Shein (harel.shein@gmail.com)

2024-04-03 10:48:03

*Thread Reply:* can you try only setting spark.extraListeners = io.openlineage.spark.agent.OpenLineageSparkListener spark.jars.packages = io.openlineage:openlineage_spark_2.12:1.10.2 spark.openlineage.transport.type = console ?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-03 12:01:20

*Thread Reply:* @Mark de Groot are you stopping the job using spark.stop() or similar command?

👍 Mark de Groot

Mark de Groot (mdegroot@ilionx.com)

2024-04-03 12:18:21

*Thread Reply:* So when i Run the default value in Synapse

Mark de Groot (mdegroot@ilionx.com)

2024-04-03 12:19:49

*Thread Reply:* Everything is working fine, but when I use the following properties I'm getting an error, when trying e.q to create a Dataframe.

Michael Robinson (michael.robinson@astronomer.io)

2024-04-03 11:23:31

@channel Accenture+Confluent's Open Standards for Data Lineage roundtable is happening on April 25th, featuring: • Kai Waehner (Confluent) • @Mandy Chessell (Egeria) • @Julien Le Dem (OpenLineage) • @Jens Pfau (Google Cloud) • @Ernie Ostic (Manta/IBM) • @Sheeri Cabral (Collibra) • Austin Kronz (Atlan) • @Luigi Scorzato (moderator, Accenture) Not to be missed! Register at the link.

events.confluent.io

Accenture & Confluent present:Open Standards for Data Lineage

One of the greatest wishes of companies around the world is end-to-end visibility into their analytics & their reporting workflow. Where do these data come from? Where do they go? Whom am I giving access to? How can I track data quality issues? What’s the impact of the latest deployment? The capability to follow the data flow to answer these questions is called data lineage.Multiple platforms now offer data lineage, within their data governance tools. But they only work for the technology in which they are created. Cross-platform data lineage is still a great challenge because different technologies cannot talk to each other effectively.How can we solve this problem?Join our event on open standards for analytics visibility. Could an open standard revolutionise analytics like TCP/IP did for networks? Global experts discuss. Don't miss out!

Original URL: https://events.confluent.io/roundtable-data-lineage/Accenture

🔥 Maciej Obuchowski

Bassim EL Baroudi (bassim.elbaroudi@gmail.com)

2024-04-03 12:58:12

Hi everyone, I'm trying to launch a spark job with integration with openlineage. The version of spark is 3.5.0. The configuration used:

spark.jars.packages=io.openlineage:openlineage-spark_2.12:1.10.2 spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener spark.openlineage.transport.url=http://marquez.dcp.svc.cluster.local:8087 spark.openlineage.namespace=pyspark spark.openlineage.transport.type=http spark.openlineage.facets.disabled="[spark.logicalPlan;]" spark.openlineage.debugFacet=enabled

the spark job exits with the following error: java.lang.NoSuchMethodError: 'org.apache.spark.sql.SQLContext org.apache.spark.sql.execution.SparkPlan.sqlContext()' at io.openlineage.spark.agent.lifecycle.ContextFactory.createSparkSQLExecutionContext(ContextFactory.java:32) at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$getSparkSQLExecutionContext$4(OpenLineageSparkListener.java:172) at java.base/java.util.HashMap.computeIfAbsent(HashMap.java:1220) at java.base/java.util.Collections$SynchronizedMap.computeIfAbsent(Collections.java:2760) at io.openlineage.spark.agent.OpenLineageSparkListener.getSparkSQLExecutionContext(OpenLineageSparkListener.java:171) at io.openlineage.spark.agent.OpenLineageSparkListener.sparkSQLExecStart(OpenLineageSparkListener.java:125) at io.openlineage.spark.agent.OpenLineageSparkListener.onOtherEvent(OpenLineageSparkListener.java:117) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1356) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) 24/04/03 13:23:39 INFO SparkContext: SparkContext is stopping with exitCode 0. 24/04/03 13:23:39 ERROR Utils: throw uncaught fatal error in thread spark-listener-group-shared java.lang.NoSuchMethodError: 'org.apache.spark.sql.SQLContext org.apache.spark.sql.execution.SparkPlan.sqlContext()' at io.openlineage.spark.agent.lifecycle.ContextFactory.createSparkSQLExecutionContext(ContextFactory.java:32) at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$getSparkSQLExecutionContext$4(OpenLineageSparkListener.java:172) at java.base/java.util.HashMap.computeIfAbsent(HashMap.java:1220) at java.base/java.util.Collections$SynchronizedMap.computeIfAbsent(Collections.java:2760) at io.openlineage.spark.agent.OpenLineageSparkListener.getSparkSQLExecutionContext(OpenLineageSparkListener.java:171) at io.openlineage.spark.agent.OpenLineageSparkListener.sparkSQLExecStart(OpenLineageSparkListener.java:125) at io.openlineage.spark.agent.OpenLineageSparkListener.onOtherEvent(OpenLineageSparkListener.java:117) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1356) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) Exception in thread "spark-listener-group-shared" java.lang.NoSuchMethodError: 'org.apache.spark.sql.SQLContext org.apache.spark.sql.execution.SparkPlan.sqlContext()' at io.openlineage.spark.agent.lifecycle.ContextFactory.createSparkSQLExecutionContext(ContextFactory.java:32) at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$getSparkSQLExecutionContext$4(OpenLineageSparkListener.java:172) at java.base/java.util.HashMap.computeIfAbsent(HashMap.java:1220) at java.base/java.util.Collections$SynchronizedMap.computeIfAbsent(Collections.java:2760) at io.openlineage.spark.agent.OpenLineageSparkListener.getSparkSQLExecutionContext(OpenLineageSparkListener.java:171) at io.openlineage.spark.agent.OpenLineageSparkListener.sparkSQLExecStart(OpenLineageSparkListener.java:125) at io.openlineage.spark.agent.OpenLineageSparkListener.onOtherEvent(OpenLineageSparkListener.java:117) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1356)

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-04 02:29:34

*Thread Reply:* Hey @Bassim EL Baroudi, what environnment are you running the Spark job? Is this some real-life production job or are you able to provide a code snippet which reproduces it?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-04 03:31:29

*Thread Reply:* Do you get any OpenLineage events like START events and see this exception at the end of job or does it occur at the begining resulting in no events emitted?

Bassim EL Baroudi (bassim.elbaroudi@gmail.com)

2024-04-10 09:11:51

*Thread Reply:* Hey @Paweł Leszczyński

Bassim EL Baroudi (bassim.elbaroudi@gmail.com)

2024-04-10 09:14:11

*Thread Reply:* i am not running the spark job on production yet. i got the message from openlinegae WITH START i tested with spark 334 java11 and java17 and i still got the some errors 24/04/10 13:09:23 INFO OpenLineageContext: Lineage completed successfully: ResponseMessage(responseCode=201, body=, error=null) {"eventType":"START","eventTime":"2024-04-10T13:09:23.057Z","run":{"runId":"33f7e822-dde1-4959-9fa5-13a874f7495b","facets":{"sparkversion":{"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark","schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet","spark-version":"3.3.4","openlineagesparkversion":"0.3.1"}}},"job":{"namespace":"pyspark","name":"shuffleduplicatedata20240410130911.mappartitionsparallelcollection"},"inputs":[],"outputs":[],"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark","schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunEvent"}

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-10 09:14:35

*Thread Reply:* ,"openlineage-spark-version":"0.3.1

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-10 09:16:01

*Thread Reply:* this week we released 1.12. I would recommend upgrading it first.

Bassim EL Baroudi (bassim.elbaroudi@gmail.com)

2024-04-10 10:48:23

*Thread Reply:* oko i will test this version

Bassim EL Baroudi (bassim.elbaroudi@gmail.com)

2024-04-10 11:03:34

*Thread Reply:* i got the some error

here is my spark configuration

spark.jars.packages=io.openlineage:openlineagespark2.12:1.12.0 spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener spark.openlineage.transport.url=http://marquez.dcp.svc.cluster.local:8087 spark.openlineage.host=http://marquez.dcp.svc.cluster.local:8087 spark.openlineage.namespace=pyspark spark.openlineage.transport.type=http

my python code

import logging import sys from datetime import datetime

from pyspark.sql import SparkSession from pyspark.sql.functions import ** from pyspark.sql import functions as f import random

Logging configuration

formatter = logging.Formatter('[%(asctime)s] %(levelname)s @ line %(lineno)d: %(message)s') handler = logging.StreamHandler(sys.stdout) handler.setLevel(logging.INFO) handler.setFormatter(formatter) logger = logging.getLogger() logger.setLevel(logging.INFO) logger.addHandler(handler)

dtstring = datetime.now().strftime("%Y%m%d%H%M%S") AppName = "Shuffle duplicate data"

def main(args):

raw_input_folder = _args_[1]
transform_output_folder = _args_[2]
iteration = _args_[3]

# Create Spark Session
spark = SparkSession \
    .builder \
    .appName(AppName + "_" + str(dt_string)) \
    .getOrCreate()

spark.sparkContext.setLogLevel("INFO")
<a href="http://logger.info">logger.info</a>("Starting spark application")

<a href="http://logger.info">logger.info</a>("Reading Parquet file from S3")
ny_taxi_df = spark.read.parquet(raw_input_folder)

# Add additional columns to the DF
final_ny_taxi_df = ny_taxi_df.withColumn("current_date", f.lit(datetime.now()))

repeated = final_ny_taxi_df.sample(False, 0.5, _seed_=0)
repeated.show()

for _ in range(int(iteration)):
    final_ny_taxi_df = final_ny_taxi_df.union(repeated)


<a href="http://logger.info">logger.info</a>("Write New York Taxi data to S3 transform table")
final_ny_taxi_df.repartition(6).write.mode("overwrite").parquet(transform_output_folder)

<a href="http://logger.info">logger.info</a>("Ending spark application")
# end spark code
spark.stop()

return None

if name == "main": print(len(sys.argv)) if len(sys.argv) != 4: print("Usage: spark-etl [input-folder] [output-folder]") sys.exit(0)

main(sys.argv)

Michael Robinson (michael.robinson@astronomer.io)

2024-04-03 16:16:41

@channel This month’s TSC meeting is next Wednesday the 10th at 9:30am PT. On the tentative agenda (additional items TBA): • announcements ◦ upcoming events including the Accenture+Confluent roundtable on 4/25 • recent release highlights • discussion items ◦ supporting job-to-job, as opposed to job-dataset-job, dependencies in the spec ◦ improving naming • open discussion More info and the meeting link can be found on the website. All are welcome! Do you have a discussion topic, use case or integration you’d like to demo? Reply here or DM me to be added to the agenda.

openlineage.io

TSC Meetings | OpenLineage

The OpenLineage Technical Steering Committee meets monthly, and is open to all.

Original URL: https://openlineage.io/meetings/

👍 Paweł Leszczyński, Sheeri Cabral (Collibra), Maciej Obuchowski

taosheng shi (taoshengshi01@gmail.com)

2024-04-10 13:32:56

*Thread Reply:* It is my first time to be in TSC meeting, Where I can find the slides in the meeting and How can find the past meeting slides?

taosheng shi (taoshengshi01@gmail.com)

2024-04-10 13:33:35

*Thread Reply:* @Michael Robinson 👆

Michael Robinson (michael.robinson@astronomer.io)

2024-04-10 15:31:01

*Thread Reply:* Hi @taosheng shi, here's a https://docs.google.com/presentation/d/139r4BvFrr4GUyf8wKDMKNukVhFyDr4Aw/edit?usp=sharing&ouid=116057523906319252244&rtpof=true&sd=true|link to the slides from today's meeting. We haven't been consistent about posting slides on the wiki, but that is the plan going forward. And I'd be happy to share a link to a slide deck from any previous meeting. Just let me know which meeting's deck you'd like access to.

taosheng shi (taoshengshi01@gmail.com)

2024-04-10 19:07:04

*Thread Reply:* received! thanks! Let me read into the slides from today's meeting first.

Francisco Morillo (fmorillo@amazon.es)

2024-04-03 22:19:15

Hi! How can i pass multiple kafka brokers when using with Flink? It appears marquez doesnt allow to have namespaces with commas.

namespace 'roker1,broker2,broker3' must contain only letters (a-z, A-Z), numbers (0-9), underscores (_), at (@), plus (+), dashes (-), colons (:), equals (=), semicolons (;), slashes (/) or dots (.) with a maximum length of 1024 characters.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-04 02:36:19

*Thread Reply:* Kafka dataset naming already has an open issue -> https://github.com/OpenLineage/OpenLineage/issues/560

I think the problem you raised deserves a separate one. Feel free to create it. I. think we can still modify broker separator to semicolon.

#560 [INTEGRATION][Kafka] Standarize kafka dataset naming

Natural representation of Kafka cluster on clients is a list of bootstrap servers. This would be dataset namespace. This list can change over time. Additionally, same cluster can have different representation on different clients - whether by shuffling the list, or having different parts of the same bootstrap.server list. We can do "normalization" like sorting bootstrap.server IPs to have it more easily match between different clients. The question is whether we should do this, and whether OpenLineage consumers should do something to connect different representations of same cluster. An example would be to treat all bootstrap.server list that have one matching IP as one cluster.

Comments

Fabio Manganiello (fabio.manganiello@booking.com)

2024-04-04 17:46:31

FYI I've moved https://github.com/OpenLineage/OpenLineage/pull/2489 to https://github.com/OpenLineage/OpenLineage/pull/2578 - I mistakenly included a couple of merge commits upon git rebase --signoff. Hopefully the tests should pass now (there were a couple of macro templates that still reported the old arguments). Is it still in time to be squeezed inside 1.11.0? It's not super-crucial (for us at least), since we already have copied the code of those macros in our operators implementation, but since the same fix has already been merged on the Airflow side it'd be good to keep things in sync (cc @Maciej Obuchowski @Kacper Muda)

#2489 [Airflow] Fixed run format returned by the `lineage_parent_id` macro

#2578 [Airflow] Fixed format returned by `airflow.macros.lineage_parent_id`.

👀 Maciej Obuchowski

Fabio Manganiello (fabio.manganiello@booking.com)

2024-04-04 18:43:05

*Thread Reply:* The tests are passing now

Francisco Morillo (fmorillo@amazon.es)

2024-04-05 01:37:57

I wanted to ask if there are any roadmap to adding more support for flink sources and sinks to openlineage for example: • Kinesis • Hudi • Iceberg SQL • Flink CDC • Opensearch or how one can contribute to those?

Kacper Muda (mudakacper@gmail.com)

2024-04-05 02:48:41

*Thread Reply:* Hey, if you feel like contributing, take a look at our contributors guide 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 07:14:55

*Thread Reply:* I think most important think on Flink side is working with Flink community on implementing https://cwiki.apache.org/confluence/display/FLINK/FLIP-314%3A+Support+Customized+Job+Lineage+Listener - as this allows us to move the implementation to the dedicated connectors

dolfinus (martinov_m_s_@mail.ru)

2024-04-05 09:47:22

👋 Hi everyone!

👋 Michael Robinson, Jakub Dardziński, Harel Shein, Damien Hawes

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 09:56:35

*Thread Reply:* Hello 👋

Michael Robinson (michael.robinson@astronomer.io)

2024-04-05 11:30:01

@channel We released OpenLineage 1.11.3, featuring a new package to support built-in lineage in Spark extensions and a telemetry mechanism in the Spark integration, among many other additions and fixes. Additions • Common: add support for SCRIPT-type jobs in BigQuery #2564 @kacpermuda • Spark: support for built-in lineage extraction #2272 @pawel-big-lebowski • Spark/Java: add support for Micrometer metrics #2496 @mobuchowski • Spark: add support for telemetry mechanism #2528 @mobuchowski • Spark: support query option on table read #2556 @mobuchowski • Spark: change SparkPropertyFacetBuilder to support recording Spark runtime #2523 @Ruihua98 • Spec: add fileCount to dataset stat facets #2562 @dolfinus There were also many bug fixes -- please see the release notes for details. Thanks to all the contributors with a shout out to new contributor @dolfinus (who contributed 5 PRs to the release and already has 4 more open!) and @Maciej Obuchowski and @Jakub Dardziński for the after-hours CI fixes! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.11.3 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/1.10.2...1.11.3 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage|https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🔥 Maciej Obuchowski, Jorge, taosheng shi, Ricardo Gaspar

🚀 Maciej Obuchowski, taosheng shi

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2024-04-09 00:20:35

*Thread Reply:* Hi, just curious, in the maven, the spark jar version is still only 1.8.0... The last uploaded time is 22nd Jan on the artifact metadata too...

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-09 02:11:14

*Thread Reply:* After version 1.8.0, the artifact identifier has been updated. For subsequent versions, utilize: io.openlineage:openlineagespark${SCALABINARYVERSION}:${OPENLINEAGESPARKVERSION}.

taosheng shi (taoshengshi01@gmail.com)

2024-04-05 12:21:34

👋 Hi everyone!

taosheng shi (taoshengshi01@gmail.com)

2024-04-05 12:22:10

*Thread Reply:* This is Taosheng from GitData Labs (https://gitdata.ai/) and We are building data versioning tool for responsible AL/ML:

An Git-like version control file system for data lineage & data collaboration. https://github.com/GitDataAI/jiaozifs

gitdata.ai

Embrace Data-Centric AI with GitData

<a href="http://GitData.AI">GitData.AI</a> is focused on the full lifecycle of data and model management and provides tools for continuous testing, experiment tracking, evaluation,visualization and automation.

Original URL: https://gitdata.ai/

GitDataAI/jiaozifs

An Git-like version control file system for data lineage & data collaboration.

Website

<https://jiaozifs.com>

Stars

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-05 12:23:38

*Thread Reply:* hello 👋

👋 taosheng shi

taosheng shi (taoshengshi01@gmail.com)

2024-04-05 12:26:56

*Thread Reply:* I came across OpenLineage on Google I would be able to contribute with our products & skills. I Was thinking maybe could start sharing some of them here, and seeing if there is something that feels like it could be interesting to co-build on/through OpenLineage and co-market together.

❤️ Sheeri Cabral (Collibra), taosheng shi

taosheng shi (taoshengshi01@gmail.com)

2024-04-05 12:27:06

*Thread Reply:* Would somebody be open to discuss any open opportunities for us together?

👍 Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2024-04-05 14:55:20

*Thread Reply:* 👋 welcome and thanks for joining!

Francisco Morillo (fmorillo@amazon.es)

2024-04-08 03:02:10

Hi Everyone ! Wanted to implement a cross stack data lineage across Flink and Spark but it seems that Iceberg Table gets registered asdifferent datasets in both. Spark at the top Flink at the bottom. so it doesnt get added to the same DAG. In Spark, Iceberg Table gets Database added in the name. Im seeing that @Paweł Leszczyński commited Spark/Flink Unify Dataset naming from URI objects (https://github.com/OpenLineage/OpenLineage/pull/2083/files#). So not sure what could be going on

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-08 04:53:53

*Thread Reply:* Looks like this method https://github.com/OpenLineage/OpenLineage/blob/1.11.3/integration/spark/shared/src/main/java/io/openlineage/spark/agent/util/PathUtils.java#L164 creates name with (tb+database)

In general, I would say we should add naming convention here -> https://openlineage.io/docs/spec/naming/ . I think db.table format is fine as we're using it for other sources.

IcebergSinkVisitor in Flink integration is does not seem to add symlink facet pointing to iceberg table with schema included. You can try extending it with dataset symlink facet as done for Spark.

openlineage.io

Naming Conventions | OpenLineage

Employing a unique naming strategy per resource ensures that the spec is followed uniformly regardless of metadata producer.

Original URL: https://openlineage.io/docs/spec/naming/

<https://github.com/OpenLineage/OpenLineage/blob/1.11.3/integration/spark/shared/src/main/java/io/openlineage/spark/agent/util/PathUtils.java | PathUtils.java>

<pre><code> private static String nameFromTableIdentifier(TableIdentifier identifier) { </code></pre>

Francisco Morillo (fmorillo@amazon.es)

2024-04-08 06:35:59

*Thread Reply:* How do you suggest we do so? creating a PR, extending IcebergSink Visitor or do it manually through spark as in this example https://github.com/OpenLineage/workshops/blob/main/spark/dataset_symlinks.ipynb

<https://github.com/OpenLineage/workshops/blob/main/spark/dataset_symlinks.ipynb | dataset_symlinks.ipynb>

``<code> { "cells": [ { "cell_type": "markdown", "id": "a8c2c062", "metadata": {}, "source": [ "# Dataset symlink demo\n", "\n", "Sometimes it happens that a dataset is identified by different name depending on a context. This may happen for Spark tables which can be identified by their physical location or identifier made of catalog and table name (logical identifier). This inconsistency can lead to broken lineage: imagine one job writing to a physical location and the other reading from a catalog and table name. With no extra information provided, the backend is not able to produce a consistent lineage of that. \n", "\n", "Dataset symlink feature is an extra facet within dataset saying: **Hey, these are the alternate names of the dataset**.\n", "\n", "The information is stored on the Marquez side and, what even more important, it is used to identify datasets by all its possible names. If a dataset primary name is a physical location, then we will be able to retrieve it by its logical name (catalog and table name) if such were provided within symlink facet. All the names will be used to create edges of the lineage graph. \n", "\n", "This demo presents the above in action. " ] }, { "cell_type": "markdown", "id": "cd961fa9", "metadata": {}, "source": [ "Let us first verify if Marquez is up and running " ] }, { "cell_type": "code", "execution_count": 1, "id": "e80da235", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Marquez is OK.\n" ] } ], "source": [ "import json,requests\n", "marquez_url = \"<http://marquez:5000\>" ## this may depend on your local setup\n", "if (requests.get(\"{}/api/v1/namespaces\".format(marquez_url)).status_code == 200):\n", " print(\"Marquez is OK.\")\n", "else:\n", " print(\"Cannot connect to Marquez\")" ] }, { "cell_type": "markdown", "id": "9ef2d003", "metadata": {}, "source": [ "Next we spin a Spark cluster with an Openlineage connector enabled.\n", "We enable Hive support for the Spark session and specify fake Hive metastore \"<http://metastore>\". Trick is done to emulate a behaviour of having two possible identifiers for a dataset:\n", " ** its physical location on disk,\n", " ** its logical location identified by a catalog and table name." ] }, { "cell_type": "code", "execution_count": 3, "id": "1cdacc9e", "metadata": {}, "outputs": [], "source": [ "from pyspark.sql import SparkSession\n", "\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "spark = (SparkSession.builder.master('local')\n", " .appName('sample_spark')\n", " .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener')\n", " .config('spark.jars.packages', 'io.openlineage:openlineage_spark:0.15.1')\n", " .config('spark.openlineage.url', '{}/api/v1/namespaces/dataset-symlinks/'.format(marquez_url))\n", " .config(\"spark.sql.catalogImplementation\", \"hive\")\n", " .config(\"spark.sql.hive.metastore.uris\", \"<http://metastore>\")\n", " .enableHiveSupport()\n", " .getOrCreate())" ] }, { "cell_type": "code", "execution_count": 4, "id": "c6c75b9f", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "22/10/25 12:57:53 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist\n", "22/10/25 12:57:53 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist\n", "22/10/25 12:57:57 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0\n", "22/10/25 12:57:57 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore UNKNOWN@172.18.0.4\n", "22/10/25 12:58:00 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException\n", "22/10/25 12:58:00 WARN LogicalPlanSerializer: Can't register jackson scala module for serializing LogicalPlan\n", "22/10/25 12:58:01 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist\n", "22/10/25 12:58:01 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist\n", "22/10/25 12:58:01 WARN DropTableCommandVisitor: Unable to find table by identifier</code>default<code>.</code>sometable<code>- Table or view 'some_table' not found in database 'default'\n" ] }, { "data": { "text/plain": [ "DataFrame[]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" }, { "name": "stderr", "output_type": "stream", "text": [ "22/10/25 12:58:02 WARN DropTableCommandVisitor: Unable to find table by identifier</code>default<code>.</code>sometable` - Table or view 'sometable' not found in database 'default'\n" ] } ], "source": [ "spark.sql(\"DROP TABLE IF EXISTS sometable;\")" ] }, { "celltype": "code", "executioncount": 12, "id": "7c4169ab", "metadata": {}, "outputs": [ { "name": "stderr", "outputtype": "stream", "text": [ "22/10/25 12:59:46 WARN SessionState: METASTOREFILTERHOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.\n", "22/10/25 12:59:47 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist\n", "22/10/25 12:59:47 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist\n", "22/10/25 12:59:47 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist\n", "22/10/25 12:59:47 WARN HiveMetaStore: Location: file:/home/jovyan/notebooks/spark-warehouse/sometable specified for non-external table:sometable\n" ] }, { "data": { "text/plain": [ "DataFrame[]" ] }, "executioncount": 12, "metadata": {}, "outputtype": "executeresult" } ], "source": [ "spark.sql(\"CREATE TABLE IF NOT EXISTS sometable (key INT, value STRING) USING hive;\")" ] }, { "celltype": "markdown", "id": "243dae1b", "metadata": {}, "source": [ "Let's see the latest event sent to Marquez and its output datasets:" ] }, { "celltype": "code", "executioncount": 13, "id": "96af5172", "metadata": {}, "outputs": [ { "name": "stdout", "outputtype": "stream", "text": [ "{\n", " \"namespace\": \"dataset-symlinks\",\n", " \"name\": \"samplespark.executecreatetablecommand\",\n", " \"facets\": {\n", " \"documentation\": null,\n", " \"sourceCodeLocation\": null,\n", " \"sql\": null\n", " }\n", "}\n" ] } ], "source": [ "event = requests.get(\"{}/api/v1/events/lineage?limit=1\".format(marquezurl)).json()['events'][0]\n", "\n", "print(json.dumps(event['job'], indent=2))" ] }, { "celltype": "code", "executioncount": 14, "id": "39a7341e", "metadata": {}, "outputs": [ { "name": "stdout", "outputtype": "stream", "text": [ "Primiary dataset name is \n", "\t namespace:'file', \n", "\t name:'/home/jovyan/notebooks/spark-warehouse/sometable'\n", "\n", "Symlinks facet:\n", "{\n", " \"producer\": \"<a href="https://github.com/OpenLineage/OpenLineage/tree/0.15.1/integration/spark\">https://github.com/OpenLineage/OpenLineage/tree/0.15.1/integration/spark\</a>",\n", " \"schemaURL\": \"<a href="https://openlineage.io/spec/facets/1-0-0/SymlinksDatasetFacet.json#/$defs/SymlinksDatasetFacet\">https://openlineage.io/spec/facets/1-0-0/SymlinksDatasetFacet.json#/$defs/SymlinksDatasetFacet\</a>",\n", " \"identifiers\": [\n", " {\n", " \"namespace\": \"<hive://metastore>\",\n", " \"name\…

Francisco Morillo (fmorillo@amazon.es)

2024-04-08 07:26:35

*Thread Reply:* is there any way to create a symlink via marquez api?

Francisco Morillo (fmorillo@amazon.es)

2024-04-08 07:26:44

*Thread Reply:* trying to figure out whats the easiest approach

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-08 07:44:54

*Thread Reply:* there are two possible conventions for pointing to iceberg dataset: • its physical location • namespace pointing to iceberg catalog, name pointing to schema+table Flink integration uses physical location only. IcebergSinkVisitor should add additional facet - dataset symlink facet

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-08 07:46:37

*Thread Reply:* just like spark integration is doing here -> https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/util/PathUtils.java#L86

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/util/PathUtils.java | PathUtils.java>

<pre><code> return di.withSymlink( </code></pre>

Francisco Morillo (fmorillo@amazon.es)

2024-04-08 15:01:10

*Thread Reply:* I have been testing in modifying first the event that gets emitted, but in the lineage i am seeing duplicate datasets. As the physical location for flink is also different than the one spark uses

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-09 03:10:29

*Thread Reply:* If symlinks are the same, Marquez should be able to create an edge in graph. wondering how come physical location is different... do you store it in some object storage or locally?

Francisco Morillo (fmorillo@amazon.es)

2024-04-09 03:31:45

*Thread Reply:* This is how Flikn leaves it as default

Namespace: -<> Name: /iceberg-2/openlineage.db/testsymlink Description: No Description

Spark leaves it as

Namespace: -<> Name: /iceberg-2/openlineage.db/openlineage.test_symlink

What is happening is that the flink job, its seeing writing to two tables the output and the symlink

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-09 04:14:29

*Thread Reply:* Looks like two issues (1) marquez incorrectly draws symlinks on the graph (2) physical location for iceberg is not consistent between spark & flink.

which one for (1) is correct? what is s3 path to test_symlink ?

Francisco Morillo (fmorillo@amazon.es)

2024-04-09 05:38:14

*Thread Reply:* its the same table, but the metadata that spark emits as event is adding the databse to the table.

Francisco Morillo (fmorillo@amazon.es)

2024-04-09 05:39:25

*Thread Reply:* the sym link is the one that is at the bottom that has the same namespace and name as the symlink that spark creates for the iceberg table.

Francisco Morillo (fmorillo@amazon.es)

2024-04-09 05:39:51

*Thread Reply:* flink creates the table at the top, and i modified the event so i can add a symlink that is the same as the one spark sets for the iceberg table

Francisco Morillo (fmorillo@amazon.es)

2024-04-09 05:40:26

*Thread Reply:* my expected behaviour should be that it knows that the job is actually writing to one table that has different names (symlink)

Fabio Manganiello (fabio.manganiello@booking.com)

2024-04-09 06:37:44

Hi all, I've open a new PR that addresses one of corner cases in dbt-ol run exit codes that we overlooked in the previous PR (a bit of extra testing on our side revealed the corner case where dbt's exit code wasn't propagated unless some OL events were emitted). Context: https://openlineage.slack.com/archives/C01CK9T7HKR/p1712137805789759 (cc @Arnab Bhattacharyya)

} Fabio Manganiello (https://openlineage.slack.com/team/U06BV4F12JU)

Hi all, I've <a href="https://github.com/OpenLineage/OpenLineage/pull/2560">opened a PR</a> for the <code>dbt-ol</code> script. We've noticed that the script doesn't transparently return/exit the exit code of the child <code>dbt</code> process. This makes it hard for the parent process to tell if the underlying workflow succeeded or failed - in the case of Airflow, the parent DAG will mark the job as succeeded even if it actually failed. Let me know if you have thought/comments (cc @Arnab Bhattacharyya)

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1712137805789759

#2591 dbt: Propagate the dbt return code also when no OpenLineage events are emitted

#2560 dbt: `dbt-ol` should transparently exit with the same exit code as the child `dbt` process

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-09 06:44:41

*Thread Reply:* enabling static analysis on the script would help too probably mypy should detect missing returns 🙂

Fabio Manganiello (fabio.manganiello@booking.com)

2024-04-09 06:47:01

*Thread Reply:* Good point - that script actually had a lot of basic Black/Flake8 issues too before I touched it last time. It'd probably be good to have these checks in pre-commit or a round of black/flake8 on the CI/CD

Arnab Bhattacharyya (arnab.bhattacharyya@booking.com)

2024-04-09 10:19:19

*Thread Reply:* Dear All, Can we have a quick release with this fix as the previous version has the bug which is fixed with this PR.

Michael Robinson (michael.robinson@astronomer.io)

2024-04-09 10:22:36

A release . Three +1s from project committers will authorize. More details about the release request process can be found in the GH repo.

} Arnab Bhattacharyya (https://openlineage.slack.com/team/U06F130QVUH)

Dear All, Can we have a quick release with this fix as the previous version has the bug which is fixed with this PR.

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1712672359909629?thread_ts=1712659064.813729&cid=C01CK9T7HKR

➕ Michael Robinson, Kacper Muda, Harel Shein, Jakub Dardziński, Arnab Bhattacharyya, Maciej Obuchowski, Paweł Leszczyński

Michael Robinson (michael.robinson@astronomer.io)

2024-04-09 10:24:58

*Thread Reply:* Thanks, all. The release is authorized and will be performed as soon as possible within 2 business days.

Michael Robinson (michael.robinson@astronomer.io)

2024-04-09 12:34:24

*Thread Reply:* @Arnab Bhattacharyya 1.12.0, containing the fix, is available now

Arnab Bhattacharyya (arnab.bhattacharyya@booking.com)

2024-04-09 12:42:29

*Thread Reply:* Thanks a lot ! Appreciate your timely help

Michael Robinson (michael.robinson@astronomer.io)

2024-04-09 11:28:37

@channel We released OpenLineage 1.12.0, featuring new Airflow macros for returning an Airflow namespace and job name, support for nested fields in SchemaDatasetsFacet, and a fix in DBT to handle processes with no emitted events, among several additional fixes. Additions • Airflow: add lineage_job_namespace and lineage_job_name macros #2582 @dolfinus • Spec: Allow nested struct fields in SchemaDatasetFacet #2548 @dolfinus See the notes for the fixes and more details. Thanks to all the contributors! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.12.0 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/1.11.3...1.12.0 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🙌 Paweł Leszczyński, Kacper Muda, Paulo Junior, Julien Le Dem

Michael Robinson (michael.robinson@astronomer.io)

2024-04-09 14:10:24

@channel This month's TSC meeting, open to all, is tomorrow! https://openlineage.slack.com/archives/C01CK9T7HKR/p1712175401113579

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)

@channel This month’s TSC meeting is next Wednesday the 10th at 9:30am PT. On the tentative agenda (additional items TBA): • announcements ◦ upcoming events including the Accenture+Confluent roundtable on 4/25 • recent release highlights • discussion items ◦ supporting job-to-job, as opposed to job-dataset-job, dependencies in the spec ◦ improving naming • open discussion More info and the meeting link can be found on the <a href="https://openlineage.io/meetings/">website</a>. All are welcome! Do you have a discussion topic, use case or integration you’d like to demo? Reply here or DM me to be added to the agenda.

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1712175401113579

Balachandru S (balachandru2468@gmail.com)

2024-04-15 02:25:34

Hi Team, I am using the OpenLineage 1.8.0 version. In that emitted Column lineage, I could see the transformation description is null. In any upcoming version, can we expect this feature will come? If yes then is any tentative data for that feature?. Thanks.

image.png

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-15 03:31:08

*Thread Reply:* You can track the issue https://github.com/OpenLineage/OpenLineage/issues/2186 to be up-to-date with transforamtion type changes.

#2186 [PROPOSAL] Formalizing transformation types

Labels

proposal

Comments

👍 Balachandru S

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-15 04:29:06

*Thread Reply:* I would just add that it seems to be a fairly big feature to have this feature complete

👍 jayant joshi, Balachandru S

jayant joshi (itsjayantjoshi@gmail.com)

2024-04-16 07:17:02

*Thread Reply:* what is the timeline to complete this update?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-16 07:20:47

*Thread Reply:* There is no timeline, we're at point where we're actively researching adding support for Spark

👍 jayant joshi

Tridev N S (tridev@sparkflows.io)

2024-04-15 03:24:06

Hi All,

Tried with below setup in spark and not able to get the Column Level Lineage Dataset Facet

val spark = SparkSession .builder() .appName("dataset_example").master("local[**]") .config("spark.extraListeners", "io.openlineage.spark.agent.OpenLineageSparkListener") .config("spark.openlineage.transport.type", "http") .config("spark.openlineage.transport.url", "<http://localhost:5000>") .config("spark.openlineage.namespace", "spark_integration") .config("spark.openlineage.facets.disabled", "[spark.logicalPlan]") .config("spark.openlineage.debugFacet", "enabled") .getOrCreate()

Tridev N S (tridev@sparkflows.io)

2024-04-15 03:26:19

lineage response

Screenshot 2024-04-15 at 12.55.45 PM.png

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-15 03:33:31

*Thread Reply:* Is the problem similar to this issue https://github.com/OpenLineage/OpenLineage/issues/2573 ? If so, please feel free to add the details within the issue.

#2573 Spark Integration - Schema for text file and column level lineage is not captured

We tried the below code for POC: <pre><code>_from pyspark.sql import SparkSession spark = (SparkSession.builder.master('local').appName('openlineage_spark_test') .getOrCreate()) spark.sparkContext.setLogLevel("TRACE") empDF = spark.read.option("header",True).csv("file:///emp.1.txt") #.selectExpr("emp_id","name","superior_emp_id","year_joined","emp_dept_id","gender","salary") empDF.printSchema() empDF.count() deptDF = spark.read.option("header",True).csv("file:///dept.1.txt") #.selectExpr("dept_name","dept_id") deptDF.printSchema() finalDF = empDF.join(deptDF,empDF.emp_dept_id == deptDF.dept_id,"inner") finalDF.printSchema() finalDF.write.options(header='True', delimiter=',').mode("overwrite").csv("ol_test_output")_ spark.read.csv("ol_test_output").count() </code></pre> Execute: spark-submit --master yarn --deploy-mode client --queue default --conf "spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener" --conf "spark.openlineage.transport.type=http" --conf "spark.openlineage.transport.url=<>" --jars "openlineage-spark2.13-1.10.2.jar" --conf "spark.openlineage.namespace=pysparkintegration" ol-test.py This code is producing job level lineage, but • The schema for input though captured not displaying in Marquez, shows only 1 column - value string. • Column level lineage for the final output file seems not being generated based on logs and also not showing up in Marquez.

Labels

proposal

Comments

Tridev N S (tridev@sparkflows.io)

2024-04-15 03:51:33

*Thread Reply:* added the info to the git issue. @Maciej Obuchowski

Tridev N S (tridev@sparkflows.io)

2024-04-15 03:27:15

Any configuration changes required to get below kind of info for lineage? https://openlineage.io/docs/spec/facets/dataset-facets/column_lineage_facet

openlineage.io

Column Level Lineage Dataset Facet | OpenLineage

Original URL: https://openlineage.io/docs/spec/facets/dataset-facets/column_lineage_facet

dolfinus (martinov_m_s_@mail.ru)

2024-04-15 06:45:00

It is mentioned in Readme that Apache Atlas is one of supported targets for OL events. But I haven't found any related information for such integration, e.g. no dedicated REST API endpoint in Atlas Swagger or documentation, no examples. Is this still releant?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-15 07:49:53

*Thread Reply:* That's a conceptual graphics ment to show the purpose of OpenLineage

dolfinus (martinov_m_s_@mail.ru)

2024-04-15 07:55:22

*Thread Reply:* Hm, but this could be misleading

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-15 07:55:45

*Thread Reply:* maybe we should just change it to something we already can send events to

➕ Michael Robinson, Paweł Leszczyński

Blake Jackson (blake.jackson@rvaenergy.com)

2024-04-15 15:16:20

Is the helm chart published to ArtifactHub or similar? Or any plans to do so?

Blake Jackson (blake.jackson@rvaenergy.com)

2024-04-16 09:02:26

*Thread Reply:* Found this, which seems to be the only one out there?

artifacthub.io

ilum-marquez 0.42.0 · ilum/ilum

Marquez packaged by Ilum, an open source metadata service for the collection, aggregation, and visualization of a data ecosystem's metadata.

Original URL: https://artifacthub.io/packages/helm/ilum/ilum-marquez

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-17 13:27:31

*Thread Reply:* I think this question is better suited to Marquez slack

Tom Linton (tom.linton@atlan.com)

2024-04-15 16:45:00

What's the long-term stance on custom extractors in airflow?

Kacper Muda (mudakacper@gmail.com)

2024-04-16 03:34:42

*Thread Reply:* Hi Tom, if you plan on using custom extractors as a foundation of your OpenLineage solution i think it's a good choice - they are not going anywhere and should always have a priority over some built-ins 🙂

✅ Tom Linton

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-16 07:15:41

*Thread Reply:* If you have a control over the operator, the better option is to add support there directly.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-16 07:16:28

*Thread Reply:* However if you want to implement something for third party operators, especially where code is not open source, it's a good solution - if the code is open source (like in Airflow repo) it's better to just contribute code there

Kacper Muda (mudakacper@gmail.com)

2024-04-16 07:17:05

*Thread Reply:* That's true. You can check out recommended way of implementing OpenLineage in this docs.

Tom Linton (tom.linton@atlan.com)

2024-04-18 11:36:28

*Thread Reply:* Awesome thanks for shedding some light here gentlemen

Sushil Dhadse (sushil.dhadse@sonra.io)

2024-04-16 06:33:28

Hi Everyone, I am new to OpenLineage and Marquez, I am trying to connect Marquez with Snowflake to get the data lineage in the tool. Is there a document with step-by-step instructions for the same ?

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2024-04-18 01:40:11

Hi team, we're facing this error on 2.12_1.11.3 OL spark jar: 24/04/17 10:07:06 WARN RddExecutionContext: Unable to access job conf from RDD java.lang.NoSuchFieldException: arg$1 at java.lang.Class.getDeclaredField(Class.java:2098) at io.openlineage.spark.agent.lifecycle.RddExecutionContext.getConfigField(RddExecutionContext.java:134) at io.openlineage.spark.agent.lifecycle.RddExecutionContext.setActiveJob(RddExecutionContext.java:92) at java.util.Optional.ifPresent(Optional.java:159) at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$onJobStart$14(OpenLineageSparkListener.java:189) at java.util.Optional.ifPresent(Optional.java:159) at io.openlineage.spark.agent.OpenLineageSparkListener.onJobStart(OpenLineageSparkListener.java:186) at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91) at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1347) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82)

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2024-04-18 01:40:43

*Thread Reply:* Has this already been fixed/addressed in the new version?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-18 02:10:11

*Thread Reply:* There is little that has changed in this piece of code. Which Spark version are you using in this case?

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2024-04-18 02:51:39

*Thread Reply:* 2.4.7 Do we not support spark 2.4 versions anymore?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-18 02:57:46

*Thread Reply:* We do. It's just the error suggests there is different implmentation of a class from what we expect.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2024-04-19 01:34:35

*Thread Reply:* Got it, shall I raise an issue on github then?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-19 02:47:38

*Thread Reply:* please do

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2024-04-19 04:18:39

*Thread Reply:* Done! https://github.com/OpenLineage/OpenLineage/issues/2622

#2622 [BUG-FIX] Unable to access job conf from RDD

Hi, we're facing this issue on OL spark, version: <code>2.12_1.11.3</code> <pre><code>24/04/17 10:07:06 WARN RddExecutionContext: Unable to access job conf from RDD java.lang.NoSuchFieldException: arg$1 at java.lang.Class.getDeclaredField(Class.java:2098) at io.openlineage.spark.agent.lifecycle.RddExecutionContext.getConfigField(RddExecutionContext.java:134) at io.openlineage.spark.agent.lifecycle.RddExecutionContext.setActiveJob(RddExecutionContext.java:92) at java.util.Optional.ifPresent(Optional.java:159) at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$onJobStart$14(OpenLineageSparkListener.java:189) at java.util.Optional.ifPresent(Optional.java:159) at io.openlineage.spark.agent.OpenLineageSparkListener.onJobStart(OpenLineageSparkListener.java:186) at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91) at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1347) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82) </code></pre> Would highly appreciate a fix! Thanks!

Labels

proposal

👍 Paweł Leszczyński

Nicolas Pittion-Rossillon (nicolas.pittion@redlab.io)

2024-04-18 03:22:35

Hello there!

First of all, thanks for the awesome job you’ve already made so far. I’m still discovering OpenLineage and Marquez, but I’m hooked.

I’m working on integrating Airflow and Marquez to retrieve lineage from DAGs over several stacks (PostgreSQL, BigQuery, MSSQL, etc.)... and I’m facing some trouble even only with PostgresOperators: I do retrieve jobs, but nothing about datasets. My jobs end up complete, but the metadata generated holds neither “inputs”, “outputs”, nor even the “sql” part.

What could be the proper way to call for help? Should I open an issue, could I provide more details here (versions, configs, logs, whatever), target someone specifically...? 🙏

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-18 03:23:39

*Thread Reply:* are you using custom operator or the one from Postgres provider?

Nicolas Pittion-Rossillon (nicolas.pittion@redlab.io)

2024-04-18 03:36:30

*Thread Reply:* ``` t1 = PostgresOperator( taskid='createtableifnotexistscounts', postgresconnid="{{ params.connection_id }}", sql='''create table if not exists counts (value integer);''' )

t2 = PostgresOperator(
    task_id='insert_into_counts_random',
    postgres_conn_id="{{ params.connection_id }}",
    sql='''insert into counts (value) values (%(value)s)''',
    parameters={'value': random.randint(1, 10)}
)```

The one from the Postgres provider; I was trying to reproduce the standard Marquez example, more or less.

Nicolas Pittion-Rossillon (nicolas.pittion@redlab.io)

2024-04-18 03:37:54

*Thread Reply:* Airflow version 2.9.0, with the following pip packages by default: apache-airflow-providers-openlineage==1.6.0 apache-airflow-providers-postgres==5.10.2

Nicolas Pittion-Rossillon (nicolas.pittion@redlab.io)

2024-04-18 03:41:51

*Thread Reply:* I do retrieve this kind of things but nothing more:

Screenshot 2024-04-18 at 09.41.38.png

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-18 03:42:41

*Thread Reply:* %(value)s syntax for sqlalchemy params is not supported AFAIR, let me double check

Nicolas Pittion-Rossillon (nicolas.pittion@redlab.io)

2024-04-18 03:52:53

*Thread Reply:* Damn! I’ll try without it right away. I took that from the documented example here: https://github.com/MarquezProject/marquez/blob/main/examples/airflow/airflow.md

Kacper Muda (mudakacper@gmail.com)

2024-04-18 03:55:13

*Thread Reply:* Also, you probably know but PostgresOperator is deprecated 🙂 You should probably switch to airflow.providers.common.sql.operators.sql.SQLExecuteQueryOperator - it should not have any influence on the lineage, just noticed

👍 Nicolas Pittion-Rossillon

Nicolas Pittion-Rossillon (nicolas.pittion@redlab.io)

2024-04-18 03:57:10

*Thread Reply:* You are absolutely right, but I was trying to stick closer to the example first - all the more since I’m not facing success yet!

👍 Kacper Muda

Nicolas Pittion-Rossillon (nicolas.pittion@redlab.io)

2024-04-18 03:57:58

*Thread Reply:* @Jakub Dardziński Unfortunately it isn’t better if I try and stick a raw value this way either: sql=f"insert into counts (value) values ({random.randint(1, 10)})"

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-18 03:58:39

*Thread Reply:* probably different Jakub :)

Nicolas Pittion-Rossillon (nicolas.pittion@redlab.io)

2024-04-18 03:58:51

*Thread Reply:* oops! my very bad!

Nicolas Pittion-Rossillon (nicolas.pittion@redlab.io)

2024-04-18 03:59:11

*Thread Reply:* 🙇‍♂️

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-18 04:06:01

*Thread Reply:* could you please share logs from the task? debug level would be best but any you have may work too it’s strange, you should have sql property at least in the job section

Nicolas Pittion-Rossillon (nicolas.pittion@redlab.io)

2024-04-18 04:21:44

*Thread Reply:* Logs from the KubernetesPod for the Airflow Task, you mean?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-18 04:22:24

*Thread Reply:* yes, if that’s how you’re running Airflow. could be copied from Airflow UI as well

Nicolas Pittion-Rossillon (nicolas.pittion@redlab.io)

2024-04-18 04:33:04

*Thread Reply:* Currently, I’m using KubernetesExecutors which are destroyed after running, and the logs aren’t persisted, so they’re a bit hard to read... That’s an issue I was planning on adressing later by adding an OpenTelemetryCollector with Loki and Grafana further down the road.

Because of that, I’ve also started generating OTEL events which kind of flood the logs.

However, I did find the relevant part, I think:

{"time":"2024-04-18T10:27:16.056+0200","level":"WARNING","source":"configuration.py:1050","message":"section/key [openlineage/disabled_for_operators] not found in config"}
{"time":"2024_04_18T10:27:16.056+0200","level":"WARNING","source":"manager.py:116","message":"Failed to extract metadata using found extractor &lt;airflow.providers.openlineage.extractors.base.DefaultExtractor object at 0xffff87b800b0&gt; - section/key [openlineage/disabled_for_operators] not found in config task_type=PostgresOperator ******_dag_id=openlineage_example_postgres_sum task_id=insert_into_sum_random ******_run_id=scheduled__2024-04-18T08:26:00+00:00 "}
{"time":"2024_04_18T10:27:16.058+0200","level":"WARNING","source":"configuration.py:1050","message":"section/key [openlineage/config_path] not found in config"}
{"time":"2024_04_18T10:27:16.058+0200","level":"WARNING","source":"utils.py:408","message":"section/key [openlineage/config_path] not found in config"}
{"time":"2024_04_18T10:27:16.058+0200","level":"INFO","source":"taskinstance.py:430","message":"::endgroup::"

That would be related to that part of my configuration, I guess (I have nothing else setup):

Screenshot 2024-04-18 at 10.32.50.png

Kacper Muda (mudakacper@gmail.com)

2024-04-18 04:35:53

*Thread Reply:* Ah yes, that may be the issue that was fixed in 1.7.0. Could you upgrade the OpenLineage provider and see if the issue persist?

EDIT: For versions < 1.7.0 you can simply export those two variables as empty, they simply need to be there to prevent error: AIRFLOW__OPENLINEAGE__CONFIG_PATH="" AIRFLOW__OPENLINEAGE__DISABLED_FOR_OPERATORS=""

Nicolas Pittion-Rossillon (nicolas.pittion@redlab.io)

2024-04-18 04:38:02

*Thread Reply:* Sure thing, I’ll try right away. Thanks a lot to the both of you!

👍 Kacper Muda

Nicolas Pittion-Rossillon (nicolas.pittion@redlab.io)

2024-04-18 10:55:37

*Thread Reply:* You were absolutely right on all accounts: • exporting the variables as empty prevented the error for version 1.6.0 ✅ • there was no need to define the variables as empty after upgrading to version 1.7.0 ✅ • upgrading to version 1.7.0 fixed the issue altogether 🥳 • the issue being resolved, I made the switch to airflow.providers.common.sql.operators.sql.SQLExecuteQueryOperator and it kept working just right. I’m still facing some namespaces issues... or not, actually, I don’t know yet what I should be expecting. I’ll keep digging and discover on my own!

Screenshot 2024-04-18 at 16.49.00.png

Nicolas Pittion-Rossillon (nicolas.pittion@redlab.io)

2024-04-18 10:56:31

*Thread Reply:* Thank you both, and thank you lots! I didn’t expect such fast answers and I learned more than I could have hoped for. 🙇‍♂️

Kacper Muda (mudakacper@gmail.com)

2024-04-18 10:58:32

*Thread Reply:* Glad we could help 🙂

Nicolas Pittion-Rossillon (nicolas.pittion@redlab.io)

2024-04-18 11:00:58

*Thread Reply:* I’ll be back to share the results of our experiments when we’re done and give back however I can when I’ll be in a position to 🙂 Or I’ll be back faster with some new troubles, ehehe.

You guys have a great day, and once again, many thanks!

🚀 Kacper Muda, Jakub Dardziński

Mariusz Górski (gorskimariusz13@gmail.com)

2024-04-18 04:24:54

We have an issue using spark OL 1.11.3 and 1.12.0, the issue is inputs section of RunEvent is empty if we move data across schemas (select from one schema and saveastable in another)

spark.table("schemaa.sourcetable").limit(100).write.saveAsTable("schemab.tmp_lineage_accounts_v3")

inputs is not empty if we do:

spark.table("schemab.sourcetable").limit(100).write.saveAsTable("schemab.tmp_lineage_accounts_v3")

is this a known issue?

this is (partially redacted) event we get

{ "eventTime" : "2024-04-18T08:23:08.69Z", "producer" : "<https://github.com/OpenLineage/OpenLineage/tree/1.12.0/integration/spark>", "schemaURL" : "<https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunEvent>", "eventType" : "COMPLETE", "run" : { "runId" : "c727b716-3df4-4b23-9410-1d148e551b55", "facets" : { "parent" : { "_producer" : "<https://github.com/OpenLineage/OpenLineage/tree/1.12.0/integration/spark>", "_schemaURL" : "<https://openlineage.io/spec/facets/1-0-0/ParentRunFacet.json#/$defs/ParentRunFacet>", "run" : { "runId" : "c83849fa-06c9-4062-84cc-1ae6b5ee4bb5" }, "job" : { "namespace" : "xxxxxx", "name" : "xxxxxx_spark" } }, "spark_version" : { "_producer" : "<https://github.com/OpenLineage/OpenLineage/tree/1.12.0/integration/spark>", "_schemaURL" : "<https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunFacet>", "spark-version" : "3.3.1", "openlineage-spark-version" : "1.12.0" }, "spark_properties" : { "_producer" : "<https://github.com/OpenLineage/OpenLineage/tree/1.12.0/integration/spark>", "_schemaURL" : "<https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunFacet>", "properties" : { "spark.master" : "<k8s://https>://kubernetes.default.svc.cluster.local", "spark.app.name" : "xxxxxx-spark" } }, "processing_engine" : { "_producer" : "<https://github.com/OpenLineage/OpenLineage/tree/1.12.0/integration/spark>", "_schemaURL" : "<https://openlineage.io/spec/facets/1-1-0/ProcessingEngineRunFacet.json#/$defs/ProcessingEngineRunFacet>", "version" : "3.3.1", "name" : "spark", "openlineageAdapterVersion" : "1.12.0" }, "environment-properties" : { "_producer" : "<https://github.com/OpenLineage/OpenLineage/tree/1.12.0/integration/spark>", "_schemaURL" : "<https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunFacet>", "environment-properties" : { } } } }, "job" : { "namespace" : "xxxxxx", "name" : "xxxxxx_spark.execute_create_data_source_table_as_select_command.tmp_lineage_accounts_v4", "facets" : { "jobType" : { "_producer" : "<https://github.com/OpenLineage/OpenLineage/tree/1.12.0/integration/spark>", "_schemaURL" : "<https://openlineage.io/spec/facets/2-0-2/JobTypeJobFacet.json#/$defs/JobTypeJobFacet>", "processingType" : "BATCH", "integration" : "SPARK", "jobType" : "JOB" } } }, "inputs" : [ ], "outputs" : [ { "namespace" : "<s3a://yyyyyy-db>", "name" : "tmp_lineage_accounts_v4", "facets" : { "dataSource" : { "_producer" : "<https://github.com/OpenLineage/OpenLineage/tree/1.12.0/integration/spark>", "_schemaURL" : "<https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet>", "name" : "<s3a://yyyyyy-db>", "uri" : "<s3a://yyyyyy-db>" }, "schema" : { "_producer" : "<https://github.com/OpenLineage/OpenLineage/tree/1.12.0/integration/spark>", "_schemaURL" : "<https://openlineage.io/spec/facets/1-1-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet>", "fields" : [ { "name" : "dt", "type" : "string" }] }, "symlinks" : { "_producer" : "<https://github.com/OpenLineage/OpenLineage/tree/1.12.0/integration/spark>", "_schemaURL" : "<https://openlineage.io/spec/facets/1-0-0/SymlinksDatasetFacet.json#/$defs/SymlinksDatasetFacet>", "identifiers" : [ { "namespace" : "<hive://hive-metastore.hive.svc.cluster.local:9083>", "name" : "yyyyyy.tmp_lineage_accounts_v4", "type" : "TABLE" } ] }, "lifecycleStateChange" : { "_producer" : "<https://github.com/OpenLineage/OpenLineage/tree/1.12.0/integration/spark>", "_schemaURL" : "<https://openlineage.io/spec/facets/1-0-0/LifecycleStateChangeDatasetFacet.json#/$defs/LifecycleStateChangeDatasetFacet>", "lifecycleStateChange" : "CREATE" } }, "outputFacets" : { } } ] }

👀 Maciej Obuchowski

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-18 05:50:46

*Thread Reply:* please be aware that OL events are cumulative and input/output datasets should be extracted from all the OL events related to a particular run

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-18 05:51:25

*Thread Reply:* could you please verify that inputs are not present within START event or any other event RUNNING event if such is present

Mariusz Górski (gorskimariusz13@gmail.com)

2024-04-18 05:57:57

*Thread Reply:* I have received START, 3xRUNNING and COMPLETE events and inputs list is empty in all of them

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-18 05:59:02

*Thread Reply:* Ok, that shouldn't happen

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-18 05:59:45

*Thread Reply:* would you mind trying this without limit ? I am pretty sure we do have similar scenario in our integration tests

Mariusz Górski (gorskimariusz13@gmail.com)

2024-04-18 05:59:51

*Thread Reply:* and what's even weirder is that it's not the case with all schemas, we tested it against some with success and with others this issue is present

Mariusz Górski (gorskimariusz13@gmail.com)

2024-04-18 06:00:12

*Thread Reply:* so schema X to schema Y might work while schema A to schema B can have empty inputs

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-18 06:01:12

*Thread Reply:* that's interesting

Mariusz Górski (gorskimariusz13@gmail.com)

2024-04-18 06:20:25

*Thread Reply:* ok, my initial suspicion was wrong, it's not cross-schema loading that is an issue. it's when input table is in iceberg format 👀

Mariusz Górski (gorskimariusz13@gmail.com)

2024-04-18 06:20:33

*Thread Reply:* run this against input table in iceberg format without any limit

Mariusz Górski (gorskimariusz13@gmail.com)

2024-04-18 06:21:19

*Thread Reply:* { "eventTime" : "2024-04-18T10:19:40.486Z", "producer" : "<https://github.com/OpenLineage/OpenLineage/tree/1.11.3/integration/spark>", "schemaURL" : "<https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunEvent>", "eventType" : "COMPLETE", "run" : { "runId" : "91ffb61e-65d3-4bcd-8e59-019f8b655cca", "facets" : { "parent" : { "_producer" : "<https://github.com/OpenLineage/OpenLineage/tree/1.11.3/integration/spark>", "_schemaURL" : "<https://openlineage.io/spec/facets/1-0-0/ParentRunFacet.json#/$defs/ParentRunFacet>", "run" : { "runId" : "3404a7b0-a42e-48cf-9ade-6203702a675c" }, "job" : { "namespace" : "test_om_132", "name" : "xxxxxx" } }, "spark_version" : { "_producer" : "<https://github.com/OpenLineage/OpenLineage/tree/1.11.3/integration/spark>", "_schemaURL" : "<https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunFacet>", "spark-version" : "3.3.1", "openlineage-spark-version" : "1.11.3" }, "spark_properties" : { "_producer" : "<https://github.com/OpenLineage/OpenLineage/tree/1.11.3/integration/spark>", "_schemaURL" : "<https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunFacet>", "properties" : { "spark.master" : "<k8s://https>://kubernetes.default.svc.cluster.local", "spark.app.name" : "xxxxxx-spark" } }, "processing_engine" : { "_producer" : "<https://github.com/OpenLineage/OpenLineage/tree/1.11.3/integration/spark>", "_schemaURL" : "<https://openlineage.io/spec/facets/1-1-0/ProcessingEngineRunFacet.json#/$defs/ProcessingEngineRunFacet>", "version" : "3.3.1", "name" : "spark", "openlineageAdapterVersion" : "1.11.3" }, "environment-properties" : { "_producer" : "<https://github.com/OpenLineage/OpenLineage/tree/1.11.3/integration/spark>", "_schemaURL" : "<https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunFacet>", "environment-properties" : { } } } }, "job" : { "namespace" : "test_om_132", "name" : "xxxxxx_spark.execute_create_data_source_table_as_select_command.lineage_tu_sdf_transform_11", "facets" : { "jobType" : { "_producer" : "<https://github.com/OpenLineage/OpenLineage/tree/1.11.3/integration/spark>", "_schemaURL" : "<https://openlineage.io/spec/facets/2-0-2/JobTypeJobFacet.json#/$defs/JobTypeJobFacet>", "processingType" : "BATCH", "integration" : "SPARK", "jobType" : "JOB" } } }, "inputs" : [ ], "outputs" : [ { "namespace" : "<s3a://datacatalog-db>", "name" : "lineage_tu_sdf_transform_11", "facets" : { "dataSource" : { "_producer" : "<https://github.com/OpenLineage/OpenLineage/tree/1.11.3/integration/spark>", "_schemaURL" : "<https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet>", "name" : "<s3a://datacatalog-db>", "uri" : "<s3a://datacatalog-db>" }, "schema" : { "_producer" : "<https://github.com/OpenLineage/OpenLineage/tree/1.11.3/integration/spark>", "_schemaURL" : "<https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet>", "fields" : [ { "name" : "c1", "type" : "string" }, { "name" : "c2", "type" : "string" }, { "name" : "c3", "type" : "string" }, { "name" : "c4", "type" : "string" } ] }, "symlinks" : { "_producer" : "<https://github.com/OpenLineage/OpenLineage/tree/1.11.3/integration/spark>", "_schemaURL" : "<https://openlineage.io/spec/facets/1-0-0/SymlinksDatasetFacet.json#/$defs/SymlinksDatasetFacet>", "identifiers" : [ { "namespace" : "<hive://hive-metastore.hive.svc.cluster.local:9083>", "name" : "datacatalog.lineage_tu_sdf_transform_11", "type" : "TABLE" } ] }, "lifecycleStateChange" : { "_producer" : "<https://github.com/OpenLineage/OpenLineage/tree/1.11.3/integration/spark>", "_schemaURL" : "<https://openlineage.io/spec/facets/1-0-0/LifecycleStateChangeDatasetFacet.json#/$defs/LifecycleStateChangeDatasetFacet>", "lifecycleStateChange" : "CREATE" } }, "outputFacets" : { } } ] }

Mariusz Górski (gorskimariusz13@gmail.com)

2024-04-18 06:21:49

*Thread Reply:* iceberg version org.apache.iceberg:iceberg_spark_runtime_3.3_2.12:1.4.2

Mariusz Górski (gorskimariusz13@gmail.com)

2024-04-18 07:02:27

*Thread Reply:* is there a way to use OL spark34 with spark 3.3 @Paweł Leszczyński? I've noticed quite significant update in iceberg version between OL spark33 and spark34 integrations

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-18 07:07:38

*Thread Reply:* yes, there factory classes which decide which dataset builders/visitors to use (see this class https://github.com/OpenLineage/OpenLineage/blob/bf009641873be48ffc8c46288b7da2ce19[…]lineage/spark/agent/lifecycle/Spark34DatasetBuilderFactory.java)

<https://github.com/OpenLineage/OpenLineage/blob/bf009641873be48ffc8c46288b7da2ce19b9617b/integration/spark/app/src/main/java/io/openlineage/spark/agent/lifecycle/Spark34DatasetBuilderFactory.java | Spark34DatasetBuilderFactory.java>

<pre><code>public class Spark34DatasetBuilderFactory extends Spark32DatasetBuilderFactory </code></pre>

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-18 07:12:19

*Thread Reply:* our target approach to solve such problems would be: https://github.com/OpenLineage/docs/pull/288/files#diff-7a12321f8b31b3f4287ef8cb547014441672e82cf00de0f2f00f6915b1471918

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-18 07:12:37

*Thread Reply:* which requires Spark extensions (like iceberg) expose lineage metadata

Mariusz Górski (gorskimariusz13@gmail.com)

2024-04-18 07:13:09

*Thread Reply:* ale teraz to faktory jest na sztywno pozenione z wersja sparka z runtime, wiec nie moge jakas flaga wymusic w sparku 3.3 uzycia tej?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-18 07:15:01

*Thread Reply:* (sry for switching to polish for a momemnt ;)) So, each spark submodule spark33, spark34, 35 is coupled with specific Iceberg verison. However, it is doable to add into Spark33DatasetBuilderFactory builders from spark35 to run against latest iceberg version

Mariusz Górski (gorskimariusz13@gmail.com)

2024-04-18 07:15:51

*Thread Reply:* ok, so build my own, not switch with feature flag

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-18 07:17:37

*Thread Reply:* The solution would be: • make spark35 dependend on latest iceberg • implement dataset builder to support what is missing for spark35 • add the newly created builder to Spark33DatasetBuilderFactory (check first if the classes needed are present on the classpath)

Max Zheng (mzheng@plaid.com)

2024-04-18 18:27:01

Are there any docs on API contract for https://openlineage.slack.com/archives/C01CK9T7HKR/p1713433846773759?thread_ts=1713428694.233199&cid=C01CK9T7HKR ? I've observed that COMPLETE events have full inputs and outputs but it sounds like thats not guaranteed

} Paweł Leszczyński (https://openlineage.slack.com/team/U02MK6YNAQ5)

please be aware that OL events are cumulative and input/output datasets should be extracted from all the OL events related to a particular run

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1713433846773759?thread_ts=1713428694.233199&cid=C01CK9T7HKR

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-19 02:10:06

*Thread Reply:* https://openlineage.io/docs/spec/object-model

Each Run State Update can include detail about the Job, the Run, and the input and output Datasets involved in the run. Subsequent updates are **additive**: input Datasets, for example, can be specified along with START, along with COMPLETE, or both. This accommodates situations where information is only available at certain times.

openlineage.io

Object Model | OpenLineage

OpenLineage was designed to enable large-scale observation of datasets as they move through a complex pipeline.

Original URL: https://openlineage.io/docs/spec/object-model

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-19 04:07:45

*Thread Reply:* To add to that, we're not intentionally trying to not send some data on some events - but we're trying to limit state we're keeping in the integrations.

Max Zheng (mzheng@plaid.com)

2024-04-19 12:46:41

*Thread Reply:* Hmm thats interesting

So for a job you could have for inputs, this theoretically: START: [{input1}, {input2}] RUNNING: [{input_3}] COMPLETE: []

I guess this means its also possible for there to be conflicting information?

Max Zheng (mzheng@plaid.com)

2024-04-19 13:09:27

*Thread Reply:* How do downstream lineage consumers (eg. Marquez) handle failed applications? ie. in that case the outputs wouldn't actually exist - do the consumers typically delete assets they generated?

Francisco Morillo (fmorillo@amazon.es)

2024-04-19 10:42:11

HI All! Anyone knows how to build OpenLineage locally? Im trying to set up my environment to try and commit the SymLink integration with Flink Iceberg so same Iceberg table can be seen in marquez?

Michael Robinson (michael.robinson@astronomer.io)

2024-04-19 13:00:02

@channel Friendly reminder: the Open Standards for Data Lineage roundtable co-sponsored by Accenture and Confluent is next Thursday the 25th. Register at the link below. https://openlineage.slack.com/archives/C01CK9T7HKR/p1712157811210629

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)

@channel <a href="https://events.confluent.io/roundtable-data-lineage/Accenture">Accenture+Confluent's Open Standards for Data Lineage roundtable</a> is happening on April 25th, featuring: • Kai Waehner (Confluent) • @Mandy Chessell (Egeria) • @Julien Le Dem (OpenLineage) • @Jens Pfau (Google Cloud) • @Ernie Ostic (Manta/IBM) • @Sheeri Cabral (Collibra) • Austin Kronz (Atlan) • @Luigi Scorzato (moderator, Accenture) Not to be missed! Register at the link.

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1712157811210629

❤️ Peter Huang, taosheng shi, Paweł Leszczyński, Julien Le Dem, Sheeri Cabral (Collibra)

🚀 alexandre bergere

arjun krishnamurthy (arjunkrishnamurthy1991@gmail.com)

2024-04-22 07:19:03

Hi I am trying to capture lineage of my Spark job using OpenLineage spark library by setting propeties directly in the Spark session configuration. During the job execution, I am seeing below error java.lang.NoSuchMethodError: org.apache.spark.scheduler.SparkListenerJobStart.stageIds()Lscala/collection/immutable/Seq; at io.openlineage.spark.agent.OpenLineageSparkListener.onJobStart(OpenLineageSparkListener.java:147) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1404) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) I am using openlineage-spark2.13 version 1.9.1 with Scala 2.13 and Spark 3.2.0 this is the SparkSession creation code val spark = SparkSession.builder. appName("ReadJob") .config("spark.master", "local") .config("spark.extraListeners", "io.openlineage.spark.agent.OpenLineageSparkListener") .config("spark.openlineage.transport.type", "http") .config("spark.openlineage.transport.url", "<http://localhost:5000>") .config("spark.openlineage.namespace", "spark_namespace") .getOrCreate() From the error message, it looks like v1.9.1 of openlineage-spark2.13 that I am using is not compatible with Spark 3.2.0. Please help troubleshooting this issue. Is there any compatibility matrix for Spark version and open lineage spark library?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-22 09:24:23

*Thread Reply:* We test with Scala 2.13 and Spark 3.2.4 - can you check with bumped patch version of Spark and OL? https://app.circleci.com/pipelines/github/OpenLineage/OpenLineage/10221/workflows/661ae29a-1fd7-4cd2-8288-a58b76f9ab12/jobs/207055

👍 arjun krishnamurthy

arjun krishnamurthy (arjunkrishnamurthy1991@gmail.com)

2024-04-22 11:14:40

*Thread Reply:* @Maciej Obuchowski, Thanks for your response. I am able to solve this problem by running with scala 2.12 and Spark 3.2.0.

Manish Kumar Gupta (luv2manish@gmail.com)

2024-04-22 11:31:16

OpenLineage is supported by Spark Streaming, we are facing OOM issue with microbatch whenever we are using OpenLineage

Manish Kumar Gupta (luv2manish@gmail.com)

2024-04-22 11:35:54

*Thread Reply:* Do we need additional param for spark streaming use case, also transport type is file with debugFacet option enabled printing all lineage in console/log file any way to turn it off

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-22 13:40:22

*Thread Reply:* Are you using newest OpenLineage version?

Manish Kumar Gupta (luv2manish@gmail.com)

2024-04-23 00:09:19

*Thread Reply:* We are using Openlineage-spark-1.8.0 with spark 3.3.0

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-23 13:00:20

*Thread Reply:* Please try 1.12.0 🙂

Gonçalo Martins Ribeiro (gmartinsribeiro@gmail.com)

2024-04-22 19:32:22

Hi everyone! Is there any plans to support Kubeflow? https://github.com/OpenLineage/OpenLineage/issues/2634

#2634 [PROPOSAL] Integration with Kubeflow

Purpose: <a href="https://www.kubeflow.org/#overview">Kufeflow</a> is a pipeline/workflow project for ML experimentation but widely used for multi-purpose orchestration, including data orchestration. Proposed implementation I would like to ask if there are plans to support Kubeflow similarly to Airflow and similar supported integrations.

Labels

proposal

Comments

Julien Le Dem (julien@apache.org)

2024-04-23 13:00:59

*Thread Reply:* I am personally not aware of a Kubeflow integration yet. It would make a lot of sense for Kubeflow to support OpenLineage. If anyone is aware of one or would like to work on one, please reach out.

🙌 Gonçalo Martins Ribeiro

➕ Maciej Obuchowski

Fabio Manganiello (fabio.manganiello@booking.com)

2024-04-23 05:56:58

Hi folks, I've just bumped into a graph representation issue when using the dbt-ol wrapper in multi-step Airflow DAGs.

I basically I have a DAG that launches two tasks - step_01 and step_02. Each of these launch a dbt process wrapped through dbt-ol - same dbt_project.yml, but different model selectors.

When step_01 completes, I get an OL event like this:

{ "eventType": "COMPLETE", "run": { "runId": "b35cb64f-088f-451b-b4d5-99cb164da571", "facets": { "parent": { "additionalProperties": { "job": { "name": "lineage-dbt-playground-bk-aws-uk.step_01.dbt", "namespace": "default" }, "run": { "runId": "0fa3b25b-b9ad-3dbd-95b9-627dec474a68" } } } } }, "job": { "namespace": "default", "name": "dbt-run-dbt_playground" } } When step_02 completes, I get an OL event like this:

{ "eventType": "COMPLETE", "run": { "runId": "7f3a08c3-f3cd-47ab-a994-7d385e6dee42", "facets": { "parent": { "additionalProperties": { "job": { "name": "lineage-dbt-playground-bk-aws-uk.step_02.dbt", "namespace": "default" }, "run": { "runId": "1f985b62-a723-30d2-8c49-ee9a70b3d6c6" } } } } }, "job": { "namespace": "default", "name": "dbt-run-dbt_playground" } } So basically both the Airflow tasks launched a dbt-ol wrapper, and each of them generated OL events using dbt-run-<dag_id> as the identifier. And this is consistent with what I see in the code: https://github.com/OpenLineage/OpenLineage/blob/main/integration/dbt/scripts/dbt-ol#L144

So when we materialize the graph from the events we end up with:

The same job name
Which has two different run instances
Which have two different parents In our implementation (which currently projects the state of the latest run event to a job record with the given namespace and name) this means that dbt-run-<dag_id> is basically a "floating node" in a multi-step Airflow DAG, it will move across the DAG leaf nodes while the DAG executes, and eventually it will always be attached to the latest task run by the DAG (or at least the one that emits the latest COMPLETE event).

Would it be possible to construct the job name using additional identifiers as well? dbt can't access the parent's context, so it can't know if it's launched as a stand-alone process or via an Airflow DAG (or another orchestrator). But maybe it could support something like an optional OPENLINEAGE_NAME_PREFIX / OPENLINEAGE_NAME_SUFFIX env variable to be attached to the name (or maybe more in general an OPENLINEAGE_NAME_TEMPLATE) to prevent such ambiguities? What are your thoughts?

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/dbt/scripts/dbt-ol | dbt-ol>

<pre><code> job_name=f"dbt-run-{processor.project_name}", </code></pre>

Fabio Manganiello (fabio.manganiello@booking.com)

2024-04-23 07:25:27

*Thread Reply:* Maybe an intermediate approach may be to leverage the existing OPENLINEAGE_PARENT_ID to build the job name in dbt-ol? Something along the lines of job_name=f"dbt-run-{os.getenv('OPENLINEAGE_PARENT_ID')}-{project_name}"? So we wouldn't introduce new environment variables to test/document, and can still leverage the existing environment to ensure that jobs associated to the same dbt project get unique names when launched from different tasks (if the parent ID variable is specified). Thoughts? (cc @Maciej Obuchowski @Jakub Dardziński)

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-23 10:30:56

*Thread Reply:* I agree that current job name scheme in dbt integration is not sufficient. > Maybe an intermediate approach may be to leverage the existing OPENLINEAGE_PARENT_ID to build the job name in dbt-ol? Not sure if that would solve the problem - there's no guarantee that single parent would not have two different dbt children from the same task.

I would look at maybe more information we can get from dbt metadata - for example, Spark events append name of generated dataset to the otherwise generated job name. However it would require us to understand what the job does on the point of generating wrapping job, and we probably do not want to have logic there? Also it does not solve the issue where you don't use model selectors but just run the whole project.

Fabio Manganiello (fabio.manganiello@booking.com)

2024-04-23 10:42:03

*Thread Reply:* I think that a more scalable solution that wouldn't couple downstream jobs too much with their upstreams could be an OPENLINEAGE_JOB_NAME_TEMPLATE variable that integrations can customize to their liking

Fabio Manganiello (fabio.manganiello@booking.com)

2024-04-23 10:42:53

*Thread Reply:* The OPENLINEAGE_PARENT_ID workaround would fix things in our specific case because in our dbt chart we always set OPENLINEAGE_PARENT_ID=airflow_task.task_id, but I agree that it's not a one-size-fits-all solution

Fabio Manganiello (fabio.manganiello@booking.com)

2024-04-23 10:46:20

*Thread Reply:* I also like the approach of having the input/output dependencies to determine the job name (it introduces an implicit constraint that a job is the same if its "core" name and dependencies are the same), but I'm wondering how that plays with the parent facet - i.e. if we get an event about job1 with parent=p1 and then job1 with parent=p2, with the same input/output datasets, then we may end up losing the original parent-child link anyway, unless the parent also becomes part of the job name I guess?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-23 10:49:50

*Thread Reply:* In Spark it works because we have the guarantee that Spark action creates only one output dataset - I don't think we can rely on that with dbt unfortunately, I just mentioned this as an example of attaching more metadata to the job name

Regarding OPENLINEAGE_JOB_NAME_TEMPLATE - I like giving users more options, but it's unclear what would be passed to that template - build your own name using something like jinja , and pass whole dbt manifest there?

Fabio Manganiello (fabio.manganiello@booking.com)

2024-04-23 10:54:14

*Thread Reply:* The jinja-like solution would probably be the most flexible, as it would allow all kind of customizations on the job name if we get the full manifest. If it turns out to be too complex/risky, a fallback that only offers a limited amount of supported items on a format string could also work

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-23 10:55:06

*Thread Reply:* I personally don’t like OPENLINEAGE_JOB_NAME_TEMPLATE idea. It seems too complex and unclear at first glance.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-23 10:55:46

*Thread Reply:* I mean we can add it ofc but would people would want to use it? Or is it a one-case-solution?

Fabio Manganiello (fabio.manganiello@booking.com)

2024-04-23 10:57:46

*Thread Reply:* It can be useful in all cases where the default job nomenclature logic may not suffice to uniquely determine a "logical" job - the dbt project launched by different DAG tasks with different models is an example, but there may be others too

Fabio Manganiello (fabio.manganiello@booking.com)

2024-04-23 10:58:30

*Thread Reply:* I'm also open to alternatives because I'm not a big fan of giving users too many dials to tweak

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-23 10:59:23

*Thread Reply:* I think the templating part is easy - but what's unclear what data is passed to the template that can be used, and can we guarantee for it to be there? It would have to be something unique, and if we can guarantee that, can't we just use that directly as a job name?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-23 11:03:17

*Thread Reply:* and because of that suffix/prefix is way easier

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-23 11:03:19

*Thread Reply:* OTOH I've heard idea for "display job name" that would't have some of the restrictions we have (like trying to be unique), because people are often not exactly sure what lineage-dbt-playground-bk-aws-uk.step_02.dbt or integration_emit_metrics.execute_create_table_command.emit_test_create_table_emit is

Arnab Bhattacharyya (arnab.bhattacharyya@booking.com)

2024-05-03 07:59:45

*Thread Reply:* Hello All, @Fabio Manganiello has raised the PR https://github.com/OpenLineage/OpenLineage/pull/2658 to mitigate the issue as this is blocking the adoption in our organisation. Could you please look into this and help to get the PR to the next steps.

#2658 [dbt] Support a less ambiguous logic to generate job names.

Problem The name of the <code>dbt-ol run</code> wrapper job generated by the connector isn't sufficiently unique, as it only depends on the dbt project name. Multiple tasks that launch the same project, but with different profiles and/or models, will experience name clashes. Closes: <a href="https://github.com/OpenLineage/OpenLineage/issues/2657">#2657</a> Solution If a new environment variable is set (<code>OPENLINEAGE_DBT_USE_EXTENDED_JOB_NAME</code>, used in order to prevent breaking existing flows, but it should ideally be the default), then the name of the job published on the event will change from: <code>dbt_run_{dbt_project_name}</code> To: <code>dbt-run-{dbt_project_name}-{dbt_profile_name}-{dbt_model}</code> ☐ Your change modifies the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json">core</a> OpenLineage model ☐ Your change modifies one or more OpenLineage <a href="https://github.com/OpenLineage/OpenLineage/tree/main/spec/facets">facets</a> One-line summary: Include profile and models in the dbt job name to make it more unique. Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☑︎ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☐ Your changes are accompanied by tests (if relevant) ☑︎ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☑︎ Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant) ☐ You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant) <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2023 contributors to the OpenLineage project

Labels

integration/dbt, common

arjun krishnamurthy (arjunkrishnamurthy1991@gmail.com)

2024-04-23 06:56:34

Hi Team, I have a spark application which actually reads data from mysql database and writes to intermediate data store. I have OpenLineageSparkListener which listens to the spark events and ingests the metadata to Marquez. but all I can see in Marquez is just the spark job metadata and the metadata related to the job runs. I don't see the mysql table data set metadata created in Marquez. From the documentation , I read OpenLineage can extract cross platform lineage automatically based on the integration platform. Is it not generating the dataset because mysql is not supported as a datasource from Spark platform?

arjun krishnamurthy (arjunkrishnamurthy1991@gmail.com)

2024-04-24 08:06:16

*Thread Reply:* Can someone throw light on this problem?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-23 13:01:51

Hey, I would like to request a release, due to few bug fixes being merged: https://github.com/OpenLineage/OpenLineage/pull/2599 https://github.com/OpenLineage/OpenLineage/pull/2609 https://github.com/OpenLineage/OpenLineage/pull/2613

➕ Harel Shein, Jakub Dardziński, Michael Robinson, Julien Le Dem, Mattia Bertorello

❤️ Sheeri Cabral (Collibra), Mattia Bertorello

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-23 13:02:04

*Thread Reply:* @Michael Robinson fyi 🙂

👍 Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2024-04-23 15:48:05

*Thread Reply:* Thanks for requesting a release. With three votes from committers, it's authorized and will be performed within 2 business days.

Alok (a_prusty@apple.com)

2024-04-23 15:20:40

Hi Team There is a dataset naming convention https://openlineage.io/docs/spec/naming Do we have something similar for job namespace naming as well? If it is spark job or a flink job, what should the namespace look like ?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-24 09:49:08

*Thread Reply:* For now, the job names for particular integrations are defined by those integrations

Peter Huang (huangzhenqiu0825@gmail.com)

2024-04-24 14:17:14

*Thread Reply:* Take Flink as example, we have configs for users to define the namespace and job name. The namespace of job highly depends on the naming convention of the organization. So we give the flexibility to users. https://github.com/OpenLineage/OpenLineage/blob/main/integration/flink/app/src/main/java/io/openlineage/flink/OpenLineageFlinkJobListener.java#L54

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/flink/app/src/main/java/io/openlineage/flink/OpenLineageFlinkJobListener.java | OpenLineageFlinkJobListener.java>

<pre><code> public static final ConfigOption<String> OPEN_LINEAGE_LISTENER_CONFIG_JOB_NAMESPACE = </code></pre>

Alok (a_prusty@apple.com)

2024-04-24 14:19:48

*Thread Reply:* Thanks for the pointers @Peter Huang, so unless overridden the default namespace will appear as flink_jobs

arjun krishnamurthy (arjunkrishnamurthy1991@gmail.com)

2024-04-24 13:36:04

Hi Team, I am using OL spark library v1.11.3 along with openlineage-sql-java v1.11.3 to collect the metadata from my spark job execution. But all I can see in marquez is just the spark job and job run whereas my input dataset and output dataset doesn't seem to be generated. is this because my input dataset and output dataset are in data platforms not supported by Spark OL integration?

Catherine Lin (mscatherinelin@gmail.com)

2024-04-24 13:51:23

*Thread Reply:* I am seeing something very similar - nothing is being populated in the inputs and outputs or facets section for me

arjun krishnamurthy (arjunkrishnamurthy1991@gmail.com)

2024-04-25 04:53:34

*Thread Reply:* #general is this issue with the latest version of OpenLineage (1.11.3 or 1.12.0) because I have tried both. Anyone help pls?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-25 07:15:03

*Thread Reply:* It would help if you posted the generated events and example job that causes this problem.

> is this because my input dataset and output dataset are in data platforms not supported by Spark OL integration? It might be the case - but there's unfortunately no info you've posted that allows anyone to answer that question 🙂

arjun krishnamurthy (arjunkrishnamurthy1991@gmail.com)

2024-04-25 12:18:20

*Thread Reply:* @Maciej Obuchowski I have a spark application which actually reads data from mysql database and writes to intermediate data store (Apache Ignite). I have attached OpenLineageSparkListener which listens to the spark events and ingests the metadata to Marquez. but I could see only metadata related to spark job in the marquez UI and there is nothing on the datasets tab.

image.png

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-25 13:00:49

*Thread Reply:* Can you show the read_job.execute_save_into_data_source_command event? The event tab is on the left

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-04-25 13:01:49

*Thread Reply:* Also logs from job execution

arjun krishnamurthy (arjunkrishnamurthy1991@gmail.com)

2024-04-25 13:38:31

*Thread Reply:* These are all the events related to a single readjob.executesaveintodatasourcecommand event id with different event types

e3e318f6-18f9-4c61-bfbe-978602a65f22-COMPLETE.json

e3e318f6-18f9-4c61-bfbe-978602a65f22-RUNNING-2.json

e3e318f6-18f9-4c61-bfbe-978602a65f22-RUNNING.json

e3e318f6-18f9-4c61-bfbe-978602a65f22-START.json

arjun krishnamurthy (arjunkrishnamurthy1991@gmail.com)

2024-04-25 13:39:33

*Thread Reply:* inputs and outputs are empty for all the events here.

arjun krishnamurthy (arjunkrishnamurthy1991@gmail.com)

2024-04-26 11:58:34

*Thread Reply:* @Maciej Obuchowski Any thoughts on the above issue?

Mayur Singal (mayur.s@deuexsolutions.com)

2024-04-26 05:09:39

Hi Team, I'm facing the similar issue as what this pr tries to solve: https://github.com/OpenLineage/OpenLineage/pull/2453 but my error stack looks slightly different, pasting it in thread 🧵

Mayur Singal (mayur.s@deuexsolutions.com)

2024-04-26 05:10:26

*Thread Reply:* 24/04/26 07:33:11 INFO PlanUtils: apply method failed with com.databricks.unity.error.MissingCredentialScopeException: [UNITY_CREDENTIAL_SCOPE_MISSING_SCOPE] Missing Credential Scope. Unity Credential Scope id not found in thread locals.. SQLSTATE: XXKUC at com.databricks.unity.error.MissingCredentialScopeException$.withDebugLog(UCSExceptions.scala:62) at com.databricks.unity.UCSDriver$Manager.$anonfun$currentScopeId$3(UCSDriver.scala:142) at scala.Option.getOrElse(Option.scala:189) at com.databricks.unity.UCSDriver$Manager.currentScopeId(UCSDriver.scala:142) at com.databricks.unity.UCSDriver$Manager.currentScope(UCSDriver.scala:145) at com.databricks.unity.UnityCredentialScope$.currentScope(UnityCredentialScope.scala:108) at com.databricks.unity.UnityCredentialScope$.getSAMRegistry(UnityCredentialScope.scala:119) at com.databricks.unity.SAMRegistry$.registerSAM(SAMRegistry.scala:332) at com.databricks.unity.SAMRegistry$.registerDefaultSAM(SAMRegistry.scala:348) at org.apache.spark.sql.catalyst.catalog.SessionCatalogImpl.defaultTablePathImpl(SessionCatalog.scala:1212) at org.apache.spark.sql.catalyst.catalog.SessionCatalogImpl.defaultTablePath(SessionCatalog.scala:1223) at com.databricks.sql.managedcatalog.ManagedCatalogSessionCatalog.defaultTablePathImpl(ManagedCatalogSessionCatalog.scala:1005) at com.databricks.sql.managedcatalog.ManagedCatalogSessionCatalog.defaultTablePath(ManagedCatalogSessionCatalog.scala:1014) at io.openlineage.spark3.agent.lifecycle.plan.catalog.AbstractDatabricksHandler.getDatasetIdentifier(AbstractDatabricksHandler.java:92) at io.openlineage.spark3.agent.lifecycle.plan.catalog.CatalogUtils3.lambda$getDatasetIdentifier$2(CatalogUtils3.java:61) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1361) at java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126) at java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.findAny(ReferencePipeline.java:536) at io.openlineage.spark3.agent.lifecycle.plan.catalog.CatalogUtils3.getDatasetIdentifier(CatalogUtils3.java:63) at io.openlineage.spark3.agent.lifecycle.plan.catalog.CatalogUtils3.getDatasetIdentifier(CatalogUtils3.java:46) at io.openlineage.spark3.agent.utils.PlanUtils3.getDatasetIdentifier(PlanUtils3.java:79) at io.openlineage.spark35.agent.lifecycle.plan.CreateReplaceOutputDatasetBuilder.apply(CreateReplaceOutputDatasetBuilder.java:144) at io.openlineage.spark35.agent.lifecycle.plan.CreateReplaceOutputDatasetBuilder.lambda$apply$3(CreateReplaceOutputDatasetBuilder.java:116) at java.util.Optional.map(Optional.java:215) at io.openlineage.spark35.agent.lifecycle.plan.CreateReplaceOutputDatasetBuilder.apply(CreateReplaceOutputDatasetBuilder.java:114) at io.openlineage.spark35.agent.lifecycle.plan.CreateReplaceOutputDatasetBuilder.apply(CreateReplaceOutputDatasetBuilder.java:60) at io.openlineage.spark35.agent.lifecycle.plan.CreateReplaceOutputDatasetBuilder.apply(CreateReplaceOutputDatasetBuilder.java:39) at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder$1.apply(AbstractQueryPlanDatasetBuilder.java:97) at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder$1.apply(AbstractQueryPlanDatasetBuilder.java:86) at io.openlineage.spark.agent.util.PlanUtils.safeApply(PlanUtils.java:264) at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder.lambda$apply$0(AbstractQueryPlanDatasetBuilder.java:76) at java.util.Optional.map(Optional.java:215) at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder.apply(AbstractQueryPlanDatasetBuilder.java:68) at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder.apply(AbstractQueryPlanDatasetBuilder.java:40) at io.openlineage.spark.agent.util.PlanUtils.safeApply(PlanUtils.java:264) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$null$27(OpenLineageRunEventBuilder.java:496) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at java.util.Iterator.forEachRemaining(Iterator.java:116) at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485) at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:272) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.StreamSpliterators$WrappingSpliterator.forEachRemaining(StreamSpliterators.java:313) at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildOutputDatasets(OpenLineageRunEventBuilder.java:450) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.populateRun(OpenLineageRunEventBuilder.java:322) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:305) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:251) at io.openlineage.spark.agent.lifecycle.SparkSQLExecutionContext.end(SparkSQLExecutionContext.java:130)

Mayur Singal (mayur.s@deuexsolutions.com)

2024-04-26 05:12:02

*Thread Reply:* @Matthew Paras would you have any idea on this?

Mayur Singal (mayur.s@deuexsolutions.com)

2024-04-26 05:12:50

*Thread Reply:* this is how my spark code looks like source_df = spark.read.format("csv").option("header", True).load("<s3a://my/path/customers.csv>") source_df.show() source_df.write.mode("overwrite").saveAsTable("s3_tbl_new")

Mayur Singal (mayur.s@deuexsolutions.com)

2024-05-03 03:14:12

*Thread Reply:* Hi team! can someone help me with this thread, shall I file a ticket in github?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-06 05:31:31

*Thread Reply:* Please create ticket - however we don't exactly test our integration with Unity Catalog, it would be something that needs to be picked up by someone who can maintain the given solution

Rodrigo Maia (rodrigo.maia@manta.io)

2024-04-26 06:32:59

Did anyone have any issues configuring OpenLineage in Databricks Azure?

Rodrigo Maia (rodrigo.maia@manta.io)

2024-04-29 07:33:21

*Thread Reply:* Anyone?

Michael Robinson (michael.robinson@astronomer.io)

2024-04-29 16:33:29

@channel On April 26th, we released OpenLineage 1.13.1, featuring a fix for the NPE caused by empty Snowflake tables in the Spark integration, a configurable global timeout for circuit breakers, custom token provider support in the Spark integration, and many more changes: Additions • Java: allow timeout for circuit breakers #2609 @pawel-big-lebowski • Java: handle DataSetEvent and JobEvent in Transport.emit #2611 @dolfinus • Java/Python: add GZIP compression to HttpTransport #2603 #2604 @dolfinus • Java/Python/Proxy: properly set Kafka message key #2571 #2597 #2598 @dolfinus • Flink: add support for Micrometer metrics #2633 @mobuchowski • Python: generate Python facets from JSON schemas #2520 @JDarDagran • Spark/Flink/Java: support YAML config files together with SparkConf/FlinkConf #2583 @pawel-big-lebowski • Spark: add custom token provider support #2613 @tnazarew • Spark/Flink: support YAML config files together with SparkConf & FlinkConf approaches #2583 @pawel-big-lebowski • Spark/Flink: job ownership facet #2533 @pawel-big-lebowski Changes • Java: sync Kinesis partitionKey format with Kafka implementation #2620 @dolfinus See the notes for the fixes and more details. Thanks to all the contributors! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.13.1 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/1.12.0...1.13.1 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🙌 Mattia Bertorello, Rodrigo Maia

Mariusz Górski (gorskimariusz13@gmail.com)

2024-04-30 05:34:48

Hey, I think I found a bug in Marquez lineage API: https://github.com/MarquezProject/marquez/issues/2806

#2806 bug: cannot query lineage if job namespace contains colon character

if job namespace is: <code>{"job": {"namespace": "<trino://trino-integration-test:1337>" }}</code> then querying for lineage registered under this namespace results in error: <pre><code>192.168.64.4 - - [30/Apr/2024:09:31:21 +0000] "GET /api/v1/namespaces/trino%3A%2F%2Ftrino-integration-test%3A1337/jobs?limit=20&offset=0 HTTP/1.1" 200 21280 "<http://localhost:32914/>" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36" 472 ERROR [2024-04-30 09:31:23,466] io.dropwizard.jersey.errors.IllegalStateExceptionMapper: Error handling a request: acc3ebe74db78369 ! java.lang.IllegalStateException: No match available ! at java.base/java.util.regex.Matcher.start(Matcher.java:450) ! at marquez.service.models.NodeId.parts(NodeId.java:233) ! at marquez.service.models.NodeId.asJobId(NodeId.java:251) ! at marquez.api.BaseResource.throwIfNotExists(BaseResource.java:165) ! at marquez.api.OpenLineageResource.getLineage(OpenLineageResource.java:118) ! at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ! at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) ! at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ! at java.base/java.lang.reflect.Method.invoke(Method.java:568) ! at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52) ! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:146) ! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:189) ! at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:176) ! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:93) ! at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:478) ! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:400) ! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81) ! at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:256) ! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248) ! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244) ! at org.glassfish.jersey.internal.Errors.process(Errors.java:292) ! at org.glassfish.jersey.internal.Errors.process(Errors.java:274) ! at org.glassfish.jersey.internal.Errors.process(Errors.java:244) ! at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265) ! at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:235) ! at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:684) ! at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:394) ! at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346) ! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:358) ! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:311) ! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205) ! at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:799) ! at org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1656) ! at io.dropwizard.servlets.ThreadNameFilter.doFilter(ThreadNameFilter.java:35) ! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) ! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626) ! at io.dropwizard.jersey.filter.AllowedMethodsFilter.handle(AllowedMethodsFilter.java:47) ! at io.dropwizard.jersey.filter.AllowedMethodsFilter.doFilter(AllowedMethodsFilter.java:41) ! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) ! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626) ! at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:552) ! at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233) ! at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440) ! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188) ! at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:505) ! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186) ! at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355) ! at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) ! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ! at com.codahale.metrics.jetty9.InstrumentedHandler.handle(InstrumentedHandler.java:322) ! at io.dropwizard.jetty.RoutingHandler.handle(RoutingHandler.java:52) ! at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:772) ! at io.dropwizard.jetty.ZipExceptionHandlingGzipHandler.handle(ZipExceptionHandlingGzipHandler.java:26) ! at org.eclipse.jetty.server.handler.RequestLogHandler.handle(RequestLogHandler.java:54) ! at org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:181) ! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ! at org.eclipse.jetty.server.Server.handle(Server.java:516) ! at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487) ! at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732) ! at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479) ! at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277) ! at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) ! at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) ! at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) ! at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338) ! at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315) ! at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173) ! at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131) ! at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409) ! at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883) ! at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034) ! at java.base/java.lang.Thread.run(Thread.java:840) ERROR [2024-04-30 09:31:23,480] marquez.logging.LoggingMdcFilter: status: 500 192.168.64.4 - - [30/Apr/2024:09:31:23 +0000] "GET /api/v1/lineage?nodeId=job:<trino://trino-integration-test:1337:20240430_093112_00001_hhtt7&depth=2> HTTP/1.1" 500 110 "<http://localhost:32914/lineage/job/trino%3A%2F%2Ftrino-integration-test%3A1337/20240430_093112_00001_hhtt7>" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36" 149 </code></pre> <https://private-user-images.githubusercontent.com/5680655/326746990-9700a9f8-60ed-40d6-aeed-733a6b184a26.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTQ0NzIxMzcsIm5iZiI6MTcxNDQ3MTgzNywicGF0aCI6Ii81NjgwNjU1LzMyNjc0Njk5MC05NzAwYTlmOC02MGVkLTQwZDYtYWVlZC03MzNhNmIxODRhMjYucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDQzMCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA0MzBUMTAxMDM3WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFt…

Comments

Mariusz Górski (gorskimariusz13@gmail.com)

2024-04-30 05:51:52

*Thread Reply:* @Alok this is related to the way default job namespace is rendered in trino integration

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-04-30 05:57:53

*Thread Reply:* hey, I guess that should land in Marquez repo 🙂

Mariusz Górski (gorskimariusz13@gmail.com)

2024-04-30 06:04:47

*Thread Reply:* ah right, let me fix that

Mariusz Górski (gorskimariusz13@gmail.com)

2024-04-30 06:10:10

*Thread Reply:* https://github.com/MarquezProject/marquez/issues/2806

#2806 bug: cannot query lineage if job namespace contains colon character

if job namespace is: <code>{"job": {"namespace": "<trino://trino-integration-test:1337>" }}</code> then querying for lineage registered under this namespace results in error: <pre><code>192.168.64.4 - - [30/Apr/2024:09:31:21 +0000] "GET /api/v1/namespaces/trino%3A%2F%2Ftrino-integration-test%3A1337/jobs?limit=20&offset=0 HTTP/1.1" 200 21280 "<http://localhost:32914/>" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36" 472 ERROR [2024-04-30 09:31:23,466] io.dropwizard.jersey.errors.IllegalStateExceptionMapper: Error handling a request: acc3ebe74db78369 ! java.lang.IllegalStateException: No match available ! at java.base/java.util.regex.Matcher.start(Matcher.java:450) ! at marquez.service.models.NodeId.parts(NodeId.java:233) ! at marquez.service.models.NodeId.asJobId(NodeId.java:251) ! at marquez.api.BaseResource.throwIfNotExists(BaseResource.java:165) ! at marquez.api.OpenLineageResource.getLineage(OpenLineageResource.java:118) ! at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ! at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) ! at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ! at java.base/java.lang.reflect.Method.invoke(Method.java:568) ! at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52) ! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:146) ! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:189) ! at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:176) ! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:93) ! at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:478) ! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:400) ! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81) ! at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:256) ! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248) ! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244) ! at org.glassfish.jersey.internal.Errors.process(Errors.java:292) ! at org.glassfish.jersey.internal.Errors.process(Errors.java:274) ! at org.glassfish.jersey.internal.Errors.process(Errors.java:244) ! at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265) ! at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:235) ! at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:684) ! at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:394) ! at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346) ! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:358) ! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:311) ! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205) ! at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:799) ! at org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1656) ! at io.dropwizard.servlets.ThreadNameFilter.doFilter(ThreadNameFilter.java:35) ! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) ! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626) ! at io.dropwizard.jersey.filter.AllowedMethodsFilter.handle(AllowedMethodsFilter.java:47) ! at io.dropwizard.jersey.filter.AllowedMethodsFilter.doFilter(AllowedMethodsFilter.java:41) ! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) ! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626) ! at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:552) ! at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233) ! at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440) ! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188) ! at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:505) ! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186) ! at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355) ! at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) ! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ! at com.codahale.metrics.jetty9.InstrumentedHandler.handle(InstrumentedHandler.java:322) ! at io.dropwizard.jetty.RoutingHandler.handle(RoutingHandler.java:52) ! at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:772) ! at io.dropwizard.jetty.ZipExceptionHandlingGzipHandler.handle(ZipExceptionHandlingGzipHandler.java:26) ! at org.eclipse.jetty.server.handler.RequestLogHandler.handle(RequestLogHandler.java:54) ! at org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:181) ! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ! at org.eclipse.jetty.server.Server.handle(Server.java:516) ! at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487) ! at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732) ! at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479) ! at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277) ! at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) ! at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) ! at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) ! at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338) ! at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315) ! at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173) ! at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131) ! at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409) ! at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883) ! at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034) ! at java.base/java.lang.Thread.run(Thread.java:840) ERROR [2024-04-30 09:31:23,480] marquez.logging.LoggingMdcFilter: status: 500 192.168.64.4 - - [30/Apr/2024:09:31:23 +0000] "GET /api/v1/lineage?nodeId=job:<trino://trino-integration-test:1337:20240430_093112_00001_hhtt7&depth=2> HTTP/1.1" 500 110 "<http://localhost:32914/lineage/job/trino%3A%2F%2Ftrino-integration-test%3A1337/20240430_093112_00001_hhtt7>" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36" 149 </code></pre> <https://private-user-images.githubusercontent.com/5680655/326746990-9700a9f8-60ed-40d6-aeed-733a6b184a26.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTQ0NzIxMTEsIm5iZiI6MTcxNDQ3MTgxMSwicGF0aCI6Ii81NjgwNjU1LzMyNjc0Njk5MC05NzAwYTlmOC02MGVkLTQwZDYtYWVlZC03MzNhNmIxODRhMjYucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDQzMCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA0MzBUMTAxMDExWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFt…

Comments

Alok (a_prusty@apple.com)

2024-04-30 12:06:21

*Thread Reply:* Great find!

Willy Lulciuc (willy@datakin.com)

2024-05-01 15:58:06

*Thread Reply:* @Mariusz Górski, thanks for linking the issue. I’ll take a look and hopefully have a fix out in Marquez 0.47.0

🙏 Mariusz Górski

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-04-30 10:27:44

I think I understand how all the parts work, but…if I want to get “Kafka lineage”, I’d use spark integration right? (I want column lineage and I’m not using flink)

Kacper Muda (mudakacper@gmail.com)

2024-04-30 10:51:24

*Thread Reply:* Hey Sheeri. I'm not sure what you are trying to achieve, but the OpenLineage events can be emitted to Kafka topic with both Python (Airflow, DBT) and Java (Spark, Flink) integrations.

✅ Sheeri Cabral (Collibra)

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-04-30 11:25:40

*Thread Reply:* kafka lineage can mean lineage for stream processing jobs reading or writing data to kafka or emitting lineage events into kafka backend which Kacper referred to

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-04-30 11:41:58

*Thread Reply:* I mean the former

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-05-07 12:03:31

*Thread Reply:* So….. @Paweł Leszczyński is there openlineage support for stream processing jobs reading/writing data in kafka?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-05-07 12:11:46

*Thread Reply:* sure, we do have a flink integration and workshop demo which one can go step by step to see it running -> https://github.com/OpenLineage/workshops/tree/main/flink-streaming

you can also see presentation on that from kafka summit 2024 -> a demo part starts around 13th minute https://www.confluent.io/events/kafka-summit-london-2024/openlineage-for-stream-processing/

Confluent

OpenLineage for Stream Processing with Maciej Obuchowski and Pawel Leszczynski

OpenLineage for Stream Processing

Original URL: https://www.confluent.io/events/kafka-summit-london-2024/openlineage-for-stream-processing/

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-05-07 12:43:30

*Thread Reply:* Love it, I just found https://openlineage.io/blog/kafka-summit-talk/ 😄

openlineage.io

OpenLineage Support for Streaming to Feature at Kafka Summit | OpenLineage

Project committers will speak about recent progress on stream processing support.

Original URL: https://openlineage.io/blog/kafka-summit-talk/

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-05-07 14:16:22

*Thread Reply:* OK, I watched the demo. I think I understand now - there isn’t a “Kafka integration to produce lineage” because Kafka itself doesn’t produce lineage. Much like Oracle or MySQL don’t produce lineage, python scripts or Airflow or dbt produce lineage.

So the integration to get lineage from Kafka stream processing jobs is an integration with whatever tool is being used to process the data, and Flink is an example of that.

Am I understanding properly? (as opposed to AWS Glue, which has a ‘direct’ integration because Spark is always underlying AWS Glue)

👍 Paweł Leszczyński

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-04-30 12:02:29

Also, our airflow integration doesn’t seem to be getting columnLineage. Is this expected, or are we using a wrong version or need to config something?

Kacper Muda (mudakacper@gmail.com)

2024-05-01 03:32:27

*Thread Reply:* Could you share some more details on the Operators you are using and expect to get the column level lineage from?

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-05-01 08:08:36

*Thread Reply:* Sure! Will look into that. I think we just followed the directions and simply turned on OpenLineage 😄

arjun krishnamurthy (arjunkrishnamurthy1991@gmail.com)

2024-05-01 12:08:52

Hi. I am using OpenLineage spark library v1.11.3 along with openlineage-sql-java v1.11.3 to collect the lineage from my spark job execution and marquez as my lineage repository. In marquez, I could just see the spark job and job runs whereas my input dataset (MySQL) and output dataset (Kafka) doesn't seem to be registered. Please help troubleshooting the issue.

Michael Robinson (michael.robinson@astronomer.io)

2024-05-01 16:39:07

@channel The latest issue of OpenLineage News, featuring a rundown of recent releases, new docs, upcoming events, and much more, is in inboxes now. To become a subscriber, sign up here.

openlineage.us14.list-manage.com

OpenLineage Project

OpenLineage Project Email Forms

Original URL: https://bit.ly/OL_news

🚀 Mattia Bertorello

Stefan Krawczyk (stefan@dagworks.io)

2024-05-01 16:52:05

Hi all. For those supporting more DS/ML/RAG tasks and using OpenLineage I’d love to chat with you.

Context: we just open sourced a backend and UI (see this blog; could be inspiration for where openlineage could go TBH) and we’re considering putting open lineage support (both ingestion and emission) on the roadmap. Would love to find other people interested in applying openlineage to the DS/ML/RAG space.

blog.dagworks.io

Hamilton UI: Streamlining Metadata, Tracking, Lineage, and Observability for Your Dataflows

Hamilton not only standardizes code, but now comes with a UI to view lineage & provenance, track artifacts, and observe execution of your Hamilton dataflows; integrate with a single line code change!

Original URL: https://blog.dagworks.io/p/hamilton-ui-streamlining-metadata

🙌 Bruno González, Maciej Obuchowski

❤️ Julien Le Dem, Juan Luis Cano Rodríguez, Maciej Obuchowski, alexandre bergere

Maddy (mandarghadi007@gmail.com)

2024-05-02 09:15:58

👋 Hello, team!

Do we also have a Podman compatible version of Marquez developed?

Michael Robinson (michael.robinson@astronomer.io)

2024-05-02 09:41:55

*Thread Reply:* Hi Maddy, not at this time, but would you mind opening an issue in Marquez?

Nicolas Pittion-Rossillon (nicolas.pittion@redlab.io)

2024-05-02 11:43:32

*Thread Reply:* I might be missing something here, but I’m currently deploying Marquez through Kubernetes using Podman as Container Runtime without any trouble (other than building an ARM64-compatible image).

:gratitude_thank_you: Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2024-05-02 12:17:18

*Thread Reply:* @Maddy 👆

Fabio Manganiello (fabio.manganiello@booking.com)

2024-05-06 04:12:58

Quick nudge/reminder: this PR is still open https://github.com/OpenLineage/OpenLineage/pull/2658, so far I haven't received any comments neither on it, nor on the document with the proposals that I published a few days ago https://paste.manganiello.tech/?d28179d0c3f3cb07#9TWwtxmsMZ2o7c3EgpxEKQLgQGbSb4QLNdjK9krYG56w. As mentioned by @Arnab Bhattacharyya, this is crucial for our dbt+OpenLineage rollout - right now running dbt from Airflow causes data loss because of the job name overwrite, and I'm considering forking the dbt-ol connector unless things move on. Can someone please take a look into it? cc @Maciej Obuchowski @Jakub Dardziński

PrivateBin

Encrypted note on PrivateBin

Visit this link to see the note. Giving the URL to anyone allows them to access the note, too.

Original URL: https://paste.manganiello.tech/?d28179d0c3f3cb07#9TWwtxmsMZ2o7c3EgpxEKQLgQGbSb4QLNdjK9krYG56w

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-05-06 04:13:28

*Thread Reply:* it was long weekend in Poland 🙂 I’ll take a look today

:gratitude_thank_you: Fabio Manganiello

🙏 Arnab Bhattacharyya

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-06 05:40:58

*Thread Reply:* Also was on vacation and going to look at it today 🙂

Simran Suri (mailsimransuri@gmail.com)

2024-05-06 08:40:00

Hi, I'm attempting to capture OpenLineage events from dbt using Dremio as the adapter, but despite configuring the YAML file and environment variables, I'm not receiving any events. When running dbt-ol build, it indicates that OpenLineage will send events at the end of execution. However, upon completion, I receive the message "Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2" without any events and giving below error

Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/openlineage/common/provider/dbt/processor.py", line 524, in extractadaptertype self.adaptertype = Adapter[profile["type"].upper()] ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/enum.py", line 790, in getitem return cls.membermap[name] ~~~~~~~~~~~~~~~~^^^^^^ KeyError: 'DREMIO'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/bin/dbt-ol", line 235, in sys.exit(main()) ^^^^^^ File "/usr/local/bin/dbt-ol", line 192, in main events = processor.parse().events() ^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/openlineage/common/provider/dbt/processor.py", line 180, in parse self.extractadaptertype(profile) File "/usr/local/lib/python3.11/site-packages/openlineage/common/provider/dbt/processor.py", line 526, in extractadaptertype raise NotImplementedError( NotImplementedError: Only bigquery,snowflake,redshift,spark,postgres,databricks,sqlserver adapters are supported right now. Passed dremio

I'm using these versions in req.txt dbt-core==1.5.4 sqlfluff==2.3.4 sqlfluff-templater-dbt==2.3.4 yamlfix==1.15.0 dbt-dremio==1.5.1 openlineage-dbt==1.6.2 confluent-kafka==2.3.0

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-06 10:52:14

*Thread Reply:* Yeah, the error is true NotImplementedError: Only `bigquery`,`snowflake`,`redshift`,`spark`,`postgres`,`databricks`,`sqlserver` adapters are supported right now. Passed dremio fortunately, implementation of that can be quite simple, like this: https://github.com/OpenLineage/OpenLineage/pull/2136/files

Simran Suri (mailsimransuri@gmail.com)

2024-05-06 11:12:45

*Thread Reply:* @Maciej Obuchowski, thanks. Then this PR can be merged right, if I do these changes, also how to get the latest jar including these changes once the PR is merged?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-06 11:35:11

*Thread Reply:* @Simran Suri we do releases regularly, however if you want earlier release you can request it - three positive votes from committers will authorize release

Simran Suri (mailsimransuri@gmail.com)

2024-05-06 12:23:19

*Thread Reply:* Sure, @Maciej Obuchowski Would be greatful if I get those 3 possitive votes, as want this release soon. Here's the open issue, can you pls look into it https://github.com/OpenLineage/OpenLineage/issues/2668 Then I'll raise the PR as per guidelines

#2668 [INTEGRATION] dbt-dremio

What is the target system? dbt, new dialect dremio What kind of integration is this? ☑︎ Produces OpenLineage metadata ☐ Consumes OpenLineage metadata ☐ Something else How should this integration be implemented? Looks very similar to this <a href="https://github.com/OpenLineage/OpenLineage/issues/2129">dbt-sqlserver</a> Add DREMIO to common.provider.dbt.processor.Adapter(Enum) Add logic to extract_namespace for dremio, for example f"dremio://{profile['host']}:{profile['port']}" base on a profile for dbt-dremio using server and not host Where should this integration be implemented? ☐ In the target system ☑︎ In the OpenLineage repo ☐ Somewhere else Do you plan to make this contribution yourself? ☑︎ I am interested in doing this work

Comments

Simran Suri (mailsimransuri@gmail.com)

2024-05-06 12:27:44

*Thread Reply:* @Maciej Obuchowski, could you please help and guide me in this process

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-05-06 12:31:39

*Thread Reply:* what kind of guidance would you need? PR should be basically identical to: https://github.com/OpenLineage/OpenLineage/pull/2136 but I don’t know dremio really

Simran Suri (mailsimransuri@gmail.com)

2024-05-06 12:32:39

*Thread Reply:* Sure, will follow this @Jakub Dardziński Thanks

Simran Suri (mailsimransuri@gmail.com)

2024-05-06 13:41:30

*Thread Reply:* I've raised a PR, identical to that which you shared @Maciej Obuchowski, @Jakub Dardziński https://github.com/OpenLineage/OpenLineage/pull/2671 Seems it needs approval, also how can I request three positive votes from committers later on, if everything goes well?

#2671 Add DREMIO to supported dbt profile types

Problem The dbt provider has a hardcoded set of supported dbt connection types. This adds <a href="https://docs.getdbt.com/docs/core/connect-data-platform/dremio-setup">dremio</a> to the list. Closes: <a href="https://github.com/OpenLineage/OpenLineage/issues/2668">#2668</a> Solution Add static DREMIO to the Adapter(Enum) Add an if clause for DREMIO to DbtArtifactProcessor.extract_namespace ☐ Your change modifies the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json">core</a> OpenLineage model ☐ Your change modifies one or more OpenLineage <a href="https://github.com/OpenLineage/OpenLineage/tree/main/spec/facets">facets</a> One-line summary: Add support for dbt-dremio, solving <a href="https://github.com/OpenLineage/OpenLineage/issues/2668">#2668</a> Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☑︎ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☐ Your changes are accompanied by tests (if relevant) ☐ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☑︎ Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant) ☐ You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant) <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2023 contributors to the OpenLineage project

Labels

common

Cory Visi (cvisi@amazon.com)

2024-05-06 15:55:07

Does it make sense to emit Openlineage events to Collibra for visualization, and if so, is there support from Collibra? I'm looking on the website, but can't find specifics.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-05-06 16:40:36

*Thread Reply:* cc @Sheeri Cabral (Collibra)? 🙂

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-05-06 16:41:42

*Thread Reply:* Indeed! We are going beta with OpenLineage June 3 and customers can sign up to be notified when it starts. Is this from Airflow, AWS Glue, or something else??

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-05-06 16:42:25

*Thread Reply:* (Private beta June 3 which is why there's nothing publicly on websites)

Cory Visi (cvisi@amazon.com)

2024-05-06 16:54:14

*Thread Reply:* @Sheeri Cabral (Collibra) Likely AWS Glue

Cory Visi (cvisi@amazon.com)

2024-05-06 16:54:45

*Thread Reply:* Awesome information. I really appreciate it. I may reach out directly if our customer is interested.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-05-06 16:56:06

*Thread Reply:* Yes please do. I am happy to go over what’s entailed with the customer. I will be OOO from may 15-29~, so keep that in mind when scheduling appts (for beta inquiries please feel free to just schedule 30 mins in a time slot that is free on my calendar or is “product focus time”~) Lulz, this isn’t the Collibra slack, y’all can’t see my calendar 😄

👍 Cory Visi

Rodrigo Maia (rodrigo.maia@manta.io)

2024-05-07 03:41:26

*Thread Reply:* @Sheeri Cabral (Collibra) will Collibra be able to support the Glue Catalog in June 3?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-07 05:09:46

*Thread Reply:* @Cory Visi I've seen a lot of demand for Glue support for OL overall 🙂

Cory Visi (cvisi@amazon.com)

2024-05-07 09:14:29

*Thread Reply:* We've heard the demands too. 🙂

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-05-07 11:19:19

*Thread Reply:* Hello! Yes, the OpenLineage beta specifically includes support for Airflow and AWS Glue jobs - to be clear, that is having the end user use the OpenLineage Spark integration, so change to AWS Glue code is needed. I haven’t looked into anything specific for Glue Catalog (is there an API or something?)

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-05-07 11:22:12

*Thread Reply:* AWS Glue is our #1 most requested lineage integration that we don’t have. (to make it worse, we worked with AWS and they had a beta solution for lineage, but it didn’t work for most of the customers - both ours and theirs - and they deprecated their beta feature. We had several customers who were very upset that we didn’t go GA…but it wasn’t even in our hands)

Cory Visi (cvisi@amazon.com)

2024-05-07 11:27:39

*Thread Reply:* I can confirm that Glue jobs work with the Maven build of the Spark OpenLineage client. I have tested it with great success.

✅ Sheeri Cabral (Collibra)

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-05-07 12:00:46

*Thread Reply:* Yes, we saw in our alpha testing it works. A lot of it depends on how the jobs themselves are written, too 😄

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-09 07:36:56

*Thread Reply:* @Cory Visi any plans for supporting DynamicFrames? I've seen https://aws.amazon.com/blogs/big-data/amazon-datazone-introduces-openlineage-compatible-data-lineage-visualization-in-preview/ which tells that > The OpenLineage Spark plugin is not able to extract data lineage from AWS Glue Spark jobs that use AWS Glue DynamicFrames. Use Spark SQL DataFrames instead. and that's something we also confirmed with our testing.

@Artur Owczarek fyi 🙂

Cory Visi (cvisi@amazon.com)

2024-07-09 08:25:34

*Thread Reply:* I'll try to find out.

Simran Suri (mailsimransuri@gmail.com)

2024-05-07 04:46:18

Hi All, I would like to request a release, for DREMIO to supported dbt profile types changes (dbt-dremio adapter addition) https://github.com/OpenLineage/OpenLineage/pull/2674

➕ Maciej Obuchowski, Kacper Muda, Harel Shein, Jakub Dardziński, Julien Le Dem, Mattia Bertorello

Fabio Manganiello (fabio.manganiello@booking.com)

2024-05-07 05:02:02

*Thread Reply:* Can we have this dbt PR merged too maybe? https://github.com/OpenLineage/OpenLineage/pull/2658

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-07 05:50:04

*Thread Reply:* @Fabio Manganiello merged your PR

:gratitude_thank_you: Fabio Manganiello

Simran Suri (mailsimransuri@gmail.com)

2024-05-07 08:04:25

*Thread Reply:* @Michael Robinson fyi 🙂 Also, @Paweł Leszczyński, Can you pls list other PRs as well if it needs to be included within this release?

Michael Robinson (michael.robinson@astronomer.io)

2024-05-07 09:17:01

*Thread Reply:* Thanks for requesting a release. It's approved and will be performed within 2 business days.

👍 Simran Suri, Maciej Obuchowski

Simran Suri (mailsimransuri@gmail.com)

2024-05-07 09:22:30

*Thread Reply:* Thanks @Michael Robinson, this release will be for dbt specifically right? openlineage-dbt ?

Harel Shein (harel.shein@gmail.com)

2024-05-07 10:04:10

*Thread Reply:* we release everything in the openlineage repo at the same time, so it will include openlineage-dbt as well

👍 Simran Suri

Fabio Manganiello (fabio.manganiello@booking.com)

2024-05-07 06:28:05

Is anyone looking into the CI/CD failures on main? https://app.circleci.com/pipelines/github/OpenLineage/OpenLineage/10455/workflows/89f13fbb-2403-4bb6-bc56-f12179e35b37/jobs/214118 it seems that something is failing on the Flink tests:

Status 404: {"message":"pull access denied for wurstmeister/kafka, repository does not exist or may require 'docker login': denied: requested access to the resource is denied"}

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-07 06:56:03

*Thread Reply:* @Paweł Leszczyński

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-07 06:56:24

*Thread Reply:* It seems that the solution is moving to confluent images

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-05-07 07:02:29

*Thread Reply:* @Maciej Obuchowski changed this in https://github.com/OpenLineage/OpenLineage/pull/2482. Applied your comments feedbacks as well. Are you OK with merging it once CI with confluent Kafka succeeds?

#2482 [FLINK] protobuf support

Problem Flink integration does not support Kafka topics in Protobuf format. Solution • Prepare example application code that reads from protobuf topic and writes to protobuf topic. • Add integration test which verifies behaviour of such app (fill protobuf topic with some data). • Extend <code>KafkaSourceVisitor</code> and <code>KafkaSourceVistitor</code> to handle scenarios when serialization/deserialization is done with protobuf. • Create method to convert Protobuf generated classes into SchemaDatasetFacet. Make the method above support nested and repeated fields. <blockquote> Note: All schema changes require discussion. Please <a href="https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue">link the issue</a> for context. </blockquote> ☐ Your change modifies the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json">core</a> OpenLineage model ☐ Your change modifies one or more OpenLineage <a href="https://github.com/OpenLineage/OpenLineage/tree/main/spec/facets">facets</a> If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports <code>S3</code> and <code>GCS</code> filesystem operations, tested with AWS EMR). One-line summary: Checklist ☐ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☐ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☐ Your changes are accompanied by tests (if relevant) ☐ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☐ Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant) ☐ You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant) <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2023 contributors to the OpenLineage project

Labels

documentation, integration/flink, streaming

Tom Linton (tom.linton@atlan.com)

2024-05-07 09:19:16

Hey all - I've searched high and low, but can't find anything conclusive on how/if custom facets are supported with Spark OL. Can anyone point me in the right direction of an article or example code I can study? Thanks in advance!

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-07 09:32:21

*Thread Reply:* Take a look at https://openlineage.io/docs/integrations/spark/extending#customfacetbuilder

openlineage.io

Extending | OpenLineage

The Spark library is intended to support extension via custom implementations of a handful

Original URL: https://openlineage.io/docs/integrations/spark/extending#customfacetbuilder

Tom Linton (tom.linton@atlan.com)

2024-05-07 09:50:46

*Thread Reply:* And if I built a custom facet would i use it by adding it to the spark config as an extraListener?

.config('spark.extraListeners', '<myCustomFacetGenerator')

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-07 10:04:17

*Thread Reply:* No need for that, just need to make sure you have the class accessible in your jar and service loader META-INF file points to it

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-07 10:05:28

*Thread Reply:* https://www.baeldung.com/google-autoservice an example how to make it easier 🙂

Tom Linton (tom.linton@atlan.com)

2024-05-07 10:17:57

*Thread Reply:* I tried so hard not to learn Java.... 🤣

Tom Linton (tom.linton@atlan.com)

2024-05-07 10:18:18

*Thread Reply:* thank you, @Maciej Obuchowski!

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-07 10:41:44

*Thread Reply:* however, if the information you need to add is static for the job, it might be enough to use https://github.com/OpenLineage/OpenLineage/blob/cc80a9736f6194272cd3c221b5ba6dcdc5[…]e/spark/agent/facets/builder/CustomEnvironmentFacetBuilder.java - you can set up env variables you want to capture from the job using https://github.com/OpenLineage/OpenLineage/blob/cc80a9736f6194272cd3c221b5ba6dcdc5[…]rc/main/java/io/openlineage/client/transports/FacetsConfig.java config, and it will be added as environment-properties run facet

Tom Linton (tom.linton@atlan.com)

2024-05-07 11:40:46

*Thread Reply:* I was thinking of adding in some data quality metrics about the data

Michael Robinson (michael.robinson@astronomer.io)

2024-05-07 12:21:02

@channel This month’s TSC meeting is tomorrow, Wednesday the 8th at 9:30am PT. On the tentative agenda: • Announcements • Recent Release 1.13.1 Highlights • Protobuf Support in Flink @Paweł Leszczyński • Improved Project Management on GitHub @Kacper Muda • Discussion Items • Open Discussion More info and the meeting link can be found on the website. All are welcome! Do you have a discussion topic, use case or integration you’d like to demo? Reply here or DM me to be added to the agenda.

openlineage.io https://openlineage.io/meetings/|TSC Meetings | OpenLineage The OpenLineage Technical Steering Committee meets monthly, and is open to all.

openlineage.io

TSC Meetings | OpenLineage

The OpenLineage Technical Steering Committee meets monthly, and is open to all.

Original URL: https://openlineage.io/meetings/

👍 Mustafa Qamaruddin, Paweł Leszczyński

Vishnu Rajendran (vishnu.rajendran@rakuten.com)

2024-05-07 21:45:58

Hello Everyone , Am trying to pitch OpenLineage to my team to be used for Pipeline Observability , one of the things they are interested in is to track the resource usage of a spark job like cpu and memory usage i went through the documentation but i couldn't find a resource related to this , of course i did find some resource about integration with Spark but not specifically about cpu,memory usage tracking , Any Help here would be immensely Helpful for my Pitch.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-05-08 04:13:08

*Thread Reply:* You're right. So far, Spark integration mostly focused on finding input/output dataset to get lineage graph. It would be great to extend events emitted with facets containing info about resource consumption.

Harel Shein (harel.shein@gmail.com)

2024-05-08 08:29:19

*Thread Reply:* Out of curiosity, since openlineage only emits events during start/stop of a job/step. How would you expect the cpu/memory metrics to look like? Average for the duration of the step / point in time / …

Simran Suri (mailsimransuri@gmail.com)

2024-05-08 07:09:43

Hi everyone, I'm encountering a similar issue to the one discussed here: https://github.com/OpenLineage/OpenLineage/issues/2559 (riased by @dolfinus). All my events are ending up in a single partition when I attempt to produce events from Azure Databricks. I'm noticing 'openlineage-java' as the key. Has this issue been resolved in a newer release, and if so, what modifications are needed to mitigate this skewness? Currently, I'm using openlineage-spark-0.26.0.jar.

Harel Shein (harel.shein@gmail.com)

2024-05-08 08:31:13

*Thread Reply:* I’m actually not sure, but this is quite an old version. Any reason for not updating to the latest?

👍 Paweł Leszczyński

Simran Suri (mailsimransuri@gmail.com)

2024-05-08 08:35:21

*Thread Reply:* Not any specific reason, I'll be updating to the latest one, but do we have this issue incorporated in that? If yes then again coming back to my question, - is there any changes I need to do after updating to the respective version to avoid skewness? Any idea?

Harel Shein (harel.shein@gmail.com)

2024-05-08 09:03:47

*Thread Reply:* There have been quite a few improvements since, but I don’t think you’d need to make any configuration change

Simran Suri (mailsimransuri@gmail.com)

2024-05-08 09:10:19

*Thread Reply:* Oh ok thanks @Harel Shein , I'll try using the latest jar then. Hoping this issue is resolved in that

Harel Shein (harel.shein@gmail.com)

2024-05-08 09:11:39

*Thread Reply:* Of course. And please let us know if there are still issues

👍 Simran Suri

Simran Suri (mailsimransuri@gmail.com)

2024-05-08 09:54:39

*Thread Reply:* By using latest version of jar, now instead of "openlineage-java", I'm getting "run:default/databricks_shell" in key. But still events are going into single partition @Harel Shein

Harel Shein (harel.shein@gmail.com)

2024-05-08 12:43:17

*Thread Reply:* can you please open an issue on the repo? also detailing the data store your are using with Spark

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-08 12:47:32

*Thread Reply:* Have you tried assigning job name manually? Looks like it's using Spark job name to assign key

Harel Shein (harel.shein@gmail.com)

2024-05-08 13:07:49

*Thread Reply:* oh, I thought this was the dataset name and job name 😅 what Maciej said ☝️

Simran Suri (mailsimransuri@gmail.com)

2024-05-09 01:23:15

*Thread Reply:* @Maciej Obuchowski I'm not assigning any job names manually. I've set up the OpenLineage config in the global init for my Databricks workspace.

When using an older version of the JAR, the generated job name includes the Azure Databricks URL, as shown below: "run": { "runId": "6390chj4-a57e-4fa1-bb5a-f0dcsj3884d" }, "job": { "namespace": "default", "name": "adb-189XXXXXXXXXX1477azuredatabricksnetproject" }

However, after switching to the latest version, the job name changes, even with the same job run I tested.

"run": { "runId": "676159d4-a57e-4fa1-bb5a-f0dcdj83094d" }, "job": { "namespace": "default", "name": "databricks_shell" }

So the generated key now looks like this:"run:default/databricks_shell" as per "run:" + parentJob.getNamespace() + "/" + parentJob.getName(); in https://github.com/OpenLineage/OpenLineage/pull/2615/files

I think partitioning issue might resolve in this way if job name will differ. But seems like there are some other changes done in respect to job name, as now I'm not getting url in jobname.

Simran Suri (mailsimransuri@gmail.com)

2024-05-09 04:02:29

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/1251 ++ @Paweł Leszczyński Can you pls help with some insights on this

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-05-09 04:08:22

*Thread Reply:* please refer to PR description https://github.com/OpenLineage/OpenLineage/pull/1330 for more details

#1330 databricks improvements to send better events

Signed-off-by: Pawel Leszczynski <a href="mailto:leszczynski.pawel@gmail.com">leszczynski.pawel@gmail.com</a> Problem Following Spark and Databricks integration issues have been encountered: • several unwanted events are sent, • spark job name gets repeated and is meaningless, Closes: <a href="https://github.com/OpenLineage/OpenLineage/issues/1251">#1251</a>, <a href="https://github.com/OpenLineage/OpenLineage/issues/1252">#1252</a> Solution Following limitation have been discovered: • There is no way to find notebook name nor number of data bricks cell being run (DbUtils packacge delivered by Databricks does not on SparkListened context), • No way have been found to get SQL query being executed from a Spark logical plan. Based on the problem and limitations, an implemented solution: • filters creating events for specified actions on Databricks platform like show namespace, describe table, etc • OL event gets enriched with <code>ExplainQueryFacet</code> which contains analyzed logical plan (3 lines of string), which provides human readable kindof information on what is happening within the job • job name gets enriched with databricks workspace url and output dataset to prevent name collisions. Example Spark job name is: <code>dbc-954f5d5f-34dd_append_data_exec_v1_air_companies_db_air_companies</code> <blockquote> Note: All schema changes require discussion. Please <a href="https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue">link the issue</a> for context. </blockquote> • [] Your change modifies the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json">core</a> OpenLineage model • [] Your change modifies one or more OpenLineage <a href="https://github.com/OpenLineage/OpenLineage/tree/main/spec/facets">facets</a> If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports <code>S3</code> and <code>GCS</code> filesystem operations, tested with AWS EMR). Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☑︎ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☑︎ Your changes are accompanied by tests (if relevant) ☑︎ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☑︎ You've updated the <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md"><code>CHANGELOG.md</code></a> with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant) ☐ You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant) <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2022 contributors to the OpenLineage project

Labels

documentation, integration/spark, client/java

Assignees

<a href="https://github.com/pawel-big-lebowski">@pawel-big-lebowski</a>

Simran Suri (mailsimransuri@gmail.com)

2024-05-09 04:24:15

*Thread Reply:* @Paweł Leszczyński, @Maciej Obuchowski yes I guess this issue was already taken care but still using latest version of jar (openlineage-spark2.13-1.13.1) I'm getting same databricksshell in job name, but I need url only, this is annoying having databricks_shell in all events

"job": { "namespace": "default", "name": "databricks_shell"

Emili Parreno (emili@stripe.com)

2024-05-08 08:42:54

Hi folks. I want to track lineage for data consumers, specifically dashboards. The idea is to get the dashboard information, that contains the query used to retrieve the data to build the dashboard (I should probably use graph or widget instead of dashboard), parsing those queries to determine what are the upstream dependencies, and build an OpenLineage event. This will be done by a Spark job and events sent to Kafka. In this example the inputs are clear (whatever tables are used in the SQL statement), but I’m having a hard time deciding what should I consider the Job and the Output dataset. Any advice?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-08 10:47:14

*Thread Reply:* We never formally specified that, but there were discussions around dashboards and reporting. Overall, I think dashboard should be an both job and output dataset in this case 🙂

Being an output dataset has pretty nice properties - like being able to specify schema of the dashboard, while being a job allows you to have runs of it - and specifically do things like looking whether cache was hit or the underlying query was actually executed

Emili Parreno (emili@stripe.com)

2024-05-08 10:55:28

*Thread Reply:* that makes sense, thanks

Jim Schneidereit (jimschneidereit@gmail.com)

2024-05-08 15:52:02

Hey everyone! I just found out about OpenLineage the other day and have been investigating whether it will be suitable for our purposes. Specifically I'd like to focus on the concept of DataSet and Job while kinda ignoring the idea of Run. On first blush this seems doable but when I rolled up my sleeves and generated a client from the OpenAPI spec I noticed that column level lineage seems to be tied to Run (at least from what documentation I could find) or is a separate facet entirely.

I guess the short of what I'm trying to determine is: can I slap the ColumnLineageDatasetFacet onto the JobEvent defined in the OpenLineage spec?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-05-08 15:53:12

*Thread Reply:* as the name (and spec) states it's dataset facet, not run facet 🙂

Jim Schneidereit (jimschneidereit@gmail.com)

2024-05-08 15:55:18

*Thread Reply:* 🤔 in this example it looks like it's highly associated with a run and job though 🤔

Jim Schneidereit (jimschneidereit@gmail.com)

2024-05-08 15:56:08

*Thread Reply:* which I guess means perhaps I can add the columnLineage facet onto any "event"?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-05-08 15:56:40

*Thread Reply:* I don't see how it's associated with run?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-05-08 15:57:30

*Thread Reply:* there are 3 types of events since 2.0 spec - run, job and dataset event. job event basically contains job and dataset info, without run

Jim Schneidereit (jimschneidereit@gmail.com)

2024-05-08 15:58:51

*Thread Reply:* interesting - well then I'll continue on my experiment to insert the ColumnLineageDatasetFacet into the main openlineage json and see if I can just send JobEvents to Marquez and get column level lineage 🤞

image.png

Jim Schneidereit (jimschneidereit@gmail.com)

2024-05-08 15:59:30

*Thread Reply:* (also, thanks for the super fast reply!)

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-05-08 15:59:32

*Thread Reply:* I don't precisely remember how Marquez handles events other than RunEvent

Jim Schneidereit (jimschneidereit@gmail.com)

2024-05-08 16:02:03

*Thread Reply:* Interesting, they do support column level lineage though

Jim Schneidereit (jimschneidereit@gmail.com)

2024-05-08 16:02:45

*Thread Reply:* In any case, if ColumnLineageDatasetFacet is not tied to a Run or a Job due to the name Dataset it seems like I should be able to include that Facet on almost any event

➕ Maciej Obuchowski

Mark Soule (marksouletheprogrammer@gmail.com)

2024-05-08 23:25:29

Hello OpenLineage community. My name is Mark Soule, from Minneapolis Minnesota. Nice job with the webinar a couple weeks ago!

👋 Maciej Obuchowski, Michael Robinson, Ernie Ostic

Michael Robinson (michael.robinson@astronomer.io)

2024-05-09 10:30:26

*Thread Reply:* Welcome and thanks for joining!

Michael Robinson (michael.robinson@astronomer.io)

2024-05-09 15:39:52

@channel Today we released OpenLineage 1.14.0, featuring: Additions • Common/dbt: add DREMIO to supported dbt profile types #2674 @surisimran • Flink: support Protobuf format for sources and sinks #2482 @pawel-big-lebowski • Java: add facet conversion test #2663 @julienledem • Spark: job type facet to distinguish RDD jobs from Spark SQL jobs #2652 @pawel-big-lebowski • Spark: add Glue symlink if reading from Glue catalog table #2646 @mobuchowski • Spark: add spark_jobDetailsfacet #2662 @dolfinus Removals • Airflow: drop old ParentRunFacet key #2660 @dolfinus • Spark: drop SparkVersionFacet #2659 @dolfinus • Python: allow relative paths in URI formats for Python facets #2679 @JDarDagran Changes • GreatExpectations: rename ParentRunFacet key #2661 @dolfinus Fixes • dbt: support a less ambiguous logic to generate job names #2658 @blacklight • Spark: update to use org.apache.commons.lang3 instead of org.apache.commons.lang #2676 @harels See the notes for more details. Thanks to all the contributors with a shout out to new contributor @Simran Suri! Please note that there tends to be a lengthy delay before Maven displays the latest repositories, but you can obtain them by manually editing the release tag in the repo URLs (e.g., <https://repo1.maven.org/maven2/io/openlineage/openlineage-java/1.14.0/>). Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.14.0 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/1.13.1...1.14.0 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🙌 Maciej Obuchowski

🚀 Maciej Obuchowski, Simran Suri, Harel Shein, Srini Raghavan

❤️ Simran Suri

Mark Soule (marksouletheprogrammer@gmail.com)

2024-05-10 16:41:42

Hey, sorry if this is a silly question. I like to understand topics from a lot of different perspectives, including academic. Are there academic papers, white papers, or seminal texts about data lineage that anyone can recommend?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-05-15 03:47:36

*Thread Reply:* It seems the answear is: not yet. In case you were about to write one, feel free to add it under https://openlineage.io/resources

openlineage.io

Resources | OpenLineage

Original URL: https://openlineage.io/resources

👍 Mark Soule

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-05-31 08:57:01

*Thread Reply:* University of Prague has many academic papers on data lineage in general. There is nothing on OpenLineage specifically.

Mark Soule (marksouletheprogrammer@gmail.com)

2024-07-10 12:36:09

*Thread Reply:* @Sheeri Cabral (Collibra) That's interesting - do you have links to those handy?

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-07-11 15:28:39

*Thread Reply:* I come across them from time to time when searching for information. I don’t have any off the top of my head but they’re not difficult to search for - https://www.google.com/search?q=Czech+Technical+University+prague+data+lineage+thesis

google.com

:mag_right: Czech Technical University prague data lineage thesis - Google Search

Original URL: https://www.google.com/search?q=Czech+Technical+University+prague+data+lineage+thesis

Rodrigo Maia (rodrigo.maia@manta.io)

2024-05-14 04:40:31

Morning! Does any of you know what is the impact of running the OL spark listener in terms of performace? Or how can it be measured?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-14 05:39:58

*Thread Reply:* you can take a look at Metrics - and specific Spark metrics

openlineage.io

Configuration | OpenLineage

We recommend configuring the client with an openlineage.yml file that contains all the

Original URL: https://openlineage.io/docs/client/java/configuration#metrics

openlineage.io

Spark Integration Metrics | OpenLineage

The OpenLineage integration with Spark not only utilizes the Java client's metrics but also introduces its own set of metrics specific to Spark operations. Below is a list of these metrics.

Original URL: https://openlineage.io/docs/integrations/spark/metrics

Sophie LY (sophie.ly@decathlon.com)

2024-05-16 10:52:28

Hi all 🙂

Just wondering if you have any thoughts on this issue please ?

https://github.com/OpenLineage/OpenLineage/issues/2699

Thanks !

#2699 [BUG] Airflow ethena extractor may produce empty name for Dataset

OpenLineage integration/client Airflow integration OpenLineage version openlineage-airflow==1.14.0 Hi all ! Not sure if it's a real bug or not but concerning Airflow Athena Operator, users have to specify an output bucket and sometimes user put the root of a bucket like the following example: <pre><code>read_table = AthenaOperator( ... output_location=f"<s3://my_bucket>", ) </code></pre> Which is completely fine for airflow (the DAG is running successfully) but it produces a Event with an empty dataset name: <pre><code>{ "eventType": "COMPLETE", "eventTime": "2020-12-28T19:52:00.001+10:00", "run": { "runId": "UUID" }, "job": { "namespace": "airflow", "name": "test" }, "inputs": [], "outputs": [ { "namespace": "<s3://my_bucket>", "name": "" } ], "producer": "<https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client>", "schemaURL": "<https://openlineage.io/spec/1-0-1/OpenLineage.json#/definitions/RunEvent>" } </code></pre> An empty name can be problematic for tool like marquez, it makes the UI crash. What you think should happen instead According to the file the output dataset is created like that: <pre><code> outputs.append( Dataset( name=parsed.path, source=Source( scheme=parsed.scheme, authority=parsed.netloc, connection_url=output_location, ), ) ) </code></pre> <a href="https://private-user-images.githubusercontent.com/130998593/331255035-764a1bbb-edd1-426f-bbe4-f9e4838f5a02.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTU4NzE0OTAsIm5iZiI6MTcxNTg3MTE5MCwicGF0aCI6Ii8xMzA5OTg1OTMvMzMxMjU1MDM1LTc2NGExYmJiLWVkZDEtNDI2Zi1iYmU0LWY5ZTQ4MzhmNWEwMi5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNTE2JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDUxNlQxNDUzMTBaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1hMmI0ZjFiZDBiNjM0NDI3NTIwMmMzNjRlODQ5YTQ1MjkxOGRhMmQ2MDgwZjkwODU4NjJhOTc0ODFlMjgzMDMzJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.3xIl3LGbMvqCvvfPaYS6_M_e0alzpojhXTyy62k4Bww">image</a> In our example with <s3://my_bucket>, the name of the dataset is parsed.path but in our case parsed.path is empty that's why the name of the dataset is empty. Should we put a default value then ? Like 'root' ? How to reproduce Use AthenaOperator with an output location at the root of your bucket without the '/' at the end <pre><code>read_table = AthenaOperator( task_id="test", query="SELECT 1", database=athena_database, output_location=f"<s3://my_bucket>", ) </code></pre> Additional details No response Do you plan to make this contribution yourself? ☑︎ I am interested in doing this work

Labels

kind:bug, state:needs-triage

Kacper Muda (mudakacper@gmail.com)

2024-05-17 04:23:00

*Thread Reply:* Hey Sophie, i responded there 🙂

Simran Suri (mailsimransuri@gmail.com)

2024-05-18 09:05:01

Hi all,

I tried using the latest version of openlineage-dbt and am happy to report that the dbt-dremio adapter is working fine. However, I'm facing an issue with the dataset description version.

While running dbt-ol build, I'm encountering the following warning. I'm passing descriptions for both the dataset and columns. I'm receiving column-level descriptions in the event, but not for the dataset. I assume this might be due to the following warning:

"Artifact schema version: https://schemas.getdbt.com/dbt/manifest/v9.json is above dbt-ol supported version 7. This might cause errors."

OpenLineage dbt seems to be supporting a very old version. Any idea on this? And how can I include the dataset description in events?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-18 09:43:56

*Thread Reply:* It's the latest manifest version we've meaningfully checked.

However, dbt manifest versions are generally additive in nature, so to raise the level of warning you'd generally just have to compare the JSON schemas of those manifests and make sure they haven't deleted anything.

Simran Suri (mailsimransuri@gmail.com)

2024-05-18 09:49:42

*Thread Reply:* Sure, thanks @Maciej Obuchowski . I'll compare the JSON schemas. Once I verify that no fields have been removed or altered between versions 7 and 9, what would be the next steps to make it compatible with OpenLineage? Specifically, how can I get rid of this warning and check if I'm receiving the dataset description?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-19 12:30:07

*Thread Reply:* > Once I verify that no fields have been removed or altered between versions 7 and 9, what would be the next steps to make it compatible with OpenLineage? Pretty much nothing. I would bump the number in the code

👍 Simran Suri

Simran Suri (mailsimransuri@gmail.com)

2024-05-21 04:29:27

*Thread Reply:* @Maciej Obuchowski I checked v7 and v12 (latest version) of JSON, seems there is no deletion but they have added new fields and tagged them as required. https://schemas.getdbt.com/dbt/manifest/v12/index.html#metadata - v12 https://schemas.getdbt.com/dbt/manifest/v7/index.html#collapseDescription_root -v7 You can also check once. In v7 there where 11 fields and in v12 all are required and additional 5 fields are added. (16 in total). I think it's fine if we can increase that version.

Also, I'm expecting this will resolve the issue of not getting description in events, though we have it in openlineage schema.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-21 05:00:36

*Thread Reply:* If the fields are only added, not removed - the bump is safe.

> Also, I'm expecting this will resolve the issue of not getting description in events, though we have it in openlineage schema. I'm not sure where that assumption comes from 😉 Most likely we don't extract that description from some field - purely bumping the version of manifest won't do the job alone.

Simran Suri (mailsimransuri@gmail.com)

2024-05-21 05:59:39

*Thread Reply:* It might be possible, but I can see the description field in the Inputs and Outputs sections of the OpenLineage schema (which is what I'm looking for). Currently, it's not appearing even though I'm passing it via dbt and it is present in the schema.

On the other hand, I am getting column-level descriptions, which are also in the schema so ideally the above mentioned description should also come. I was also getting a version warning, so thought of getting it updated and test the same. @Maciej Obuchowski

image.png

Simran Suri (mailsimransuri@gmail.com)

2024-05-21 06:05:19

*Thread Reply:* This is my event, getting column description but can't see for datasets (outputs)

image.png

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-21 06:15:26

*Thread Reply:* Makes sense - is there a field in dbt schema that contains the dataset description?

Simran Suri (mailsimransuri@gmail.com)

2024-05-21 06:29:18

*Thread Reply:* Yes @Maciej Obuchowski, I can see source_description as one of the field in dbt manifest schema, also there are other field as well named description. And this is how I'm passing it:

models:

name: daily_metrics description: This table contains information about experimentation metrics at daily level. columns:
- name: datepart description: This is datapart description .

image.png

Simran Suri (mailsimransuri@gmail.com)

2024-05-21 06:30:49

*Thread Reply:* Ideally, it should be part of the output field, in the same manner as the column-level description I'm passing. However, I'm not receiving it. Even though I'm setting some data quality tests, I can see those in the events.

Simran Suri (mailsimransuri@gmail.com)

2024-05-22 01:59:18

*Thread Reply:* Hi @Maciej Obuchowski, Can we then forward with bumping that number in code? to update the version?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-22 04:50:48

*Thread Reply:* Added the documentation facet: https://github.com/OpenLineage/OpenLineage/pull/2714

👍 Simran Suri

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-22 04:50:55

*Thread Reply:* @Simran Suri sure, do you have a PR?

Simran Suri (mailsimransuri@gmail.com)

2024-05-22 05:14:45

*Thread Reply:* No, I don't have any PR, also along with this documentation facet, can we change the version number as well @Maciej Obuchowski Also, this merge changes will be taken in consideration in next release? or we can have a pre-release for this as well?

Jens Pfau (jenspfau@google.com)

2024-05-20 09:29:46

Hi,

We want to start implementing https://github.com/OpenLineage/OpenLineage/issues/2186 and https://github.com/OpenLineage/OpenLineage/issues/2187 (column-level lineage transformation details), also ensuring the Spark integration respects the new event spec. If you have any opinions, please chime in. If anyone has a proposal how to determine whether is_masked should be true for a given transformation, happy to read proposals :)

#2186 [PROPOSAL] Formalizing transformation types

#2187 [PROPOSAL] ColumnLineageDatasetFacet - Allow transformation type to vary between input fields for a specific output field

👍 Maciej Obuchowski

❤️ Julien Le Dem

Jens Pfau (jenspfau@google.com)

2024-05-22 02:14:53

What's the thinking about how the static lineage changes can be used to reflect lineage for views, dashboards, reports, etc? As I see it, this type of lineage has different semantics from the lineage that's due to a job that ran and transformed some data. Wondering whether using the JobEvent is the intended way to declare lineage for these types of assets.

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-22 10:30:47

*Thread Reply:* JobEvent - describe static job - we don't have information about particular run of that job - maybe it does not exist - but we know some things about it, like what datasets it consumes and produces DatasetEvent - describe static datasets, for when we know about some datasets, but don't have any job/run data on how they are produced.

Regarding interaction between those types of lineage, I think it makes more sense to either treat Job and Dataset events as design-time lineage - as in "my jobs should look like this" which is then "tested" in reality. Second use case is providing additional information about jobs and datasets, which might not be captured during run execution - like changing ownership of a dataset.

Jens Pfau (jenspfau@google.com)

2024-05-23 01:53:45

*Thread Reply:* How would you model the view case? If you cannot get column-level lineage, you'd need to use a JobEvent so you can set table-level inputs and outputs. So for the sake of consistency, I think it makes sense to use a JobEvent in both cases. Does it? The job for that JobEvent could be the create view statement.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-05-31 09:04:11

*Thread Reply:* A view is an interesting case. Technically it’s a Dataset, and the definition of the dataset is where lineage is.

One way to think about it is that in an unmaterialized view, the lineage only happens when someone queries the view. So it should be in a JobEvent, where the job is what queries the view, and makes the lineage.

HOWEVER, the idea about design lineage is that it’s what’s designed, not what was actually used. Imagine a stored procedure that’s never called - we would still want to see what lineage(s) COULD be made by that stored procedure. Same with a view.

DataSet Event makes sense for a view, and JobEvent makes sense for a stored procedure, because a stored proc is a set of code just like a job.

BUT - I can see how that means there’s duplication. So it depends on if we want to implement as pure theory - which would be DataSet Event for views - or implement with something that makes less duplication - JobEvent for views.

Julien Le Dem (julien@apache.org)

2024-06-04 18:15:35

*Thread Reply:* I think in the case of OL, the view will have both a dataset and a job. (just like a "Create table as ..." SQL has both a job for the query itself and for the output dataset). It sounds like a great idea to me to use the static lineage to model a view as it exists independently of whether it is queried or not. It would look like this: { job: { ... sql facet ...} inputs: [ ... tables queried by the view ...] outputs: [ { ... view dataset following the same table naming convention...} ] }

❤️ Sheeri Cabral (Collibra)

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-22 10:25:18

There has been a few nice fixes and additions since the last release, so I would like to request another one. CC @Michael Robinson

➕ Jakub Dardziński, Kacper Muda, Willy Lulciuc, Simran Suri, Paweł Leszczyński

Michael Robinson (michael.robinson@astronomer.io)

2024-05-22 16:04:30

*Thread Reply:* Thanks for requesting a release. It will be initiated within 2 business days.

Jens Pfau (jenspfau@google.com)

2024-05-23 05:58:10

[Interoperability testing / Integration certification process] https://github.com/OpenLineage/OpenLineage/issues/2163 I made some proposals for implementation but need help 🙂 https://docs.google.com/document/d/1h_PI0HLX7ECVll068EmExZHF5xVYqVNGsNj7DCPiB5Y/edit?pli=1#bookmark=id.gprcazbvd246

@Julien Le Dem left some comments. @Sheeri Cabral (Collibra) @Mandy Chessell @Ernie Ostic you might have some ideas, and of course anyone else as well.

#2163 [PROPOSAL] Define an integration certification process

Purpose: As more and more third parties implement OpenLineage the need has emerged for a certification process. There are two distinct cases for this: • Producers: They want to verify the metadata they are producing is valid according to the spec. • Consumer: They want to verify they understand metadata correctly and need example of all facets. Both want to be able to keep up to date with the evolution of the spec and capture what they support or not. In particular, what facets they consume or produce. Proposed implementation Create tooling and a record of implementations of the spec. Currently we have produced <a href="https://github.com/OpenLineage/OpenLineage/tree/main/spec/tests">examples of all the core facets</a> that can be used by consumers to validate their implementation. This is a bit short but provides a starting point. Conversely, producers can use the json schemas to validate the syntax of the metadata they produce but not the semantics. Input is requested to further this proposal Discussion doc: <a href="https://docs.google.com/document/d/1h_PI0HLX7ECVll068EmExZHF5xVYqVNGsNj7DCPiB5Y/edit">https://docs.google.com/document/d/1h_PI0HLX7ECVll068EmExZHF5xVYqVNGsNj7DCPiB5Y/edit</a>

Labels

kind:proposal

👀 Sheeri Cabral (Collibra)

Jens Pfau (jenspfau@google.com)

2024-06-17 08:55:35

*Thread Reply:* Any feedback? 🙂

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-06-24 13:11:55

*Thread Reply:* Yes, sorry, had to have time to really ingest and think about it 😄

Michael Robinson (michael.robinson@astronomer.io)

2024-05-25 12:00:01

@channel Yesterday we released OpenLineage 1.15.0, including: Additions: • Flink: handle Iceberg tables with nested and complex field types #2706 @dolfinus • Flink: handle Avro schema with nested and complex field types #2711 @dolfinus • Spark: add facets to Spark application events #2677 @dolfinus • Spark: add nested fields to SchemaDatasetFieldsFacet #2689 @dolfinus • Spark: add SparkApplicationDetailsFacet #2688 @dolfinus Removals: • Airflow: remove Airflow < 2.3.0 support #2710 @kacpermuda • Integration: use v2 Python facets #2693 @JDarDagran ◦ Removes usage of v1 facets. Fixes including: • Spark: improve job suffix assigning mechanism #2665 @pawel-big-lebowski • SQL: catch TokenizerErrors, PanicException #2703 @mobuchowski See the notes for more details. Thanks to all the contributors! Please note that there tends to be a lengthy delay before Maven displays the latest repositories, but you can obtain them by manually editing the release tag in the repo URLs (e.g., <https://repo1.maven.org/maven2/io/openlineage/openlineage-java/1.15.0/>). Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.15.0 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/1.14.0...1.15.0 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🚀 Maciej Obuchowski, Rodrigo Maia, Sheeri Cabral (Collibra)

Francisco Morillo (fmorillo@amazon.es)

2024-05-27 04:22:09

*Thread Reply:* Thats Awesome!! Could we add to Flink Iceberg tables a facet so they share the same naming convention that spark has for iceberg tables?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-27 04:33:00

*Thread Reply:* @Francisco Morillo could you open a ticket in Github, or link one for this if it already exist?

Francisco Morillo (fmorillo@amazon.es)

2024-05-27 04:33:25

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/2625#issuecomment-2118778423

Comment on #2625 [Bug]. Iceberg DataSet from Flink NameSpace Different than Spark

Could you give examples of dataset names procuded by both integrations?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-27 04:33:51

*Thread Reply:* Thanks!

Simran Suri (mailsimransuri@gmail.com)

2024-05-27 08:03:32

*Thread Reply:* Great! But while testing this new version for dbt, I found one bug. Raised a ticket for the same, it was not there in last version i.e., 1.14.0 https://github.com/OpenLineage/OpenLineage/issues/2734

#2734 [BUG] Name and Namespace getting swapped for dbt

OpenLineage integration/client DBT integration OpenLineage version openlineage-dbt=1.15.0 Technology and package versions No response Environment configuration No response Deployment details No response Problem details I've been using openlineage-dbt version 1.14.0 to capture lineage events. Two days ago, a new version (1.15.0) was released (with addition of dataset facet for dbt). While using version 1.15.0, I encountered an issue with the name and namespace values in the events. Specifically, the values are getting swapped: the Dremio and Databricks connection strings, which should appear in the namespace, are instead appearing in the name, and vice versa. Interestingly, in some cases, the values appear correctly, but mostly, they are swapped. I tested this with both Databricks and Dremio adapters and faced the same issue in both cases. 1. Databricks - DBT <pre><code>{ "eventTime":"2024_05_27T08:03:26.279742+00:00", "eventType":"START", "inputs":[ { "facets":{ "dataSource":{ "_producer":"<https://github.com/OpenLineage/OpenLineage/tree/1.15.0/client/python>", "_schemaURL":"<https://openlineage.io/spec/facets/1-0-1/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet>", "name":"<databricks://adb-4XXXXXXXXXXXXXX8.6.azuredatabricks.net>", "uri":"<databricks://adb-4XXXXXXXXXXXXXX8.6.azuredatabricks.net>" }, "documentation":{ "_producer":"<https://github.com/OpenLineage/OpenLineage/tree/1.15.0/client/python>", "_schemaURL":"<https://openlineage.io/spec/facets/1-0-1/DocumentationDatasetFacet.json#/$defs/DocumentationDatasetFacet>", "description":"This table contains the final account data" }, "schema":{ "_producer":"<https://github.com/OpenLineage/OpenLineage/tree/1.15.0/client/python>", "_schemaURL":"<https://openlineage.io/spec/facets/1-1-1/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet>", "fields":[ ] } }, "name":"<databricks://adb-4XXXXXXXXXXXXXX8.6.azuredatabricks.net>", "namespace":"spark_catalog.dbt.customer" } ], </code></pre> 2. DREMIO - DBT <pre><code>"outputs":[ { "facets":{ "dataSource":{ "_producer":"<https://github.com/OpenLineage/OpenLineage/tree/1.15.0/client/python>", "_schemaURL":"<https://openlineage.io/spec/facets/1-0-1/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet>", "name":"<dremio://dremio-digital.net:443>", "uri":"<dremio://dremio-digital.net:443>" }, "documentation":{ "_producer":"<https://github.com/OpenLineage/OpenLineage/tree/1.15.0/client/python>", "_schemaURL":"<https://openlineage.io/spec/facets/1-0-1/DocumentationDatasetFacet.json#/$defs/DocumentationDatasetFacet>", "description":"This is a lookup dataset for customer master data ." }, "schema":{ "_producer":"<https://github.com/OpenLineage/OpenLineage/tree/1.15.0/client/python>", "_schemaURL":"<https://openlineage.io/spec/facets/1-1-1/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet>", "fields":[ { "description":"This is unique identified for customer .", "fields":[ ], "name":"cust_id", "type":"" }, { "description":"Name of customer .", "fields":[ ], "name":"Name", "type":"" } ] } }, "name":"<dremio://dremiodigital.net:443>", "namespace":"test-demo.sources.customer", "outputFacets":{ } } ], </code></pre> What you think should happen instead No response How to reproduce This issue can be reproduced by running any dbt model via any supported adapter and running dbt-ol build Additional details No response Do you plan to make this contribution yourself? ☐ I am interested in doing this work

Labels

kind:bug, state:needs-triage

Hitesh Yadav (hiteshy9904@gmail.com)

2024-05-27 22:57:07

Hi Team, I am seeing a lot of logs like OpenLineage received Spark event that is configured to be skipped: SparkListenerJobEnd while running my jobs with openlineage. Is there any way of disabling these logs since I don't want them to clutter my log console?

Hitesh Yadav (hiteshy9904@gmail.com)

2024-05-28 02:02:33

*Thread Reply:* Also, can I make openlineage not skip these events? I'd like to capture them.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-05-28 04:15:27

*Thread Reply:* Hi @Hitesh Yadav, if your job is using delta catalaog, then delta is calling several internal Spark operations. These can be tens or even hundreds. To limit the amount of trash events, we've added filtering rules which result in logs being written.

Simran Suri (mailsimransuri@gmail.com)

2024-05-28 02:09:10

While using version 1.15.0 of openlineage-dbt, we encountered a BUG affecting the name and namespace values in the events. This issue has been fixed, as referenced here: OpenLineage Issue #2734.

Can we have a new release to incorporate this fix in dbt? CC: @Michael Robinson

➕ Maciej Obuchowski, Michael Robinson, Paweł Leszczyński, Jakub Dardziński, Kacper Muda

Damien Hawes (damien.hawes@booking.com)

2024-05-28 04:19:19

*Thread Reply:* cc @Fabio Manganiello

Michael Robinson (michael.robinson@astronomer.io)

2024-05-28 09:37:43

*Thread Reply:* Thanks for requesting a release. It's approved and in-process.

👍 Simran Suri

Michael Robinson (michael.robinson@astronomer.io)

2024-05-28 11:08:52

@channel Today we released OpenLineage 1.16.0, including: Additions: • Spark: add jobType facet to Spark application events #2719 @dolfinus Fixes: • dbt: fix swapped namespace and name in dbt integration #2735 @JDarDagran • Python: override debug level #2727 @mobuchowski Please see the notes for more details. Thanks to all the contributors! Please note that there tends to be a lengthy delay before Maven displays the latest repositories, but you can obtain them by manually editing the release tag in the repo URLs (e.g., <https://repo1.maven.org/maven2/io/openlineage/openlineage-java/1.16.0/>). Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.16.0 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/1.15.0...1.16.0 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

❤️ Rahul Madan, Maciej Obuchowski, Mattia Bertorello, Simran Suri, Sheeri Cabral (Collibra)

Roy Cheong (roy.cheong@grabtaxi.com)

2024-05-28 23:02:16

Hi team, there was an issue with Airflow jobs failing or running indefinitely and a quick fix was to set disableOpenLineage to true . Could you help to take a look ? :gratitudethankyou:

airflow_job_failure_due_to_lineage.log

Roy Cheong (roy.cheong@grabtaxi.com)

2024-05-28 23:03:16

*Thread Reply:* For context, spark version being used is Spark 3.5 , delta 3.2

Michael Robinson (michael.robinson@astronomer.io)

2024-05-29 17:28:17

*Thread Reply:* wonder if @Damien Hawes might be able to help. @Paweł Leszczyński thinks it could be due to a mismatch with Scala.

Damien Hawes (damien.hawes@booking.com)

2024-05-30 03:51:16

*Thread Reply:* Presenting symptom is a plain old NullPointerException

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-06-03 07:08:39

*Thread Reply:* @Roy Cheong working on NPE -> https://github.com/OpenLineage/OpenLineage/pull/2749/files

Roy Cheong (roy.cheong@grabtaxi.com)

2024-06-06 03:11:48

*Thread Reply:* got it, is this released in the latest version already ?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-06-06 06:12:46

*Thread Reply:* no, it's not. it got merged on monday.

Roy Cheong (roy.cheong@grabtaxi.com)

2024-06-06 23:28:22

*Thread Reply:* got it, when will it be released ?

Michael Robinson (michael.robinson@astronomer.io)

2024-06-07 14:43:32

*Thread Reply:* FYI releases can be requested in #general 🙂

Amit Karyekar (amit.karyekar@gmail.com)

2024-05-30 13:42:39

Hi team, I wanted to check on the proposal mentioned in this issue. Are the proposed changes to column-lineage facet approved? We were interested in looking at automatic column-lineage derivation. We wanted to check if there was a consensus on the proposal or are there any further changes being planned in the proposal?

#2186 [PROPOSAL] Formalizing transformation types

Labels

kind:proposal

Comments

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-05-31 08:49:43

*Thread Reply:* Yeah - @Tomasz Nazarewicz is working on finalizing spec changes and implementation

👍 Amit Karyekar, praveen kanamarlapudi

Amit Karyekar (amit.karyekar@gmail.com)

2024-05-31 11:37:17

*Thread Reply:* sounds good @Maciej Obuchowski, is there a rough timeline of when we can see the finalized spec? Looking forward to it.

JODOG (whalstn0202@gmail.com)

2024-05-31 05:44:31

Hello team. I have a question about integrating python client

JODOG (whalstn0202@gmail.com)

2024-05-31 05:44:37

*Thread Reply:* ```➜ test pip list Package Version

attrs 23.2.0 certifi 2024.2.2 charset-normalizer 3.3.2 confluent-kafka 2.4.0 idna 3.7 openlineage-integration-common 1.16.0 openlineage-python 1.16.0 openlineage_sql 1.16.0 packaging 24.0 pip 24.0 python-dateutil 2.9.0.post0 python-dotenv 0.18.0 PyYAML 6.0.1 requests 2.32.3 setuptools 65.5.0 simple-term-menu 1.2.1 six 1.16.0 urllib3 2.2.1 ➜ test python -V Python 3.11.9```

JODOG (whalstn0202@gmail.com)

2024-05-31 05:44:54

*Thread Reply:* but I got an error like this ModuleNotFoundError: No module named 'openlineage.client'; 'openlineage' is not a package

JODOG (whalstn0202@gmail.com)

2024-05-31 05:45:03

*Thread Reply:* ```from openlineage.client import OpenLineageClient from openlineage.client.transport.kafka import KafkaConfig, KafkaTransport

kafka_config = KafkaConfig( type="kafka", topic="lineage", config={ "bootstrap.servers": "localhost:9092", "acks": "all", "retries": "3", }, flush=True, messageKey="some", )

client = OpenLineageClient(transport=KafkaTransport(kafka_config))```

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-05-31 06:09:14

*Thread Reply:* you probably have some folder called openlineage in your PYTHONPATH

☝️ Kacper Muda

JODOG (whalstn0202@gmail.com)

2024-06-04 11:05:13

*Thread Reply:* you're right thank you Jakub

Michael Robinson (michael.robinson@astronomer.io)

2024-05-31 15:07:53

@channel The latest issue of OpenLineage News, featuring a rundown of recent releases, new docs, upcoming events, and much more, is in inboxes now. To become a subscriber, sign up here. openlineage.us14.list-manage.com

openlineage.us14.list-manage.com

OpenLineage Project

OpenLineage Project Email Forms

Original URL: https://bit.ly/OL_news

🙌 Maciej Obuchowski

Ewald Geschwinde (geschwinde@isoc.org)

2024-06-03 08:35:07

Hello all I just have some understanding questions - openlineage is a cool thing. My question is does the standard also include data assets without lineage. For example I want to scan a complete database with all database objects and get them intoa metadata repository without a lineage and on top some lineage which can be connected to the before scanned data assets - is this possible to represent with openlineage ?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-06-03 09:39:57

*Thread Reply:* First, it is a thing that's possible with static lineage.

However, I don't know of any efforts from OL committers to implement some integration that would generate events like that.

✅ Jakub Dardziński, Sheeri Cabral (Collibra)

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-06-03 09:42:22

*Thread Reply:* I think some data catalogs (like Atlan AFAIK) implement their lineage in a similar way - they have their own integrations that do not speak OL but their internal API, connect to DB and create static assets. Then, regular runtime OL integrations are used to update those assets.

Ewald Geschwinde (geschwinde@isoc.org)

2024-06-03 09:47:09

*Thread Reply:* Perfect this is what I need I try to implement a data catalog based on openlineage completely

Ewald Geschwinde (geschwinde@isoc.org)

2024-06-03 09:48:51

*Thread Reply:* It should be possible to create through the openlineage python client connectors to data sources to have a translator to static lineage so it can be used by catalog system?

Ewald Geschwinde (geschwinde@isoc.org)

2024-06-03 09:50:59

*Thread Reply:* for example postgres-openlinage, salesforce-openlinage and so on - that the outcome is openlinage compliant

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-06-03 09:52:35

*Thread Reply:* yes - exactly

Ewald Geschwinde (geschwinde@isoc.org)

2024-06-03 09:53:23

*Thread Reply:* perfect - this sounds great

Ewald Geschwinde (geschwinde@isoc.org)

2024-06-03 09:53:41

*Thread Reply:* I let you know when I have a prototype

🔥 Paweł Leszczyński

Darshit Dave (dsdave1691@gmail.com)

2024-06-03 15:00:53

Hello, We have OpenLineage integration with Spark. I am looking at the code and see that one of the dataset facets is OwnershipFacet

```public static final class OwnershipDatasetFacet implements DatasetFacet { private final URI _producer;

private final URI _schemaURL;

private final Boolean _deleted;

private final List<OwnershipDatasetFacetOwners> owners;

@JsonAnySetter
private final Map<String, Object> additionalProperties;```

buildOutputDatasets under OpenLineageRunEventBuilder.java I see a snippet where facets are being built:

```if (!datasets.isEmpty()) { Map outputFacetsMap = new HashMap<>(); nodes.forEach( event -> outputDatasetFacetBuilders.forEach(fn -> fn.accept(event, outputFacetsMap::put))); Map datasetFacetsMap = new HashMap<>(); nodes.forEach( event -> datasetFacetBuilders.forEach(fn -> fn.accept(event, datasetFacetsMap::put))); return datasets.stream() .map( ds -> { Map dsFacetsMap = new HashMap(datasetFacetsMap); ColumnLevelLineageUtils.buildColumnLineageDatasetFacet( Optional.of(nodes.get(0)) .filter(e -> e instanceof SparkListenerEvent) .map(e -> (SparkListenerEvent) e) .orElse(null), openLineageContext, ds.getFacets().getSchema()) .ifPresent(facet -> dsFacetsMap.put("columnLineage", facet));

            return openLineage
                .newOutputDatasetBuilder()
                .name(ds.getName())
                .namespace(ds.getNamespace())
                .outputFacets(
                    mergeFacets(
                        outputFacetsMap, ds.getOutputFacets(), OutputDatasetOutputFacets.class))
                .facets(mergeFacets(dsFacetsMap, ds.getFacets(), DatasetFacets.class))
                .build();
          })
      .collect(Collectors.toList());```

There are some way I am considering to attach the ownership, but wanted to reach out to the community if there's a recommended way to do so.

Is there a seamless way to attach the facet with Spark integration?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-06-03 16:42:10

*Thread Reply:* I think it would be good to just make it read from Spark Conf - as this isn't a standard Spark feature, you'd have to pass it to job somehow.

Darshit Dave (dsdave1691@gmail.com)

2024-06-03 16:51:04

*Thread Reply:* I believe SparkConfig with SparkContext is available under OpenLineageContext. That's what you mean right?

Darshit Dave (dsdave1691@gmail.com)

2024-06-03 16:51:38

*Thread Reply:* while building the output datasets, OpenLineageContext seems to be available already.

Darshit Dave (dsdave1691@gmail.com)

2024-06-03 16:55:28

*Thread Reply:* Also, SparkListnerApplicationStart has sparkUser attribute to it.

@DeveloperApi case class SparkListenerApplicationStart( appName: String, appId: Option[String], time: Long, sparkUser: String, appAttemptId: Option[String], driverLogs: Option[Map[String, String]] = None, driverAttributes: Option[Map[String, String]] = None) extends SparkListenerEvent I don't see any documentation around it. Do we know what's that and how's that supposed to be used?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-06-03 17:01:32

*Thread Reply:* yes

:gratitude_thank_you: Darshit Dave

Darshit Dave (dsdave1691@gmail.com)

2024-06-03 17:07:52

*Thread Reply:* Any ideas about the sparkUser thingi I mentioned above: https://openlineage.slack.com/archives/C01CK9T7HKR/p1717448128905609?thread_ts=1717441253.497409&cid=C01CK9T7HKR?

} Darshit Dave (https://openlineage.slack.com/team/U076C62NSLS)

Also, <code>SparkListnerApplicationStart</code> has <code>sparkUser</code> attribute to it. <code>@DeveloperApi case class SparkListenerApplicationStart( appName: String, appId: Option[String], time: Long, sparkUser: String, appAttemptId: Option[String], driverLogs: Option[Map[String, String]] = None, driverAttributes: Option[Map[String, String]] = None) extends SparkListenerEvent</code> I don't see any documentation around it. Do we know what's that and how's that supposed to be used?

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1717448128905609?thread_ts=1717441253.497409&cid=C01CK9T7HKR

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-06-03 17:35:28

*Thread Reply:* @Darshit Dave not sure if it's always what we want in that context: https://github.com/apache/spark/blob/9671abbc1dc997f1d88a9f62c05a9d799dae8139/core/src/main/scala/org/apache/spark/util/Utils.scala#L2353

<https://github.com/apache/spark/blob/9671abbc1dc997f1d88a9f62c05a9d799dae8139/core/src/main/scala/org/apache/spark/util/Utils.scala | Utils.scala>

<pre><code> def getCurrentUserName(): String = { </code></pre>

Darshit Dave (dsdave1691@gmail.com)

2024-06-03 17:42:46

*Thread Reply:* Not sure what you mean exactly by always what we want. Are you saying that because sometimes it could be overwritten by SPARK_USER env variable?

Also you are referring SparkContext under OpenLineagContext right and not the ApplicationsStartEvent sparkUser attribute?

Damien Hawes (damien.hawes@booking.com)

2024-06-04 05:16:03

*Thread Reply:* sparkUser is the user that the Spark application is running as. In cases like Hadoop and friends, the user may not be the user who wrote the spark job.

The easiest way to implement the OwnershipFacet is as Maciej pointed out, pass it in via spark properties and implement logic to read those properties out, but it isn't the best way.

Realistically, the best way is to have some catalog that contains this ownership information.

☝️ Maciej Obuchowski

Darshit Dave (dsdave1691@gmail.com)

2024-06-04 06:46:36

*Thread Reply:* Thanks both!

Darshit Dave (dsdave1691@gmail.com)

2024-06-04 07:09:05

*Thread Reply:* Just for clarity, if I had to go the OwnershipFacet route, and pass it in via spark properties, does that mean that during Spark Submit we provide ownership as say -Dopenlineage.ownership=<something>?

Damien Hawes (damien.hawes@booking.com)

2024-06-04 07:09:50

*Thread Reply:* more like --conf spark.openlineage.ownership=<owner>

Damien Hawes (damien.hawes@booking.com)

2024-06-04 07:10:47

*Thread Reply:* You can configure the spark.openlineage.capturedProperties property to capture that property (and other spark properties) like so:

--conf spark.openlineage.ownership='Darshit Dave' --conf spark.openlineage.capturedProperties=spark.app.name,spark.app.id,spark.openlineage.ownership

Damien Hawes (damien.hawes@booking.com)

2024-06-04 07:10:58

*Thread Reply:* However, that won't populate in the OwnershipFacet

Damien Hawes (damien.hawes@booking.com)

2024-06-04 07:11:19

*Thread Reply:* that will instead populate in the $.run.facets.sparkProperties.properties object

Damien Hawes (damien.hawes@booking.com)

2024-06-04 07:11:49

*Thread Reply:* The only way to get it to populate in the OwnershipFacet is to write some code.

Damien Hawes (damien.hawes@booking.com)

2024-06-04 07:12:16

*Thread Reply:* And make sure your code sits alongside the openlineage-spark jar

Damien Hawes (damien.hawes@booking.com)

2024-06-04 07:12:24

*Thread Reply:* and that Spark is loading both at runtime

Damien Hawes (damien.hawes@booking.com)

2024-06-04 07:12:42

*Thread Reply:* The easiest way to do that is to place things in your $SPARK_HOME/libs dir

Darshit Dave (dsdave1691@gmail.com)

2024-06-04 07:18:23

*Thread Reply:* When you say > The only way to get it to populate in the OwnershipFacet is to write some code. Meaning change OpenLineage code itself right?

> And make sure your code sits alongside the openlineage-spark jar I think you have a visual on how would this look which skipping me. I'm not entirely sure what that means.

In OpenLineage code I was looking at buildingInput and buildingOutput functions and was thinking of adding code there where I have access to OpenLineageContext which in turn has access to SparkContext. Do you think that's viable?

Damien Hawes (damien.hawes@booking.com)

2024-06-04 07:23:08

*Thread Reply:* > Meaning change OpenLineage code itself right? No, not necessarily. You implement the OpenLineage interfaces yourself, and you package that implementation. You then distribute the package (the JAR) with openlineage-spark.

> I think you have a visual on how would this look which skipping me. I'm not entirely sure what that means. Concretely, in your $SPARK_HOME/libs directory, you will have spark's own JAR files, and you will have openlineage-spark_${scala.binary.version}.jar and my-openlineage-extensions.jar in that same directory

Damien Hawes (damien.hawes@booking.com)

2024-06-04 07:24:11

*Thread Reply:* > Do you think that's viable? As with all answers, it depends. The major thing to keep in mind: is the way we assign ownership flexible enough for others to do the same?

Damien Hawes (damien.hawes@booking.com)

2024-06-04 07:24:45

*Thread Reply:* It is concievable that the datasources you read from, you don't own, someone else does.

Darshit Dave (dsdave1691@gmail.com)

2024-06-04 07:30:06

*Thread Reply:* > You implement the OpenLineage interfaces yourself, and you package that implementation. You then distribute the package (the JAR) with openlineage-spark. Pardon my lack of knowledge on OpenLineage. By any chance would you happen to have an example that follows this pattern? It would very much clear things up for me.

It is concievable that the datasources you read from, you don't own, someone else does. Yup true, the idea was to attach ownership with Spark and attach ownership to outputs. Meaning if Spark app is writing to a dataset meaning it's owned by the writer.

Damien Hawes (damien.hawes@booking.com)

2024-06-04 07:34:16

*Thread Reply:* > Pardon my lack of knowledge on OpenLineage. By any chance would you happen to have an example that follows this pattern? It would very much clear things up for me. I do however, I cannot share the code because it is my companies internal code, but I can explain what it does:

At my company, we do not know the hostnames of our Kafka brokers. Instead, we use a service discovery mechanism to resolve these values at runtime. This means that the KafkaTransport provided with the openlineage-java client does not work for us. Therefore, we had to implement our own transport that wraps our service discovery mechanisms and Kafka, and present it to the openlineage-java client to resolve at runtime. Concretely, this meant that the code we wrote, packaged, and shipped had to be colocated with the openlineage-java artifact. In our case, this was the openlineage-spark artifact, which includes the openlineage-java artifact.

👍 Maciej Obuchowski

Darshit Dave (dsdave1691@gmail.com)

2024-06-04 07:36:49

*Thread Reply:* That clears things up. Thanks a ton Damien!

Darshit Dave (dsdave1691@gmail.com)

2024-06-04 07:41:09

*Thread Reply:* I'm thinking how would I be able to apply this to ownership exactly. Soon may have a question or two if you don't mind.

Damien Hawes (damien.hawes@booking.com)

2024-06-04 07:41:15

*Thread Reply:* Sure.

:gratitude_thank_you: Darshit Dave

Darshit Dave (dsdave1691@gmail.com)

2024-06-04 07:52:20

*Thread Reply:* There are three question lingering in my head: • Which interface to implement? • How to tell OpenLineage to use that interface? • How to access the conf properties we discussed within that interface? I might be asking some basic question so please don't mind.

Darshit Dave (dsdave1691@gmail.com)

2024-06-04 09:49:20

*Thread Reply:* Got any ideas on that?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-06-04 09:55:55

*Thread Reply:* CustomFacetBuilder is a good place to start. https://openlineage.io/docs/integrations/spark/extending#customfacetbuilder You might want to look at for example https://github.com/OpenLineage/OpenLineage/blob/a88a4236501e02d1cdcd23e0bc85cebb1a[…]neage/spark/agent/facets/builder/SparkPropertyFacetBuilder.java As for how to tell OL to use that, https://openlineage.io/docs/integrations/spark/extending#openlineageeventhandlerfactory - Java's ServiceLoader takes care of that if you put "pointer" to it in the META-INF/service directory of your jar

<https://github.com/OpenLineage/OpenLineage/blob/a88a4236501e02d1cdcd23e0bc85cebb1a6ff7a0/integration/spark/shared/src/main/java/io/openlineage/spark/agent/facets/builder/SparkPropertyFacetBuilder.java | SparkPropertyFacetBuilder.java>

<pre><code>public class SparkPropertyFacetBuilder </code></pre>

Darshit Dave (dsdave1691@gmail.com)

2024-06-04 09:58:16

*Thread Reply:* :gratitudethankyou:

Simran Suri (mailsimransuri@gmail.com)

2024-06-04 02:53:47

Hi All,

I would like to know about facets and how to create a custom facet and integrate it into my openlineage package/jar. Do we need to release a new version of the openlineage jar every time we create a new custom facet, or is there another way to include them? Detailed help and information on creating custom facets would be much appreciated as I'm not aware about it.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-06-04 04:52:46

*Thread Reply:* Hi Simran, pointing you to the docs -> https://openlineage.io/docs/integrations/spark/extending#openlineageeventhandlerfactory which you're probably already aware of. The answear to your question depends on how are shipping custom facets and openlineage-spark jars. Are you adding two separate jars into your spark project or are you packing openlineage-spark into a jar containing your custom factes.

openlineage.io

Extending | OpenLineage

The Spark library is intended to support extension via custom implementations of a handful

Original URL: https://openlineage.io/docs/integrations/spark/extending#openlineageeventhandlerfactory

Simran Suri (mailsimransuri@gmail.com)

2024-06-04 05:01:24

*Thread Reply:* I'm not aware of either of the methods and what efforts it needs, do we have some documents around the same as well? @Paweł Leszczyński that's what don't have any idea on custom faucets

Rishabh Pareek (rishabh.pareek@infoobjects.com)

2024-06-04 04:40:31

@here Hi everyone, I'm facing an issue with data lineage tracking using OpenLineage. I've written the following script to create lineage for two files:


import os
from datetime import datetime
from openlineage.client import OpenLineageClient
from openlineage.client.run import RunEvent, RunState, Job, Dataset, Run
from openlineage.client.facet import BaseFacet
import uuid
import shutil
import attr

# Define your Marquez endpoint
MARQUEZ_URL = '<http://localhost:5000>'
client = OpenLineageClient(MARQUEZ_URL)
namespace = "DEMO_POC_1"

# Define jobs
jobs = {
    "sequencing": Job(namespace=namespace, name="sequencing"),
    "primary_processing": Job(namespace=namespace, name="primary_processing"),
    "secondary_processing": Job(namespace=namespace, name="secondary_processing")
}

# Define datasets with namespace and name only
datasets = {
    "sequencing_input": {"namespace": namespace, "name": "sequencing_input"},
    "sequencing_output": {"namespace": namespace, "name": "sequencing_output"},
    "sequencing_output_vast_db": {"namespace": namespace, "name": "vast_db"},
    "primary_processing_input": {"namespace": namespace, "name": "primary_processing_input"},
    "primary_processing_output": {"namespace": namespace, "name": "primary_processing_output"},
    "primary_processing_output_vast_db": {"namespace": namespace, "name": "vast_db"},
    "secondary_processing_input": {"namespace": namespace, "name": "secondary_processing_input"},
    "secondary_processing_output_db": {"namespace": namespace, "name": "vast_db"},
    "secondary_processing_output_client_data_server": {"namespace": namespace, "name": "client_data_server"},
    "secondary_processing_output_s3": {"namespace": namespace, "name": "s3"}
}

# Define locations for datasets
locations = {
    "sequencing_input": r"C:\Users\Infoobjects\Desktop\data\Analysis\NovaSeq\20240521_LH00406_0120_B22KLHHLT3\CAGRF24040213",
    "sequencing_output": r"C:\Users\Infoobjects\Desktop\data\Analysis\NovaSeq\20240521_LH00406_0120_B22KLHHLT3\CAGRF24040213\secondary_analysis",
    "sequencing_output_vast_db": r"C:\Users\Infoobjects\Desktop\data\vast",
    "primary_processing_input": r"C:\Users\Infoobjects\Desktop\data\Analysis\NovaSeq\20240521_LH00406_0120_B22KLHHLT3\CAGRF24040213\secondary_analysis",
    "primary_processing_output": r"C:\Users\Infoobjects\Desktop\data\Analysis\NovaSeq\20240521_LH00406_0120_B22KLHHLT3\CAGRF24040213\secondary_analysis\Results",
    "primary_processing_output_vast_db": r"C:\Users\Infoobjects\Desktop\data\vast",
    "secondary_processing_input": r"C:\Users\Infoobjects\Desktop\data\Analysis\NovaSeq\20240521_LH00406_0120_B22KLHHLT3\CAGRF24040213\secondary_analysis\Results",
    "secondary_processing_output_db": r"C:\Users\Infoobjects\Desktop\data\vast",
    "secondary_processing_output_client_data_server": r"C:\Users\Infoobjects\Desktop\data\CDS",
    "secondary_processing_output_s3": r"C:\Users\Infoobjects\Desktop\data\s3"
}

# Define dictionary specifying output for each file at each job
file_outputs = {
    "14828112_0115922_file.txt": {
        "sequencing": "14828112.txt",
        "primary_processing": "14828112.csv",
        "secondary_processing": "output1_secondary.txt"
    },

    "11828112_file.txt": {
        "sequencing": "11828112.txt",
        "primary_processing": "11828112.csv",
        "secondary_processing": "output1_secondary.txt"
    },

}

# Define custom metadata
metadata = {
    "SHA-256": "zzdyhxeapiexmjmiootrqhfjvjcbbqtvw",
    "ObjectStateID": "10",
    "ObjectStateName": "Cu",
    "Organisation": "abc",
    "System": "Pure",
    "Time": "26/02/2018",
    "Owner": "Lulu",
    "State": "Active"
}

# Custom facet class using attr
@attr.s
class CustomFacet(BaseFacet):
    sha256: str = attr.ib()
    objectStateID: str = attr.ib()
    objectStateName: str = attr.ib()
    organisation: str = attr.ib()
    system: str = attr.ib()
    time: str = attr.ib()
    owner: str = attr.ib()
    state: str = attr.ib()
    file_name: str = attr.ib()  # Added file_name

# Helper function to create directories if they don't exist
def create_directory_if_not_exists(directory):
    if not os.path.exists(directory):
        os.makedirs(directory)

# Helper function to copy files from source to destination
def copy_files(source, destination):
    shutil.copy(source, destination)

# Helper function to create run events
def create_run_event(job_name, input_files, output_files, file_name):
    run_id = str(uuid.uuid4())
    run = Run(runId=run_id)

    # Adding custom facet to the run
    custom_facet = CustomFacet(
        sha256=metadata["SHA-256"],
        objectStateID=metadata["ObjectStateID"],
        objectStateName=metadata["ObjectStateName"],
        organisation=metadata["Organisation"],
        system=metadata["System"],
        time=metadata["Time"],
        owner=metadata["Owner"],
        state=metadata["State"],
        file_name=file_name  # Add file_name to the custom facet
    )
    run.facets["customFacet"] = custom_facet

    inputs = []
    for input_file in input_files:
        dataset = Dataset(
            namespace=datasets["sequencing_input"]["namespace"],
            name=input_file  # Use the input file name directly
        )
        # Adding custom facet to the dataset
        dataset.facets["customFacet"] = custom_facet
        inputs.append(dataset)

    outputs = []
    for output_file in output_files:
        dataset = Dataset(
            namespace=datasets["sequencing_output"]["namespace"],
            name=output_file  # Use the output file name directly
        )
        # Adding custom facet to the dataset
        dataset.facets["customFacet"] = custom_facet
        outputs.append(dataset)

    # Check status based on file existence
    output_file_exists = all(os.path.exists(output_file) for output_file in output_files if "vast://" not in output_file)
    status = "success" if output_file_exists else "failure"

    event_time_iso = datetime.utcnow().isoformat() + 'Z'  # ISO format with UTC timezone

    # Define status facet
    status_facet = {"status": status, "file_name": file_name}

    # Attach status facet
    run.facets["status"] = status_facet

    event = RunEvent(
        eventType=RunState.COMPLETE,
        eventTime=event_time_iso,
        run=run,
        job=jobs[job_name],
        inputs=inputs,
        outputs=outputs,
        producer="<https://github.com/OpenLineage/OpenLineage>"
    )

    try:
        client.emit(event)
        print(f"Successfully emitted event for job: {job_name}")
    except Exception as e:
        print(f"Failed to emit event for job: {job_name}. Error: {e}")

# Function to simulate job execution
def execute_job(job_name, input_datasets, output_datasets):
    try:
        # Simulate job execution
        # Here you would perform actual processing, such as copying files, etc.
        print(f"Executing job: {job_name}")
        # Create output directories if they don't exist
        for dataset in output_datasets:
            create_directory_if_not_exists(os.path.dirname(dataset["name"]))
        # Copy files from input to output directories
        for input_file, output_file in zip(input_datasets, output_datasets):
            copy_files(input_file["name"], output_file["name"])
        print("Job execution successful")
        return True
    except Exception as e:
        print(f"Failed to execute job: {job_name}. Error: {e}")
        return False

# Execute jobs for each file
for input_file, output_files in file_outputs.items():
    input_path = os.path.join(locations["sequencing_input"], input_file)
    if execute_job("sequencing", [{"name": input_path}], [{"name": os.path.join(locations["sequencing_output"], output_files["sequencing"])}]):
        create_run_event(
            job_name="sequencing",
            input_files=[input_file],
            output_files=[output_files["sequencing"], "vast_db"],
            file_name=input_file
        )
    if execute_job("primary_processing", [{"name": os.path.join(locations["sequencing_output"], output_files["sequencing"])}], [{"name": os.path.join(locations["primary_processing_output"], output_files["primary_processing"])}, {"name": locations["primary_processing_output_vast_db"]}]):
        create_run_event(
            job_name="primary_processing",
            input_files=[output_files["sequencing"]],
            output_files=[output_files["primary_processing"], "vast_db"],
            file_name=output_files["sequencing"]
        )
    if execute_job("secondary_processing", [{"name": os.path.join(locations["primary_processing_output"], output_files["primary_processing"])}], [{"name": locations["secondary_processing_output_db"]}, {"name": locations["secondary_processing_output_client_data_server"]}, {"name": locations["secondary_processing_output_s3"]}]):
        create_run_event(
            job_name="secondary_processing",
            input_files=[output_files["primary_processing"]],
            output_files=["vast_db", "client_data_server", "s3"],
            file_name=output_files["primary_processing"]

The script is intended to create lineage for both 14828112_file.txt and 11828112_file.txt. The problem I'm encountering is that while the lineage for 11828112_file.txt is complete and shows correctly in Marquez, the lineage for 14828112_file.txt is not complete. Could anyone provide some guidance on what might be causing this discrepancy or how I can debug this further? Thanks in advance!

Rishabh Pareek (rishabh.pareek@infoobjects.com)

2024-06-04 06:08:59

*Thread Reply:* Adding images for the reference

image.png

image (1).png

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-06-04 08:46:01

*Thread Reply:* hey, it would be easier if you shared OpenLineage events. you can see them in separate tab in Marquez

also, please post such huge code blocks as file or at least in thread, rather than putting it all into general chat

👍 Rishabh Pareek

Rishabh Pareek (rishabh.pareek@infoobjects.com)

2024-06-05 01:54:43

*Thread Reply:* Our use case is very simple. Our client have already there pipelines in place. Every single pipeline is saving its output as a file at different different location. What we have to do is, create lineage for the entire process based on the file's existence in the folder. We are not allowed to see or modify the pipeline code. We know the input and output location of each pipeline. Using our lineage code we'll simply check if file exist in the given location or not and accordingly create the lineage, also show the status if for a given file pipeline executed successfully or not.

Rishabh Pareek (rishabh.pareek@infoobjects.com)

2024-06-05 01:59:24

*Thread Reply:* For understanding purpose we have created 3 job in the lineage code itself and creating a lineage out of it.

DEMO.py

Rishabh Pareek (rishabh.pareek@infoobjects.com)

2024-06-05 02:00:34

*Thread Reply:* The script is intended to create lineage for both 14828112_file.txt and 11828112_file.txt. The problem I'm encountering is that while the lineage for 11828112_file.txt is complete and shows correctly in Marquez, the lineage for 14828112_file.txt is not complete. Could anyone provide some guidance on what might be causing this discrepancy or how I can debug this further?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-06-05 02:01:57

*Thread Reply:* could you please attach OL events? we've already got the script

Rishabh Pareek (rishabh.pareek@infoobjects.com)

2024-06-05 02:03:50

*Thread Reply:* Can we schedule a meeting and I'll show you?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-06-05 02:04:42

*Thread Reply:* please send them via DM if you can't do it here for some reason

Sathish Kumar J (sathish.jeganathan@walmart.com)

2024-06-04 07:52:29

@here after enabling openlineage (airflow version 2.4.3 , openlineage-airflow==1.5.0), we see intermittently tasks are getting stuck in running state. it looks the openlineage not able to open the thread and waiting (keep on trying to open) , this resulted in Airflow tasks getting stuck in the running state . can anyone provide if there is a timeout property or any alternate to timeout to fix this issue ?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-06-04 08:33:05

*Thread Reply:* That version of Airflow is really old and not really able to be changed by anyone.

dolfinus (martinov_m_s_@mail.ru)

2024-06-04 07:52:54

please stop pinging everyone in this chat by using @ notation

🆗 Sathish Kumar J, Rishabh Pareek

Onagon Quintle (onagon@gmx.de)

2024-06-04 10:21:30

Hi all, I am trying to get into open lineage with Marquez. I wanted to try it with some python first and used the official python client. I can see quite some of the data I want but two things are not displaying correctly in Marquez even though they seem to be correctly emitted towards Marquez. The first thing are SourceCodeJobFacets . Example: job3 = Job( namespace="test_namespace", name="public.join_tables", facets={"sourceCode": SourceCodeJobFacet("python", "pd.merge(table1, table2, on='id')")}, ) This creates the output in Marquez that you can see in the first image attached. I would love to know why it says that it has "No code available".

The second facet that is not showing up even though it says it has some data to it is the ColumnLineageDatasetFacet Example (shortened): lineage = ColumnLineageDatasetFacet( fields={ "id": Fields( inputFields=[ InputField(namespace="test_namespace", name="public.table2", field="id"), InputField(namespace="test_namespace", name="public.table1", field="id"), ], transformationDescription="identical", transformationType="IDENTITY", ), "name": Fields( inputFields=[ InputField(namespace="test_namespace", name="public.table1", field="name"), ], transformationDescription="identical", transformationType="IDENTITY", ),}) dataset_joined = Dataset( namespace="test_namespace", name="public.joined_df", facets={"schema": get_schema_from_df(joined_df), "columnLineage": lineage}, ) But if I click on the now present VIew button for column level lineage nothing is display. For reference see the two other attached images. Can anyone hint me in a direction here or note what could help for analysation of the problem?

image.png

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-06-04 10:23:28

*Thread Reply:* hey, please post these questions to Marquez Slack channel

Onagon Quintle (onagon@gmx.de)

2024-06-04 12:44:33

*Thread Reply:* I thought that might be related to the openlineage python client. thats why I went here aswel. But maybe it is just a visual thing. Thanks anyways for the reply.

Onagon Quintle (onagon@gmx.de)

2024-06-05 03:12:22

*Thread Reply:* It was a UI issue. Marquez does not display source code in all cases and the part with the column lineage was just taking some time to be processed.

JODOG (whalstn0202@gmail.com)

2024-06-04 15:22:36

Hello everyone. I have a question. Can I make topic lineage including kafka producer and consumer? i.e. who makes this topic? who consumes this topic? And is there any use case that integrate with google data lineage using those metadata?

Julien Le Dem (julien@apache.org)

2024-06-04 20:54:14

Hello! FYI I have turned off @here, @channel and @everyone in the #general channel as their use is rarely appropriate in this channel. Please don't hesitate to reach out if you think we should turn it back on.

👍 Maciej Obuchowski, Harel Shein

Simran Suri (mailsimransuri@gmail.com)

2024-06-05 02:19:36

Hello all! As per this proposal OpenLineage Issue #2748, could someone guide me on how to add fields in facets? If there is already a PR, similar changes done earlier, or some documentation, pointing me to that would be very helpful so that I can get an idea of how it needs to be done, taking all the things in consideration.

Michael Robinson (michael.robinson@astronomer.io)

2024-06-05 16:20:08

This month’s TSC meeting is next Wednesday the 12th at 9:30am PT. On the tentative agenda: • Announcements • Recent Release Highlights • Open Discussion • Additional items TBA More info and the meeting link can be found on the website. All are welcome! Do you have a discussion topic, use case or integration you’d like to demo? Reply here or DM me to be added to the agenda.

openlineage.io

TSC Meetings | OpenLineage

The OpenLineage Technical Steering Committee meets monthly, and is open to all.

Original URL: https://openlineage.io/meetings/

Charles kuzmiak (charles.kuzmiak@kraken.tech)

2024-06-06 01:58:03

Hello lovely OL people 👋 We are trying to get our Airflow to play nice with OL, and send data to Marquez, and I was wondering if anyone here has experience with doing something similar and would be up for sharing their experience ? Most of the docs I could find around OL/Marquez/Airflow seem to be somewhat out of date and I am struggling to get a full picture, so any help would be greatly appreciated 🙂

Kacper Muda (mudakacper@gmail.com)

2024-06-06 03:25:03

*Thread Reply:* Hey, thanks for the interest in OpenLineage ! Did you try the official OpenLineage provider documentation? It's up to date and should explain everything from A to Z. Let me know should you have any more questions after reading them, we can always upgrade some parts 🙂

Charles kuzmiak (charles.kuzmiak@kraken.tech)

2024-06-06 03:51:49

*Thread Reply:* yes thats what I am using atm, I think it might be the Marquez-specific bit thats not cooperating 😓 thanks !

Kacper Muda (mudakacper@gmail.com)

2024-06-06 03:52:57

*Thread Reply:* Got it. If you have any specifics (errors, logs etc.) let me know, I'll try to help !

Darshit Dave (dsdave1691@gmail.com)

2024-06-06 10:49:08

OpenLineage START and COMPLETE events

Is it safe to assume that COMPLETE event would always have everything from the START event and potentially more? I read it somewhere that it's additive but want to confirm as I have seem to have misplaced where I read it from.

Out integration is with Spark and potentially will make an integration with Airflow

Kacper Muda (mudakacper@gmail.com)

2024-06-06 11:12:34

*Thread Reply:* Hey, I would not assume that a COMPLETE event would always have everything from the START. That is not true in Airflow, and I guess neither in Spark.

Kacper Muda (mudakacper@gmail.com)

2024-06-06 11:17:41

*Thread Reply:* The docs state: > We assume events describing a single run are accumulative so that's probably where you read it from, but i think it means something slightly different than what you describe. It means, and somebody please correct me if I'm wrong, that you have to accumulate the information from all events to have the whole picture. Events are stateless, we do not store the information from the start event anywhere, so we can't put it in the complete event.

Maybe this part of docs needs some clarification.

openlineage.io

The Run Cycle | OpenLineage

The OpenLineage object model is event-based and updates provide an OpenLineage backend with details about the activities of a Job.

Original URL: https://openlineage.io/docs/spec/run-cycle#run-states

🙌 Paweł Leszczyński

Darshit Dave (dsdave1691@gmail.com)

2024-06-06 11:19:51

*Thread Reply:* > so that's probably where you read it from I believe so yes.

Understood. Thanks for the context Kacper! Much appreciated

👍 Kacper Muda

Mariusz Górski (gorskimariusz13@gmail.com)

2024-06-07 07:42:16

*Thread Reply:* Also you can expect intermediate events with RUNNING event type that can include partial information. In OpenMedata integration we actually collect and merge all events from start to complete and only then translate them into lineage information.

Darshit Dave (dsdave1691@gmail.com)

2024-06-07 10:31:32

*Thread Reply:* yeah we are doing the same 👍

Bernat Gabor (gaborjbernat@gmail.com)

2024-06-06 11:33:33

I found that consumers under https://openlineage.io/ecosystem focus entirely on the dataset lineage side (such as DataHub, Amudsen, OpenMedata). Is there tools that allow the users to drill down the run/job side too https://openlineage.io/docs/spec/run-cycle ?

openlineage.io

The Run Cycle | OpenLineage

The OpenLineage object model is event-based and updates provide an OpenLineage backend with details about the activities of a Job.

Original URL: https://openlineage.io/docs/spec/run-cycle

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-06-07 07:22:54

*Thread Reply:* Depends what you need - but Marquez models jobs and runs too

Bernat Gabor (gaborjbernat@gmail.com)

2024-06-07 11:44:57

*Thread Reply:* Anything else than Marquez?

Bernat Gabor (gaborjbernat@gmail.com)

2024-06-07 11:45:33

*Thread Reply:* My problem with it is that is not easy to extend it with new facets and the UI 🤔

Sergei Nevedomski (neviadomski@gmail.com)

2024-06-09 11:24:01

Hi all. Nice joining your community. Not long ago I found openLineage and now doing POC for my job (one of major banks). I'll describe project first. And will ask questions after. Project:

We have hundreds of spark (mostly pyspark and sparkR) scripts / notebooks running in production. Most of them read data from source tables, transform and save to destination tables. Some of them do incremental updates. Some of them overwrite whole tables. All of scripts developed and supported by different teams.
I need to get what you call "static lineage" but in out proprietary format.
I can easily use spark integration + marquez + python API to extract lineage. But in this case I need to make change to every script manually and run every one. It will take years. Questions:

Is there known way to get static lineage from spark scripts without full code execution? • Maybe monkey patch or method mocking • Disable save / write operations

When I use spark integration I get "transformation type" field, but it is empty. Is there a way to populate it?

Basically idea is to loop over all scripts and extract lineage. Any ideas on making it easier appreciated. Thanks.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-06-10 02:05:55

*Thread Reply:* Aren't you able to add openlineage-spark to all the spark jobs running in production? What are the reasons for not doing so?

Damien Hawes (damien.hawes@booking.com)

2024-06-10 03:47:15

*Thread Reply:* @Sergei Nevedomski - what does your organisation's deployment look like? Is the Spark deployment centrally managed?

Sergei Nevedomski (neviadomski@gmail.com)

2024-06-10 12:20:53

*Thread Reply:* @Paweł Leszczyński, we have 100s developers and modelers. different spark deployments. Long time in most cases to release new code to production (even with minimal changes). I do not see how this works. Not company wide.

Sergei Nevedomski (neviadomski@gmail.com)

2024-06-10 12:22:27

*Thread Reply:* @Damien Hawes, We have 1 shared hive storage and multiple spark deployments.

Sergei Nevedomski (neviadomski@gmail.com)

2024-06-10 12:30:55

*Thread Reply:* Bank decided to use https://www.alation.com/ on all levels. Decision is already made and paid for. Don't ask me why. Right now Alation data catalog points to all data tables we need (around 10k). The problem is there's no lineage because Alation cannot get spark lineage. For some most important tables linage was prepared manually and uploaded over API. And I'm looking for a way to extract lineage from existing spark code and add it to Alation. Hope it gives you some info. Thanks again.

alation.com

The Alation Data Intelligence Platform: Data Catalog & Governance

Lead your business fearlessly with data using Alation's industry-leading data intelligence platform. Discover, understand, & trust your data like never before.

Original URL: https://www.alation.com/

Damien Hawes (damien.hawes@booking.com)

2024-06-10 12:32:23

*Thread Reply:* @Sergei Nevedomski - the "easiest" in terms of getting lineage is to deploy the jar to $SPARK_HOME/lib

👍 Sergei Nevedomski

Damien Hawes (damien.hawes@booking.com)

2024-06-10 12:32:51

*Thread Reply:* and then enable it in $SPARK_HOME/conf/spark-defaults.conf

👍 Sergei Nevedomski

Roy Cheong (roy.cheong@grabtaxi.com)

2024-06-12 02:48:08

Hi team, is it possible to release this ? https://github.com/OpenLineage/OpenLineage/pull/2749/files

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-06-12 07:48:23

*Thread Reply:* You can request a release 🙂 Like https://openlineage.slack.com/archives/C01CK9T7HKR/p1715071578130549

} Simran Suri (https://openlineage.slack.com/team/U069R6P724Q)

Hi All, I would like to request a release, for DREMIO to supported dbt profile types changes (dbt-dremio adapter addition) <a href="https://github.com/OpenLineage/OpenLineage/pull/2674">https://github.com/OpenLineage/OpenLineage/pull/2674</a>

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1715071578130549

Rodrigo Maia (rodrigo.maia@manta.io)

2024-06-12 04:00:39

Is setting any openlineage spark config at runtime possible? (OL 1.8) Hi all! Im testing spark integration + notebooks to get lineage from pyspark. First, we start the cluster then we run the notebooks. Im wondering if i can set specific settings at runtime within the notebook (with a running spark context.). ive tried setting namespace, parentjobnamespace.... and none of those showed up in the lineage events, only the default values set up from the spark cluster configs.

ive tested something like: from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() conf = spark.sparkContext.getConf() conf.set("spark.openlineage.namespace", "CDS_Dim_Party_Delta")

Damien Hawes (damien.hawes@booking.com)

2024-06-12 05:08:45

*Thread Reply:* That's a consequence of Spark's architecture.

Generally configuration must be specified before the construction of the SparkSession.

This means:

```conf = SparkConf() conf.set("...", "...")

spark = SparkSession.builder.config(conf).getOrCreate()

alternatively

spark = SparkSession.builder.config("...", "...").getOrCreate()```

dolfinus (martinov_m_s_@mail.ru)

2024-06-12 05:16:40

*Thread Reply:* Configuration options of Spark session can be changed in runtime (excluding ones which describe how to start this session). But OpenLineage listener is created during application startup, and all the options are read on listener creation, so this will not work

Rodrigo Maia (rodrigo.maia@manta.io)

2024-06-12 05:37:13

*Thread Reply:* So there is no way of defining extra information at runtime to be available in the json payload, I assume, my main motivation is that we always use a generic spark cluster (already configured with Openlineage) to do all the processing.

Damien Hawes (damien.hawes@booking.com)

2024-06-12 05:46:39

*Thread Reply:* There is, but you need to define the properties before creating the SparkSession

Damien Hawes (damien.hawes@booking.com)

2024-06-12 05:47:08

*Thread Reply:* Additionally:

spark.openlineage.capturedProperties=spark.app.name,spark.app.id,<a href="http://spark.my">spark.my</a>.custom.property

Harel Shein (harel.shein@gmail.com)

2024-06-12 09:16:09

Hi all, This month’s TSC meeting is today, Wednesday, June 12th at 12:30am EDT / 6:30pm CEST. On the agenda: • Announcements ◦ Trino event listener ◦ Facet registry • Recent releases ◦ 1.14.0 ◦ 1.15.0 ◦ 1.16.0 • Dataset Namespace Resolver in Java client @Harel Shein • Updates about Airflow integration @Maciej Obuchowski • Discussion Items ◦ Spark 4.0 upcoming (call for contributors) • Open discussion More info and the meeting link can be found on the website. All are welcome! Do you have a discussion topic, use case or integration you’d like to demo? Reply here or DM me to be added to the agenda.

openlineage.io

TSC Meetings | OpenLineage

The OpenLineage Technical Steering Committee meets monthly, and is open to all.

Original URL: https://openlineage.io/meetings/

Charles kuzmiak (charles.kuzmiak@kraken.tech)

2024-06-13 03:39:31

hello, I am trying to set up a basic DAG in Airflow to validate if the OpenLineage config is correct. I am using the Postgres Operator, but getting WARNING - OpenLineage provider method failed to extract data from provider. [2024-06-13, 07:30:45 UTC] {manager.py:276} WARNING - Extractor returns non-valid metadata: None [2024-06-13, 07:31:00 UTC] {adapter.py:140} WARNING - Failed to emit OpenLineage event of id XXXXXX Any ideas what could be the cause ?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-06-13 03:40:11

*Thread Reply:* hey, did you have a chance to check this out: https://openlineage.io/docs/integrations/airflow/preflight-check-dag ?

openlineage.io

Preflight check DAG | OpenLineage

Purpose

Original URL: https://openlineage.io/docs/integrations/airflow/preflight-check-dag

➕ Rahul Madan

Charles kuzmiak (charles.kuzmiak@kraken.tech)

2024-06-13 03:41:45

*Thread Reply:* oooh no I have not found that before as we run version >2.8, will try thanks !

Alexandre Courtois (alexandre.courtois@gmail.com)

2024-06-13 10:39:35

Hello everyone! 👋

I want to use OpenLineage with Spark on AWS (AWS Glue). I wish to use the "Kinesis" transport. It seems like I have configured everything correctly because in Cloudwatch (Spark logs), I find logs such as: • INFO SparkContext: Registered listener io.openlineage.spark.agent.OpenLineageSparkListener • DEBUG OpenLineageClient: OpenLineageClient will emit lineage event: ** • DEBUG EventEmitter: Emitting lineage completed successfully: ** • DEBUG OpenLineageRunEventBuilder: Traversing optimized plan ** However, I don't find any events in Kinesis. I also don't find the logs following the call of the "emit" method of the "KinesisTransport".

Any idea what I might have missed?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-06-14 05:48:16

*Thread Reply:* Looking at code, there should be debug logs that log either log.debug("Success to send to Kinesis lineage event: {}", eventAsJson); or log.error("Failed to send to Kinesis lineage event: {}", eventAsJson, t);

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-06-14 05:48:49

*Thread Reply:* Is your job ending just after those logs? It looks KinesisTransport emits those events asynchronously, and it might not flush the events before the job ends

Alexandre Courtois (alexandre.courtois@gmail.com)

2024-06-14 07:48:29

*Thread Reply:* Hi! I found the cause of my problem by digging a bit more into the logs: a version conflict of the "protobuf-java" dependency required by Hadoop (required version 2.5) and the Kinesis producer (required version 3.21). I am looking for a solution to this problem.

Alexandre Courtois (alexandre.courtois@gmail.com)

2024-06-14 08:10:47

*Thread Reply:* I got it! I try the lastest version of package amazon-kinesis-producer (0.16.0) which need protobuf-java 3.21+. I go back to version 0.14.0 (as mentionned in OpenLineage documentation here) which need the 2.6.1. And it works like a charm !

👍 Maciej Obuchowski

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-06-14 08:35:52

*Thread Reply:* Great to see that 🙂

harsh loomba (hloomba@upgrade.com)

2024-06-13 11:49:08

qq regaridng Airflow Marquez lineage - say there is a dataflow which performs the following operations and Im wondering in what way we can enable lineage on loadpandasdf method so that we can track lineage of that staged table -

Converts SQL into DF by doing getpandasdf
Does some transformation on DF
Load pandas df using loadpandasdf I can't think of any clean way other than creating an operator which performs this flow and then educate everyone who wants to enable lineage on their Airflow DAGS but if u are aware of any existing functionality then that's great

harsh loomba (hloomba@upgrade.com)

2024-06-13 11:49:21

*Thread Reply:* cc: @Willy Lulciuc @Jakub Dardziński

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-06-14 05:45:35

*Thread Reply:* > existing No. But in the future https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-62+Getting+Lineage+from+Hook+Instrumentation

Sathish Kumar J (sathish.jeganathan@walmart.com)

2024-06-14 01:01:47

Hello Everyone. I want to use opensource kafka (kafka-python) instead of confluent-kafka with openlineage-airflow==1.5.0. does it work? I see open-lineage client look for confluent-kafka module. any help here [2024-06-14, 04:34:47 UTC] {kafka.py:96} ERROR - OpenLineage client did not found confluent-kafka module. Installing it is required for KafkaTransport to work. You can also get it via `pip install openlineage-python[kafka]` Traceback (most recent call last): File "/usr/local/lib/python3.9/site-packages/openlineage/client/transport/kafka.py", line 85, in _setup_producer import confluent_kafka as kafka ModuleNotFoundError: No module named 'confluent_kafka' [2024-06-14, 04:34:47 UTC] {adapter.py:103} ERROR - Failed to emit OpenLineage event of id 007e3faf-c8a7-3ab0-9f44-47172632cafa Traceback (most recent call last): File "/usr/local/lib/python3.9/site-packages/openlineage/airflow/adapter.py", line 100, in emit return self.client.emit(event) File "/usr/local/lib/python3.9/site-packages/openlineage/client/client.py", line 103, in emit self.transport.emit(event) File "/usr/local/lib/python3.9/site-packages/openlineage/client/transport/kafka.py", line 71, in emit self._setup_producer(self.kafka_config.config) File "/usr/local/lib/python3.9/site-packages/openlineage/client/transport/kafka.py", line 85, in _setup_producer import confluent_kafka as kafka ModuleNotFoundError: No module named 'confluent_kafka'

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-06-14 05:06:25

*Thread Reply:* you should create your custom transport then https://openlineage.io/docs/client/python#custom-transport-type

openlineage.io

Python | OpenLineage

Overview

Original URL: https://openlineage.io/docs/client/python#custom-transport-type

Sathish Kumar J (sathish.jeganathan@walmart.com)

2024-06-14 05:23:00

*Thread Reply:* @Jakub Dardziński I need to push the events only to the kafka topic and transport type always be "Kafka" . I want to setup the producer with opensource-kafka (kafka-python) instead of confluent-kafka, https://github.com/OpenLineage/OpenLineage/blob/90b43ccfeffe2d213c1fe0d5b7c60764a8d3ea67/client/python/openlineage/client/transport/kafka.py#L8[…]18

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-06-14 05:41:23

*Thread Reply:* yes but you can’t dynamically swap the code to use another library

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-06-14 05:49:31

*Thread Reply:* Those libraries are not compatible

Sathish Kumar J (sathish.jeganathan@walmart.com)

2024-06-14 05:52:50

*Thread Reply:* yeah, it will be new setup from scratch if i need to change to custom.

Matias Vizcaino (datagero@outlook.com)

2024-06-14 12:11:42

Hi Everyone! If you use Marquez, we could really use your help for some research into the Lineage Graph. Much appreciated!

🔍 Do you have 5 minutes? Please help us with this quick survey to understand your experience with the Marquez Lineage Graph. Your feedback is invaluable! 📋 Survey Link (Google Forms)

🗣️ Want to contribute more? If you'd like to participate in a 20-minute interview, please book here: Calendar - 20min Interview Booking Your responses will remain confidential and anonymized throughout the process.

📢 Spread the word! Please share this survey link on your team's channel to help us get diverse and valuable responses.

🙌 Willy Lulciuc, Peter Hicks

Frimpong Boadu (boadu203@gmail.com)

2024-06-17 14:54:55

Hello Everyone, I am new to openlineage and Marquez. I am wondering, is there a way to use Marquez and read from files instead of postgresDB.

Jonathan Moraes (jonathanlbt@gmail.com)

2024-06-17 18:50:18

*Thread Reply:* Hey, I believe that this information might help you: https://soumilshah1995.blogspot.com/2024/02/getting-started-with-open-data-lineage.html

soumilshah1995.blogspot.com

Getting Started with Open Data lineage | Marquez Project | Apache Hudi Spark jobs

Untitled1 Getting Started with Open Data lineage | Marquez Project | Apache Hudi Sp...

Original URL: https://soumilshah1995.blogspot.com/2024/02/getting-started-with-open-data-lineage.html

Jonathan Moraes (jonathanlbt@gmail.com)

2024-06-17 18:51:07

*Thread Reply:* One of the examples given, the guy uses a file instead of postgresDB

Frimpong Boadu (boadu203@gmail.com)

2024-06-17 19:56:04

*Thread Reply:* Thanks Jonathan, but I can’t find the examples. I meant, if I save my lineage data into a file format, is there a way to visualize it with Marquez; all examples I’ve seen so far use postgres

Frimpong Boadu (boadu203@gmail.com)

2024-06-17 19:56:29

*Thread Reply:* And I can’t find the portion of the code that reads from the Postgres database

Jonathan Moraes (jonathanlbt@gmail.com)

2024-06-17 20:39:19

*Thread Reply:* this is the portion I saw:

```def writetohudi(sparkdf, tablename, dbname, method='upsert', tabletype='COPYONWRITE', recordkey='', precombine='', partitionfields='', metadatacolumn_stats="" ):

path = f"file:///Users/soumilshah/IdeaProjects/SparkProject/tem/{db_name}/{table_name}"

hudi_options = {
    'hoodie.table.name': table_name,
    'hoodie.datasource.write.table.type': table_type,
    'hoodie.datasource.write.table.name': table_name,
    'hoodie.datasource.write.operation': method,
    'hoodie.datasource.write.recordkey.field': recordkey,
    'hoodie.datasource.write.precombine.field': precombine,
    "hoodie.datasource.write.partitionpath.field": partition_fields,

}

print("\n")
print(path)
print("\n")

spark_df.write.format("hudi"). \
    options(****hudi_options). \
    mode("append"). \
    save(path)```

Jonathan Moraes (jonathanlbt@gmail.com)

2024-06-17 20:40:37

*Thread Reply:* this is reading from a file:

path = f"file:///Users/soumilshah/IdeaProjects/SparkProject/tem/{db_name}/{table_name}"

Frimpong Boadu (boadu203@gmail.com)

2024-06-17 20:42:17

*Thread Reply:* Oh I am able to save lineage to file, I want to use Marquez to visualize it.

Jonathan Moraes (jonathanlbt@gmail.com)

2024-06-17 20:43:34

*Thread Reply:* I think you can do that with jupyter notebook. You write a method that sends this information to Marquez

Frimpong Boadu (boadu203@gmail.com)

2024-06-17 20:44:31

*Thread Reply:* Yes, I was looking for an example on how to do that. Or atleast the portion of the code that reads input for Marquez

Jonathan Moraes (jonathanlbt@gmail.com)

2024-06-17 20:47:40

*Thread Reply:* I think you will have to write an api call to Marquez client. I don't think you'll be able to import the file to Marquez otherwise.

Frimpong Boadu (boadu203@gmail.com)

2024-06-17 20:49:18

*Thread Reply:* Yh, I think so too. Thank you very much

👍 Jonathan Moraes

Simran Suri (mailsimransuri@gmail.com)

2024-06-18 02:19:44

Hi Everyone! Can I use the additionalProperties field in dbt to pass some values? I want to have some tags/additional metadata in dbt events. I can see them in manifest.json, but how can I include them in events? https://github.com/OpenLineage/OpenLineage/issues/2779

#2779 Additional Metadata in Events

Details: I am seeking to add more metadata to the OpenLineage events produced by dbt. I can see tags/additional metadata in the manifest.json file of dbt, it is also possible to incorporate this metadata into openlineage events as well. This is how I'm sending and tagging in dbt.yml: <pre><code>version: 2 sources: - name: dbt schema: dbt database: spark_catalog tables: - name: customers description: "This table contains the final account data" columns: - name: customer_id description: "The unique identifier for the customer" meta: tags: {"col_id": "customer_id"} - name: first_name description: "The first name of the customer" meta: additionalProperties: {"senstive": true} - name: last_name description: "The last name of the customer" </code></pre> While I can see descriptions for columns and datasets in the events, can we include tags or additionalProperties within the events as well, which could be represented as a key-value pair object.

Labels

state:needs-triage

Comments

Harel Shein (harel.shein@gmail.com)

2024-06-19 12:55:15

*Thread Reply:* I can definitely see that as an optional facet that we can add and let folks configure to enable

👍 Simran Suri

Harel Shein (harel.shein@gmail.com)

2024-06-19 12:56:00

*Thread Reply:* Do you want to take a stab at adding this?

Simran Suri (mailsimransuri@gmail.com)

2024-06-20 06:09:45

*Thread Reply:* I tried working around it, but no luck as I'm not that familiar, that would be grateful if you can pitch in this @Harel Shein?

Simran Suri (mailsimransuri@gmail.com)

2024-06-24 14:08:30

*Thread Reply:* Hey @Harel Shein, apologies to bug but I'm stuck with this since a long time, help on this would be very grateful.

Harel Shein (harel.shein@gmail.com)

2024-06-25 11:00:14

*Thread Reply:* Hey Simran, sorry I'm not sure I can help myself (I'm quite busy with other projects at work atm)

Frimpong Boadu (boadu203@gmail.com)

2024-06-18 14:50:40

Hi everyone,

I'm new to OpenLineage and Spark, so I apologize if my question is unclear.

I understand that OpenLineage collects lineage information from the logical plan, which is created and optimized by the driver node in Spark. However, the documentation under Apache Spark transport mentions that using a file for lineage tracking is ineffective because each executor writes to its own local file system.

Is the driver not the only one needed in this case? Could someone clarify how OpenLineage handles this issue or suggest a better approach for tracking lineage in Spark?

Thanks!

Damien Hawes (damien.hawes@booking.com)

2024-06-19 07:35:02

*Thread Reply:* Hi @Frimpong Boadu - could you point me to the documentation where it states this?

Frimpong Boadu (boadu203@gmail.com)

2024-06-19 12:57:39

*Thread Reply:* Hi @Damien Hawes please find them here:

"Collecting lineage requires hooking into Spark's ListenerBus in the driver application and collecting and analyzing execution events as they happen."

And Here "Notes for Yarn/Kubernetes This transport type is pretty useless on Spark/Flink applications deployed to Yarn or Kubernetes cluster: • Each executor will write file to a local filesystem of Yarn container/K8s pod. So resulting file will be removed when such container/pod is destroyed. • Kubernetes persistent volumes are not destroyed after pod removal. But all the executors will write to the same network disk in parallel, producing a broken file. " https://openlineage.io/docs/integrations/spark/configuration/transport/#notes-for-yarnkubernetes

openlineage.io

Main Concepts | OpenLineage

Spark jobs typically run on clusters of machines. A single machine hosts the "driver" application,

Original URL: https://openlineage.io/docs/integrations/spark/main_concept/#:~:text=Collecting%20lineage%20requires%20hooking%20into%20Spark%27s%20ListenerBus%20in,post%20events%20to%20the%20listener%20bus%20during%20execution.

Frimpong Boadu (boadu203@gmail.com)

2024-06-19 13:07:35

*Thread Reply:* @Damien Hawes My confusion is that in spark, I believe in Apache Spark, the driver is primarily responsible for creating and managing the different stages of the query plan, including the parsed logical plan, analyzed logical plan, optimized logical plan, and physical plan. So in this case in what instance will each executor write(I am thinking only the driver writes). Please correct me if I am wrong or missing something.

Damien Hawes (damien.hawes@booking.com)

2024-06-19 15:19:15

*Thread Reply:* @Frimpong Boadu OK. You are correct that it is the driver that writes the file, and the executors will not. The docs are correct in some regards, but incorrect in others.

To clarify:

When running a Spark workload via YARN, the file transport is not a good choice when the deploy mode is set to cluster. As the driver sits somewhere within the YARN cluster. When running a Spark workload via YARN, the file transport is OK when the deploy mode is set to client AND you have access to the driver AND the driver node isn't ephemeral (like a container or a pod that will be killed when the Spark application terminates).
When running a Spark workload via Kubernetes, the same mostly applies. There is also an additional condition: you do need a persistent volume which will store the output file. These docs should be corrected.

Frimpong Boadu (boadu203@gmail.com)

2024-06-19 15:30:39

*Thread Reply:* @Damien Hawes Oh okay, thank you very much. From this, is it safe to assume it’s okay to use the File transport if I’m writing to a permanent storage, regardless of the mode.

Damien Hawes (damien.hawes@booking.com)

2024-06-19 15:31:35

*Thread Reply:* No. If multiple applications are writing to the same location, that file will be a mess.

Frimpong Boadu (boadu203@gmail.com)

2024-06-19 15:35:06

*Thread Reply:* In my case, each application writes to a different file. However the same instance of the application may append/overwrite to the old file.

Damien Hawes (damien.hawes@booking.com)

2024-06-19 15:37:26

*Thread Reply:* What does your deployment environment look like? Is it YARN? Is it Kubernetes? Is it standalone?

Damien Hawes (damien.hawes@booking.com)

2024-06-19 15:38:31

*Thread Reply:* The FileTransport should really only be used for development purposes. In production, Http / Kafka / Kinesis are far better bets.

Frimpong Boadu (boadu203@gmail.com)

2024-06-19 15:38:38

*Thread Reply:* It’s Yarn.

Damien Hawes (damien.hawes@booking.com)

2024-06-19 15:39:05

*Thread Reply:* How do you submit your jobs to YARN?

Damien Hawes (damien.hawes@booking.com)

2024-06-19 15:39:21

*Thread Reply:* spark-submit ... --deploy-mode <what is this value> ...

Damien Hawes (damien.hawes@booking.com)

2024-06-19 15:40:18

*Thread Reply:* If you have a machine that sits outside of your YARN cluster, and you deploy using the client mode, then the files will be located on that machine.

Damien Hawes (damien.hawes@booking.com)

2024-06-19 15:40:35

*Thread Reply:* If the driver is a YARN AM (ApplicationMaster), those files will be lost, or be difficult to retrieve.

Frimpong Boadu (boadu203@gmail.com)

2024-06-19 15:44:08

*Thread Reply:* Our deploy mode is cluster. However we have a dedicated storage where we write the file, and this storage is not local to driver, but can be accessed by the driver. So we directly write to this storage.

dolfinus (martinov_m_s_@mail.ru)

2024-06-20 04:43:50

*Thread Reply:* Why using File transport in the first place, instead of HTTP or Kafka?

Frimpong Boadu (boadu203@gmail.com)

2024-06-20 05:56:26

*Thread Reply:* @dolfinus I’m new here and not sure what is best. I haven’t used Kafka before, so I don’t know how it works. For http, I’m believe I would need a service running on the backend to connect to, which is something my seniors don’t want.

dolfinus (martinov_m_s_@mail.ru)

2024-06-20 05:57:47

*Thread Reply:* Hm, why having a backend server collecting lineage events is worse than having a bunch of files a lot of Spark sessions are writing into?

Frimpong Boadu (boadu203@gmail.com)

2024-06-20 11:30:37

*Thread Reply:* I’m not sure yet, I would have to talk to them about it.

Frimpong Boadu (boadu203@gmail.com)

2024-06-20 11:30:58

*Thread Reply:* Thank you very much @Damien Hawes @dolfinus l

Roy Cheong (roy.cheong@grabtaxi.com)

2024-06-19 23:48:40

Hi team, I would like to request a release for this https://github.com/OpenLineage/OpenLineage/pull/2749/files

➕ Maciej Obuchowski, Paweł Leszczyński, Tomasz Nazarewicz, Jakub Dardziński, Kacper Muda

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-06-20 08:04:48

*Thread Reply:* FYI @Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2024-06-20 11:01:20

*Thread Reply:* Thank you for requesting a release. It will be performed within 2 business days as per the release policy.

Mariusz Górski (gorskimariusz13@gmail.com)

2024-06-20 04:39:37

After Trino 450 was released I think it's about time we make it official 🎉

https://github.com/OpenLineage/docs/pull/330

#330 feat: Trino integration documentation

Comments

🎉 dolfinus, Maciej Obuchowski, Michael Robinson, Alok

Mariusz Górski (gorskimariusz13@gmail.com)

2024-06-20 04:40:18

*Thread Reply:* cc @Michael Robinson @Alok

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-06-20 05:33:42

*Thread Reply:* merged 🚀 🎉

🔥 Mariusz Górski, Michael Robinson

Mariusz Górski (gorskimariusz13@gmail.com)

2024-06-20 05:35:33

*Thread Reply:* yeah it's there

image.png

🎉 Michael Robinson

harsh loomba (hloomba@upgrade.com)

2024-06-20 14:39:45

Hello Team - Im comparing these 2 releases of airflow-openlineage to see where in my worker pods the following kind of prints coming from? "tbl1" "tbl1" "tbl1" "tbl1" "tbl2" "tbl2" "tbl3" "tbl3" "tbl3" "tbl4" "tbl4" "tbl5" "tbl5" any thoughts? may be this is not clear enough but Im still tracking where these are originating from

harsh loomba (hloomba@upgrade.com)

2024-06-20 14:41:58

*Thread Reply:* These are not in standard format and my guess is its due to overwriting the to_String() method and returning that to client

harsh loomba (hloomba@upgrade.com)

2024-06-20 14:43:32

*Thread Reply:* when i use 1.4.1 release I dont see this issue. But I can't use it because apache-airflow-provider-lineage needs min

Screenshot 2024-06-20 at 11.43.25 AM.png

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-06-21 06:56:26

*Thread Reply:* Any particular operator/place in code where you notice this?

harsh loomba (hloomba@upgrade.com)

2024-06-21 13:00:44

*Thread Reply:* at times it appears above airflow/hooks/base.py airflow/jobs/local_task_job_runner.py and below airflow/utils/log/logging_mixin.py at times it appears around sql.py class

harsh loomba (hloomba@upgrade.com)

2024-06-21 13:00:50

*Thread Reply:* but its very inconsistent

harsh loomba (hloomba@upgrade.com)

2024-07-08 17:15:56

*Thread Reply:* @Maciej Obuchowski I have updated the lineage version to lates but not sure how to fix this bug

harsh loomba (hloomba@upgrade.com)

2024-07-08 17:16:07

*Thread Reply:* this is overutilizing our loggers

harsh loomba (hloomba@upgrade.com)

2024-07-08 17:17:38

*Thread Reply:* this is a very critical issue we have to deal, does anyone else have seen this issue

harsh loomba (hloomba@upgrade.com)

2024-07-08 18:32:28

*Thread Reply:* something of this sort is coming up ```something of this kind I see in logs

{"asctime": "2024-07-08 22:30:35,947", "pathname": "/airflow/hooks/vault.py", "funcName": "getconnection", "lineno": 172, "levelname": "INFO", "message": "no vault config in connection extras, return original airflow connection", "dagid": "dagid", "taskid": "taskid", "executiondate": "2024-07-08T21:30:00+00:00", "try_number": 1} "tbl1" "tbl1" "tbl1" "tbl1" "tbl1"```

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-09 07:12:29

*Thread Reply:* TBH I haven't seen anyone else reporting this - are you sure it's not coming from your own code?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-09 07:13:28

*Thread Reply:* It's not utilizing Airflow's loggers, since there's no structured information in this message, like it's in one above: {"asctime": "2024-07-08 22:30:35,947", "pathname": "/airflow/hooks/vault.py", "funcName": "get_connection", "lineno": 172, "levelname": "INFO", "message": "no vault config in connection extras, return original airflow connection", "dag_id": "dag_id", "task_id": "task_id", "execution_date": "2024-07-08T21:30:00+00:00", "try_number": 1}

harsh loomba (hloomba@upgrade.com)

2024-07-10 13:24:43

*Thread Reply:* @Maciej Obuchowski what do you think about this

harsh loomba (hloomba@upgrade.com)

2024-07-10 13:24:44

*Thread Reply:* https://github.com/apache/airflow/issues/36571

#36571 KubernetesPodOperator prints empty log lines without respecting the log format

Apache Airflow Provider(s) cncf-kubernetes Versions of Apache Airflow Providers apache-airflow-providers-amazon==8.14.0 apache-airflow-providers-celery==3.5.1 apache-airflow-providers-cncf-kubernetes==7.12.0 apache-airflow-providers-common-io==1.1.0 apache-airflow-providers-common-sql==1.10.0 apache-airflow-providers-discord==3.5.0 apache-airflow-providers-docker==3.8.2 apache-airflow-providers-elasticsearch==5.3.0 apache-airflow-providers-ftp==3.7.0 apache-airflow-providers-google==10.13.0 apache-airflow-providers-grpc==3.4.0 apache-airflow-providers-hashicorp==3.6.0 apache-airflow-providers-http==4.8.0 apache-airflow-providers-imap==3.5.0 apache-airflow-providers-microsoft-azure==8.4.0 apache-airflow-providers-mysql==5.5.0 apache-airflow-providers-odbc==4.2.0 apache-airflow-providers-openlineage==1.3.0 apache-airflow-providers-postgres==5.9.0 apache-airflow-providers-redis==3.5.0 apache-airflow-providers-sendgrid==3.4.0 apache-airflow-providers-sftp==4.8.0 apache-airflow-providers-slack==8.5.1 apache-airflow-providers-snowflake==5.2.1 apache-airflow-providers-sqlite==3.6.0 apache-airflow-providers-ssh==3.9.0 Apache Airflow version 2.8.0 Operating System Debian 12 (bookworm) Deployment Official Apache Airflow Helm Chart Deployment details deployed on digitalocean k8s. I have a custom log format as <code>'%%(asctime)s | %%(levelname)s | %%(message)s'</code> What happened I have recently upgraded the cncf-kubernetes provider from <code>7.4.2</code> to <code>7.12.0</code> and started seeing weird log lines being printed out in-between application logs: <pre><code>2024_01_03T21:31:59.286+0000 | WARNING | Pod not yet started: my-pod-name-xutbmgq3 2024-01-03T21:32:00.313+0000 | WARNING | Pod not yet started: my-pod-name-xutbmgq3 2024-01-03T21:32:00.694053856Z 2024-01-03T21:32:00.694107912Z 2024-01-03T21:32:00.694384093Z 2024-01-03T21:32:00.694416033Z 2024-01-03T21:32:00.694514264Z 2024-01-03T21:32:00.694543915Z 2024-01-03T21:32:00.694553314Z 2024-01-03T21:32:00.694557782Z 2024-01-03T21:32:00.694629859Z 2024-01-03T21:32:00.694642682Z 2024-01-03T21:32:01.348+0000 | WARNING | Pod not yet started: my-pod-name-xutbmgq3 </code></pre> this also happens while the application is running: <pre><code>2024_01_03T21:32:04.814+0000 | INFO | [base] application log line 1 2024-01-03T21:32:05.234+0000 | INFO | [base] application log line 2 2024-01-03T21:32:05.526911055Z 2024-01-03T21:32:05.529+0000 | INFO | [base] application log line 3 2024-01-03T21:32:05.529+0000 | INFO | [base] application log line 4 2024-01-03T21:32:05.654767914Z 2024-01-03T21:32:05.654830820Z 2024-01-03T21:32:05.654857369Z </code></pre> As you can see the log lines have a different format than the actual airflow log format, and they seem to contain no messages. This causes confusion, and for a few cases where I parse the logs it broke my existing code. I have checked through external log collectors as well as via tailing the pod logs and I can confirm that the logs do not come from the application/pod side, and being logged somewhere on the airflow side. What you think should happen instead there should be no log lines printed that doesn't respect the logging format. How to reproduce Create a pod with KubernetesPodOperator, it happens to me consistently. Anything else Every time I run a pod it happens. Are you willing to submit PR? ☑︎ Yes I am willing to submit a PR! Code of Conduct ☑︎ I agree to follow this project's <a href="https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md">Code of Conduct</a>

Labels

kind:bug, area:providers, provider:cncf-kubernetes

Comments

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-10 13:29:11

*Thread Reply:* I think the case might be different, because this error above prints not empty lines, but they have timestamp - in a different format

harsh loomba (hloomba@upgrade.com)

2024-07-10 13:31:39

*Thread Reply:* another i can think be tis line - https://github.com/apache/airflow/blob/224cb75be10f71e34b4a81a9f4b7ed43f2f25db6/airflow/providers/openlineage/sqlparser.py#L141C16-L141C21

harsh loomba (hloomba@upgrade.com)

2024-07-10 13:31:48

*Thread Reply:* where we are calling rust library

harsh loomba (hloomba@upgrade.com)

2024-07-10 13:32:05

*Thread Reply:* do you think that can print something like this

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-10 13:40:06

*Thread Reply:* @harsh loomba I think you're very right about this: https://github.com/OpenLineage/OpenLineage/pull/2844/files

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-10 13:40:37

*Thread Reply:* it's the only place we're logging/printing anything in SQL parser 🙂

harsh loomba (hloomba@upgrade.com)

2024-07-10 13:43:56

*Thread Reply:* ohhhhh man

harsh loomba (hloomba@upgrade.com)

2024-07-10 13:44:09

*Thread Reply:* can we release it please

harsh loomba (hloomba@upgrade.com)

2024-07-10 13:47:15

*Thread Reply:* @Maciej Obuchowski what does this function do

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-10 13:48:36

*Thread Reply:* we aim to release tomorrow

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-10 13:48:54

*Thread Reply:* and it will be surely included

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-10 13:49:14

*Thread Reply:* I think I did not even think to look at it is because outside of K8S it won't even get logged

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-10 13:49:23

*Thread Reply:* and I don't test locally on it

harsh loomba (hloomba@upgrade.com)

2024-07-10 13:50:22

*Thread Reply:* yeah

harsh loomba (hloomba@upgrade.com)

2024-07-10 13:50:24

*Thread Reply:* ok thanks man

harsh loomba (hloomba@upgrade.com)

2024-07-10 13:54:22

*Thread Reply:* so basically https://github.com/OpenLineage/OpenLineage/blob/5631150baa32a601c6a656cdd7b0b9f6025ea91b/integration/sql/iface-py/src/lib.rs#L77 is printing those table names right?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-10 14:04:59

*Thread Reply:* I haven't reproduced it locally - but it's most likely culprit - as what it is printing are table names

harsh loomba (hloomba@upgrade.com)

2024-07-10 14:05:41

*Thread Reply:* im wondering if we are printing anything else, so that before next release we can fix at once

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-10 14:06:04

*Thread Reply:* not in the parser - it's only print/log in the whole non-test codebase there

harsh loomba (hloomba@upgrade.com)

2024-07-10 14:06:56

*Thread Reply:* cool

harsh loomba (hloomba@upgrade.com)

2024-07-10 14:08:02

*Thread Reply:* please let me know when u have a pre-release pip for me, i will use that meanwhile

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-10 14:16:00

*Thread Reply:* only linux ones before release: ARM: https://output.circle-artifacts.com/output/job/0e3b2f49-3b9f-4460-ad15-05e3934f4c7b/artifa[…]7-abi3-manylinux217aarch64.manylinux2014aarch64.whl x86: https://output.circle-artifacts.com/output/job/85171375-d421-47fb-94b7-418898c6f35e/artifa[…]p37-abi3-manylinux217x8664.manylinux2014x8664.whl

harsh loomba (hloomba@upgrade.com)

2024-07-10 14:17:29

*Thread Reply:* i see will wait

harsh loomba (hloomba@upgrade.com)

2024-07-11 14:01:56

*Thread Reply:* @Maciej Obuchowski i see the new release https://github.com/OpenLineage/OpenLineage/blob/1.18.0/CHANGELOG.md but i dont see this PR got added. m'i missing something?

<https://github.com/OpenLineage/OpenLineage/blob/1.18.0/CHANGELOG.md | CHANGELOG.md>

#2844 sql: remove debug println

harsh loomba (hloomba@upgrade.com)

2024-07-11 14:03:07

*Thread Reply:* i think i see here though - https://github.com/OpenLineage/OpenLineage/compare/1.17.1...1.18.0

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-11 14:03:32

*Thread Reply:* If it's merged before release, it's released

👍 harsh loomba

Michael Robinson (michael.robinson@astronomer.io)

2024-06-21 12:27:31

Announcing OpenLineage 1.17.1, featuring a new facet Registry in the spec and many other additions and fixes. Additions • Java: dataset namespace resolver #2720 @Paweł Leszczyński • Spark: add transformation extraction #2758 @Tomasz Nazarewicz • Spark: add GCP run and job facets #2643 @codelixir • Spark: improve namespace format for SQLServer #2773 @dolfinus • Spark: verify jar content after build #2698 @Tomasz Nazarewicz • Spec: add transformation type info #2756 @Tomasz Nazarewicz • Spec: implementing facet registry #2729 @Harel Shein • Spec: register GCP common job facet #2740 @Natalia Gorchakova There were also many bug fixes -- please see the release notes for details. Thanks to all the contributors with a shout out to new contributor @Natalia Gorchakova! Please note that there tends to be a lengthy delay before Maven displays the latest repositories, but you can obtain them by manually editing the release tag in the repo URLs (e.g., <https://repo1.maven.org/maven2/io/openlineage/openlineage-java/1.17.1/>). Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.17.1 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/1.16.0...1.17.1 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🚀 Maciej Obuchowski, Jarek Potiuk, Mattia Bertorello, Natalia Gorchakova

🔥 Maciej Obuchowski, Jarek Potiuk, Mattia Bertorello

Rodrigo Maia (rodrigo.maia@manta.io)

2024-06-25 12:06:56

*Thread Reply:* @Tomasz Nazarewicz Question about transformations. Is it working already? I've tested and 1 - Column Level Lineage is not showing up anymore, 2 - No transformations.

Tomasz Nazarewicz (tomasz.nazarewicz@getindata.com)

2024-06-25 12:20:12

*Thread Reply:* @Rodrigo Maia best I can say is that it works in all our tests but if it fails for you do you have any logs or something for me to go on?

Rodrigo Maia (rodrigo.maia@manta.io)

2024-06-27 05:21:17

*Thread Reply:* No errors in the logs related to column level. Let me try to run more examples

Charles kuzmiak (charles.kuzmiak@kraken.tech)

2024-06-24 01:01:33

hello, has anyone managed to get Openlineage working on Databricks spark ? I tried installing on cluster, as part of job, http and console and I cant get anything out, no errors no nothing 😞 really stumped atm. Spark 3.4.1 Sacla 2.12, using io.openlineage:openlineagespark2.12:1.10.2

Charles kuzmiak (charles.kuzmiak@kraken.tech)

2024-06-24 02:19:01

*Thread Reply:* I believe it is due to the need for init script as mentioned here, but even with it still nothing

openlineage.io

Quickstart with Databricks | OpenLineage

Original URL: https://openlineage.io/docs/integrations/spark/quickstart/quickstart_databricks

Rodrigo Maia (rodrigo.maia@manta.io)

2024-06-25 06:56:21

*Thread Reply:* i was able to install it for the previous version (1.8). Please let me know if you need help

Charles kuzmiak (charles.kuzmiak@kraken.tech)

2024-06-25 20:53:52

*Thread Reply:* thank you for reaching out ! Did you use the init script or were you able to get it running with normal context settings ?

Rodrigo Maia (rodrigo.maia@manta.io)

2024-06-27 05:20:16

*Thread Reply:* init script, but i belive you can specify the jar in your spark context. have you tried?

Charles kuzmiak (charles.kuzmiak@kraken.tech)

2024-06-27 21:00:09

*Thread Reply:* yes, it did not work 😞 Tried installing on cluster, adding through jar, but ended up having to use init....

Rodrigo Maia (rodrigo.maia@manta.io)

2024-06-25 06:57:17

OpenLineage 1.17.1 + Databricks error -> No Inputs/Outputs

These errors are showing up several times in the logs for simple read and write operations using hive catalog. 24/06/25 10:35:46 ERROR PlanUtils: Apply failed: java.lang.IllegalArgumentException: Expected scheme-specific part at index 5: dbfs: The results are json payloads without any input or output information. Any Ideas?

dolfinus (martinov_m_s_@mail.ru)

2024-06-25 07:06:16

*Thread Reply:* Probably related to https://github.com/OpenLineage/OpenLineage/pull/2782. Could you give an example of table location?

Rodrigo Maia (rodrigo.maia@manta.io)

2024-06-25 07:18:34

*Thread Reply:* it could be. In this case, the hive warehouse is located under the databricks file system: • dbfs:/user/hive/warehouse/oltransformations.db/1arithmetictransformationsol This path was not modified and worked properly with the previous version of OL, such as 1.8 (including ColumnLevel Lineage)

Nevertheless, when i try the same example using Unity Catalog instead of Hive, with table path as: • Events are showing Inputs/Outputs but no ColulumLevel Lineage.

image.png

dolfinus (martinov_m_s_@mail.ru)

2024-06-25 07:28:30

*Thread Reply:* Redarding dbfs, could you give a full track trace?

Rodrigo Maia (rodrigo.maia@manta.io)

2024-06-25 07:38:09

*Thread Reply:* Here is the log with the stack trace.

sample_log.log

dolfinus (martinov_m_s_@mail.ru)

2024-06-25 08:04:03

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2800

#2800 Spark,Flink: Fix DBFS namespace format

Problem <a href="https://github.com/OpenLineage/OpenLineage/pull/2782">#2782</a> introduces feature for custom extractors for specific FS types. But it haven't been tested on dbfs paths like <code>dbfs:/some/location</code> - in previous version namespace was <code>dbfs</code>, but now it is <code>dbfs:</code>. This leads to errors like: <pre><code>24/06/25 10:47:01 ERROR PlanUtils: Apply failed: java.lang.IllegalArgumentException: Expected scheme-specific part at index 5: dbfs: at java.net.URI.create(URI.java:852) at io.openlineage.spark.agent.util.PlanUtils.datasourceFacet(PlanUtils.java:213) at io.openlineage.spark.agent.lifecycle.plan.LogicalRelationDatasetBuilder.handleCatalogTable(LogicalRelationDatasetBuilder.java:130) at io.openlineage.spark.agent.lifecycle.plan.LogicalRelationDatasetBuilder.apply(LogicalRelationDatasetBuilder.java:104) at io.openlineage.spark.agent.lifecycle.plan.LogicalRelationDatasetBuilder.apply(LogicalRelationDatasetBuilder.java:60) </code></pre> Solution Remove final <code>:</code> from namespace, to convert <code>dbfs:</code> to <code>dbfs</code>. One-line summary: Fix DBFS namespace format. Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☑︎ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☑︎ Your changes are accompanied by tests (if relevant) ☑︎ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☑︎ Your comment includes a one-liner for the changelog about the specific purpose of the change (not required for changes to tests, docs, or CI config) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant) ☐ You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant) <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2024 contributors to the OpenLineage project

Labels

area:integration/spark, area:client/java, area:tests, language:java

dolfinus (martinov_m_s_@mail.ru)

2024-06-25 08:05:16

*Thread Reply:* Redarding column lineage - there are changes in this part as well, this also could be a bug. But I'm not familiar with this part of OL

morad (morad.masnaoui@keyrus.com)

2024-06-26 05:40:05

Hello guys, I am trying to find some documentation on how to secure the API on openlineage. do you have any pointers for me ? thanks,

Kacper Muda (mudakacper@gmail.com)

2024-06-26 06:32:57

*Thread Reply:* Hey, do you mean the API of Marquez? You can try asking on the dedicated Marquez slack: https://bit.ly/Marquez_Slack_invite

morad (morad.masnaoui@keyrus.com)

2024-06-27 01:47:18

*Thread Reply:* Hi,

Unless I am mistaken, spark events are sent directly to the openlineage server (Backend) through the connectors and api calls. marquez is used to visualise that information. I would like the connector connection to be secured. Did I understand this incorrectly ?

Kacper Muda (mudakacper@gmail.com)

2024-06-27 02:33:50

*Thread Reply:* Spark events are captured by and sent to an OpenLineage-compatible backend server, such as Marquez, which is used to visualize and manage that information (so Marquez is a specific implementation of an OpenLineage backend).

I'm not sure i fully understand this part: > sent directly to the openlineage server (Backend) through the connectors and api calls. The Spark OpenLineage integrations is gathering data about the lineage and then sends that information to the backend (f.e. Marquez, with some api calls / connectors, whatever the backend requires) by using predefined or custom transport class, f.e. http transport. So if you are hosting some backend that needs the OpenLineage client to authorize in a specific way, you can always implement your own transport.

Let me know if that helps,I’m not sure if I correctly identified the issue you are facing.

openlineage.io

Configuration | OpenLineage

We recommend configuring the client with an openlineage.yml file that contains all the

Original URL: https://openlineage.io/docs/client/java/configuration#defining-your-own-transport

morad (morad.masnaoui@keyrus.com)

2024-06-27 09:33:52

*Thread Reply:* thank you for your clear answer!

👍 Kacper Muda

harsh loomba (hloomba@upgrade.com)

2024-06-26 14:33:02

Team Im trying to timeout getopenlineagefacetsonstart method in our processes but since lineage runs in separate thread im unable to containerize sql-parser and metadata in my thread to kill it. Any thoughts here? @Willy Lulciuc @Jakub Dardziński

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-01 07:00:58

*Thread Reply:* @harsh loomba setting will be added in next release: https://github.com/apache/airflow/pull/40078/files#diff-776b432a685e8cc2a2ec0432aab9d9b22ca4e79a247debf0967a0a5802af09f2R349

harsh loomba (hloomba@upgrade.com)

2024-07-02 11:27:18

*Thread Reply:* thanks much and appreciated, when we can release the latest version of provider package? its urgent at our end so appreciate if we can deploy the latest release

harsh loomba (hloomba@upgrade.com)

2024-07-02 11:55:15

*Thread Reply:* Is this - https://github.com/apache/airflow/blob/providers-openlineage/1.9.0rc2/airflow/providers/openlineage/plugins/listener.py the stable image to test this feature here? cc: @Maciej Obuchowski @Michael Robinson

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-02 12:09:51

*Thread Reply:* you can try https://pypi.org/project/apache-airflow-providers-openlineage/1.9.0rc2/

PyPI

apache-airflow-providers-openlineage

Provider package apache-airflow-providers-openlineage for Apache Airflow

Original URL: https://pypi.org/project/apache-airflow-providers-openlineage/1.9.0rc2/

👍 harsh loomba

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-02 12:12:15

*Thread Reply:* should soon be normal 1.9.0

harsh loomba (hloomba@upgrade.com)

2024-07-02 12:20:52

*Thread Reply:* appreciate, testing it now

harsh loomba (hloomba@upgrade.com)

2024-07-02 14:45:57

*Thread Reply:* @Maciej Obuchowski this is working, thanks and appreciate the input. Our tasks were hanging due to this issue.

harsh loomba (hloomba@upgrade.com)

2024-07-02 14:46:02

*Thread Reply:* @Michael Robinson

👍 Michael Robinson

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-06-26 17:08:42

Question about the spark integration - we are using a fairly simple read/write for CSV files in an S3 bucket. Spark OpenLineage integration gives us the fields in the files properly, but I don’t see the filenames themselves anywhere. (more in thread)

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-06-26 17:09:02

*Thread Reply:* Here’s the code, S3 paths have been sanitized:

```import sys from awsglue.transforms import ** from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from pyspark.conf import SparkConf from awsglue.context import GlueContext from awsglue.job import Job

conf = SparkConf() conf.set("spark.extraListeners","io.openlineage.spark.agent.OpenLineageSparkListener")\ .set("spark.jars.packages","io.openlineage:openlineagespark:1.10.2")\ .set("spark.openlineage.version","v1")\ .set("spark.openlineage.namespace","OLEXAMPLE_DN")\ .set("spark.openlineage.transport.type","console")

@params: [JOB_NAME]

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext.getOrCreate(conf=conf) glueContext = GlueContext(sc) spark = glueContext.sparksession job = Job(glueContext) job.init(args["JOBNAME"], args)

df=spark.read.format("csv").option("header","true").load("") df.write.format('csv').option('header','true').save('

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-06-26 17:20:41

*Thread Reply:* This is the JSON we got back - it has the columns, but not the file names.

Is there some coding work we can help with to get the filenames into the output?

Sanitized_Complete.txt

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-06-27 02:15:44

*Thread Reply:* The output of a Spark job is mostly lot of files. This is the amount of Spark internal partitions which allow distributed computing. Unless repartition is called on dataset, the default is 200. So it makes more sense to provide within OL event just the directory where the files are stored.

👍 dolfinus

Rodrigo Maia (rodrigo.maia@manta.io)

2024-06-27 05:23:56

*Thread Reply:* @Sheeri Cabral (Collibra) exactly, but if you read a specific CSV file (and not the directory) it will show you the file name.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-06-27 08:31:53

*Thread Reply:* I understand the reasoning for providing a directory. However - spark looks into the files to get the column/field information (e.g. “status”, “name”, “AS_OF”). So is there a way to get spark to know the filenames too?

We are willing to write some code to make this happen, if it’s possible.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-06-27 08:32:28

*Thread Reply:* (this is a very common pattern, we see it in ETL tools like Azure Data Factory and DataStage as well)

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-06-27 08:33:42

*Thread Reply:* We can’t really do lineage from “columns in one directory” to “columns in another directory”, though. For both compliance and impact analysis, we need the filenames

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-06-27 08:49:10

*Thread Reply:* Isn't it the same as "columns in one table" to "columns in another table"? If the case is column level lineage, the column level lineage facet should be used.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-06-27 08:52:05

*Thread Reply:* Anyways, I don't mind someone writing DatasetFilesFacet to be filled and sent within OL event when turned on.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-06-27 08:52:57

*Thread Reply:* columnLineage is being used

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-06-27 08:53:57

*Thread Reply:* But the “name” of the column is the directory

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-06-27 08:54:17

*Thread Reply:* (the “field” is the column name, the “name” should be the name of the table/view/parent of the field)

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-06-27 08:54:36

*Thread Reply:* @Paweł Leszczyński it would also be useful to understand that it’s a file and not a database/table, if possible

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-06-27 08:56:38

*Thread Reply:* (specifically it’s useful for our consumption, because we have different data models for database -> schema -> table/view -> column vs. filesystem -> directory -> file -> field)

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-06-27 08:58:33

*Thread Reply:* are you sure to have different column data in different files?

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-06-27 11:35:16

*Thread Reply:* I think it’s the opposite - I can’t guarantee that every object in an S3 bucket will have the exact same column data.

The spark engine somehow gets the column data - the columns are not in the AWS Glue job code. So if the spark engine reads the file, can’t it read the filename too?

dolfinus (martinov_m_s_@mail.ru)

2024-06-28 04:09:21

*Thread Reply:* > I can’t guarantee that every object in an S3 bucket will have the exact same column data. In this case, Spark will not be able to read the data from those files in the first place, because they to not follow the schema. Unless .option("mergeSchema", "true") is used, but in this case it will be not possible to tell which file has a specific column, and which one does not, as schema is applied to all the files in the batch.

Also, having a per-file column lineage leads to producing a very large event JSON, as the number of nodes will be inputFilex ** outputFiles ** columns

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-06-28 09:10:59

*Thread Reply:* I understand. But there is actually that lineage happening; so it’s right to represent it.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-06-28 09:11:52

*Thread Reply:* Thanks for explaining about the spark abilities - I am not very familiar with Spark or AWS Glue code 😄

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-06-28 09:12:29

*Thread Reply:* So I guess we would have to say yes, it’s guaranteed that all files will have the same schema, since the schema will be applied to all the files (or fail)

Matias Vizcaino (datagero@outlook.com)

2024-06-27 19:35:20

Me again - For the Marquez users out there, thanks for your input on the initial survey. We got some interesting findings and perspectives! We've built some simple prototypes and now want to evaluate their usefulness. If you could take a few minutes to check them out and fill out a quick survey, it would be a huge help. The more viewpoints we gather, the more valuable our results will be.

🔍 Do you have 10 minutes? Please help us by testing the new Marquez prototypes and answering a quick survey to share your initial impressions. 📋 Prototype Testing and Survey Link

📢 Spread the word! Please share this survey link on your team's channel to help us get diverse and valuable responses. Thank you for your time and feedback!

👀 Charles kuzmiak

✅ Charles kuzmiak

Charles kuzmiak (charles.kuzmiak@kraken.tech)

2024-06-28 03:25:58

hello lovely people 👋 anyone has experience with custom operators ? Im trying to start super basic, but so far not getting anything out 😞 ```class HelloOperator2(BaseOperator): def init(self, name: str, *kwargs) -> None: super().__init__(*kwargs) self.name = name

def execute(self, context):
    message = f"Hello2 {self.name}"
    print(message)
    return message

def get_openlineage_facets_on_start(self):
    from openlineage.airflow.extractors.base import OperatorLineage
    from openlineage.client.run import Dataset

    input_dataset = Dataset(
        namespace=f"<s3://IN>",
        name=self.name + "IN",
    )
    output_dataset = Dataset(
        namespace=f"<s3://out>",
        name=self.name + "OUT",
    )

    return OperatorLineage(
        inputs=[input_dataset],
        outputs=[output_dataset],
    )```

The dag is hello_task2 = HelloOperator2(task_id="sample-task2", name="lorem_ipsum", dag=dag) It runs, and it is emitting events but no input/output 🤯 I feel like I am missing something obvious but no matter what docs I read cannot seem to find the answer

Charles kuzmiak (charles.kuzmiak@kraken.tech)

2024-06-28 04:00:20

*Thread Reply:* Ok this one was on me, I read the old docs, but just in case anyone else faces the same issue from openlineage.airflow.extractors.base import OperatorLineage Is old and not working, use from airflow.providers.openlineage.extractors import OperatorLineage

✅ Sheeri Cabral (Collibra), Natalia Gorchakova

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-06-28 08:51:29

*Thread Reply:* (where were those old docs?)

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-06-28 08:51:34

*Thread Reply:* (should we remove them?)

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-06-28 08:51:03

Well this is interesting (And not surprising). Wonder if they’re [ab]using Marquez for the visualization - https://aws.amazon.com/blogs/aws/introducing-end-to-end-data-lineage-preview-visualization-in-amazon-datazone/

Amazon Web Services

Introducing end-to-end data lineage (preview) visualization in Amazon DataZone | Amazon Web Services

Trace data from origin to insights with an intuitive visual graph, empowering engineers, analysts, and admins to validate provenance, troubleshoot pipelines, and ensure governance with ease.

Original URL: https://aws.amazon.com/blogs/aws/introducing-end-to-end-data-lineage-preview-visualization-in-amazon-datazone/

😮 Jakub Dardziński, Minkyu Park, Willy Lulciuc, Peter Hicks, Paweł Leszczyński, Mattia Bertorello

❤️ Julien Le Dem, Harel Shein, Paweł Leszczyński, Mattia Bertorello

Damien Hawes (damien.hawes@booking.com)

2024-06-28 12:51:37

*Thread Reply:* Yup. Just saw this in my LinkedIn feed.

Michael Robinson (michael.robinson@astronomer.io)

2024-06-28 12:52:17

*Thread Reply:* @Willy Lulciuc

Michael Robinson (michael.robinson@astronomer.io)

2024-06-28 12:53:17

*Thread Reply:*

Screenshot 2024-06-28 at 12.53.05.png

Michael Robinson (michael.robinson@astronomer.io)

2024-06-28 12:53:34

*Thread Reply:* @Peter Hicks

Willy Lulciuc (willy@datakin.com)

2024-06-28 12:56:10

*Thread Reply:* “Inspired by Marquez” seems more likely backed by OpenLineage. We did have discussions with AWS last year, but no contributions or further discussion since then.

Julien Le Dem (julien@apache.org)

2024-06-28 13:04:09

*Thread Reply:* Either way it is great for the ecosystem. Solidifying OpenLineage as the standard

➕ Willy Lulciuc, Sheeri Cabral (Collibra), Peter Hicks, Harel Shein, Paweł Leszczyński, Mattia Bertorello, dolfinus, Ibby Khajanchi

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-06-28 14:41:58

*Thread Reply:* @Willy Lulciuc AWS has built a whole business on adding infrastructure to open source and monetizing it that way, so it’s possible. I guess it depends if they already have some kind of lineage rendering engine.

The API part is interesting to me; https://docs.aws.amazon.com/datazone/latest/APIReference/API_PostLineageEvent.html makes me think they are just using openlineage under the hood. Interesting that https://docs.aws.amazon.com/datazone/latest/APIReference/API_GetLineageNode.html exists too tho

docs.aws.amazon.com

PostLineageEvent - Amazon DataZone

Posts a data lineage event.

Original URL: https://docs.aws.amazon.com/datazone/latest/APIReference/API_PostLineageEvent.html

docs.aws.amazon.com

GetLineageNode - Amazon DataZone

Gets the data lineage node.

Original URL: https://docs.aws.amazon.com/datazone/latest/APIReference/API_GetLineageNode.html

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-06-28 15:10:50

*Thread Reply:* (and yeah, I’d talked to Stephen Said about OpenLineage at the end of 2022, and then 2 months later he posted https://aws.amazon.com/blogs/big-data/automate-data-lineage-on-amazon-mwaa-with-openlineage/)

❤️ Julien Le Dem, Peter Hicks

Peter Hicks (hickstricks@gmail.com)

2024-06-28 15:12:20

*Thread Reply:* Yeah, I there is some inspiration from Marquez here and the drawer system we recently added. My two cents is that while the lineage graph by itself can be provocative, is missing the day to day operational value... Anyways, I think we are on the right track with our plans for Marquez. (but I'm a little biased) https://github.com/MarquezProject/marquez/discussions/2823#discussioncomment-9695846

Comment on #2823 Marquez `v2.UI`

Here as some initial rendering on where we are thinking of heading in the next month or so encompassing the change events we wish to log. <a href="https://private-user-images.githubusercontent.com/7514204/337500273-49d89d7a-215f-4774-a391-eb2e59c22e37.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTk2MDIyNDEsIm5iZiI6MTcxOTYwMTk0MSwicGF0aCI6Ii83NTE0MjA0LzMzNzUwMDI3My00OWQ4OWQ3YS0yMTVmLTQ3NzQtYTM5MS1lYjJlNTljMjJlMzcucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYyOCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MjhUMTkxMjIxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9MThkYWVmYzhiZWIxOGY0ZTJhNTJkNDlmOThmNzMwNTg0MDJjMGE0NGFhM2YxNTRiMGE3ZWE2YzhhMGFjZTA1ZiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.jf3vuOc0nZHCOdM3m1NUor4u79-ZqknVpjLdHuvx7H4">image</a>

❤️ Sheeri Cabral (Collibra)

Frimpong Boadu (boadu203@gmail.com)

2024-06-28 16:15:08

Hello everyone, I am using OpenLineage and Spark to log lineage data for a simple read and write operation. However, whenever I read or write data, the namespace value of the data defaults to "file." Is there something that I am missing here? I would expect that the namespace is in a format like this: file://{host}/{path}. Since I am reading from a file.

Similar thing happens when I write to ABFSS (Azure Data Lake Gen2), but in this case reading from ABFSS (Azure Data Lake Gen2) the namespace is in this format: abfss://{container name}@{service name}/{path}, which is expected.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-07-01 02:09:44

*Thread Reply:* please have a look at this issue -> https://github.com/OpenLineage/OpenLineage/issues/2709 and comment it if you like

#2709 [QUESTION][SPARK] Shouldn't file:/dataset include hostname?

Details: Currently if someone read or writes data to local filesystem using Spark with <code>master=local</code>, OL integrations creates dataset with namespace <code>file</code> and name <code>/some/path</code>. But the same file can be read or created on different hosts, and this is not tracked in dataset namespace or name. Shouldn't hostname be a part of dataset namespace, e.g. <code><file://host></code>?

Labels

area:integration/spark, kind:discussion

Comments

Frimpong Boadu (boadu203@gmail.com)

2024-07-01 03:51:11

*Thread Reply:* @Paweł Leszczyński Thank you very much for your response. I will add my comment. However the issue only discussed for to local filesystem using Spark with master=local, but I faced the same problem when writing to ABFSS storage, but not when I read from the same storage.

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-07-01 03:58:51

*Thread Reply:* Got it. So the problem with those kind of issues is that it's difficult to investigate it without having access to ABFSS. You can create a separate issue for this. Please mind providing steps required to reproduce the issue, like an example spark code and debug facet

Frimpong Boadu (boadu203@gmail.com)

2024-07-01 12:11:16

*Thread Reply:* Sure! I will do that. Thank you.

Michael Robinson (michael.robinson@astronomer.io)

2024-07-01 17:29:43

The latest issue of OpenLineage News, featuring a rundown of recent releases, new docs, upcoming events, and more, is in inboxes now. To become a subscriber, sign up here.

openlineage.us14.list-manage.com

OpenLineage Project

OpenLineage Project Email Forms

Original URL: https://bit.ly/OL_news

🙌 Sheeri Cabral (Collibra)

Jonathan Moraes (jonathan.moraes.gft@nubank.com.br)

2024-07-02 10:49:25

Hello everyone! I'm trying to run the open lineage-proxy, but the command $ ./gradlew runShadow is getting stuck at 85% of the execution process. I have done all the steps mentioned in the README.md file at OpenLineage/proxy/backend at main · OpenLineage/OpenLineage page. I have stopped and tried to re-run it multiple times but nothing. Does anyone have an idea of why this might be happening? Should I wait longer?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-07-03 02:24:36

*Thread Reply:* you can try running ./gradlew runShadow --info to see if you get more output and know at which step does it stuck

🙏 Jonathan Moraes

Jonathan Moraes (jonathan.moraes.gft@nubank.com.br)

2024-07-03 08:12:55

*Thread Reply:* Thank you Pawel. It's strange but even though it gets stuck at 85% of the execution process, the server comes up and accepts incoming requests. What I did to bypass this behavior was to create a docker image. On Docker it is running smothly.

Hitesh Yadav (hiteshy9904@gmail.com)

2024-07-02 13:00:42

Hello everyone! Is there a spark version restriction that I need to follow while using the newer versions of openlineage? Is there a minimum version of spark that I need to use or will it work for all spark versions?

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-07-03 02:22:54

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/main/integration/README.md

<https://github.com/OpenLineage/OpenLineage/blob/main/integration/README.md | README.md>

Matias Vizcaino (datagero@outlook.com)

2024-07-03 15:08:58

For the Marquez users, this is the last one from me on this research. Thanks to everyone who helped review the initial prototypes. I’ve now collated the core feedback and have one last prototype for your review. Everyone’s welcome to have a say, even if you haven’t participated until now. The more viewpoints we gather, the more valuable our results will be.

🔍 Do you have 10 minutes? Please help us by testing the new Marquez prototypes and answering a quick survey to share your initial impressions. 📋 Prototype Testing and Survey Link

📢 Spread the word! Please share this survey link on your team's channel to help us get diverse and valuable responses.

Thank you for your time and feedback!

Jacob Barber (jacoby6000@gmail.com)

2024-07-03 16:53:14

o/ Hello OL people. My team has recently begun to adopt openlineage, and I'm currently working on getting the spark integrations working. I have the agent running in a databricks spark cluster, but no events are getting published to our backend.

On our job consuming from a delta table, I see a lot of these: {"@version":1,"source_host":"0703_194243_jvbx1y9d_10_232_0_183","message":"Cannot extract path from ParallelCollectionRDD WrappedArray([<REDACTED>], [<REDACTED>], [<REDACTED>], [<REDACTED>]","thread_name":"spark_listener_group_shared","@timestamp":"2024-07-03T20:02:37.800+0000","level":"WARN","logger_name":"io.openlineage.spark.agent.util.RddPathUtils"} {"@version":1,"source_host":"0703-194243-jvbx1y9d-10-232-0-183","message":"Query execution is null: can't emit event for executionId 1632","thread_name":"spark_listener_group_shared","@timestamp":"2024-07-03T20:36:15.702+0000","level":"ERROR","logger_name":"io.openlineage.spark.agent.lifecycle.ContextFactory"} In our job consuming from kinesis, I see this: {"@version":1,"source_host":"0703_194242_vvt49f6f_10_232_212_52","message":"Cannot extract path from ParallelCollectionRDD ArraySeq(StartPositionWithConsumer[ stream=<redacted>, shardId=shardId-000000000003, consumerARN=None, description=AfterRecord:49652963313292848862991643801734579343130390851726868530 ], StartPositionWithConsumer[ stream=<redacted>, shardId=shardId-000000000004, consumerARN=None, description=AfterRecord:49653347582272626452240232350088339137573563569591550018 ], StartPositionWithConsumer[ stream=<redacted>, shardId=shardId-000000000005, consumerARN=None, description=AfterRecord:49652621693395526948836548161918411336952266618176536658 ], StartPositionWithConsumer[ stream=<redacted>, shardId=shardId-000000000006, consumerARN=None, description=AfterRecord:49653379356775798732246413536834595205849740162698838114 ])","thread_name":"spark_listener_group_shared","@timestamp":"2024-07-03T19:48:39.153+0000","level":"WARN","logger_name":"io.openlineage.spark.agent.util.RddPathUtils"} {"@version":1,"source_host":"0703-194242-vvt49f6f-10-232-212-52","message":"Found output path null from RDD MapPartitionsRDD[297] at start at DeltaWriter.scala:32","thread_name":"spark-listener-group-shared","@timestamp":"2024_07_03T19:48:39.153+0000","level":"INFO","logger_name":"io.openlineage.spark.agent.lifecycle.RddExecutionContext"} {"@version":1,"source_host":"0703_194242_vvt49f6f_10_232_212_52","message":"Output RDDs are empty: skipping sending OpenLineage event","thread_name":"spark_listener_group_shared","@timestamp":"2024-07-03T19:48:39.153+0000","level":"INFO","logger_name":"io.openlineage.spark.agent.lifecycle.RddExecutionContext"} I've looked through issues and googled these errors to no avail. Does anybody have any ideas about what might be going on?

Jacob Barber (jacoby6000@gmail.com)

2024-07-03 17:31:57

*Thread Reply:* I am initializing things in a non-standard way, so I'll try doing it the way the docs recommend and see if that fixes anything. I also just realized I was setting the backend url incorrectly 🤦‍♂️

Jacob Barber (jacoby6000@gmail.com)

2024-07-03 17:33:16

*Thread Reply:* Alright, so doing things as recommended now. I can't start the spark cluster at all.

java.lang.ClassNotFoundException: io.openlineage.spark.agent.OpenLineageSparkListener not found I'm guessing this is because Databricks has a different environment and i may need to configure things differently.

Jacob Barber (jacoby6000@gmail.com)

2024-07-03 17:33:38

*Thread Reply:* My job jar definitely has the listener, but I guess the cluster does not

Jacob Barber (jacoby6000@gmail.com)

2024-07-03 17:35:04

*Thread Reply:* Looks like i need to follow https://openlineage.io/docs/integrations/spark/quickstart/quickstart_databricks

☝️ Harel Shein, Paweł Leszczyński, Charles kuzmiak

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-07-04 02:31:18

*Thread Reply:* yeah, for databricks runtime jar has to be added through cluster's init-scripts

Charles kuzmiak (charles.kuzmiak@kraken.tech)

2024-07-04 04:48:50

*Thread Reply:* I just recently set up things for Databrics, indeed needs init

Jonathan Moraes (jonathan.moraes.gft@nubank.com.br)

2024-07-04 19:13:27

I'm working on a project where we intend to implement the OpenLineage Proxy in between http requests and Marquez API and running some stress tests here, I noticed that OpenLineage Proxy is throwing a few exceptions for concurrent tasks. As a matter of fact, I'm testing its capacity to handle parallel requests. By the way, I directed the same workload to Marquez API, without the Proxy, and it handled everything without a problem.

This is the exception I'm getting:

ERROR [2024-07-03 14:57:52,335] io.openlineage.proxy.api.ProxyResource: Failed to proxy OpenLineage event! 2024-07-03 11:57:52 ! org.glassfish.jersey.message.internal.HeaderValueException: Too many "Content-Type" header values: "[application/json]" 2024-07-03 11:57:52 ! at org.glassfish.jersey.message.internal.OutboundMessageContext.singleHeader(OutboundMessageContext.java:199) 2024-07-03 11:57:52 ! at org.glassfish.jersey.message.internal.OutboundMessageContext.getMediaType(OutboundMessageContext.java:270) 2024-07-03 11:57:52 ! at org.glassfish.jersey.client.ClientRequest.ensureMediaType(ClientRequest.java:511) 2024-07-03 11:57:52 ! at org.glassfish.jersey.client.ClientRequest.writeEntity(ClientRequest.java:441) 2024-07-03 11:57:52 ! at org.glassfish.jersey.client.internal.HttpUrlConnector._apply(HttpUrlConnector.java:369) 2024-07-03 11:57:52 ! at org.glassfish.jersey.client.internal.HttpUrlConnector.apply(HttpUrlConnector.java:267) 2024-07-03 11:57:52 ! at org.glassfish.jersey.client.ClientRuntime.invoke(ClientRuntime.java:297) 2024-07-03 11:57:52 ! at org.glassfish.jersey.client.JerseyInvocation.lambda$invoke$0(JerseyInvocation.java:662) 2024-07-03 11:57:52 ! at org.glassfish.jersey.client.JerseyInvocation.call(JerseyInvocation.java:697) 2024-07-03 11:57:52 ! at org.glassfish.jersey.client.JerseyInvocation.lambda$runInScope$3(JerseyInvocation.java:691) 2024-07-03 11:57:52 ! at org.glassfish.jersey.internal.Errors.process(Errors.java:292) 2024-07-03 11:57:52 ! at org.glassfish.jersey.internal.Errors.process(Errors.java:274) 2024-07-03 11:57:52 ! at org.glassfish.jersey.internal.Errors.process(Errors.java:205) 2024-07-03 11:57:52 ! at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:390) 2024-07-03 11:57:52 ! at org.glassfish.jersey.client.JerseyInvocation.runInScope(JerseyInvocation.java:691) 2024-07-03 11:57:52 ! at org.glassfish.jersey.client.JerseyInvocation.invoke(JerseyInvocation.java:661) 2024-07-03 11:57:52 ! at org.glassfish.jersey.client.JerseyInvocation$Builder.method(JerseyInvocation.java:439) 2024-07-03 11:57:52 ! at org.glassfish.jersey.client.JerseyInvocation$<a href="http://Builder.post">Builder.post</a>(JerseyInvocation.java:345) 2024-07-03 11:57:52 ! at io.openlineage.proxy.api.models.HttpLineageStream.collect(HttpLineageStream.java:43) 2024-07-03 11:57:52 ! at io.openlineage.proxy.service.ProxyService.lambda$proxyEventAsync$0(ProxyService.java:41) 2024-07-03 11:57:52 ! at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736) 2024-07-03 11:57:52 ! ... 6 common frames omitted 2024-07-03 11:57:52 ! Causing: java.util.concurrent.CompletionException: org.glassfish.jersey.message.internal.HeaderValueException: Too many "Content-Type" header values: "[application/json]" 2024-07-03 11:57:52 ! at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314) 2024-07-03 11:57:52 ! at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319) 2024-07-03 11:57:52 ! at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1739) 2024-07-03 11:57:52 ! at java.base/java.util.concurrent.CompletableFuture$AsyncRun.exec(CompletableFuture.java:1728) 2024-07-03 11:57:52 ! at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) 2024-07-03 11:57:52 ! at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) 2024-07-03 11:57:52 ! at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) 2024-07-03 11:57:52 ! at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) 2024-07-03 11:57:52 ! at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183) 2024-07-03 11:57:52 ERROR [2024-07-03 14:57:52,335] io.openlineage.proxy.api.ProxyResource: Failed to proxy OpenLineage event! 2024-07-03 11:57:52 ! org.glassfish.jersey.message.internal.HeaderValueException: Too many "Content-Type" header values: "[application/json]" 2024-07-03 11:57:52 ! at org.glassfish.jersey.message.internal.OutboundMessageContext.singleHeader(OutboundMessageContext.java:199) 2024-07-03 11:57:52 ! at org.glassfish.jersey.message.internal.OutboundMessageContext.getMediaType(OutboundMessageContext.java:270) 2024-07-03 11:57:52 ! at org.glassfish.jersey.client.ClientRequest.ensureMediaType(ClientRequest.java:511) 2024-07-03 11:57:52 ! at org.glassfish.jersey.client.ClientRequest.writeEntity(ClientRequest.java:441) 2024-07-03 11:57:52 ! at org.glassfish.jersey.client.internal.HttpUrlConnector._apply(HttpUrlConnector.java:369) 2024-07-03 11:57:52 ! at org.glassfish.jersey.client.internal.HttpUrlConnector.apply(HttpUrlConnector.java:267) 2024-07-03 11:57:52 ! at org.glassfish.jersey.client.ClientRuntime.invoke(ClientRuntime.java:297) 2024-07-03 11:57:52 ! at org.glassfish.jersey.client.JerseyInvocation.lambda$invoke$0(JerseyInvocation.java:662) 2024-07-03 11:57:52 ! at org.glassfish.jersey.client.JerseyInvocation.call(JerseyInvocation.java:697) 2024-07-03 11:57:52 ! at org.glassfish.jersey.client.JerseyInvocation.lambda$runInScope$3(JerseyInvocation.java:691) 2024-07-03 11:57:52 ! at org.glassfish.jersey.internal.Errors.process(Errors.java:292) 2024-07-03 11:57:52 ! at org.glassfish.jersey.internal.Errors.process(Errors.java:274) 2024-07-03 11:57:52 ! at org.glassfish.jersey.internal.Errors.process(Errors.java:205) 2024-07-03 11:57:52 ! at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:390) 2024-07-03 11:57:52 ! at org.glassfish.jersey.client.JerseyInvocation.runInScope(JerseyInvocation.java:691) 2024-07-03 11:57:52 ! at org.glassfish.jersey.client.JerseyInvocation.invoke(JerseyInvocation.java:661) 2024-07-03 11:57:52 ! at org.glassfish.jersey.client.JerseyInvocation$Builder.method(JerseyInvocation.java:439) 2024-07-03 11:57:52 ! at org.glassfish.jersey.client.JerseyInvocation$<a href="http://Builder.post">Builder.post</a>(JerseyInvocation.java:345) 2024-07-03 11:57:52 ! at io.openlineage.proxy.api.models.HttpLineageStream.collect(HttpLineageStream.java:43) 2024-07-03 11:57:52 ! at io.openlineage.proxy.service.ProxyService.lambda$proxyEventAsync$0(ProxyService.java:41) 2024-07-03 11:57:52 ! at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736) 2024-07-03 11:57:52 ! ... 6 common frames omitted 2024-07-03 11:57:52 ! Causing: java.util.concurrent.CompletionException: org.glassfish.jersey.message.internal.HeaderValueException: Too many "Content-Type" header values: "[application/json]" 2024-07-03 11:57:52 ! at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314) 2024-07-03 11:57:52 ! at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319) 2024-07-03 11:57:52 ! at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1739) 2024-07-03 11:57:52 ! at java.base/java.util.concurrent.CompletableFuture$AsyncRun.exec(CompletableFuture.java:1728) 2024-07-03 11:57:52 ! at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) 2024-07-03 11:57:52 ! at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) 2024-07-03 11:57:52 ! at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) 2024-07-03 11:57:52 ! at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) 2024-07-03 11:57:52 ! at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)

Jonathan Moraes (jonathan.moraes.gft@nubank.com.br)

2024-07-04 19:16:31

*Thread Reply:* this is the Jupyter notebook script I'm using to test this application

Notebook_2

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-05 08:08:38

*Thread Reply:* @Willy Lulciuc any idea? 🙂

Jonathan Moraes (jonathan.moraes.gft@nubank.com.br)

2024-07-05 14:15:56

*Thread Reply:* Aparently, there are lots of issues with this one. I looked at the other OpenLineage apps and they're all good but this one.

image.png

😬 Maciej Obuchowski

Willy Lulciuc (willy@datakin.com)

2024-07-08 12:57:40

*Thread Reply:* Not much maintenance has been done on the OL rest proxy since it’s first release, but great to see it’s getting usage! I agree the vulnerabilities will need to be addressed (at some point, but would be great to open an issue). as for the stacktrace, the error is thrown on PrxyService.java:41 / jersey it’s related to: Too many "Content-Type" header values: "[application/json]" Mind checking your HTTP headers for your API call? The proxy call returns either 200 or 500 and doesn’t set the content type so don’t think it’s a backend issue

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-07-05 12:33:15

This is somewhat random but I only learned of OpenMetadata’s OpenLineage integration today.

Does it already exist, or are there any plans to have an OpenMetadata integration for OpenLineage? Or a translator from OpenMetadata lineage to OpenLineage format?

I ask because I feel it would significantly grow the OpenLineage ecosystem. (e.g. we could get lineage from Airbyte)

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-08 11:01:58

*Thread Reply:* I haven't heard anyone talking about this.

As for Airbyte, I believe we'll soon have OL support for AirbyteOperator in Airflow, which AFAIK would be easily generalizable to be a general "Airbyte Integration"

✅ Sheeri Cabral (Collibra)

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-07-08 13:16:12

*Thread Reply:* OK. I can see if customers are using Airflow for Airbyte orchestration, which would be part of the Airflow integration for us

Jonathan Moraes (jonathan.moraes.gft@nubank.com.br)

2024-07-05 15:15:51

Hey guys! I've been trying to run OpenLineage FluentD following the instructions on the README file but I haven't been able to run it. Actually, I'm testing the endpoint http://localhost:9880/api/v1/lineage but it's not receiving any of my post requests. I'm using it with a Jupyter Notebook, to send post requests to OpenLineage-Fluentd. Any idea of what might be wrong?

Jonathan Moraes (jonathanlbt@gmail.com)

2024-07-06 22:19:41

*Thread Reply:* As it turns out, I was running all applications on Docker, even Jupyter notebook and for some reason, that was the problem. I started a Jupyter Notebook instance straight from my machine and then everything worked.

👍 Maciej Obuchowski

Jacob Barber (jacoby6000@gmail.com)

2024-07-08 14:43:58

When leveraging the spark integrations, what might cause a single job named "databricksshell" to show up? There's no other jobs, and the job this is representing should certainly not be named "databricksshell" 😅

Jacob Barber (jacoby6000@gmail.com)

2024-07-08 14:44:24

*Thread Reply:* The dataset appears to be produced properly, but not the job name... And the job has no input dataset for some reason

Jacob Barber (jacoby6000@gmail.com)

2024-07-09 15:57:19

*Thread Reply:* I see now that I was just seeing a truncated name.

✅ Sheeri Cabral (Collibra)

Michael Robinson (michael.robinson@astronomer.io)

2024-07-08 14:58:26

Hi all, this month's OpenLineage TSC meeting, open to all, is this Wednesday, July 10th, at 9:30 am PT. On the tentative agenda: • Recent releases • Announcements • Discussion topics: ◦ Certification process proposal • Open discussion More info and the meeting link can be found on the website. All are welcome! Do you have a discussion topic, use case or integration you’d like to demo? Reply here or DM me to be added to the agenda.

openlineage.io

TSC Meetings | OpenLineage

The OpenLineage Technical Steering Committee meets monthly, and is open to all.

Original URL: https://openlineage.io/meetings/

✅ Sheeri Cabral (Collibra)

Marty (mmoravec@alteryx.com)

2024-07-08 17:22:13

Hey friends, my team is working on setting up open lineage support in our engine. The OpenLineage Spec states, The specification for OpenLineage is formalized as a JsonSchema OpenLineage.json. An OpenAPI spec is also provided for HTTP-based implementations: OpenLineage.yml Which to me implies that there is also a non-http based implementation. Is there a standard somewhere for a pull based open lineage api? or is it always push?

dolfinus (martinov_m_s_@mail.ru)

2024-07-09 03:28:26

*Thread Reply:* It's always push

Julien Le Dem (julien@apache.org)

2024-07-09 12:41:01

*Thread Reply:* There is the option of pushing to a Kafka topic. Which gives you the ability of pulling from kafka.

Marty (mmoravec@alteryx.com)

2024-07-09 12:57:28

*Thread Reply:* Ah. that makes sense. Thank you.

Roy Cheong (roy.cheong@grabtaxi.com)

2024-07-08 23:01:16

Hi team, we still have an issue when adding upstream tables using spark3. Could you help to take a look ?

Roy Cheong (roy.cheong@grabtaxi.com)

2024-07-08 23:01:35

*Thread Reply:*

Untitled

Roy Cheong (roy.cheong@grabtaxi.com)

2024-07-09 00:50:10

*Thread Reply:* @Maciej Obuchowski, @Michael Robinson hi guys could you help to take a look as its rather urgent ? thanks

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-09 07:16:01

*Thread Reply:* @Roy Cheong can you create issue regarding this and post more context from logs? I think followup error would help debug that as well

Roy Cheong (roy.cheong@grabtaxi.com)

2024-07-10 00:27:15

*Thread Reply:* ok can i check if OL 1.17.1 is on which version of spark ?

dolfinus (martinov_m_s_@mail.ru)

2024-07-09 03:48:28

may I request a review of https://github.com/OpenLineage/OpenLineage/pull/2830 and then a release of 1.18.0?

#2830 Fix: handle dashes in hostname for JdbcExtractors

Problem DNS names could contain dashes, e.g. <code><a href="http://my-company.com">my-company.com</a></code>, but it is not an allowed symbol in JdbcExtractors implementation. This leads to producing datasets with namespaces like <code><postgres://my-company.com:5432:5432></code>. Solution Properly handle dashes in JDBC URL hosts. One-line summary: Checklist ☑︎ You've <a href="https://github.com/OpenLineage/OpenLineage/blob/main/why-the-dco.md">signed-off</a> your work ☑︎ Your pull request title follows our <a href="https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#creating-pull-requests">guidelines</a> ☑︎ Your changes are accompanied by tests (if relevant) ☑︎ Your change contains a <a href="https://kurtisnusbaum.medium.com/stacked-diffs-keeping-phabricator-diffs-small-d9964f4dcfa6">small diff</a> and is self-contained ☐ You've updated any relevant documentation (if relevant) ☐ Your comment includes a one-liner for the changelog about the specific purpose of the change (not required for changes to tests, docs, or CI config) ☐ You've versioned the core OpenLineage model or facets according to <a href="https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/common-architecture/schemaver">SchemaVer</a> (if relevant) ☐ You've added a <a href="https://github.com/OpenLineage/OpenLineage/tree/main/.github/header_templates.md">header</a> to source files (if relevant) <hr /> SPDX-License-Identifier: Apache-2.0 Copyright 2018-2024 contributors to the OpenLineage project

Labels

area:client/java, area:tests, language:java

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-09 07:16:17

*Thread Reply:* +1 for release 🙂 @Michael Robinson

➕ Harel Shein, Jakub Dardziński, Kacper Muda, Maciej Obuchowski

👍 Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)

2024-07-09 18:58:43

*Thread Reply:* Thanks for requesting a release. It's been approved and will be initiated within two business days.

dolfinus (martinov_m_s_@mail.ru)

2024-07-10 03:26:39

*Thread Reply:* Thanks!

Rahul Madan (rahul.madan@atlan.com)

2024-07-09 12:28:05

Hi folks, I’m excited to inform that I will be speaking at the Fifth Elephant Conference on July 12, 2024. The topic of our presentation will be “Ensuring Data Quality Using Contracts and Lineage” and I will be joined by two other colleagues from Atlan.

This will be a workshop primarily attended by data engineers. My goal is to raise awareness about the OpenLineage project, which focuses on collecting lineage metadata. We will discuss real-world use cases of data lineage and provide hands-on experience of doing RCA using data lineage (this is for adding “wow” factor, they will know how easy it is to find the root cause with lineage)

I have started preparing a deck, and I plan to incorporate some ideas from @Maciej Obuchowski’s OpenLineage talk at the Airflow Summit 2023 as well as from a few talks available on YouTube (from @Julien Le Dem and others).

hasgeek.com

The Fifth Elephant 2024 Annual Conference

Maximising the Potential of Data — Discussions around data science, machine learning & AI

Original URL: https://hasgeek.com/fifthelephant/2024/

🚀 Jakub Dardziński, Kacper Muda, Ernie Ostic, Harel Shein, Maciej Obuchowski

🎉 Kacper Muda

🔥 Kacper Muda, Julien Le Dem, Harel Shein, Maciej Obuchowski

🙌 Sheeri Cabral (Collibra)

Michael Robinson (michael.robinson@astronomer.io)

2024-07-09 13:40:58

Friendly reminder: this month's TSC meeting (open to all) is tomorrow https://openlineage.slack.com/archives/C01CK9T7HKR/p1720465106400949

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)

Hi all, this month's OpenLineage TSC meeting, open to all, is this Wednesday, July 10th, at 9:30 am PT. On the tentative agenda: • Recent releases • Announcements • Discussion topics: ◦ Certification process proposal • Open discussion More info and the meeting link can be found on the <a href="https://openlineage.io/meetings/">website</a>. All are welcome! Do you have a discussion topic, use case or integration you’d like to demo? Reply here or DM me to be added to the agenda.

Original URL: https://openlineage.slack.com/archives/C01CK9T7HKR/p1720465106400949

Jacob Barber (jacoby6000@gmail.com)

2024-07-09 18:12:35

Does anyone know the cause of this error? Cannot extract path from ParallelCollectionRDD ArraySeq(StartPositionWithConsumer[ stream=<snip>, shardId=shardId-000000000000, consumerARN=None, description=AfterRecord:49653724427668838628621253653706248657909037639458095106 ])

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-10 05:52:50

*Thread Reply:* I think we'd need a wider context.

Jacob Barber (jacoby6000@gmail.com)

2024-07-10 10:55:14

*Thread Reply:* As far as additional logs go... This set of messages from the OpenLineage integration plays in a loop:

"Config field is not HadoopMapRedWriteConfigUtil or HadoopMapReduceWriteConfigUtil, it's org.apache.spark.rdd.MapPartitionsRDD" "Found job conf from RDD Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-rbf-default.xml, hdfs-site.xml, hdfs-rbf-site.xml" "Cannot extract path from ParallelCollectionRDD ArraySeq(StartPositionWithConsumer[ stream=my-stream, shardId=shardId-000000000000, consumerARN=None, description=AfterRecord:49653724427668838628621254480385440146047486826028990466 ])" "Found output path null from RDD MapPartitionsRDD[2926] at start at DeltaWriter.scala:32" "Output RDDs are empty: skipping sending OpenLineage event"

The message about OutputRDDs I'm not too worried about, because this job has successfully created an output dataset, but no input dataset from the source kinesis stream I am consuming from

There's another job I have that's been reporting nothing because of: "OpenLineage received Spark event that is configured to be skipped: SparkListenerSQLExecutionStart" "OpenLineage received Spark event that is configured to be skipped: SparkListenerSQLExecutionStart" "Query execution is null: can't emit event for executionId 63" "OpenLineage received Spark event that is configured to be skipped: SparkListenerSQLExecutionEnd" "OpenLineage received Spark event that is configured to be skipped: SparkListenerJobEnd" "OpenLineage received Spark event that is configured to be skipped: SparkListenerSQLExecutionEnd" "Query execution is null: can't emit event for executionId 64" "Query execution is null: can't emit event for executionId 65" The only other openlineage logs from that job I see in that one are about the listener starting. That job is consuming from a delta table and writing to another delta table

Jacob Barber (jacoby6000@gmail.com)

2024-07-10 12:10:20

*Thread Reply:* I've tested with openlineage 1.9.1 and 1.13.1 and both have the same result

Jacob Barber (jacoby6000@gmail.com)

2024-07-10 12:10:39

*Thread Reply:* I'm trying 1.17.1 right now, but I'm not sure its really an option since marquez' latest supported version is 1.13.1 and that's the backend we use

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-10 12:26:48

*Thread Reply:* yeah I don't think this is enough to understand what's the cause for this, would be nice to have some particular exception, reproduction or serialized LogicalPlan of a job that reproduced this

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-10 12:27:15

*Thread Reply:* > I'm trying 1.17.1 right now, but I'm not sure its really an option since marquez' latest supported version is 1.13.1 and that's the backend we use (edited) There's no dependency like that - where did you found that? So we can remove the confusing part 🙂

Jacob Barber (jacoby6000@gmail.com)

2024-07-10 12:27:47

*Thread Reply:* I was just looking at this and assumed the versions needed to match: https://github.com/MarquezProject/marquez/blob/0.47.0/build.gradle#L60

<https://github.com/MarquezProject/marquez/blob/0.47.0/build.gradle | build.gradle>

<pre><code> openlineageVersion = '1.13.1' </code></pre>

Jacob Barber (jacoby6000@gmail.com)

2024-07-10 12:27:58

*Thread Reply:* There's no exception occuring that I'm aware of, but I can look in to getting the plan

👍 Maciej Obuchowski

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-10 12:29:41

*Thread Reply:* Marquez uses OL client in tests only afaik, besides that it deserializes the event to it's own data structures - the only dependency is on OpenLineage specification 🙂

Jacob Barber (jacoby6000@gmail.com)

2024-07-10 12:33:51

*Thread Reply:* Gotcha, that makes a lot of sense! One reason I had that assumption is because we had started integrating with marquez when it was using OL 1.9, and then in OL 1.13 the spec introduced nested fields in the schema dataset facet, and I wasn't sure how marquez would handle the nested fields (or if it would just fail outright). Then I carried that assumption forward that OL 1.17 might have similar spec changes. I hadn't really checked because I just assumed it wouldn't work 😅

Jacob Barber (jacoby6000@gmail.com)

2024-07-10 12:34:39

*Thread Reply:* I'll let the team know to just use the latest OL spec for our internal stuff, your message clears up quite a bit!

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-10 13:44:57

*Thread Reply:* we keep forward compatibility, meaning we only add things in events - this means newer version should never break (syntactically) old consumer, they would just skip the fields they don't understand 🙂

Jacob Barber (jacoby6000@gmail.com)

2024-07-10 17:44:05

*Thread Reply:* Finally got my hands on a logical plan. It's fairly simple; reading from a kinesis stream and writing to a table with the data, roughly 1:1

'Project [type#51, source#52, subject#53, id#54, time#55, schemaurl#56, datacontenttype#57, datacontentencoding#58, partner#59, data#60, trace#61, sourceRegion#62, sequence#63, sequenceType#64, partitionkey#65, compressiontype#66, kinesisPartitionKey#42, kinesisStream#44, kinesisShardId#46, kinesisSequenceNumber#48, kinesisApproximateArrivalTimestamp#50, yr#88, mo#111, dy#135, hour('time, None) AS hr#160] +- Project [type#51, source#52, subject#53, id#54, time#55, schemaurl#56, datacontenttype#57, datacontentencoding#58, partner#59, data#60, trace#61, sourceRegion#62, sequence#63, sequenceType#64, partitionkey#65, compressiontype#66, kinesisPartitionKey#42, kinesisStream#44, kinesisShardId#46, kinesisSequenceNumber#48, kinesisApproximateArrivalTimestamp#50, yr#88, mo#111, dayofmonth(cast(time#55 as date)) AS dy#135] +- Project [type#51, source#52, subject#53, id#54, time#55, schemaurl#56, datacontenttype#57, datacontentencoding#58, partner#59, data#60, trace#61, sourceRegion#62, sequence#63, sequenceType#64, partitionkey#65, compressiontype#66, kinesisPartitionKey#42, kinesisStream#44, kinesisShardId#46, kinesisSequenceNumber#48, kinesisApproximateArrivalTimestamp#50, yr#88, month(cast(time#55 as date)) AS mo#111] +- Project [type#51, source#52, subject#53, id#54, time#55, schemaurl#56, datacontenttype#57, datacontentencoding#58, partner#59, data#60, trace#61, sourceRegion#62, sequence#63, sequenceType#64, partitionkey#65, compressiontype#66, kinesisPartitionKey#42, kinesisStream#44, kinesisShardId#46, kinesisSequenceNumber#48, kinesisApproximateArrivalTimestamp#50, year(cast(time#55 as date)) AS yr#88] +- Project [envelope#37.type AS type#51, envelope#37.source AS source#52, envelope#37.subject AS subject#53, envelope#37.id AS id#54, envelope#37.time AS time#55, envelope#37.schemaurl AS schemaurl#56, envelope#37.datacontenttype AS datacontenttype#57, envelope#37.datacontentencoding AS datacontentencoding#58, envelope#37.partner AS partner#59, envelope#37.data AS data#60, envelope#37.trace AS trace#61, envelope#37.sourceRegion AS sourceRegion#62, envelope#37.sequence AS sequence#63, envelope#37.sequenceType AS sequenceType#64, envelope#37.partitionkey AS partitionkey#65, envelope#37.compressiontype AS compressiontype#66, kinesisPayload#38.partitionKey AS kinesisPartitionKey#42, kinesisPayload#38.stream AS kinesisStream#44, kinesisPayload#38.shardId AS kinesisShardId#46, kinesisPayload#38.sequenceNumber AS kinesisSequenceNumber#48, kinesisPayload#38.approximateArrivalTimestamp AS kinesisApproximateArrivalTimestamp#50] +- SerializeFromObject [if (isnull(knownnotnull(assertnotnull(input[0, com.mybusiness.datasources.kinesis.SparkEnvelopeWithKinesisPayload, true])).envelope)) null else named_struct(type, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(knownnotnull(assertnotnull(input[0, com.mybusiness.datasources.kinesis.SparkEnvelopeWithKinesisPayload, true])).envelope).type, true, false, true), source, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(knownnotnull(assertnotnull(input[0, com.mybusiness.datasources.kinesis.SparkEnvelopeWithKinesisPayload, true])).envelope).source, true, false, true), subject, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, unwrapoption(ObjectType(class java.lang.String), knownnotnull(knownnotnull(assertnotnull(input[0, com.mybusiness.datasources.kinesis.SparkEnvelopeWithKinesisPayload, true])).envelope).subject), true, false, true), id, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(knownnotnull(assertnotnull(input[0, com.mybusiness.datasources.kinesis.SparkEnvelopeWithKinesisPayload, true])).envelope).id, true, false, true), time, staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, TimestampType, fromJavaTimestamp, unwrapoption(ObjectType(class java.sql.Timestamp), knownnotnull(knownnotnull(assertnotnull(input[0, com.mybusiness.datasources.kinesis.SparkEnvelopeWithKinesisPayload, true])).envelope).time), true, false, true), schemaurl, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, unwrapoption(ObjectType(class java.lang.String), knownnotnull(knownnotnull(assertnotnull(input[0, com.mybusiness.datasources.kinesis.SparkEnvelopeWithKinesisPayload, true])).envelope).schemaurl), true, false, true), datacontenttype, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, unwrapoption(ObjectType(class java.lang.String), knownnotnull(knownnotnull(assertnotnull(input[0, com.mybusiness.datasources.kinesis.SparkEnvelopeWithKinesisPayload, true])).envelope).datacontenttype), true, false, true), datacontentencoding, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, unwrapoption(ObjectType(class java.lang.String), knownnotnull(knownnotnull(assertnotnull(input[0, com.mybusiness.datasources.kinesis.SparkEnvelopeWithKinesisPayload, true])).envelope).datacontentencoding), true, false, true), partner, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, unwrapoption(ObjectType(class java.lang.String), knownnotnull(knownnotnull(assertnotnull(input[0, com.mybusiness.datasources.kinesis.SparkEnvelopeWithKinesisPayload, true])).envelope).partner), true, false, true), data, unwrapoption(BinaryType, knownnotnull(knownnotnull(assertnotnull(input[0, com.mybusiness.datasources.kinesis.SparkEnvelopeWithKinesisPayload, true])).envelope).data), trace, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, unwrapoption(ObjectType(class java.lang.String), knownnotnull(knownnotnull(assertnotnull(input[0, com.mybusiness.datasources.kinesis.SparkEnvelopeWithKinesisPayload, true])).envelope).trace), true, false, true), sourceRegion, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, unwrapoption(ObjectType(class java.lang.String), knownnotnull(knownnotnull(assertnotnull(input[0, com.mybusiness.datasources.kinesis.SparkEnvelopeWithKinesisPayload, true])).envelope).sourceRegion), true, false, true), ... 8 more fields) AS envelope#37, if (isnull(knownnotnull(assertnotnull(input[0, com.mybusiness.datasources.kinesis.SparkEnvelopeWithKinesisPayload, true])).kinesisPayload)) null else named_struct(partitionKey, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(knownnotnull(assertnotnull(input[0, com.mybusiness.datasources.kinesis.SparkEnvelopeWithKinesisPayload, true])).kinesisPayload).partitionKey, true, false, true), data, knownnotnull(knownnotnull(assertnotnull(input[0, com.mybusiness.datasources.kinesis.SparkEnvelopeWithKinesisPayload, true])).kinesisPayload).data, stream, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(knownnotnull(assertnotnull(input[0, com.mybusiness.datasources.kinesis.SparkEnvelopeWithKinesisPayload, true])).kinesisPayload).stream, true, false, true), shardId, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(knownnotnull(assertnotnull(input[0, com.mybusiness.datasources.kinesis.SparkEnvelopeWithKinesisPayload, true])).kinesisPayload).shardId, true, false, true), sequenceNumber, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(knownnotnull(assertnotnull(input[0, com.mybusiness.datasources.kinesis.SparkEnvelopeWithKinesisPayload, true])).kinesisPayload).sequenceNumber, true, false, true), approximateArrivalTimestamp, staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, TimestampType, fromJavaTimestamp, knownnotnull(knownnotnull(assertnotnull(input[0, com.mybusiness.datasources.kinesis.SparkEnvelopeWithKinesisPayload, true])).kinesisPayload).approximateArrivalTimestamp, true, false, true)) AS kinesisPayload#38] +- MapPartitions org.apache.spark.sql.Dataset$$Lambda$7676/1242441656@42516756, obj#36: com.mybusiness.datasources.kinesis.SparkEnvelopeWithKinesisPayload +- DeserializeToObject newInstance(class com.mybusiness.datasources.kinesis.KinesisPayload), obj#35: com.mybusiness.datasources.kinesis.KinesisPayload +- StreamingRelationV2 com.databricks.sql.kinesis.KinesisSourceProvider@3e9db24a, kinesis, com.databricks.sql.kinesis.KinesisTable@334c5c74, [maxFetchRate=1.0, initialPosition=LATEST, maxFetchDuration=10s, maxRecordsPerFetch=10000, fetchBufferSize=300gb, roleArn=arn:aws:iam::789659335040:role/bamazon-TeamSDP, shardFetchInterval=2m, region=us-east-1, minFetchPeriod=400ms, shardsPerTask=75, streamName=my-stream], [partitionKey#8, data#9, stream#10, shardId#11, sequenceNumber#12, approximateArrivalTimestamp#13]

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-11 05:03:58

*Thread Reply:* ah, it's streaming

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-11 05:04:04

*Thread Reply:* I would say it's not supported yet

Jacob Barber (jacoby6000@gmail.com)

2024-07-11 11:02:46

*Thread Reply:* Interesting... This job uses streaming for reading and writing, but the output dataset is successfully (and correctly) created. We have another job that also uses streaming, but it gets no input or output datasets (the job never submits any events at all)

If streaming is unsupported I'd expect none of them to create an output dataset 🤔

Our system uses strict schemas that are known at compile time, so I think I'll be able to refer to the extending section of the documentation and get our usecases in there. Still very strange to me that one of the datasets uses streaming and gets written though

Hitesh Yadav (hiteshy9904@gmail.com)

2024-07-10 01:59:05

Hi team I’m getting the following error while trying to run a dataproc job with openlineage jar

java.lang.NoSuchMethodError: 'scala.collection.immutable.Seq org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.map(scala.Function1)' at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$buildInputDatasets$6(OpenLineageRunEventBuilder.java:249) ~[openlineage-spark_2.13-1.17.1.jar:1.17.1]

dolfinus (martinov_m_s_@mail.ru)

2024-07-10 03:26:02

*Thread Reply:* Check if Scala version in OL and Spark are matching

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-10 05:53:26

*Thread Reply:* yeah, looks like you're trying to use 2.13 library on Spark that uses 2.12

Harish Sharma (waytohksharma@gmail.com)

2024-07-10 03:28:45

Hi Team,

I am not able to see any data in the UI with Flink+OpenLinage. PLease can someone suggest me what wrong is with my Flink Code

```import io.openlineage.client.OpenLineageClient; import io.openlineage.client.transports.HttpConfig; import io.openlineage.client.transports.HttpTransport; import io.openlineage.flink.avro.event.InputEvent; import io.openlineage.flink.avro.event.OutputEvent; import io.openlineage.util.EnvUtils; import io.openlineage.util.OpenLineageFlinkJobListenerBuilder; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.types.Row;

import java.net.URI;

import static io.openlineage.flink.StreamEnvironment.setupEnv; import static io.openlineage.kafka.SinkAndSourceClientProvider.**; import static org.apache.flink.api.common.eventtime.WatermarkStrategy.noWatermarks;

public class FromKafkaToKafkaTopicApplication { public static void main(String[] args) throws Exception {

    HttpConfig httpConfig = new HttpConfig();
    httpConfig.setUrl(URI.create("<http://localhost:5000>"));

    OpenLineageClient client = OpenLineageClient.builder()
            .transport(
                    new HttpTransport(httpConfig))
            .build();

    EnvUtils.waitForSchemaRegistry();
    StreamExecutionEnvironment env = setupEnv(args);

    env.fromSource(
                    aKafkaSource("io.openlineage.flink.kafka.input.**"),
                    noWatermarks(),
                    "kafka-source")
            .uid("kafka-source")
            .keyBy(InputEvent::getId)
            .process(new StatefulCounter())
            .name("process")
            .uid("process")
            .map(outputEvent -&gt; new OutputEvent(
                    outputEvent.id, outputEvent.version, outputEvent.counter

            ))
            .sinkTo(aKafkaSink("io.openlineage.flink.kafka.output"))
            .name("kafka-sink")
            .uid("kafka-sink");

    // OpenLineage specific code
    env.registerJobListener(
            OpenLineageFlinkJobListenerBuilder
                    .create()
                    .executionEnvironment(env)
                    .jobName("from-topic-to-psql")
                    .build()
    );

    env.execute("from-topic-to-psql");
}

}```

Kacper Muda (mudakacper@gmail.com)

2024-07-10 03:55:21

*Thread Reply:* I'm not sure if duplicating the already created issue here is a good idea. You can mention that it's already created: github.com/OpenLineage/OpenLineage/issues/2840

Harish Sharma (waytohksharma@gmail.com)

2024-07-10 03:58:19

*Thread Reply:* got it

Mark Soule (marksouletheprogrammer@gmail.com)

2024-07-10 13:31:50

As mentioned in today's call and previous calls, I am giving a lineage talk at Kafka Current in Austin this year. See more: https://events.bizzabo.com/599116/agenda/speakers/3273722

events.bizzabo.com

Current 2024

Join the best and brightest minds in data streaming, AI, stream processing, and Apache Kafka® app development at Current 2024. Boost your Kafka game with Apache Flink® integration. Gain incisive insight on burgeoning and emerging Generative AI use cases.

Original URL: https://events.bizzabo.com/599116/agenda/speakers/3273722

🚀 Maciej Obuchowski, Jakub Dardziński, Michael Robinson, Paweł Leszczyński

✅ Sheeri Cabral (Collibra), Paweł Leszczyński

Paweł Leszczyński (pawel.leszczynski@getindata.com)

2024-07-12 04:51:55

*Thread Reply:* feel free to use this demo workshop -> https://github.com/OpenLineage/workshops/tree/main/flink-streaming

👍 Mark Soule

Jonathan Moraes (jonathan.moraes.gft@nubank.com.br)

2024-07-10 15:20:24

I'm running Fluentd as a proxy for Marquez API and I noticed that Fluentd keeps checking on Marquez API status every second, even when it's not routing anything to Marquez backend. I'm using OpenLineage version from the folder proxy/fluentd. Why does it do it?

image.png

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-07-10 18:12:52

*Thread Reply:* are you using https://github.com/OpenLineage/OpenLineage/blob/main/proxy/fluentd/docker/conf/fluent.conf?

<https://github.com/OpenLineage/OpenLineage/blob/main/proxy/fluentd/docker/conf/fluent.conf | fluent.conf>

<pre><code><source> @type http port 9880 <parse> @type openlineage </parse> </source> # <https://docs.fluentd.org/output/http> <match api.v1.lineage> # tag should match fluentd input endpoint url <http://localhost:9880/api/v1/lineage> @type copy <store> @type http endpoint_url "#{ENV['MARQUEZ_HTTP_ENDPOINT']}" content_type application/json bulk_request false # available since using <https://github.com/fluent-plugins-nursery/fluent-plugin-out-http> buffered true serializer json retryable_response_codes 408, 429, 500, 502, 503 <buffer> @type file path /tmp/openlineage/buf/chunk-** flush_mode immediate </buffer> </store> <store> @type stdout # testing purpose to demonstrate that copy is working </store> # other output stores can be put </match> </code></pre>

Jonathan Moraes (jonathan.moraes.gft@nubank.com.br)

2024-07-10 18:15:13

*Thread Reply:* Yes, that's correct

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-07-10 18:18:37

*Thread Reply:* fluentd does not do any healthchecks by itself AFAIK. is it possible you have some job that sends events on regular basis?

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-07-10 18:19:50

*Thread Reply:* see openlineage-fluentd container if it logs anything to console

Jonathan Moraes (jonathanlbt@gmail.com)

2024-07-10 21:44:22

*Thread Reply:* Thank you @Jakub Dardziński, I'll check it out and I will get back to you.

Jonathan Moraes (jonathan.moraes.gft@nubank.com.br)

2024-07-11 10:37:44

*Thread Reply:* I stopped Marquez API and went back to Fluentd and got this error message:

2024-07-11 11:35:24 2024-07-11 14:35:24 +0000 [warn]: #0 Net::<a href="http://HTTP.Post">HTTP.Post</a> raises exception: EOFError, 'end of file reached' 2024-07-11 11:35:27 2024-07-11 14:35:27 +0000 [warn]: #0 Net::<a href="http://HTTP.Post">HTTP.Post</a> raises exception: SocketError, 'Failed to open TCP connection to marquez-api:5000 (getaddrinfo: Name or service not known)' 2024-07-11 11:35:27 2024-07-11 14:35:27 +0000 [warn]: #0 failed to flush the buffer. retry_times=0 next_retry_time=2024-07-11 14:35:28 +0000 chunk="61cf91e2d988e314d4d00e76ce387e05" error_class=SocketError error="Failed to open TCP connection to marquez-api:5000 (getaddrinfo: Name or service not known)" 2024-07-11 11:35:27 2024-07-11 14:35:27 +0000 [warn]: #0 /usr/local/lib/ruby/3.1.0/net/http.rb:1018:in `initialize' 2024-07-11 11:35:27 2024-07-11 14:35:27 +0000 [warn]: #0 /usr/local/lib/ruby/3.1.0/net/http.rb:1018:in `open' 2024-07-11 11:35:27 2024-07-11 14:35:27 +0000 [warn]: #0 /usr/local/lib/ruby/3.1.0/net/http.rb:1018:in `block in connect' 2024-07-11 11:35:27 2024-07-11 14:35:27 +0000 [warn]: #0 /usr/local/lib/ruby/3.1.0/timeout.rb:107:in `block in timeout' 2024-07-11 11:35:27 2024-07-11 14:35:27 +0000 [warn]: #0 /usr/local/lib/ruby/3.1.0/timeout.rb:117:in `timeout' 2024-07-11 11:35:27 2024-07-11 14:35:27 +0000 [warn]: #0 /usr/local/lib/ruby/3.1.0/net/http.rb:1016:in `connect' 2024-07-11 11:35:27 2024-07-11 14:35:27 +0000 [warn]: #0 /usr/local/lib/ruby/3.1.0/net/http.rb:995:in `do_start' 2024-07-11 11:35:27 2024-07-11 14:35:27 +0000 [warn]: #0 /usr/local/lib/ruby/3.1.0/net/http.rb:984:in `start' 2024-07-11 11:35:27 2024-07-11 14:35:27 +0000 [warn]: #0 /usr/local/lib/ruby/3.1.0/net/http.rb:628:in `start' 2024-07-11 11:35:27 2024-07-11 14:35:27 +0000 [warn]: #0 /usr/local/bundle/gems/fluent-plugin-out-http-1.3.4/lib/fluent/plugin/out_http.rb:232:in `send_request' 2024-07-11 11:35:27 2024-07-11 14:35:27 +0000 [warn]: #0 /usr/local/bundle/gems/fluent-plugin-out-http-1.3.4/lib/fluent/plugin/out_http.rb:260:in `handle_record' 2024-07-11 11:35:27 2024-07-11 14:35:27 +0000 [warn]: #0 /usr/local/bundle/gems/fluent-plugin-out-http-1.3.4/lib/fluent/plugin/out_http.rb:313:in `block in write' 2024-07-11 11:35:27 2024-07-11 14:35:27 +0000 [warn]: #0 /usr/local/bundle/gems/fluentd-1.16.3/lib/fluent/event.rb:319:in `each' 2024-07-11 11:35:27 2024-07-11 14:35:27 +0000 [warn]: #0 /usr/local/bundle/gems/fluentd-1.16.3/lib/fluent/event.rb:319:in `block in each' 2024-07-11 11:35:27 2024-07-11 14:35:27 +0000 [warn]: #0 /usr/local/bundle/gems/fluentd-1.16.3/lib/fluent/plugin/buffer/file_chunk.rb:171:in `open' 2024-07-11 11:35:27 2024-07-11 14:35:27 +0000 [warn]: #0 /usr/local/bundle/gems/fluentd-1.16.3/lib/fluent/event.rb:318:in `each' 2024-07-11 11:35:27 2024-07-11 14:35:27 +0000 [warn]: #0 /usr/local/bundle/gems/fluent-plugin-out-http-1.3.4/lib/fluent/plugin/out_http.rb:312:in `write' 2024-07-11 11:35:27 2024-07-11 14:35:27 +0000 [warn]: #0 /usr/local/bundle/gems/fluentd-1.16.3/lib/fluent/plugin/output.rb:1225:in `try_flush' 2024-07-11 11:35:27 2024-07-11 14:35:27 +0000 [warn]: #0 /usr/local/bundle/gems/fluentd-1.16.3/lib/fluent/plugin/output.rb:1538:in `flush_thread_run' 2024-07-11 11:35:27 2024-07-11 14:35:27 +0000 [warn]: #0 /usr/local/bundle/gems/fluentd-1.16.3/lib/fluent/plugin/output.rb:510:in `block (2 levels) in start' 2024-07-11 11:35:27 2024-07-11 14:35:27 +0000 [warn]: #0 /usr/local/bundle/gems/fluentd-1.16.3/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create' 2024-07-11 11:35:28 2024-07-11 14:35:28 +0000 [warn]: #0 Net::<a href="http://HTTP.Post">HTTP.Post</a> raises exception: SocketError, 'Failed to open TCP connection to marquez-api:5000 (getaddrinfo: Name or service not known)'

Jonathan Moraes (jonathan.moraes.gft@nubank.com.br)

2024-07-11 10:39:09

*Thread Reply:* I think the buffer keeps dumping stuff on Marquez API even after the jupyter notebook job is run

Jonathan Moraes (jonathan.moraes.gft@nubank.com.br)

2024-07-11 13:36:01

*Thread Reply:* Yep, that seems to be the case. I ommited this line below from my fluent.conf file and this ongoing behavior was over.

<buffer> @type file path /tmp/openlineage/buf/chunk-** flush_mode immediate </buffer>

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-07-11 13:37:19

*Thread Reply:* glad you figured it out!

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-07-11 13:37:39

*Thread Reply:* maybe there's a way to buffer but give up after some time/tries

Jonathan Moraes (jonathan.moraes.gft@nubank.com.br)

2024-07-11 13:38:05

*Thread Reply:* True! Thank you Jakub!

👍 Jakub Dardziński

Charles kuzmiak (charles.kuzmiak@kraken.tech)

2024-07-10 22:58:16

hello, maybe a dumb question but how can i accesss Task Attributes in Custom Executor ? Is There a way ?

Charles kuzmiak (charles.kuzmiak@kraken.tech)

2024-07-11 02:17:03

*Thread Reply:* it seems that there is a version of extract_on_complete with OperatorLineage, but in the examples in the repo they use Optional[Taskmetadata],are those the same ? and if not whats the difference ?

Kacper Muda (mudakacper@gmail.com)

2024-07-11 02:43:00

*Thread Reply:* Hey, just to clarify: you are using openlineage-airflow (and not apache-airflow-providers-openlineage), so you are using Airflow <= 2.6, and you are wondering how to access the Task (Operator) attributes within extract_on_complete in your Custom Extractor?

Charles kuzmiak (charles.kuzmiak@kraken.tech)

2024-07-11 03:02:38

*Thread Reply:* I am using apache-airflow-providers-openlineage and airflow 2.9.2

Charles kuzmiak (charles.kuzmiak@kraken.tech)

2024-07-11 03:04:26

*Thread Reply:* My current test code is ``` def extractoncomplete(self, taskinstance): from openlineage.client.run import Dataset from airflow.providers.openlineage.extractors import OperatorLineage from openlineage.client.facet import ( DocumentationJobFacet, OwnershipJobFacet, OwnershipJobFacetOwners, ) self.log.debug(f"---extractoncomplete------({taskinstance})")

    job_facets = {
        "documentation": DocumentationJobFacet(
            description=f"""
        Takes data from the data source
        """
        ),
        "ownership": OwnershipJobFacet(
            owners=[OwnershipJobFacetOwners(name="Charles", type=self.email)]
        ),
    }

    input_dataset = Dataset(
        namespace="METADATA",
        name="MMMMIN",
    )
    output_dataset = Dataset(
        namespace="METADATA",
        name="MMMMO",
    )

    return OperatorLineage(
        inputs=[input_dataset],
        outputs=[output_dataset],
        job_facets=job_facets,
    )```

Charles kuzmiak (charles.kuzmiak@kraken.tech)

2024-07-11 03:05:25

*Thread Reply:* if I use extract_on_complete it works (without the instance)

Charles kuzmiak (charles.kuzmiak@kraken.tech)

2024-07-11 03:05:51

*Thread Reply:* but the on_complete does not output anything

Kacper Muda (mudakacper@gmail.com)

2024-07-11 03:11:04

*Thread Reply:* Okay, let me take a step back and clarify something, I need to make sure i understand it correctly and there is no confusion. This code is not a direct modification of the Operator. You are using Custom Extractor feature ? Could you share the whole code of an extractor? Do you see in the logs that this Custom Extractor that you defined is actually used? Could you share the logs of the task?

Charles kuzmiak (charles.kuzmiak@kraken.tech)

2024-07-11 03:14:33

*Thread Reply:* sure yeah, so i got custom operators working, but now I need to use a third party one so I am trying to set up custom extractor full code: ```from airflow.providers.openlineage.extractors.base import BaseExtractor

from typing import List, TYPE_CHECKING from typing import Dict, List, Mapping, Optional

class HExtractor(BaseExtractor): @classmethod def getoperatorclassnames(cls) -> List[str]: return ["HelloOperator3"]

def _execute_extraction(self):
   from airflow.providers.openlineage.extractors import OperatorLineage
   from openlineage.client.run import Dataset
   input_dataset = Dataset(
        namespace=f"extractor-test1",
        name=f"{self} init_style",
    )
   output_dataset = Dataset(
        namespace=f"extractor-test1",
        name=f"{self.operator}  init_style_out",
    )
   output_dataset2 = Dataset(
        namespace=f"extractor-test1",
        name=f"{self.operator.task_id}  TXT",
    )

   return OperatorLineage(
        inputs=[input_dataset],
        outputs=[output_dataset,output_dataset2],
    )

def extract_on_complete(self, task_instance):
    from openlineage.client.run import Dataset
    from airflow.providers.openlineage.extractors import OperatorLineage
    from openlineage.client.facet import (
        DocumentationJobFacet,
        OwnershipJobFacet,
        OwnershipJobFacetOwners,
    )
    self.log.debug(f"---extract_on_complete------({task_instance})")
    self.log.debug(f"Sending SQL to parser: {self.operator}")

    job_facets = {
        "documentation": DocumentationJobFacet(
            description=f"""
        Takes data from the data source
        """
        ),
        "ownership": OwnershipJobFacet(
            owners=[OwnershipJobFacetOwners(name="Charles", type=self.email)]
        ),
    }

    input_dataset = Dataset(
        namespace="METADATA",
        name="MMMMIN",
    )
    output_dataset = Dataset(
        namespace="METADATA",
        name="MMMMO",
    )

    return OperatorLineage(
        inputs=[input_dataset],
        outputs=[output_dataset],
        #job_facets=job_facets,
    )```

Charles kuzmiak (charles.kuzmiak@kraken.tech)

2024-07-11 03:16:55

*Thread Reply:* I do not have logs but I can see the datasets from executeextraction in Marquez but none of the datasets from oncomplete

Charles kuzmiak (charles.kuzmiak@kraken.tech)

2024-07-11 03:20:25

*Thread Reply:* The third party Operator has a field in TaskAttributes that I need,so I am looking for the best way to get it

Kacper Muda (mudakacper@gmail.com)

2024-07-11 03:22:07

*Thread Reply:* While i test it, maybe a dumb question, do you need a task_instance for it or can you simply get it from self.operator available in the extractor?

Kacper Muda (mudakacper@gmail.com)

2024-07-11 03:23:02

*Thread Reply:* Not sure if by TaskAttributes you mean something specific related to task instance or something that's simply at the operator level

Kacper Muda (mudakacper@gmail.com)

2024-07-11 03:23:33

*Thread Reply:* But now i see that you are already using self.operator, so probably that was a dumb question 😄

Charles kuzmiak (charles.kuzmiak@kraken.tech)

2024-07-11 03:26:38

*Thread Reply:* honestly that is one of the issues Im running into, whats the difference between these two

Screenshot 2024-07-11 at 16.26.22.png

Screenshot 2024-07-11 at 16.26.12.png

Kacper Muda (mudakacper@gmail.com)

2024-07-11 03:34:36

*Thread Reply:* Okay, I'm able to see the lineage from extractoncomplete without any troubles. I had to change the type in OwnershipJobFacetOwners, because there is no email attribute in this Extracor, so there was an error that resulted in no lineage being emitted

image.png

👀 Charles kuzmiak

Kacper Muda (mudakacper@gmail.com)

2024-07-11 03:38:10

*Thread Reply:* And as to the TaskInstance vs Task, the TaskInstance is one of Airflow core models, that stores a lot of information about the task execution while the Task, in that case, is a specific Operator. So if you want to retrieve something like annotations from Task Attributes, inside an extractor, you would do self.operator.annotations .

<https://github.com/apache/airflow/blob/main/airflow/models/taskinstance.py | taskinstance.py>

<pre><code>class TaskInstance(Base, LoggingMixin): </code></pre>

Kacper Muda (mudakacper@gmail.com)

2024-07-11 03:40:47

*Thread Reply:* Or, if in oncomplete you can also do task_instance.task.annotations - task_instance.task is the operator.

Charles kuzmiak (charles.kuzmiak@kraken.tech)

2024-07-11 03:46:01

*Thread Reply:* OK I can see the datasets now, will try both ```def extractoncomplete(self, task_instance): from openlineage.client.run import Dataset from airflow.providers.openlineage.extractors import OperatorLineage

    input_dataset = Dataset(
        namespace="METADATA",
        name=f"{self.operator.email}   MMMMIN",
    )
    output_dataset = Dataset(
        namespace="METADATA",
        name=f" {task_instance.task.email}   MMMMO",
    )

    return OperatorLineage(
        inputs=[input_dataset],
        outputs=[output_dataset],
    )```

Charles kuzmiak (charles.kuzmiak@kraken.tech)

2024-07-11 03:48:42

*Thread Reply:* looks like its working !!! You sir are a gentleman and a scholar and I salute you ! Cheers 🥳

Kacper Muda (mudakacper@gmail.com)

2024-07-11 03:49:32

*Thread Reply:* Awesome 🙂 Good luck !

Jonathan Moraes (jonathan.moraes.gft@nubank.com.br)

2024-07-11 13:41:02

Hello guys! Does anybody know how I can use Openlineage-fluentd to expose metricts to Prometheus? I tried by building a container with the additional command to implement gem install fluent-plugin-prometheus but only this plugin doesn't seem to be enough.

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-07-11 14:05:25

*Thread Reply:* https://github.com/fluent/fluent-plugin-prometheus

seems like you also have to update fluent config file

fluent/fluent-plugin-prometheus

A fluent plugin that collects metrics and exposes for Prometheus.

Stars

257

Language

Ruby

Jonathan Moraes (jonathan.moraes.gft@nubank.com.br)

2024-07-11 14:10:17

*Thread Reply:* Thank you again Jakub. I'll check it out

Jakub Dardziński (jakub.dardzinski@getindata.com)

2024-07-11 14:11:58

*Thread Reply:* I'm very interested with what you're trying to do with fluent proxy btw 🙂 please let me know whether you succeed to use it in production or larger scale 🙂

Jonathan Moraes (jonathan.moraes.gft@nubank.com.br)

2024-07-11 14:21:59

*Thread Reply:* Ok, I'll keep you posted!

🚀 Jakub Dardziński

Jacob Barber (jacoby6000@gmail.com)

2024-07-11 17:22:06

What would be an acceptable namespace for a kinesis Dataset?

Based on the "what's in a namespace" blog post, it should be "kinesis://<aws_account_id>"

Is that correct?

Michael Robinson (michael.robinson@astronomer.io)

2024-07-12 10:17:28

Announcing OpenLineage 1.18.0, featuring: • Spark: configurable integration test #2755 @pawel-big-lebowski • Spark: OpenLineage Spark extension interfaces without runtime dependency hell #2809 #2837 @ddebowczyk92 • Spark: support latest versions 3.4.3 and 3.5.1. #2743 @pawel-big-lebowski • Spark: add extraction of the masking property in column-level lineage #2789 @tnazarew • Spark: collect table name from InsertIntoHadoopFsRelationCommand #2794 @dolfinus • Spark/Flink: add PostgresJdbcExtractor #2806 @dolfinus • Spark/Flink: add TeradataJdbcExtractor #2826 @dolfinus • Spark/Flink: add MySqlJdbcExtractor #2825 @dolfinus • Spark/Flink: add OracleJdbcExtractor #2824 @dolfinus • Spark: configurable test with Docker image provided #2822 @pawel-big-lebowski • Spark: Support Iceberg 1.4 on Spark 3.5.1. #2838 @pawel-big-lebowski • Spec: add example for change in #2756 #2801 @Sheeri Plus many fixes, including • Java: handle dashes in hostname for JdbcExtractors #2830 @dolfinus Thanks to all the contributors, including new contributor @Akash2351! See the notes on GitHub for all the fixes and more details. Please note that there tends to be a lengthy delay before Maven displays the latest repositories, but you can obtain them by manually editing the release tag in the repo URLs (e.g., <https://repo1.maven.org/maven2/io/openlineage/openlineage-spark_2.12/1.18.0/>). Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.18.0 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/1.17.1...1.18.0 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

Public Channels

Private Channels

Direct Messages

Group Direct Messages

Supress success

logger = logging.getLogger(name)

[2022-05-19, 15:24:50 UTC] {taskinstance.py:1355} INFO -

[2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

SundayFunday

what lineage backend to use

backend =openlineage.lineage_backend.OpenLineageBackend

MARQUEZ_URL=http://10.36.37.178:3000

MARQUEZ_NAMESPACE=airflow

MARQUEZAPIKEY=[YOURAPIKEY]

Marquez as an OpenLineage Client

csv_file = location.csv

set the log level for the openlineage spark library

delete the old table if needed

expected structure of the file

read data to dataframe

# create table object to make delta lake queryable

_ = spark.sql(f'''

CREATE TABLE transactions

USING DELTA

LOCATION '{adlsRootPath}/examples/data/csv/completejourney/silver/transactions'

''')

show data

spark-submit --master --name SparkScriptquery1 --deploy-mode client /home/haneefa/airflow/dags/customoperators/samplesqlspark.py

./bin/spark-submit --class "SparkTest" --master local[**] --jars```

Set up SparkSubmitOperator for each query

spark-submit --master --name SparkScriptquery1 --deploy-mode client /home/haneefa/airflow/dags/customoperators/samplesqlspark.py

./bin/spark-submit --class "SparkTest" --master local[**] --jars ```

Gradle 8.5

Read the data from the source table into a DataFrame

Show the source DataFrame

Write the data from the source DataFrame to the destination table

namespace =

extractors =

disablesourcecode = ```

glueContext = GlueContext(sc)

Initialize the glue context

sc = SparkContext(spark1)

[END dataset_def]

Logging configuration

alternatively

@params: [JOB_NAME]