Julien Le Dem (julien@apache.org)
2020-10-20 21:01:02

@Julien Le Dem has joined the channel

Mars Lan (mars.th.lan@gmail.com)
2020-10-21 08:23:39

@Mars Lan has joined the channel

Wes McKinney (wesmckinn@gmail.com)
2020-10-21 11:39:13

@Wes McKinney has joined the channel

Ryan Blue (rblue@netflix.com)
2020-10-21 12:46:39

@Ryan Blue has joined the channel

Drew Banin (drew@fishtownanalytics.com)
2020-10-21 12:53:42

@Drew Banin has joined the channel

Willy Lulciuc (willy@datakin.com)
2020-10-21 13:29:49

@Willy Lulciuc has joined the channel

Lewis Hemens (lewis@dataform.co)
2020-10-21 13:52:50

@Lewis Hemens has joined the channel

Julien Le Dem (julien@apache.org)
2020-10-21 14:15:41

This is the official start of the OpenLineage initiative. Thank you all for joining. First item is to provide feedback on the doc: https://docs.google.com/document/d/1qL_mkd9lFfe_FMoLTyPIn80-fpvZUAdEIfrabn8bfLE/edit

🎉 Willy Lulciuc, Abe Gong
Abe Gong (abe@superconductive.com)
2020-10-21 23:22:03

@Abe Gong has joined the channel

Shirshanka Das (sdas@linkedin.com)
2020-10-22 13:50:35

@Shirshanka Das has joined the channel

deleted_profile (fengtao04@gmail.com)
2020-10-23 15:03:44

@deleted_profile has joined the channel

Chris White (chris@prefect.io)
2020-10-23 19:30:36

@Chris White has joined the channel

Julien Le Dem (julien@apache.org)
2020-10-24 19:29:04

Thanks all for joining. In addition to the google doc, I have opened a pull request with an initial openapi spec: https://github.com/OpenLineage/OpenLineage/pull/1 The goal is to specify the initial model (just plain lineage) that will be extended with various facets. It does not intend to restrict to HTTP. Those same PUT calls without output can be translated to any async protocol

GitHub
Julien Le Dem (julien@apache.org)
2020-10-24 19:31:09
Wes McKinney (wesmckinn@gmail.com)
2020-10-25 12:13:26

Am I the only weirdo that would prefer a Google Group mailing list to Slack for communicating?

👍 Ryan Blue
Julien Le Dem (julien@apache.org)
2020-10-25 17:22:09

*Thread Reply:* slack is the new email?

Wes McKinney (wesmckinn@gmail.com)
2020-10-25 17:40:19

*Thread Reply:* :(

Ryan Blue (rblue@netflix.com)
2020-10-27 12:27:04

*Thread Reply:* I'd prefer a google group as well

Ryan Blue (rblue@netflix.com)
2020-10-27 12:27:25

*Thread Reply:* I think that is better for keeping people engaged, since it isn't just a ton of history to go through

Ryan Blue (rblue@netflix.com)
2020-10-27 12:27:38

*Thread Reply:* And I think it is also better for having thoughtful design discussions

Julien Le Dem (julien@apache.org)
2020-10-29 15:40:14

*Thread Reply:* I’m happy to create a google group if that would help.

Julien Le Dem (julien@apache.org)
2020-10-29 15:45:23

*Thread Reply:* Here it is: https://groups.google.com/g/openlineage

Julien Le Dem (julien@apache.org)
2020-10-29 15:46:34

*Thread Reply:* Slack is more of a way to nudge discussions along, we can use github issues or the mailing list to discuss specific points

Julien Le Dem (julien@apache.org)
2020-11-03 17:34:53

*Thread Reply:* @Ryan Blue and @Wes McKinney any recommendations on automating sending github issues update to that list?

Ryan Blue (rblue@netflix.com)
2020-11-03 17:35:34

*Thread Reply:* I don't really know how to do that

Ravi Suhag (suhag.ravi@gmail.com)
2021-04-02 07:18:25

*Thread Reply:* @Julien Le Dem How about using Github discussions. They are specifically meant to solve this problem. Feature is still in beta, but it be enabled from repository settings. One positive side i see is that it will really easy to follow through and one separate place to go and look for discussions and ideas which are being discussed.

Julien Le Dem (julien@apache.org)
2021-04-02 19:51:55

*Thread Reply:* I just enabled it: https://github.com/OpenLineage/OpenLineage/discussions

🙌 Ravi Suhag
Wes McKinney (wesmckinn@gmail.com)
2020-10-25 12:14:06

Or GitHub Issues

Julien Le Dem (julien@apache.org)
2020-10-25 17:21:44

*Thread Reply:* the plan is to use github issues for discussions on the spec. This is to supplement

Laurent Paris (laurent@datakin.com)
2020-10-26 19:28:17

@Laurent Paris has joined the channel

Josh Benamram (josh@databand.ai)
2020-10-27 21:17:30

@Josh Benamram has joined the channel

Victor Shafran (victor.shafran@databand.ai)
2020-10-28 04:07:27

@Victor Shafran has joined the channel

Victor Shafran (victor.shafran@databand.ai)
2020-10-28 04:09:00

👋 Hi everyone!

👋 Willy Lulciuc, Abe Gong, Drew Banin, Julien Le Dem
Zhamak Dehghani (zdehghan@thoughtworks.com)
2020-10-29 17:59:31

@Zhamak Dehghani has joined the channel

Julien Le Dem (julien@apache.org)
2020-11-02 18:30:51

I’ve opened a github issue to propose OpenAPI as the way to define the lineage metadata: https://github.com/OpenLineage/OpenLineage/issues/2 I have also started a thread on the OpenLineage group: https://groups.google.com/g/openlineage/c/2i7ogPl1IP4 Discussion should happen there: ^

GitHub
Evgeny Shulman (evgeny.shulman@databand.ai)
2020-11-04 10:56:00

@Evgeny Shulman has joined the channel

Julien Le Dem (julien@apache.org)
2020-11-05 20:51:22

FYI I have updated the PR with a simple genrator: https://github.com/OpenLineage/OpenLineage/pull/1

} julienledem (https://github.com/julienledem)
Comments
6
Daniel Henneberger (danny@datakin.com)
2020-11-11 15:05:46

@Daniel Henneberger has joined the channel

Julien Le Dem (julien@apache.org)
2020-12-08 17:27:57

Please send me your github ids if you wish to be added to the github repo

👍 Willy Lulciuc
Fabrice Etanchaud (fabrice.etanchaud@netc.fr)
2020-12-10 02:10:35

@Fabrice Etanchaud has joined the channel

Julien Le Dem (julien@apache.org)
2020-12-10 17:04:29

As mentioned on the mailing List, the initial spec is ready for a final review. Thanks for all who gave feedback so far.

Julien Le Dem (julien@apache.org)
2020-12-10 17:04:39

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/1

} julienledem (https://github.com/julienledem)
Comments
6
Julien Le Dem (julien@apache.org)
2020-12-10 17:04:51

The next step will be to define individual facets

Julien Le Dem (julien@apache.org)
2020-12-13 00:28:11

I have opened a PR to update the ReadMe: https://openlineage.slack.com/archives/C01EB6DCLHX/p1607835827000100

} julienledem (https://github.com/julienledem)
Pull request opened by julienledem
👍 Willy Lulciuc
Willy Lulciuc (willy@datakin.com)
2020-12-14 17:55:46

*Thread Reply:* Looks great!

Maxime Beauchemin (max@preset.io)
2020-12-13 17:45:49

👋

👋 Shirshanka Das, Julien Le Dem, Willy Lulciuc, Arthur Wiedmer, Mario Measic
Julien Le Dem (julien@apache.org)
2020-12-14 20:19:57

I’m planning to merge https://github.com/OpenLineage/OpenLineage/pull/1 soon. That will be the base that we can iterate on and will enable starting the discussion on individual facets

} julienledem (https://github.com/julienledem)
Comments
6
Julien Le Dem (julien@apache.org)
2020-12-16 21:40:52

Thank you all for the feedback. I have made an update to the initial spec adressing the final comments

Julien Le Dem (julien@apache.org)
2020-12-16 21:41:16

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/1

} julienledem (https://github.com/julienledem)
Comments
7
Julien Le Dem (julien@apache.org)
2020-12-19 11:21:27

The contributing guide is available here: https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md Here is an example proposal for adding a new facet: https://github.com/OpenLineage/OpenLineage/issues/9

👍 Josh Benamram, Victor Shafran
Julien Le Dem (julien@apache.org)
2020-12-19 18:27:36

Welcome to the newly joined members 🙂 👋

👋 Chris Lambert, Ananth Packkildurai, Arthur Wiedmer, Abe Gong, ale, James Le, Ha Pham, David Krevitt, Harel Shein
Ash Berlin-Taylor (ash@apache.org)
2020-12-21 05:23:21

Hello! Airflow PMC member here. Super interested in this effort

👋 Willy Lulciuc
Julien Le Dem (julien@apache.org)
2020-12-21 12:15:42

*Thread Reply:* Welcome!

Ash Berlin-Taylor (ash@apache.org)
2020-12-21 05:25:07

I'm joining this slack now, but I'm basically done for the year, so will investigate proposals etc next year

🙌 Willy Lulciuc
Zachary Friedman (zafriedman@gmail.com)
2020-12-21 10:02:37

Hey all 👋 Super curious what people's thoughts are on the best way for data quality tools i.e. Great Expectations to integrate with OpenLineage. Probably a Dataset level facet of some sort (from the 25 minutes of deep spec knowledge I have 😆), but curious if that's something being worked on? @Abe Gong

👋 Abe Gong, Willy Lulciuc
Abe Gong (abe@superconductive.com)
2020-12-21 10:30:51

*Thread Reply:* Yes, that’s about right.

Abe Gong (abe@superconductive.com)
2020-12-21 10:31:45

*Thread Reply:* There’s some subtlety here.

Abe Gong (abe@superconductive.com)
2020-12-21 10:32:02

*Thread Reply:* The initial OpenLineage spec is pretty explicit about linking metadata primarily to execution of specific tasks, which is appropriate for ValidationResults in Great Expectations

✅ Zachary Friedman
Abe Gong (abe@superconductive.com)
2020-12-21 10:32:57

*Thread Reply:* There isn’t as strong a concept of persistent data objects (e.g. a specific table, or batches of data from a specific table)

✅ Zachary Friedman
Abe Gong (abe@superconductive.com)
2020-12-21 10:33:20

*Thread Reply:* (In the GE ecosystem, we call these DataAssets and Batches)

Abe Gong (abe@superconductive.com)
2020-12-21 10:33:56

*Thread Reply:* This is also an important conceptual unit, since it’s the level of analysis where Expectations and data docs would typically attach.

✅ Zachary Friedman
Abe Gong (abe@superconductive.com)
2020-12-21 10:34:47

*Thread Reply:* @James Campbell and I have had some productive conversations with @Julien Le Dem and others about this topic

Julien Le Dem (julien@apache.org)
2020-12-21 12:20:53

*Thread Reply:* Yep! The next step will be to open a few github issues with proposals to add to or amend the spec. We would probably start with a Descriptive Dataset facet of a dataset profile (or dataset update profile). There are other aspects to clarify as well as @Abe Gong is explaining above.

✅ James Campbell
Zachary Friedman (zafriedman@gmail.com)
2020-12-21 10:08:24

Also interesting to see where this would hook into Dagster. Because one of the many great features of Dagster IMO is it let you do stuff like this (without a formal spec albeit). An OpenLineageMaterialization could be interesting

Julien Le Dem (julien@apache.org)
2020-12-21 12:23:41

*Thread Reply:* Totally! We had a quick discussion with Dagster. Looking forward to proposals along those lines.

Harikiran Nayak (hari@streamsets.com)
2020-12-21 14:35:11

Congrats @Julien Le Dem @Willy Lulciuc and team on launching OpenLineage!

🙌 Willy Lulciuc
Willy Lulciuc (willy@datakin.com)
2020-12-21 14:48:11

*Thread Reply:* Thanks, @Harikiran Nayak! It’s amazing to see such interest in the community on defining a standard for lineage metadata collection.

Harikiran Nayak (hari@streamsets.com)
2020-12-21 15:03:29

*Thread Reply:* Yep! Its a validation that the problem is real!

Kriti (kathuriakritihp@gmail.com)
2020-12-22 02:05:45

Hey folks! Worked on a variety of lineage problems across domains. Super excited about this initiative!

👋 Willy Lulciuc
Julien Le Dem (julien@apache.org)
2020-12-22 13:23:43

*Thread Reply:* Welcome!

👋 Kriti
Julien Le Dem (julien@apache.org)
2020-12-30 22:30:23

*Thread Reply:* What are you current use cases for lineage?

Julien Le Dem (julien@apache.org)
2020-12-22 19:54:33

(for review) Proposal issue template: https://github.com/OpenLineage/OpenLineage/pull/11

GitHub
Julien Le Dem (julien@apache.org)
2020-12-22 19:55:16

for people interested, <#C01EB6DCLHX|github-notifications> has the github integration that will notify of new PRs …

Martin Charrel (martin.charrel@datadoghq.com)
2020-12-29 09:39:46

👋 Hello! I'm currently working on lineage systems @ Datadog. Super excited to learn more about this effort

👋 Willy Lulciuc
Julien Le Dem (julien@apache.org)
2020-12-30 22:28:54

*Thread Reply:* Welcome!

Julien Le Dem (julien@apache.org)
2020-12-30 22:29:43

*Thread Reply:* Would you mind sharing your main use cases for collecting lineage?

Marko Jamedzija (marko@popcore.com)
2021-01-03 05:54:34

Hi! I’m also working on a similar topic for some time. Really looking forward to having these ideas standardized 🙂

👋 Willy Lulciuc
Alexander Gilfillan (agilfillan@dealerinspire.com)
2021-01-05 11:29:31

I would be interested to see how to extend this to dashboards/visualizations. If that still falls with the scope of this project.

Julien Le Dem (julien@apache.org)
2021-01-05 12:55:01

*Thread Reply:* Definitely, each dashboard should become a node in the lineage graph. That way you can understand all the dependencies of a given dashboard. SOme example of interesting metadata around this: is the dashboard updated in a timely fashion (data freshness); is the data correct (data quality)? Observing changes upstream of the dashboard will provide insights to what’s hapening when freshness or quality suffer

Alexander Gilfillan (agilfillan@dealerinspire.com)
2021-01-05 13:20:41

*Thread Reply:* 100%. On a granular scale, the difference between a visualization and dashboard can be interesting. One visualization can be connected to multiple dashboards. But of course this depends on the BI tool, Redash would be an example in this case.

Julien Le Dem (julien@apache.org)
2021-01-05 15:15:23

*Thread Reply:* We would need to decide how to model those things. Possibly as a Job type for dashboard and visualization.

Alexander Gilfillan (agilfillan@dealerinspire.com)
2021-01-06 18:20:06

*Thread Reply:* It could be. Its interesting in Redash for example you create custom queries that run at certain intervals to produce the data you need to visualize. Pretty much equivalent to job. But you then build certain visualizations off of that “job”. Then you build dashboards off of visualizations. So you could model it as an job or it could make sense for it to be more modeled like an dataset.

Thats the hard part of this. How to you model a visualization/dashboard to all the possible ways they can be created since it differs depending on how the tool you use abstracts away creating an visualization.

Jason Reid (reid.david.jason@gmail.com)
2021-01-05 17:06:02

👋 Hi everyone!

🙌 Willy Lulciuc, Arthur Wiedmer
👋 Abe Gong
Jason Reid (reid.david.jason@gmail.com)
2021-01-05 17:10:22

*Thread Reply:* Part of my role at Netflix is to oversee our data lineage story so very interested in this effort and hope to be able to participate in its success

Julien Le Dem (julien@apache.org)
2021-01-05 18:12:48

*Thread Reply:* Hi Jason and welcome

Julien Le Dem (julien@apache.org)
2021-01-05 18:15:12

A reference implementation of the OpenLineage initial spec is in progress in Marquez: https://github.com/MarquezProject/marquez/pull/880

} henneberger (https://github.com/henneberger)
Comments
1
Julien Le Dem (julien@apache.org)
2021-01-07 12:46:19

*Thread Reply:* The OpenLineage reference implementation in Marquez will be presented this morning Thursday (01/07) at 10AM PST, at the Marquez Community meeting.

When: Thursday, January 7th at 10AM PST Wherehttps://us02web.zoom.us/j/89344845719?pwd=Y09RZkxMZHc2U3pOTGZ6SnVMUUVoQT09

Julien Le Dem (julien@apache.org)
2021-01-07 12:46:44

*Thread Reply:* that’s in 15 min

Julien Le Dem (julien@apache.org)
2021-01-12 17:10:23

*Thread Reply:* And it’s merged!

Julien Le Dem (julien@apache.org)
2021-01-12 17:10:53

*Thread Reply:* Marquez now has a reference implementation of the initial OpenLineage spec

Jon Loyens (jon@data.world)
2021-01-06 17:43:02

👋 Hi everyone! I'm one of the co-founder at data.world and looking forward to hanging out here

👋 Julien Le Dem, Willy Lulciuc
Elena Goydina (egoydina@provectus.com)
2021-01-11 11:39:20

👋 Hi everyone! I was looking for the roadmap and don't see any. Does it exist?

Julien Le Dem (julien@apache.org)
2021-01-13 19:06:34

*Thread Reply:* There’s no explicit roadmap so far. With the initial spec defined and the reference implementation implemented, next steps are to define more facets (for example, data shape, dataset size, etc), provide clients to facilitate integrations (java, python, …), implement more integrations (Spark in the works). Members of the community are welcome to drive their own initiatives around the core spec. One of the design goals of the facet is to enable numerous and independant parallel efforts

Julien Le Dem (julien@apache.org)
2021-01-13 19:06:48

*Thread Reply:* Is there something you are interested about in particular?

Julien Le Dem (julien@apache.org)
2021-01-13 19:09:42

I have opened a proposal to move the spec to JSONSchema, this will make it more focused and decouple from http: https://github.com/OpenLineage/OpenLineage/issues/15

} julienledem (https://github.com/julienledem)
Assignees
julienledem
👍 Willy Lulciuc
Julien Le Dem (julien@apache.org)
2021-01-19 12:26:39

Here is a PR with the corresponding change: https://github.com/OpenLineage/OpenLineage/pull/17

} julienledem (https://github.com/julienledem)
Xinbin Huang (bin.huangxb@gmail.com)
2021-02-01 17:07:50

Really excited to see this project! I am curious what's the current state and the roadmap of it?

Julien Le Dem (julien@apache.org)
2021-02-01 17:55:59

*Thread Reply:* You can find the initial spec here: https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md The process to contribute to the model is described here: https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md In particular, now we’d want to contribute more facets and integrations. Marquez has a reference implementation: https://github.com/MarquezProject/marquez/pull/880 On the roadmap: • define more facets: data profile, etc • more integrations • java/python client You can see current discussions here: https://github.com/OpenLineage/OpenLineage/issues

✅ Xinbin Huang
Julien Le Dem (julien@apache.org)
2021-02-01 17:56:43

For people curious about following github activity you can subscribe to: <#C01EB6DCLHX|github-notifications>

Julien Le Dem (julien@apache.org)
2021-02-01 17:57:05

*Thread Reply:* It is not on general, as it can be a bit noisy

Zachary Friedman (zafriedman@gmail.com)
2021-02-09 13:50:17

Random-ish question: why is producer and schemaURL nested under nominalTime facet in the spec for postRunStateUpdate? It seems like the producer of its metadata isn’t related to the time of the lineage event?

Julien Le Dem (julien@apache.org)
2021-02-09 20:02:48

*Thread Reply:* Hi @Zachary Friedman! I replied bellow. https://openlineage.slack.com/archives/C01CK9T7HKR/p1612918909009900

} Julien Le Dem (https://openlineage.slack.com/team/U01DCLP0GU9)
Julien Le Dem (julien@apache.org)
2021-02-09 20:01:49

producer and schemaURL are defined in the BaseFacet type and therefore all facets (including nominalTime) have it. • The producer is an identifier for the code that produced the metadata. The idea is that different facets in the same event can be produced by different libraries. For example In a Spark integration, Iceberg could emit it’s own facet in addition to other facets. The producer identifies what produced what. • The _schemaURL is the identifier of the version of the schema for a given facet. Similarly an event could contain a mixture of Core facets from the spec as well as custom facets. This makes explicit what the definition for this facet is.

👍 Zachary Friedman
Julien Le Dem (julien@apache.org)
2021-02-09 21:27:05

As discussed previously, I have separated a Json Schema spec for the OpenLineage events from the OpenAPI spec defining a HTTP endpoint: https://github.com/OpenLineage/OpenLineage/pull/17

} julienledem (https://github.com/julienledem)
Reviewers
@wslulciuc, @henneberger
Julien Le Dem (julien@apache.org)
2021-02-09 21:27:26

*Thread Reply:* Feel free to comment, this is ready to merge

Willy Lulciuc (willy@datakin.com)
2021-02-11 20:12:18

*Thread Reply:* Thanks, Julien. The new spec format looks great 👍

Julien Le Dem (julien@apache.org)
2021-02-09 21:34:31

And the corresponding code generator to start the java (and other languages) client: https://github.com/OpenLineage/OpenLineage/pull/18

} julienledem (https://github.com/julienledem)
Reviewers
@wslulciuc
👍 Willy Lulciuc
Julien Le Dem (julien@apache.org)
2021-02-11 22:25:24

those are merged, we now have a jsonschema, an openapi spec that extends it and a generated java model

🎉 Willy Lulciuc
🙌 Willy Lulciuc
Julien Le Dem (julien@apache.org)
2021-02-17 19:39:55

Following up on a previous discussion: This proposal and the accompanying PR add the notion of InputFacets and OutputFacets: https://github.com/OpenLineage/OpenLineage/issues/20 In summary, we are collecting metadata about jobs and datasets. At the Job level, when it’s fairly static metadata (not changing every run, like the current code version of the job) it goes in a JobFacet. When it is dynamic and changes every run (like the schedule time of the run), it goes in a RunFacet. This proposal is adding the same notion at the Dataset level: when it is static and doesn’t change every run (like the dataset schema) it goes in a Dataset facet. When it is dynamic and changes every run (like the input time interval of the dataset being read, or the statistics of the dataset being written) it goes in an inputFacet or an outputFacet. This enables Job and Dataset versioning logic, to keep track of what changes in the definition of something vs runtime changes

} julienledem (https://github.com/julienledem)
Labels
proposal
Comments
1
👍 Kevin Mellott, Petr Šimeček
Julien Le Dem (julien@apache.org)
2021-02-19 14:27:23

*Thread Reply:* @Kevin Mellott and @Petr Šimeček Thanks for the confirmation on this slack message. To make your comment visible to the wider community, please chime in on the github issue as well: https://github.com/OpenLineage/OpenLineage/issues/20 Thank you.

} julienledem (https://github.com/julienledem)
Labels
proposal
Comments
1
Julien Le Dem (julien@apache.org)
2021-02-19 14:27:46

*Thread Reply:* The PR is out for this: https://github.com/OpenLineage/OpenLineage/pull/23

} julienledem (https://github.com/julienledem)
Reviewers
@jcampbell, @abegong, @henneberger
Weixi Li (ashlee.happy@gmail.com)
2021-02-19 04:14:59

Hi, I am really interested in this project and Marquez. I am a bit not clear about the differences and relationship between those two projects. As my understanding, OpenLineage provides an api specification for other tools running jobs (e.g. Spark, Airflow) to send out an event to update the run state of the job, then for example Marquez can be the destination for those events and show the data lineage from those run state updates. When you are saying there is an reference implementation of the OpenLineage spec in Marquez, do you mean there is an /lineage endpoint implemented in the Marquez api https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/api/OpenLineageResource.java? Then my question is what is next step after Marquez has this api? How does Marquez use that endpoint to integrate with airflow for example? I did not find the usage of that endpoint in Marquez project. The library marquez-airflow which integrates Airflow with Marquez seems like only use the other marquez apis to build the data lineage. Or did I misunderstand something? Thank you very much!

Weixi Li (ashlee.happy@gmail.com)
2021-02-19 05:03:21

*Thread Reply:* Okay, I found the spark integration in Marquez calls the /lineage endpoint. But I am still curious about the future plan to integrate with other tools, like airflow?

Julien Le Dem (julien@apache.org)
2021-02-19 12:41:23

*Thread Reply:* Just restating some of my answers from teh marquez slack for the benefits of folks here.

• OpenLineage defines the schema to collect metadata • Marquez has a /lineage endpoint implementing the OpenLineage spec to receive this metadata, implemented by the OpenLineageResource you pointed out • In the future other projects will also have OpenLineage endpoints to receive this metadata •  The Marquez Spark integration produces OpenLineage events: https://github.com/MarquezProject/marquez/tree/main/integrations/spark • The Marquez airflow integration still uses the original marquez api but will be migrated to open lineage. • All new integrations will use OpenLineage metadata

Weixi Li (ashlee.happy@gmail.com)
2021-02-22 03:55:18

*Thread Reply:* thank you! very clear answer🙂

Ernie Ostic (ernie.ostic@getmanta.com)
2021-03-02 13:49:04

Hi Everyone. Just got started with the Marquez REST API and a little bit into the Open Lineage aspects. Very easy to use. Great work on the curl examples for getting started. I'm working with Postman and am happy to share a collection I have once I finish testing. A question about tags --- are there plans for a "post new tag" call in the API? ...or maybe I missed it. Thx. --ernie

Julien Le Dem (julien@apache.org)
2021-03-02 17:51:29

*Thread Reply:* I forgot to reply in thread 🙂 https://openlineage.slack.com/archives/C01CK9T7HKR/p1614725462008300

} Julien Le Dem (https://openlineage.slack.com/team/U01DCLP0GU9)
Julien Le Dem (julien@apache.org)
2021-03-02 17:51:02

OpenLineage doesn’t have a Tag facet yet (but tags are defined in the Marquez api). Feel free to open a proposal on the github repo. https://github.com/OpenLineage/OpenLineage/issues/new/choose

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-03-16 11:21:37

Hey everyone. What's the story for stream processing (like Flink jobs) for OpenLineage? It does not fit cleanly with runEvent model, which It is required to issue 1 START event and 1 of [ COMPLETE, ABORT, FAIL ] event per run. as unbounded stream jobs usually do not complete.

I'd imagine few "workarounds" that work for some cases - for example, imagine a job calculating hourly aggregations of transactions and dumpling them into parquet files for further analysis. The job could issue OTHER event type adding additional output dataset every hour. Another option would be to create new "run" every hour, just indicating the added data.

Adam Bellemare (adam.bellemare@shopify.com)
2021-03-16 15:07:04

*Thread Reply:* Ha, I signed up just to ask this precise question!

😀 Maciej Obuchowski
Adam Bellemare (adam.bellemare@shopify.com)
2021-03-16 15:07:44

*Thread Reply:* I’m still looking into the spec myself. Are we required to have 1 or more runs per Job? Or can a Job exist without a run event?

Ravi Suhag (suhag.ravi@gmail.com)
2021-04-02 07:24:39

*Thread Reply:* Run event can be emitted when it starts. and it can stay in RUNNING state unless something happens to the job. Additionally, you could send event periodically as state RUNNING to inform the system that job is healthy.

Adam Bellemare (adam.bellemare@shopify.com)
2021-03-16 15:09:31

Similar to @Maciej Obuchowski question about Flink / Streaming jobs - what about Streaming sources (eg: a Kafka topic)? It does fit into the dataset model, more or less. But, has anyone used this yet for a set of streaming sources? Particularly with schema changes over time?

Julien Le Dem (julien@apache.org)
2021-03-16 18:30:46

Hi @Maciej Obuchowski and @Adam Bellemare, streaming jobs are meant to be covered by the spec but I agree there are a few details to iron out.

Julien Le Dem (julien@apache.org)
2021-03-16 18:31:55

In particular, streaming job still have runs. If they run continuously they do not run forever and you want to track that a job has been started at a point in time with a given version of the code, then stopped and started again after being upgraded for example.

👍 Maciej Obuchowski
Julien Le Dem (julien@apache.org)
2021-03-16 18:32:23

I agree with @Maciej Obuchowski that we would also send OTHER events to keep track of progress.

Julien Le Dem (julien@apache.org)
2021-03-16 18:32:46

For example one could track checkpointing this way.

Julien Le Dem (julien@apache.org)
2021-03-16 18:35:35

For a Kafka topic you could have streaming dataset specific facets or even Kafka specific facets (ex: list of offsets we stopped reading at, schema id, etc )

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-03-17 10:05:53

*Thread Reply:* That's good idea.

Now I'm wondering - let's say we want to track on which offset checkpoint ended processing. That would mean we want to expose checkpoint id, time, and offset. I suppose we don't want to overwrite previous checkpoint info, so we want to have some collection of data in this facet.

Something like appendable facets would be nice, to just add new checkpoint info to the collection, instead of having to push all the checkpoint infos all the time we just want to add new data point.

Julien Le Dem (julien@apache.org)
2021-03-16 18:45:23

Let me know if you have more thoughts

Adam Bellemare (adam.bellemare@shopify.com)
2021-03-17 09:18:49

*Thread Reply:* Thanks Julien! I will try to wrap my head around some use-cases and see how it maps to the current spec. From there, I can see if I can figure out any proposals

Julien Le Dem (julien@apache.org)
2021-03-17 13:43:29

*Thread Reply:* You can use the proposal issue template to propose a new facet for example: https://github.com/OpenLineage/OpenLineage/issues/new/choose

Carlos Zubieta (carlos.zubieta@wizeline.com)
2021-03-16 18:49:00

Hi everyone, I just hear about OpenLineage and would like to learn more about it. The talks in the repo explain nicely the purpose and general ideas but I have a couple of questions. Are there any working implementations to produce/consume the spec? Also, are there any discussions/guides standard information, naming conventions, etc. in the facets?

Julien Le Dem (julien@apache.org)
2021-03-16 20:05:06

Hi @Carlos Zubieta here are some pointers ^

Julien Le Dem (julien@apache.org)
2021-03-16 20:06:51

Marquez has a reference implementation of an OpenLineage endpoint. The Spark integration emits OpenLineage events.

Carlos Zubieta (carlos.zubieta@wizeline.com)
2021-03-16 20:56:37

Thank you @Julien Le Dem!!! Will take a close look

Adam Bellemare (adam.bellemare@shopify.com)
2021-03-17 15:41:50

Q related to People/Teams/Stakeholders/Owners with regards to Jobs and Datasets (didn’t find anything in search): Let’s say I have a dataset , and there are a number of other downstream jobs that ingest from it. In the case that the dataset is mutated in some way (or deleted, archived, etc), how would I go about notifying the stakeholders of that set about the changes?

Just to be clear, I’m not concerned about the mechanics of doing this, just that there is someone that needs to be notified, who has self-registered on this set. Similarly, I want to manage the datasets I am concerned about , where I can grab a list of all the datasets I tagged myself on.

This seems to suggest that we could do with additional entities outside of Dataset, Run, Job. However, at the same time, I can see how this can lead to an explosion of other entities. Any thoughts on this particular domain? I think I could achieve something similar with aspects, but this would require that I update the aspect on each entity if I want to wholesale update the user contact, say their email address.

Has anyone else run into something like this? Have you any advice? Or is this something that may be upcoming in the spec?

Adam Bellemare (adam.bellemare@shopify.com)
2021-03-17 16:42:24

*Thread Reply:* One thing we were considering is just adding these in as Facets ( Tags as per Marquez), and then plugging into some external people managing system. However, I think the question can be generalized to “should there be some sort of generic entity that can enable relationships between itself and Datasets, Jobs, Runs) as part of an integration element?

Julien Le Dem (julien@apache.org)
2021-03-18 16:03:55

*Thread Reply:* That’s a great topic of discussion. I would definitely use the OpenLineage facets to capture what you describe as aspect above. The current Marquez model has a simple notion of ownership at the namespace model but this need to be extended to enable use cases you are describing (owning a dataset or a job) . Right now the owner is just a generic identifier as a string (a user id or a group id for example). Once things are tagged (in some way), you can use the lineage API to find all the downstream or upstream jobs and datasets. In OpenLineage I would start by being able to capture the owner identifier in a facet with contact info optional if it’s available at runtime. It will have the advantage of keeping track of how that changed over time. This definitely deserves its own discussion.

Julien Le Dem (julien@apache.org)
2021-03-18 17:52:13

*Thread Reply:* And also to make sure I understand your use case, you want to be able to notify the consumers of a dataset that it is being discontinued/replaced/… ? What else are you thinking about?

Adam Bellemare (adam.bellemare@shopify.com)
2021-03-22 09:15:19

*Thread Reply:* Let me pull in my colleagues

Adam Bellemare (adam.bellemare@shopify.com)
2021-03-22 09:15:24

*Thread Reply:* Standby

Olessia D'Souza (olessia.dsouza@shopify.com)
2021-03-22 10:59:57

*Thread Reply:* 👋 Hi Julien. I’m Olessia, I’m working on the metadata collection implementation with Adam. Some thought on this:

Olessia D'Souza (olessia.dsouza@shopify.com)
2021-03-22 11:00:45

*Thread Reply:* To start off, we’re thinking that there often isn’t a single owner, but rather a set of Stakeholders that evolve over time. So we’d like to be able to attach multiple entries, possibly of different types, to a Dataset. We’re also thinking that a dataset should have at least one owner. So a few things I’d like to confirm/discuss options:

  • If I were to stay true to the spec as it’s defined atm I wouldn’t be able to add a required facet. True/false?
  • According to the readme, “...emiting a new facet with the same name for the same entity replaces the previous facet instance for that entity entirely”. If we were to store multiple stakeholders, we’d have a field “stakeholders” and its value would be a list? This would make queries involving stakeholders not very straightforward. If the facet is overwritten every time, how do I a) add individuals to the list b) track changes to the list over time. Let me know what I’m missing, because based on what you said above tracking facet changes over time is possible.
  • Run events are issued by a scheduler. Why should it be in the domain of the scheduler to know the entire list of Stakeholders?
  • I noticed that Marquez has separate endpoints to capture information about Datasets, and some additional information beyond what’s described in the spec is required. In this context, we could add a required Stakeholder facets on a Dataset, and potentially even additional end points to add and remove Stakeholders. Is that a valid way to go about this, in your opinion?

Curious to hear your thoughts on all of this!

Julien Le Dem (julien@apache.org)
2021-03-24 17:06:50

*Thread Reply:* > To start off, we’re thinking that there often isn’t a single owner, but rather a set of Stakeholders that evolve over time. So we’d like to be able to attach multiple entries, possibly of different types, to a Dataset. We’re also thinking > that a dataset should have at least one owner. So a few things I’d like to confirm/discuss options: > -> If I were to stay true to the spec as it’s defined atm I wouldn’t be able to add a required facet. True/false? Correct, The spec defines what facets looks like (and how you can make your own custom facets) but it does not make statements about whether facets are required. However, you can have your own validation and make certain things required if you wish to on the client side?   > - According to the readme, “...emiting a new facet with the same name for the same entity replaces the previous facet instance for that entity entirely”. If we were to store multiple stakeholders, we’d have a field “stakeholders” and its value would be a list?  Yes, I would indeed consider such a facet on the dataset with the stakeholder.

> This would make queries involving stakeholders not very straightforward. If the facet is overwritten every time, how do I  > a) add individuals to the list You would provide the new list of stake holders. OpenLineage standardizes lineage collection and defines a format for expressing metadata. Marquez will keep track of how metadata has evolved over time.

> b) track changes to the list over time. Let me know what I’m missing, because based on what you said above tracking facet changes over time is possible. Each event is an observation at a point in time. In a sense they are each immutable. There’s a “current” version but also all the previous ones stored in Marquez. Marquez stores each version of a dataset it received through OpenLineage and exposes an API to see how that evolved over time.

> - Run events are issued by a scheduler. Why should it be in the domain of the scheduler to know the entire list of Stakeholders? The scheduler emits the information that it knows about. For example: “I started this job and it’s reading from this dataset and is writing to this other dataset.” It may or may not be in the domain of the scheduler to know the list of stakeholders. If not then you could emit different types of events to add a stakeholder facet to a dataset. We may want to refine the spec for that. Actually I would be curious to hear what you think should be the source of truth for stakeholders. It is not the intent to force everything coming from the scheduler.

  • example 1: stakeholders are people on call for the job, they are defined as part of the job and that also enables alerting
  • example 2: stakeholders are consumers of the jobs: they may be defined somewhere else

> - I noticed that Marquez has separate endpoints to capture information about Datasets, and some additional information beyond what’s described in the spec is required. In this context, we could add a required Stakeholder facets on a Dataset, and potentially even additional end points to add and remove Stakeholders. Is that a valid way to go about this, in your opinion?

Julien Le Dem (julien@apache.org)
2021-03-24 17:06:50

*Thread Reply:* Marquez existed before OpenLineage. In particular the /run end-point to create and update runs will be deprecated as the OpenLineage /lineage endpoint replaces it. At the moment we are mapping OpenLineage metadata to Marquez. Soon Marquez will have all the facets exposed in the Marquez API. (See: https://github.com/MarquezProject/marquez/pull/894/files) We could make Marquez Configurable or Pluggable for validation purposes. There is already a notion of LineageListener for example. Although Marquez collects the metadata. I feel like this validation would be better upstream or with some some other mechanism. The question is when do you create a dataset vs when do you become a stakeholder? What are the various stakeholder and what is the responsibility of the minimum one stakeholder? I would probably make it required to deploy the job that the stakeholder is defined. This would apply to the output dataset and would be collected in Marquez.

In general, you are very welcome to make suggestion on additional endpoints for Marquez and I’m happy to discuss this further as those ideas are progressing.

> Curious to hear your thoughts on all of this! Thanks for taking the time!

Julien Le Dem (julien@apache.org)
2021-05-24 16:27:03

*Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1621887895004200

} Julien Le Dem (https://openlineage.slack.com/team/U01DCLP0GU9)
Julien Le Dem (julien@apache.org)
2021-03-24 18:58:00

Thanks for the Python client submission @Maciej Obuchowski https://github.com/OpenLineage/OpenLineage/pull/34

} mobuchowski (https://github.com/mobuchowski)
🙌 Willy Lulciuc
Julien Le Dem (julien@apache.org)
2021-03-24 18:59:50

I also have added a spec to define a standard naming policy. Please review: https://github.com/OpenLineage/OpenLineage/pull/31/files

Julien Le Dem (julien@apache.org)
2021-03-31 23:45:35

We now have a python client! Thanks @Maciej Obuchowski

👍 Maciej Obuchowski, Kevin Mellott, Ravi Suhag, Ross Turk, Willy Lulciuc, Mirko Raca
Zachary Friedman (zafriedman@gmail.com)
2021-04-02 19:37:36

Question, what do you folks see as the canonical mechanism for receiving OpenLineage events? Do you see an agent like statsd? Or do you see this as purely an API spec that services could implement? Do you see producers of lineage data writing code to send formatted OpenLineage payloads to arbitrary servers that implement receipt of these events? Curious what the long-term vision is here related to how an ecosystem of producers and consumers of payloads would interact?

Julien Le Dem (julien@apache.org)
2021-04-02 19:54:52

*Thread Reply:* Marquez is the reference implementation for receiving events and tracking changes. But the definition of the API let’s other receive them (and also enables using openlineage events to sync between systems)

Julien Le Dem (julien@apache.org)
2021-04-02 19:55:32

*Thread Reply:* In particular, Egeria is involved in enabling receiving and emitting openlineage

Zachary Friedman (zafriedman@gmail.com)
2021-04-03 18:03:01

*Thread Reply:* Thanks @Julien Le Dem. So to get specific, if dbt were to emit OpenLineage events, how would this work? Would dbt Cloud hypothetically allow users to configure an endpoint to send OpenLineage events to, similar in UI implementation to configuring a Stripe webhook perhaps? And then whatever server the user would input here would point to somewhere that implements receipt of OpenLineage payloads? This is all a very hypothetical example, but trying to ground it in something I have a solid mental model for.

Michael Collado (collado.mike@gmail.com)
2021-04-05 17:51:57

*Thread Reply:* hypothetically speaking, that all sounds right. so a user, who, e.g., has a dbt pipeline and an AWS glue pipeline could configure both of those projects to point to the same open lineage service and get their entire lineage graph even if the two pipelines aren't connected.

Willy Lulciuc (willy@datakin.com)
2021-04-06 20:33:51

*Thread Reply:* Yeah, OpenLineage events need to be published to a backend (can be Kafka, can be a graphDB, etc). Your Stripe webhook analogy is aligned with how events can be received. For example, in Marquez, we expose a /lineage endpoint that consumes OpenLineage events. We then map an OpenLineage event to the Marquez model (sources, datasets, jobs, runs) that’s persisted in postgres.

Zachary Friedman (zafriedman@gmail.com)
2021-04-07 10:47:06

*Thread Reply:* Thanks both!

Julien Le Dem (julien@apache.org)
2021-04-13 20:52:53

*Thread Reply:* sorry, I was away last week. Yes that sounds right.

Jakub Moravec (jkb.moravec@gmail.com)
2021-04-07 09:41:09

Hi everyone, I just started discovering OpenLineage and Marquez, it looks great and the quick-start tutorial is very helpful! One question though, I pushed some metadata to Marquez using the Lineage POST endpoint, and when I try to confirm that everything was created using Marquez REST API, everything is there ... but I don't see these new objects in the Marquez UI... what is the best way how to investigate where the issue is?

Willy Lulciuc (willy@datakin.com)
2021-04-14 13:12:31

*Thread Reply:* Welcome, @Jakub Moravec 👋 . Given that you're able to retrieve metadata using the marquezAPI, you should be able to also view dataset and job metadata in the UI. Mind using the search bar in the top right-hand corner in the UI to see if your metadata is searchable? The UI only renders jobs and datasets that are connected in the lineage graph. We're working towards a more general metadata exploration experience, but currently the lineage graph is the main experience.

Jakob Külzer (jakob.kulzer@shopify.com)
2021-04-08 11:23:18

Hi friends, we're exploring OpenLineage and while building out integration for existing systems we realized there is no obvious way for an input to specify what "version" of that dataset is being consumed. For example, we have a job that rolls up a variable number of what OpenLineage calls dataset versions. By specifying only that dataset, we can't represent the specific instances of it that are actually rolled up. We think that would be a very important part of the lineage graph.

Are there any thoughts on how to address specific dataset versions? Is this where custom input facets would come to play?

Furthermore, based on the spec, it appears that events can provide dataset facets for both inputs and outputs and this seems to open the door to race conditions in which two runs concurrently create dataset versions of a dataset. Is this where the eventTime field is supposed to be used?

Julien Le Dem (julien@apache.org)
2021-04-13 20:56:42

*Thread Reply:* Your intuition is right here. I think we should define an input facet that specifies which dataset version is being read. Similarly you would have an output facet that specifies what version is being produced. This would apply to storage layers like Deltalake and Iceberg as well.

Julien Le Dem (julien@apache.org)
2021-04-13 20:57:58

*Thread Reply:* Regarding the race condition, input and output facets are attached to the run. The version of the dataset that was read is an attribute of a run and should not modify the dataset itself.

Julien Le Dem (julien@apache.org)
2021-04-13 21:01:34

*Thread Reply:* See the Dataset description here: https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md#core-lineage-model

Stephen Pimentel (stephenpiment@gmail.com)
2021-04-14 18:20:42

Hi everyone! I’m exploring what existing, open-source integrations are available, specifically for Spark, Airflow, and Trino (PrestoSQL). My team is looking both to use and contribute to these integrations. I’m aware of the integration in the Marquez repo: • Spark: https://github.com/MarquezProject/marquez/tree/main/integrations/spark • Airflow: https://github.com/MarquezProject/marquez/tree/main/integrations/airflow Are there other efforts I should be aware of, whether for these two or for Trino? Thanks for any information!

👋 Arthur Wiedmer, Maciej Obuchowski, Peter Hicks
Zachary Friedman (zafriedman@gmail.com)
2021-04-19 16:17:06

*Thread Reply:* I think for Trino integration you'd be looking at writing a Trino extractor if I'm not mistaken, yes?

Zachary Friedman (zafriedman@gmail.com)
2021-04-19 16:17:23

*Thread Reply:* But extractor would obviously be at the Marquez layer not OpenLineage

Zachary Friedman (zafriedman@gmail.com)
2021-04-19 16:19:00

*Thread Reply:* And hopefully the metadata you'd be looking to extract from Trino wouldn't have any connector-specific syntax restrictions.

Antonio Moctezuma (antoniomoctezuma@northwesternmutual.com)
2021-04-16 15:37:24

Hey all! Right now I am working on getting OpenLineage integrated with some microservices here at Northwestern Mutual and was looking for some advice. The current service I am trying to integrate it with moves files from one AWS S3 bucket to another so i was hoping to track that movement with OpenLineage. However by my understanding the inputs that would be passed along in a runEvent are meant to be datasets that have schema and other properties. But I wanted to have that input represent the file being moved. Is this a proper usage of Open Lineage? Or is this a use case that is still being developed? Any and all help is appreciated!

Julien Le Dem (julien@apache.org)
2021-04-19 21:42:14

*Thread Reply:* This is a proper usage. That schema is optional if it’s not available.

Julien Le Dem (julien@apache.org)
2021-04-19 21:43:27

*Thread Reply:* You would model it as a job reading from a folder (the input dataset) in the input bucket and writing to a folder (the output dataset) in the output bucket

Julien Le Dem (julien@apache.org)
2021-04-19 21:43:58

*Thread Reply:* This is similar to how this is modeled in the spark integration (spark job reading and writing to s3 buckets)

Julien Le Dem (julien@apache.org)
2021-04-19 21:47:06
Julien Le Dem (julien@apache.org)
2021-04-19 21:48:48
Antonio Moctezuma (antoniomoctezuma@northwesternmutual.com)
2021-04-20 11:11:38

*Thread Reply:* Hey Julien, thank you so much for getting back to me. I'll take a look at the documentation/implementations you've sent me and will reach out if I have anymore questions. Thanks again!

Antonio Moctezuma (antoniomoctezuma@northwesternmutual.com)
2021-04-20 17:39:24

*Thread Reply:* @Julien Le Dem I left a quick comment on that spec PR you mentioned. Just wanted to let you know.

Julien Le Dem (julien@apache.org)
2021-04-20 17:49:15

*Thread Reply:* thanks

Josh Quintus (josh.quintus@gmail.com)
2021-04-28 09:41:45

Hello all. I was reading through the OpenLineage documentation on GitHub and noticed a very minor typo (an instance where and should have been an). I was just about to create a PR for it but wanted to check with someone to see if that would be something that the team is interested in.

Thanks for the tool, I'm looking forward to learning more about it.

👍 Maciej Obuchowski
Julien Le Dem (julien@apache.org)
2021-04-28 20:56:53

*Thread Reply:* Thank you! Please do fix typos, I’ll approve your PR.

Josh Quintus (josh.quintus@gmail.com)
2021-04-28 23:21:44

*Thread Reply:* No problem. Here's the PR. https://github.com/OpenLineage/OpenLineage/pull/47

Josh Quintus (josh.quintus@gmail.com)
2021-04-28 23:22:41

*Thread Reply:* Once I fixed the ones I saw I figured "Why not just run it through a spell checker just in case... " and found a few additional ones.

Ross Turk (ross@datakin.com)
2021-05-20 16:30:05

For your enjoyment, @Julien Le Dem was on the Data Engineering Podcast talking about OpenLineage!

https://www.dataengineeringpodcast.com/openlineage-data-lineage-specification-episode-187/

Data Engineering Podcast
🙌 Willy Lulciuc, Maciej Obuchowski, Peter Hicks, Mario Measic
❤️ Willy Lulciuc, Maciej Obuchowski, Peter Hicks, Rogier Werschkull, A Pospiech, Kedar Rajwade, James Le
Ross Turk (ross@datakin.com)
2021-05-20 16:30:09

share and enjoy 🙂

Julien Le Dem (julien@apache.org)
2021-05-21 18:21:23

Also happened yesterday: OpenLineage being accepted by the LFAI&Data.

🎉 Abe Gong, Willy Lulciuc, Peter Hicks, Maciej Obuchowski, Daniel Henneberger, Harel Shein, Antonio Moctezuma, Josh Quintus, Mariusz Górski, James Le
👏 Matt Turck
Willy Lulciuc (willy@datakin.com)
2021-05-21 19:20:55

*Thread Reply:* Huge milestone! 🙌💯🎊

Julien Le Dem (julien@apache.org)
2021-05-24 16:24:55

I have created a channel to discuss <#C022MMLU31B|user-generated-metadata> since this came up in a few discussions.

🙌 Willy Lulciuc
Jonathon Mitchal (bigmit83@gmail.com)
2021-05-31 01:28:35

hey guys, does anyone have any sample openlineage schemas for S3 please? potentially including facets for attributes in a parquet file? that would help heaps thanks. i am trying to slowly bring in a common metadata interface and this will help shape some of the conversations 🙂 with a move to marquez/datahub et al over time

🙌 Willy Lulciuc
Willy Lulciuc (willy@datakin.com)
2021-06-01 17:56:16

*Thread Reply:* We currently don’t have S3 (or distributed filesystem specific facets) at the moment, but such support would be a great addition! @Julien Le Dem would be best to answer if any work has been done in this area 🙂

Willy Lulciuc (willy@datakin.com)
2021-06-01 17:57:19

*Thread Reply:* Also, happy to answer any Marquez specific questions, @Jonathon Mitchal when you’re thinking of making the move. Marquez supports OpenLineage out of the box 🙌

Julien Le Dem (julien@apache.org)
2021-06-01 19:58:21

*Thread Reply:* @Jonathon Mitchal You can follow the naming strategy here for referring to a S3 dataset: https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md#s3

Julien Le Dem (julien@apache.org)
2021-06-01 19:59:30

*Thread Reply:* There is no facet yet for the attributes of a Parquet file. I can give you feedback if you want to start defining one. https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#proposing-changes

Julien Le Dem (julien@apache.org)
2021-06-01 20:00:50

*Thread Reply:* Adding Parquet metadata as a facet would make a lot of sense. It is mainly a matter of specifying what the json would look like

Julien Le Dem (julien@apache.org)
2021-06-01 20:01:54

*Thread Reply:* for reference the parquet metadata is defined here: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift

Jonathon Mitchal (bigmit83@gmail.com)
2021-06-01 23:20:50

*Thread Reply:* Thats awesome, thanks for the guidance Willy and Julien ... will report back on how we get on

🙏 Willy Lulciuc
Pedram (pedram@hightouch.io)
2021-06-01 17:52:08

hi all! just wanted to introduce myself, I'm the Head of Data at Hightouch.io, we build reverse etl pipelines from the warehouse into various destinations. I've been following OpenLineage for a while now and thought it would be nice to build and expose our runs via the standard and potentially save that back to the warehouse for analysis/alerting. Really interesting concept, looking forward to playing around with it

👋 Willy Lulciuc, Ross Turk
Julien Le Dem (julien@apache.org)
2021-06-01 20:02:34

*Thread Reply:* Welcome! Let use know if you have any questions

Leo (leorobinovitch@gmail.com)
2021-06-03 19:22:10

Hi all! I have a noob question. As I understand it, one of the main purposes of OpenLineage is to avoid runaway proliferation of bespoke connectors for each data lineage/cataloging/provenance tool to each data source/job scheduler/query engine etc. as illustrated in the problem diagram from the main repo below.

My understanding is that instead, things push to OpenLineage which provides pollable endpoints for metadata tools.

I’m looking at Amundsen, and it seems to have bespoke connectors, but these are pull-based - I don’t need to instrument my data resources to push to Amundsen, I just need to configure Amundsen to poll my data resources (e.g. the Postgres metadata extractor here).

Can OpenLineage do something similar where I can just point it at something to extract metadata from it, rather than instrumenting that thing to push metadata to OpenLineage? If not, I’m wondering why?

Is it the case that Open Lineage defines the general framework but doesn’t actually enforce push or pull-based implementations, it just so happens that the reference implementation (Marquez) uses push?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-06-04 04:45:15

*Thread Reply:* > Is it the case that Open Lineage defines the general framework but doesn’t actually enforce push or pull-based implementations, it just so happens that the reference implementation (Marquez) uses push? Yes, at core OpenLineage just enforces format of the event. We also aim to provide clients - REST, later Kafka, etc. and some reference implementations - which are now in Marquez repo. https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/doc/Scope.png

There are several differences between push and poll models. Most important one is that with push model, latency between your job and emitting OpenLineage events is very low. With some systems, with internal, push based model you have more runtime metadata available than when looking from outside. Another one would be that naive poll implementation would need to "rebuild the world" on each change. There are also disadvantages, such as that usually, it's easier to write plugin that extracts data from outside the system than hooking up to the internals.

Integration with Amundsen specifically is planned. Although, right now it seems to me that way to do it is to bypass the databuilder framework and push directly to underlying database, such as Neo4j, or make Marquez backend for Metadata Service: https://raw.githubusercontent.com/amundsen-io/amundsen/master/docs/img/Amundsen_Architecture.png

❤️ Julien Le Dem
Leo (leorobinovitch@gmail.com)
2021-06-04 10:39:51

*Thread Reply:* This is really helpful, thank you @Maciej Obuchowski!

Leo (leorobinovitch@gmail.com)
2021-06-04 10:40:59

*Thread Reply:* Similar to what you say about push vs pull, I found DataHub’s comment to be interesting yesterday: > Push is better than pull: While pulling metadata directly from the source seems like the most straightforward way to gather metadata, developing and maintaining a centralized fleet of domain-specific crawlers quickly becomes a nightmare. It is more scalable to have individual metadata providers push the information to the central repository via APIs or messages. This push-based approach also ensures a more timely reflection of new and updated metadata.

Julien Le Dem (julien@apache.org)
2021-06-04 21:59:59

*Thread Reply:* yes. You can also “pull-to-push” for things that don’t push.

Mariusz Górski (gorskimariusz13@gmail.com)
2021-06-17 10:01:37

*Thread Reply:* @Maciej Obuchowski any particular reason for bypassing databuilder and go directly to neo4j? By design databuilder is supposed to be very abstract so any kind of backend can be used with Amundsen. Currently there are at least 4 and neo4j is just one of them.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-06-17 10:28:52

*Thread Reply:* Databuilder's pull model is very different than OpenLineage's push model, where the events are generated while the dataset itself is generated.

So, how would you see using it? Just to proxy the events to concrete search and metadata backend?

I'm definitely not an Amundsen expert, so feel free to correct me if I'm getting it wrong.

Julien Le Dem (julien@apache.org)
2021-07-07 19:59:28

*Thread Reply:* @Mariusz Górski my slide that Maciej is referring to might be a bit misleading. The Amundsen integration does not exist yet. Please add your input in the ticket: https://github.com/OpenLineage/OpenLineage/issues/86

Mariusz Górski (gorskimariusz13@gmail.com)
2021-07-09 02:22:06

*Thread Reply:* thanks Julien! will take a look

Kedar Rajwade (kedar@cloudzealous.com)
2021-06-08 10:00:47

@here Hello, My name is Kedar Rajwade. I happened to come across the OpenLineage project and it looks quite interesting. Is there some kind of getting start guide that I can follow. Also are there any weekly/bi-weekly calls that I can attend to know the current/future plans ?

Julien Le Dem (julien@apache.org)
2021-06-08 14:16:42

*Thread Reply:* Welcome! You can look here: https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md

Julien Le Dem (julien@apache.org)
2021-06-08 14:17:19

*Thread Reply:* We’re starting a monthly call, I will publish more details here

Julien Le Dem (julien@apache.org)
2021-06-08 14:17:48

*Thread Reply:* Do you have a specific use case in mind?

Kedar Rajwade (kedar@cloudzealous.com)
2021-06-08 21:32:02

*Thread Reply:* Nothing specific yet

Julien Le Dem (julien@apache.org)
2021-06-09 00:49:09

The first instance of the OpenLineage Monthly meeting is tomorrow June 9 at 9am PT: https://calendar.google.com/event?action=TEMPLATE&tmeid=MDRubzk0cXAwZzA4bXRmY24yZjBkdTZzbDNfMjAyMTA2MDlUMTYwMDAwWiBqdWxpZW5AZGF0YWtpbi5jb20&tmsrc=julien%40datakin.com&scp=ALL|https://calendar.google.com/event?action=TEMPLATE&tmeid=MDRubzk0cXAwZzA4bXRmY24yZjBkdT[…]qdWxpZW5AZGF0YWtpbi5jb20&tmsrc=julien%40datakin.com&scp=ALL

accounts.google.com
🎉 Willy Lulciuc, Maciej Obuchowski
Victor Shafran (victor.shafran@databand.ai)
2021-06-09 08:33:45

*Thread Reply:* Hey @Julien Le Dem, I can’t add a link to my calendar… Can you send an invite?

Leo (leorobinovitch@gmail.com)
2021-06-09 11:00:05

*Thread Reply:* Same!

Julien Le Dem (julien@apache.org)
2021-06-09 11:01:45

*Thread Reply:* Will do. Also if you send your email in dm you can get added to the invite

Kedar Rajwade (kedar@cloudzealous.com)
2021-06-09 12:00:30

*Thread Reply:* @Julien Le Dem Can't access the calendar.

Kedar Rajwade (kedar@cloudzealous.com)
2021-06-09 12:00:43

*Thread Reply:* Can you please share the meeting details

Julien Le Dem (julien@apache.org)
2021-06-09 12:01:12

*Thread Reply:*

Julien Le Dem (julien@apache.org)
2021-06-09 12:01:24

*Thread Reply:*

Michael Collado (collado.mike@gmail.com)
2021-06-09 12:01:55

*Thread Reply:* The calendar invite says 9am PDT, not 10am. Which is right?

Kedar Rajwade (kedar@cloudzealous.com)
2021-06-09 12:01:58

*Thread Reply:* Thanks

Julien Le Dem (julien@apache.org)
2021-06-09 13:25:13

*Thread Reply:* it is 9am,thanks

Julien Le Dem (julien@apache.org)
2021-06-09 18:37:02

*Thread Reply:* I have posted the notes on the wiki (includes link to recording) https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+meeting+archive

🙌 Willy Lulciuc, Victor Shafran
Pedram (pedram@hightouch.io)
2021-06-10 13:53:18

Hi! Are there some 'close-to-real' sample events available to build off and compare to? I'd like to make sure what I'm outputting makes sense but it's hard when only comparing to very synthetic data.

👋 Willy Lulciuc
Willy Lulciuc (willy@datakin.com)
2021-06-10 13:55:51

*Thread Reply:* We’ve recently worked on a getting started guide for OpenLineage that we’d like to publish on the OpenLineage website. That should help with making things a bit more clear on usage. @Ross Turk / @Julien Le Dem might know of when that might become available. Otherwise, happy to answer any immediate questions you might have about posting/collecting OpenLineage events

Pedram (pedram@hightouch.io)
2021-06-10 13:58:58

*Thread Reply:* Here's a sample of what I'm producing, would appreciate any feedback if it's on the right track. One of our challenges is that 'dataset' is a little loosely defined for us as outputs since we take data from a warehouse/database and output to things like Salesforce, Airtable, Hubspot and even Slack.

{ eventType: 'START', eventTime: '2021-06-09T08:45:00.395+00:00', run: { runId: '2821819' }, job: { namespace: '<hightouch://my-workspace>', name: '<hightouch://my-workspace/sync/123>' }, inputs: [ { namespace: '<snowflake://abc1234>', name: '<snowflake://abc1234/my_source_table>' } ], outputs: [ { namespace: '<salesforce://mysf_instance.salesforce.com>', name: 'accounts' } ], producer: 'hightouch-event-producer-v.0.0.1' } { eventType: 'COMPLETE', eventTime: '2021-06-09T08:45:30.519+00:00', run: { runId: '2821819' }, job: { namespace: '<hightouch://my-workspace>', name: '<hightouch://my-workspace/sync/123>' }, inputs: [ { namespace: '<snowflake://abc1234>', name: '<snowflake://abc1234/my_source_table>' } ], outputs: [ { namespace: '<salesforce://mysf_instance.salesforce.com>', name: 'accounts' } ], producer: 'hightouch-event-producer-v.0.0.1' }

Pedram (pedram@hightouch.io)
2021-06-10 14:02:59

*Thread Reply:* One other question I have is really around how customers might take the metadata we emit at Hightouch and integrate that with OpenLineage metadata emitted from other tools like dbt, Airflow, and other integrations to create a true lineage of their data.

For example, if the data goes from S3 -&gt; Snowflake via Airflow and then from Snowflake -&gt; Salesforce via Hightouch, this would mean both Airflow/Hightouch would need to define the Snowflake dataset in exactly the same way to get the benefits of lineage?

Willy Lulciuc (willy@datakin.com)
2021-06-17 19:13:14

*Thread Reply:* Hey, @Dejan Peretin! Sorry for the late replay here! Your OL events look solid and only have a few of suggestions:

  1. I would use a valid UUID for the run ID as the spec will standardize on that type, see https://github.com/OpenLineage/OpenLineage/pull/65
  2. You don’t need to provide the input dataset again on the COMPLETE event as the input datasets have already been associated with the run ID
  3. For the producer, I’d recommend using a link to the producer source code version to link the producer version with the OL event that was emitted.
Willy Lulciuc (willy@datakin.com)
2021-06-17 19:13:59

*Thread Reply:* You can now reference our OL getting started guide for a close-to-real example 🙂 , see http://openlineage.io/getting-started

openlineage.io
Willy Lulciuc (willy@datakin.com)
2021-06-17 19:18:19

*Thread Reply:* > … this would mean both Airflow/Hightouch would need to define the Snowflake dataset in exactly the same way to get the benefits of lineage? Yes, the dataset and the namespace that it was registered under would have to be the same to properly build the lineage graph. We’re working on defining unique dataset names and have made some good progress in this area. I’d suggest reviewing the OL naming conventions if you haven’t already: https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md

🙌 Pedram
Pedram (pedram@hightouch.io)
2021-06-19 01:09:27

*Thread Reply:* Thanks! I'm really excited to see what the future holds, I think there are so many great possibilities here. Will be keeping a watchful eye. 🙂

Willy Lulciuc (willy@datakin.com)
2021-06-22 15:14:39

*Thread Reply:* 🙂

Antonio Moctezuma (antoniomoctezuma@northwesternmutual.com)
2021-06-11 09:53:39

Hey everyone! I've been running into a minor OpenLineage issue and I was curious if anyone had any advice. So according to OpenLineage specs its suggested that for a dataset coming from S3 that its namespace be in the form of s3://<bucket>. We have implemented our code to do so and RunEvents are published without issue but when trying to retrieve the information of this RunEvent (like the job) I am unable to retrieve it based on namespace from both /api/v1/namespaces/s3%3A%2F%2F<bucket name> (encoding since : and / are special characters in URL) and the beta endpoint of /api/v1-beta/lineage?nodeId=<dataset>:<namespace>:<name> and instead get a 400 error with a "Ambiguous Segment in URI" message.

Any and all advice would be super helpful! Thank you so much!

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-06-11 10:16:41

*Thread Reply:* Sounds like problem is with Marquez - might be worth to open issue here: https://github.com/MarquezProject/marquez/issues

Antonio Moctezuma (antoniomoctezuma@northwesternmutual.com)
2021-06-11 10:25:58

*Thread Reply:* Thank you! Will do.

Julien Le Dem (julien@apache.org)
2021-06-11 15:31:41

*Thread Reply:* Thanks for reporting Antonio

Julien Le Dem (julien@apache.org)
2021-06-16 19:01:52

I have opened a proposal for versioning and publishing the spec: https://github.com/OpenLineage/OpenLineage/issues/63

Labels
proposal
Julien Le Dem (julien@apache.org)
2021-06-18 15:00:20

We have a nice OpenLineage website now. https://openlineage.io/ Thank you to contributors: @Ross Turk @Willy Lulciuc @Michael Collado!

openlineage.io
❤️ Ross Turk, Kevin Mellott, Leo, Peter Hicks, Willy Lulciuc, Edgar Ramírez Mondragón, Maciej Obuchowski, Supratim Mukherjee
👍 Kedar Rajwade, Mukund
Leo (leorobinovitch@gmail.com)
2021-06-18 15:09:18

*Thread Reply:* Very nice!

Bruno Canal (bcanal@gmail.com)
2021-06-20 10:08:43

Hi everyone! Im trying to run a spark job with openlineage and marquez...But Im getting some errors

Bruno Canal (bcanal@gmail.com)
2021-06-20 10:09:28

*Thread Reply:* Here is the error...

21/06/20 11:02:56 WARN ArgumentParser: missing jobs in [, api, v1, namespaces, spark_integration] at 5 21/06/20 11:02:56 WARN ArgumentParser: missing runs in [, api, v1, namespaces, spark_integration] at 7 21/06/20 11:03:01 ERROR AsyncEventQueue: Listener SparkListener threw an exception java.lang.NullPointerException at marquez.spark.agent.SparkListener.onJobEnd(SparkListener.java:165) at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:39) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91) at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82)

Bruno Canal (bcanal@gmail.com)
2021-06-20 10:10:41

*Thread Reply:* Here is my code ...

```from pyspark.sql import SparkSession from pyspark.sql.functions import lit

spark = SparkSession.builder \ .master('local[1]') \ .config('spark.jars.packages', 'io.github.marquezproject:marquezspark:0.15.2') \ .config('spark.extraListeners', 'marquez.spark.agent.SparkListener') \ .config('openlineage.url', 'http://localhost:5000/api/v1/namespaces/spark_integration/') \ .config('openlineage.namespace', 'sparkintegration') \ .getOrCreate()

Supress success

spark.sparkContext.jsc.hadoopConfiguration().set('mapreduce.fileoutputcommitter.marksuccessfuljobs', 'false') spark.sparkContext.jsc.hadoopConfiguration().set('parquet.summary.metadata.level', 'NONE')

dfsourcetrip = spark.read \ .option('inferSchema', True) \ .option('header', True) \ .option('delimiter', '|') \ .csv('/Users/bcanal/Workspace/poc-marquez/pocspark/resources/data/source/trip.csv') \ .createOrReplaceTempView('sourcetrip')

dfdrivers = spark.table('sourcetrip') \ .select('driver') \ .distinct() \ .withColumn('drivername', lit('Bruno')) \ .withColumnRenamed('driver', 'driverid') \ .createOrReplaceTempView('source_driver')

df = spark.sql( """ SELECT d., t. FROM sourcetrip t, sourcedriver d WHERE t.driver = d.driver_id """ )

df.coalesce(1) \ .drop('driverid') \ .write.mode('overwrite') \ .option('path', '/Users/bcanal/Workspace/poc-marquez/pocspark/resources/data/target') \ .saveAsTable('trip')```

Bruno Canal (bcanal@gmail.com)
2021-06-20 10:12:27

*Thread Reply:* After this execution, I can see just the source from first dataframe called dfsourcetrip...

Bruno Canal (bcanal@gmail.com)
2021-06-20 10:13:04

*Thread Reply:*

Bruno Canal (bcanal@gmail.com)
2021-06-20 10:13:45

*Thread Reply:* I was expecting to see all source dataframes, target dataframes and the job

Bruno Canal (bcanal@gmail.com)
2021-06-20 10:14:35

*Thread Reply:* I`m running spark local on my laptop and I followed marquez getting start to up it

Bruno Canal (bcanal@gmail.com)
2021-06-20 10:14:44

*Thread Reply:* Can anyone help me?

Michael Collado (collado.mike@gmail.com)
2021-06-22 14:42:03

*Thread Reply:* I think there's a race condition that causes the context to be missing when the job finishes too quickly. If I just add spark.sparkContext.setLogLevel('info') to the setup code, everything works reliably. Also works if you remove the master('local[1]') - at least when running in a notebook

anup agrawal (anup.agrawal500@gmail.com)
2021-06-22 13:48:34

@here Hi everyone,

👋 Willy Lulciuc
anup agrawal (anup.agrawal500@gmail.com)
2021-06-22 13:49:10

i need to implement export functionality for my data lineage project.

anup agrawal (anup.agrawal500@gmail.com)
2021-06-22 13:50:26

as part of this i need to convert the information fetched from graph db (neo4j) to CSV format and send in response.

anup agrawal (anup.agrawal500@gmail.com)
2021-06-22 13:51:21

can someone please direct me to the CSV format of open lineage data

Willy Lulciuc (willy@datakin.com)
2021-06-22 15:26:55

*Thread Reply:* Hey, @anup agrawal. This is a great question! The OpenLineage spec is defined using the Json Schema format, and it’s mainly for the transport layer of OL events. In terms of how OL events are eventually stored, that’s determined by the backend consumer of the events. For example, Marquez stores the raw event in a lineage_events table, but that’s mainly for convenience and replayability of events . As for importing / exporting OL events from storage, as long as you can translate the CSV to an OL event, then HTTP backends like Marquez that support OL can consume them

Willy Lulciuc (willy@datakin.com)
2021-06-22 15:27:29

*Thread Reply:* > as part of this i need to convert the information fetched from graph db (neo4j) to CSV format and send in response. Depending on the exported CSV, I would translate the CSV to an OL event, see https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json

Willy Lulciuc (willy@datakin.com)
2021-06-22 15:29:58

*Thread Reply:* When you say “send in response”, who would be the consumer of the lineage metadata exported for the graph db?

anup agrawal (anup.agrawal500@gmail.com)
2021-06-22 23:33:05

*Thread Reply:* so far what i understood about my requirement is that. 1. my service will receive OL events

anup agrawal (anup.agrawal500@gmail.com)
2021-06-22 23:33:24

*Thread Reply:* 2. store it in graph db (neo4j)

anup agrawal (anup.agrawal500@gmail.com)
2021-06-22 23:38:28

*Thread Reply:* 3. this lineage information will be displayed on ui, based on the request.

  1. now my part in that is to implement an Export functionality, so that someone can download it from UI. in UI there will be option to download the report.
  2. so i need to fetch data from storage and convert it into CSV format, send to UI
  3. they can download the report from UI.

SO my question here is that i have never seen how that CSV report look like and how do i achieve that ? when i had asked my team how should CSV look like they directed me to your website.

👍 Willy Lulciuc
Willy Lulciuc (willy@datakin.com)
2021-07-01 19:18:35

*Thread Reply:* I see. @Julien Le Dem might have some thoughts on how an OL event would be represented in different formats like CSV (but, of course, there’s also avro, parquet, etc). The Json Schema is the recommended format for importing / exporting lineage metadata. And, for a file, each line would be an OL event. But, given that CSV is a requirement, I’m not sure how that would be structured. Or at least, it’s something we haven’t previously discussed

anup agrawal (anup.agrawal500@gmail.com)
2021-06-22 13:51:51

i am very new to this .. sorry for any silly questions

Willy Lulciuc (willy@datakin.com)
2021-06-22 20:29:22

*Thread Reply:* There are no silly questions! 😉

Abdulmalik AN (lord.of.d1@gmail.com)
2021-06-29 11:46:33

Hello, I have read every topic and listened to 4 talks and the podcast episode about OpenLineage and Marquez due to my basic understanding for the data engineering field, I have a couple of questions which I did not understand: 1- What are events and facets and what are their purpose? 2- Can I implement the OpenLineage API to any software? or does the software needs to be integrated with the OpenLineage API? 3- Can I say that OpenLineage is about observability and Marquez is about collecting and storing the metadata? Thank you all for being cooperative.

👍 Stephen Pimentel, Kedar Rajwade
Willy Lulciuc (willy@datakin.com)
2021-07-01 19:07:27

*Thread Reply:* Welcome, @Abdulmalik AN 👋 Hopefully the talks / podcasts have been informative! And, sure, happy to clarify a few things:

> What are events and facets and what are their purpose? An OpenLineage event is used to capture the lineage metadata at a point in time for a given run in execution. That is, the runs state transition, the inputs and outputs consumed/produced and the job associated with the run are part of the event. The metadata defined in the event can then be consumed by an HTTP backend (as well as other transport layers). Marquez is an HTTP backend implementation that consumes OL events via a REST API call. The OL core model only defines the metadata that should be captured in the context of a run, while the processing of the event is up to the backend implementation consuming the event (think consumer / producer model here). For Marquez, the end-to-end lineage metadata is stored for pipelines (composed of multiple jobs) with built-in metadata versioning support. Now, for the second part of your question: the OL core model is highly extensible via facets. A facet is user-defined metadata and enables entity enrichment. I’d recommend checking out the getting started guide for OL 🙂

> Can I implement the OpenLineage API to any software? or does the software needs to be integrated with the OpenLineage API? Do you mean HTTP vs other protocols? Currently, OL defines an API spec for HTTP backends, that Marquez has adopted to ingest OL events. But there are also plans to support Kafka and many others.

> Can I say that OpenLineage is about observability and Marquez is about collecting and storing the metadata? > Thank you all for being cooperative. Yep! OL defines the metadata to collect for running jobs / pipelines that can later be used for root cause analysis / troubleshooting failing jobs, while Marquez is a metadata service that implements the OL standard to both consume and store lineage metadata while also exposing a REST API to query dataset, job and run metadata.

openlineage.io
👍 Kedar Rajwade
Nic Colley (nic.colley@alation.com)
2021-06-30 17:46:52

Hi OpenLineage team! Has anyone got this working on databricks yet? I’ve been working on this for a few days and can’t get it to register lineage. I’ve attached my notebook in this thread.

silly question - does the jar file need be on the cluster? Which versions of spark does OpenLineage support?

Nic Colley (nic.colley@alation.com)
2021-06-30 18:16:58

*Thread Reply:* I based my code on this previous post https://openlineage.slack.com/archives/C01CK9T7HKR/p1624198123045800

} Bruno Canal (https://openlineage.slack.com/team/U025LV2BJUB)
Nic Colley (nic.colley@alation.com)
2021-06-30 18:36:59

*Thread Reply:*

Michael Collado (collado.mike@gmail.com)
2021-07-01 13:45:42

*Thread Reply:* In your first cell, you have from pyspark.sql import SparkSession from pyspark.sql.functions import lit spark.sparkContext.setLogLevel('info') unfortunately, the reference to sparkContext in the third line forces the initialization of the SparkContext so that in the next cell, your new configuration is ignored. In pyspark, you must initialize your SparkSession before any references to the SparkContext. It works if you remove the setLogInfo call from the first cell and make your 2nd cell spark = SparkSession.builder \ .config('spark.jars.packages', 'io.github.marquezproject:marquez_spark:0.15.2') \ .config('spark.extraListeners', 'marquez.spark.agent.SparkListener') \ .config('openlineage.url', '<https://domain.com>') \ .config('openlineage.namespace', 'my-namespace') \ .getOrCreate() spark.sparkContext.setLogLevel('info')

Samia Rahman (srahman@thoughtworks.com)
2021-06-30 19:26:42

How would one capture lineage for job that's processing streaming data? Is that in scope for OpenLineage?

➕ Josh Quintus, Maciej Obuchowski
Willy Lulciuc (willy@datakin.com)
2021-07-01 16:32:18

*Thread Reply:* It’s absolutely in scope! We’ve primarily focused on the batch use case (ETL jobs, etc), but the OpenLineage standard supports both batch and streaming jobs. You can check out our roadmap here, where you’ll find Flink and Beam on our list of future integrations.

Willy Lulciuc (willy@datakin.com)
2021-07-01 16:32:57

*Thread Reply:* Is there a streaming framework you’d like to see added to our roadmap?

mohamed chorfa (chorfa672@gmail.com)
2021-06-30 20:33:25

👋 Hello everyone!

Willy Lulciuc (willy@datakin.com)
2021-07-01 16:24:16

*Thread Reply:* Welcome, @mohamed chorfa 👋 . Let’s us know if you have any questions!

👍 mohamed chorfa
mohamed chorfa (chorfa672@gmail.com)
2021-07-03 19:37:58

*Thread Reply:* Really looking follow the evolution of the specification from RawData to the ML-Model

❤️ Julien Le Dem, Willy Lulciuc
Julien Le Dem (julien@apache.org)
2021-07-02 16:53:01

Hello OpenLineage community, We have been working on fleshing out the OpenLineage roadmap. See on github on the currently prioritized effort: https://github.com/OpenLineage/OpenLineage/projects Please add your feedback to the roadmap by either commenting on the github issues or opening new issues.

Julien Le Dem (julien@apache.org)
2021-07-02 17:04:13

In particular, I have opened an issue to finalize our mission statement: https://github.com/OpenLineage/OpenLineage/issues/84

❤️ Ross Turk, Maciej Obuchowski, Peter Hicks
Julien Le Dem (julien@apache.org)
2021-07-07 19:53:17

*Thread Reply:* Based on community feedback, The new proposed mission statement: “to enable the industry at-large to collect real-time lineage metadata consistently across complex ecosystems, creating a deeper understanding of how data is produced and used”

Julien Le Dem (julien@apache.org)
2021-07-07 20:23:24

I have updated the proposal for the spec versioning: https://github.com/OpenLineage/OpenLineage/issues/63

Assignees
julienledem
Labels
proposal
🙌 Willy Lulciuc
Jorik (jorik.blaas-sigmond@nn.nl)
2021-07-08 07:06:53

Hi all. I'm trying to get my bearings on openlineage. Love the concept. In our data transformation pipelines, output datasets are explicitly versioned (we have an incrementing snapshot id). Our storage layer (deltalake) allows us to also ingest 'older' versions of the same dataset, etc. If I understand it correctly I would have to add some inputFacets and outputFacets to run to store the actual version being referenced. Is that something that is currently available, or on the roadmap, or is it something I could extend myself?

Julien Le Dem (julien@apache.org)
2021-07-08 18:57:44

*Thread Reply:* It is on the roadmap and there’s a ticket open but nobody is working on it at the moment. You are very welcome to contribute a spec and implementation

Julien Le Dem (julien@apache.org)
2021-07-08 18:59:00

*Thread Reply:* Please comment here and feel free to make a proposal: https://github.com/OpenLineage/OpenLineage/issues/35

Labels
proposal
Comments
2
Jorik (jorik.blaas-sigmond@nn.nl)
2021-07-08 07:07:29

TL;DR: our database supports time-travel, and runs can be set up to use a specific point-in-time of an input. How do we make sure to keep that information within openlineage

Mariusz Górski (gorskimariusz13@gmail.com)
2021-07-09 02:23:29

Hi, on a subject of spark integrations - I know that there is spark-marquez but was curious did you also consider https://github.com/AbsaOSS/spline-spark-agent ? It seems like this and spark-marquez are doing similar thing and maybe it would make sense to add openlineage support to spline spark agent?

Website
<https://absaoss.github.io/spline/>
Stars
36
Mariusz Górski (gorskimariusz13@gmail.com)
2021-07-09 02:23:42

*Thread Reply:* cc @Julien Le Dem @Maciej Obuchowski

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-07-09 04:28:38

*Thread Reply:* @Michael Collado

👀 Michael Collado
Julien Le Dem (julien@apache.org)
2021-07-12 21:17:12

The OpenLineage Technical Steering Committee meetings are Monthly on the Second Wednesday 9:00am to 10:00am US Pacific and the link to join the meeting is https://us02web.zoom.us/j/81831865546?pwd=RTladlNpc0FTTDlFcWRkM2JyazM4Zz09 The next meeting is this Wednesday All are welcome. •  Agenda: ◦ Finalize the OpenLineage Mission Statement ◦ Review OpenLineage 0.1 scope ◦ Roadmap ◦ Open discussion  ◦ Slides: https://docs.google.com/presentation/d/1fD_TBUykuAbOqm51Idn7GeGqDnuhSd7f/edit#slide=id.ge4b57c6942_0_46 notes are posted here: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting.,.,_

🙌 Willy Lulciuc, Maciej Obuchowski
Julien Le Dem (julien@apache.org)
2021-07-12 21:18:04

*Thread Reply:* Feel free to share your email with me if you want to be added to the gcal invite

Julien Le Dem (julien@apache.org)
2021-07-14 12:03:31

*Thread Reply:* It is starting now

Jiří Sedláček (yirie.sedlahczech@gmail.com)
2021-07-13 08:22:40

Hello, is it possible to track lineage on column level? For example for SQL like this: CREATE TABLE T2 AS SELECT c1,c2 FROM T1; I would like to record this lineage: T1.C1 -- job1 --&gt; T2.C1 T1.C2 -- job1 --&gt; T2.C2 Would that be possible to record in OL format?

Jiří Sedláček (yirie.sedlahczech@gmail.com)
2021-07-13 08:29:52

(the important thing for me is to be able to tell that T1.C1 has no effect on T2.C2)

Julien Le Dem (julien@apache.org)
2021-07-14 17:00:12

I have updated the notes and added the link to the recording of the meeting this morning: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

Julien Le Dem (julien@apache.org)
2021-07-14 17:04:18

*Thread Reply:* In particular, please review the versioning proposal: https://github.com/OpenLineage/OpenLineage/issues/63

Assignees
julienledem
Labels
proposal
Julien Le Dem (julien@apache.org)
2021-07-14 17:04:33

*Thread Reply:* and the mission statement: https://github.com/OpenLineage/OpenLineage/issues/84

Comments
2
Julien Le Dem (julien@apache.org)
2021-07-14 17:05:02

*Thread Reply:* for this one, please give explicit approval in the ticket

👍 Willy Lulciuc
Julien Le Dem (julien@apache.org)
2021-07-14 21:10:42

*Thread Reply:* @Zhamak Dehghani @Daniel Henneberger @Drew Banin @James Campbell @Ryan Blue @Maciej Obuchowski @Willy Lulciuc ^

Julien Le Dem (julien@apache.org)
2021-07-27 18:58:35

*Thread Reply:* Per the votes in the github ticket, I have finalized the charter here: https://docs.google.com/document/d/11xo2cPtuYHmqRLnR-vt9ln4GToe0y60H/edit

🙌 Willy Lulciuc
Jarek Potiuk (jarek@potiuk.com)
2021-07-16 01:25:56

Hi Everyone. I am PMC member and committer of Apache Airflow. Watched the talk at the summit https://airflowsummit.org/sessions/2021/data-lineage-with-apache-airflow-using-openlineage/ and thought I might help (after the Summit is gone 🙂 with making OpenLineage/Marquez more seemlesly integrated in Airflow

airflowsummit.org
❤️ Abe Gong, WingCode, Maciej Obuchowski, Ross Turk, Julien Le Dem, Michael Collado, Samia Rahman, mohamed chorfa
🙌 Maciej Obuchowski
👍 Jorik
Samia Rahman (srahman@thoughtworks.com)
2021-07-20 16:38:38

*Thread Reply:* The demo in this does not really use the openlineage spec does it?

Did I miss something - the API that was should for lineage was that of Marquez, how does Marquest use the open lineage spec?

Samia Rahman (srahman@thoughtworks.com)
2021-07-20 18:09:01

*Thread Reply:* I have a question about the SQLJobFacet in the job schema - isn't it better to call it the TransformationJob Facet or the ProjecessJobFacet such that any logic in the appropriate language and be described? Am I misinterpreting the intention of SQLJobFacet is to capture the logic that runs for a job?

Willy Lulciuc (willy@datakin.com)
2021-07-26 19:06:43

*Thread Reply:* > The demo in this does not really use the openlineage spec does it? @Samia Rahman In our Airflow talk, the demo used the marquez-airflow lib that sends OpenLineage events to Marquez’s . You can check out the how does Airflow works with OpenLineage + Marquez here https://openlineage.io/integration/apache-airflow/

openlineage.io
Willy Lulciuc (willy@datakin.com)
2021-07-26 19:07:51

*Thread Reply:* > Did I miss something - the API that was should for lineage was that of Marquez, how does Marquest use the open lineage spec? Yes, Marquez ingests OpenLineage events that confirm to the spec via the . Hope this helps!

Kenton (swiple.io) (kknoxparton@gmail.com)
2021-07-21 07:52:32

Hi all, does OpenLineage intend on creating lineage off of query logs?

From what I have read, there are a number of supported integrations but none that cater to regular SQL based ETL. Is this on the OpenLineage roadmap?

Willy Lulciuc (willy@datakin.com)
2021-07-26 18:54:46

*Thread Reply:* I would say this is more of an ingestion pattern, then something the OpenLineage spec would support directly. Though I completely agree, query logs are a great source of lineage metadata with minimal effort. On our roadmap, we have Kafka as a supported backend which would enable streaming lineage metadata from query logs into a topic. That said, confluent has some great blog posts on Change Data Capture:https://www.confluent.io/blog/no-more-silos-how-to-integrate-your-databases-with-apache-kafka-and-cdc/https://www.confluent.io/blog/simplest-useful-kafka-connect-data-pipeline-world-thereabouts-part-1/

Confluent
Confluent
Willy Lulciuc (willy@datakin.com)
2021-07-26 18:57:59

*Thread Reply:* Q: @Kenton (swiple.io) Are you planning on using Kafka connect? If so, I see 2 reasonable options:

  1. Stream query logs to a topic using the JDBC source connector, then have a consumer read the query logs off the topic, parse the logs, then stream the result of the query parsing to another topic as an OpenLineage event
  2. Add direct support for OpenLineage to the JDBC connector or any other application you planned to use to read the query logs.
Willy Lulciuc (willy@datakin.com)
2021-07-26 19:01:31

*Thread Reply:* Either way, I think this is a great question and a common ingestion pattern we should document or have best practices for. Also, more details on how you plan to ingestion the query logs would be help drive the discussion.

Kenton (swiple.io) (kknoxparton@gmail.com)
2021-08-05 12:01:55

*Thread Reply:* Using something like sqlflow could be a good starting point? Demo https://sqlflow.gudusoft.com/?utm_source=gspsite&utm_medium=blog&utm_campaign=support_article#/

sqlflow.gudusoft.com
Stars
79
Language
Python
Willy Lulciuc (willy@datakin.com)
2021-09-21 20:22:26

*Thread Reply:* @Kenton (swiple.io) I haven’t heard of sqlflow but it does look promising. It’s not on our current roadmap, but I think there is a need to have support for parsing query logs as OpenLineage events. Do you mind opening an issue and outlining you thoughts? It’d be great to start the discussion if you’d like to drive this feature and help prioritize this 💯

Samia Rahman (srahman@thoughtworks.com)
2021-07-21 08:49:23

The openlineage implementation for airflow and spark code integration currently lives in Marquez repo, my understanding from the open lineage scope is that the the integration implementation is the scope of open lineage, are the spark code migrations going to be moved to open lineage?

Ross Turk (ross@datakin.com)
2021-07-21 11:35:12

@Samia Rahman Yes, that is the plan. For details you can see https://github.com/OpenLineage/OpenLineage/issues/73

🙌 Samia Rahman, Willy Lulciuc
Samia Rahman (srahman@thoughtworks.com)
2021-07-21 18:13:11

I have a question about the SQLJobFacet in the job schema - isn't it better to call it the TransformationJob Facet or the ProjecessJobFacet such that any logic in the appropriate language and be described, can be scala or python code that runs in the job facet and processing streaming or batch data? Am I misinterpreting the intention of SQLJobFacet is to capture the logic that runs for a job?

Willy Lulciuc (willy@datakin.com)
2021-07-21 18:22:01

*Thread Reply:* Hey, @Samia Rahman 👋. Yeah, great question! The SQLJobFacet is used only for SQL-based jobs. That is, it’s not intended to capture the code being executed, but rather the just the SQL if it’s present. The SQL fact can be used later for display purposes. For example, in Marquez, we use the SQLJobFacet to display the SQL executed by a given job to the user via the UI.

Willy Lulciuc (willy@datakin.com)
2021-07-21 18:23:03

*Thread Reply:* To capture the logic of the job (meaning, the code being executed), the OpenLineage spec defines the SourceCodeLocationJobFacet that builds the link to source in version control

Julien Le Dem (julien@apache.org)
2021-07-22 17:56:41

The process started a few months back when the LF AI & Data voted to accept OpenLineage as part of the foundation. It is now official, OpenLineage joined the LFAI & data Foundation.  https://lfaidata.foundation/blog/2021/07/22/openlineage-joins-lf-ai-data-as-new-sandbox-project/

LF AI
Written by
Jacqueline Z Cardoso
Est. reading time
3 minutes
🙌 Ross Turk, Luke Smith, Maciej Obuchowski, Gyan Kapur, Dr Daniel Smith, Jarek Potiuk, Peter Hicks, Kedar Rajwade, Abe Gong, Damian Warszawski, Willy Lulciuc
❤️ Ross Turk, Jarek Potiuk, Peter Hicks, Abe Gong, Willy Lulciuc
🎉 Laurent Paris, Rifa Achrinza, Minkyu Park, Peter Hicks, mohamed chorfa, Jarek Potiuk, Abe Gong, Damian Warszawski, Willy Lulciuc, James Le
👏 Matt Turck
Namron (ian.norman@avanade.com)
2021-07-29 11:20:17

Hi, I am trying to create lineage between two datasets. Following the Spec, I can see the syntax for declaring the input and output datasets, and for all creating the associated Job (which I take to be the process in the middle joining the two datasets together). What I can't see is where in the specification to relate the job to the inputs and outputs. Do you have an example of this?

Michael Collado (collado.mike@gmail.com)
2021-07-30 17:24:44

*Thread Reply:* The run event is always tied to exactly one job. It's up to the backend to store the relationship between the job and its inputs/outputs. E.g., in marquez, this is where we associate the input datasets with the job- https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/db/OpenLineageDao.java#L132-L143

Julien Le Dem (julien@apache.org)
2021-08-03 15:06:58

the OuputStatistics facet PR is updated based on your comments @Michael Collado https://github.com/OpenLineage/OpenLineage/pull/114

Comments
1
🙌 Michael Collado
Michael Collado (collado.mike@gmail.com)
2021-08-03 15:11:56

*Thread Reply:* /|~~~ ///| /////| ///////| /////////| \==========|===/ ~~~~~~~~~~~~~~~~~~~~~

Julien Le Dem (julien@apache.org)
2021-08-03 19:59:03

*Thread Reply:* ⛵

Julien Le Dem (julien@apache.org)
2021-08-03 19:59:38

I have updated the DataQuality metrics proposal and the corresponding PR: https://github.com/OpenLineage/OpenLineage/issues/101 https://github.com/OpenLineage/OpenLineage/pull/115

🙌 Willy Lulciuc, Bruno González
💯 Willy Lulciuc, Dominique Tipton
Oleksandr Dvornik (oleksandr.dvornik@getindata.com)
2021-08-04 10:42:48

Guys, I've merged circleCI publish snapshot PR

Snapshots can be found bellow: https://datakin.jfrog.io/artifactory/maven-public-libs-snapshot-local/io/openlineage/openlineage-java/0.0.1-SNAPSHOT/ openlineage-java-0.0.1-20210804.142910-6.jar https://datakin.jfrog.io/artifactory/maven-public-libs-snapshot-local/io/openlineage/openlineage-spark/0.1.0-SNAPSHOT/ openlineage-spark-0.1.0-20210804.143452-5.jar

Build on main passed (edited)

🎉 Julien Le Dem
Julien Le Dem (julien@apache.org)
2021-08-04 23:08:08

I added a mechanism to enforce spec versioning per: https://github.com/OpenLineage/OpenLineage/issues/63 https://github.com/OpenLineage/OpenLineage/pull/140

Ben Teeuwen-Schuiringa (ben.teeuwen@booking.com)
2021-08-05 10:02:49

Hi all, at Booking.com we’re using Spline to extract granular lineage information from spark jobs to be able to trace lineage on column-level and the operations in between. We wrote a custom python parser to create graph-like structure that is sent into arangodb. But tbh, the process is far from stable and is not able to quickly answer questions like ‘which root input columns are used to construct column x’.

My impression with openlineage thus far is it’s focusing on less granular, table input-output information. Is anyone here trying to accomplish something similar on a column-level?

Luke Smith (luke.smith@kinandcarta.com)
2021-08-05 12:56:48

*Thread Reply:* Also interested in use case / implementation differences between Spline and OL. Watching this thread.

Julien Le Dem (julien@apache.org)
2021-08-05 14:46:44

*Thread Reply:* It would be great to have the option to produce the spline lineage info as OpenLineage. To capture the column level lineage, you would want to add a ColumnLineage facet to the Output dataset facets. Which is something that is needed in the spec. Here is a proposal, please chime in: https://github.com/OpenLineage/OpenLineage/issues/148 Is this something you would be interested to do?

Labels
proposal
Julien Le Dem (julien@apache.org)
2021-08-09 19:49:51

*Thread Reply:* regarding the difference of implementation, the OpenLineage spark integration focuses on extracting metadata and exposing it as a standard representation. (The OpenLineage LineageEvents described in the JSON-Schema spec). The goal is really to have a common language to express lineage and related metadata across everything. We’d be happy if Spline can produce or consume OpenLineage as well and be part of that ecosystem.

Ben Teeuwen-Schuiringa (ben.teeuwen@booking.com)
2021-08-18 08:09:38

*Thread Reply:* Does anyone know if the Spline developers are in this slack group?

Ben Teeuwen-Schuiringa (ben.teeuwen@booking.com)
2022-08-03 03:07:56

*Thread Reply:* @Luke Smith how have things progressed on your side the past year?

Julien Le Dem (julien@apache.org)
2021-08-09 19:39:28

I have opened an issue to track the facet versioning discussion: https://github.com/OpenLineage/OpenLineage/issues/153

Labels
proposal
Julien Le Dem (julien@apache.org)
2021-08-09 20:16:18

I have updated the agenda to the OpenLineage monthly TSC meeting: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting (meeting information bellow for reference, you can also DM me your email to get added to a google calendar invite)

The OpenLineage Technical Steering Committee meetings are Monthly on the Second Wednesday 9:00am to 10:00am US Pacific and the link to join the meeting is https://us02web.zoom.us/j/81831865546?pwd=RTladlNpc0FTTDlFcWRkM2JyazM4Zz09 All are welcome.

Aug 11th 2021 • Agenda: ◦ Coming in OpenLineage 0.1 ▪︎ OpenLineage spec versioning ▪︎ Clients ◦ Marquez integrations imported in OpenLineage ▪︎ Apache Airflow: • BigQuery  • Postgres • Snowflake • Redshift • Great Expectations ▪︎ Apache Spark ▪︎ dbt ◦ OpenLineage 0.2 scope discussion ▪︎ Facet versioning mechanism ▪︎ OpenLineage Proxy Backend () ▪︎ Kafka client ◦ Roadmap ◦ Open discussion • Slides: https://docs.google.com/presentation/d/1Lxp2NB9xk8sTXOnT0_gTXicKX5FsktWa/edit#slide=id.ge80fbcb367_0_14

🙌 Willy Lulciuc, Maciej Obuchowski, Dr Daniel Smith
💯 Willy Lulciuc, Dr Daniel Smith
Julien Le Dem (julien@apache.org)
2021-08-11 10:05:27

*Thread Reply:* Just a reminder that this is in 2 hours

Julien Le Dem (julien@apache.org)
2021-08-11 18:50:32

*Thread Reply:* I have added the notes to the meeting page: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

Julien Le Dem (julien@apache.org)
2021-08-11 18:51:19
Daniel Avancini (dpavancini@gmail.com)
2021-08-11 13:30:52

Hi guys, great discussion today. Something we are particularly interested on is the integration with Airflow 2. I've been searching into Marquez and Openlineage repos and I couldn't find a clear answer on the status of that. I did some work locally to update the marquez-airflow package but I would like to know if someone else is working on this and maybe we could give it some help too.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-08-11 13:36:43

*Thread Reply:* @Daniel Avancini I'm working on it. Some changes in airflow made current approach unfeasible, so slight change in a way how we capture events is needed. You can take a look at progress here: https://github.com/OpenLineage/OpenLineage/tree/airflow/2

Daniel Avancini (dpavancini@gmail.com)
2021-08-11 13:48:36

*Thread Reply:* Thank you Maciej. I'll take a look

Julien Le Dem (julien@apache.org)
2021-08-11 20:37:09

I have migrated the Marquez issues related to OpenLineage integrations to the OpenLineage repo

Julien Le Dem (julien@apache.org)
2021-08-13 19:02:54

And OpenLineage 0.1.0 is out ! https://github.com/OpenLineage/OpenLineage/releases/tag/0.1.0

🙌 Peter Hicks, Maciej Obuchowski, Willy Lulciuc, Oleksandr Dvornik, Luke Smith, Daniel Avancini, Matt Gee
❤️ Willy Lulciuc, Matt Gee
Oleksandr Dvornik (oleksandr.dvornik@getindata.com)
2021-08-16 11:42:24

PR ready for review

👍 Willy Lulciuc
Luke Smith (luke.smith@kinandcarta.com)
2021-08-20 13:54:08

Anyone have experience parsing spark's logical plan to generate column-level lineage and DAGs with more human readable operations? I assume I could recreate a graph like the one below using the spark.logicalPlan facet. The analysts writing the SQL / spark queries aren't familiar with ShuffledRowRDD , MapPartitionsRDD, etc... It'd be better if I could convert this plan into spark SQL (or capture spark SQL as a facet at runtime).

Michael Collado (collado.mike@gmail.com)
2021-08-26 16:46:53

*Thread Reply:* The logicalPlan facet currently returns the Logical Plan, not the physical plan. This means you end up with expressions like Aggregate and Join rather than WholeStageCodegen and Exchange. I don't know if it's possible to reverse engineer the SQL- it's worth looking into the API and trying to find a way to generate that

Luke Smith (luke.smith@kinandcarta.com)
2021-08-20 13:57:41
Erick Navarro (Erick.Navarro@gt.ey.com)
2021-08-31 14:26:35

👋 Hi everyone!

Erick Navarro (Erick.Navarro@gt.ey.com)
2021-08-31 14:27:00

Nice to e-meet you 🙂 I want to use OpenLineage integration for spark in my Azure Databricks clusters, but I am having problems with the configuration of the listener in the cluster, I was wondering if you could help me, if you know any tutorial for the integration of spark with Azure Databricks that can help me, or some more specific guide for this scenario, I would really appreciate it.

Erick Navarro (Erick.Navarro@gt.ey.com)
2021-08-31 14:27:33

I added this configuration to my cluster :

Erick Navarro (Erick.Navarro@gt.ey.com)
2021-08-31 14:28:37

I receive this error message:

Willy Lulciuc (willy@datakin.com)
2021-08-31 14:30:00

*Thread Reply:* Hey, @Erick Navarro 👋 . Are you using the openlineage-spark lib? (Note, the marquez-spark lib has been deprecated)

Luke Smith (luke.smith@kinandcarta.com)
2021-08-31 14:43:20

*Thread Reply:* My team had this issue as well. Our read of the error is that Databricks attempts to register the listener before installing packages defined with either spark.jars or spark.jars.packages. Since the listener lib is not yet installed, the listener cannot be found. To solve the issue, we

  1. copy the OL JAR to a staging directory on DBFS (we use /dbfs/databricks/init/lineage)
  2. using an init script, copy the JAR from the staging directory to the default JAR location for the Databricks driver -- /mnt/driver-daemon/jars
  3. Within the same init script, write the spark config parameters to a .conf file in /databricks/driver/conf (we use open-lineage.conf) The .conf file will be read by the driver on initialization. It should follow this format (lineagehosturl should point to your API): [driver] { "spark.jars" = "/mnt/driver-daemon/jars/openlineage-spark-0.1-SNAPSHOT.jar" "spark.extraListeners" = "com.databricks.backend.daemon.driver.DBCEventLoggingListener,openlineage.spark.agent.OpenLineageSparkListener" "spark.openlineage.url" = "$lineage_host_url" } Your cluster must be configured to call the init script (enabling lineage for entire cluster). OL is not friendly to notebook-level init as far as we can tell.

@Willy Lulciuc -- I have some utils and init script templates that simplify this process. May be worth adding them to the OL repo along with a readme.

🙏 Erick Navarro
❤️ Erick Navarro
Willy Lulciuc (willy@datakin.com)
2021-08-31 14:51:46

*Thread Reply:* Absolutely, thanks for elaborating on your spark + OL deployment process and I think that’d be great to document. @Michael Collado what are your thoughts?

Michael Collado (collado.mike@gmail.com)
2021-08-31 14:57:02

*Thread Reply:* I haven't tried with Databricks specifically, but there should be no issue registering the OL listener in the Spark config as long as it's done before the Spark session is created- e.g., this example from the README works fine in a vanilla Jupyter notebook- https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark#openlineagesparklistener-as-a-plain-spark-listener

Michael Collado (collado.mike@gmail.com)
2021-08-31 15:11:37

*Thread Reply:* Looks like Databricks' notebooks come with a Spark instance pre-configured- configuring lineage within the SparkSession configuration doesn't seem possible- https://docs.databricks.com/notebooks/notebooks-manage.html#attach-a-notebook-to-a-cluster 😞

docs.databricks.com
Michael Collado (collado.mike@gmail.com)
2021-08-31 15:11:53
Luke Smith (luke.smith@kinandcarta.com)
2021-08-31 15:59:38

*Thread Reply:* Right, Databricks provides preconfigured spark context / session objects. With Spline, you can set some cluster level config (e.g. spark.spline.lineageDispatcher.http.producer.url ) and install the library on the cluster, but then enable tracking at a notebook level with:

%scala import za.co.absa.spline.harvester.SparkLineageInitializer._ sparkSession.enableLineageTracking() In OL, it would be nice to install and config OL at a cluster level, but to enable it at a notebook level. This way, users could control whether all notebooks run on a cluster emit lineage or just those with lineage explicitly enabled.

Michael Collado (collado.mike@gmail.com)
2021-08-31 16:01:00

*Thread Reply:* Seems, at the very least, we need to provide a way to specify the job name at the notebook level

👍 Luke Smith
Luke Smith (luke.smith@kinandcarta.com)
2021-08-31 16:03:50

*Thread Reply:* Agreed. I'd like a default that uses the notebook name that can also be overridden in the notebook.

Michael Collado (collado.mike@gmail.com)
2021-08-31 16:10:42

*Thread Reply:* if you have some insight into the available options, it would be great if you can open an issue on the OL project. I'll have to carve out some time to play with a databricks cluster and learn what options we have

👍 Luke Smith
Erick Navarro (Erick.Navarro@gt.ey.com)
2021-08-31 18:26:11

*Thread Reply:* Thank you @Luke Smith, the method you recommend works for me, the cluster is running and apparently it fetch the configuration it was my first progress in over a week testing openlineage in azure databricks. Thank you!

Now I have this:

Luke Smith (luke.smith@kinandcarta.com)
2021-08-31 18:52:15

*Thread Reply:* Is this error thrown during init or job execution?

Michael Collado (collado.mike@gmail.com)
2021-08-31 18:55:30

*Thread Reply:* this is likely a race condition- I've seen it happen for jobs that start and complete very quickly- things like defining temp views or similar

Erick Navarro (Erick.Navarro@gt.ey.com)
2021-08-31 19:59:15

*Thread Reply:* During the execution of the job @Luke Smith, thank you @Michael Collado, that was exactly the scenario, the job that I executed was empty, now the cluster is running ok, I don't have errors, I have run some jobs successfully, but I don't see any information in my datakin explorer

Willy Lulciuc (willy@datakin.com)
2021-08-31 20:00:46

*Thread Reply:* Awesome! Great to hear you’re up and running. For datakin specific questions, mind if we move the discussion to the datakin user slack channel?

Erick Navarro (Erick.Navarro@gt.ey.com)
2021-08-31 20:01:17

*Thread Reply:* Yes Willy, thank you!

Erick Navarro (Erick.Navarro@gt.ey.com)
2021-09-02 10:06:00

*Thread Reply:* Hi , @Luke Smith, thank you for your help, are you familiar with this error in azure databricks when you use OL?

Erick Navarro (Erick.Navarro@gt.ey.com)
2021-09-02 10:07:07

*Thread Reply:*

Erick Navarro (Erick.Navarro@gt.ey.com)
2021-09-02 10:17:17

*Thread Reply:* I found the solution here: https://docs.microsoft.com/en-us/answers/questions/170730/handshake-fails-trying-to-connect-from-azure-datab.html

docs.microsoft.com
Erick Navarro (Erick.Navarro@gt.ey.com)
2021-09-02 10:17:28

*Thread Reply:* It works now! 😄

👍 Luke Smith, Maciej Obuchowski, Minkyu Park, Willy Lulciuc
Willy Lulciuc (willy@datakin.com)
2021-09-02 16:33:01

*Thread Reply:* @Erick Navarro This might be a helpful to add to our openlineage spark docs for others trying out openlineage-spark with Databricks. Let me know if that’s something you’d like to contribute 🙂

Erick Navarro (Erick.Navarro@gt.ey.com)
2021-09-02 19:59:10

*Thread Reply:* Yes of course @Willy Lulciuc, I will prepare a small tutorial for my colleagues and I will share it with you 🙂

Willy Lulciuc (willy@datakin.com)
2021-09-02 20:44:36

*Thread Reply:* Awesome. Thanks!

Thomas Fredriksen (thomafred90@gmail.com)
2021-09-02 03:47:35

Hello everyone! I am currently evaluating OpenLineage and am finding it very interesting as Prefect is in the list of integrations. However, I am not seeing any documentation or code for this. How far are you from supporting Prefect?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-02 04:57:55

*Thread Reply:* Hey! If you mean this picture, it provides concept of how OpenLineage works, not current state of integration. We don't have Prefect support yet; hovewer, it's on our roadmap.

Thomas Fredriksen (thomafred90@gmail.com)
2021-09-02 05:22:15

*Thread Reply:* great, thanks 🙂

Julien Le Dem (julien@apache.org)
2021-09-02 11:49:48

*Thread Reply:* @Thomas Fredriksen Feel free to chime in the github issue Maciej linked if you want.

Luke Smith (luke.smith@kinandcarta.com)
2021-09-02 13:13:05

What's the timeline to support spark 3.0 within OL? One breaking change we've found is within DatasetSourceVisitor.java -- the DataSourceV2 is deprecated in spark 3.0. There may be other issues we haven't found yet. Is there a good feel for the scope of work required to make OL spark 3.0 compatible?

Julien Le Dem (julien@apache.org)
2021-09-02 14:28:11

*Thread Reply:* It is being worked on right now. @Oleksandr Dvornik is adding an integration test in the build so that we run test for both spark 2.4 and spark 3. Please open an issue with the stack trace if you can. From our perspective, it should be mostly compatible with a few exceptions like this one that we’d want to add test cases for.

Julien Le Dem (julien@apache.org)
2021-09-02 14:36:19

*Thread Reply:* The goal is to be able to make a release in the next few weeks. The integration is being used with Spark 3 already.

🙌 Luke Smith
Luke Smith (luke.smith@kinandcarta.com)
2021-09-02 15:50:14

*Thread Reply:* Great, I'll take some time to open an issue for this particular issue and a few others.

Michael Collado (collado.mike@gmail.com)
2021-09-02 17:33:08

*Thread Reply:* are you actually using the DatasetSource interface in any capacity? Or are you just scanning the source code to find incompatibilities?

Luke Smith (luke.smith@kinandcarta.com)
2021-09-03 12:36:20

*Thread Reply:* Turns out this has more to do with a how Databricks handles the delta format. It's related to https://github.com/AbsaOSS/spline-spark-agent/issues/96.

Labels
question
Comments
5
Luke Smith (luke.smith@kinandcarta.com)
2021-09-03 13:42:43

*Thread Reply:* I haven't been chasing this issue down on my team -- turns out some things were lost in communication. There are really two problems here:

  1. When attempting to do delta I/O with Spark 3 on Databricks, e.g. insert into . . . values . . . We get an error related to DataSourceV2: java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation.source()Lorg/apache/spark/sql/sources/v2/DataSourceV2;
  2. Using Spline, which is Spark 3 compatible, we have issues with the way Databricks handles delta table io. This is related: https://github.com/AbsaOSS/spline-spark-agent/issues/96

So there are two stacked issues related to spark 3 on Databricks with delta IO, not just one. Hope this clears things up.

Labels
question
Comments
5
Michael Collado (collado.mike@gmail.com)
2021-09-03 13:44:54

*Thread Reply:* So, the first issue is OpenLineage related directly, and the second issue applies to both OpenLineage and Spline?

Luke Smith (luke.smith@kinandcarta.com)
2021-09-03 13:45:49

*Thread Reply:* Yes, that's my read of what I'm getting from others on the team.

Michael Collado (collado.mike@gmail.com)
2021-09-03 13:46:56

*Thread Reply:* For the first issue- can you give some details about the target of the INSERT INTO... ? Is it a data source defined in Databricks? a Hive table? a view on GCS?

Michael Collado (collado.mike@gmail.com)
2021-09-03 13:47:40

*Thread Reply:* oh, it's a Delta table?

Luke Smith (luke.smith@kinandcarta.com)
2021-09-03 14:48:15

*Thread Reply:* Yes, it's created via

CREATE TABLE . . . using DELTA location "/dbfs/mnt/ . . . "

Julien Le Dem (julien@apache.org)
2021-09-02 14:28:53

I have opened a PR to fix some outdated language in the spec: https://github.com/OpenLineage/OpenLineage/pull/241 Thank you @Mandy Chessell for the feedback

Comments
1
Julien Le Dem (julien@apache.org)
2021-09-02 14:37:27

The next OpenLineage monthly meeting is next week. Please chime in this thread if you’d like something added to the agenda

🙌 Willy Lulciuc
marko (marko.kristian.helin@gmail.com)
2021-09-04 12:53:54

*Thread Reply:* Apache Beam integration? I have a very crude integration at the moment. Maybe it’s better to integrate on the orchestration level (airflow, luigi). Thoughts?

Julien Le Dem (julien@apache.org)
2021-09-05 13:06:19

*Thread Reply:* I think it makes a lot of sense to have a Beam level integration similar to the spark one. Feel free to post a draft PR if you want to share.

Julien Le Dem (julien@apache.org)
2021-09-07 21:04:09

*Thread Reply:* I have added Beam as a topic for the roadmap discussion slide: https://docs.google.com/presentation/d/1fI0u8aE0iX9vG4GGrnQYAEcsJM9z7Rlv/edit#slide=id.ge7d4b64ef4_0_0

Julien Le Dem (julien@apache.org)
2021-09-07 21:03:08

I have prepared slides for the OpenLineage meeting tomorrow morning: https://docs.google.com/presentation/d/1fI0u8aE0iX9vG4GGrnQYAEcsJM9z7Rlv/edit#slide=id.ge7d4b64ef4_0_0

Julien Le Dem (julien@apache.org)
2021-09-07 21:03:32

*Thread Reply:* There will be a quick demo of the dbt integration (thanks @Willy Lulciuc!)

🙌 Willy Lulciuc
Julien Le Dem (julien@apache.org)
2021-09-07 21:05:13

*Thread Reply:* Information to join and archive of previous meetings: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

Julien Le Dem (julien@apache.org)
2021-09-08 14:49:52

*Thread Reply:* The recording and notes are now available: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

Venkatesh Tadinada (venkat@mlacademy.io)
2021-09-08 21:58:09

*Thread Reply:* Good meeting today. @Julien Le Dem. Thanks

Shreyas Kaushik (shreyask@gmail.com)
2021-09-08 04:03:29

Hello, was looking to get some lineage out for BQ in my Airflow DAGs and saw that the BQ extractor here - https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/openlineage/airflow/extractors/bigquery_extractor.py#L47 is using an operator that has been deprecated by Airflow - https://github.com/apache/airflow/blob/main/airflow/contrib/operators/bigquery_operator.py#L44 and most of my DAGs are using the operator BigQueryExecuteQueryOperator mentioned there. I presume with this lineage extraction wouldn’t work and some work is needed to support both these operators with the same ( or differnt) extractor. Is that correct or am I missing something ?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-08 04:27:04

*Thread Reply:* We're working on updating our integration to airflow 2. Some changes in airflow made current approach unfeasible, so slight change in a way how we capture events is needed. You can take a look at progress here: https://github.com/OpenLineage/OpenLineage/tree/airflow/2

Shreyas Kaushik (shreyask@gmail.com)
2021-09-08 04:27:38

*Thread Reply:* Thanks @Maciej Obuchowski When is this expected to land in a release ?

Daniel Zagales (dzagales@gmail.com)
2021-11-11 06:35:24

*Thread Reply:* hi @Maciej Obuchowski I wanted to follow up on this to understand when the more recent BQ Operators will be supported, specifically BigQueryInsertJobOperator

Julien Le Dem (julien@apache.org)
2021-09-11 22:30:31

The PR to separate facets in their own file (and allowing versioning them independently) is now available: https://github.com/OpenLineage/OpenLineage/pull/118

Comments
1
Jose Badeau (jose.badeau@gmail.com)
2021-09-13 03:46:20

Hi, new to the channel but I think OL is a great initiative. Currently we are focused on beam/spark/delta but are moving to beam/flink/iceberg and I’m happy to help where I can.

Willy Lulciuc (willy@datakin.com)
2021-09-13 15:40:01

*Thread Reply:* Welcome, @Jose Badeau 👋. That’s exciting to hear as we have Beam, Flink and Iceberg on our roadmap! Your welcome to join the discussion :)

Julien Le Dem (julien@apache.org)
2021-09-13 20:56:11

Per the discussion last week, Ryan updated the metadata that would be available in Iceberg: https://github.com/OpenLineage/OpenLineage/issues/167#issuecomment-917237320

Julien Le Dem (julien@apache.org)
2021-09-13 21:00:54

I have also created tickets for follow up discussions: (#269 and #270): https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

Tomas Satka (satka.tomas@gmail.com)
2021-09-14 04:50:22

Hello. I find OpenLineage an interesting tool however can someone help me with integration?

Tomas Satka (satka.tomas@gmail.com)
2021-09-14 04:52:50

I am trying to capture lineage from spark 3.1.1 but when executing i constantly get: java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2.writer()Lorg/apache/spark/sql/sources/v2/writer/DataSourceWriter; at openlineage.spark.agent.lifecycle.plan.DatasetSourceVisitor.findDatasetSource(DatasetSourceVisitor.java:57) as if i would be using openlineage on wrong spark version (2.4) I have tried also spark jar from branch feature/itspark3. Is there any branch or release that works or can be tried with spark 3+?

Oleksandr Dvornik (oleksandr.dvornik@getindata.com)
2021-09-14 05:03:45

*Thread Reply:* Hello Tomas. We are currently working on support for spark v3. Can you please raise an issue with stack trace, that would help us to track and solve it. We are currently adding integration tests. Next step would be fix changes in method signatures for v3 (that's what you have)

Tomas Satka (satka.tomas@gmail.com)
2021-09-14 05:12:45

*Thread Reply:* Hi @Oleksandr Dvornik i raised https://github.com/OpenLineage/OpenLineage/issues/272

👍 Oleksandr Dvornik
Tomas Satka (satka.tomas@gmail.com)
2021-09-14 08:47:39

I also tried to downgrade spark to 2.4.0 and retry with 0.2.2 but i also faced issue.. so my preferred way would be to push for spark 3.1.1 but depends a bit on when you plan to release version supporting it. As backup plan i would try spark 2.4.0 but this is blocking me also: https://github.com/OpenLineage/OpenLineage/issues/274

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-14 08:55:44

*Thread Reply:* I think this might be actually spark issue: https://stackoverflow.com/questions/53787624/spark-throwing-arrayindexoutofboundsexception-when-parallelizing-list/53787847

Stack Overflow
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-14 08:56:10

*Thread Reply:* Can you try newer version in 2.4.** line, like 2.4.7?

👀 Tomas Satka
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-14 08:57:30

*Thread Reply:* This might be also spark 2.4 with scala 2.12 issue - I'd recomment 2.11 versions.

Tomas Satka (satka.tomas@gmail.com)
2021-09-14 09:04:26

*Thread Reply:* @Maciej Obuchowski with 2.4.7 i get following exc:

Tomas Satka (satka.tomas@gmail.com)
2021-09-14 09:04:27

*Thread Reply:* 21/09/14 15:03:25 WARN RddExecutionContext: Unable to access job conf from RDD java.lang.NoSuchFieldException: config$1 at java.base/java.lang.Class.getDeclaredField(Class.java:2411)

Tomas Satka (satka.tomas@gmail.com)
2021-09-14 09:04:48

*Thread Reply:* i can also try to switch to 2.11 scala

Tomas Satka (satka.tomas@gmail.com)
2021-09-14 09:05:37

*Thread Reply:* or do you have some recommended setup that works for sure?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-14 09:09:58

*Thread Reply:* One more check - you're using Java 8 with this, right?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-14 09:10:17

*Thread Reply:* This is what works for me: -&gt; % cat tools/spark-2.4/RELEASE Spark 2.4.8 (git revision 4be4064) built for Hadoop 2.7.3 Build flags: -B -Pmesos -Pyarn -Pkubernetes -Pflume -Psparkr -Pkafka-0-8 -Phadoop-2.7 -Phive -Phive-thriftserver -DzincPort=3036

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-14 09:11:23

*Thread Reply:* spark-shell: Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_292)

Tomas Satka (satka.tomas@gmail.com)
2021-09-14 09:12:05

*Thread Reply:* awesome let me try 🙂

Tomas Satka (satka.tomas@gmail.com)
2021-09-14 09:26:00

*Thread Reply:* data has been sent to marquez. coolio. however i noticed nullpointer being thrown: 21/09/14 15:23:53 ERROR AsyncEventQueue: Listener OpenLineageSparkListener threw an exception java.lang.NullPointerException at io.openlineage.spark.agent.OpenLineageSparkListener.onJobEnd(OpenLineageSparkListener.java:164) at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:39) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91) at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82)

Tomas Satka (satka.tomas@gmail.com)
2021-09-14 10:59:45

*Thread Reply:* closed related issue #274

Tomas Satka (satka.tomas@gmail.com)
2021-09-14 11:02:42

does openlineage capture streaming in spark? as this example is not showing me anything unless i replace readStream() with batch read() and writeStream() with write() ```SparkSession.Builder builder = SparkSession.builder(); SparkSession session = builder .appName("quantweave") .master("local[**]") .config("spark.jars.packages", "io.openlineage:openlineage_spark:0.2.2") .config("spark.extraListeners", "io.openlineage.spark.agent.OpenLineageSparkListener") .config("spark.openlineage.url", "http://localhost:5000/api/v1/namespaces/spark_integration/") .getOrCreate();

    Dataset&lt;Row&gt; df = session
            .readStream()
            .format("kafka")
            .option("kafka.bootstrap.servers", "localhost:9092")
            .option("subscribe", "topic1")
            .option("startingOffsets", "earliest")
            .load();

    Dataset&lt;Row&gt; dff = df
            .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)").as("data");

    dff
            .writeStream()
            .format("kafka")
            .option("kafka.bootstrap.servers", "localhost:9092")
            .option("topic", "topic2")
            .option("checkpointLocation", "/tmp/checkpoint")
            .start();```
Julien Le Dem (julien@apache.org)
2021-09-14 13:38:09

*Thread Reply:* Not at the moment, but it is in scope. You are welcome to open an issue with your example to track this or even propose an implementation if you have the time.

Oleksandr Dvornik (oleksandr.dvornik@getindata.com)
2021-09-14 15:12:01

*Thread Reply:* @Tomas Satka it would be great, if you can add an containerized integration test for kafka with your test case. You can take this as an example here

Tomas Satka (satka.tomas@gmail.com)
2021-09-14 18:02:05

*Thread Reply:* Hi @Oleksandr Dvornik i wrote a test for simple read/write from kafka topic using kafka testcontainer. However i discovered a bug. When writing to kafka topic getting java.lang.IllegalArgumentException: One of the following options must be specified for Kafka source: subscribe, subscribepattern, assign. See the docs for more details.

• How would you like me to add the test? Fork openlineage and create PR

Tomas Satka (satka.tomas@gmail.com)
2021-09-14 18:02:50

*Thread Reply:* • Shall i raise bug for writing to kafka that should have only "topic" instead of "subscribe"

Tomas Satka (satka.tomas@gmail.com)
2021-09-14 18:03:42

*Thread Reply:* • Since i dont know expected payload to openlineage mock server can somebody help me to create it?

Oleksandr Dvornik (oleksandr.dvornik@getindata.com)
2021-09-14 19:06:41

*Thread Reply:* Hi @Tomas Satka, yes you should create a fork and raise a PR from that. For more details, please take a look at. Not sure about kafka, cause we don't have that integration yet. About expected payload, as a first step, I would suggest to leave that test without assertion for now. Second step would be investigation (what we can get from that plan node). Third step - implementation and asserting a payload. Basically we parse spark optimized plan, and get as much information as we can for specific implementation. You can take a look at recent PR for HIVE. We visit root node and leaves to get output datasets and input datasets accordingly.

Comments
1
Tomas Satka (satka.tomas@gmail.com)
2021-09-15 04:37:59

*Thread Reply:* Hi @Oleksandr Dvornik PR for step one : https://github.com/OpenLineage/OpenLineage/pull/279

👍 Oleksandr Dvornik
🙌 Julien Le Dem
Luke Smith (luke.smith@kinandcarta.com)
2021-09-14 15:52:41

There may not be an answer to these questions yet, but I'm curious about the plan for Tableau lineage.

• How will this integration be packaged and attached to Tableau instances? ◦ via Extensions API, REST API? • What is the architecture? https://github.com/OpenLineage/OpenLineage/issues/78

Thomas Fredriksen (thomafred90@gmail.com)
2021-09-15 01:58:37

Hi everyone - Following up on my previous post on prefect. The technical integration does not seem very difficult, but I am wondering about how to structure the lineage logic. Is it the case that each prefect task should be mapped to a lineage job? If so, how do we connect the jobs together? Does there have to be a dataset between each job? I am OpenLineage with Marquez by the way

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-15 09:19:23

*Thread Reply:* Hey Thomas!

Following what we do with Airflow, yes, I think that each task should be mapped to job.

You don't need datasets between each tasks. It's necessary only where you consume and produce datasets - and it does not matter where in uour job graph you've produced them.

To map tasks togther In Airflow, we use ParentRunFacet , and the same approach could be used here. In Prefect, I think using flow_run_id would work.

👍 Julien Le Dem
Thomas Fredriksen (thomafred90@gmail.com)
2021-09-15 09:26:21

*Thread Reply:* this is very helpful, thank you

Thomas Fredriksen (thomafred90@gmail.com)
2021-09-15 09:26:43

*Thread Reply:* what would be the namespace used in the Job -definition of each task?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-15 09:31:34

*Thread Reply:* In contrast to dataset namespaces - which we try to standardize, job namespaces should be provided by user, or operator of particular scheduler.

For example, it would be good if it helped you identify Prefect instance where the job was run.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-15 09:32:23

*Thread Reply:* If you use openlineage-python client, you can provide namespace either in client constuctor, or via OPENLINEAGE_NAMESPACE env variable.

Thomas Fredriksen (thomafred90@gmail.com)
2021-09-15 09:32:55

*Thread Reply:* awesome, thank you 🙂

Brad (bradley.mcelroy@live.com)
2021-09-15 17:03:07

*Thread Reply:* Hey @Thomas Fredriksen - just chiming in, I’m also keen for a prefect integration. Let me know if I can help out at all

Julien Le Dem (julien@apache.org)
2021-09-15 17:27:20

*Thread Reply:* Please chime in on https://github.com/OpenLineage/OpenLineage/issues/81

Brad (bradley.mcelroy@live.com)
2021-09-15 18:29:20

*Thread Reply:* Done!

❤️ Julien Le Dem
Brad (bradley.mcelroy@live.com)
2021-09-16 00:06:41

*Thread Reply:* For now I'm prototyping in a separate repo https://github.com/limx0/caching_flow_runner/tree/open_lineage

Thomas Fredriksen (thomafred90@gmail.com)
2021-09-17 01:55:08

*Thread Reply:* I really like your PR, @Brad. I think that using FlowRunner and TaskRunner may be a more "proper" way of doing this, as opposed as adding a state-handler to each task the way I do it.

How are you dealing with Prefect-library tasks such as the included BigQuery-tasks and such? Is it necessary to create DatasetTask for them to show up in the lineage graph?

Brad (bradley.mcelroy@live.com)
2021-09-17 02:04:19

*Thread Reply:* Hey @Thomas Fredriksen! At the moment I'm not dealing with any task-specific things. The plan (in my head, and after speaking with another prefect user @davzucky) would be that we add a LineageTask subclass where you could define custom facets on a per task basis

Brad (bradley.mcelroy@live.com)
2021-09-17 02:05:21

*Thread Reply:* or some sort of other hook where basically you would define some lineage attribute or put something in the prefect.context that the TaskRunner would find and attach

Brad (bradley.mcelroy@live.com)
2021-09-17 02:06:23

*Thread Reply:* Sorry I misread your question - any tasks should be automatically tracked (I believe but have not tested yet!)

Thomas Fredriksen (thomafred90@gmail.com)
2021-09-17 02:16:02

*Thread Reply:* @Brad Could you elaborate a bit on your ideas around adding custom context attributes?

Brad (bradley.mcelroy@live.com)
2021-09-17 02:21:57

*Thread Reply:* yeah so basically we just need some hooks that you can easily access from the task decorator or somewhere else that we can pass through to the open lineage adapter to do things like custom facets

Brad (bradley.mcelroy@live.com)
2021-09-17 02:24:31

*Thread Reply:* like for your bigquery example - you might want to record some facets like in https://github.com/OpenLineage/OpenLineage/blob/main/integration/common/openlineage/common/provider/bigquery.py and we need a way to do that with the Prefect bigquery task

Brad (bradley.mcelroy@live.com)
2021-09-17 02:28:28

*Thread Reply:* @davzucky

Thomas Fredriksen (thomafred90@gmail.com)
2021-09-17 02:29:12

*Thread Reply:* I see. Is this supported by the airflow-integration?

Brad (bradley.mcelroy@live.com)
2021-09-17 02:29:32

*Thread Reply:* I think so, yes

Brad (bradley.mcelroy@live.com)
2021-09-17 02:31:54

*Thread Reply:* (I don't actually use airflow or bigquery - but for my own use case I can see wanting to do thing like this)

Thomas Fredriksen (thomafred90@gmail.com)
2021-09-17 03:18:27

*Thread Reply:* Interesting, I like how dynamic this is

Chris Baynes (chris@contiamo.com)
2021-09-15 09:09:21

HI all, I have a clarification question about dataset namespaces. What's the difference between a dataset namespace (in the input/output) and a dataSource name (in the dataSource facet)? The dbt integration appears to set those to the same value (e.g. <snowflake://myprofile>), however it seems that Marquez assumes the dataset namespace to be a more generic concept (similar to a nice user provided name like the job namespace).

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-15 09:29:25

*Thread Reply:* Hey. Generally, dataSource name should be namespace of particular dataset.

In some cases, like Postgres, dataSource facet is used to provide additionally connection strings, with info like particular host and port that we're connected to.

In case of Snowflake - or Bigquery, or S3, or multiple systems where we have only "global" instance, so the dataSource facet does not carry any other additional information.

Chris Baynes (chris@contiamo.com)
2021-09-15 10:11:19

*Thread Reply:* Thanks. So then perhaps marquez could differentiate a bit more between job & dataset namespaces. Right now it doesn't quite feel right to have a single global list of namespaces for jobs & datasets, especially as they also have a separate concept of sources (which are not in a namespace).

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-15 10:18:59

*Thread Reply:* @Willy Lulciuc what do you think?

Chris Baynes (chris@contiamo.com)
2021-09-15 10:41:20

*Thread Reply:* As an example, in marquez I have this list of namespaces (from some sample data): dbt-sales, default, <snowflake://my-account1>, <snowflake://my-account2>. I think the new marquez UI with the nice namespace dropdown and job/dataset search is awesome, and I'd expect to be able to filter by job namespace everywhere, but how about being able to filter datasets by source (which would be populated by the OL dataset namespace) and not persist dataset namespaces in the global namespace table?

Julien Le Dem (julien@apache.org)
2021-09-15 18:38:03

The dbt integration (https://github.com/OpenLineage/OpenLineage/tree/main/integration/dbt) is pretty awesome but there are still a few improvements we could make. Here are a few thoughts. • In dbt-ol if the configuration is wrong or missing we will fail silently. This one seems like a good first thing to fix by logging the error to stdout • We need to wait until the end to know if it worked at all. It would be nice if we checked the config at the beginning and display an error right away. Possibly by adding a parent job/run with a start event at the beginning and an end event at the end when all is done. • While we are sending events at the end the console will hang until it’s done. It’s not clear that progress is made. We could have a simple progress bar by printing a dot for every event sent. (ex: sending 10 OpenLineage events: .........) • We could also write at the beginning that the OL events will be sent at the end so that the user knows what to expect. What do you think? (@Maciej Obuchowski in particular, but anyone using dbt in general)

👀 Maciej Obuchowski
Julien Le Dem (julien@apache.org)
2021-09-15 18:43:18

*Thread Reply:* Last point is that we should persist the configuration and not just have it in environment variables. What is the best way to do this in dbt?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-15 18:49:21

*Thread Reply:* We could have something similar to https://docs.getdbt.com/dbt-cli/configure-your-profile - or even put our config in there

❤️ Julien Le Dem
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-15 18:51:42

*Thread Reply:* I think we should assume that variables/config should be set and valid - and fail the run if they aren't. After all, if someone wouldn't need lineage events, they wouldn't use our wrapper.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-15 18:56:36

*Thread Reply:* 3rd point would be easy to address if we could send events async/in parallel. But there could be dataset version dependencies, and we don't want to get into needless complexity of recognizing that, building a dag etc.

We could batch events if the network roundtrips are responsible for majority of the slowdown. However, we can't assume any particular environment.

Maybe just notifying about the progress is the best thing we can do right now.

👀 Mario Measic
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-15 18:58:22

*Thread Reply:* About second point, I want to add recognizing if we already have a parent run - for example, if running via airflow. If not, creating run for this purpose is a good idea.

Julien Le Dem (julien@apache.org)
2021-09-15 21:31:35

*Thread Reply:* @Maciej Obuchowski can you open github issues to propose those changes?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-16 09:11:31

*Thread Reply:* Done

Ross Turk (ross@datakin.com)
2021-09-16 12:05:10

*Thread Reply:* FWIW, I have been putting my config in ~/.openlineage/config so it can be mapped into a container

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-16 17:56:23

*Thread Reply:* Makes sense, also, all clients could use that config

Mario Measic (mario.measic.gavran@gmail.com)
2021-10-18 04:47:08

*Thread Reply:* if dbt could actually stream the events, that would be great.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-18 09:59:12

*Thread Reply:* Unfortunately, this seems very unlikely for now, due to the fact that we rely on metadata files that dbt only produces after end of execution.

Julien Le Dem (julien@apache.org)
2021-09-15 22:52:09

The split of facets in their own schemas is ready to be merged: https://github.com/OpenLineage/OpenLineage/pull/118

Comments
1
Brad (bradley.mcelroy@live.com)
2021-09-16 00:12:02

Hey @Julien Le Dem I'm going to start a thread here for any issues I run into trying to build a prefect integration

Brad (bradley.mcelroy@live.com)
2021-09-16 00:16:44

*Thread Reply:* This might be useful to others https://github.com/OpenLineage/OpenLineage/pull/284

Brad (bradley.mcelroy@live.com)
2021-09-16 00:18:44

*Thread Reply:* So I'm trying to push a simple event to marquez, but getting the following response: '{"code":400,"message":"Unable to process JSON"}' The JSON I'm pushing:

{ "eventTime": "2021-09-16T04:00:28.343702", "eventType": "START", "inputs": {}, "job": { "facets": {}, "name": "prefect.core.parameter.p", "namespace": "default" }, "producer": "<https://github.com/OpenLineage/OpenLineage/tree/0.0.0/integration/prefect>", "run": { "facets": {}, "runId": "3bce33cb-9495-4c58-b326-6aac71634ace" } } Does anything look obviously wrong here?

marko (marko.kristian.helin@gmail.com)
2021-09-16 02:41:11

*Thread Reply:* What I did previously when debugging something like this was to remove half of the payload until I found the culprit. Binary search essentially. I was running Marquez locally, so probably could’ve enabled better logging as well. Aren’t inputs and facets arrays?

👍 Maciej Obuchowski
Brad (bradley.mcelroy@live.com)
2021-09-16 03:14:54

*Thread Reply:* Thanks for the response @marko - this is a greatly reduced payload already (but I'll keep going). Yep they are supposed to be arrays (I've since fixed that)

Brad (bradley.mcelroy@live.com)
2021-09-16 03:46:01

*Thread Reply:* okay it was my timestamp 🥲

Brad (bradley.mcelroy@live.com)
2021-09-16 19:07:16

*Thread Reply:* Okay - I've got a simply working example now https://github.com/limx0/caching_flow_runner/blob/open_lineage/caching_flow_runner/task_runner.py

Brad (bradley.mcelroy@live.com)
2021-09-16 19:07:37

*Thread Reply:* I might move this into a proper PR @Julien Le Dem

Brad (bradley.mcelroy@live.com)
2021-09-16 19:08:12

*Thread Reply:* Successfully got a basic prefect flow working

Brad (bradley.mcelroy@live.com)
2021-09-16 02:11:53

A question about DatasetType - is there a representation for a file-like type? For files stored in S3/FTP/NFS etc (assuming a fully resolvable url)

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-16 09:53:24

*Thread Reply:* I think there was some talk somewhere to actually drop the DatasetType concept; can't find where though.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-16 10:04:09

*Thread Reply:* I've taken a look at your repo. Looks great so far!

One thing I've noticed I don't think you need to use any stuff from Marquez to emit events. It's lineage ingestion API is deprecated - you can just use openlineage-python client. If there's something you think it's missing from it, feel free to write that here or open issue.

Brad (bradley.mcelroy@live.com)
2021-09-16 17:12:31

*Thread Reply:* And would that be replaced by just some Input/Output notion @Maciej Obuchowski?

Brad (bradley.mcelroy@live.com)
2021-09-16 17:13:26

*Thread Reply:* Oh yeah I got a little confused by the single lineage endpoint - but I’ve realised how it all works now. I’m still using the marquez backend to view things but I’ll use the openlineage-client to talk to it

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-16 17:34:46

*Thread Reply:* Yes 🙌

Tomas Satka (satka.tomas@gmail.com)
2021-09-16 06:04:30

When trying to fix failing checks, i see integration-test-integration-airflow to fail ```#!/bin/bash -eo pipefail if [[ GCLOUDSERVICEKEY,GOOGLEPROJECTID == "" ]]; then echo "No required environment variables to check; moving on" else IFS="," read -ra PARAMS <<< "GCLOUDSERVICEKEY,GOOGLEPROJECTID"

for i in "${PARAMS[@]}"; do if [[ -z "${!i}" ]]; then echo "ERROR: Missing environment variable {i}" >&2

  if [[ -n "" ]]; then
    echo "" &gt;&amp;2
  fi

  exit 1
else
  echo "Yes, ${i} is defined!"
fi

done fi

ERROR: Missing environment variable {i}

Exited with code exit status 1 CircleCI received exit code 1``` However i havent touch airflow at all.. can somebody help please?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-16 06:59:34

*Thread Reply:* Hey, Airflow integration tests do not pass env variables to PRs from forks due to security reasons - everyone could create malicious PR and dump secrets

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-16 07:00:29

*Thread Reply:* So, they will fail and there's nothing to do from your side 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-16 07:00:55

*Thread Reply:* We probably should split those into ones that don't touch external systems, and run those for all PRs

Tomas Satka (satka.tomas@gmail.com)
2021-09-16 07:08:03

*Thread Reply:* ah okie. good to know. and in build-integration-spark Could not resolve all artifacts. Is that also known issue? Or something from my side that i could fix?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-16 07:11:12

*Thread Reply:* Looks like gradle server problem? &gt; Could not get resource '<https://plugins.gradle.org/m2/com/diffplug/spotless/spotless-lib/2.13.2/spotless-lib-2.13.2.module>'. &gt; Could not GET '<https://plugins.gradle.org/m2/com/diffplug/spotless/spotless-lib/2.13.2/spotless-lib-2.13.2.module>'. Received status code 500 from server: Internal Server Error

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-16 07:34:44

*Thread Reply:* After retry, there's spotless error:

+········.orElse(Collections.emptyList()).stream()

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-16 07:35:15

*Thread Reply:* I think this is due to mismatch between behavior of spotless in Java 8 and Java 11+ - which you probably used 🙂

Tomas Satka (satka.tomas@gmail.com)
2021-09-16 07:40:01

*Thread Reply:* ah.. i used java11. so shall i rerun something with java8 setup as sdk?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-16 07:44:31

*Thread Reply:* For spotless, you can just fix this one line 🙂 Though I don't guarantee that tests that run later will pass, so you might need Java 8 for later testing

Tomas Satka (satka.tomas@gmail.com)
2021-09-16 08:04:36

*Thread Reply:* yup looks better now

Tomas Satka (satka.tomas@gmail.com)
2021-09-16 08:04:41

*Thread Reply:* thanks

Tomas Satka (satka.tomas@gmail.com)
2021-09-16 14:27:02

*Thread Reply:* will somebody please review my PR? had to already adjust due to updates on same test class 🙂

Brad (bradley.mcelroy@live.com)
2021-09-16 20:36:28

Hey team - I've opened https://github.com/OpenLineage/OpenLineage/pull/293 for a very WIP prefect integration

🙌 Maciej Obuchowski
Brad (bradley.mcelroy@live.com)
2021-09-16 20:37:27

*Thread Reply:* @Thomas Fredriksen would love any feedback

Thomas Fredriksen (thomafred90@gmail.com)
2021-09-17 04:21:13

*Thread Reply:* nicely done! As we discussed in another thread - the way you have implemented lineage using FlowRunner and TaskRunner is likely the best way to do this. Let me know if you need any help, I would love to see this PR get merged!

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-17 07:28:33

*Thread Reply:* Hey @Brad, it looks great!

I've seen you're using task_qualified_name to name datasets and I don't think it's the right way. I'd take a look at naming conventions here: https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md

Getting that right is key to making sure that lineage is properly tracked between systems - for example, if you use Prefect to schedule dbt runs or pyspark jobs, the unified naming makes sure that all those integrations properly refer to the same dataset.

Brad (bradley.mcelroy@live.com)
2021-09-17 08:12:50

*Thread Reply:* Hey @Maciej Obuchowski thanks for the feedback. Yep the naming was a bit of a placeholder. Open to any recommendations.. I think things like dbt or pyspark are straight forward (we could add special handling for tasks like that) but what about regular transformation type tasks that run in a scheduler? Do you have any naming preference? Say I just had some pandas transform task in prefect for example

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-17 08:28:04

*Thread Reply:* First of all, not all tasks are producing and consuming datasets. For example, I wouldn't expect any of the Github tasks to have any datasets.

Second, in Airflow we have a concept of Extractor where you can write specialized code to expose datasets. For example, for BigQuery we extract datasets from query plan. Now, I'm not sure if this concept would translate well to Prefect - but if yes, then we have some helpers inside openlineage common library that could be reused. Also, this way allows to emit additional facets, some of which are really useful - like query statistics for BigQuery, and data quality tests for dbt.

Third, if we're talking about generalized tasks like FunctionTask or ShellTask, then I think the right way is to expose functionality to user to expose lineage themselves. I'm not sure how exactly that would look in Prefect.

Brad (bradley.mcelroy@live.com)
2021-09-19 23:03:14

*Thread Reply:* You've raised some good points @Maciej Obuchowski - I might have been thinking about this integration in slightly the wrong way. I think based on your comments I'll refactor some of the code to hook into the Results object in prefect (The Result object is the way in which data is serialized and persisted).

> Now, I'm not sure if this concept would translate well to Prefect - but if yes, then we have some helpers inside openlineage common library that could be reused This definitely applies to prefect and the similar tasks exist in prefect and we should definitely leverage the common library in this case.

> Third, if we're talking about generalized tasks like FunctionTask or ShellTask, then I think the right way is to expose functionality to user to expose lineage themselves. I'm not sure how exactly that would look in Prefect. Yeah I agree with this. I'd like to make it as easy a possible to opt-in, but I think you're right that there needs to be some hooks for user defined lineage. I'll think about this a little more.

> First of all, not all tasks are producing and consuming datasets. For example, I wouldn't expect any of the Github tasks to have any datasets. My initial thoughts here were that it would still be good to have lineage as these tasks do have side effects, and downstream consumers of the lineage data might want to know about these tasks. However I don't have a good feeling yet how best to do this, so I'm going to park those thoughts for now.

🙌 Maciej Obuchowski
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-20 06:30:51

*Thread Reply:* > Yeah I agree with this. I'd like to make it as easy a possible to opt-in, but I think you're right that there needs to be some hooks for user defined lineage. I'll think about this a little more. First version of an integration doesn't have to be perfect. in particular, not handling this use case would be okay, since it does not lock us into some particular way of doing it later.

> My initial thoughts here were that it would still be good to have lineage as these tasks do have side effects, and downstream consumers of the lineage data might want to know about these tasks. However I don't have a good feeling yet how best to do this, so I'm going to park those thoughts for now. I'd think of two options first, before modeling it as a dataset: Won't existence of a event be enough? After all, we'll still have it despite it not having any input and output datasets. If not, then wouldn't custom run or job facet be a better fit?

Brad (bradley.mcelroy@live.com)
2021-09-23 17:27:49

*Thread Reply:* > Won’t existence of a event be enough? After all, we’ll still have it despite it not having any input and output datasets. Duh, yep you’re right @Maciej Obuchowski, I’m over thinking this. I’m going to clean this up based on your comments

Thomas Fredriksen (thomafred90@gmail.com)
2021-10-06 03:39:28

*Thread Reply:* Hi @Brad. How will this integration work for Prefect flows running in Prefect Cloud or on Prefect Server?

Brad (bradley.mcelroy@live.com)
2021-10-06 03:40:44

*Thread Reply:* Hi @Thomas Fredriksen - it'll relate to the agent actually - you'll need to pass the flow runner class to the agent when running

Thomas Fredriksen (thomafred90@gmail.com)
2021-10-06 03:48:14

*Thread Reply:* nice!

Brad (bradley.mcelroy@live.com)
2021-10-06 03:48:54

*Thread Reply:* Unfortunately I've been a little busy the past week, and I will be for the rest of this week

Brad (bradley.mcelroy@live.com)
2021-10-06 03:49:09

*Thread Reply:* but I do plan to pick this up next week

Brad (bradley.mcelroy@live.com)
2021-10-06 03:49:23

*Thread Reply:* (the additional changes I mention above)

Thomas Fredriksen (thomafred90@gmail.com)
2021-10-06 03:50:08

*Thread Reply:* looking forward to it 🙂 let me know if you need any help!

Brad (bradley.mcelroy@live.com)
2021-10-06 03:50:34

*Thread Reply:* yeah when I get this next lot of stuff in - I'd love for people to test it out

🙌 Thomas Fredriksen, Maciej Obuchowski
Adam Pocock (adam.pocock@oracle.com)
2021-09-20 17:38:51

Is there a preferred academic citation for OpenLineage? I’m writing a paper on the provenance system in our machine learning library, and I’d like to cite OpenLineage as an example of future work on data lineage to integrate with.

🙌 Willy Lulciuc
Julien Le Dem (julien@apache.org)
2021-09-20 19:18:53

*Thread Reply:* I think you can reffer to https://openlineage.io/

openlineage.io
Julien Le Dem (julien@apache.org)
2021-09-20 19:31:30

We’re starting to see the beginning of larger contributions (Spark streaming, prefect, …) and I think we need to define a way to accept those contributions incrementally. If we take the example of Streaming (Spark streaming, Flink or Beam) support (but really this applies in general, sorry to pick on you Tomas, this is great!): The first Spark streaming PR ( https://github.com/OpenLineage/OpenLineage/pull/279 ) lays the ground work for testing spark streaming but there’s more work to have a full feature. I’m in favor of merging Spark streaming support into main once it’s working end to end (possibly with partial input/output coverage). So I see 2 options:

  1. start a branch for spark streaming support. Have PRs like this one go into it until it’s completed (smaller reviews). Then merge the whole thing as a PR in main when it’s finished
  2. Keep working on that PR until it’s fully implemented, but it will get big, and make reviews difficult. I have seen the model 1) work well. It’s easier to do multiple smaller reviews for larger projects.
Comments
2
👍 Ross Turk, Maciej Obuchowski, Faouzi
Yannick Endrion (yannick.endrion@gmail.com)
2021-09-24 05:10:04

Thank you @Ross Turk for this really useful article: https://openlineage.io/blog/dbt-with-marquez/?s=03 Is anyone aware of additional environment being supported by the dbt<->OpenLineage<->Marquez integration ? I think only Snowflake and BigQuery are supported now. I am really interested by SQLServer or even Dremio (which could be great because capable of read from multiples DB).

Thank you

openlineage.io
🎉 Minkyu Park, Ross Turk
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-24 05:15:31

*Thread Reply:* It should be really easy to add additional databases. Basically, we'd need to know how to get namespace for that database: https://github.com/OpenLineage/OpenLineage/blob/main/integration/common/openlineage/common/provider/dbt.py#L467

The first step should be to add SQLServer or Dremio to the dataset naming schema here https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md

Yannick Endrion (yannick.endrion@gmail.com)
2021-10-04 16:22:59

*Thread Reply:* Thank you @Maciej Obuchowski, I tried to give it a try but without success yet. Not sure where I am suppose to add the sqlserver naming schema... If you have any documentation that I could read I would be glad =) Many thanks

Julien Le Dem (julien@apache.org)
2021-10-07 15:13:43

*Thread Reply:* This would be adding a paragraph similar to this one: https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md#snowflake

Julien Le Dem (julien@apache.org)
2021-10-07 15:14:30

*Thread Reply:* Snowflake See: Object Identifiers — Snowflake Documentation Datasource hierarchy: • account name Naming hierarchy: • Database: {database name} => unique across the account • Schema: {schema name} => unique within the database • Table: {table name} => unique within the schema Identifier: • Namespace: snowflake://{account name} ◦ Scheme = snowflake ◦ Authority = {account name} • Name: {database}.{schema}.{table} ◦ URI = snowflake://{account name}/{database}.{schema}.{table}

Marty Pitt (martypitt@vyne.co)
2021-09-24 06:53:05

Hi all. I'm the Founder / CTO of a data discovery & transformation platform that captures very rich lineage information. We're interested in exposing / making our lineage data consumable via open standards, which is what lead me to this project. A couple of questions:

A) Am I right in considering that's the goal of this project? B) Are you also considering provedance as well as lineage? C) What's a good starting point to understand the models we should be exposing our data in, to make it consumable?

Marty Pitt (martypitt@vyne.co)
2021-09-24 07:06:20

*Thread Reply:* For clarity on the provedance vs lineage point (in case I'm using those terms incorrectly...)

Our platform performs automated enrichment and processing of data. In doing so, we often make calls to functions or out to other data services (such as APIs, or SELECTs against databases). We capture the inputs that pass to these, along with the outputs. (And, if the input is derived from other outputs, we capture the full chain, right back to the root).

That's the kinda stuff our customers are really interested in, and we feel like there's value in making is consumable.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-24 08:47:35

*Thread Reply:* Not sure I understand you right, but are you interested in tracking individual API calls, and for example, values of some parameters passed for one call?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-24 08:51:16

*Thread Reply:* I guess that's not in OpenLineage scope, as we're interested more in tracking metadata for whole datasets. But I might be wrong, some other people might chime in.

We could of course model this situation, but that would capture for example schema of those parameters. Not their values.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-24 08:52:16

*Thread Reply:* I think this might be better suited for https://opentelemetry.io/

Marty Pitt (martypitt@vyne.co)
2021-09-24 10:55:54

*Thread Reply:* Kinda, but not really. Telemetery data is metadata about the API calls. We have that, but it's not interesting to our customers. It's the metadata about the data that Vyne provides that we want to expose.

Our customers use Vyne to fetch data from lots of different sources. Eg:

> "Whenever a trade is booked, calculate it's compliance against these regulations, to report to the regulators". or

> "Whenever a customer buys a $thing, capture the transaction data, client data, and account data, and store it in this table." Providing answers to those questions involves fetching and transforming data, before storing it, or outputting it. We capture all that data, on a per-attribute basis, so we can answer the question "how did we get this value?" That's the lineage information we want to publish.

Michael Collado (collado.mike@gmail.com)
2021-09-30 15:10:51

*Thread Reply:* The core OpenLineage model is documented at https://github.com/OpenLineage/OpenLineage/#core-model . The model is really focused on Jobs and Datasets. Jobs have Runs which have start and end times (typically scheduled start/end times as well) and read from and/or write to the target datasets. If your transformation chain fits within that model, then I think you can definitely record and share the lineage information with your customers. The existing implementations are all focused on batch data access, though streaming should be possible to capture as well

Drew Bittenbender (drew@salt.io)
2021-09-29 11:10:29

Hello. I am trying the openlineage-airflow integration with Marquez as the backend and have 3 questions.

  1. Does it only work for PostgresOperators?
  2. Which is the recommended integration: marquez-airflow or openlineage-airflow
  3. How do you enable more detailed logging? I tried OPENLINEAGELOGLEVEL and MARQUEZLOGLEVEL and neither seemed to affect logging. I assume this is logged to the airflow worker
Faouzi (faouzi@dataroots.io)
2021-09-29 13:46:59

*Thread Reply:* Hello @Drew Bittenbender!

For your two first questions:

• Yes right now only the PostgresOperator is integrated. I learnt it the hard way ^_^. Spent hours trying with MySQL. There were attempts to integrate with MySQL actually. If engineers do not integrate it I will allocate myself some time to try to implement other airflow db operators. • Use the openlineage one. It is the recommended approach now.

Drew Bittenbender (drew@salt.io)
2021-09-29 13:49:41

*Thread Reply:* Thank you @Faouzi. Is there any documentation/best practices to write your own extractor, or is it "read the code"? We use the Python, Docker and SSH operators a lot. Maybe those don't fit into the lineage paradigm well, but want to give it a shot

Faouzi (faouzi@dataroots.io)
2021-09-29 13:52:16

*Thread Reply:* To the best of my knowledge there is no documentation to guide through the design of your own extractor. So yes we need to read the code. Here a link where you can see how they did for postgre extractor and others. https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow/openlineage/airflow/extractors

👍 Drew Bittenbender
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-09-30 05:08:53

*Thread Reply:* I think in case of "bring your own code" operators like Python or Docker ones, it might be better to use lineage_run_id macro and use openlineage-python library inside, instead of implementing extractor.

Michael Collado (collado.mike@gmail.com)
2021-09-30 15:14:47

*Thread Reply:* I think @Maciej Obuchowski is right here. The airflow integration will create the parent jobs, but to get the dataset input/output links, it's best to do that directly from the python/docker scripts. If you report the parent run id, Marquez will link the jobs together correctly

Julien Le Dem (julien@apache.org)
2021-10-07 15:09:55

*Thread Reply:* To clarify on what airflow operators are supported out of the box: • postgres • bigquery • snowflake • Great expectations (with extra config) See: https://github.com/OpenLineage/OpenLineage/blob/3a1ccbd854bbf202bbe6437bf81786cb01[…]ntegration/airflow/openlineage/airflow/extractors/extractors.py Mysql is not at the moment. We should track it as an issue

Yuki Tannai (tannai-yuki@dmm.com)
2021-09-30 09:21:35

Hi there! I’m trying to enhance the lineage functionality of a data infrastructure I’m working on. All of the tools I found only visualize the relationships between tables before and after the transformation, but the DataHub RFC discusses Field Level Lineage, which I thought was close to the functionality I was looking for. Does OpenLineage support the same functionality? https://datahubproject.io/docs/rfc/active/1841-lineage/field_level_lineage/

datahubproject.io
Julien Le Dem (julien@apache.org)
2021-10-07 15:03:40

*Thread Reply:* OpenLineage doesn’t have field level lineage yet. Here is the proposal for adding it: https://github.com/OpenLineage/OpenLineage/issues/148

Labels
proposal
Comments
2
👀 Yuki Tannai, Ricardo Gaspar
Julien Le Dem (julien@apache.org)
2021-10-07 15:04:36

*Thread Reply:* Those two specs look compatible, so Datahub should be able to consume this lineage metadata in the future

👍 Yuki Tannai
павел клопотюк (klopotuk@gmail.com)
2021-10-04 14:27:24

Hello, everyone. I'm trying to work with OL and Airflow 2.1.4 and it doesn't work. I found that OL is supported for Airflow 1.10.12++. Does it support Airflow 2.X.Y?

Ross Turk (ross@datakin.com)
2021-10-04 15:38:47

*Thread Reply:* Hi! Airflow 2.x is currently in development - you can follow along with the progress here: https://github.com/OpenLineage/OpenLineage/issues/205

Assignees
mobuchowski
Comments
5
павел клопотюк (klopotuk@gmail.com)
2021-10-05 03:01:54

*Thread Reply:* Thank you for your reply!

Julien Le Dem (julien@apache.org)
2021-10-07 15:02:23

*Thread Reply:* There should be a first version of Airflow 2.X support soon: https://github.com/OpenLineage/OpenLineage/pull/305 We’re labelling it experimental because the config step might change as discussion in the airflow github evolve. It will track succesful jobs in its current state.

Comments
2
SAM (skhettri@gmail.com)
2021-10-04 23:14:26

Hi All, I’m working on openlineage-dbt integration with Marquez as backend. I want to integrate OL with DBT cloud, would you please help to provide steps that I need to follow?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-05 04:18:42

*Thread Reply:* Take a look at this: https://docs.getdbt.com/docs/dbt-cloud/dbt-cloud-api/metadata/metadata-overview

docs.getdbt.com
✅ SAM
Julien Le Dem (julien@apache.org)
2021-10-07 14:58:24

*Thread Reply:* @SAM Let us know of your progress.

👍 SAM
ale (alessandro.lollo@gmail.com)
2021-10-05 16:23:41

Hey folks 😊 I’m trying to run dbt-ol with Redshift target, but I get the following error message Traceback (most recent call last): File "/usr/local/bin/dbt-ol", line 61, in &lt;module&gt; main() File "/usr/local/bin/dbt-ol", line 54, in main events = processor.parse().events() File "/usr/local/lib/python3.8/site-packages/openlineage/common/provider/dbt.py", line 97, in parse self.extract_dataset_namespace(profile) File "/usr/local/lib/python3.8/site-packages/openlineage/common/provider/dbt.py", line 368, in extract_dataset_namespace self.dataset_namespace = self.extract_namespace(profile) File "/usr/local/lib/python3.8/site-packages/openlineage/common/provider/dbt.py", line 382, in extract_namespace raise NotImplementedError( NotImplementedError: Only 'snowflake' and 'bigquery' adapters are supported right now. Passed redshift I know that Redshift is not the best cloud DWH we can use… 😅 But, still….do you have any plan to support it? Thanks!

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-05 16:41:30

*Thread Reply:* Hey, can you create ticket in OpenLineage repository? FWIW Redshift is very similar to postgres, so supporting it won't be hard.

ale (alessandro.lollo@gmail.com)
2021-10-05 16:43:39

*Thread Reply:* Hey @Maciej Obuchowski 😊 Yep, will do now! Thanks!

ale (alessandro.lollo@gmail.com)
2021-10-05 16:46:26

*Thread Reply:* Well...will do tomorrow morning 😅

ale (alessandro.lollo@gmail.com)
2021-10-06 03:03:16

*Thread Reply:* Here’s the issue: https://github.com/OpenLineage/OpenLineage/issues/318

Julien Le Dem (julien@apache.org)
2021-10-07 14:51:08

*Thread Reply:* Thanks a lot. I pulled it in the current project.

👍 ale
ale (alessandro.lollo@gmail.com)
2021-10-08 05:48:28

*Thread Reply:* @Julien Le Dem @Maciej Obuchowski I’m not familiar with dbt-ol codebase, but I’m willing to help on this if you guys can give me a bit of guidance 😅

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-08 05:53:05

*Thread Reply:* @ale can you help us define naming schema for redshift, as we have for other databases? https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md

ale (alessandro.lollo@gmail.com)
2021-10-08 05:53:21

*Thread Reply:* Sure!

ale (alessandro.lollo@gmail.com)
2021-10-08 05:54:21

*Thread Reply:* will work on this today and I’ll try to submit a PR by EOD

ale (alessandro.lollo@gmail.com)
2021-10-08 06:36:12

*Thread Reply:* There you go https://github.com/OpenLineage/OpenLineage/pull/324

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-08 06:39:35

*Thread Reply:* Host would be something like examplecluster.&lt;XXXXXXXXXXXX&gt;.<a href="http://us-west-2.redshift.amazonaws.com">us-west-2.redshift.amazonaws.com</a> right?

ale (alessandro.lollo@gmail.com)
2021-10-08 07:13:51

*Thread Reply:* Yep, let me update the PR

ale (alessandro.lollo@gmail.com)
2021-10-08 07:27:42

*Thread Reply:* Done

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-08 07:31:40

*Thread Reply:* 🙌

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-08 07:35:30

*Thread Reply:* If you want to look at dbt integration itself, there are two things:

We need to determine how Redshift adapter reports metrics https://github.com/OpenLineage/OpenLineage/blob/610a687bf69df2b52ec4ac4da80b4a05580e8d32/integration/common/openlineage/common/provider/dbt.py#L412

And how we can create namespace and job name based on the job naming schema that you created: https://github.com/OpenLineage/OpenLineage/blob/610a687bf69df2b52ec4ac4da80b4a05580e8d32/integration/common/openlineage/common/provider/dbt.py#L512

One thing how to get this info is to run the dbt yourself and look at resulting metadata files - in target dir of the dbt directory

ale (alessandro.lollo@gmail.com)
2021-10-08 08:33:31

*Thread Reply:* I figured out how to generate the namespace. But I can’t understand which of the JSON files is inspected for metrics. Is it run_results.json ?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-08 09:48:50

*Thread Reply:* yes, run_results.json - it's different in bigquery and snowflake, so I presume it's different in redshift too

ale (alessandro.lollo@gmail.com)
2021-10-08 11:02:32

*Thread Reply:* Ok thanks!

ale (alessandro.lollo@gmail.com)
2021-10-08 11:11:57

*Thread Reply:* Should be stats:rows:value

ale (alessandro.lollo@gmail.com)
2021-10-08 11:19:59

*Thread Reply:* Regarding namespace: if env_var is used in profiles.yml , how is this handled now?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-08 11:44:50

*Thread Reply:* Well, it isn't. This is relevant only if you passed cluster hostname this way, right?

ale (alessandro.lollo@gmail.com)
2021-10-08 11:53:52

*Thread Reply:* Exactly

ale (alessandro.lollo@gmail.com)
2021-10-11 07:10:38

*Thread Reply:* If you think it make sense, I can submit a PR to handle dbt profile with env_var

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-11 07:18:01

*Thread Reply:* Do you want to run jinja on the dbt profile?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-11 07:20:18

*Thread Reply:* Theoretically, we'd need to run it also on dbt_project.yml , but we only take target path and profile name from it.

ale (alessandro.lollo@gmail.com)
2021-10-11 07:20:32

*Thread Reply:* The env_var syntax in the profile is quite simple, I was thinking of extracting the env var name using re and then retrieving the value from os

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-11 07:23:59

*Thread Reply:* It would work, but we can actually use jinja - if you're using dbt, it's already included. The method is pretty simple: ``` @contextmember @staticmethod def envvar(var: str, default: Optional[str] = None) -> str: """The envvar() function. Return the environment variable named 'var'. If there is no such environment variable set, return the default.

    If the default is None, raise an exception for an undefined variable.
    """
    if var in os.environ:
        return os.environ[var]
    elif default is not None:
        return default
    else:
        msg = f"Env var required but not provided: '{var}'"
        undefined_error(msg)```
ale (alessandro.lollo@gmail.com)
2021-10-11 07:25:07

*Thread Reply:* Oh cool! I will definitely use this one!

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-11 07:25:09

*Thread Reply:* We'd be sure that our implementation matches dbt's one, right? Also, you'd support default method for free

ale (alessandro.lollo@gmail.com)
2021-10-11 07:26:34

*Thread Reply:* So this env_varmethod is defined in dbt and not in OpenLineage codebase, right?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-11 07:27:01

*Thread Reply:* yes

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-11 07:27:14

*Thread Reply:* dbt is on Apache license 🙂

ale (alessandro.lollo@gmail.com)
2021-10-11 07:28:06

*Thread Reply:* Should we import dbt package and use the method or should we just copy/paste the method inside OpenLineage codebase?

ale (alessandro.lollo@gmail.com)
2021-10-11 07:28:28

*Thread Reply:* I’m asking for guidance here 😊

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-11 07:34:44

*Thread Reply:* I think we should just do basic jinja template rendering in our code like in the quick example: https://realpython.com/primer-on-jinja-templating/#quick-examples

just with the env_var method passed to the render method 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-11 07:37:05

*Thread Reply:* basically, here in the code we should read the file, do the jinja render, and load yaml from string instead of straight from file https://github.com/OpenLineage/OpenLineage/blob/610a687bf69df2b52ec4ac4da80b4a05580e8d32/integration/common/openlineage/common/provider/dbt.py#L176

ale (alessandro.lollo@gmail.com)
2021-10-11 07:38:53

*Thread Reply:* ok, got it. Will try to implement following your suggestions. Thanks @Maciej Obuchowski 🙌

🙌 Maciej Obuchowski
ale (alessandro.lollo@gmail.com)
2021-10-11 08:36:13

*Thread Reply:* We need to:

  1. load the template profile from the profile.yml
  2. replace any env vars we found For the first step, we can use jinja2.Template However, to replace the env vars we find, we have to actually search for those env vars… 🤔
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-11 08:43:06

*Thread Reply:* The dbt method implements that: ``` @contextmember @staticmethod def envvar(var: str, default: Optional[str] = None) -> str: """The envvar() function. Return the environment variable named 'var'. If there is no such environment variable set, return the default.

    If the default is None, raise an exception for an undefined variable.
    """
    if var in os.environ:
        return os.environ[var]
    elif default is not None:
        return default
    else:
        msg = f"Env var required but not provided: '{var}'"
        undefined_error(msg)```
ale (alessandro.lollo@gmail.com)
2021-10-11 08:45:54

*Thread Reply:* Ok, but I need to pass var to the env_var method. And to pass the var value, I need to look into the loaded Template and search for env var names…

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-11 08:46:54

*Thread Reply:* that's what jinja does - you're passing function to jinja render, and it's calling it itself

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-11 08:47:45

*Thread Reply:* you can try the quick example from here, but just pass the env_var method (slightly adjusted - as a standalone function and without undefined error) and call it inside the template: https://realpython.com/primer-on-jinja-templating/#quick-examples

ale (alessandro.lollo@gmail.com)
2021-10-11 08:51:19

*Thread Reply:* Ok, will try

ale (alessandro.lollo@gmail.com)
2021-10-11 09:37:49

*Thread Reply:* I’m trying to run pip install -e ".[dev]" so that I can test my changes, but I get ERROR: Could not find a version that satisfies the requirement openlineage-integration-common[dbt]==0.2.3 (from openlineage-dbt[dev]) (from versions: 0.0.1rc7, 0.0.1rc8, 0.0.1, 0.1.0rc5, 0.1.0, 0.2.0, 0.2.1, 0.2.2) ERROR: No matching distribution found for openlineage-integration-common[dbt]==0.2.3 I don’t understand what I’m doing wrong…

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-11 09:41:47

*Thread Reply:* can you try installing it manually?

pip install openlineage-integration-common[dbt]==0.2.3

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-11 09:42:13

*Thread Reply:* I mean, it exists in pypi: https://pypi.org/project/openlineage-integration-common/#files

PyPI
ale (alessandro.lollo@gmail.com)
2021-10-11 09:44:57

*Thread Reply:* Yep, maybe it’s our internal Pypi repo which is not synced. Installing from the public pypi resolved the issue

ale (alessandro.lollo@gmail.com)
2021-10-11 12:04:55

*Thread Reply:* Can;’t seem to make env_var working as the render method of a Template 😅

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-11 12:57:07

*Thread Reply:* try this:

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-11 12:57:09

*Thread Reply:* ```import os from typing import Optional from jinja2 import Template

def envvar(var: str, default: Optional[str] = None) -> str: """The envvar() function. Return the environment variable named 'var'. If there is no such environment variable set, return the default.

If the default is None, raise an exception for an undefined variable.
"""
if var in os.environ:
    return os.environ[var]
elif default is not None:
    return default
else:
    msg = f"Env var required but not provided: '{var}'"
    raise Exception("")

if name == 'main': t = Template("Hello {{ envvar('ENVVAR') }}!") print(t.render(envvar=envvar))```

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-11 12:57:42

*Thread Reply:* works for me: mobuchowski@thinkpad [18:57:14] [~] -&gt; % ENV_VAR=world python jinja_example.py Hello world!

ale (alessandro.lollo@gmail.com)
2021-10-11 16:59:13

*Thread Reply:* Finally 😅 https://github.com/OpenLineage/OpenLineage/pull/328

There are minimal tests for Redshift and env vars. Feedbacks and suggestions are welcome!

ale (alessandro.lollo@gmail.com)
2021-10-12 03:10:45

*Thread Reply:* Hi @Maciej Obuchowski 😊 Regarding this comment https://github.com/OpenLineage/OpenLineage/pull/328#discussion_r726586564

How can we distinguish between snowflake, bigquery and redshift in this method?

A simple, but not very clean solution, would be to split this bytes = get_from_multiple_chains( node.catalog_node, [ ['stats', 'num_bytes', 'value'], # bigquery ['stats', 'bytes', 'value'], # snowflake ['stats', 'size', 'value'] # redshift (Note: size = count of 1MB blocks) ] ) into two pieces, one checking for snowflake and bigquery and the other checking for redshift.

A better solution would be to have the profile type inside method node_to_output_dataset , but I’m struggling understanding how to do that

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-12 05:35:00

*Thread Reply:* Well, why not do something like

```bytes = getfrommultiple_chains(... rest of stuff)

if adapter == 'redshift': bytes = 10241024```

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-12 05:36:49

*Thread Reply:* we can store adapter type in the class

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-12 05:38:47

*Thread Reply:* well, I've looked at last commit and that's exactly what you did 👍

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-12 05:40:35

*Thread Reply:* Now, have you tested your branch on real redshift cluster? I don't think we 100% need automated tests for that now, but would be nice to have confirmation that it works.

ale (alessandro.lollo@gmail.com)
2021-10-12 06:35:04

*Thread Reply:* Not yet, but I'll try to do that this afternoon. Need to figure out how to build the lib locally, then I can use it to test with Redshift

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-12 06:40:58

*Thread Reply:* I think pip install -e .[dbt] in common directory should be enough

ale (alessandro.lollo@gmail.com)
2021-10-12 09:29:13

*Thread Reply:* I was able to run my local branch with my Redshift cluster and metadata is pushed to Marquez. However, I’m not sure about the namespace . I also see exceptions in Marquez logs

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-12 09:33:26

*Thread Reply:* namespace: well, if it matches what you put into your profile, there's not much we can do. I don't understand why you connect to redshift via host, maybe this is related to IAM?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-12 09:44:17

*Thread Reply:* I think the marquez error is because we don't send SourceCodeLocationJobFacet

ale (alessandro.lollo@gmail.com)
2021-10-12 09:46:17

*Thread Reply:* Regarding the namespace, I will check it and figure it out 😊 Regarding the error: in the context of this PR, is it something I should worry about or not?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-12 09:54:17

*Thread Reply:* I think not in the context of the PR. It certainly deserves separate issue in Marquez repository.

ale (alessandro.lollo@gmail.com)
2021-10-12 10:24:38

*Thread Reply:* 👍

ale (alessandro.lollo@gmail.com)
2021-10-12 10:24:51

*Thread Reply:* Is there anything else I can do to improve the PR?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-12 10:27:44

*Thread Reply:* did you figure out the namespace stuff? I think it's ready to be merged outside of that

ale (alessandro.lollo@gmail.com)
2021-10-12 10:49:06

*Thread Reply:* Not yet

ale (alessandro.lollo@gmail.com)
2021-10-12 10:58:07

*Thread Reply:* Ok i figured it out. When running dbt locally, we connect to Redshift using an SSH tunnel. dbt runs on Docker, hence it can access the tunnel using host.docker.internal

ale (alessandro.lollo@gmail.com)
2021-10-12 10:58:16

*Thread Reply:* So the namespace is correct

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-12 11:04:12

*Thread Reply:* Makes sense. So, let's merge it, after DCO bot gets up again.

ale (alessandro.lollo@gmail.com)
2021-10-12 11:04:37

*Thread Reply:* 👍

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-13 05:29:48

*Thread Reply:* merged your PR 🙌

ale (alessandro.lollo@gmail.com)
2021-10-13 10:54:09

*Thread Reply:* 🎉

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-13 12:01:20

*Thread Reply:* I think I'm going to change it up a bit. The problem is that we can try to render jinja everywhere, including comments. I tried to make it skip unknown methods and values here, but I think the right solution is to load the yaml, and then try to render jinja for values.

ale (alessandro.lollo@gmail.com)
2021-10-13 14:27:37

*Thread Reply:* Ok sounds good to me!

SAM (skhettri@gmail.com)
2021-10-06 10:50:43

Hey there, I’m not sure why I’m getting below error, after I ran OPENLINEAGE_URL=<http://localhost:5000> dbt-ol run , although running this command dbt debug doesn’t show any error. Pls help.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-06 10:54:32

*Thread Reply:* Does it work with simply dbt run?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-06 10:55:51

*Thread Reply:* also, do you have dbt-snowflake installed?

SAM (skhettri@gmail.com)
2021-10-06 11:00:42

*Thread Reply:* it works with dbt run

👀 Maciej Obuchowski
SAM (skhettri@gmail.com)
2021-10-06 11:01:22

*Thread Reply:* no i haven’t installed dbt-snowflake

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-06 12:04:19

*Thread Reply:* what the dbt says - the snowflake profile with dev target - is that what you ment to run or was it something else?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-06 12:04:46

*Thread Reply:* it feels very weird to me, since the dbt-ol script just runs dbt run underneath

SAM (skhettri@gmail.com)
2021-10-06 12:19:27

*Thread Reply:* this is my profiles.yml file: ```snowflake: target: dev outputs: dev: type: snowflake account: xxxxxxx

  # User/password auth
  user: xxxxxx
  password: xxxxx

  role: poc_db_temp_fullaccess
  database: POC_DB
  warehouse: poc_wh
  schema: temp
  threads: 2
  client_session_keep_alive: False
  query_tag: dbt_ol```
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-06 12:26:39

*Thread Reply:* Yes, it looks that everything is okay on your side...

SAM (skhettri@gmail.com)
2021-10-06 12:28:19

*Thread Reply:* may be I’ll restart my machine and try again

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-06 12:30:25

*Thread Reply:* can you try OPENLINEAGE_URL=<http://localhost:5000> dbt-ol debug

SAM (skhettri@gmail.com)
2021-10-07 05:59:03

*Thread Reply:* Actually i had to use venv that fixed above issue. However, i ran into another problem which is no jobs / datasets found in marquez:

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-07 06:00:28

*Thread Reply:* Good that you fixed that one 🙂 Regarding last one, I've found it independently yesterday and PR fixing it is already waiting for review: https://github.com/OpenLineage/OpenLineage/pull/322

SAM (skhettri@gmail.com)
2021-10-07 06:00:46

*Thread Reply:* oh, thanks a lot

Julien Le Dem (julien@apache.org)
2021-10-07 14:50:01

*Thread Reply:* There will be a release soon: https://openlineage.slack.com/archives/C01CK9T7HKR/p1633631825147900

} Willy Lulciuc (https://openlineage.slack.com/team/U01DCMDFHBK)
👍 SAM
SAM (skhettri@gmail.com)
2021-10-07 23:23:26

*Thread Reply:* Hi, openlineage-dbt==0.2.3 worked, thanks a lot for the quick fix.

Alex P (alexander.pelivan@scout24.com)
2021-10-07 07:46:16

Hi, I just started playing around with Marquez. When submitting some lineage data, after some experimenting, the visualisation becomes a bit cluttered with all the naive attempts of building a meaningful graph. Can I clear this up somehow? Or is there a tip, how to hide certain information?

Alex P (alexander.pelivan@scout24.com)
2021-10-07 07:46:59
Alex P (alexander.pelivan@scout24.com)
2021-10-07 09:51:40

*Thread Reply:* So, as a quick fix, shutting down and re-starting the docker container resets everything. ./docker/up.sh

👍 Maciej Obuchowski
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-07 12:28:25

*Thread Reply:* I guess that it's the easiest way now. There should be API for that.

Willy Lulciuc (willy@datakin.com)
2021-10-07 14:09:50

*Thread Reply:* @Alex P Yeah, we're realizing that being able to delete metadata is becoming very important. And, as @Maciej Obuchowski mentioned, dropping your entire database is the only way currently (not ideal!). We do have an issue in the Marquez backlog to expose delete APIs: https://github.com/MarquezProject/marquez/issues/754

Labels
feature, api
Willy Lulciuc (willy@datakin.com)
2021-10-07 14:10:36

*Thread Reply:* A bit more discussion is needed though. Like what if a dataset is deleted, but you still want to keep track that it existed at some point? (i.e. soft vs hard deletes). But, for the case that you just want to clear metadata because you were testing things out, then yeah, that's more obvious and requires little discussion of the API upfront.

Willy Lulciuc (willy@datakin.com)
2021-10-07 14:12:52

*Thread Reply:* @Alex P I moved the delete APIs to the Marquez 0.20.0 release

Julien Le Dem (julien@apache.org)
2021-10-07 14:39:03

*Thread Reply:* Thanks Willy.

🙌 Willy Lulciuc
Julien Le Dem (julien@apache.org)
2021-10-07 14:48:37

*Thread Reply:* I have also updated a corresponding issue to track this in OpenLineage: https://github.com/OpenLineage/OpenLineage/issues/323

Labels
proposal
Julien Le Dem (julien@apache.org)
2021-10-07 13:36:48

The next OpenLineage monthly meeting is on the 13th. https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting please chime in here if you’d like a topic to be added to the agenda

🙌 Willy Lulciuc, Maciej Obuchowski, Peter Hicks
❤️ Willy Lulciuc, Maciej Obuchowski, Peter Hicks
Julien Le Dem (julien@apache.org)
2021-10-13 10:47:49

*Thread Reply:* Reminder that the meeting is today. See you soon

Julien Le Dem (julien@apache.org)
2021-10-13 19:49:21

*Thread Reply:* The recording and notes of the meeting are now available: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting#MonthlyTSCmeeting-Oct13th2021

Willy Lulciuc (willy@datakin.com)
2021-10-07 14:37:05

@channel: We’ve recently become aware that our integration with dbt no longer works with the latest dbt manifest version (v3), see original discussion. The manifest version change was introduced in dbt 0.21 , see diff. That said, we do have a fix: PR #322 contributed by @Maciej Obuchowski! Here’s our plan to rollout the openlineage-dbt hotfix for those using the latest version of dbt (NOTE: for those using an older dbt version, you will NOT not be affected by this bug):

Releasing OpenLineage 0.2.3 with dbt v3 manifest support:

  1. Branch off 0.2.2 tagged commit, and create a openlineage-0.2.x branch
  2. Cherry pick the commit with the dbt manifest v3 fix
  3. Release 0.2.3 batch release We will be releasing 0.2.3 today. Please reach out to us with any questions!
} Samjhana Khettri (https://openlineage.slack.com/team/U02EYPQNU58)
🙌 Mario Measic, Minkyu Park, Peter Hicks
Julien Le Dem (julien@apache.org)
2021-10-07 14:55:35

*Thread Reply:* For people following along, dbt changed the schema of its metadata which broke the openlineage integration. However we were a bit too stringent on validating the schema version (they increment it every time event if it’s backwards compatible, which it is in this case). We will fix that so that future compatible changes don’t prevent the ol integration to work.

Mario Measic (mario.measic.gavran@gmail.com)
2021-10-07 16:44:28

*Thread Reply:* As one of the main integrations, would be good to connect more within the dbt community for the next releases, by testing the release candidates 👍

Thanks for the PR

💯 Willy Lulciuc
Willy Lulciuc (willy@datakin.com)
2021-10-07 16:46:40

*Thread Reply:* Yeah, I totally agree with you. We also should be more proactive and also be more aware in what’s coming in future dbt releases. Sorry if you were effected by this bug :ladybug:

Willy Lulciuc (willy@datakin.com)
2021-10-07 18:12:22

*Thread Reply:* We’ve release OpenLineage 0.2.3 with the hotfix for adding dbt v3 manifest support, see https://github.com/OpenLineage/OpenLineage/releases/tag/0.2.3

You can download and install openlineage-dbt 0.2.3 with the fix using:

$ pip3 install openlineage-dbt==0.2.3

Drew Bittenbender (drew@salt.io)
2021-10-07 19:02:37

Hello. I have a question about dbt-ol. I run dbt in a docker container and alias the dbt command to execute in that docker container. dbt-ol doesn't seem to use that alias. Do you know of a way to force it to use the alias?...or is there an alternative to getting the linage into Marquez?

Julien Le Dem (julien@apache.org)
2021-10-07 21:10:36

*Thread Reply:* @Maciej Obuchowski might know

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-08 04:23:17

*Thread Reply:* @Drew Bittenbender dbt-ol always calls dbt command now, without spawning shell - so it does not have access to bash aliases.

Can you elaborate about your use case? Do you mean that dbt in your path does docker run or something like this? It still might be a problem if we won't have access to artifacts generated by dbt in target directory.

Drew Bittenbender (drew@salt.io)
2021-10-08 10:59:32

*Thread Reply:* I am running on a mac and I have aliased (.zshrc) dbt to execute docker run against the fishtownanalytics docker image rather than installing dbt natively (homebrew, etc). I am doing this so that the dbt configuration is portable and reusable by others.

It seems that by installing openlineage-dbt in a virtual environment, it pulls down it's own version of dbt which it calls inline rather than shelling out and executing the dbt setup resident in the host system. I understand that opening a shell is a security risk so that is understandable.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-08 11:05:00

*Thread Reply:* It does not pull down, it just assumes that it's in the system. It would fail if it isn't.

For now I think you could build your own image based on official one, and install openlineage-dbt inside, something like:

FROM fishtownanalytics/dbt:0.21.0 RUN pip install openlineage-dbt ENTRYPOINT ["dbt-ol"]

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-08 11:05:15

*Thread Reply:* and then pass OPENLINEAGE_URL in env while doing docker run

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-08 11:06:55

*Thread Reply:* Also, to make sure that using shell would help in your case: do you bind mount your dbt directory to home? dbt-ol can't run without access to dbt's target directory, so if it's not visible in host, the only option is to have dbt-ol in container.

SAM (skhettri@gmail.com)
2021-10-08 07:00:43

Hi, I found below issues, not sure what is the root-cause:

  1. Marquez UI does not show any jobs/datasets, but if I search my table name then only it shows in search result section.
  2. After running dbt docs generate there is not schema information available in marquez?
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-08 08:16:37

*Thread Reply:* Regarding 2), the data is only visible after next dbt-ol run - dbt docs generate does not emit events itself, but generates data that run take into account.

SAM (skhettri@gmail.com)
2021-10-08 08:24:57

*Thread Reply:* oh got it, since its in default, i need to click on it and choose my dbt profile’s account name. thnx

SAM (skhettri@gmail.com)
2021-10-08 11:25:22

*Thread Reply:* May I know, why these highlighted ones dont have schema? FYI, I used sources in dbt.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-08 11:26:18

*Thread Reply:* Do they have it in dbt docs?

SAM (skhettri@gmail.com)
2021-10-08 11:33:59

*Thread Reply:* I prepared this yaml file, not sure this is what u asked

ale (alessandro.lollo@gmail.com)
2021-10-12 04:14:08

Hey folks 😊 DCO checks on this PR https://github.com/OpenLineage/OpenLineage/pull/328 seem to be stuck. Any suggestions on how to unblock it?

Thanks!

Comments
1
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-12 07:21:33

*Thread Reply:* I don't think anything is wrong with your branch. It's also not working on my one. Maybe it's globally stuck?

Mark Taylor (marktayl@microsoft.com)
2021-10-12 15:17:02

We are working on the hackathon and have a couple of questions about generating lineage information. @Willy Lulciuc would you have time to help answer a couple of questions?

• Is there a way to generate OpenLineage output that contains a mapping between input and output fields? • In Azure Databricks sources often map to ADB mount points. We are looking for a way to translate this into source metadata in the OL output. Is there some configuration that would make this possible, or any other suggestions?

👋 Willy Lulciuc
Willy Lulciuc (willy@datakin.com)
2021-10-12 15:50:20

*Thread Reply:* > Is there a way to generate OpenLineage output that contains a mapping between input and output fields? OpenLineage defines discrete classes for both OpenLineage.InputDataset and OpenLineage.OutputDataset datasets. But, for clarification, are you asking:

  1. If a job reads / writes to the same dataset, how can OpenLineage track which fields were used in job’s logic as input and which fields were used to write back to the resulting output?
  2. Or, if a job reads / writes from two different dataset, how can OpenLineage track which input fields were used in the job’s logic for the resulting output dataset? (i.e. column-level lineage)
Willy Lulciuc (willy@datakin.com)
2021-10-12 15:56:18

*Thread Reply:* > In Azure Databricks sources often map to ADB mount points.  We are looking for a way to translate this into source metadata in the OL output.  Is there some configuration that would make this possible, or any other suggestions? I would look into our OutputDatasetVisitors class (as a starting point) that extracts metadata from the spark logical plan to construct a mapping between a logic plan to one or more OpenLineage.Dataset for the spark job. But, I think @Michael Collado will have a more detailed suggestion / approach to what you’re asking

Michael Collado (collado.mike@gmail.com)
2021-10-12 15:59:41

*Thread Reply:* are the sources mounted like local filesystem mounts? are you ending up with datasources that point to the local filesystem rather than some dbfs url? (sorry, I'm not familiar with databricks or azure at this point)

Mark Taylor (marktayl@microsoft.com)
2021-10-12 16:59:38

*Thread Reply:* I think under the covers they are an os level fs mount, but it is using an ADB specific api, dbutils.fs.mount. It is using the ADB filesystem.

docs.microsoft.com
Michael Collado (collado.mike@gmail.com)
2021-10-12 17:01:23

*Thread Reply:* Do you use the dbfs scheme to access the files from Spark as in the example on that page? df = spark.read.text("dbfs:/mymount/my_file.txt")

Mark Taylor (marktayl@microsoft.com)
2021-10-12 17:04:52

*Thread Reply:* @Willy Lulciuc In our project, @Will Johnson had generated some sample OL output from just reading in and writing out a dataset to blob storage. In the resulting output, I see the columns represented as fields under the schema element with a set represented for output and another for input. I would need the mapping of in and out columns to generate column level lineage so wondering if it is possible to get or am I just missing it somewhere? Thanks for your help!

Willy Lulciuc (willy@datakin.com)
2021-10-12 17:26:35

*Thread Reply:* Ahh, well currently, no, but it has been discussed and on the OpenLineage roadmap. Here’s a proposal opened by @Julien Le Dem, column level lineage facet, that starts the discussion to add the columnLineage face to the datasets model in order to support column-level lineage. Would be great to get your thoughts!

Labels
proposal
Comments
3
Will Johnson (will@willj.co)
2021-10-12 17:41:41

*Thread Reply:* @Michael Collado - Databricks allows you to reference a file called /mnt/someMount/some/file/path The way you have referenced it would let you hit the file with local file system stuff like pandas / local python.

Julien Le Dem (julien@apache.org)
2021-10-12 17:49:37
Julien Le Dem (julien@apache.org)
2021-10-12 17:51:24

*Thread Reply:* This example adds facets to the run, but you can also add them to the job

Michael Collado (collado.mike@gmail.com)
2021-10-12 17:52:46

*Thread Reply:* unfortunately, there's not yet a way to add your own custom facets to the spark integration- there's some work on extensibility to be done

Michael Collado (collado.mike@gmail.com)
2021-10-12 17:54:07

*Thread Reply:* for the hackathon's sake, you can check out the package and just add in whatever you want

Will Johnson (will@willj.co)
2021-10-12 18:26:44

*Thread Reply:* Thank you guys!!

🙌 Willy Lulciuc
Will Johnson (will@willj.co)
2021-10-12 20:42:20

Question on the Spark Integration and its SPARKCONFURL_KEY configuration variable.

https://github.com/OpenLineage/OpenLineage/blob/8afc4ff88b8dd8090cd9c45061a9f669fe[…]rk/src/main/java/io/openlineage/spark/agent/ArgumentParser.java

It looks like I can pass in any url but I'm not sure if I can pass in query parameters along with that URL. For example, if I had https://localhost/myendpoint?secret_code=123 I THINK that is used for the endpoint and it does not append /lineage to the end of the url. Is that a fair assessment of what happens when the url is provided?

Thank you for any guidance!

Julien Le Dem (julien@apache.org)
2021-10-12 21:46:12

*Thread Reply:* You can also pass the settings independently if you want something more flexible: https://github.com/OpenLineage/OpenLineage/blob/8afc4ff88b8dd8090cd9c45061a9f669fe[…]n/java/io/openlineage/spark/agent/OpenLineageSparkListener.java

Julien Le Dem (julien@apache.org)
2021-10-12 21:47:36

*Thread Reply:* SparkSession.builder() .config("spark.jars.packages", "io.openlineage:openlineage_spark:0.2.+") .config("spark.extraListeners", "io.openlineage.spark.agent.OpenLineageSparkListener") .config("spark.openlineage.host", "<https://localhost>") .config("spark.openlineage.apiKey", "your api key") .config("spark.openlineage.namespace", "&lt;NAMESPACE_NAME&gt;") // Replace with the name of your Spark cluster. .getOrCreate()

Julien Le Dem (julien@apache.org)
2021-10-12 21:48:57

*Thread Reply:* It is going to add /lineage in the end: https://github.com/OpenLineage/OpenLineage/blob/8afc4ff88b8dd8090cd9c45061a9f669fe[…]rc/main/java/io/openlineage/spark/agent/OpenLineageContext.java

Julien Le Dem (julien@apache.org)
2021-10-12 21:49:37

*Thread Reply:* the apiKey setting is sent in an “Authorization” header

Julien Le Dem (julien@apache.org)
2021-10-12 21:49:55

*Thread Reply:* “Bearer $KEY”

Will Johnson (will@willj.co)
2021-10-12 22:54:22

*Thread Reply:* Thank you @Julien Le Dem it seems in both cases (defining the url endpoint with spark.openlineage.url and with the components: spark.openlineage.host / openlineage.version / openlineage.namespace / etc.) OpenLineage will strip out url parameters and rebuild the url endpoint with /lineage.

I think we might need to add in a url parameter configuration for our hackathon. We're using a bit of serverless code to shuttle open lineage events to a queue so that another job and/or serverless application can read that queue at its leisure.

Using the apiKey that feeds into the Authorization header as a Bearer token is great and would suffice but for our services we use OAuth tokens that would expire after two hours AND most of our customers wouldn't want to generate an access token themselves and feed it to Spark. ☹️

Would you guys entertain a proposal to support a spark.openlineage.urlParams configuration variable that lets you add url parameters to the derived lineage url?

Thank you for the detailed replies and deep links!

Julien Le Dem (julien@apache.org)
2021-10-13 10:46:22

*Thread Reply:* Yes, please open an issue detailing the use case.

Will Johnson (will@willj.co)
2021-10-13 13:02:06

Quick question, is it expected, when using Spark SQL and the Spark Integration for Spark3 that we receive and INPUT but no OUTPUTS when doing a CREATE TABLE ... AS SELECT ... .

I'm reading from a Spark SQL table (underlying CSV) and then writing it to a DELTA lake table.

I get a COMPLETE event type with an INPUT but no OUTPUT and then I get an exception for the AsyncEvent Queue but I'm guessing it's unrelated 😅

21/10/13 15:38:15 INFO OpenLineageContext: Lineage completed successfully: ResponseMessage(responseCode=200, body=null, error=null) {"eventType":"COMPLETE","eventTime":"2021-10-13T15:38:15.878Z","run":{"runId":"2cfe52b3-e08f-4888-8813-ffcdd2b27c89","facets":{"spark_unknown":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.2.3-SNAPSHOT/integration/spark>","_schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>","output":{"description":{"@class":"org.apache.spark.sql.catalyst.plans.logical.Project","traceEnabled":false,"streaming":false,"cacheId":null,"canonicalizedPlan":false},"inputAttributes":[{"name":"id","type":"long","metadata":{}}],"outputAttributes":[{"name":"id","type":"long","metadata":{}},{"name":"action_date","type":"date","metadata":{}}]},"inputs":[{"description":{"@class":"org.apache.spark.sql.catalyst.plans.logical.Range","streaming":false,"traceEnabled":false,"cacheId":null,"canonicalizedPlan":false},"inputAttributes":[],"outputAttributes":[{"name":"id","type":"long","metadata":{}}]}]},"spark.logicalPlan":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.2.3-SNAPSHOT/integration/spark>","_schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>","plan":[{"class":"org.apache.spark.sql.catalyst.plans.logical.Project","num-children":1,"projectList":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num_children":0,"name":"id","dataType":"long","nullable":false,"metadata":{},"exprId":{"product_class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":111,"jvmId":"4bdfd808-97d5-455f-ad6a-a3b29855e85b"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.Alias","num-children":1,"child":0,"name":"action_date","exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":113,"jvmId":"4bdfd808_97d5_455f_ad6a_a3b29855e85b"},"qualifier":[],"explicitMetadata":{},"nonInheritableMetadataKeys":"[__dataset_id, __col_position]"},{"class":"org.apache.spark.sql.catalyst.expressions.CurrentDate","num_children":0,"timeZoneId":"Etc/UTC"}]],"child":0},{"class":"org.apache.spark.sql.catalyst.plans.logical.Range","num-children":0,"start":0,"end":5,"step":1,"numSlices":8,"output":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num_children":0,"name":"id","dataType":"long","nullable":false,"metadata":{},"exprId":{"product_class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":111,"jvmId":"4bdfd808-97d5-455f-ad6a-a3b29855e85b"},"qualifier":[]}]],"isStreaming":false}]}}},"job":{"namespace":"sparknamespace","name":"databricks_shell.project"},"inputs":[],"outputs":[],"producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.2.3-SNAPSHOT/integration/spark>","schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunEvent>"} 21/10/13 15:38:16 INFO FileSizeAutoTuner: File size tuning result: {"tuningType":"autoTuned","tunedConfs":{"spark.databricks.delta.optimize.minFileSize":"268435456","spark.databricks.delta.optimize.maxFileSize":"268435456"}} 21/10/13 15:38:16 INFO FileFormatWriter: Write Job e062f36c-8b9d-4252-8db9-73b58bd67b15 committed. 21/10/13 15:38:16 INFO FileFormatWriter: Finished processing stats for write job e062f36c-8b9d-4252-8db9-73b58bd67b15. 21/10/13 15:38:18 INFO CodeGenerator: Code generated in 253.294028 ms 21/10/13 15:38:18 INFO SparkContext: Starting job: collect at DataSkippingReader.scala:430 21/10/13 15:38:18 INFO DAGScheduler: Job 1 finished: collect at DataSkippingReader.scala:430, took 0.000333 s 21/10/13 15:38:18 ERROR AsyncEventQueue: Listener OpenLineageSparkListener threw an exception java.lang.NullPointerException at io.openlineage.spark.agent.OpenLineageSparkListener.onJobEnd(OpenLineageSparkListener.java:167) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:39) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:119) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:103) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1547) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)

Julien Le Dem (julien@apache.org)
2021-10-13 17:54:22

*Thread Reply:* This is because this specific action is not covered yet. You can see the “spark_unknown” facet is describing things that are not understood yet run": { ... "facets": { "spark_unknown": { ... "output": { "description": { "@class": "org.apache.spark.sql.catalyst.plans.logical.Project", "traceEnabled": false, "streaming": false, "cacheId": null, "canonicalizedPlan": false },

Julien Le Dem (julien@apache.org)
2021-10-13 17:54:43

*Thread Reply:* I think this is part of the Spark 3 gap

Julien Le Dem (julien@apache.org)
2021-10-13 17:55:46

*Thread Reply:* an unknown output will cause missing output lineage

Julien Le Dem (julien@apache.org)
2021-10-13 18:05:57

*Thread Reply:* Output handling is here: https://github.com/OpenLineage/OpenLineage/blob/e0f1852422f325dc019b0eab0e466dc905[…]io/openlineage/spark/agent/lifecycle/OutputDatasetVisitors.java

🙌 Will Johnson
Will Johnson (will@willj.co)
2021-10-13 22:49:08

*Thread Reply:* Ah! Thank you so much, Julien! This is very helpful to understand where that is set. This is a big gap that we want to help address after our hackathon. Thank you!

Julien Le Dem (julien@apache.org)
2021-10-13 20:09:17

Following up on the meeting this morning, I have created an issue to formalize a design doc review process: https://github.com/OpenLineage/OpenLineage/issues/336 If that sounds good I’ll create the first doc to describe this as a PR. (how meta!)

Labels
proposal
Julien Le Dem (julien@apache.org)
2021-10-13 20:13:02

*Thread Reply:* the github wiki is backed by a git repo but it does not allow PRs. (people do hacks but I’d rather avoid those)

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-18 10:24:25

We're discussing creating Transport abstraction for OpenLineage clients, that would allow us creating better experience for people that expect to be able to emit their events using something else than http interface. Please tell us what you think of proposed mechanism - encouraging emojis are helpful too 😉 https://github.com/OpenLineage/OpenLineage/pull/344

Julien Le Dem (julien@apache.org)
2021-10-18 20:57:04

OpenLineage release 0.3 is coming. Please chiming if there’s anything blocker that should go in the release: https://github.com/OpenLineage/OpenLineage/projects/4

❤️ Willy Lulciuc
Carlos Quintas (cdquintas@gmail.com)
2021-10-19 06:36:05

👋 Hi everyone!

👋 Ross Turk, Willy Lulciuc, Michael Collado
Carlos Quintas (cdquintas@gmail.com)
2021-10-22 05:38:14

openlineage with DBT and Trino, is there any forecast?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-22 05:44:17

*Thread Reply:* Maybe you want to contribute it? It's not that hard, mostly testing, and figuring out what would be the naming of openlineage namespace for Trino, and how some additional statistics work.

For example, recently we had added support for Redshift by community member @ale

https://github.com/OpenLineage/OpenLineage/pull/328

Comments
1
Carlos Quintas (cdquintas@gmail.com)
2021-10-22 05:42:52

Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5 Traceback (most recent call last): File "/home/labuser/.local/bin/dbt-ol", line 61, in <module> main() File "/home/labuser/.local/bin/dbt-ol", line 54, in main events = processor.parse().events() File "/home/labuser/.local/lib/python3.8/site-packages/openlineage/common/provider/dbt.py", line 98, in parse self.extractdatasetnamespace(profile) File "/home/labuser/.local/lib/python3.8/site-packages/openlineage/common/provider/dbt.py", line 377, in extractdatasetnamespace self.datasetnamespace = self.extractnamespace(profile) File "/home/labuser/.local/lib/python3.8/site-packages/openlineage/common/provider/dbt.py", line 391, in extract_namespace raise NotImplementedError( NotImplementedError: Only 'snowflake' and 'bigquery' adapters are supported right now. Passed trino

Michael Collado (collado.mike@gmail.com)
2021-10-22 12:41:08

Hey folks, we've released OpenLineage 0.3.1. There are quite a few changes, including doc improvements, Redshift support in dbt, bugfixes, a new server-side client code base, but the real highlights are

  1. Official Spark 3 support- this is still a work in progress (the whole Spark integration is), but the big deal is we've split the source tree to support both Spark 2 and Spark 3 specific plan visitors. This will enable us to work with the Spark 3 API explicitly and to add support for those interfaces and classes that didn't exist in Spark 2. We're also running all integration tests against both Spark 2.4.7 and Spark 3.1.0
  2. Airflow 2 support- also a work in progress, but we have a new LineageBackend implementation that allows us to begin tracking lineage for successful Airflow 2 DAGs. We're working to support failure notifications so we can also trace failed jobs. The LineageBackend can also be enabled in Airflow 1.10.X to improve the reporting of task completion times. Check the READMEs for more details and to get started with the new features. Thanks to @Maciej Obuchowski , @Oleksandr Dvornik, @ale, and @Willy Lulciuc for their contributions. See the full changelog
🎉 Willy Lulciuc, Maciej Obuchowski, Minkyu Park, Ross Turk, Peter Hicks, RamanD, Ry Walker
🙌 Willy Lulciuc, Maciej Obuchowski, Minkyu Park, Will Johnson, Ross Turk, Peter Hicks, Ry Walker
🔥 Ry Walker
David Virgil (david.virgil.naranjo@googlemail.com)
2021-10-28 07:27:12

Hello community. I am starting using marquez. I try to connect dbt with Marquez, but the spark adapter is not yet available.

Are you planning to implement this spark dbt adapter in next openlineage versions?

NotImplementedError: Only 'snowflake', 'bigquery', and 'redshift' adapters are supported right now. Passed spark In my company we are starting to use as well the athena dbt adapter. Are you planning to implement this integration? Thanks a lot community

Julien Le Dem (julien@apache.org)
2021-10-28 12:20:27

*Thread Reply:* That would make sense. I think you are the first person to request this. Is this something you would want to contribute to the project?

David Virgil (david.virgil.naranjo@googlemail.com)
2021-10-28 17:37:53

*Thread Reply:* I would like to Julien, but not sure how can I do it. Could you guide me how can i start? or show me other integration.

Matthew Mullins (mmullins@aginity.com)
2021-10-31 07:57:55

*Thread Reply:* @David Virgil look at the pull request for the addition of Redshift as a starting guide. https://github.com/OpenLineage/OpenLineage/pull/328

Comments
1
David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-01 12:01:41

*Thread Reply:* Thanks @Matthew Mullins I ll try to add dbt spark integration

Mario Measic (mario.measic.gavran@gmail.com)
2021-10-28 09:31:01

Hey folks, quick question, are we able to run dbt-ol without providing OPENLINEAGE_URL? I find it quite limiting that I need to have a service set up in order to emit/generate OL events/messages. Is there a way to just output them to the console?

Mario Measic (mario.measic.gavran@gmail.com)
2021-10-28 10:05:09

*Thread Reply:* OK, was changed here: https://github.com/OpenLineage/OpenLineage/pull/286

Did you think about this?

Comments
1
Julien Le Dem (julien@apache.org)
2021-10-28 12:19:27

*Thread Reply:* In Marquez there was a mechanism to do that. Something like OPENLINEAGE_BACKEND=HTTP|LOG

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-28 13:56:42

*Thread Reply:* @Mario Measic We're going to add Transport mechanism, that will address use cases like yours. Please comment on this PR what would you expect: https://github.com/OpenLineage/OpenLineage/pull/344

Comments
1
👀 Mario Measic
Mario Measic (mario.measic.gavran@gmail.com)
2021-10-28 15:29:50

*Thread Reply:* Nice, thanks @Julien Le Dem and @Maciej Obuchowski.

Mario Measic (mario.measic.gavran@gmail.com)
2021-10-28 15:46:45

*Thread Reply:* Also, dbt build is not working which is kind of the biggest feature of the version 0.21.0, I will try testing the code with modifications to the https://github.com/OpenLineage/OpenLineage/blob/c3aa70e161244091969951d0da4f37619bcbe36f/integration/dbt/scripts/dbt-ol#L141

I guess there's a reason for it that I didn't see since you support v3 of the manifest.

Mario Measic (mario.measic.gavran@gmail.com)
2021-10-29 03:45:27

*Thread Reply:* Also, is it normal not to see the column descriptions for the model/table even though these are provided in the YAML file, persisted in Redshift and also dbt docs generate has been run before dbt-ol run?

Mario Measic (mario.measic.gavran@gmail.com)
2021-10-29 04:26:22

*Thread Reply:* Tried with dbt versions 0.20.2 and 0.21.0, openlineage-dbt==0.3.1

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-10-29 10:39:10

*Thread Reply:* I'll take a look at that. Supporting descriptions might be simple, but dbt build might be a little larger task.

Julien Le Dem (julien@apache.org)
2021-11-01 19:12:01

*Thread Reply:* I opened a ticket to track this: https://github.com/OpenLineage/OpenLineage/issues/376

👀 Mario Measic
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-11-02 05:48:06

*Thread Reply:* The column description issue should be fixed here: https://github.com/OpenLineage/OpenLineage/pull/383

Julien Le Dem (julien@apache.org)
2021-10-28 12:27:17

I’m looking for feedback on my proposal to improve the proposal process ! https://github.com/OpenLineage/OpenLineage/issues/336

Assignees
wslulciuc, mobuchowski, mandy-chessell, collado-mike
Labels
proposal
Brad (bradley.mcelroy@live.com)
2021-10-28 18:49:12

Hey guys - just an update on my prefect PR (https://github.com/OpenLineage/OpenLineage/pull/293) - there a little spiel on the ticket but I've closed that PR in favour of opening a new one. Prefect have just release a 2.0a technical preview, which they would like to make stable near the start of next year. I think it makes sense to target this release, and I've had one of the prefect team reach out and is keen to get some sort of lineage implemented in prefect.

👍 Kevin Kho, Maciej Obuchowski, Willy Lulciuc, Michael Collado, Julien Le Dem, Thomas Fredriksen
Brad (bradley.mcelroy@live.com)
2021-10-28 18:51:10

*Thread Reply:* If anyone has any questions or comments - happy to discuss here

Brad (bradley.mcelroy@live.com)
2021-10-28 18:51:15

*Thread Reply:* @davzucky

Willy Lulciuc (willy@datakin.com)
2021-10-28 23:01:29

*Thread Reply:* Thanks for updating the community, Brad!

davzucky (davzucky@hotmail.com)
2021-10-28 23:47:02

*Thread Reply:* Than you Brad. Looking forward to see how to integrated that with v2

Kevin Kho (kdykho@gmail.com)
2021-10-28 18:53:23

Hello, joining here from Prefect. Because of community requests from users like Brad above, we are looking to implement lineage for Prefect this quarter. Good to meet you all!

❤️ Minkyu Park, Faouzi, John Thomas, Maciej Obuchowski, Kevin Mellott, Thomas Fredriksen
👍 Minkyu Park, Faouzi, John Thomas
🙌 Michael Collado, Faouzi, John Thomas
Willy Lulciuc (willy@datakin.com)
2021-10-28 18:54:56

*Thread Reply:* Welcome, @Kevin Kho 👋. Really excited to see this integration kick off! 💯🚀

👍 Kevin Kho, Maciej Obuchowski, Peter Hicks, Faouzi
David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-01 12:03:14

Hello,

i am integratng openLineage with Airflow 2.2.0

Do you consider in the future airflow manual inlets and outlets?

Seeing the documentation I can see that is not possible.

OpenLineageBackend does not take into account manually configured inlets and outlets. Thanks

John Thomas (john@datakin.com)
2021-11-01 12:23:11

*Thread Reply:* While it’s not something we’re supporting at the moment, it’s definitely something that we’re considering!

If you can give me a little more detail on what your system infrastructure is like, it’ll help us set priority and design

David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-01 13:57:34

*Thread Reply:* So basic architecture of a datalake. We are using airflow to trigger jobs. Every job is a pipeline that runs a spark job (in our case it spin up an EMR). So the idea of lineage would be defining in the dags inlets and outlets based on the airflow lineage:

https://airflow.apache.org/docs/apache-airflow/stable/lineage.html

I think you need to be able to include these inlets and outlets in the picture of openlineage

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-11-01 14:01:24

*Thread Reply:* Why not use spark integration? https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark

David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-01 14:05:02

*Thread Reply:* because there are some other jobs that are not spark, some jobs they run in dbt, other jobs they run in redshift @Maciej Obuchowski

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-11-01 14:08:58

*Thread Reply:* So, combo of https://github.com/OpenLineage/OpenLineage/tree/main/integration/dbt and PostgresExtractor from airflow integration should cover Redshift if you're using it from PostgresOperator 🙂

It's definitely interesting use case - you'd be using most of the existing integrations we have.

David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-01 15:04:44

*Thread Reply:* @Maciej Obuchowski Do i need to define any extractor in the airflow startup?

Francis McGregor-Macdonald (francis@mc-mac.com)
2021-11-05 23:48:21

*Thread Reply:* I am using Redshift with PostgresOperator and it is returning…

[2021-11-06 03:43:06,541] {{__init__.py:92}} ERROR - Failed to extract metadata 'NoneType' object has no attribute 'host' task_type=PostgresOperator airflow_dag_id=counter task_id=inc airflow_run_id=scheduled__2021-11-06T03:42:00+00:00 Traceback (most recent call last): File "/usr/local/airflow/.local/lib/python3.7/site-packages/openlineage/lineage_backend/__init__.py", line 83, in _extract_metadata task_metadata = self._extract(extractor, task_instance) File "/usr/local/airflow/.local/lib/python3.7/site-packages/openlineage/lineage_backend/__init__.py", line 104, in _extract task_metadata = extractor.extract_on_complete(task_instance) File "/usr/local/airflow/.local/lib/python3.7/site-packages/openlineage/airflow/extractors/base.py", line 61, in extract_on_complete return self.extract() File "/usr/local/airflow/.local/lib/python3.7/site-packages/openlineage/airflow/extractors/postgres_extractor.py", line 65, in extract authority=self._get_authority(), File "/usr/local/airflow/.local/lib/python3.7/site-packages/openlineage/airflow/extractors/postgres_extractor.py", line 120, in _get_authority if self.conn.host and self.conn.port: AttributeError: 'NoneType' object has no attribute 'host'

I can’t see this raised as an issue.

David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-01 13:57:54

Hello, I am trying to integrate Airflow with openlineage.

It is not working for me.

What I tried:

  1. Adding openlineage-airflow to requirements.txt
  2. Adding ```- AIRFLOWLINEAGEBACKEND=openlineage.airflow.backend.OpenLineageBackend

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/airflow/.local/bin/airflow", line 8, in <module> sys.exit(main()) File "/home/airflow/.local/lib/python3.8/site-packages/airflow/main.py", line 40, in main args.func(args) File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/cliparser.py", line 47, in command func = importstring(importpath) File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/moduleloading.py", line 32, in importstring module = importmodule(modulepath) File "/usr/local/lib/python3.8/importlib/init.py", line 127, in importmodule return bootstrap.gcdimport(name[level:], package, level) File "<frozen importlib.bootstrap>", line 1014, in gcdimport File "<frozen importlib.bootstrap>", line 991, in _findandload File "<frozen importlib.bootstrap>", line 975, in findandloadunlocked File "<frozen importlib.bootstrap>", line 671, in _loadunlocked File "<frozen importlib.bootstrapexternal>", line 843, in execmodule File "<frozen importlib.bootstrap>", line 219, in callwithframesremoved File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/dbcommand.py", line 24, in <module> from airflow.utils import cli as cliutils, db File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/db.py", line 26, in <module> from airflow.jobs.basejob import BaseJob # noqa: F401 File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/init.py", line 19, in <module> import airflow.jobs.backfilljob File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/backfilljob.py", line 29, in <module> from airflow import models File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/init.py", line 20, in <module> from airflow.models.baseoperator import BaseOperator, BaseOperatorLink File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/baseoperator.py", line 196, in <module> class BaseOperator(Operator, LoggingMixin, TaskMixin, metaclass=BaseOperatorMeta): File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/baseoperator.py", line 941, in BaseOperator def postexecute(self, context: Any, result: Any = None): File "/home/airflow/.local/lib/python3.8/site-packages/airflow/lineage/init.py", line 103, in applylineage _backend = getbackend() File "/home/airflow/.local/lib/python3.8/site-packages/airflow/lineage/init.py", line 52, in get_backend clazz = conf.getimport("lineage", "backend", fallback=None) File "/home/airflow/.local/lib/python3.8/site-packages/airflow/configuration.py", line 469, in getimport raise AirflowConfigException( airflow.exceptions.AirflowConfigException: The object could not be loaded. Please check "backend" key in "lineage" section. Current value: "openlineage.airflow.backend.OpenLineageBackend".```

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-11-01 14:06:12

*Thread Reply:* 1. Please use openlineage.lineage_backend.OpenLineageBackend as AIRFLOW__LINEAGE__BACKEND

  1. Please tell us where you've seen openlineage.airflow.backend.OpenLineageBackend, so we can fix the documentation 🙂
Julien Le Dem (julien@apache.org)
2021-11-01 19:07:21

*Thread Reply:* https://pypi.org/project/openlineage-airflow/

PyPI
Julien Le Dem (julien@apache.org)
2021-11-01 19:08:03

*Thread Reply:* (I googled it and found that page that seems to have an outdated doc)

David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-02 02:38:59

*Thread Reply:* @Maciej Obuchowski @Julien Le Dem that's the page i followed. Please guys revise the documentation, as it is very important

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-11-02 04:34:14

*Thread Reply:* It should just copy actual readme

John Thomas (john@datakin.com)
2021-11-03 16:30:00

*Thread Reply:* PyPi is using the README at the time of the release 0.3.1, rather than the current README, which is 0.4.0. If we send the new release to PyPi it should also update the README

David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-01 15:09:54

Related the Airflow integration. Is it required to install openlineage-airflow and setup the environment variables in both scheduler and webserver, or just in the scheduler?

David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-01 15:19:18

*Thread Reply:* I set i up in the scheduler and it starts to log data to marquez. But it fails with this error:

Traceback (most recent call last): File "/home/airflow/.local/lib/python3.8/site-packages/openlineage/client/client.py", line 49, in __init__ raise ValueError(f"Need valid url for OpenLineageClient, passed {url}") ValueError: Need valid url for OpenLineageClient, passed "<http://marquez-internal-eks.eu-west-1.dev.hbi.systems>"

David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-01 15:19:26

*Thread Reply:* why is it not a valid URL?

John Thomas (john@datakin.com)
2021-11-01 18:39:58

*Thread Reply:* Which version of the OpenLineage client are you using? On first check it should be fine

David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-02 05:14:30

*Thread Reply:* @John Thomas I was appending double quotes as part of the url. Forget about this error

John Thomas (john@datakin.com)
2021-11-02 10:35:28

*Thread Reply:* aaaah, gotcha, good catch!

David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-02 05:15:52

Hello, I am receiving this error today when I deployed openlineage in development environment (not using docker-compose locally).

I am running with KubernetesExecutor

airflow.exceptions.AirflowConfigException: The object could not be loaded. Please check "backend" key in "lineage" section. Current value: "openlineage.lineage_backend.OpenLineageBackend".

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-11-02 05:18:18

*Thread Reply:* Are you sure that openlineage-airflow is present in the container?

David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-02 05:23:09

So in this case in my template I am adding:

```env:
ADDITIONALPYTHONDEPS: "openpyxl==3.0.3 smartopen==2.0.0 apache-airflow-providers-http apache-airflow-providers-cncf-kubernetes apache-airflow-providers-amazon openlineage-airflow" OPENLINEAGEURL: https://marquez-internal-eks.eu-west-1.dev.hbi.systems OPENLINEAGENAMESPACE: dnsairflow AIRFLOWKUBERNETESENVIRONMENTVARIABLESOPENLINEAGEURL: https://marquez-internal-eks.eu-west-1.dev.hbi.systems AIRFLOWKUBERNETESENVIRONMENTVARIABLESOPENLINEAGENAMESPACE: dns_airflow

configmap: mountPath: /var/airflow/config # mount path of the configmap data: airflow.cfg: | [lineage] backend = openlineage.lineage_backend.OpenLineageBackend

pod_template_file.yaml: |

    containers:
      - args: []
        command: []
        env:
          - name: AIRFLOW__KUBERNETES_ENVIRONMENT_VARIABLES__OPENLINEAGE_URL
            value: <https://marquez-internal-eks.eu-west-1.dev.hbi.systems>
          - name: AIRFLOW__KUBERNETES_ENVIRONMENT_VARIABLES__OPENLINEAGE_NAMESPACE
            value: dns_airflow
          - name: AIRFLOW__LINEAGE__BACKEND
            value: openlineage.lineage_backend.OpenLineageBackend```
David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-02 05:23:31

I am installing openlineage in the ADDITIONAL_PYTHON_DEPS

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-11-02 05:25:43

*Thread Reply:* Maybe ADDITIONAL_PYTHON_DEPS are dependencies needed by the tasks, and are installed after Airflow tries to initialize LineageBackend?

David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-02 06:34:11

*Thread Reply:* I am checking this accessing the Kubernetes pod

David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-02 06:34:54

I have a question related airflow and open lineage:

I have a dag that contains 2 tasks:

David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-02 06:35:34

I see that every task is displayed as a different job. I was expecting to see one job per dag.

David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-02 07:29:43

Is this the expected behaviour??

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-11-02 07:34:47

*Thread Reply:* Yes

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-11-02 07:35:53

*Thread Reply:* Probably what you want is job hierarchy: https://github.com/MarquezProject/marquez/issues/1737

Assignees
collado-mike
David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-02 07:46:02

*Thread Reply:* I do not see any benefit of just having some airflow task metadata. I do not see relationship between tasks. Every task is a job. When I was thinking about lineage when i started working on my company integration with openlineage i though that openlineage would give me relationship between task or datasets and the only thing i see is some metadata of the history of airflow runs that is already provided by airflow

David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-02 07:46:20

*Thread Reply:* i was expecting to see a nice graph. I think it is missing some features

David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-02 07:46:25

*Thread Reply:* at this early stage

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-11-02 07:50:10

*Thread Reply:* It probably depends on whether those tasks are covered by the extractors: https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow/openlineage/airflow/extractors

David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-02 07:55:50

*Thread Reply:* We are not using any of those operators: bigquery, postsgress or snowflake.

And what is it doing GreatExpectactions extractor?

It would be good if there is one extractor that relies in the inlets and outlets that you can define in any Airflow task, and that that can be the general way to make relationships between datasets

David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-02 07:56:30

*Thread Reply:* And that the same dag graph can be seen in marquez, and not one job per task.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-11-02 08:07:06

*Thread Reply:* > It would be good if there is one extractor that relies in the inlets and outlets that you can define in any Airflow task I think this is good idea. Overall, OpenLineage strongly focuses on automatic metadata collection. However, using them would be a nice fallback for not-covered-yet cases.

> And that the same dag graph can be seen in marquez, and not one job per task. This currently depends on dataset hierarchy. If you're not using any of the covered extractors, then Marquez can't build dataset graph like in the demo: https://raw.githubusercontent.com/MarquezProject/marquez/main/web/docs/demo.gif

With the job hierarchy ticket, probably some graph could be generated using just the job data though.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-11-02 08:09:55

*Thread Reply:* Created issue for the manual fallback: https://github.com/OpenLineage/OpenLineage/issues/384

Assignees
mobuchowski
David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-02 08:28:29

*Thread Reply:* @Maciej Obuchowski how many people are working full time in this library? I really would like to adopt it in my company, as we use airflow and spark, but i see that yet it does not have the features we would like to.

At the moment the same info we have in marquez related the tasks, is available in airflow UI or using airflow API.

The game changer for us would be that it could give us features/metadata that we cannot query directly from airflow. That's why if the airflow inlets/outlets could be used, then it really would make much more sense for us to adopt it.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-11-02 09:33:31

*Thread Reply:* > how many people are working full time in this library? On Airflow integration or on OpenLineage overall? 🙂

> The game changer for us would be that it could give us features/metadata that we cannot query directly from airflow. I think there are three options there:

  1. Contribute relevant extractors for Airflow operators that you use
  2. Use those extractors as custom extractors: https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow#custom-extractors
  3. Create that manual fallback mechanism with Airflow inlets/outlets: https://github.com/OpenLineage/OpenLineage/issues/384
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-11-02 09:35:10

*Thread Reply:* But first, before implementing last option, I'd like to get consensus about it - so feel free to comment there about your use case

David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-02 09:19:14

@Maciej Obuchowski even i can contribute or help with my ideas (from what i consider that should be lineage from a client side)

David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-03 07:58:56

@Maciej Obuchowski I was able to put to work Airflow in Kubernetes pointing to Marquez using the openlineage library. I have a few problems I found that would be good to comment.

I see a warning [2021-11-03 11:47:04,309] {great_expectations_extractor.py:27} WARNING - Did not find great_expectations_provider library or failed to import it I couldnt find any information about GreatExpectationsExtractor. Could you tell me what is this extractor about?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-11-03 08:00:34

*Thread Reply:* It should only affect you if you're using https://greatexpectations.io/

Francis McGregor-Macdonald (francis@mc-mac.com)
2021-11-03 15:57:02

*Thread Reply:* I have a similar message after installing openlineage into Amazon MWAA from the scheduler logs:

WARNING:/usr/local/airflow/.local/lib/python3.7/site-packages/openlineage/airflow/extractors/great_expectations_extractor.py:Did not find great_expectations_provider library or failed to import it

I am not using great expectations in the DAG.

David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-03 08:00:52

I see a few priorities for Airflow integration:

  1. Direct relationship 1-1 between Dag && Job. At the moment every task is a different job in marquez. What i consider wrong.
  2. Airflow Inlets/outlets integration with marquez When do you think you guys can have this? If you need any help I can happily contribute, but I would need some help
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-11-03 08:08:21

*Thread Reply:* I don't think 1) is a good idea. You can have multiple tasks in one dag, processing different datasets and producing different datasets. If you want visual linking of jobs that produce disjoint datasets, then I think you want this: https://github.com/MarquezProject/marquez/issues/1737 which wuill affect visual layer.

Regarding 2), I think we need to get along with Airflow maintainers regarding long term mechanism on which OL will work: https://github.com/apache/airflow/issues/17984

I think using inlets/outlets as a fallback mechanism when we're not doing automatic metadata extraction is a good idea, but we don't know if hypothetical future mechanism will have access to these. It's hard to commit to mechanism which might disappear soon.

David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-03 08:13:28

Another option is that I build my own extractor, do you have any example of how to create a custom extractor? How I can apply that customExtractor to specific operators? Is there a way to link an extractor with an operator, so at runtime airflow knows which extractor to run?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-11-03 08:19:00

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow#custom-extractors

I think you can base your code on any existing extractor, like PostgresExtractor: https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/openlineage/airflow/extractors/postgres_extractor.py#L53

Custom extractors work just like buildin ones, just that you need to add bit of mapping between operator and extractor, like OPENLINEAGE_EXTRACTOR_PostgresOperator=openlineage.airflow.extractors.postgres_extractor.PostgresExtractor

👍 Francis McGregor-Macdonald
David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-03 08:35:59

*Thread Reply:* Thank you very much @Maciej Obuchowski

David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-03 08:36:52

Last question of the morning. Running one task that failed i could see that no information appeared in Marquez. Is this something that is expected to happen? I would like to see in Marquez all the history of runs, successful and unsucessful them.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-11-03 08:41:14

*Thread Reply:* It worked like that in Airflow 1.10.

This is an unfortunate limitation of LineageBackend API that we're using for Airflow 2. We're trying to work out solution for this with Airflow maintainers: https://github.com/apache/airflow/issues/17984

Labels
kind:feature, area:lineage
Comments
23
David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-04 03:41:38

Hello openlineage community.

Yesterday I tried the integration with spark.

The result was not satisfactory. This is what I did:

  1. Add openlineage-spark dependency
  2. Add these lines: .config("spark.jars.packages", "io.openlineage:openlineage_spark:0.3.1") .config("spark.extraListeners", "io.openlineage.spark.agent.OpenLineageSparkListener") .config("spark.openlineage.url", "<https://marquez-internal-eks.eu-west-1.dev.hbi.systems/api/v1/namespaces/spark_integration/>" This job was doing spark.read from 2 different json location. It is doing spark write to 5 different parquet location in s3. The job finished succesfully and the result in marquez is:
David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-04 03:43:40

It created 3 namespaces. One was the one that I point in the spark config property. The other 2 are the bucket that we are writing to () and the bucket where we are reading from ()

David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-04 03:44:00

If I enter in the bucket namespaces I see nowthing inside

David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-04 03:48:35

I can see if i enter in one of the weird jobs generated this:

Julien Le Dem (julien@apache.org)
2021-11-04 18:47:41

*Thread Reply:* This job with no output is a symptom of the output not being understood. you should be able to see the facets for that job. There will be a spark_unknown facet with more information about the problem. If you put that into an issue with some more details about this job we should be able to help.

David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-05 04:36:30

*Thread Reply:* I ll try to put all the info in a ticket, as it is not working as i would expect

David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-04 03:52:24

And i am seeing this as well

If I check the logs of marquez-web and marquez I can't see any error there

David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-04 03:54:38

When I try to open the job fulfilments.execute_insert_into_hadoop_fs_relation_command I see this window:

David Virgil (david.virgil.naranjo@googlemail.com)
2021-11-04 04:06:29

The page froze and no link from the menu works. Apart from that I see that there are no messages in the logs

Julien Le Dem (julien@apache.org)
2021-11-04 18:49:31

*Thread Reply:* Is there an error in the browser javascript console? (example on chrome: View -> Developer -> Javascript console)

Alessandro Rizzo (l.alessandrorizzo@gmail.com)
2021-11-04 17:22:29

Hi #general, I'm a data engineer for a UK-based insuretech (part of one of the biggest UK retail insurers). We run a series of tech meetups and we'd love to have someone from the OpenLineage project to give us a demo of the tool. Would anyone be interested (DM if so 🙂 ) ?

👍 Ross Turk
Taleb Zeghmi (talebz@zillowgroup.com)
2021-11-04 21:30:24

Hi! Is there an example of tracking lineage when using Pandas to read/write and transform data?

John Thomas (john@datakin.com)
2021-11-04 21:35:16

*Thread Reply:* Hi Taleb - I don’t know of a generalized example of lineage tracking with Pandas, but you should be able to accomplish this by sending the runEvents manually to the OpenLineage API in your code: https://openlineage.io/docs/openapi/

Taleb Zeghmi (talebz@zillowgroup.com)
2021-11-04 21:38:25

*Thread Reply:* Is this a work in progress, that we can investigate? Because I see it in this image https://github.com/OpenLineage/OpenLineage/blob/main/doc/Scope.png

John Thomas (john@datakin.com)
2021-11-04 21:54:51

*Thread Reply:* To my knowledge, while there are a few proposals around adding a wrapper on some Pandas methods to output runEvents, it’s not something that’s had work started on it yet

John Thomas (john@datakin.com)
2021-11-04 21:56:26

*Thread Reply:* I sent some feelers out to get a little more context from folks who are more informed about this than I am, so I’ll get you more info about potential future plans and the considerations around them when I know more

John Thomas (john@datakin.com)
2021-11-04 23:04:47

*Thread Reply:* So, Pandas is tricky because unlike Airflow, DBT, or Spark, Pandas doesn’t own the whole flow, and you might dip in and out of it to use other Python Packages (at least I did when I was doing more Data Science).

We have this issue open in OpenLineage that you should go +1 to help with our planning 🙂

Labels
proposal
Taleb Zeghmi (talebz@zillowgroup.com)
2021-11-05 15:08:09

*Thread Reply:* interesting... what if it were instead on all the read_** to_** functions?

Lyndon Armitage (lyndon.armitage@gmail.com)
2021-11-05 12:00:57

Hi! I am working alongside David at integrating OpenLineage into our Data Pipelines. I have a questions around Marquez and OpenLineage's divergent APIs: That is to say, these 2 APIs differ: https://openlineage.io/docs/openapi/ https://marquezproject.github.io/marquez/openapi.html This makes sense since they are at different layers of abstraction, but Marquez requires a few things that are absent from OpenLineage's API, for example the type in a data source, the distinctions between physicalName and sourceName in Datasets. Is that intentional? And can these be set using the OpenLineage API as some additional facets or keys? I noticed that the DatasourceDatasetFacet has a map of additionalProperties .

John Thomas (john@datakin.com)
2021-11-05 12:59:49

*Thread Reply:* The Marquez write APIs are artifacts from before OpenLineage existed, and they’re already slated for deprecation soon.

If you POST an OpenLineage runEvent to the /lineage endpoint in Marquez, it’ll create any missing jobs or datasets that are relevant.

Lyndon Armitage (lyndon.armitage@gmail.com)
2021-11-05 13:06:06

*Thread Reply:* Thanks for the response. That sounds good. Does this include the query interface e.g. http://localhost:5000/api/v1/namespaces/testing_java/datasets/incremental_data as that currently returns the Marquez version of a dataset including default set fields for type and the above mentioned properties.

Michael Collado (collado.mike@gmail.com)
2021-11-05 17:01:55

*Thread Reply:* I believe the intention for type is to support a new facet- TBH, it hasn't been the most pressing concern for most users, as most people are only recording tables, not streams. However, there's been some recent work to support Kafka in Spark- maybe it's time to address that deficiency.

I don't actually know what happened to the datasource type field- maybe @Julien Le Dem can comment on whether that field was dropped intentionally or whether it was an oversight.

Julien Le Dem (julien@apache.org)
2021-11-05 18:18:06

*Thread Reply:* It looks like an oversight, currently Marquez hard codes it to POSGRESQL: https://github.com/MarquezProject/marquez/blob/734bfd691636cb00212d7d22b1a489bd4870fb04/api/src/main/java/marquez/db/OpenLineageDao.java#L438

Julien Le Dem (julien@apache.org)
2021-11-05 18:18:25

*Thread Reply:* https://github.com/MarquezProject/marquez/blob/734bfd691636cb00212d7d22b1a489bd4870fb04/api/src/main/java/marquez/db/OpenLineageDao.java#L438-L440

Julien Le Dem (julien@apache.org)
2021-11-05 18:20:25
Julien Le Dem (julien@apache.org)
2021-11-05 18:07:16

The next OpenLineage monthly meeting is this coming Wednesday at 9am PT The tentative agenda is: • OL Client use cases for Apache Iceberg [Ryan] • OpenLineage and Azure Purview [Shrikanth] • Proxy Backend and Egeria integration progress update (Issue #152) [Mandy] • OpenLineage last release overview (0.3.1) ◦ Facet versioning ◦ Airflow 2 / Spark 3 support, dbt improvements • OpenLineage 0.4 scope review ◦ Proxy Backend (Issue #152) ◦ Spark, Airflow, dbt improvements (documentation, coverage, ...) ◦ improvements to the OpenLineage model • Open discussion 

Assignees
mandy-chessell
Comments
3
🙌 Maciej Obuchowski, Peter Hicks
Julien Le Dem (julien@apache.org)
2021-11-05 18:07:57

*Thread Reply:* If you want to add something please chime in this thread

Julien Le Dem (julien@apache.org)
2021-11-09 19:47:26

*Thread Reply:* The monthly meeting is happening tomorrow. The purview team will present at the December meeting instead See full agenda here: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting You are welcome to contribute

Julien Le Dem (julien@apache.org)
2021-11-10 11:10:17
Julien Le Dem (julien@apache.org)
2021-11-10 12:02:23

*Thread Reply:* It’s happening now ^

Julien Le Dem (julien@apache.org)
2021-11-16 19:57:23

*Thread Reply:* I have posted the notes and the recording from the last instance of our monthly meeting: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting#MonthlyTSCmeeting-Nov10th2021(9amPT) I have a few TODOs to follow up on tickets

Julien Le Dem (julien@apache.org)
2021-11-05 18:09:10

The next release of OpenLineage is being scoped: https://github.com/OpenLineage/OpenLineage/projects/6 Please chime in if you want to raise the priority of something or are planning to contribute

Anthony Ivanov (anthvt@gmail.com)
2021-11-09 08:18:11

Hi, I have been looking at open lineage for some time. And I really like it. It is very simple specification that covers a lot of use-cases. You can create any provider or consumer in a very simple way. So that’s pretty powerful. I have some questions about things that are not clear to me. I am not sure if this is the best place to ask. Please refer me to other place if this is not appropriate.

Anthony Ivanov (anthvt@gmail.com)
2021-11-09 08:18:58

*Thread Reply:* How do you model continuous process (not batch processes). For example a flume or spark job that does some real time processing on data.

Maybe it’s simply a “Job” But than what is run ?

Anthony Ivanov (anthvt@gmail.com)
2021-11-09 08:19:44

*Thread Reply:* How do you model consumers at the end - they can be reports? Data applications, ML model deployments, APIs, GUI consumed by end users ?

Have you considered having some examples of different use cases like those?

Anthony Ivanov (anthvt@gmail.com)
2021-11-09 08:21:43

*Thread Reply:* By definition, Job is a process definition that consumes and produces datasets. It is many to many relations? I’ve been wondering about that.Shouldn’t be more restrictive? For example important use-case for lineage is troubleshooting or error notifications (e.g mark report or job as temporarily in bad state if upstream data integration is broken). In order to be able to that you need to be able to traverse the graph to find the original error. So having multiple inputs produce single output make sense (e.g insert into output_1 select ** from x,y group by a,b) . But what are the cases where you’d want to see multiple outputs ? You can have single process produce multiple tables (in above example) but they’d alway be separate queries. The actual inputs for each output would be different.

But having multiple outputs create ambiguity as now If x or y is broken but have multiple outputs I do not know which is really impacted?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-11-09 08:34:01

*Thread Reply:* > How do you model continuous process (not batch processes). For example a flume or spark job that does some real time processing on data. > > Maybe it’s simply a “Job” But than what is run ? Every continuous process eventually has end - for example, you can deploy new version of your Flink pipeline. The new version would be the next Run for the same Job.

Moreover, OTHER event type is useful to update metadata like amount of processed records. In this Flink example, it could be emitted per checkpoint.

I think more attention for streaming use cases will be given soon.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-11-09 08:43:09

*Thread Reply:* > How do you model consumers at the end - they can be reports? Data applications, ML model deployments, APIs, GUI consumed by end users ? Our reference implementation is an web application https://marquezproject.github.io/marquez/

We definitely do not exclude any of the things you're talking about - and it would make a lot of sense to talk more about potential usages.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-11-09 08:45:47

*Thread Reply:* > By definition, Job is a process definition that consumes and produces datasets. It is many to many relations? I’ve been wondering about that.Shouldn’t be more restrictive? I think this is too SQL-centric view 🙂

Not everything is a query. For example, those Flink streaming jobs can produce side outputs, or even push data to multiple sinks. We need to model those types of jobs too.

If your application does not do multiple outputs, then I don't see how specification allowing those would impact you.

Anthony Ivanov (anthvt@gmail.com)
2021-11-17 12:11:37

*Thread Reply:* > We definitely do not exclude any of the things you’re talking about - and it would make a lot of sense to talk more about potential usages. Yes I think that would be great if we expand on potential usages. if Open Lineage documentation (perhaps) has all kind of examples for different use-cases or case studies. Financal or healthcase industry case study and how would someone doing integration with OpenLineage. It would be easier to understand the concepts and make sure things are modeled consistently.

Anthony Ivanov (anthvt@gmail.com)
2021-11-17 14:19:19

*Thread Reply:* > I think this is too SQL-centric view 🙂 > > Not everything is a query. For example, those Flink streaming jobs can produce side outputs, or even push data to multiple sinks. We need to model those types of jobs too. Thanks for answering @Maciej Obuchowski

Even in SQL you can have multiple outputs if you look thing at transaction level. I was simply using it as an example.

Maybe it would be clear what I mean in another example . Let’s say we have those phases

  1. Ingest from sources
  2. Process/transform
  3. export to somewhere (image/diagram) https://mermaid.ink/img/eyJjb2RlIjoiXG5ncmFwaCBMUlxuICAgIHN1YmdyYXBoIFNvdXJjZXNcbi[…]yIjpmYWxzZSwiYXV0b1N5bmMiOnRydWUsInVwZGF0ZURpYWdyYW0iOmZhbHNlfQ

Let’s look at those two cases:

  1. Within a single flink job and even task: Inventory & UI are both written to both S3, DB
  2. Within a single flink job and even task: Inventory is written only to S3, UI is written only to DB

In 1. open lineage run event could look like {inputs: [ui, inventory], outputs: [s3, db] }

In 2. user can either do same as 1. (because data changes or copy-paste) which would be an error since both do not go to both Likely accurate one would be {inputs: [ui], outputs: [s3] } {inputs: [ui], outputs: [db] }

If the specification standard required single output then

  1. would be modelled like run event {inputs: [ui, inventory], outputs: [s3] } ; {inputs: [ui, inventory], outputs: [db] } which is still correct if more verbose.
  2. could only be modelled this way: {inputs: [ui], outputs: [s3] }; {inputs: [ui], outputs: [db] }

The more restrictive specification seems to lower the chance for an error doesn’t it?

Also if tools know spec guarantees single output , they’d be able to write tracing capabilities which are more precise because the structure would allow for less ambiguity. Storage backends that implement the spec could be also written in more optimal ways perhaps I have not looked into those accuracy of those hypothesis though.

Those were the thoughts I was thinking when asking about that. I’d be curious if there’s document on the research of pros/cons and alternatives for the design of the current specifications

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-11-23 05:38:11

*Thread Reply:* @Anthony Ivanov I see what you're trying to model. I think this could be solved by column level lineage though - when we'll have it. OL consumer could look at particular columns and derive which table contained particular error.

> 2. Within a single flink job and even task: Inventory is written only to S3, UI is written only to DB Does that actually happen? I understand this in case of job, but having single operator write to two different systems seems like bad design. Wouldn't that leave the possibility of breaking exactly-once unless you're going full into two phase commit?

Anthony Ivanov (anthvt@gmail.com)
2021-11-23 17:02:36

*Thread Reply:* > Does that actually happen? I understand this in case of job, but having single operator write to two different systems seems like bad design In a Spark or flink job it is less likely now that you mention it. But in a batch job (airflow python or kubernetes operator for example) users could do anything and then they’d need lineage to figure out what is wrong if even if what they did is suboptimal 🙂

> I see what you’re trying to model. I am not trying to model something specific. I am trying to understand how would openlineage be used in different organisations/companies and use-cases.

> I think this could be solved by column level lineage though There’s something specific planned ? I could not find a ticket in github. I thought you can use Dataset Facets - Schema for example could be subset of columns for a table …

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-11-24 04:55:41

*Thread Reply:* @Anthony Ivanov take a look at this: https://github.com/OpenLineage/OpenLineage/issues/148

Lyndon Armitage (lyndon.armitage@gmail.com)
2021-11-10 13:21:23

How do you deleting jobs/runs from Marquez/OpenLineage?

Willy Lulciuc (willy@datakin.com)
2021-11-10 16:17:10

*Thread Reply:* We’re adding APIs to delete metadata in Marquez 0.20.0. Here’s the related issue, https://github.com/MarquezProject/marquez/issues/1736

Willy Lulciuc (willy@datakin.com)
2021-11-10 16:17:37

*Thread Reply:* Until then, you can connected to the DB directly and drop the rows from both the datasets and jobs tables (I know, not dieal)

Lyndon Armitage (lyndon.armitage@gmail.com)
2021-11-11 05:03:50

*Thread Reply:* Thanks! I assume deleting information will remain a Marquez only feature rather than becoming part of OpenLineage itself?

Willy Lulciuc (willy@datakin.com)
2021-12-10 14:07:57

*Thread Reply:* Yes! Delete operations will be an action supported by consumers of OpenLineage events

Lyndon Armitage (lyndon.armitage@gmail.com)
2021-11-11 05:13:31

Am I understanding namespaces correctly? A job namespace is different to a Dataset namespace. And that job namespaces define a job environment, like Airflow, Spark or some other system that executes jobs. But Dataset namespace define data locations, like an S3 bucket, local file system or schema in a Database?

Lyndon Armitage (lyndon.armitage@gmail.com)
2021-11-11 05:14:39

*Thread Reply:* I've been skimming this page: https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-11-11 05:46:06

*Thread Reply:* Yes!

Lyndon Armitage (lyndon.armitage@gmail.com)
2021-11-11 06:17:01

*Thread Reply:* Excellent, I think I had mistakenly conflated the two originally. This document makes it a little clearer. As an additional question: When viewing a Dataset in Marquez will it cross the job namespace bounds? As in, will I see jobs from different job namespaces?

Lyndon Armitage (lyndon.armitage@gmail.com)
2021-11-11 09:20:14

*Thread Reply:* In this example I have 1 job namespace and 2 dataset namespaces: sql-runner-dev is the job namespace. I cannot see a graph of my job now. Is this something to do with the namespace names?

Lyndon Armitage (lyndon.armitage@gmail.com)
2021-11-11 09:21:46

*Thread Reply:* The above document seems to have implied a namespace could be like a connection string for a database

Lyndon Armitage (lyndon.armitage@gmail.com)
2021-11-11 09:22:25

*Thread Reply:* Wait, it does work? Marquez was being temperamental

Lyndon Armitage (lyndon.armitage@gmail.com)
2021-11-11 09:24:01

*Thread Reply:* Yes, marquez is unable to fetch lineage for either dataset

Lyndon Armitage (lyndon.armitage@gmail.com)
2021-11-11 09:32:19

*Thread Reply:* Here's what I mean:

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-11-11 09:59:24

*Thread Reply:* I think you might have hit this issue: https://github.com/MarquezProject/marquez/issues/1744

Labels
bug
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-11-11 10:00:29

*Thread Reply:* or, maybe not? It was released already.

Can you create issue on github with those helpful gifs? @Lyndon Armitage

Lyndon Armitage (lyndon.armitage@gmail.com)
2021-11-11 10:58:25

*Thread Reply:* I think you are right Maciej

Lyndon Armitage (lyndon.armitage@gmail.com)
2021-11-11 10:58:52

*Thread Reply:* Was that patched in 0,19.1?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-11-11 11:06:06

*Thread Reply:* As far as I see yes: https://github.com/MarquezProject/marquez/releases/tag/0.19.1

Haven't tested this myself unfortunately.

Lyndon Armitage (lyndon.armitage@gmail.com)
2021-11-11 11:07:07

*Thread Reply:* Perhaps not. It is urlencoding them: <http://localhost:3000/lineage/dataset/jdbc%3Ah2%3Amem%3Asql_tests_like/HBMOFA.ORDDETP> But the error seems to be in marquez getting them.

Lyndon Armitage (lyndon.armitage@gmail.com)
2021-11-11 11:09:23

*Thread Reply:* This is an example Lineage event JSON I am sending.

👀 Maciej Obuchowski
Lyndon Armitage (lyndon.armitage@gmail.com)
2021-11-11 11:11:29

*Thread Reply:* I did run into another issue with really long names not being supported due to Marquez's DB using a fixed size string for a column, but that is understandable and probably a non-issue (my test code was generating temporary folders with long names).

Lyndon Armitage (lyndon.armitage@gmail.com)
2021-11-11 11:22:00
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-11-11 11:36:01

*Thread Reply:* @Lyndon Armitage can you create issue on the Marquez repo? https://github.com/MarquezProject/marquez/issues

Lyndon Armitage (lyndon.armitage@gmail.com)
2021-11-11 11:52:36

*Thread Reply:* https://github.com/MarquezProject/marquez/issues/1761 Is this sufficient?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-11-11 11:54:41

*Thread Reply:* Yup, thanks!

Francis McGregor-Macdonald (francis@mc-mac.com)
2021-11-15 13:00:39

I am looking at an AWS Glue Crawler lineage event. The glue crawler creates or updates a table schema, and I have a few questions on aligning to best practice.

  1. Is this a dataset create/update or…
  2. … a job with no dataset inputs and only dataset outputs or
  3. … is the path in S3 the input and the Glue table the output?
  4. Is there an example of the lineage even here I can clone or work from? Thanks.
🚀 Willy Lulciuc
John Thomas (john@datakin.com)
2021-11-15 13:04:19

*Thread Reply:* Hi Francis, for the event is it creating a new table with new data in glue / adding new data to an existing one or is it simply reformatting an existing table or making an empty one?

Francis McGregor-Macdonald (francis@mc-mac.com)
2021-11-15 13:35:00

*Thread Reply:* The table does not exist in the Glue catalog until …

A Glue crawler connects to one or more data stores (in this case S3), determines the data structures, and writes tables into the Data Catalog.

The data/objects are in S3, the Glue catalog is a metadata representation (HIVE) as as table.

John Thomas (john@datakin.com)
2021-11-15 13:41:14

*Thread Reply:* Hmm, interesting, so the lineage of interest here would be of the metadata flow not of the data itself?

In that case I’d say that the glue Crawler is a job that outputs a dataset.

Michael Collado (collado.mike@gmail.com)
2021-11-15 15:03:36

*Thread Reply:* The crawler is a job that discovers a dataset. It doesn't create it. If you're posting lineage yourself, I'd post it as an input event, not an output. The thing that actually wrote the data - generated the records and stored them in S3 - is the thing that would be outputting the dataset

Francis McGregor-Macdonald (francis@mc-mac.com)
2021-11-15 15:23:23

*Thread Reply:* @Michael Collado I agree the crawler discovers the S3 dataset. It also creates an event which creates/updates the HIVE/Glue table.

If the Glue table isn’t a distinct dataset from the S3 data, how does this compare to a view in a database on top of a table. Are they 2 datasets or just one?

Glue can discover data in remote databases too, in those cases does it make sense to have only the source dataset?

Francis McGregor-Macdonald (francis@mc-mac.com)
2021-11-15 15:24:39

*Thread Reply:* @John Thomas yes, its the metadata flow.

Michael Collado (collado.mike@gmail.com)
2021-11-15 15:24:52

*Thread Reply:* that's how the Spark integration currently treats Hive datasets- I'd like to add a facet to attach that indicates that it is being read as a Hive table, and include all the appropriate metadata, but it uses the dataset's location in S3 as the canonical dataset identifier

John Thomas (john@datakin.com)
2021-11-15 15:29:22

*Thread Reply:* @Francis McGregor-Macdonald I think the way to represent this is predicated on what you’re looking to accomplish by sending a runEvent for the Glue crawler. What are your broader objectives in adding this?

Francis McGregor-Macdonald (francis@mc-mac.com)
2021-11-15 15:50:37

*Thread Reply:* I am working through AWS native services seeing how they could, can, or do best integrate with openlineage (I’m an AWS SA). Hence the questions on best practice.

Aligning with the Spark integration sounds like it might make sense then. Is there an example I could build from?

Michael Collado (collado.mike@gmail.com)
2021-11-15 17:56:17

*Thread Reply:* an example of reporting lineage? you can look at the Spark integration here - https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/

John Thomas (john@datakin.com)
2021-11-15 17:59:14

*Thread Reply:* Ahh, in that case I would have to agree with Michael’s approach to things!

✅ Diogo
Francis McGregor-Macdonald (francis@mc-mac.com)
2021-11-19 03:30:03

*Thread Reply:* @Michael Collado I am following the Spark integration you recommended (for a Glue job) and while everything appears to be set up correct, I am getting no lineage appear in marquez (a request.get from the pyspark script can reach the endpoint). Is there a way to enable a debug log so I can look to identify where the issue is? Is there a specific place to look in the regular logs?

Michael Collado (collado.mike@gmail.com)
2021-11-19 13:39:01

*Thread Reply:* listener output should be present in the driver logs. you can turn on debug logging in your log4j config (or whatever logging tool you use) for the package io.openlineage.spark.agent

✅ Francis McGregor-Macdonald
Michael Collado (collado.mike@gmail.com)
2021-11-19 19:44:06

Woo hoo! Initial Spark <-> Kafka support has been merged 🙂 https://github.com/OpenLineage/OpenLineage/pull/387

Comments
1
🎉 Willy Lulciuc, John Thomas, Peter Hicks, Maciej Obuchowski
🙌 Willy Lulciuc, John Thomas, Francis McGregor-Macdonald, Peter Hicks, Maciej Obuchowski
🚀 Willy Lulciuc, John Thomas, Peter Hicks, Francis McGregor-Macdonald, Maciej Obuchowski
Francis McGregor-Macdonald (francis@mc-mac.com)
2021-11-22 13:32:57

I am “successfully” exporting lineage to openlineage from AWS Glue using the listener. Only the source load is showing, not the transforms, or the sink

Francis McGregor-Macdonald (francis@mc-mac.com)
2021-11-22 13:34:15

*Thread Reply:* Output event:

2021-11-22 08:12:15,513 INFO [spark-listener-group-shared] agent.OpenLineageContext (OpenLineageContext.java:emit(50)): Lineage completed successfully: ResponseMessage(responseCode=201, body=, error=null) { “eventType”: “COMPLETE”, “eventTime”: “2021-11-22T08:12:15.478Z”, “run”: { “runId”: “03bfc770-2151-499e-9265-8457a38ceec3”, “facets”: { “sparkversion”: { “producer”: “https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark”, “schemaURL”: “https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet”, “spark-version”: “3.1.1-amzn-0”, “openlineage-spark-version”: “0.3.1” } } }, “job”: { “namespace”: “sparkintegration”, “name”: “nyctaxirawstage.mappartitionsunionmappartitionsnew_hadoop” }, “inputs”: [ { “namespace”: “s3.cdkdl-dev-foundationstoragef3787fa8-raw1d6fb60a-171gwxf2sixt9”, “name”: “” } ], “outputs”: [], “producer”: “https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark”, “schemaURL”: “https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunEvent” }

Francis McGregor-Macdonald (francis@mc-mac.com)
2021-11-22 13:34:59

*Thread Reply:* This sink record is missing details …

2021-11-22 08:12:15,481 INFO [Thread-7] sinks.HadoopDataSink (HadoopDataSink.scala:$anonfun$writeDynamicFrame$1(275)): nameSpace: , table:

Francis McGregor-Macdonald (francis@mc-mac.com)
2021-11-22 13:40:30

*Thread Reply:* I can also see multiple history events (presumably for each transform, each as above) emitted for the same Glue Job, with different RunId, with the same inputs and the same (null) output.

John Thomas (john@datakin.com)
2021-11-22 14:31:06

*Thread Reply:* Are you using the existing spark integration for the spark lineage?

Francis McGregor-Macdonald (francis@mc-mac.com)
2021-11-22 14:46:47

*Thread Reply:* I followed: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark In the Glue context I was not clear on the correct settings for “spark.openlineage.parentJobName” and “spark.openlineage.parentRunId”, I put in static values (which may be incorrect)? I injected these via: "--conf": "spark.openlineage.parentJobName=nyc-taxi-raw-stage",

Francis McGregor-Macdonald (francis@mc-mac.com)
2021-11-22 14:47:54

*Thread Reply:* Happy to share what is working when I am done, I can’t seem to find an AWS Glue specific example to walk me through.

John Thomas (john@datakin.com)
2021-11-22 15:03:31

*Thread Reply:* yeah, We haven’t spent any significant time with AWS Glue, but we just released the Databricks integration, which might help guide the way you’re working a little bit more

Francis McGregor-Macdonald (francis@mc-mac.com)
2021-11-22 15:12:15

*Thread Reply:* from what I can see in the DBX integration (https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/databricks) all of what is being done here I am doing in Glue (upload the jar, embed the settings into the Glue spark job). It is emitting the above for each transform in the Glue job, but does not seem to capture the output …

Francis McGregor-Macdonald (francis@mc-mac.com)
2021-11-22 15:13:54

*Thread Reply:* Is there a standard Spark test script in use with openlineage I could put into Glue to test without using any Glue specific functionality (without for example the GlueContext, or Glue dynamic frames)?

Francis McGregor-Macdonald (francis@mc-mac.com)
2021-11-22 15:25:30

*Thread Reply:* The initialisation does appear to be working if I compare it to the DBX README Mine from AWS Glue… 21/11/22 18:48:48 INFO SparkContext: Registered listener io.openlineage.spark.agent.OpenLineageSparkListener 21/11/22 18:48:49 INFO OpenLineageContext: Init OpenLineageContext: Args: ArgumentParser(host=<http://ec2>-….<a href="http://compute-1.amazonaws.com:5000">compute-1.amazonaws.com:5000</a>, version=v1, namespace=spark_integration, jobName=default, parentRunId=null, apiKey=Optional.empty) URI: <http://ec2>-….<a href="http://compute-1.amazonaws.com:5000/api/v1/lineage">compute-1.amazonaws.com:5000/api/v1/lineage</a> 21/11/22 18:48:49 INFO AsyncEventQueue: Process of event SparkListenerApplicationStart(nyc-taxi-raw-stage,Some(spark-application-1637606927106),1637606926281,spark,None,None,None) by listener OpenLineageSparkListener took 1.092252643s.

John Thomas (john@datakin.com)
2021-11-22 16:12:40

*Thread Reply:* We don’t have a test run, unfortunately, but you could follow this blog post’s processes in each and see what the differences are? https://openlineage.io/blog/openlineage-spark/

openlineage.io
Francis McGregor-Macdonald (francis@mc-mac.com)
2021-11-22 16:43:23

*Thread Reply:* Thanks, I have been looking at that. I will create a Glue job aligned with that. What is the best way to pass feedback? Keep it here?

John Thomas (john@datakin.com)
2021-11-22 16:49:50

*Thread Reply:* yeah, this thread will work great 🙂

Ilya Davidov (idavidov@marpaihealth.com)
2022-07-18 11:37:02

*Thread Reply:* @Francis McGregor-Macdonald are you managed to enable it?

Francis McGregor-Macdonald (francis@mc-mac.com)
2022-07-18 15:14:47

*Thread Reply:* Just DM you the code I used a while back (app.py + CDK code). I haven’t used it in a while, and there is some duplication in it. I had openlineage enabled, but dynamic frames not working yet with lineage. Let me know how you go. I haven’t had the space to look at it in a while, but happy to support if you are looking at it.

Dinakar Sundar (dinakar_sundar@condenast.com)
2021-11-23 08:48:51

how to use the Open lineage with amundsen ?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-11-23 09:01:11

*Thread Reply:* You can use this: https://github.com/amundsen-io/amundsen/pull/1444

Labels
area:databuilder, area:dev-tools, area:docs
Comments
2
John Thomas (john@datakin.com)
2021-11-23 09:38:44

*Thread Reply:* you can also check out this section from the Amundsen Community Meeting in october: https://www.youtube.com/watch?v=7WgECcmLSRk

YouTube
} Amundsen (https://www.youtube.com/channel/UCgOyzG0sEoolxuC9YXDYPeg)
Dinakar Sundar (dinakar_sundar@condenast.com)
2021-11-23 08:49:16

do we need to use the Marquez ?

Willy Lulciuc (willy@datakin.com)
2021-11-23 12:45:34

*Thread Reply:* No, I believe the databuilder OpenLineage extractor for Amundsen will continue to store lineage metadata in Atlas

Willy Lulciuc (willy@datakin.com)
2021-11-23 12:47:01

*Thread Reply:* We've spoken to the Amundsen team, and though using Marquez to store lineage metadata isn't an option, it's an integration that makes sense but hasn't yet been prioritized

Dinakar Sundar (dinakar_sundar@condenast.com)
2021-11-23 13:51:00

*Thread Reply:* Thanks , Right now amundsen has no support for lineage extraction from spark or airflow , if this case do we need to use marquez for open lineage implementation to capture the lineage from airflow & spark

Willy Lulciuc (willy@datakin.com)
2021-11-23 13:57:13

*Thread Reply:* Maybe, that would mean running the full Amundsen stack as well as the Marquez stack along side each other (not ideal). The OpenLineage integration for Amundsen is very recent, so haven't had a chance to look deeply into the implementation. But, briefly looking over the config for Openlineagetablelineageextractor, you can only send metadata to Atlas

Dinakar Sundar (dinakar_sundar@condenast.com)
2021-11-24 00:36:56

*Thread Reply:* @Willy Lulciuc thats our real concern , running the two stacks will make a mess environment , let me explain our amundsen setup , we are having neo4j as backend , (front end , search service , metadata service,elastic search & neo4j) . our requirement to capture lineage from spark and airflow , imported into amundsen

Vinith Krishnan US (vinithk@nvidia.com)
2022-03-11 22:33:39

*Thread Reply:* We are running into a similar issue. @Dinakar Sundar were you able to get the Amundsen OpenLineage integration to work with a neo4j backend?

bitsofinfo (bitsofinfo.g@gmail.com)
2021-11-24 11:41:31

Hi all - i just watched the presentation on this and Marquez from the Airflow 21 summit. I was pretty impressed with this. My question is what other open source players are in this space or are pretty much people consolidating around this? (which would be great). Was looking at the available datasource extractors for the airflow side and would hope to see more here, looking at the code doesn't seem like too huge of a deal. Is there a roadmap available?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-11-24 11:49:14

*Thread Reply:* You can take a look at https://github.com/OpenLineage/OpenLineage/projects

Martin Fiser (fisa@keboola.com)
2021-11-24 19:24:48

Hi all, I was wondering what is the status of native support of openlineage for DataHub or Amundzen. re https://openlineage.slack.com/archives/C01CK9T7HKR/p1633633476151000?thread_ts=1633008095.115900&cid=C01CK9T7HKR Many thanks!

} Julien Le Dem (https://openlineage.slack.com/team/U01DCLP0GU9)
Martin Fiser (fisa@keboola.com)
2021-12-01 16:35:17

*Thread Reply:* Anyone? Thanks!

Dinakar Sundar (dinakar_sundar@condenast.com)
2021-11-25 01:42:26

our amundsen setup , we are having neo4j as backend , (front end , search service , metadata service,elastic search & neo4j) . our requirement to capture lineage from spark and airflow , imported into amundsen ?

Will Johnson (will@willj.co)
2021-11-29 23:30:12

Hello, OpenLineage folks - I'm curious if anyone here has ran into an issue like we're running into as we look to extend OpenLineage's Spark integration into Databricks.

Has anyone ran into an issue where a scala class should exist (based on a decompiled jar, I see that it's a public class) but you keep getting an error like object SqlDWRelation in package sqldw cannot be accessed in package com.databricks.spark.sqldw?

Databricks has a Synapse SQL DW connector: https://docs.databricks.com/data/data-sources/azure/synapse-analytics.html

I want to extract the database URL, table, and schema from the logical plan but

I execute something like the below command that runs a SELECT ** on the given tableName ("borrower" in this case) in the Azure Synapse database.

val df = spark.read.format("com.databricks.spark.sqldw") .option("url", sqlDwUrl) .option("tempDir", tempDir) .option("forwardSparkAzureStorageCredentials", "true") .option("dbTable", tableName) .load() val logicalPlan = df.queryExecution.logical val logicalRelation = logicalPlan.asInstanceOf[LogicalRelation] val sqlBaseRelation = logicalRelation.relation I end up with something like this, all good so far: ```logicalPlan: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Relation[memberId#97,residentialState#98,yearsEmployment#99,homeOwnership#100,annualIncome#101,incomeVerified#102,dtiRatio#103,lengthCreditHistory#104,numTotalCreditLines#105,numOpenCreditLines#106,numOpenCreditLines1Year#107,revolvingBalance#108,revolvingUtilizationRate#109,numDerogatoryRec#110,numDelinquency2Years#111,numChargeoff1year#112,numInquiries6Mon#113] SqlDWRelation("borrower")

logicalRelation: org.apache.spark.sql.execution.datasources.LogicalRelation = Relation[memberId#97,residentialState#98,yearsEmployment#99,homeOwnership#100,annualIncome#101,incomeVerified#102,dtiRatio#103,lengthCreditHistory#104,numTotalCreditLines#105,numOpenCreditLines#106,numOpenCreditLines1Year#107,revolvingBalance#108,revolvingUtilizationRate#109,numDerogatoryRec#110,numDelinquency2Years#111,numChargeoff1year#112,numInquiries6Mon#113] SqlDWRelation("borrower")

sqlBaseRelation: org.apache.spark.sql.sources.BaseRelation = SqlDWRelation("borrower")`` Schema, I can easily get withsqlBaseRelation.schema` but I cannot figure out:

  1. How I can get the database name from the logical relation
  2. How I can get the table name from the logical relation ("borrower" is the table name so I can always parse the string if necessary" I know that Databricks has the SqlDWRelation class which I think I need to cast the BaseRelation to BUT it appears to be in a jar / package that is inaccessible during the execution of a notebook. Specifically import com.databricks.spark.sqldw.SqlDWRelation is the relation and it appears to have a few accessors that would help me answer some of these questions: params and JDBCWrapper

Of course this is undocumented on the Databricks side 😰

If I could cast the BaseRelation into this SqlDWRelation, I'd be able to get this info. However, whenever I attempt to use the imported SqlDWRelation, I get an error object SqlDWRelation in package sqldw cannot be accessed in package com.databricks.spark.sqldw I'm hoping someone has run into something similar in the past on the Spark / Databricks / Scala side and might share some advice. Thank you for any guidance!

docs.databricks.com
Will Johnson (will@willj.co)
2021-11-30 11:21:34

*Thread Reply:* I have not! Will give it a try, Maciej! Thank you for the reply!

🙌 Maciej Obuchowski
Will Johnson (will@willj.co)
2021-11-30 15:20:18

*Thread Reply:* 🙏 @Maciej Obuchowski we're not worthy! That was the magic we needed. Seems like a hack since we're snooping in on private classes but if it works...

Thank you so much for pointing to those utilities!

❤️ Julien Le Dem
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-11-30 15:48:25

*Thread Reply:* Glad I could help!

Francis McGregor-Macdonald (francis@mc-mac.com)
2021-11-30 19:43:03

A colleague pointed me at https://open-metadata.org/, is there anywhere a view or comparison of this and openlineage?

Mario Measic (mario.measic.gavran@gmail.com)
2021-12-01 08:51:28

*Thread Reply:* Different concepts. OL is focused on describing the lineage and metadata of the running jobs. So it keeps track of all the metadata (schema, ...) of inputs and outputs at the time transformation occurs + transformation metadata (code version, cost, etc.)

OM I am not an expert but it's a metadata model with clients and API around it.

RamanD (romantanzar@gmail.com)
2021-12-01 12:33:51

Hey! OpenLineage is a beautiful initiative, to be honest! We also try to accommodate it. One question, maybe it's already described somewhere then many apologies :) if we need to propagate run id from Airflow to a child task (AWS Batch job, for instance) what will be the best way to do it in the current realization (as we get run id only at post execute phase)?.. We use Airflow 2+ integration.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-12-01 12:40:53

*Thread Reply:* Hey. For technical reasons, we can't automatically register macro that does this job, as we could in Airflow 1 integration. You could put it yourself:

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-12-01 12:41:02

*Thread Reply:* ```def lineageparentid(run_id, task): """ Macro function which returns the generated job and run id for a given task. This can be used to forward the ids from a task to a child run so the job hierarchy is preserved. Child run can create ParentRunFacet from those ids. Invoke as a jinja template, e.g.

PythonOperator(
    task_id='render_template',
    python_callable=my_task_function,
    op_args=['{{ lineage_parent_id(run_id, task) }}'], # lineage_run_id macro invoked
    provide_context=False,
    dag=dag
)

:param run_id:
:param task:
:return:
"""
with create_session() as session:
    job_name = openlineage_job_name(task.dag_id, task.task_id)
    ids = JobIdMapping.get(job_name, run_id, session)
    if ids is None:
        return ""
    elif isinstance(ids, list):
        run_id = "" if len(ids) == 0 else ids[0]
    else:
        run_id = str(ids)
    return f"{_DAG_NAMESPACE}/{job_name}/{run_id}"

def openlineagejobname(dagid: str, taskid: str) -> str: return f'{dagid}.{taskid}'```

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-12-01 12:41:13

*Thread Reply:* from here: https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/openlineage/airflow/dag.py#L77

RamanD (romantanzar@gmail.com)
2021-12-01 12:53:27

*Thread Reply:* the quickest response ever! And that works like a charm 🙌

👍 Michael Collado
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-12-01 13:21:16

*Thread Reply:* Glad I could help!

Will Johnson (will@willj.co)
2021-12-01 14:14:23

@Maciej Obuchowski and @Michael Collado given your work on the Spark Integration, what's the right way to explore the Write operations' logical plans? When doing a read, it's easy! In scala df.queryExecution.logical gives you exactly what you need but how do you guys interactively explore what sort of commands are being used during a write? We are exploring some of the DataSourceV2 data sources and are hoping to learn from you guys a bit more, please 😃

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-12-01 14:18:00

*Thread Reply:* For SQL, EXPLAIN EXTENDED and show() in scala-shell is helpful:

spark.sql("EXPLAIN EXTENDED CREATE TABLE tbl USING delta LOCATION '/tmp/delta' AS SELECT ** FROM tmp").show(false) ```|== Parsed Logical Plan == 'CreateTableAsSelectStatement [tbl], delta, /tmp/delta, false, false +- 'Project [**] +- 'UnresolvedRelation [tmp], [], false

== Analyzed Logical Plan ==

CreateTableAsSelect org.apache.spark.sql.delta.catalog.DeltaCatalog@63c5b63a, default.tbl, [provider=delta, location=/tmp/delta], false +- Project [x#12, y#13] +- SubqueryAlias tmp +- LocalRelation [x#12, y#13]

== Optimized Logical Plan == CreateTableAsSelect org.apache.spark.sql.delta.catalog.DeltaCatalog@63c5b63a, default.tbl, [provider=delta, location=/tmp/delta], false +- LocalRelation [x#12, y#13]

== Physical Plan == AtomicCreateTableAsSelect org.apache.spark.sql.delta.catalog.DeltaCatalog@63c5b63a, default.tbl, LocalRelation [x#12, y#13], [provider=delta, location=/tmp/delta, owner=mobuchowski], [], false +- LocalTableScan [x#12, y#13] |```

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-12-01 14:27:25

*Thread Reply:* For dataframe api, I'm usually just either logging plan to console from OpenLineage listener, or looking at sparklogicalPlan or sparkunknown facets send by listener - even when the particular write operation isn't supported by integration, those facets should have some relevant info.

🙌 Will Johnson
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-12-01 14:27:40

*Thread Reply:* For example, for the query I've send at comment above, the spark_logicalPlan facet looks like this:

"spark.logicalPlan": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/0.4.0-SNAPSHOT/integration/spark>", "_schemaURL": "<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>", "plan": [ { "allowExisting": false, "child": [ { "class": "org.apache.spark.sql.catalyst.plans.logical.LocalRelation", "data": null, "isStreaming": false, "num-children": 0, "output": [ [ { "class": "org.apache.spark.sql.catalyst.expressions.AttributeReference", "dataType": "integer", "exprId": { "id": 2, "jvmId": "e03e2860-a24b-41f5-addb-c35226173f7c", "product-class": "org.apache.spark.sql.catalyst.expressions.ExprId" }, "metadata": {}, "name": "x", "nullable": false, "num-children": 0, "qualifier": [] } ], [ { "class": "org.apache.spark.sql.catalyst.expressions.AttributeReference", "dataType": "integer", "exprId": { "id": 3, "jvmId": "e03e2860-a24b-41f5-addb-c35226173f7c", "product-class": "org.apache.spark.sql.catalyst.expressions.ExprId" }, "metadata": {}, "name": "y", "nullable": false, "num-children": 0, "qualifier": [] } ] ] } ], "class": "org.apache.spark.sql.execution.command.CreateViewCommand", "name": { "product-class": "org.apache.spark.sql.catalyst.TableIdentifier", "table": "tmp" }, "num-children": 0, "properties": null, "replace": true, "userSpecifiedColumns": [], "viewType": { "object": "org.apache.spark.sql.catalyst.analysis.LocalTempView$" } } ] },

Will Johnson (will@willj.co)
2021-12-01 14:38:55

*Thread Reply:* Okay! That is very helpful! I wasn't sure if there was a fancier trick but I can definitely do logging 🙂 Our challenge was that our proprietary packages were resulting in Null Pointer Exceptions when it tried to push to OpenLineage 😞

Will Johnson (will@willj.co)
2021-12-01 14:39:02

*Thread Reply:* Thank you as usual!!

Michael Collado (collado.mike@gmail.com)
2021-12-01 14:40:25

*Thread Reply:* You can always add test cases and add breakpoints to debug in your IDE. That doesn't work for the container tests, but it does work for the other ones

Will Johnson (will@willj.co)
2021-12-01 14:47:20

*Thread Reply:* Ah! That's a great point! I definitely would appreciate being able to poke at the objects interactively in a debug mode. Thank you for the guidance as well!

Ricardo Gaspar (ricardogaspar2@gmail.com)
2021-12-03 11:49:10

hi everyone! 👋 Very noob question here: I’ve been wanting to play with Marquez and open lineage for my company’s projects. I use mostly scala & spark, but also Airflow. I’ve been reading and watching talks about OpenLineage and Marquez. So far i didn’t quite discover if Marquez or OpenLineage does field-level lineage (with Spark), like spline tries to.

Any idea?

Other sources about this topic • https://medium.com/cdapio/data-integration-with-field-level-lineage-5d9986524316https://medium.com/cdapio/field-level-lineage-part-1-3cc5c9e1d8c6https://medium.com/cdapio/designing-field-level-lineage-part-2-b6c7e6af5bf4https://www.youtube.com/playlist?list=PL897MHVe_nHeEQC8UnCfXecmZdF0vka_Thttps://www.youtube.com/watch?v=gKYGKXIBcZ0https://www.youtube.com/watch?v=eBep6rRh7ic

Medium
Reading time
6 min read
Medium
Reading time
6 min read
Medium
Reading time
10 min read
YouTube
YouTube
} CDAP (https://www.youtube.com/c/CDAPio)
🙌 Francis McGregor-Macdonald
John Thomas (john@datakin.com)
2021-12-03 11:55:17

*Thread Reply:* Hi Ricardo - OpenLineage doesn’t currently have support for field-level lineage, but it’s definitely something we’ve been looking into. This is a great collection of resources 🙂

We’ve to-date been working on our integrations library, making it as easy to set up as possible.

Ricardo Gaspar (ricardogaspar2@gmail.com)
2021-12-03 12:01:25

*Thread Reply:* Thanks John! I was checking the issues on github and other posts here. Just wanted to clarify that. I’ll keep an eye on it

Julien Le Dem (julien@apache.org)
2021-12-06 20:25:19

The next OpenLineage monthly meeting is this Wednesday at 9am PT. (everybody is welcome to join) The slides are here: https://docs.google.com/presentation/d/1q2Be7WTKlIhjLPgvH-eXAnf5p4w7To9v/edit#slide=id.ge4b57c6942_0_75 tentative agenda: • SPDX headers [Mandy Chessel] • Azure Purview + OpenLineage [Will Johnson, Mark Taylor] • Logging backend (OpenTelemetry, ...) [Julien Le Dem] • Open discussion Please chime in in this thread if you’d want to add something

Julien Le Dem (julien@apache.org)
2021-12-06 20:28:09

*Thread Reply:* The link to join the meeting is on the wiki: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

Julien Le Dem (julien@apache.org)
2021-12-06 20:28:25

*Thread Reply:* Please reach out to me if you’d like to be added to a gcal invite

Dinakar Sundar (dinakar_sundar@condenast.com)
2021-12-06 22:37:29

@John Thomas we in Condenast currently exploring the features of open lineage to integrate to databricks , https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/databricks , spark configuration not working ,

Michael Collado (collado.mike@gmail.com)
2021-12-08 02:03:37

*Thread Reply:* Hi Dinakar. Can you give some specifics regarding what kind of problem you're running into?

Dinakar Sundar (dinakar_sundar@condenast.com)
2021-12-09 10:15:50

*Thread Reply:* Hi @Michael Collado, were able to set the spark configuration for spark extra listener & placed jars as well , wen i ran the sapark job , Lineage is not get tracked into the marquez

Dinakar Sundar (dinakar_sundar@condenast.com)
2021-12-09 10:34:39

*Thread Reply:* {"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark","schemaURL":"https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark/facets/spark/v1/output-statistics-facet.json","rowCount":0,"size":-1,"status":"DEPRECATED"}},"outputFacets":{"outputStatistics":{"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark","schemaURL":"https://openlineage.io/spec/facets/1-0-0/OutputStatisticsOutputDatasetFacet.json#/$defs/OutputStatisticsOutputDatasetFacet","rowCount":0,"size":-1}}}],"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark","schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunEvent"} OpenLineageHttpException(code=0, message=java.lang.IllegalArgumentException: Cannot construct instance of io.openlineage.spark.agent.client.HttpError (although at least one Creator exists): no String-argument constructor/factory method to deserialize from String value ('{"code":404,"message":"HTTP 404 Not Found"}') at [Source: UNKNOWN; line: -1, column: -1], details=java.util.concurrent.CompletionException: java.lang.IllegalArgumentException: Cannot construct instance of io.openlineage.spark.agent.client.HttpError (although at least one Creator exists): no String-argument constructor/factory method to deserialize from String value ('{"code":404,"message":"HTTP 404 Not Found"}') at [Source: UNKNOWN; line: -1, column: -1]) at io.openlineage.spark.agent.OpenLineageContext.emit(OpenLineageContext.java:48) at io.openlineage.spark.agent.lifecycle.SparkSQLExecutionContext.start(SparkSQLExecutionContext.java:122) at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$onJobStart$3(OpenLineageSparkListener.java:159) at java.util.Optional.ifPresent(Optional.java:159) at io.openlineage.spark.agent.OpenLineageSparkListener.onJobStart(OpenLineageSparkListener.java:148) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:119) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:103) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1585) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)

Dinakar Sundar (dinakar_sundar@condenast.com)
2021-12-09 13:29:42

*Thread Reply:* Issue solved , mentioned the version wrongly as 1 instead v1

🙌 Michael Collado
Jitendra Sharma (jitendra_sharma@condenast.com)
2021-12-07 02:07:06

👋 Hi everyone!

👋 Willy Lulciuc, Maciej Obuchowski
kavuri raghavendra (kavuri.raghavendra@gmail.com)
2021-12-08 05:37:44

Hello Everyone.. we are exploring Openlineage for capturing Spark lineage.. but form the GitHub(https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark) ..I see that the output send to API (Marquez).. how can I send it to Kafka topic.. can some body please guide me on this.

Minkyu Park (minkyu@datakin.com)
2021-12-08 12:15:38

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/400/files

there’s ongoing PR for proxy backend, which opens http API and redirects events to Kafka.

John Thomas (john@datakin.com)
2021-12-08 12:17:38

*Thread Reply:* Hi Kavuri, as minkyu said, there's currently work going on to simplify this process.

For now, you'll need to make something to capture the HTTP api events and send them to the Kafka topic. Changing the spark.openlineage.url parameter will send the runEvents wherever you like, but obviously you can't directly produce HTTP events to a topic

kavuri raghavendra (kavuri.raghavendra@gmail.com)
2021-12-08 22:13:09

*Thread Reply:* Many Thanks for the Reply.. As I understand, currently pushing lineage to kafka topic is not yet there. it is under implementation. If you can help me out in understanding in which version it is going to be present, that will help me a lot. Thanks in advance.

Minkyu Park (minkyu@datakin.com)
2021-12-09 12:57:10

*Thread Reply:* Not sure about the release plan, but the http endpoint is just regular RESTful API, and you will be able to write a super simple proxy for your own use case if you want.

🙌 Will Johnson
Will Johnson (will@willj.co)
2021-12-12 00:13:54

Hi, Open Lineage team - For the Spark Integration, I'm looking to extract information from a DataSourceV2 data source.

I'm working on the WRITE side of the data source and right now I'm touching the AppendData logical plan (I can't find the Java Doc): https://github.com/rdblue/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala#L446

I was able to extract out the table name (from the named relation) but I'm struggling getting out the schema next.

I noticed that the AppendData offers inputSet, schema, and outputSet. • inputSet gives me an AttributeSet which does contain the names of my columns (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeSet.scala#L69) • schema returns an empty StructType • outputSet is an empty AttributeSet I thought I read in the Spark Internals book that outputSet would only be populated if there was some sort of change to the DataFrame columns but I cannot find that page and searching for spark outputSet turns up few relevant results.

Has anyone else worked with the AppendData plan and gotten the schema out of it? Am I going down the wrong path with this snippet of code below? Thank you for any guidance!

if (logical instanceof AppendData) { AppendData appendOp = (AppendData) logical; NamedRelation namedRel = appendOp.table(); <a href="http://log.info">log.info</a>(namedRel.name()); // Works great! <a href="http://log.info">log.info</a>(appendOp.inputSet().toString());// This will get you a rough schema StructType schema = appendOp.schema(); // This is an empty StructType <a href="http://log.info">log.info</a>(schema.json()); // Nothing useful here }

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-12-12 07:34:13

*Thread Reply:* One thing, you're looking at Ryan's fork of Spark, which is few thousand commits behind head 🙂

This one should be good: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala#L72

About schema: looking at AppendData's query schema should work, if there's no change to columns, because to pass analysis, data being inserted have to match table's schema. I would test that though 🙂

On the other hand, current AppendDataVisitor just looks at AppendData's table and tries to extract dataset from it using list of common output visitors:

https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/src/main/co[…]o/openlineage/spark/agent/lifecycle/plan/AppendDataVisitor.java

In this case, the DataSourceV2RelationVisitor would look at it, provided we're using Spark 3:

https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/src/main/sp[…]ge/spark3/agent/lifecycle/plan/DataSourceV2RelationVisitor.java

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-12-12 07:37:04

*Thread Reply:* In this case, we basically need more info about nature of this DataSourceV2Relation, because this is provider-dependent. We have Iceberg in main branch and Delta here: https://github.com/OpenLineage/OpenLineage/pull/393/files#diff-7b66a9bd5905f4ba42914b73a87d834c1321ebcf75137c1e2a2413c0d85d9db6

Will Johnson (will@willj.co)
2021-12-13 14:54:13

*Thread Reply:* Ah! Maciej! As always, thank you! Looking through the DataSourceV2RelationVisitor you provided, it looks like the connector (Azure Cosmos Db) doesn't provide that Provider property 😞 😞 😞

Is there any other method for determining the type of DataSourceV2Relation?

Will Johnson (will@willj.co)
2021-12-13 14:57:06

*Thread Reply:* And, to make sure I close out on my original question, it was as simple as the code that Maciej was using:

I merely needed to use DataSourceV2Realtion rather than NamedRelation!

DataSourceV2Relation relation = (DataSourceV2Relation)appendOp.table(); <a href="http://log.info">log.info</a>(relation.schema().toString()); <a href="http://log.info">log.info</a>(relation.name());

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-12-15 06:20:31
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-12-15 06:22:05

*Thread Reply:* I guess you can use object.getClass.getCanonicalName() to find if the passed class matches the one that Cosmos provider uses.

Will Johnson (will@willj.co)
2021-12-15 09:53:24

*Thread Reply:* Yes! That's the one, Maciej! I will give getCanonicalName a try but also make a PR into that repo to get the provider property set up correctly 🙂

Will Johnson (will@willj.co)
2021-12-15 09:53:28

*Thread Reply:* Thank you so much!

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-12-15 10:09:39

*Thread Reply:* Glad to help 😄

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-12-15 10:22:58

*Thread Reply:* @Will Johnson could you tell on which commands from https://github.com/OpenLineage/OpenLineage/issues/368#issue-1038510649 you'll be working?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-12-15 10:24:14

*Thread Reply:* If any, of course 🙂

Will Johnson (will@willj.co)
2021-12-15 10:49:31

*Thread Reply:* From all of our tests on that Cosmos connector, it looks like it strictly uses athe AppendData operation. However @Harish Sune is looking at more of these commands from a Delta data source.

👍 Maciej Obuchowski
Will Johnson (will@willj.co)
2021-12-22 22:43:34

*Thread Reply:* Just to close the loop on this one - I submitted a PR for the work we've been doing. Looking forward to any feedback! https://github.com/OpenLineage/OpenLineage/pull/450

Comments
1
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-12-23 05:04:36

*Thread Reply:* Thanks @Will Johnson! I added one question about dataset naming.

Michael Collado (collado.mike@gmail.com)
2021-12-14 19:45:59

Finally got this doc posted - https://github.com/OpenLineage/OpenLineage/pull/437 (see the readable version here ) Looking for feedback, @Willy Lulciuc @Maciej Obuchowski @Will Johnson

Will Johnson (will@willj.co)
2021-12-15 10:54:41

*Thread Reply:* Yes! This is awesome!! How might this work for an existing command like the DataSourceV2Visitor.

Right now, OpenLineage checks based on the provider property if it's an Iceberg or Delta provider.

Ideally, we'd be able to extend the list of providers or have a custom "CosmosDbDataSourceV2Visitor" that knew how to work with a custom DataSourceV2.

Would that cause any conflicts if the base class is already accounted for in OpenLineage?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-12-15 11:13:20

*Thread Reply:* Resolving this would be nice addition to the doc (and, to the implementation) - currently, we're just returning result of first function for which isDefinedAt is satisfied.

This means, that we can depend on the order of the visitors...

Michael Collado (collado.mike@gmail.com)
2021-12-15 13:59:12

*Thread Reply:* great question. For posterity, I'd like to move this to the PR discussion. I'll address the question there.

Michael Collado (collado.mike@gmail.com)
2021-12-14 19:50:57

Oh, and I forgot to post yesterday OpenLineage 0.4.0 was released 🥳

This was a big one. • Split tests for Spark 2 and Spark 3 • Spark output metrics • Databricks support with init scripts • Initial Iceberg support for Spark • Initial Kafka support for Spark • dbt build support • forward compatibility for dbt versions • lots of bug fixes 🙂 Check the full changelog for details

🙌 Maciej Obuchowski, Will Johnson, Peter Hicks, Manuel, Peter Hanssens
Dinakar Sundar (dinakar_sundar@condenast.com)
2021-12-14 21:42:40

Hi @Michael Collado is there any documentation on using great expectations with open lineage

Michael Collado (collado.mike@gmail.com)
2021-12-15 11:50:47

*Thread Reply:* hmm, actually the only documentation we have right now is on the demo.datakin.com site https://demo.datakin.com/onboarding . The great expectations tab should be enough to get you started

demo.datakin.com
Michael Collado (collado.mike@gmail.com)
2021-12-15 11:51:04

*Thread Reply:* I'll open a ticket to copy that documentation to the OpenLineage site repo

👍 Madhu Maddikera, Dinakar Sundar
Carlos Meza (omar.m.8x@gmail.com)
2021-12-15 09:52:51

Hello ! I am new on OpenLineage , awesome project !! ; anybody knows about integration with Deequ ? Or a way to capture dataset stats with openlineage ? Thanks ! Appreciate the help !

Michael Collado (collado.mike@gmail.com)
2021-12-15 19:01:50

*Thread Reply:* Hi! We don't have any integration with deequ yet. We have a structure for recording data quality assertions and statistics, though - see https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/DataQualityAssertionsDatasetFacet.json and https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/DataQualityMetricsInputDatasetFacet.json for the specs.

Check the great expectations integration to see how those facets are being used

Bruno González (brugms2@gmail.com)
2022-05-24 06:20:50

*Thread Reply:* This is great. Thanks @Michael Collado!

Anatoliy Zhyzhkevych (Anatoliy.Zhyzhkevych@franklintempleton.com)
2021-12-19 22:40:33

Hi,

I am testing Open Lineage/Marquez 0.4.0 with dbt 1.0.0 using dbt-ol build It seems 12 events were generated but UI shows only history of runs with "Nothing to show here" in detail section about datasets/tests failures in dbt namespace. The warehouse namespace shows lineage but no details about dataset/test failures .

Please advice.

02:57:54 Done. PASS=4 WARN=0 ERROR=3 SKIP=2 TOTAL=9 02:57:54 Error sending message, disabling tracking Emitting OpenLineage events: 100%|██████████████████████████████████████████████████████| 12/12 [00:00<00:00, 12.50it/s]

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-12-20 04:15:51

*Thread Reply:* This is nothing to show here when you click on test node, right? What about run node?

Anatoliy Zhyzhkevych (Anatoliy.Zhyzhkevych@franklintempleton.com)
2021-12-20 12:28:21

*Thread Reply:* There is no details about failure.

```dbt-ol build -t DEV --profile cdp --profiles-dir /c/Work/dbt/cdp100/profiles --project-dir /c/Work/dbt/cdp100 --select +riskrawmastersharedshareclass Running OpenLineage dbt wrapper version 0.4.0 This wrapper will send OpenLineage events at the end of dbt execution. 02:57:21 Running with dbt=1.0.0 02:57:23 [WARNING]: Configuration paths exist in your dbtproject.yml file which do not apply to any resources. There are 1 unused configuration paths:

  • models.cdp.risk.raw.liquidity.shared

02:57:23 Found 158 models, 181 tests, 0 snapshots, 0 analyses, 574 macros, 0 operations, 2 seed files, 56 sources, 1 exposure, 0 metrics 02:57:23 02:57:35 Concurrency: 10 threads (target='DEV') 02:57:35 02:57:35 1 of 9 START test dbtexpectationssourceexpectcompoundcolumnstobeuniquebsesharedpbshareclassEDMPORTFOLIOIDSHARECLASSCODEanyvalueismissingDELETEDFLAGFalse [RUN] 02:57:37 1 of 9 PASS dbtexpectationssourceexpectcompoundcolumnstobeuniquebsesharedpbshareclassEDMPORTFOLIOIDSHARECLASSCODEanyvalueismissingDELETEDFLAGFalse [PASS in 2.67s] 02:57:37 2 of 9 START view model REPL.SHARECLASSDIM.................................... [RUN] 02:57:39 2 of 9 OK created view model REPL.SHARECLASSDIM............................... [SUCCESS 1 in 2.12s] 02:57:39 3 of 9 START test dbtexpectationsexpectcompoundcolumnstobeuniquerawreplpbsharedshareclassRISKPORTFOLIOIDSHARECLASSCODEanyvalueismissingDELETEDFLAGFalse [RUN] 02:57:43 3 of 9 PASS dbtexpectationsexpectcompoundcolumnstobeuniquerawreplpbsharedshareclassRISKPORTFOLIOIDSHARECLASSCODEanyvalueismissingDELETEDFLAGFalse [PASS in 3.42s] 02:57:43 4 of 9 START view model RAWRISKDEV.STG.SHARECLASSDIM........................ [RUN] 02:57:46 4 of 9 OK created view model RAWRISKDEV.STG.SHARECLASSDIM................... [SUCCESS 1 in 3.44s] 02:57:46 5 of 9 START view model RAWRISKDEV.MASTER.SHARECLASSDIM..................... [RUN] 02:57:46 6 of 9 START test relationshipsriskrawstgsharedshareclassRISKINSTRUMENTIDRISKINSTRUMENTIDrefriskrawstgsharedsecurity_ [RUN] 02:57:46 7 of 9 START test relationshipsriskrawstgsharedshareclassRISKPORTFOLIOIDRISKPORTFOLIOIDrefriskrawstgsharedportfolio_ [RUN] 02:57:51 5 of 9 ERROR creating view model RAWRISKDEV.MASTER.SHARECLASSDIM............ [ERROR in 4.31s] 02:57:51 8 of 9 SKIP test relationshipsriskrawmastersharedshareclassRISKINSTRUMENTIDRISKINSTRUMENTIDrefriskrawmastersharedsecurity_ [SKIP] 02:57:51 9 of 9 SKIP test relationshipsriskrawmastersharedshareclassRISKPORTFOLIOIDRISKPORTFOLIOIDrefriskrawmastersharedportfolio_ [SKIP] 02:57:52 7 of 9 FAIL 7282 relationshipsriskrawstgsharedshareclassRISKPORTFOLIOIDRISKPORTFOLIOIDrefriskrawstgsharedportfolio_ [FAIL 7282 in 5.41s] 02:57:54 6 of 9 FAIL 6520 relationshipsriskrawstgsharedshareclassRISKINSTRUMENTIDRISKINSTRUMENTIDrefriskrawstgsharedsecurity_ [FAIL 6520 in 7.23s] 02:57:54 02:57:54 Finished running 6 tests, 3 view models in 30.71s. 02:57:54 02:57:54 Completed with 3 errors and 0 warnings: 02:57:54 02:57:54 Database Error in model riskrawmastersharedshareclass (models/risk/raw/master/shared/riskrawmastersharedshareclass.sql) 02:57:54 002003 (42S02): SQL compilation error: 02:57:54 Object 'RAWRISKDEV.AUDIT.STGSHARECLASSDIMRELATIONSHIPRISKINSTRUMENTID' does not exist or not authorized. 02:57:54 compiled SQL at target/run/cdp/models/risk/raw/master/shared/riskrawmastersharedshareclass.sql 02:57:54 02:57:54 Failure in test relationshipsriskrawstgsharedshareclassRISKPORTFOLIOIDRISKPORTFOLIOIDrefriskrawstgsharedportfolio (models/risk/raw/stg/shared/riskrawstgsharedschema.yml) 02:57:54 Got 7282 results, configured to fail if != 0 02:57:54 02:57:54 compiled SQL at target/compiled/cdp/models/risk/raw/stg/shared/riskrawstgsharedschema.yml/relationshipsriskrawstgsha19e10fb324f7d0cccf2aab512683f693.sql 02:57:54 02:57:54 Failure in test relationshipsriskrawstgsharedshareclassRISKINSTRUMENTIDRISKINSTRUMENTID_refriskrawstgsharedsecurity_ (models/risk/raw/stg/shared/riskrawstgsharedschema.yml) 02:57:54 Got 6520 results, configured to fail if != 0 02:57:54 02:57:54 compiled SQL at target/compiled/cdp/models/risk/raw/stg/shared/riskrawstgsharedschema.yml/relationshipsriskrawstgsha_e3148a1627817f17f7f5a9eb841ef16f.sql 02:57:54 02:57:54 See test failures:


select ** from RAWRISKDEV.AUDIT.STGSHARECLASSDIMrelationship_RISKINSTRUMENT_ID


02:57:54 02:57:54 Done. PASS=4 WARN=0 ERROR=3 SKIP=2 TOTAL=9 02:57:54 Error sending message, disabling tracking Emitting OpenLineage events: 100%|██████████████████████████████████████████████████████| 12/12 [00:00<00:00, 12.50it/s]Emitted 14 openlineage events (dbt) linux@dblnbk152371:/c/Work/dbt/cdp$```

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-12-20 12:30:20

*Thread Reply:* I'm talking on clicking on non-test node in Marquez UI - the screenshots shared show you clicked on the one ending in test

Anatoliy Zhyzhkevych (Anatoliy.Zhyzhkevych@franklintempleton.com)
2021-12-20 16:46:11

*Thread Reply:* There are two types of failures: tests failed on stage model (relationships) and physical error in master model (no table with such name). The stage test node in Marquez does not show any indication of failures and dataset node indicates failure but without number of failed records or table name for persistent test storage. The failed master model shows in red but no details of failure. Master model tests were skipped because of model failure but UI reports "Complete".

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-12-20 18:11:50

*Thread Reply:* If I understood correctly, for model you would like OpenLineage to capture message error, like this one 22:52:07 Database Error in model customers (models/customers.sql) 22:52:07 Syntax error: Expected "(" or keyword SELECT or keyword WITH but got identifier "PLEASE_REMOVE" at [56:12] 22:52:07 compiled SQL at target/run/jaffle_shop/models/customers.sql And for dbt test failures, to visualize better that error is happening, for example like that:

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-12-20 18:23:12

*Thread Reply:* We actually do the first one for Airflow and Spark, I've missed it for dbt 😞

Created issue to add it to spec in a generic way: https://github.com/OpenLineage/OpenLineage/issues/446

Labels
proposal
Anatoliy Zhyzhkevych (Anatoliy.Zhyzhkevych@franklintempleton.com)
2021-12-20 22:49:54

*Thread Reply:* Sounds great. Failed/Skipped Tests/Models could be color-coded as well. Thanks.

Jorge Reyes (Zenta Group) (jorge.reyes@zentagroup.com)
2021-12-22 12:37:00

hello everyone , i'm learning Openlineage, I am trying to connect with airflow 2, is it possible? or that version is not yet released. this is currently throwing me airflow

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-12-22 12:38:26

*Thread Reply:* Hey. If you're using Airflow 2, you should use LineageBackend method described here: https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow#airflow-21-experimental

🙌 Jorge Reyes (Zenta Group)
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2021-12-22 12:39:06

*Thread Reply:* You don't need to do anything with DAG import then.

Jorge Reyes (Zenta Group) (jorge.reyes@zentagroup.com)
2021-12-22 12:40:30

*Thread Reply:* Thanks!!!!! i'll try

Michael Collado (collado.mike@gmail.com)
2021-12-27 16:49:20

The PR at https://github.com/OpenLineage/OpenLineage/pull/451 should be everything needed to complete the implementation for https://github.com/OpenLineage/OpenLineage/pull/437 . The PR is in draft mode, as I still need ~1 day to update the integration test expectations to match the refactoring (there are some new events, but from my cursory look, the old events still match expected contents). But I think it's in a state that can be reviewed before the tests are updated.

There are two other PRs that this one is based on - broken up for easier reviewing • https://github.com/OpenLineage/OpenLineage/pull/447https://github.com/OpenLineage/OpenLineage/pull/448

Michael Collado (collado.mike@gmail.com)
2021-12-27 16:49:56

*Thread Reply:* @Will Johnson @Maciej Obuchowski FYI 👆

🙌 Will Johnson, Maciej Obuchowski
Michael Robinson (michael.robinson@astronomer.io)
2022-01-07 15:25:11

The next OpenLineage Technical Steering Committee meeting is Wednesday, January 12! Meetings are on the second Wednesday of each month from 9:00 to 10:00am PT.  Join us on Zoom: https://us02web.zoom.us/j/81831865546?pwd=RTladlNpc0FTTDlFcWRkM2JyazM4Zz09 All are welcome. Agenda: • OpenLineage 0.4 and 0.5 releases • Egeria version 3.4 support for OpenLineage • Airflow TaskListener to simplify OpenLineage integration [Maciej] • Open Discussion Notes: https://tinyurl.com/openlineagetsc

🙌 Maciej Obuchowski, Ross Turk, John Thomas, Minkyu Park, Joshua Wankowski, Dalin Kim
David Virgil (david.virgil.naranjo@googlemail.com)
2022-01-11 12:16:09

Hello community,

We are able to post this datasource in marquez. But then the information about the facet with the datasource is not displayed in the UI.

We want to display the S3 location (URI) where this datasource is pointing to. { id: { namespace: "<s3://hbi-dns-staging>", name: "PCHG" }, type: "DB_TABLE", name: "PCHG", physicalName: "PCHG", createdAt: "2022-01-11T16:15:54.887Z", updatedAt: "2022-01-11T16:56:04.093153Z", namespace: "<s3://hbi-dns-staging>", sourceName: "<s3://hbi-dns-staging>", fields: [], tags: [], lastModifiedAt: null, description: null, currentVersion: "c565864d-1a66-4cff-a5d9-2e43175cbf88", facets: { dataSource: { uri: "<s3://hbi-dns-staging/sql-runner/2022-01-11/PCHG.avro>", name: "<s3://hbi-dns-staging>", _producer: "<a href="http://ip-172-25-23-163.dir.prod.aws.hollandandbarrett.comeu-west-1.com/172.25.23.163">ip-172-25-23-163.dir.prod.aws.hollandandbarrett.comeu-west-1.com/172.25.23.163</a>", _schemaURL: "<https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet>" } } }

David Virgil (david.virgil.naranjo@googlemail.com)
2022-01-11 12:23:41
David Virgil (david.virgil.naranjo@googlemail.com)
2022-01-11 12:24:00

As you see there is no much info in openlineage UI

Michael Robinson (michael.robinson@astronomer.io)
2022-01-11 13:02:16

The OpenLineage TSC meeting is tomorrow! https://openlineage.slack.com/archives/C01CK9T7HKR/p1641587111000700

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
Julien Le Dem (julien@apache.org)
2022-01-12 11:59:44

*Thread Reply:* ^ It’s happening now!

David Virgil (david.virgil.naranjo@googlemail.com)
2022-01-14 06:46:44

any idea guys about the previous question?

Minkyu Park (minkyu@datakin.com)
2022-01-18 14:19:39

*Thread Reply:* Just to be clear, were you able to get a datasource information from API but just now showing up in the UI? Or you weren’t able to get it from API too?

SAM (skhettri@gmail.com)
2022-01-17 03:41:56

Hi everyone !! I am doing POC of OpenLineage with Airflow version 2.1, before that would like to know, if this version is supported by OpenLineage?

Conor Beverland (conorbev@gmail.com)
2022-01-18 11:40:00

*Thread Reply:* It does generally work, but, there's a known limitation in that only successful task runs are reported to the lineage backend. This is planned to be fixed in Airflow 2.3.

✅ SAM
❤️ Julien Le Dem
SAM (skhettri@gmail.com)
2022-01-18 20:35:52

*Thread Reply:* thank you. 🙂

SAM (skhettri@gmail.com)
2022-01-17 06:47:54

Hello there, I’m using docker Airflow version 2.1.0 , below were the steps I performed but I encountered error, pls help:

  1. Inside requirements.txt file i added openlineage-airflow . Then ran pip install -r requirements.txt .
  2. Added environmental variable using this command export AIRFLOW__LINEAGE__BACKEND = openlineage.lineage_backend.OpenLineageBackend
  3. Then configured HTTP Backend environment variables inside “airflow” folder: export OPENLINEAGE_URL=<http://marquez:5000>
  4. Ran Marquez using ./docker/up.sh & open web frontend UI and saw below error msg:
Conor Beverland (conorbev@gmail.com)
2022-01-18 11:30:38

*Thread Reply:* hey, I'm aware of one small bug ( which will be fixed in the upcoming OpenLineage 0.5.0 ) which means you would also have to include google-cloud-bigquery in your requirements.txt. This is the bug: https://github.com/OpenLineage/OpenLineage/issues/438

✅ SAM
Conor Beverland (conorbev@gmail.com)
2022-01-18 11:31:51

*Thread Reply:* The other thing I think you should check is, did you def define the AIRFLOW__LINEAGE__BACKEND variable correctly? What you pasted above looks a little odd with the 2 = signs

Conor Beverland (conorbev@gmail.com)
2022-01-18 11:34:25

*Thread Reply:* I'm looking a task log inside my own Airflow and I see msgs like: INFO - Constructing openlineage client to send events to

Conor Beverland (conorbev@gmail.com)
2022-01-18 11:34:47

*Thread Reply:* ^ i.e. I think checking the task logs you can see if it's at least attempting to send data

Conor Beverland (conorbev@gmail.com)
2022-01-18 11:34:52

*Thread Reply:* hope this helps!

SAM (skhettri@gmail.com)
2022-01-18 20:40:37

*Thread Reply:* Thank you, will try again.

Michael Collado (collado.mike@gmail.com)
2022-01-18 20:10:25

Just published OpenLineage 0.5.0 . Big items here are • dbt-spark support • New proxy message broker for forwarding OpenLineage messages to Kafka • New extensibility API for Spark integration Accompanying tweet thread on the latter two items here: https://twitter.com/PeladoCollado/status/1483607050953232385

🙌 Maciej Obuchowski, Kevin Mellott
Michael Collado (collado.mike@gmail.com)
2022-01-19 12:39:30

*Thread Reply:* BTW, this was actually the 0.5.1 release. Because, pypi... 🤷‍♂️:skintone4:

Mario Measic (mario.measic.gavran@gmail.com)
2022-01-27 06:45:08

*Thread Reply:* nice on the dbt-spark support 👍

Mohamed El IBRAHIMI (mohamedelibrahimi700@gmail.com)
2022-01-19 11:12:14

HELLO everyone . I’ve been reading and watching talks about OpenLineage and Marquez . this solution is exactly what we been looking to lineage our etls . GREAT WORK . our etls based on postgres redshift and airflow. SO

I tried to implement the example respecting all the steps required. everything runs successfully (the two dags on airflow ) on host http://localhost:3000/ but nothing appeared on marquez ui . am i missing something ? .

I’am thinking about create a simple etl pandas to a pandas with some transformation . Like to have a poc to show it to my team . I REALLY NEED SOME HELP

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-01-19 11:13:35

*Thread Reply:* Are you using docker on mac with "Use Docker Compose V2" enabled?

We've just found yesterday that it somehow breaks our example...

✅ Mohamed El IBRAHIMI
Mohamed El IBRAHIMI (mohamedelibrahimi700@gmail.com)
2022-01-19 11:14:51

*Thread Reply:* yes i just installed docker on mac

Mohamed El IBRAHIMI (mohamedelibrahimi700@gmail.com)
2022-01-19 11:15:02

*Thread Reply:* and docker compose version 1.29.2

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-01-19 11:20:24

*Thread Reply:* What you can do is to uncheck this, do docker system prune -a and try again.

Mohamed El IBRAHIMI (mohamedelibrahimi700@gmail.com)
2022-01-19 11:21:56

*Thread Reply:* done but i get this : Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-01-19 11:22:15

*Thread Reply:* Try to restart docker for mac

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-01-19 11:23:00

*Thread Reply:* It needs to show Docker Desktop is running :

Mohamed El IBRAHIMI (mohamedelibrahimi700@gmail.com)
2022-01-19 11:24:01

*Thread Reply:* yeah done . I will try to implement the example again and see thank you very much

Mohamed El IBRAHIMI (mohamedelibrahimi700@gmail.com)
2022-01-19 11:32:55

*Thread Reply:* i dont why i getting this when i $ docker-compose up :

WARNING: The TAG variable is not set. Defaulting to a blank string. WARNING: The APIPORT variable is not set. Defaulting to a blank string. WARNING: The APIADMINPORT variable is not set. Defaulting to a blank string. WARNING: The WEBPORT variable is not set. Defaulting to a blank string. ERROR: The Compose file ‘./../docker-compose.yml’ is invalid because: services.api.ports contains an invalid type, it should be a number, or an object services.api.ports contains an invalid type, it should be a number, or an object services.web.ports contains an invalid type, it should be a number, or an object services.api.ports value [‘:’, ‘:’] has non-unique elements

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-01-19 11:46:12

*Thread Reply:* are you running it exactly like here, with respect to directories, etc?

https://github.com/MarquezProject/marquez/tree/main/examples/airflow

Mohamed El IBRAHIMI (mohamedelibrahimi700@gmail.com)
2022-01-19 11:59:36

*Thread Reply:* yeah yeah my bad . every things work fine know . I see the graph in the ui

Mohamed El IBRAHIMI (mohamedelibrahimi700@gmail.com)
2022-01-19 12:04:01

*Thread Reply:* one more question plz . As i said our etls based on postgres redshift and airflow . any advice you have for us to integrate OL to our pipeline ?

Mohamed El IBRAHIMI (mohamedelibrahimi700@gmail.com)
2022-01-19 11:12:17

thank you very much

Kevin Mellott (kevin.r.mellott@gmail.com)
2022-01-19 17:29:51

I’m upgrading our OL Java client from an older version (0.2.3) and noticed that the ol.newCustomFacetBuilder() method to create custom facets no longer exists. I can see in this code diff that it might be replaced by simply adding to the additional properties of the standard element you are extending.

Can you please let me know if I’m understanding this change correctly? In other words, is the code in the diff functionally equivalent or is there a large change I should be understanding better?

https://github.com/OpenLineage/OpenLineage/compare/0.2.3...0.4.0#diff-f0381d7e68797d9ec60551c96897809072582350e1657d23425747358ec6e471L196

John Thomas (john@datakin.com)
2022-01-19 17:50:39

*Thread Reply:* Hi Kevin - to my understanding that's correct. Do you guys have a custom extractor using this?

Kevin Mellott (kevin.r.mellott@gmail.com)
2022-01-19 20:49:49

*Thread Reply:* Thanks John! We have custom code emitting OL events within our ingestion pipeline and it includes a custom facet. I’ll refactor the code to the new format and should be good to go.

Kevin Mellott (kevin.r.mellott@gmail.com)
2022-01-21 00:34:37

*Thread Reply:* Just to follow up, this code update worked as expected and we are all good on the upgrade.

👍 Minkyu Park, John Thomas, Julien Le Dem
SAM (skhettri@gmail.com)
2022-01-21 02:13:51

I’m not sure what went wrong, with Airflow docker, version 2.1.0 , below were the steps I performed but Marquez UI is showing no jobs, pls help:

  1. requirements.txt i added openlineage-airflow==0.5.1 . Then ran pip install -r requirements.txt .
  2. Added environmental variable inside my airflow docker folder using this command: export AIRFLOW__LINEAGE__BACKEND = openlineage.lineage_backend.OpenLineageBackend
  3. Then configured HTTP Backend environment variables inside same airflow docker folder: export OPENLINEAGE_URL=<http://localhost:5000>
  4. Ran Marquez using ./docker/up.sh  which is in another folder, Front end UI is not showing any job, its empty:
  5. Attached in the airflow DAG log.
Ross Turk (ross@datakin.com)
2022-01-25 14:46:58

*Thread Reply:* Hm, that is odd. Usually there are a few lines in the DAG log from the OpenLineage bits. I’d expect to see something about not having an extractor for the operator you are using.

Ross Turk (ross@datakin.com)
2022-01-25 14:47:53

*Thread Reply:* If you open a shell in your Airflow Scheduler container and check for the presence of AIRFLOW__LINEAGE__BACKEND is it properly set? Possible the env isn’t making it all the way there.

Lena Kullab (Lena.Kullab@storable.com)
2022-01-21 13:38:37

Hi All,

I am working on a POC of OpenLineage-Airflow integration and was attempting to get it configured with Amundsen (also working on a POC). Reading through the tutorial here https://openlineage.io/integration/apache-airflow/, under the Prerequisites section it says: To use the OpenLineage Airflow integration, you'll need a running Airflow instance. You'll also need an OpenLineage compatible HTTP backend. The example uses Marquez, but I was trying to figure out how to get it to send metadata to the Amundsen graph db backend. Does the Airflow integration only support configuration with an HTTP compatible backend?

John Thomas (john@datakin.com)
2022-01-21 14:03:29

*Thread Reply:* Hi Lena! That’s correct, Openlineage is designed to send events to an HTTP backend. There’s a ticket on the future section of the roadmap to support pushing to Amundsen, but it’s not yet been worked on (Ref: Roadmap Issue #86)

Lena Kullab (Lena.Kullab@storable.com)
2022-01-21 14:08:35

*Thread Reply:* Thank you for the info!

naman shaundik (namanshaundik@gmail.com)
2022-01-30 11:01:42

hi , i am completely new to openlineage and marquez, i have to integrate openlineage to my existing java project but i am completely confused on where to start, i have gone through documentation and all but i am not able to understand how to integrate openlineage using marquez http backend in my existing project. please someone help me. I may sound naive here but i am in dire need of help.

John Thomas (john@datakin.com)
2022-01-30 12:37:39

*Thread Reply:* what do you mean by “Integrate Openlineage”?

Can you give a little more information on what you’re trying to accomplish and what the existing project is?

naman shaundik (namanshaundik@gmail.com)
2022-01-31 03:49:22

*Thread Reply:* I work in a datalake team and we are trying to implement data lineage property in our project using openlineage. our project basically keeps track of datasets coming from different sources(hive, redshift, elasticsearch etc.) and jobs.

John Thomas (john@datakin.com)
2022-01-31 15:01:31

*Thread Reply:* Gotcha!

Broadly speaking, all an integration needs to do is to send runEvents to Marquez.

I'd start by understanding the OpenLineage data model, and then looking at your system to identify when / where runEvents should be sent from, and what information needs to be included.

TJ Tang (tj@tapdata.io)
2022-02-15 15:28:03

*Thread Reply:* I suppose OpenLineage itself only defines the standard/protocol to design your data model. To be able to visualize/trace the lineage, you either have to implement your self with the standard data models or including Marquez in your project. You would need to use HTTP API to send lineage events from your Java project to Marquez in this case.

John Thomas (john@datakin.com)
2022-02-16 11:17:13

*Thread Reply:* Exactly! This project also includes connectors for more common data tools (Airflow, dbt, spark, etc), but at it's core OpenLineage is a standard and protocol

Michael Robinson (michael.robinson@astronomer.io)
2022-02-02 19:55:13

The next OpenLineage Technical Steering Committee meeting is Wednesday, February 9. Meetings are on the second Wednesday of each month from 9:00 to 10:00am PT. Join us on Zoom: https://us02web.zoom.us/j/81831865546?pwd=RTladlNpc0FTTDlFcWRkM2JyazM4Zz09 All are welcome. Agenda items are always welcome, as well. Reply in thread with yours. Current agenda: • OpenLineage 0.5.1 release • Apache Flink effort • Dagster integration • Open Discussion Notes: https://tinyurl.com/openlineagetsc

Jensen Yap (jensen@contxts.io)
2022-02-03 00:33:45

Hi everybody!

👋 Maciej Obuchowski, John Thomas
John Thomas (john@datakin.com)
2022-02-03 12:39:57

*Thread Reply:* Hello!

Albert Bikeev (albert.bikeev@gmail.com)
2022-02-04 09:36:46

Hi everybody! Very cool initiative, thank you! Is there any traction on Apache Atlas integration? Is there some way to help you there?

John Thomas (john@datakin.com)
2022-02-04 15:07:07

*Thread Reply:* Hey Albert! There aren't yet any issues or proposals around Apache Atlas yet, but that's definitely something you can help with!

I'm not super familiar with Atlas, were you thinking in terms of enabling Atlas to receive runEvents from OpenLineage connectors?

Albert Bikeev (albert.bikeev@gmail.com)
2022-02-07 05:49:16

*Thread Reply:* Hi John! Yes, exactly, it’d be nice to see Atlas as a receiver side of the OpenLineage events. Is there some guidelines on how to implement it? I guess we need OpenLineage-compatible server implementation so we could receive events and send them to Atlas, right?

John Thomas (john@datakin.com)
2022-02-07 11:30:14

*Thread Reply:* exactly - This would be a change on the Atlas side. I’d start by opening an issue in the atlas repo about making an API endpoint that can receive OpenLineage events. Marquez is our reference implementation of OpenLineage, so I’d look around in that repo to see how it’s been implemented :)

Albert Bikeev (albert.bikeev@gmail.com)
2022-02-07 11:50:27

*Thread Reply:* Got it, thanks! Did that: https://issues.apache.org/jira/browse/ATLAS-4550 If it’d not get any traction we at New Work might contribute as well

John Thomas (john@datakin.com)
2022-02-07 11:56:09

*Thread Reply:* awesome! if you guys have any questions, reach out and I can get you in touch with some of the engineers on our end

👍 Albert Bikeev
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-02-08 11:20:47

*Thread Reply:* @Albert Bikeev one minor thing that could be helpful: java OpenLineage library contains server model classes: https://github.com/OpenLineage/OpenLineage/pull/300#issuecomment-923489097

Albert Bikeev (albert.bikeev@gmail.com)
2022-02-08 11:32:12

*Thread Reply:* Got it, thank you!

Juan Carlos Fernández Rodríguez (jcfernandez@keedio.com)
2022-05-04 11:12:23

*Thread Reply:* This is a quite old discussion, but isn't possible to use openlineage proxy to send json to kafka topic and let Atlas read that json without any modification? It would be needed to create a new model for spark, other than https://github.com/apache/atlas/blob/release-2.1.0-rc3/addons/models/1000-Hadoop/1100-spark_model.json and upload it to atlas (what could be done with a call to the atlas Api) Does it makes sense?

👍 Albert Bikeev
Will Johnson (will@willj.co)
2022-05-04 11:24:02

*Thread Reply:* @Juan Carlos Fernández Rodríguez - You still need to build a bridge between the OpenLineage Spec and the Apache Atlas entity JSON. So far, no one has contributed something like that to the open source community... yet!

Juan Carlos Fernández Rodríguez (jcfernandez@keedio.com)
2022-05-04 14:24:28

*Thread Reply:* sorry for the ignorance, But what is the purpose of the bridge?the comunicación with atlas should be done throw kafka, and that messages can be sent by the proxy. What are I missing?

John Thomas (john@datakin.com)
2022-05-04 16:37:33

*Thread Reply:* "bridge" in this case refers to a service of some sort that converts from OpenLineage run event to Atlas entity JSON, since there's currently nothing that will do that

xiang chen (cdmikechen@hotmail.com)
2022-05-19 09:08:23

*Thread Reply:* If OpenLineage send an event to kafka, I think we can use kafka stream or kafka connect to rebuild message to atlas event.

xiang chen (cdmikechen@hotmail.com)
2022-05-19 09:11:37

*Thread Reply:* @John Thomas Our company used to use atlas as a metadata service. I just came into know this project. After I learned how openlineage works, I think I can create an issue to describe my design first.

xiang chen (cdmikechen@hotmail.com)
2022-05-19 09:13:36

*Thread Reply:* @Juan Carlos Fernández Rodríguez If you already have some experience and design, can you directly create an issue so that we can discuss it in more detail ?

Juan Carlos Fernández Rodríguez (jcfernandez@keedio.com)
2022-05-19 12:42:31

*Thread Reply:* Hi @xiang chen we are discussing internally in my company if rewrite to atlas or another alternative. If we do this, we will share and could involve you in some way.

Michael Robinson (michael.robinson@astronomer.io)
2022-02-04 15:02:29

Who here is working with OpenLineage at Dagster or Flink? We would love to hear about your work at the next on February 9 at 9 a.m. PT. Please reply here or message me to coordinate. @Ziyoiddin Yusupov

👍 Ziyoiddin Yusupov
Luca Soato (lucasoato@gmail.com)
2022-02-04 19:18:24

Hi everyone, OpenLineage is wonderful, we really needed something like this! Has anyone else used it with Databricks, Delta tables or Spark? If someone is interested into these technologies we can work together to get a POC and share some thoughts. Thanks and have a nice weekend! :)

Julius Rentergent (julius.rentergent@thetradedesk.com)
2022-02-25 13:06:16

*Thread Reply:* Hi Luca, I agree this looks really promising. I’m working on getting it to run on Databricks, but I’m only just starting out 🙂

Michael Robinson (michael.robinson@astronomer.io)
2022-02-08 12:00:02

Friendly reminder: this month’s OpenLineage TSC meeting is tomorrow! https://openlineage.slack.com/archives/C01CK9T7HKR/p1643849713216459

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
❤️ Kevin Mellott, John Thomas
Albert Bikeev (albert.bikeev@gmail.com)
2022-02-10 08:22:28

Hi people, One question regarding error reporting - what is the mechanism for that? E.g. if I send duplicated job to Openlineage, is there a way to notify me about that?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-02-10 09:05:39

*Thread Reply:* By duplicated, you mean with the same runId?

Albert Bikeev (albert.bikeev@gmail.com)
2022-02-10 11:40:55

*Thread Reply:* It’s only one example, could be also duplicated job name or anything else. The question is if there is mechanism to report that

Will Johnson (will@willj.co)
2022-02-14 17:21:20

Reducing the Logging of Spark Integration

Hey, OpenLineage community! I'm curious if there are any quick tricks / fixes to reduce the amount of logging happening in the OpenLineage Spark Integration. Each job seems to print out the Logical Plan with INFO level logging. The default behavior of Databricks is to print out INFO level logs and so it gets pretty cluttered and noisy.

I'm hoping there's a feature flag that would help me shut off those kind of logs in OpenLineage's Spark integration 🤞

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-02-15 05:15:12

*Thread Reply:* I think this log should be dropped to debug: https://github.com/OpenLineage/OpenLineage/blob/d66c41872f3cc7f7cd5c99664d401e070e[…]c/main/common/java/io/openlineage/spark/agent/EventEmitter.java

Will Johnson (will@willj.co)
2022-02-15 23:27:07

*Thread Reply:* @Maciej Obuchowski that is a good one! It would be nice to still have SOME logging in info to know that the event complete successfully but that response and event is very verbose.

I was also thinking about here: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/src/main/common/java/io/openlineage/spark/agent/lifecycle/OpenLineageRunEventBuilder.java#L337-L340

and here: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/src/main/common/java/io/openlineage/spark/agent/lifecycle/OpenLineageRunEventBuilder.java#L405-L408

These spots are where it's printing out the full logical plan for some reason.

Can I just open up a PR and switch these to log.debug instead?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-02-16 04:59:17

*Thread Reply:* Yes, that would be good solution for now. Later would be nice to have some option to raise the log level - OL logs are absolutely drowning in logs from rest of Spark cluster when set to debug.

Will Johnson (will@willj.co)
2022-02-16 13:35:15

[SPARK][INTEGRATION] Need Brainstorming Ideas - How to Persist / Access Spark Configs in JobEnd

Hey, OL community! I'm working on PR#490 and I finally have all tests passing but now my desired behavior - display environment properties during COMPLETE / JobEnd events - is not happening 😭

The previous approach stored the spark properties in the OpenLineageContext with a properties attribute but that was part of all of the test failures I believe.

What are some other ways to store the jobStart's properties and make them accessible to the corresponding jobEnd? Hopefully it's okay to tag @Maciej Obuchowski, @Michael Collado, and @Paweł Leszczyński who have been extremely helpful in the past and brought great ideas to the table.

Comments
6
Michael Collado (collado.mike@gmail.com)
2022-02-16 13:44:30

*Thread Reply:* Hey, I responded on the issue, but just to make it clear for everyone, the OL events for a run are not expected to be an accumulation of all past events. Events should be treated as additive by the backend - each event can post what information it has about the run and the backend is responsible for constructing a holistic picture of the run

Michael Collado (collado.mike@gmail.com)
2022-02-16 13:47:18

*Thread Reply:* e.g., here is the marquez code that fetches the facets for a run. Note that all of the facets are included from all events with the requested run_uuid. If the env facet is present on any event, it will be returned by the API

Will Johnson (will@willj.co)
2022-02-16 13:51:30

*Thread Reply:* Ah! Thanks for that @Michael Collado it's good to understand the OpenLineage perspective.

So, we do need to maintain some state. That makes total sense, Mike.

How does Marquez handle failed jobs currently? Based on this issue (https://github.com/OpenLineage/OpenLineage/issues/436) I think Marquez would show a START but no COMPLETE event, right?

Michael Collado (collado.mike@gmail.com)
2022-02-16 14:00:03

*Thread Reply:* If I were building the backend, I would store events, then calculate the end state later, rather than trying to "maintain some state" (maybe we mean the same thing, but using different words here 😀). Re: the failure events, I think job failures will currently result in one FAIL event and one COMPLETE event. The SparkListenerJobEnd event will trigger a FAIL event but the SparkListenerSQLExecutionEnd event will trigger the COMPLETE event.

Will Johnson (will@willj.co)
2022-02-16 15:16:27

*Thread Reply:* Oooh! I did not know we already could get a FAIL event! That is super helpful to know, Mike! Thank you so much!

Will Johnson (will@willj.co)
2022-02-21 10:04:18

[SPARK] Connecting SparkListenerSQLExecutionStart to the various SparkListenerJobStarts

TL;DR: How can I connect the SparkListenerSQLExecutionStart to the SparkListenerJobStart events coming out of OpenLineage? The events appear to have two separate run ids and no link to indicate that the ExecutionStart event owns the subsequent JobStart events.

More Context:

Recently, I implemented a connector for Azure Synapse (data warehouse on the Microsoft cloud) for the Spark integration and now with https://github.com/OpenLineage/OpenLineage/pull/490, I realize now that the SparkListenerSQLExecutionStart events carries with it the necessary inputs and outputs to tell the "real" lineage. The way the Synapse in Databricks works is:

• SparkListenerSQLExecutionStart fires off an event with the end to end input and output (e.g. S3 as input and SQL table as output) • SparkListenerJobStart events fire off that move content from one S3 location to a "staging" location controlled by Azure Synapse. OpenLineage records this event with INPUT S3 and output is a WASB "tempfolder" (which is a temporary locatio and not really useful for lineage since it will be destroyed at the end of the job) • The final operation actually happens ALL in Synapse and OpenLineage does not fire off an event it seems. The Synapse database has a "COPY" command which moves the data from "tempfolder" in to the database. • Finally a SparkListenerSQLExecutionEnd event happens and the query is complete. Ideally, I could connect the SQLExecutionStart of SQLExecutionEnd with the SparkListenerJobStart so that I can get the JobStart properties. I see that ExecutionStart has an execution id and JobStart should have the same Execution Id BUT I think by the time I reach the ExecutionEND, all the JobStart events would have been removed from the HashMap that contains all of the events in OpenLineage.

Any guidance on how to reach a JobStart properties from an ExecutionStart or ExecutionEnd would be greatly appreciated!

Comments
7
🤔 Maciej Obuchowski
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-02-22 09:02:48

*Thread Reply:* I think this scenario only happens when spark job spawns another "sub-job", right?

I think that maybe you can check sparkContext.getLocalProperty("spark.sql.execution.id")

> I see that ExecutionStart has an execution id and JobStart should have the same Execution Id BUT I think by the time I reach the ExecutionEND, all the JobStart events would have been removed from the HashMap that contains all of the events in OpenLineage. But pairwise, those starts and ends should at least have the same runId as they were created with same OpenLineageContext, right?

Anyway, what @Michael Collado wrote on the issue is true: https://github.com/OpenLineage/OpenLineage/pull/490#issuecomment-1042011803 - you should not assume that we hold all the metadata somewhere in memory during whole execution of the run. The backend should be able to take care of it.

Will Johnson (will@willj.co)
2022-02-22 10:53:09

*Thread Reply:* @Maciej Obuchowski - I was hoping they'd have the same run id as well but they do not 😞

But that is the expectation? A SparkSQLExecutionStart and JobStart SHOULD have the same execution ID, right?

I will take a look at sparkContext.getLocalProperty. Thank you so much for the reply Maciej!

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-02-22 10:57:24

*Thread Reply:* SparkSQLExecutionStart and SparkSQLExecutionEnd should have the same runId, as well as JobStart and JobEnd events. Beyond those it can get wild. For example, some jobs don't emit JobStart/JobEnd events. Some jobs, like Delta emit multiple, that aren't easily tied to SQL event.

Will Johnson (will@willj.co)
2022-02-23 03:48:38

*Thread Reply:* Okay, I dug into the Databricks Synapse Connector and it does the following:

  1. SparkSQLExecutionStart with execution id of 8 happens (so gets runid of abc123). It contains the real inputs and outputs that we want.
  2. The Synapse connector starts executing JDBC commands. These commands prepare the synapse database to connect with data that Spark will land in a staging area in the cloud. (I don't know how it' executing arbitrary commands before the official job start begins 😞 )
  3. SparkJobStart beings with execution id of 9 happens (so it gets runid of jkl456). This contains the inputs and an output to a temp folder (NOT the real output we want but a staging location) a. There are four JobIds 0 - 3, all of which point back to execution id 9 with the same physical plan. b. After job1, it runs more JDBC commands. c. I think at Job2, it runs the actual Spark code to query and join my raw input data and land it in a cloud storage account "tempfolder"/ d. After job3, it runs the final JDBC commands to actually move the data from "tempfolder/" to Synapse Db.
  4. Finally, the SparkSQLListenerEnd event occurs. I can see this in the Spark UI as well.

Because the Databricks Synapse connector somehow adds these additional JobStarts WITHOUT referencing the original SparkSQLExeuctionStart execution ID, we have to rely on heuristics to connect the /tempfolder to the real downstream table that was already provided in the ExecutionStart event 😞

I've attached the logs and a screenshot of what I'm seeing the Spark UI. If you had a chance to take a look, it's a bit verbose but I'd appreciate a second pair of eyes on my analysis. Hopefully I got something wrong 😅

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-02-23 07:19:01

*Thread Reply:* I think we've encountered the same stuff in Delta before 🙂

https://github.com/OpenLineage/OpenLineage/issues/388#issuecomment-964401860

Michael Collado (collado.mike@gmail.com)
2022-02-23 14:13:18

*Thread Reply:* @Will Johnson , am I reading your report correctly that the SparkListenerJobStart event is reported with a spark.sql.execution.id that differs from the execution id of the SparkSQLExecutionStart?

Michael Collado (collado.mike@gmail.com)
2022-02-23 14:18:04

*Thread Reply:* WILLJ: We're deep inside this thing and have an executionid |9| 😂

Will Johnson (will@willj.co)
2022-02-23 21:56:48

*Thread Reply:* Hah @Michael Collado I see you found my method of debugging in Databricks 😅

But you're exactly right, there's a SparkSQLExecutionStart event with execution id 8 and then a set of JobStart events all with execution id 9!

I don't know enough about Spark internals on how you can just run arbitrary Scala code while making it look like a Spark Job but that's what it looks like. As if the SqlDwWriter somehow submits a new job without a ExecutionStart... maybe it's an RDD operation instead? This has given me another idea to add some more log.info statements to my jar 😅😬

Michael Robinson (michael.robinson@astronomer.io)
2022-02-28 14:00:23

One of our own will be talking OpenLineage, Airflow and Spark at the Subsurface Conference this week. Register to attend @Michael Collado’s session on March 3rd at 11:45. You can register and learn more here: https://www.dremio.com/subsurface/live/winter2022/

Dremio
🎉 Willy Lulciuc, Maciej Obuchowski
🙌 Will Johnson, Ziyoiddin Yusupov, Julien Le Dem
👍 Ziyoiddin Yusupov
Willy Lulciuc (willy@datakin.com)
2022-02-28 14:00:56

*Thread Reply:* You won’t want to miss this talk!

Martin Fiser (fisa@keboola.com)
2022-02-28 15:06:43

I have a question about DataHub integration through OpenLineage standard. Is anyone working on it, or was it rather just an icon used in previous materials? We have build a openlineage API endpoint in our product and we were hoping OL will gain enough traction so it will be a native way to connect to variaty of data discovery/observability tools, such as datahub, amundzen, etc.

Many thanks!

John Thomas (john@datakin.com)
2022-02-28 15:29:58

*Thread Reply:* hi Martin - when you talk about a DataHub integration, did you mean a method to collect information from DataHub? I don't see a current issue open for that, but I recommend you make one and to kick off the discussion around it.

If you mean sending information to DataHub, that should already be possible if users pass a datahub api endpoint to the OPENLINEAGE_ENDPOINT variable

Martin Fiser (fisa@keboola.com)
2022-02-28 16:29:54

*Thread Reply:* Hi, thanks for a reply! I meant to emit Openlineage JSON structure to datahub.

Could you be please more specific, possibly link an article how to find the endpoint on the datahub side? Many thanks!

John Thomas (john@datakin.com)
2022-02-28 17:15:31

*Thread Reply:* ooooh, sorry I misread - I thought you meant that datahub had built an endpoint. Your integration should emit openlineage events to an endpoint, but datahub would have to build that support into their product likely? I'm not sure how to go about it

John Thomas (john@datakin.com)
2022-02-28 17:16:27

*Thread Reply:* I'd reach out to datahub, potentially?

Martin Fiser (fisa@keboola.com)
2022-02-28 17:21:51

*Thread Reply:* i see. ok, will do!

Julien Le Dem (julien@apache.org)
2022-03-02 18:15:21

*Thread Reply:* It has been discussed in the past but I don’t think there is something yet. The Kafka transport PR that is in flight should facilitate this

Martin Fiser (fisa@keboola.com)
2022-03-02 18:33:45

*Thread Reply:* Thanks for the response! though dragging Kafka in just for data delivery bit is too much. I think the clearest way would be to push Datahub to make an API endpoint and parser for OL /lineage data structure.

I see this is more political think that would require join effort of DataHub team and OpenLineage with a common goal.

Michael Robinson (michael.robinson@astronomer.io)
2022-02-28 17:22:47

Is there a topic you think the community should discuss at the next OpenLineage TSC meeting? Reply or DM with your item, and we’ll add it to the agenda. Mark your calendars: the next TSC meeting is Wednesday, March 9 at 9 am PT on zoom.

Michael Robinson (michael.robinson@astronomer.io)
2022-03-02 10:24:58

The next OpenLineage Technical Steering Committee meeting is Wednesday, March 9! Meetings are on the second Wednesday of each month from 9:00 to 10:00am PT. Join us on Zoom: https://us02web.zoom.us/j/81831865546?pwd=RTladlNpc0FTTDlFcWRkM2JyazM4Zz09 All are welcome. Agenda: • New committers • Release overview (0.6.0) • New process for blog posts • Retrospective: Spark integration Notes: https://tinyurl.com/openlineagetsc

Michael Collado (collado.mike@gmail.com)
2022-03-02 14:29:33

FYI, there's a talk on OpenLineage at Subsurface live tomorrow - https://www.dremio.com/subsurface/live/winter2022/session/cross-platform-data-lineage-with-openlineage/

Dremio
Est. reading time
1 minute
🙌 Maciej Obuchowski, John Thomas, Paweł Leszczyński, Francis McGregor-Macdonald
👍 Ziyoiddin Yusupov, Michael Robinson, Jac.
Michael Robinson (michael.robinson@astronomer.io)
2022-03-04 15:25:20

@channel The latest release (0.6.0) of OpenLineage is now available, featuring a new Dagster integration, updates to the Airflow and Java integrations, a generic facet for env properties, bug fixes, and more. For more info, visit https://github.com/OpenLineage/OpenLineage/releases/tag/0.6.0

🙌 Conor Beverland, Dalin Kim, Ziyoiddin Yusupov, Luca Soato
👍 Julien Le Dem
👀 William Angel, Francis McGregor-Macdonald
Marco Diaz (mdiaz@roblox.com)
2022-03-07 14:06:19

Hello Guys,

Where do I find an example of building a custom extractor? We have several custom airflow operators that I need to integrate

John Thomas (john@datakin.com)
2022-03-07 14:56:58

*Thread Reply:* Hi marco - we don't have documentation on that yet, but the Postgres extractor is a pretty good example of how they're implemented.

all the included extractors are here: https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow/openlineage/airflow/extractors

Marco Diaz (mdiaz@roblox.com)
2022-03-07 15:07:41

*Thread Reply:* Thanks. I can follow that to build my own. Also I am installing this environment right now in Airflow 2. It seems I need Marquez and openlinegae-aiflow library. It seems that by this example I can put my extractors in any path as long as it is referenced in the environment variable. Is that correct? OPENLINEAGE_EXTRACTOR_&lt;operator&gt;=full.path.to.ExtractorClass Also do I need anything else other than Marquez and openlineage_airflow?

Ross Turk (ross@datakin.com)
2022-03-07 15:30:45

*Thread Reply:* Yes, as long as the extractors are in the python path.

Ross Turk (ross@datakin.com)
2022-03-07 15:31:59

*Thread Reply:* I built one a little while ago for a custom operator, I'd be happy to share what I did. I put it in the same file as the operator class for convenience.

Marco Diaz (mdiaz@roblox.com)
2022-03-07 15:32:51

*Thread Reply:* That will be great help. Thanks

Ross Turk (ross@datakin.com)
2022-03-08 20:38:27

*Thread Reply:* This is the one I wrote:

Ross Turk (ross@datakin.com)
2022-03-08 20:39:30

*Thread Reply:* to make it work, I set this environment variable:

OPENLINEAGE_EXTRACTOR_HttpToBigQueryOperator=http_to_bigquery.HttpToBigQueryExtractor

Ross Turk (ross@datakin.com)
2022-03-08 20:40:57

*Thread Reply:* the extractor starts at line 183, and the really important bits start at line 218

Michael Robinson (michael.robinson@astronomer.io)
2022-03-07 15:16:37

@channel At the next OpenLineage TSC meeting, we’ll be reminiscing about the Spark integration. If you’ve had a hand in OL support for Spark, please join and share! The meeting will start at 9 am PT on Wednesday this week. @Maciej Obuchowski @Oleksandr Dvornik @Willy Lulciuc @Michael Collado https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

👍 Ross Turk, Maciej Obuchowski
Marco Diaz (mdiaz@roblox.com)
2022-03-07 18:44:26

Would Marquez create some lineage for operators that don't have a custom extractor built yet?

✅ Fuming Shih
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-03-08 12:05:25

*Thread Reply:* You would see that job was run - but we couldn't extract dataset lineage from it.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-03-08 12:05:49

*Thread Reply:* The good news is that we're working to solve this problem in general.

Marco Diaz (mdiaz@roblox.com)
2022-03-08 12:15:52

*Thread Reply:* I see, so i definitively will need the custom extractor built. I just need to understand where to set the path to the extractor. I can build one by following the postgress extractor you have built.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-03-08 12:50:00

*Thread Reply:* That depends how you deploy Airflow. Our tests use environment in docker-compose: https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/tests/integration/tests/docker-compose-2.yml#L34

Marco Diaz (mdiaz@roblox.com)
2022-03-08 13:19:37

*Thread Reply:* Thanks for the example. I can show this to my infra support person for his reference.

Michael Robinson (michael.robinson@astronomer.io)
2022-03-08 11:47:11

This month’s OpenLineage TSC community meeting is tomorrow at 9am PT! It’s not too late to add an item to the agenda. Reply here or msg me with yours. https://openlineage.slack.com/archives/C01CK9T7HKR/p1646234698326859

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
👍 Ross Turk
Marco Diaz (mdiaz@roblox.com)
2022-03-09 19:31:23

I am running the last command to install marquez in AWS helm upgrade --install marquez . --set marquez.db.host &lt;AWS-RDS-HOST&gt; --set marquez.db.user &lt;AWS-RDS-USERNAME&gt; --set marquez.db.password &lt;AWS-RDS-PASSWORD&gt; --namespace marquez --atomic --wait And I am receiving this error Error: query: failed to query with labels: secrets is forbidden: User "xxx@xxx.xx" cannot list resource "secrets" in API group "" in the namespace "default"

Julien Le Dem (julien@apache.org)
2022-03-10 12:46:18

*Thread Reply:* Do you need to specify a namespace that is not « default »?

Marco Diaz (mdiaz@roblox.com)
2022-03-09 19:31:48

Can anyone let me know what is happening? My DI guy said it is a chart issue

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-03-10 07:40:13

*Thread Reply:* @Kevin Mellott aren't you the chart wizard? Maybe you could help 🙂

👀 Kevin Mellott
Marco Diaz (mdiaz@roblox.com)
2022-03-10 14:09:26

*Thread Reply:* Ok so I had to update a chart dependency

Marco Diaz (mdiaz@roblox.com)
2022-03-10 14:10:39

*Thread Reply:* Now I installed the service in amazon using this helm install marquez . --dependency-update --set marquez.db.host=myhost --set marquez.db.user=myuser --set marquez.db.password=mypassword --namespace marquez --atomic --wait

Marco Diaz (mdiaz@roblox.com)
2022-03-10 14:11:31

*Thread Reply:* i can see marquez-web running and marquez as well as the database i set up manually

Marco Diaz (mdiaz@roblox.com)
2022-03-10 14:12:27

*Thread Reply:* however I can not fetch initial data when login into the endpoint

Kevin Mellott (kevin.r.mellott@gmail.com)
2022-03-10 14:52:06

*Thread Reply:* 👋 @Marco Diaz happy to hear that the Helm install is completing without error! To help troubleshoot the error above, can you please let me know if this endpoint is available and working?

http://localhost:5000/api/v1/namespaces

Marco Diaz (mdiaz@roblox.com)
2022-03-10 15:13:16

*Thread Reply:* i got this {"namespaces":[{"name":"default","createdAt":"2022_03_10T18:05:55.780593Z","updatedAt":"2022-03-10T19:03:31.309713Z","ownerName":"anonymous","description":"The default global namespace for dataset, job, and run metadata not belonging to a user-specified namespace."}]}

Marco Diaz (mdiaz@roblox.com)
2022-03-10 15:13:34

*Thread Reply:* i have to use the namespace marquez to redirect there kubectl port-forward svc/marquez 5000:80 -n marquez

Marco Diaz (mdiaz@roblox.com)
2022-03-10 15:13:48

*Thread Reply:* is there something i need to change in a config file?

Marco Diaz (mdiaz@roblox.com)
2022-03-10 15:14:39

*Thread Reply:* also how would i change the "localhost" address to something that is accessible in amazon without the need to redirect?

Marco Diaz (mdiaz@roblox.com)
2022-03-10 15:14:59

*Thread Reply:* Sorry for all the questions. I am not an infra guy and have had to do all this by myself

Kevin Mellott (kevin.r.mellott@gmail.com)
2022-03-10 15:39:23

*Thread Reply:* No problem at all, I think there are a couple of things at play here. With the local setup, it appears that the web is attempting to access the API on the wrong port number (3000 instead of 5000). I’ll create an issue for that one so that we can fix it.

As to the EKS installation (or any non-local install), this is where you would need to use what’s called an ingress controller to expose the services outside of the Kubernetes cluster. There are different flavors of these (NGINX is popular), and I believe that AWS EKS has some built-in capabilities that might help as well.

https://www.eksworkshop.com/beginner/130_exposing-service/ingress/

Amazon EKS Workshop
Marco Diaz (mdiaz@roblox.com)
2022-03-10 15:40:50

*Thread Reply:* So how do i fix this issue?

Kevin Mellott (kevin.r.mellott@gmail.com)
2022-03-10 15:46:56

*Thread Reply:* If your goal is to deploy to AWS, then you would need to get the EKS ingress configured. It’s not a trivial task, but they do have a bit of a walkthrough at https://www.eksworkshop.com/beginner/130_exposing-service/.

However, if you are just seeking to explore Marquez and try things out, then I would highly recommend the “Open in Gitpod” functionality at https://github.com/MarquezProject/marquez#try-it. That will perform a full deployment for you in a temporary environment very quickly.

Amazon EKS Workshop
Marco Diaz (mdiaz@roblox.com)
2022-03-10 16:02:05

*Thread Reply:* i need to use it in aws for a POC

Marco Diaz (mdiaz@roblox.com)
2022-03-10 19:15:08

*Thread Reply:* Is there a better guide on how to install and setup Marquez in AWS? This guide is omitting many steps https://marquezproject.github.io/marquez/running-on-aws.html

Marquez
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-03-10 12:35:37

We're trying to find best way to track upstream releases of projects we have integrations for, to support newer versions faster and with less bugs. If you have any opinions on this topic, please chime in here

https://github.com/OpenLineage/OpenLineage/issues/602

Marco Diaz (mdiaz@roblox.com)
2022-03-11 13:34:30

@Kevin Mellott Hello Kevin I followed the tutorial you sent me and I have exposed my services. However I am still seeing the same errors (this comes from the api/namescape call) {"namespaces":[{"name":"default","createdAt":"2022_03_10T18:05:55.780593Z","updatedAt":"2022-03-10T19:03:31.309713Z","ownerName":"anonymous","description":"The default global namespace for dataset, job, and run metadata not belonging to a user-specified namespace."}]}

Marco Diaz (mdiaz@roblox.com)
2022-03-11 13:35:08

Is there something i need to change in the chart? I do not have access to the default namespace in kubernetes only marquez namescpace

Kevin Mellott (kevin.r.mellott@gmail.com)
2022-03-11 13:56:27

@Marco Diaz that is actually a good response! This is the JSON returned back by the API to show some of the default Marquez data created by the install. Is there another error you are experiencing?

Marco Diaz (mdiaz@roblox.com)
2022-03-11 13:59:28
Marco Diaz (mdiaz@roblox.com)
2022-03-11 14:00:09

*Thread Reply:* I created my own database and changed the values for host, user and password inside the chart.yml

Kevin Mellott (kevin.r.mellott@gmail.com)
2022-03-11 14:00:23

*Thread Reply:* Does it show that within the AWS deployment? It looks to show localhost in your screenshot.

Kevin Mellott (kevin.r.mellott@gmail.com)
2022-03-11 14:00:52

*Thread Reply:* Or are you working through the local deploy right now?

Marco Diaz (mdiaz@roblox.com)
2022-03-11 14:01:57

*Thread Reply:* It shows the same using the exposed service

Marco Diaz (mdiaz@roblox.com)
2022-03-11 14:02:09

*Thread Reply:* i just didnt do another screenshot

Marco Diaz (mdiaz@roblox.com)
2022-03-11 14:02:27

*Thread Reply:* Could it be communication with the DB?

Kevin Mellott (kevin.r.mellott@gmail.com)
2022-03-11 14:04:37

*Thread Reply:* What do you see if you view the network traffic within your web browser (right click -> Inspect -> Network). Specifically, wondering what the response code from the Marquez API URL looks like.

Marco Diaz (mdiaz@roblox.com)
2022-03-11 14:14:48

*Thread Reply:* i see this error Error occured while trying to proxy to: <a href="http://xxxxxxxxxxxxxxxxxxxxxxxxx.us-east-1.elb.amazonaws.com/api/v1/namespaces">xxxxxxxxxxxxxxxxxxxxxxxxx.us-east-1.elb.amazonaws.com/api/v1/namespaces</a>

Marco Diaz (mdiaz@roblox.com)
2022-03-11 14:16:00

*Thread Reply:* it seems to be trying to use the same address to access the api endpoint

Marco Diaz (mdiaz@roblox.com)
2022-03-11 14:16:26

*Thread Reply:* however the api service is in a different endpoint

Marco Diaz (mdiaz@roblox.com)
2022-03-11 14:18:24

*Thread Reply:* The API resides here <a href="http://Xxxxxxxxxxxxxxxxxxxxxx-2064419849.us-east-1.elb.amazonaws.com">Xxxxxxxxxxxxxxxxxxxxxx-2064419849.us-east-1.elb.amazonaws.com</a>

Marco Diaz (mdiaz@roblox.com)
2022-03-11 14:19:13

*Thread Reply:* The web service resides here <a href="http://xxxxxxxxxxxxxxxxxxxxxxxxxxx-335729662.us-east-1.elb.amazonaws.com">xxxxxxxxxxxxxxxxxxxxxxxxxxx-335729662.us-east-1.elb.amazonaws.com</a>

Marco Diaz (mdiaz@roblox.com)
2022-03-11 14:19:25

*Thread Reply:* do they both need to be under the same LB?

Marco Diaz (mdiaz@roblox.com)
2022-03-11 14:19:56

*Thread Reply:* How would i do that is they install as separate services?

Kevin Mellott (kevin.r.mellott@gmail.com)
2022-03-11 14:27:15

*Thread Reply:* You are correct, both the website and API are expecting to be exposed on the same ALB. This will give you a single URL that can reach your Kubernetes cluster, and then the ALB will allow you to configure Ingress rules to route the traffic based on the request.

Here is an example from one of the AWS repos - in the ingress resource you can see the single rule setup to point traffic to a given service.

https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/main/docs/examples/2048/2048_full.yaml

Marco Diaz (mdiaz@roblox.com)
2022-03-11 14:36:40

*Thread Reply:* Thanks for the help. Now I know what the issue is

Kevin Mellott (kevin.r.mellott@gmail.com)
2022-03-11 14:51:34

*Thread Reply:* Great to hear!!

Sandeep Bhat (bhatsandeep424@gmail.com)
2022-03-16 00:55:36

👋 Hi everyone! Our company is looking to adopt data lineage tool, so i have few queries on open lineage, so 1. Is this completey free.

  1. What are tha database it supports?
Ross Turk (ross@datakin.com)
2022-03-16 10:29:06

*Thread Reply:* Hi! Yes, OpenLineage is free. It is an open source standard for collection, and it provides the agents that integrate with pipeline tools to capture lineage metadata. You also need a metadata server, and there is an open source one called Marquez that you can use.

Ross Turk (ross@datakin.com)
2022-03-16 10:29:15

*Thread Reply:* It supports the databases listed here: https://openlineage.io/integration

openlineage.io
Sandeep Bhat (bhatsandeep424@gmail.com)
2022-03-16 08:27:20

and when i run the ./docker/up.sh --seed i got the result from java code(sample example) But how to get the same thing in python example?

Ross Turk (ross@datakin.com)
2022-03-16 10:29:53

*Thread Reply:* Not sure I understand - are you looking for example code in Python that shows how to make OpenLineage calls?

Sandeep Bhat (bhatsandeep424@gmail.com)
2022-03-16 12:45:14

*Thread Reply:* yup

Sandeep Bhat (bhatsandeep424@gmail.com)
2022-03-16 13:10:04

*Thread Reply:* how to run

Ross Turk (ross@datakin.com)
2022-03-16 23:08:31

*Thread Reply:* this is a good post for getting started with Marquez: https://openlineage.io/blog/explore-lineage-api/

openlineage.io
Ross Turk (ross@datakin.com)
2022-03-16 23:08:51

*Thread Reply:* once you have run ./docker/up.sh, you should be able to run through that and see how the system runs

Ross Turk (ross@datakin.com)
2022-03-16 23:09:45

*Thread Reply:* There is a python client you can find here: https://github.com/OpenLineage/OpenLineage/tree/main/client/python

Sandeep Bhat (bhatsandeep424@gmail.com)
2022-03-17 00:05:58

*Thread Reply:* Thank you

Ross Turk (ross@datakin.com)
2022-03-19 00:00:32

*Thread Reply:* You are welcome 🙂

Mirko Raca (racamirko@gmail.com)
2022-04-19 09:28:50

*Thread Reply:* Hey @Ross Turk, (and potentially @Maciej Obuchowski) - what are the plans for OL Python client? I'd like to use it, but without a pip package it's not really project-friendly.

Is there any work in that direction, is the current client code considered mature and just needs re-packaging, or is it just a thought sketch and some serious work is needed?

I'm trying to avoid re-inventing the wheel, so if there's already something in motion, I'd rather support than start (badly) from scratch?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-04-19 09:32:17

*Thread Reply:* What do you mean without pip-package?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-04-19 09:32:18
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-04-19 09:35:08

*Thread Reply:* It's still developed, for example next release will have pluggable backends - like Kafka https://github.com/OpenLineage/OpenLineage/pull/530

Mirko Raca (racamirko@gmail.com)
2022-04-19 09:40:11

*Thread Reply:* My apologies Maciej! In my defense - looking for "open lineage" on pypi doesn't show this in the first 20 results. Still, should have checked setup.py. My bad, and thank you for the pointer!

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-04-19 10:00:49

*Thread Reply:* We might need to add some keywords to setup.py - right now we have only "openlineage" there 😉

Mirko Raca (racamirko@gmail.com)
2022-04-20 08:12:29

*Thread Reply:* My mistake was that I was expecting a separate repo for the clients. But now I'm playing around with the package and trying to figure out the OL concepts. Thank you for your contribution, it's much nicer to experiment from ipynb than curl 🙂

Michael Robinson (michael.robinson@astronomer.io)
2022-03-16 12:00:01

@Julien Le Dem and @Willy Lulciuc will be at Data Council Austin next week talking OpenLineage and Airflow https://www.datacouncil.ai/talks/data-lineage-with-apache-airflow-using-openlineage?hsLang=en

datacouncil.ai
Sandeep Bhat (bhatsandeep424@gmail.com)
2022-03-16 12:50:20

I couldn't figure out for the sample lineage flow (etldelivery7_days) when we ran the seed command after from which file its fetching data

John Thomas (john@datakin.com)
2022-03-16 14:35:14

*Thread Reply:* the seed data is being inserted by this command here: https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/cli/SeedCommand.java

Sandeep Bhat (bhatsandeep424@gmail.com)
2022-03-17 00:06:53

*Thread Reply:* Got it, but if i changed the code in this java file lets say i added another job here satisfying the syntax its not appearing in the lineage flow

Marco Diaz (mdiaz@roblox.com)
2022-03-22 18:18:22

@Kevin Mellott Hello Kevin, sorry to bother you again. I was finally able to configure Marquez in AWS using an ALB. Now I am receiving this error when calling the API

Marco Diaz (mdiaz@roblox.com)
2022-03-22 18:18:32

Is this an issue accessing the database?

Marco Diaz (mdiaz@roblox.com)
2022-03-22 18:19:15

I created the database and host manually and passed the parameters using helm --set

Marco Diaz (mdiaz@roblox.com)
2022-03-22 18:19:33

Do the database services need to be exposed too through the ALB?

Kevin Mellott (kevin.r.mellott@gmail.com)
2022-03-23 10:20:47

*Thread Reply:* I’m not too familiar with the 504 error in ALB, but found a guide with troubleshooting steps. If this is an issue with connectivity to the Postgres database, then you should be able to see errors within the marquez pod in EKS (kubectl logs <marquez pod name>) to confirm.

I know that EKS needs to have connectivity established to the Postgres database, even in the case of RDS, so that could be the culprit.

Marco Diaz (mdiaz@roblox.com)
2022-03-23 16:09:09

*Thread Reply:* @Kevin Mellott This is the error I am seeing in the logs [HPM] Proxy created: /api/v1 -&gt; <http://localhost:5000/> App listening on port 3000! [HPM] Error occurred while trying to proxy request /api/v1/namespaces from <a href="http://marquez-interface-test.di.rbx.com">marquez-interface-test.di.rbx.com</a> to <http://localhost:5000/> (ECONNREFUSED) (<https://nodejs.org/api/errors.html#errors_common_system_errors>)

Kevin Mellott (kevin.r.mellott@gmail.com)
2022-03-23 16:22:13

*Thread Reply:* It looks like the website is attempting to find the API on localhost. I believe this can be resolved by setting the following Helm chart value within your deployment.

marquez.hostname=marquez-interface-test.di.rbx.com

Kevin Mellott (kevin.r.mellott@gmail.com)
2022-03-23 16:22:54

*Thread Reply:* assuming that is the DNS used by the website

Marco Diaz (mdiaz@roblox.com)
2022-03-23 16:48:53

*Thread Reply:* thanks, that did it. I have a question regarding the database

Marco Diaz (mdiaz@roblox.com)
2022-03-23 16:50:01

*Thread Reply:* I made my own database manually. Do the marquez tables should be created automatically when install marquez?

Marco Diaz (mdiaz@roblox.com)
2022-03-23 16:56:10

*Thread Reply:* Also could you put both the API and interface on the same port (3000)

Marco Diaz (mdiaz@roblox.com)
2022-03-23 17:21:58

*Thread Reply:* Seems I am still having the forwarding issue [HPM] Proxy created: /api/v1 -&gt; <http://marquez-interface-test.di.rbx.com:5000/> App listening on port 3000! [HPM] Error occurred while trying to proxy request /api/v1/namespaces from <a href="http://marquez-interface-test.di.rbx.com">marquez-interface-test.di.rbx.com</a> to <http://marquez-interface-test.di.rbx.com:5000/> (ECONNRESET) (<https://nodejs.org/api/errors.html#errors_common_system_errors>)

Will Johnson (will@willj.co)
2022-03-23 09:08:14

Guidance on How / When a Spark SQL Execution event Controls JobStart Events?

@Maciej Obuchowski and @Paweł Leszczyński and @Michael Collado I'd really appreciate your thoughts on how / when JobStart events are triggered for a given execution. I've ran into two situations now where a SQLExecutionStart event fires with execution id X and then JobStart events fire with execution id Y.

• Spark 2 Delta SaveIntoDataSourceCommand on Databricks - I see it has a SparkSQLExecutionStart event but only on Spark 3 does it have JobStart events with the SaveIntoDataSourceCommand and the same execution id. • Databricks Synapse Connector - A SparkSQLExecutionStart event occurs but then the job starts are different execution ids. Is there any guidance / books / videos that dive deeper into how these events are triggered?

We need the JobStart event with the same execution id so that we can get some environment properties stored in the job start event.

Thanks you so much for any guidance!

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-03-23 09:25:18

*Thread Reply:* It's always Delta, isn't it?

When I originally worked on Delta support I tried to find answer on Delta slack and got an answer:

Hi Maciej, the main reason is that Delta will run queries on metadata to figure out what files should be read for a particular version of a Delta table and that's why you might see multiple jobs. In general Delta treats metadata as data and leverages Spark to handle them to make it scalable.

🤣 Will Johnson
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-03-23 09:25:48

*Thread Reply:* I haven't touched how it works in Spark 2 - wanted to make it work with Spark 3's new catalogs, so can't help you there.

Will Johnson (will@willj.co)
2022-03-23 09:46:14

*Thread Reply:* Argh!! It's always Databricks doing something 🙄

Thanks, Maciej!

Will Johnson (will@willj.co)
2022-03-23 09:51:59

*Thread Reply:* One last question for you, @Maciej Obuchowski, any thoughts on how I could identify WHY a particular JobStart event fired? Is it just stepping through every event? Was that your approach to getting Spark3 Delta working? Thank you so much for the insights!

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-03-23 09:58:08

*Thread Reply:* Before that, we were using just JobStart/JobEnd events and I couldn't find events that correspond to logical plan that has anything to do with what job was actually doing. I just found out that SQLExecution events have what I want, so I just started using them and stopped worrying about Projection or Aggregate, or other events that don't really matter here - and that's how filtering idea was born: https://github.com/OpenLineage/OpenLineage/issues/423

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-03-23 09:59:37

*Thread Reply:* Are you trying to get environment info from those events, or do you actually get Job event with proper logical plans like SaveIntoDataSourceCommand?

Might be worth to just post here all the events + logical plans that are generated for particular job, as I've done in that issue

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-03-23 09:59:40

*Thread Reply:* scala&gt; spark.sql("CREATE TABLE tbl USING delta AS SELECT ** FROM tmp") 21/11/09 19:01:46 WARN SparkSQLExecutionContext: SparkListenerSQLExecutionStart - executionId: 3 21/11/09 19:01:46 WARN SparkSQLExecutionContext: org.apache.spark.sql.catalyst.plans.logical.CreateTableAsSelect 21/11/09 19:01:46 WARN SparkSQLExecutionContext: SparkListenerSQLExecutionStart - executionId: 4 21/11/09 19:01:46 WARN SparkSQLExecutionContext: org.apache.spark.sql.catalyst.plans.logical.LocalRelation 21/11/09 19:01:46 WARN SparkSQLExecutionContext: SparkListenerJobStart - executionId: 4 21/11/09 19:01:46 WARN SparkSQLExecutionContext: org.apache.spark.sql.catalyst.plans.logical.LocalRelation 21/11/09 19:01:47 WARN SparkSQLExecutionContext: SparkListenerJobEnd - executionId: 4 21/11/09 19:01:47 WARN SparkSQLExecutionContext: org.apache.spark.sql.catalyst.plans.logical.LocalRelation 21/11/09 19:01:47 WARN SparkSQLExecutionContext: SparkListenerSQLExecutionEnd - executionId: 4 21/11/09 19:01:47 WARN SparkSQLExecutionContext: org.apache.spark.sql.catalyst.plans.logical.LocalRelation 21/11/09 19:01:48 WARN SparkSQLExecutionContext: SparkListenerSQLExecutionStart - executionId: 5 21/11/09 19:01:48 WARN SparkSQLExecutionContext: org.apache.spark.sql.catalyst.plans.logical.Aggregate 21/11/09 19:01:48 WARN SparkSQLExecutionContext: SparkListenerJobStart - executionId: 5 21/11/09 19:01:48 WARN SparkSQLExecutionContext: org.apache.spark.sql.catalyst.plans.logical.Aggregate 21/11/09 19:01:49 WARN SparkSQLExecutionContext: SparkListenerJobEnd - executionId: 5 21/11/09 19:01:49 WARN SparkSQLExecutionContext: org.apache.spark.sql.catalyst.plans.logical.Aggregate 21/11/09 19:01:49 WARN SparkSQLExecutionContext: SparkListenerSQLExecutionEnd - executionId: 5 21/11/09 19:01:49 WARN SparkSQLExecutionContext: org.apache.spark.sql.catalyst.plans.logical.Aggregate 21/11/09 19:01:49 WARN SparkSQLExecutionContext: SparkListenerSQLExecutionEnd - executionId: 3 21/11/09 19:01:49 WARN SparkSQLExecutionContext: org.apache.spark.sql.catalyst.plans.logical.CreateTableAsSelect

Will Johnson (will@willj.co)
2022-03-23 11:41:37

*Thread Reply:* The JobStart event contains a Properties field and that contains a bunch of fields we want to extract to get more precise lineage information within Databricks.

As far as we know, the SQLExecutionStart event does not have any way to get these properties :(

https://github.com/OpenLineage/OpenLineage/blob/21b039b78bdcb5fb2e6c2489c4de840ebb[…]ark/agent/facets/builder/DatabricksEnvironmentFacetBuilder.java

As a result, I do have to care about the subsequent JobStart events coming from a given ExecutionStart 😢

Will Johnson (will@willj.co)
2022-03-23 11:42:33

*Thread Reply:* I started down this path with the Project statement but I agree with @Michael Collado that a ProjectVisitor isn't a great idea.

https://github.com/OpenLineage/OpenLineage/issues/617

Comments
2
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-03-24 09:43:38

Hey. I'm working on replacing current SQL parser - on which we rely for Postgres, Snowflake, Great Expectations - and I'd appreciate your opinion.

https://github.com/OpenLineage/OpenLineage/pull/627/files

Marco Diaz (mdiaz@roblox.com)
2022-03-25 19:30:29

Am i supposed to see this when I open marquez fro the first time on an empty database?

John Thomas (john@datakin.com)
2022-03-25 20:33:02

*Thread Reply:* Marquez and OpenLineage are job-focused lineage tools, so once you run a job in an OL-integrated instance of Airflow (or any other supported integration), you should see the jobs and DBs appear in the marquez ui

👍 Marco Diaz
Ross Turk (ross@datakin.com)
2022-03-25 21:44:54

*Thread Reply:* If you want to seed it with some data, just to try it out, you can run docker/up.sh -s and it will run a seeding job as it starts.

Marco Diaz (mdiaz@roblox.com)
2022-03-25 19:31:09

Would datasets be created when I send data from airflow?

Willy Lulciuc (willy@datakin.com)
2022-03-31 18:34:40

*Thread Reply:* Yep! Marquez will register all in/out datasets present in the OL event as well as link them to the run

Willy Lulciuc (willy@datakin.com)
2022-03-31 18:35:47

*Thread Reply:* FYI, @Peter Hicks is working on displaying the dataset version to run relationship in the web UI, see https://github.com/MarquezProject/marquez/pull/1929

Labels
feature, review, web, javascript
Comments
1
Marco Diaz (mdiaz@roblox.com)
2022-03-28 14:31:32

How is Datakin used in conjunction with Openlineage and Marquez?

John Thomas (john@datakin.com)
2022-03-28 15:43:46

*Thread Reply:* Hi Marco,

Datakin is a reporting tool built on the Marquez API, and therefore designed to take in Lineage using the OpenLineage specification.

Did you have a more specific question?

Marco Diaz (mdiaz@roblox.com)
2022-03-28 15:47:53

*Thread Reply:* No, that is it. Got it. So, i can install Datakin and still use openlineage and marquez?

John Thomas (john@datakin.com)
2022-03-28 15:55:07

*Thread Reply:* if you set up a datakin account, you'll have to change the environment variables used by your OpenLineage integrations, and the runEvents will be sent to Datakin rather than Marquez. You shouldn't have any loss of functionality, and you also won't have to keep manually hosting Marquez

Marco Diaz (mdiaz@roblox.com)
2022-03-28 16:10:25

*Thread Reply:* Will I still be able to use facets for backfills?

John Thomas (john@datakin.com)
2022-03-28 17:04:03

*Thread Reply:* yeah it works in the same way - Datakin actually submodules the Marquez API

Marco Diaz (mdiaz@roblox.com)
2022-03-28 16:52:41

Another question. I installed the open-lineage library and now I am trying to configure Airflow 2 to use it Do I follow these steps?

Marco Diaz (mdiaz@roblox.com)
2022-03-28 16:53:20

If I have marquez access via alb ingress what would i use the marquezurl variable or openlineageurl?

Marco Diaz (mdiaz@roblox.com)
2022-03-28 16:54:53

So, i don't need to modify my dags in Airflow 2 to use the library? Would this just allow me to start collecting data? openlineage.lineage_backend.OpenLineageBackend

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-03-29 06:24:21

*Thread Reply:* Yes, you don't need to modify dags in Airflow 2.1+

Marco Diaz (mdiaz@roblox.com)
2022-03-29 17:47:39

*Thread Reply:* ok, I added that environment variable. Now my question is how do i configure my other variables. I have marquez running in AWS with an ingress. Do i use OpenLineageURL or Marquez_URL?

Marco Diaz (mdiaz@roblox.com)
2022-03-29 17:48:09

*Thread Reply:* Also would a new namespace be created if i add the variable?

data_fool (data.fool.me@gmail.com)
2022-03-29 02:12:30

Hello! Are there any plans for openlineage to support dbt on trino?

John Thomas (john@datakin.com)
2022-03-30 14:59:13

*Thread Reply:* Hi Datafool - I'm not familiar with how trino works, but the DBT-OL integration works by wrapping the dbt run command with dtb-ol run , and capturing lineage data from the runresult file

These things don't necessarily preclude you from using OpenLineage on trino, so it may work already.

data_fool (data.fool.me@gmail.com)
2022-03-30 18:34:38

*Thread Reply:* hey @John Thomas yep, tried to use dbt-ol run command but it seems trino is not supported, only bigquery, redshift and few others.

John Thomas (john@datakin.com)
2022-03-30 18:36:41

*Thread Reply:* aaah I misunderstood what Trino is - yeah we don't currently support jobs that are running outside of those environments.

We don't currently have plans for this, but a great first step would be opening an issue in the OpenLineage repo.

If you're interested in implementing the support yourself I'm also happy to connect you to people that can help you get started.

data_fool (data.fool.me@gmail.com)
2022-03-30 20:23:46

*Thread Reply:* oh okay, got it, yes I can contribute, I'll see if I can get some time in the next few weeks. Thanks @John Thomas

Francis McGregor-Macdonald (francis@mc-mac.com)
2022-03-30 16:08:39

I can see 2 articles using Spline with BMW and Capital One. Could OpenLineage be doing the same job as Spline here? What would the differences be? Are there any similar references for OpenLineage? I can see Northwestern Mutual but that article does not contain a lot of detail.

SpringerLink
Capital One
openlineage.io
Marco Diaz (mdiaz@roblox.com)
2022-03-31 12:47:59

Could anyone help me wit this custom extractor. I am not sure what I am doing wrong. I added the variable to airflow2, but I still see this in the logs [2022-03-31, 16:43:39 UTC] {__init__.py:97} WARNING - Unable to find an extractor. task_type=QueryOperator Here is the code

```import logging from typing import Optional, List from openlineage.airflow.extractors.base import BaseExtractor,TaskMetadata from openlineage.client.facet import SqlJobFacet, ExternalQueryRunFacet from openlineage.common.sql import SqlMeta, SqlParser

logger = logging.getLogger(name)

class QueryOperatorExtractor(BaseExtractor):

def __init__(self, operator):
    super().__init__(operator)

@classmethod
def get_operator_classnames(cls) -&gt; List[str]:
    return ['QueryOperator']

def extract(self) -&gt; Optional[TaskMetadata]:
    # (1) Parse sql statement to obtain input / output tables.
    sql_meta: SqlMeta = SqlParser.parse(self.operator.hql)
    inputs = sql_meta.in_tables
    outputs = sql_meta.out_tables
    task_name = f"{self.operator.dag_id}.{self.operator.task_id}"
    run_facets = {}
    job_facets = {
        'hql': SqlJobFacet(self.operator.hql)
    }

    return TaskMetadata(
        name=task_name,
        inputs=[inputs.to_openlineage_dataset()],
        outputs=[outputs.to_openlineage_dataset()],
        run_facets=run_facets,
        job_facets=job_facets
    )```
Orbit
2022-03-31 13:20:55

@Orbit has joined the channel

Orbit
2022-03-31 13:21:23

@Orbit has joined the channel

Marco Diaz (mdiaz@roblox.com)
2022-03-31 14:07:24

@Ross Turk Could you please take a look if you have a minute☝️? I know you have built one extractor before

Ross Turk (ross@datakin.com)
2022-03-31 14:11:35

*Thread Reply:* Hmmmm. Are you running in Docker? Is it possible for you to shell into your scheduler container and make sure the ENV is properly set?

Ross Turk (ross@datakin.com)
2022-03-31 14:11:57

*Thread Reply:* looks to me like the value you posted is correct, and return ['QueryOperator'] seems right to me

Marco Diaz (mdiaz@roblox.com)
2022-03-31 14:33:00

*Thread Reply:* It is in an EKS cluster I checked and the variable is there OPENLINEAGE_EXTRACTOR_QUERYOPERATOR=shared.plugins.ol_custom_extractors.QueryOperatorExtractor

Marco Diaz (mdiaz@roblox.com)
2022-03-31 14:33:56

*Thread Reply:* I am wondering if it is an issue with my extractor code. Something not rendering well

Ross Turk (ross@datakin.com)
2022-03-31 14:40:17

*Thread Reply:* I don’t think it’s even executing your extractor code. The error message traces back to here: https://github.com/OpenLineage/OpenLineage/blob/249868fa9b97d218ee35c4a198bcdf231a9b874b/integration/airflow/openlineage/lineage_backend/__init__.py#L77

Ross Turk (ross@datakin.com)
2022-03-31 14:40:45

*Thread Reply:* I am currently digging into _get_extractor to see where it might be missing yours 🤔

Marco Diaz (mdiaz@roblox.com)
2022-03-31 14:46:36

*Thread Reply:* Thanks

Ross Turk (ross@datakin.com)
2022-03-31 14:47:19

*Thread Reply:* silly idea, but you could add a log message to __init__ in your extractor.

Ross Turk (ross@datakin.com)
2022-03-31 14:48:20

*Thread Reply:* the openlineage client actually tries to import the value of that env variable from pos 22. if that happens, but for some reason it fails to register the extractor, we can at least know that it’s importing

Ross Turk (ross@datakin.com)
2022-03-31 14:48:54

*Thread Reply:* if you add a log line, you can verify that your PYTHONPATH and env are correct

Marco Diaz (mdiaz@roblox.com)
2022-03-31 14:49:23

*Thread Reply:* will try that

Marco Diaz (mdiaz@roblox.com)
2022-03-31 14:49:29

*Thread Reply:* and let you know

Ross Turk (ross@datakin.com)
2022-03-31 14:49:39

*Thread Reply:* ok!

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-03-31 15:04:05

*Thread Reply:* @Marco Diaz can you try env variable OPENLINEAGE_EXTRACTOR_QueryOperator instead of full caps?

👍 Ross Turk
Marco Diaz (mdiaz@roblox.com)
2022-03-31 15:13:37

*Thread Reply:* Will try that too

Marco Diaz (mdiaz@roblox.com)
2022-03-31 15:13:44

*Thread Reply:* Thanks for helping

Marco Diaz (mdiaz@roblox.com)
2022-03-31 16:58:24

*Thread Reply:* @Maciej Obuchowski My setup does not allow me to submit environment variables with lowercases. Is the name of the variable used to register the extractor?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-03-31 17:15:57

*Thread Reply:* yes, it's case sensitive...

Marco Diaz (mdiaz@roblox.com)
2022-03-31 17:18:42

*Thread Reply:* i see

Marco Diaz (mdiaz@roblox.com)
2022-03-31 17:39:16

*Thread Reply:* So it is definitively the name of the variable. I changed the name of the operator to capitals and now is being registered

Marco Diaz (mdiaz@roblox.com)
2022-03-31 17:39:44

*Thread Reply:* Could there be a way not to make this case sensitive?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-03-31 18:31:27

*Thread Reply:* yes - could you create issue on OpenLineage repository?

Marco Diaz (mdiaz@roblox.com)
2022-04-01 10:46:59

*Thread Reply:* sure

Marco Diaz (mdiaz@roblox.com)
2022-04-01 10:48:28

I have another question. I have this query INSERT OVERWRITE TABLE schema.daily_play_sessions_v2 PARTITION (ds = '2022-03-30') SELECT platform_id, universe_id, pii_userid, NULL as session_id, NULL as session_start_ts, COUNT(1) AS session_cnt, SUM( UNIX_TIMESTAMP(stopped) - UNIX_TIMESTAMP(joined) ) AS time_spent_sec FROM schema.fct_play_sessions_merged WHERE ds = '2022-03-30' AND UNIX_TIMESTAMP(stopped) - UNIX_TIMESTAMP(joined) BETWEEN 0 AND 28800 GROUP BY platform_id, universe_id, pii_userid And I am seeing the following inputs [DbTableName(None,'schema','fct_play_sessions_merged','schema.fct_play_sessions_merged')] But the outputs are empty Shouldn't this be an output table schema.daily_play_sessions_v2

Ross Turk (ross@datakin.com)
2022-04-01 13:25:52

*Thread Reply:* Yes, it should. This line is the likely culprit: https://github.com/OpenLineage/OpenLineage/blob/431251d25f03302991905df2dc24357823d9c9c3/integration/common/openlineage/common/sql/parser.py#L30

Ross Turk (ross@datakin.com)
2022-04-01 13:26:25

*Thread Reply:* I bet if that said ['INTO','OVERWRITE'] it would work

Ross Turk (ross@datakin.com)
2022-04-01 13:27:23

*Thread Reply:* @Maciej Obuchowski do you agree? should OVERWRITE be a token we look for? if so, I can submit a short PR.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-04-01 13:30:36

*Thread Reply:* we have a better solution

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-04-01 13:30:37

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/644

Ross Turk (ross@datakin.com)
2022-04-01 13:31:27

*Thread Reply:* ah! I heard there was a new SQL parser, but did not know it was imminent!

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-04-01 13:31:30

*Thread Reply:* I've added this case as a test and it works: https://github.com/OpenLineage/OpenLineage/blob/764dfdb885112cd0840ebc7384ff958bf20d4a70/integration/sql/tests/tests_insert.rs

👍 Ross Turk, Paweł Leszczyński
Ross Turk (ross@datakin.com)
2022-04-01 13:31:33

*Thread Reply:* let me review this PR

Marco Diaz (mdiaz@roblox.com)
2022-04-01 13:36:32

*Thread Reply:* Do i have to download a new version of the opelineage-airflow python library

Marco Diaz (mdiaz@roblox.com)
2022-04-01 13:36:41

*Thread Reply:* If so which version?

Ross Turk (ross@datakin.com)
2022-04-01 13:37:22

*Thread Reply:* this PR isn’t merged yet 😞 so if you wanted to try this you’d have to build the python client from the sql/rust-parser-impl branch

Marco Diaz (mdiaz@roblox.com)
2022-04-01 13:38:17

*Thread Reply:* ok, np. I am not in a hurry yet. Do you have an ETA for the merge?

Ross Turk (ross@datakin.com)
2022-04-01 13:39:50

*Thread Reply:* Hard to say, it’s currently in-review. Let me pull some strings, see if I can get eyes on it.

Marco Diaz (mdiaz@roblox.com)
2022-04-01 13:40:34

*Thread Reply:* I will check again next week don't worry. I still need to make some things in my extractor work

Ross Turk (ross@datakin.com)
2022-04-01 13:40:36

*Thread Reply:* after it’s merged, we’ll have to do an OpenLineage release as well - perhaps next week?

👍 Michael Robinson
Ross Turk (ross@datakin.com)
2022-04-01 13:40:41

*Thread Reply:* 👍

Tien Nguyen (tiennguyenhotel97@gmail.com)
2022-04-01 12:25:48

Hi everyone, I just started using openlineage to connect with DBT for my company. I work as data engineering. After the connection and run test on dbt-ol run, it gives me this error. I have looked up online to find the answer but couldn't see the answer anywhere. Can somebody please help me with? The error tells me that the correct version is DBT Schemajson version 2 instead of 3. I don't know where to change the schemajson version. Thank you everyone @channel

Ross Turk (ross@datakin.com)
2022-04-01 13:34:10

*Thread Reply:* Hm - what version of dbt are you using?

Ross Turk (ross@datakin.com)
2022-04-01 13:47:50

*Thread Reply:* @Tien Nguyen The dbt schema version changes with different versions of dbt. If you have recently updated, you may have to make some changes: https://docs.getdbt.com/docs/guides/migration-guide/upgrading-to-v1.0

docs.getdbt.com
Ross Turk (ross@datakin.com)
2022-04-01 13:48:27

*Thread Reply:* also make sure you are on the latest version of openlineage-dbt - I believe we have made it a bit more tolerant of dbt schema changes.

Tien Nguyen (tiennguyenhotel97@gmail.com)
2022-04-01 13:52:46

*Thread Reply:* @Ross Turk Thank you very much for your answer. I will update those and see if I can resolve the issues.

Tien Nguyen (tiennguyenhotel97@gmail.com)
2022-04-01 14:20:00

*Thread Reply:* @Ross Turk Thank you very much for your help. The latest version of dbt couldn't work. But version 0.20.0 works for this problem.

Ross Turk (ross@datakin.com)
2022-04-01 14:22:42

*Thread Reply:* Hmm. Interesting, I remember when dbt 1.0 came out we fixed a very similar issue: https://github.com/OpenLineage/OpenLineage/pull/397

Ross Turk (ross@datakin.com)
2022-04-01 14:25:17

*Thread Reply:* if you run pip3 list | grep openlineage-dbt, what version does it show?

Ross Turk (ross@datakin.com)
2022-04-01 14:26:26

*Thread Reply:* I wonder if you have somehow ended up with an older version of the integration

Tien Nguyen (tiennguyenhotel97@gmail.com)
2022-04-01 14:33:43

*Thread Reply:* it is 0.1.0

Tien Nguyen (tiennguyenhotel97@gmail.com)
2022-04-01 14:34:23

*Thread Reply:* is it 0.1.0 the older version of openlineage ?

Ross Turk (ross@datakin.com)
2022-04-01 14:43:14

*Thread Reply:* ❯ pip3 list | grep openlineage-dbt openlineage-dbt 0.6.2

Ross Turk (ross@datakin.com)
2022-04-01 14:43:26

*Thread Reply:* the latest is 0.6.2 - that might be your issue

Ross Turk (ross@datakin.com)
2022-04-01 14:43:59

*Thread Reply:* How are you going about installing it?

Tien Nguyen (tiennguyenhotel97@gmail.com)
2022-04-01 18:35:26

*Thread Reply:* @Ross Turk. I follow instruction from open lineage "pip3 install openlineage-dbt"

Ross Turk (ross@datakin.com)
2022-04-01 18:36:00

*Thread Reply:* Hm! Interesting. I did the same thing to get 0.6.2.

Tien Nguyen (tiennguyenhotel97@gmail.com)
2022-04-01 18:51:36

*Thread Reply:* @Ross Turk Yes. I have tried to reinstall and clear cache but it still install 0.1.0

Tien Nguyen (tiennguyenhotel97@gmail.com)
2022-04-01 18:53:07

*Thread Reply:* But thanks for the version. I reinstall 0.6.2 version by specify the version

👍 Ross Turk
Marco Diaz (mdiaz@roblox.com)
2022-04-02 17:37:59

@Ross Turk @Maciej Obuchowski FYI the sql parser also seems not to return any inputs or outpus for queries that have subqueries Example INSERT OVERWRITE TABLE mytable PARTITION (ds = '2022-03-31') SELECT ** FROM (SELECT ** FROM table2) a INSERT OVERWRITE TABLE mytable PARTITION (ds = '2022-03-31') SELECT ** FROM (SELECT ** FROM table2 UNION SELECT ** FROM table3 UNION ALL SELECT ** FROM table4) a

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-04-03 15:07:09

*Thread Reply:* they'll work with new parser - added test for those

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-04-03 15:07:39

*Thread Reply:* btw, thank you very much for notifying us about multiple bugs @Marco Diaz!

Marco Diaz (mdiaz@roblox.com)
2022-04-03 15:20:55

*Thread Reply:* @Maciej Obuchowski thank you for making sure these cases are taken into account. I am getting more familiar with the Open lineage code as i build my extractors. If I see anything else I will let you know. Any ETA on the new parser release date?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-04-03 15:55:28

*Thread Reply:* it should be week-two, unless anything comes up

Marco Diaz (mdiaz@roblox.com)
2022-04-03 17:10:02

*Thread Reply:* I see. Keeping my fingers crossed this is the only thing delaying me right now.

Marco Diaz (mdiaz@roblox.com)
2022-04-02 20:27:37

Also what would happen if someone uses a CTE in the SQL? Is the parser taken those cases in consideration?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-04-03 15:02:13

*Thread Reply:* current one handles cases where you have one CTE (like this test) but not multiple - next one will handle arbitrary number of CTEs (like this test)

Michael Robinson (michael.robinson@astronomer.io)
2022-04-04 10:54:47

Agenda items are requested for the next OpenLineage Technical Steering Committee meeting on Wednesday, April 13. Please reply here or ping me with your items!

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-04-04 11:11:53

*Thread Reply:* I've mentioned it before but I want to talk a bit about new SQL parser

🙌 Will Johnson, Ross Turk
Marco Diaz (mdiaz@roblox.com)
2022-04-04 13:25:17

*Thread Reply:* Will the parser be released after the 13?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-04-08 11:47:05

*Thread Reply:* @Michael Robinson added additional item to Agenda - client transports feature that we'll have in next release

🙌 Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2022-04-08 12:56:44

*Thread Reply:* Thanks, Maciej

Sukanya Patra (Sukanya_Patra@mckinsey.com)
2022-04-05 02:39:59

Hi Everyone,

I have come across OpenLineage at Data Council Austin, 2022 and am curious to try it out. I have reviewed the Getting Started section (https://openlineage.io/getting-started/) of OpenLineage docs but couldn't find clear reference documentation for using the API • Are there any swagger API docs or equivalent dedicated for OpenLineage API? There is some reference docs of Marquez API: https://marquezproject.github.io/marquez/openapi.html#tag/Lineage Secondly are there any means to use Open Lineage independent of Marquez? Any pointers would be appreciated.

Patrick Mol (patrick.mol@prolin.com)
2022-04-05 10:28:08

*Thread Reply:* I had kind of the same question. I found https://marquezproject.github.io/marquez/openapi.html#tag/Lineage With some of the entries marked Deprecated, I am not sure how to proceed.

John Thomas (john@datakin.com)
2022-04-05 11:55:35

*Thread Reply:* Hey folks, are you looking for the OpenAPI specification found here?

John Thomas (john@datakin.com)
2022-04-05 15:33:23

*Thread Reply:* @Patrick Mol, Marquez's deprecated endpoints were the old methods for creating lineage (making jobs, dataset, and runs independently), they were deprecated because we moved over to using the OpenLineage spec for all lineage collection purposes.

The GET methods for jobs/datasets/etc are still functional

Sarat Chandra (saratchandra9494@gmail.com)
2022-04-05 21:10:39

*Thread Reply:* Hey John,

Thanks for sharing the OpenAPI docs. Was wondering if there are any means to setup OpenLineage API that will receive events without a consumer like Marquez or is it essential to always pair with a consumer to receive the events?

John Thomas (john@datakin.com)
2022-04-05 21:47:13

*Thread Reply:* the OpenLineage integrations don’t have any way to recieve events, since they’re designed to send events to other apps - what were you expecting OpenLinege to do?

Marquez is our reference implementation of an OpenLineage consumer, but egeria also has a functional endpoint

Patrick Mol (patrick.mol@prolin.com)
2022-04-06 09:53:31

*Thread Reply:* Hi @John Thomas, Would creation of Sources and Datasets have an equivalent in the OpenLineage specification ? Sofar I only see the Inputs and Outputs in the Run Event spec.

John Thomas (john@datakin.com)
2022-04-06 11:31:10

*Thread Reply:* Inputs and outputs in the OL spec are Datasets in the old MZ spec, so they're equivalent

Marco Diaz (mdiaz@roblox.com)
2022-04-05 14:24:50

Hey Guys,

The BaseExtractor is working fine with operators that are derived from Airflow BaseOperator. However for operators derived from LivyOperator the BaseExtractor does not seem to work. Is there a fix for this? We use livyoperator to run sparkjobs

John Thomas (john@datakin.com)
2022-04-05 15:16:34

*Thread Reply:* Hi Marco - it looks like LivyOperator itself does derive from BaseOperator, have you seen any other errors around this problem?

@Maciej Obuchowski might be more help here

Marco Diaz (mdiaz@roblox.com)
2022-04-05 15:21:03

*Thread Reply:* It is the operators that inherit from LivyOperator. It doesn't find the parameters like sql, connection etc

Marco Diaz (mdiaz@roblox.com)
2022-04-05 15:25:42

*Thread Reply:* My guess is that operators that inherit from other operators (not baseoperator) will have the same problem

John Thomas (john@datakin.com)
2022-04-05 15:32:13

*Thread Reply:* interesting! I'm not sure about that. I can look into it if I have time, but Maciej is definitely the person who would know the most.

Ross Turk (ross@datakin.com)
2022-04-06 15:49:48

*Thread Reply:* @Marco Diaz I wonder - perhaps it would be better to instrument spark with OpenLineage. It doesn’t seem that Airflow will know much about what’s happening underneath here. Have you looked into openlineage-spark?

Marco Diaz (mdiaz@roblox.com)
2022-04-06 15:51:57

*Thread Reply:* I have not tried that library yet. I need to see how it implement because we have several spark custom operators that use livy

Marco Diaz (mdiaz@roblox.com)
2022-04-06 15:52:59

*Thread Reply:* Do you have any examples?

Ross Turk (ross@datakin.com)
2022-04-06 15:54:01

*Thread Reply:* there is a good blog post from @Michael Collado: https://openlineage.io/blog/openlineage-spark/

openlineage.io
Ross Turk (ross@datakin.com)
2022-04-06 15:54:37

*Thread Reply:* and the doc page here has a good overview: https://openlineage.io/integration/apache-spark/

openlineage.io
Marco Diaz (mdiaz@roblox.com)
2022-04-06 16:38:15

*Thread Reply:* is this all we need to pass? spark-submit --conf "spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener" \ --packages "io.openlineage:openlineage_spark:0.2.+" \ --conf "spark.openlineage.host=http://&lt;your_ol_endpoint&gt;" \ --conf "spark.openlineage.namespace=my_job_namespace" \ --class com.mycompany.MySparkApp my_application.jar

Marco Diaz (mdiaz@roblox.com)
2022-04-06 16:38:49

*Thread Reply:* If so, yes our operators have a way to pass configurations to spark and we may be able to implement it.

Michael Collado (collado.mike@gmail.com)
2022-04-06 16:41:27

*Thread Reply:* Looks right to me

Marco Diaz (mdiaz@roblox.com)
2022-04-06 16:42:03

*Thread Reply:* Will give it a try

Marco Diaz (mdiaz@roblox.com)
2022-04-06 16:42:50

*Thread Reply:* Do we have to install the library on the spark side or the airflow side?

Marco Diaz (mdiaz@roblox.com)
2022-04-06 16:42:58

*Thread Reply:* I assume is the spark side

Michael Collado (collado.mike@gmail.com)
2022-04-06 16:44:25

*Thread Reply:* The —packages argument tells spark where to get the jar (you'll want to upgrade to 0.6.1)

Marco Diaz (mdiaz@roblox.com)
2022-04-06 16:44:54

*Thread Reply:* sounds good

Varun Singh (varuntestaz@outlook.com)
2022-04-06 00:04:14

Hi, I saw there was some work done for integrating OpenLineage with Azure Purview

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-04-06 04:54:27

*Thread Reply:* @Will Johnson

Will Johnson (will@willj.co)
2022-04-07 12:43:27

*Thread Reply:* Hey @Varun Singh! We are building a github repository that deploys a few resources that will support a limited number of Azure data sources being pushed into Azure Purview. You can expect a public release near the end of the month! Feel free to direct message me if you'd like more details!

Michael Robinson (michael.robinson@astronomer.io)
2022-04-06 15:05:39

The next OpenLineage Technical Steering Committee meeting is Wednesday, April 13! Meetings are on the second Wednesday of each month from 9:00 to 10:00am PT. Join us on Zoom: https://astronomer.zoom.us/j/87156607114?pwd=a3B0K210dnRaQmdkaFdGMytBREZEQT09 All are welcome. Agenda: • OpenLineage 0.6.2 release overview • Airflow integration update • Dagster integration retrospective • Open discussion Notes: https://tinyurl.com/openlineagetsc

slackbot
2022-04-06 21:40:16

This message was deleted.

Marco Diaz (mdiaz@roblox.com)
2022-04-07 01:00:43

*Thread Reply:* Are both airflow2 and Marquez installed locally on your computer?

Jorge Reyes (Zenta Group) (jorge.reyes@zentagroup.com)
2022-04-07 09:04:19

*Thread Reply:* yes Marco

Marco Diaz (mdiaz@roblox.com)
2022-04-07 15:00:18

*Thread Reply:* can you open marquez on <http://localhost:3000>

Marco Diaz (mdiaz@roblox.com)
2022-04-07 15:00:40

*Thread Reply:* and get a response from <http://localhost:5000/api/v1/namespaces>

Jorge Reyes (Zenta Group) (jorge.reyes@zentagroup.com)
2022-04-07 15:26:41

*Thread Reply:* yes , i used this guide https://openlineage.io/getting-started and execute un post to marquez correctly

openlineage.io
Marco Diaz (mdiaz@roblox.com)
2022-04-07 22:17:34

*Thread Reply:* In theory you should receive events in jobs under airflow namespace

Tien Nguyen (tiennguyenhotel97@gmail.com)
2022-04-07 14:18:05

Hi Everyone, Can someone please help me to debug this error ? Thank you very much all

John Thomas (john@datakin.com)
2022-04-07 14:59:06

*Thread Reply:* It looks like you need to add a payment method to your DBT account

Tyler Farris (tyler@kickstand.work)
2022-04-11 12:46:41

Hello. Does Airflow's TaskFlow API work with OpenLineage?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-04-11 12:50:48

*Thread Reply:* It does, but admittedly not very well. It can't recognize what you're doing inside your tasks. The good news is that we're working on it and long term everything should work well.

👍 Howard Yoo
Tyler Farris (tyler@kickstand.work)
2022-04-11 12:58:28

*Thread Reply:* Thanks for the quick reply Maciej.

sandeep (sandeepgame07@gmail.com)
2022-04-12 09:56:44

Hi all, watched few of your demos with airflow(astronomer) recently, really liked them. Thanks for doing those

Questions:

  1. Are there plans to have a hive listener similar to the open-lineage spark integration ?
  2. If not will the sql parser work with the HiveQL ?
  3. Maybe one for presto too ?
  4. Will the run version and dataset version come out of the box or do we need to define some facets ?
  5. I read the blog on facets, is there a tutorial on how to create a sample facet ? Background: We have hive, spark jobs and big query tasks running from airflow in GCP Dataproc
John Thomas (john@datakin.com)
2022-04-12 13:56:53

*Thread Reply:* Hi Sandeep,

1&3: We don't currently have Hive or Presto on the roadmap! The best way to start the conversation around them would be to create a proposal in the OpenLineage repo, outlining your thoughts on implementation and benefits.

2: I'm not familiar enough with HiveQL, but you can read about the new SQL parser we're implementing here

  1. you can see the Standard Facets here - Dataset Version is included out of the box, but Run Version would have to be defined.

  2. the best place to start looking into making facets is the Spec doc here. We don't have a dedicated tutorial, but if you have more specific questions please feel free to reach out again on slack

👍 sandeep
sandeep (sandeepgame07@gmail.com)
2022-04-12 15:39:23

*Thread Reply:* Thank you John The standard facets links to the github issues currently

sandeep (sandeepgame07@gmail.com)
2022-04-12 15:41:01

*Thread Reply:* Will check it out thank you

Michael Robinson (michael.robinson@astronomer.io)
2022-04-12 10:37:58

Reminder: this month’s OpenLineage TSC meeting is tomorrow, 4/13, at 9 am PT! https://openlineage.slack.com/archives/C01CK9T7HKR/p1649271939878419

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
sandeep (sandeepgame07@gmail.com)
2022-04-12 15:43:29

I setup the open-lineage spark integration for spark(dataproc) tasks from airflow. It’s able to post data to the marquez end point and I see the job information in Marquez UI.

I don’t see any dataset information in it, I see just the jobs ? Is there some setup I need to do or something else I need to configure ?

John Thomas (john@datakin.com)
2022-04-12 16:08:30

*Thread Reply:* is there anything in your marquez-api logs that might indicate issues?

What guide did you follow to setup the spark integration?

sandeep (sandeepgame07@gmail.com)
2022-04-12 16:10:07

*Thread Reply:* Followed this guide https://openlineage.io/integration/apache-spark/ and used the spark-defaults.conf approach

sandeep (sandeepgame07@gmail.com)
2022-04-12 16:11:04

*Thread Reply:* The logs from dataproc side show no errors, let me check from the marquez api side To confirm, we should be able to see the datasets from the marquez UI with the spark integration right ?

John Thomas (john@datakin.com)
2022-04-12 16:11:50

*Thread Reply:* I'm not super familiar with the spark integration, since I work more with airflow - I'd start with looking through the readme for the spark integration here

sandeep (sandeepgame07@gmail.com)
2022-04-12 16:14:44

*Thread Reply:* Hmm, the readme says it aims to generate the input and output datasets

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-04-12 16:40:38

*Thread Reply:* Are you looking at the same namespace?

sandeep (sandeepgame07@gmail.com)
2022-04-12 16:40:51

*Thread Reply:* Yes, the same one where I can see the job

sandeep (sandeepgame07@gmail.com)
2022-04-12 16:54:49

*Thread Reply:* Tailing the API logs and rerunning the spark job now to hopefully catch errors if any, will ping back here

sandeep (sandeepgame07@gmail.com)
2022-04-12 17:01:10

*Thread Reply:* Don’t see any failures in the logs, any suggestions on how to debug this ?

John Thomas (john@datakin.com)
2022-04-12 17:08:24

*Thread Reply:* I'd next set up a basic spark notebook and see if you can't get it to send dataset information on something simple in order to check if it's a setup issue or a problem with your spark job specifically

sandeep (sandeepgame07@gmail.com)
2022-04-12 17:14:43

*Thread Reply:* ok, that sounds good, will try that

sandeep (sandeepgame07@gmail.com)
2022-04-12 17:16:06

*Thread Reply:* before that, I see that spark-lineage integration posts lineage to the api https://marquezproject.github.io/marquez/openapi.html#tag/Lineage/paths/~1lineage/post We don’t seem to add a DataSet in this, does marquez internally create this “dataset” based on Output and fields ?

John Thomas (john@datakin.com)
2022-04-12 17:16:34

*Thread Reply:* yeah, you should be seeing "input" and "output" in the runEvents - that's where datasets come from

John Thomas (john@datakin.com)
2022-04-12 17:17:00

*Thread Reply:* I'm not sure if it's a problem with your specific spark job or with the integration itself, however

sandeep (sandeepgame07@gmail.com)
2022-04-12 17:19:16

*Thread Reply:* By runEvents, do you mean a job Object or lineage Object ? The integration seems to be only POSTing lineage objects

John Thomas (john@datakin.com)
2022-04-12 17:20:34

*Thread Reply:* yep, a runEvent is body that gets POSTed to the /lineage endpoint:

https://openlineage.io/docs/openapi/

👍 sandeep
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-04-12 17:41:01

*Thread Reply:* > Yes, the same one where I can see the job I think you should look at other namespace, which name depends on what systems you're actually using

sandeep (sandeepgame07@gmail.com)
2022-04-12 17:48:24

*Thread Reply:* Shouldn’t the dataset would be created in the same namespace we define in the spark properties?

sandeep (sandeepgame07@gmail.com)
2022-04-15 10:19:06

*Thread Reply:* I found few datasets in the table location, I ran it in a similar (hive metastore, gcs, sparksql and scala spark jobs) setup to the one mentioned in this post https://openlineage.slack.com/archives/C01CK9T7HKR/p1649967405659519

} Will Johnson (https://openlineage.slack.com/team/U02H4FF5M36)
sandeep (sandeepgame07@gmail.com)
2022-04-12 15:49:46

Is this the correct place for this Q or should I reach out to Marquez slack ? I followed this post https://openlineage.io/integration/apache-spark/

openlineage.io
Will Johnson (will@willj.co)
2022-04-14 16:16:45

Before I create an issue around it, maybe I'm just not seeing it in Databricks. In the Spark Integration, does OpenLineage report Hive Metastore tables or it ONLY reports the file path?

For example, if I have a Hive table called default.myTable stored at LOCATION /usr/hive/warehouse/default/mytable.

For a query that reads a CSV file and inserts into default.myTable, would I see an output of default.myTable or /usr/hive/warehoues/default/mytable?

We want to include a link between the physical path and the hive metastore table but it seems that OpenLineage (at least on Databricks) only reports the physical path with the table name showing up in the catalog but not as a facet.

sandeep (sandeepgame07@gmail.com)
2022-04-15 10:17:55

*Thread Reply:* This was my experience as well, I was under the impression we would see the table as a dataset. Looking forward to understanding the expected behavior

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-04-15 10:39:34

*Thread Reply:* relevant: https://github.com/OpenLineage/OpenLineage/issues/435

👍 Howard Yoo
Will Johnson (will@willj.co)
2022-04-15 12:36:08

*Thread Reply:* Ah! Thank you both for confirming this! And it's great to see the proposal, Maciej!

Sharanya Santhanam (santhanamsharanya@gmail.com)
2022-06-10 12:37:41

*Thread Reply:* Is there a timeline around when we can expect this fix ?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-06-10 12:46:47

*Thread Reply:* Not a simple fix, but I guess we'll start working on this relatively soon.

Sharanya Santhanam (santhanamsharanya@gmail.com)
2022-06-10 13:10:31

*Thread Reply:* I see, thanks for the update ! We are very much interested in this feature.

Michael Robinson (michael.robinson@astronomer.io)
2022-04-15 15:42:22

@channel A significant number of us have a conflict with the current TSC meeting day/time, so, unfortunately, we need to reschedule the meeting. When you have a moment, please share your availability here: https://doodle.com/meeting/participate/id/ejRnMlPe. Thanks in advance for your input!

doodle.com
Arturo (ggrmos@gmail.com)
2022-04-19 13:35:23

Hello everyone, I'm learning Openlineage, I finally achieved the connection between Airflow 2+ and Openlineage+Marquez. The issue is that I don't see nothing on Marquez. Do I need to modify current airflow operators?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-04-19 13:40:54

*Thread Reply:* You probably need to change dataset from default

Arturo (ggrmos@gmail.com)
2022-04-19 13:47:04

*Thread Reply:* I click it on everything 😕 I manually (joining to the pod and send curl to the marquez local endpoint) created a namespaces to check if there is a network issue I was ok, I created a namespaces called: data-dev . The airflow is mounted over k8s using helm chart. ``` config: AIRFLOWWEBSERVERBASEURL: "http://airflow.dev.test.io" PYTHONPATH: "/opt/airflow/dags/repo/config" AIRFLOWAPIAUTHBACKEND: "airflow.api.auth.backend.basicauth" AIRFLOWCOREPLUGINSFOLDER: "/opt/airflow/dags/repo/plugins" AIRFLOWLINEAGEBACKEND: "openlineage.lineage_backend.OpenLineageBackend"

. . . .

extraEnv: - name: OPENLINEAGEURL value: http://marquez-dev.data-dev.svc.cluster.local - name: OPENLINEAGENAMESPACE value: data-dev```

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-04-19 15:16:47

*Thread Reply:* I think answer is somewhere in airflow logs 🙂 For some reason, OpenLineage events aren't send to Marquez.

Arturo (ggrmos@gmail.com)
2022-04-20 11:08:09

*Thread Reply:* Thanks, finally was my error .. I created a dummy dag to see if maybe it's an issue over the dag and now I can see something over Marquez

Mirko Raca (racamirko@gmail.com)
2022-04-20 08:15:32

One really novice question - there doesn't seem to be a way of deleting lineage elements (any of them)? While I can imagine that in production system we want to keep history, it's not practical while testing/developing. I'm using throw-away namespaces to step around the issue. Is there a better way, or alternatively - did I miss an API somewhere?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-04-20 08:20:35

*Thread Reply:* That's more of a Marquez question 🙂 We have a long-standing issue to add that API https://github.com/MarquezProject/marquez/issues/1736

Mirko Raca (racamirko@gmail.com)
2022-04-20 09:32:19

*Thread Reply:* I see it already got skipped for 2 releases, and my only conclusion is that people using Marquez don't make mistakes - ergo, API not needed 🙂 Lets see if I can stick around the project long enough to offer a bit of help, now I just need to showcase it and get interest in my org.

Dan Mahoney (dan.mahoney@sphericalanalytics.io)
2022-04-20 10:08:33

Good day all. I’m trying out the openlineage-dagster plugin • I’ve got dagit, dagster-daemon and marquez running locally • The openlineagesensor is recognized in dagit and the daemon. But, when I run a job, I see the following message in the daemon’s shell: Sensor openlineage_sensor skipped: Last cursor: {"last_storage_id": 9, "running_pipelines": {"97e2efdf-9499-4ffd-8528-d7fea5b9362c": {"running_steps": {}, "repository_name": "hello_cereal_repository"}}} I’ve attached my repos.py and serialjob.py. Any thoughts?

David (drobin1437@gmail.com)
2022-04-20 10:40:03

Hi All, I am walking through the curl examples on this page and have a question on the first curl example: https://openlineage.io/getting-started/ The curl command completes, and I can see the input file and job in the namespace, but the lineage graph does not show the input file connected as an input to the job. This only seems to happen after the job is marked complete.

Is there a way to have a running job show connections to its input files in the lineage? Thanks!

openlineage.io
raghanag (raghanag@gmail.com)
2022-04-20 18:06:29

Hi Team, we are using spark as a service, and we are planning to integrate open lineage spark listener and looking at the below params that we need to pass, we don't know the name of the spark cluster, is the spark.openlineage.namespace conf param mandatory? spark-submit --conf "spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener" \ --packages "io.openlineage:openlineage_spark:0.2.+" \ --conf "spark.openlineage.host=http://&lt;your_ol_endpoint&gt;" \ --conf "spark.openlineage.namespace=my_job_namespace" \ --class com.mycompany.MySparkApp my_application.jar

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-04-20 18:11:19

*Thread Reply:* Namespace is defined by you, it does not have to be name of the spark cluster.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-04-20 18:11:42

*Thread Reply:* And I definitely recommend to use newer version than 0.2.+ 🙂

raghanag (raghanag@gmail.com)
2022-04-20 18:13:32

*Thread Reply:* oh i see that someone mentioned that it has to be replaced with name of the spark clsuter

raghanag (raghanag@gmail.com)
2022-04-20 18:13:57

*Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1634089656188400?thread_ts=1634085740.187700&cid=C01CK9T7HKR

} Julien Le Dem (https://openlineage.slack.com/team/U01DCLP0GU9)
raghanag (raghanag@gmail.com)
2022-04-20 18:19:19

*Thread Reply:* @Maciej Obuchowski may i know if i can add the --packages "io.openlineage:openlineage_spark:0.2.+" as part of the spark jar file, that meant as part of the pom.xml

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-04-21 03:54:25

*Thread Reply:* I think it needs to run on the driver

Mirko Raca (racamirko@gmail.com)
2022-04-21 05:53:34

Hello, when looking through Marquez API it seems that most individual-element creation APIs are marked as deprecated and are going to be removed by 0.25, with a point of switching to open lineage. That makes POST to /api/v1/lineage the only creation point of elements, but OpenLineage API is very limited in attributes that can be passed.

Is that intended to stay that way? One practical question/example: how do we create a job of type STREAMING, when OL API only allows to pass name, namespace and facets. Do we now move all properties into facets?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-04-21 07:16:44

*Thread Reply:* > OpenLineage API is very limited in attributes that can be passed. Can you specify where do you think it's limited? The way to solve that problems would be to evolve OpenLineage.

> One practical question/example: how do we create a job of type STREAMING, So, here I think the question is more how streaming jobs differ from batch jobs. One obvious difference is that output of the job is continuous (in practice, probably "microbatched" or commited on checkpoint). However, deprecated Marquez API didn't give us tools to properly indicate that. On the contrary, OpenLineage with different event types allows us to properly do that. > Do we now move all properties into facets? Basically, yes. Marquez should handle specific facets. For example, https://github.com/MarquezProject/marquez/pull/1847

Mirko Raca (racamirko@gmail.com)
2022-04-21 07:23:11

*Thread Reply:* Hey Maciej

first off - thanks for being active on the channel!

> So, here I think the question is more how streaming jobs differ from batch jobs Not really. I just gave an example of how would you express a specific job type creation which can be done with https://marquezproject.github.io/marquez/openapi.html#tag/Jobs/paths/~1namespaces~1{namespace}~1jobs~1{job}/put|/api/v1/namespaces/.../jobs/... , by passing the type field which is required. In the call to /api/v1/lineage the job field offers just to specify (namespace, name), but no other attributes.

> However, deprecated Marquez API didn't give us tools to properly indicate that. On the contrary, OpenLineage with different event types allows us to properly do that. I have the feeling I'm still missing some key concepts on how OpenLineage is designed. I think I went over the API and documentation, but trying to use just OpenLineage failed to reproduce mildly complex chain-of-job scenarios, and when I took a look how Marquez seed demo is doing it - it was heavily based on deprecated API. So, I'm kinda lost on how to use OpenLineage.

I'm looking forward to some open-public meeting, as I don't think asking these long questions on chat really works. 😞 Any pointers are welcome!

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-04-21 07:53:59

*Thread Reply:* > I just gave an example of how would you express a specific job type creation Yes, but you're trying to achieve something by passing this parameter or creating a job in a certain way. We're trying to cover everything in OpenLineage API. Even if we don't have everything, the spec from the beginning is focused to allow emitting custom data by custom facet mechanism.

> I have the feeling I'm still missing some key concepts on how OpenLineage is designed. This talk by @Julien Le Dem is a great place to start: https://www.youtube.com/watch?v=HEJFCQLwdtk

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-04-21 11:29:20

*Thread Reply:* > Any pointers are welcome! BTW: OpenLineage is an open standard. Everyone is welcome to contribute and discuss. Every feedback ultimately helps us build better systems.

Mirko Raca (racamirko@gmail.com)
2022-04-22 03:32:48

*Thread Reply:* I agree, but for now I'm more likely to be in the I didn't get it category, and not in the brilliant new idea category 🙂

My temporary goal is to go over the documentation and to write the gaps that confused me (and the solutions) and maybe publish that as an article for wider audience. So far I realized that: • I don't get the naming convention - it became clearer that it's important with the Naming examples, but more info is needed • I mis-interpret the namespaces. I was placing datasources and jobs in the same namespace which caused a lot of issues until I started using different ones. Not sure why... So now I'm interpreting namespaces=source as suggested by the naming convention • JSON schema actually clarified things a lot, but that's not the most reader-friendly of resources, so surely there should be a better one • I was questioning whether to move away from Marquez completely and go with DataHub, but for my scenario Marquez (with limitations outstanding) is still most suitable • Marquez for some reason does not tolerate the datetimes if they're missing the 'T' delimiter in the ISO, which caused a lot of trial-and-error because the message is just "JSON parsing failed" • Marquez doesn't give you (at least by default) meaningful OpenLineage parsing errors, so running examples against it is a very slow learning process

Karatuğ Ozan BİRCAN (karatugo@gmail.com)
2022-04-21 10:20:55

Hi everyone,

I'm running the Spark Listener on Databricks. It works fine for the event emit part for a basic Databricks SQL Create Table query. Nevertheless, it throws a NullPointerException exception after sending lineage successfully.

I tried to debug a bit. Looks like it's thrown at the line: QueryExecution queryExecution = SQLExecution.getQueryExecution(executionId); So, does this mean that the listener can't get the query exec from Spark SQL execution?

Please see the logs in the thread. Thanks.

Karatuğ Ozan BİRCAN (karatugo@gmail.com)
2022-04-21 10:21:33

*Thread Reply:* Driver logs from Databricks:

```22/04/21 14:05:07 INFO EventEmitter: Lineage completed successfully: ResponseMessage(responseCode=200, body={}, error=null) {"eventType":"COMPLETE",[...], "schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunEvent"}

22/04/21 14:05:07 ERROR AsyncEventQueue: Listener OpenLineageSparkListener threw an exception java.lang.NullPointerException at io.openlineage.spark.agent.lifecycle.ContextFactory.createSparkSQLExecutionContext(ContextFactory.java:43) at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$getSparkSQLExecutionContext$8(OpenLineageSparkListener.java:221) at java.util.HashMap.computeIfAbsent(HashMap.java:1127) at java.util.Collections$SynchronizedMap.computeIfAbsent(Collections.java:2674) at io.openlineage.spark.agent.OpenLineageSparkListener.getSparkSQLExecutionContext(OpenLineageSparkListener.java:220) at io.openlineage.spark.agent.OpenLineageSparkListener.sparkSQLExecStart(OpenLineageSparkListener.java:143) at io.openlineage.spark.agent.OpenLineageSparkListener.onOtherEvent(OpenLineageSparkListener.java:135) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:102) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:119) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:103) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1588) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)```

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-04-21 11:32:37

*Thread Reply:* @Karatuğ Ozan BİRCAN are you running on Spark 3.2? If yes, then new release should have fixed your problem: https://github.com/OpenLineage/OpenLineage/issues/609

Karatuğ Ozan BİRCAN (karatugo@gmail.com)
2022-04-21 11:33:15

*Thread Reply:* Spark 3.1.2 with Scala 2.12

Karatuğ Ozan BİRCAN (karatugo@gmail.com)
2022-04-21 11:33:50

*Thread Reply:* In fact, I couldn't make it work in Spark 3.2. But I'll test it again. Thanks for the info.

Vinith Krishnan US (vinithk@nvidia.com)
2022-05-20 16:15:47

*Thread Reply:* Has this been resolved? I am facing the same issue with spark 3.2.

Ben (ben@meridian.sh)
2022-04-21 11:51:33

Does anyone have thoughts on the difference between the sourceCode and sql job facets - and whether we’d expect to ever see both on a particular job?

John Thomas (john@datakin.com)
2022-04-21 15:34:24

*Thread Reply:* I don't think that the facets are particularly strongly defined, but I would expect that it could be possible to see both on a pythonOperator that's executing SQL queries, depending on how the extractor was written

Ben (ben@meridian.sh)
2022-04-21 15:34:45

*Thread Reply:* ah sure, that makes sense

Xiaoyong Zhu (xiaoyzhu@outlook.com)
2022-04-21 15:14:03

Just get to know open lineage and it's really a great project! One question for the granularity on Spark + Openlineage - is it possible to track column level lineage (rather than the table lineage that's currently there)? Thanks!

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-04-21 16:17:59

*Thread Reply:* We're actively working on it - expect it in next OpenLineage release. https://github.com/OpenLineage/OpenLineage/pull/645

Labels
enhancement, integration/spark
Milestone
<a href="https://github.com/OpenLineage/OpenLineage/milestone/4">0.8.0</a>
Xiaoyong Zhu (xiaoyzhu@outlook.com)
2022-04-21 16:24:16

*Thread Reply:* nice -thanks!

Xiaoyong Zhu (xiaoyzhu@outlook.com)
2022-04-21 16:25:19

*Thread Reply:* Assuming we don't need to do anything except using the next update? Or do you expect that we need to change quite a lot of configs?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-04-21 17:44:46

*Thread Reply:* No, it should be automatic.

Will Johnson (will@willj.co)
2022-04-24 14:37:33

Hey, Team - We are starting to get requests for other, non Microsoft data sources (e.g. Teradata) for the Spark Integration. We (I) don't have a lot of bandwidth to fill every request but I DO want to help these people new to OpenLineage get started.

Has anyone on the team written up a blog post about extending open lineage or is this an area that we could collaborate on for the OpenLineage blog? Alternatively, is it a bad idea to write this down since the internals have changed a few times over the past six months?

Mirko Raca (racamirko@gmail.com)
2022-04-25 03:52:20

*Thread Reply:* Hey Will,

while I would not consider myself in the team, I'm dabbling in OL, hitting walls and learning as I go. If I don't have enough experience to contribute, I'd be happy to at least proof-read and point out things which are not clear from a novice perspective. Let me know!

👍 Will Johnson
Will Johnson (will@willj.co)
2022-04-25 13:49:48

*Thread Reply:* I'll hold you to that @Mirko Raca 😉

Ross Turk (ross@datakin.com)
2022-04-25 17:18:02

*Thread Reply:* I will support! I’ve done a few recent presentations on the internals of OpenLineage that might also be useful - maybe some diagrams can be reused.

Will Johnson (will@willj.co)
2022-04-25 17:56:44

*Thread Reply:* Any chance you have links to those old presentations? Would be great to build off of an existing one and then update for some of the new naming conventions.

Ross Turk (ross@datakin.com)
2022-04-25 18:00:26

*Thread Reply:* the most recent one was an astronomer webinar

happy to share the slides with you if you want 👍 here’s a PDF:

🙌 Will Johnson
Ross Turk (ross@datakin.com)
2022-04-25 18:00:44

*Thread Reply:* the other ones have not been public, unfortunately 😕

Ross Turk (ross@datakin.com)
2022-04-25 18:02:24

*Thread Reply:* architecture, object model, run lifecycle, naming conventions == the basics IMO

Will Johnson (will@willj.co)
2022-04-26 09:14:42

*Thread Reply:* Thank you so much, Ross! This is a great base to work from.

Michael Robinson (michael.robinson@astronomer.io)
2022-04-26 14:49:04

Your periodical reminder that Github stars are one of those trivial things that make a significant difference for an OS project like ours. Have you starred us yet?

👍 Ross Turk
raghanag (raghanag@gmail.com)
2022-04-26 15:02:10

Hi All, I have a simple spark job from converting csv to parquet and I am using https://openlineage.io/integration/apache-spark/ to generate lineage events and posting to maquez but I see that both events (START & COMPLETE) have the same event except eventType, i thought we should see outputsarray in the complete event right?

openlineage.io
Will Johnson (will@willj.co)
2022-04-27 00:36:05

*Thread Reply:* For a spark job like that, you'd have at least four events:

  1. START event - This represents the SparkSQLExecutionStart
  2. START event #2 - This represents a JobStart event
  3. COMPLET event - This represents a JobEnd event
  4. COMPLETE event #2 - This represents a SparkSQLExectionEnd event For CSV to Parquet, you should be seeing inputs and outputs that match across each event. OpenLineage scans the logical plan and reports back the inputs / outputs / metadata across the different facets for each event BECAUSE each event might give you some different information.

For example, the JobStart event might give you access to properties that weren't there before. The JobEnd event might give you information about how many rows were written.

Marquez / OpenLineage expects that you collect all of the resulting events and then aggregate the results.

raghanag (raghanag@gmail.com)
2022-04-27 21:51:07

*Thread Reply:* Hi @Will Johnson good evening. We are seeing an issue while using spark integaration and found that when we provide openlinegae.host property a value like <http://lineage.com/common/marquez> where my marquez api is running I see that the below line is modifying the host to become <http://lineage.com/api/v1/lineage> instead of <http://lineage.com/common/marquez/api/v1/lineage> which is causing the problem https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/src/main/common/java/io/openlineage/spark/agent/EventEmitter.java#L49 I see that it has been added 5 months ago and released it as part of 0.4.0, is there anyway that we can fix the line to be like below this.lineageURI = new URI( hostURI.getScheme(), hostURI.getAuthority(), hostURI.getPath() + uriPath, queryParams, null);

Will Johnson (will@willj.co)
2022-04-28 14:31:42

*Thread Reply:* Can you open up a Github issue for this? I had this same issue and so our implementation always has to feature the /api/v1/lineage. The host config is literally the host. You're specifying a host and path. I'd be happy to see greater flexibility with the api endpoint but the /v1/ is important to know which version of OpenLineage's specification you're communicating with.

Arturo (ggrmos@gmail.com)
2022-04-27 14:12:38

Hi all, guys ... anyone have an example of a custom extractor with different source-destination, I'm trying to build an extractor from a custom operator like mysql_to_s3

Ross Turk (ross@datakin.com)
2022-04-27 15:10:24

*Thread Reply:* @Michael Collado made one for a recent webinar:

https://gist.github.com/collado-mike/d1854958b7b1672f5a494933f80b8b58

Ross Turk (ross@datakin.com)
2022-04-27 15:11:38

*Thread Reply:* it's not exactly for an operator that has source-destination, but it shows how to format lineage events for a few different kinds of datasets

Arturo (ggrmos@gmail.com)
2022-04-27 15:51:32

*Thread Reply:* Thanks! I'm going to take a look

Michael Robinson (michael.robinson@astronomer.io)
2022-04-27 23:04:18

A release has been requested by @Howard Yoo and @Ross Turk pending the merging of PR 644. Are there any +1s?

👍 Julien Le Dem, Maciej Obuchowski, Ross Turk, Conor Beverland
Michael Robinson (michael.robinson@astronomer.io)
2022-04-28 17:44:00

*Thread Reply:* Thanks for your input. The release is authorized. Look for it tomorrow!

raghanag (raghanag@gmail.com)
2022-04-28 14:29:13

Hi All, We are seeing the below exception when we integrate the openlineage-spark into our spark job, can anyone share pointers Exception uncaught: java.lang.NoSuchMethodError: com.fasterxml.jackson.databind.SerializationConfig.hasExplicitTimeZone()Z at openlineage.jackson.datatype.jsr310.ser.InstantSerializerBase.formatValue(InstantSerializerBase.java:144) at openlineage.jackson.datatype.jsr310.ser.InstantSerializerBase.serialize(InstantSerializerBase.java:103) at openlineage.jackson.datatype.jsr310.ser.ZonedDateTimeSerializer.serialize(ZonedDateTimeSerializer.java:79) at openlineage.jackson.datatype.jsr310.ser.ZonedDateTimeSerializer.serialize(ZonedDateTimeSerializer.java:13) at com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:727) at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:719) at com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:155) at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480) at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:319) at com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:3906) at com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:3220) at io.openlineage.spark.agent.client.OpenLineageClient.executeAsync(OpenLineageClient.java:123) at io.openlineage.spark.agent.client.OpenLineageClient.executeSync(OpenLineageClient.java:85) at <a href="http://io.openlineage.spark.agent.client.OpenLineageClient.post">io.openlineage.spark.agent.client.OpenLineageClient.post</a>(OpenLineageClient.java:80) at <a href="http://io.openlineage.spark.agent.client.OpenLineageClient.post">io.openlineage.spark.agent.client.OpenLineageClient.post</a>(OpenLineageClient.java:75) at <a href="http://io.openlineage.spark.agent.client.OpenLineageClient.post">io.openlineage.spark.agent.client.OpenLineageClient.post</a>(OpenLineageClient.java:70) at io.openlineage.spark.agent.EventEmitter.emit(EventEmitter.java:67) at io.openlineage.spark.agent.lifecycle.SparkSQLExecutionContext.start(SparkSQLExecutionContext.java:69) at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$sparkSQLExecStart$0(OpenLineageSparkListener.java:90) at java.util.Optional.ifPresent(Optional.java:159) at io.openlineage.spark.agent.OpenLineageSparkListener.sparkSQLExecStart(OpenLineageSparkListener.java:90) at io.openlineage.spark.agent.OpenLineageSparkListener.onOtherEvent(OpenLineageSparkListener.java:81) at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:80) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91) at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82)

John Thomas (john@datakin.com)
2022-04-28 14:41:10

*Thread Reply:* What's the spark job that's running - this looks similar to an error that can happen when jobs have a very short lifecycle

raghanag (raghanag@gmail.com)
2022-04-28 14:47:27

*Thread Reply:* nothing in spark job, its just a simple csv to parquet conversion file

John Thomas (john@datakin.com)
2022-04-28 14:48:50

*Thread Reply:* ah yeah that's probably it - when the job is finished before the Openlineage integration can poll it for information this error is thrown. Since the job is very quick it creates a race condition

:gratitude_thank_you: raghanag
raghanag (raghanag@gmail.com)
2022-05-03 17:16:39

*Thread Reply:* @John Thomas may i know how to solve this kind of issue?

John Thomas (john@datakin.com)
2022-05-03 17:20:11

*Thread Reply:* This is probably an issue with the integration - for now you can either open an issue, or see if you're still getting a subset of events and take it as is. I'm not sure what you could do on your end aside from adding a sleep call or similar

raghanag (raghanag@gmail.com)
2022-05-03 17:21:17

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/src/main/common/java/io/openlineage/spark/agent/OpenLineageSparkListener.java#L151 you meant if we add a sleep in this method this will solve this

John Thomas (john@datakin.com)
2022-05-03 18:44:43

*Thread Reply:* oh no I meant making sure your jobs don't close too quickly

raghanag (raghanag@gmail.com)
2022-05-06 00:14:15

*Thread Reply:* Hi @John Thomas we figured out the error that it is indeed causing with conflicted versions and with shadowJar and shading, we are not seeing it anymore.

Michael Robinson (michael.robinson@astronomer.io)
2022-04-29 18:40:41

@channel The latest release (0.8.1) of OpenLineage is now available, featuring a new TaskInstance listener API for Airflow 2.3+, an HTTP client in the openlineage-java library for emitting run events, support for HiveTableRelation as an input source in the Spark integration, a new SQL parser used by multiple integrations, and bug fixes. For more info, visit https://github.com/OpenLineage/OpenLineage/releases/tag/0.8.1

🚀 Willy Lulciuc, John Thomas, Minkyu Park, Ross Turk, Marco Diaz, Conor Beverland, Kevin Mellott, Howard Yoo, Peter Hicks, Maciej Obuchowski, Mario Measic
🙌 Francis McGregor-Macdonald, Ross Turk, Marco Diaz, Peter Hicks
Willy Lulciuc (willy@datakin.com)
2022-04-29 18:41:37

*Thread Reply:* Amazing work on the new sql parser @Maciej Obuchowski 💯 :firstplacemedal:

👍 Ross Turk, Howard Yoo, Peter Hicks
🙌 Ross Turk, Howard Yoo, Peter Hicks, Maciej Obuchowski
Michael Robinson (michael.robinson@astronomer.io)
2022-04-30 07:54:48

The May meeting of the TSC will be postponed because most of the TSC will be attending the Astronomer Spring Summit the week of May 9th. Details to follow along with a new meeting day/time for the meeting going forward (thanks to all who responded to the poll!).

Hubert Dulay (hubert.dulay@gmail.com)
2022-05-01 09:25:23

Are there examples of using openlineage with streaming data pipelines? Thanks

Mirko Raca (racamirko@gmail.com)
2022-05-03 04:12:09

*Thread Reply:* Hi @Hubert Dulay,

while I'm not an expert, I can offer the following: • Marquez has had the but what I got here - that API is not encouraged • I personally don't find the run->job metaphor to work nicely with streaming transformation, but I'm using that in my current setup (until someone points me in a better direction 😉 ) • I register each change of the stream processing as a new "run", which ends immediately - so duration information is lost, but current set of parameters is recorded. It's not pretty, I know. Maybe stream processing is a scenario to be re-evaluated in OL meetings, or at least clarified?

Hubert Dulay (hubert.dulay@gmail.com)
2022-05-03 21:19:06

*Thread Reply:* Thanks for the details

Kostikey Mustakas (kostikey.mustakas@gmail.com)
2022-05-02 09:32:23

Hey OL! My company is in the process of migrating off of Palantir and into Databricks/Azure. There are a couple of business units not wanting to budge due to the built-in data lineage and code reference features Palantir has. I am tasked with researching an alternative data lineage solution and I quickly came across OL. I love what I have read and seen demos of so far and want to do a POC for my org of its capabilities. I was able to set up the Marquez server on a VM and get it talking to Databricks. I also have the iniit script installed on the cluster and I can see from the log4j logs it’s communicating fine (I think). However, I am embarrassed to admit I can’t figure out how the instrumentation works for the databricks notebooks. I ran a simple notebook that loads data, runs a simple transform, and saves the output somewhere but I don’t see any entries in my namespace I configured. I am sure I missed something very obvious somewhere, but are there examples of how to get a simple example into Marquez from databricks? Thanks so much for any guidance you can give!

John Thomas (john@datakin.com)
2022-05-02 13:26:52

*Thread Reply:* Hi Kostikey - this blog has an example with Spark and jupyter, which might be a good place to start!

openlineage.io
Kostikey Mustakas (kostikey.mustakas@gmail.com)
2022-05-02 14:58:29

*Thread Reply:* Hi @John Thomas, thanks for the reply. I think I am close but my cluster is unable to talk to the marquez server. After looking at log4j I see the following rows:

22/05/02 18:43:39 INFO SparkContext: Registered listener io.openlineage.spark.agent.OpenLineageSparkListener 22/05/02 18:43:40 INFO EventEmitter: Init OpenLineageContext: Args: ArgumentParser(host=<http://135.170.226.91:8400>, version=v1, namespace=gus-namespace, jobName=default, parentRunId=null, apiKey=Optional.empty, urlParams=Optional[{}]) URI: <http://135.170.226.91:8400/api/v1/lineage>? 22/05/02 18:46:21 ERROR EventEmitter: Could not emit lineage [responseCode=0]: {"eventType":"START","eventTime":"2022-05-02T18:44:08.36Z","run":{"runId":"91fd4e13-52ac-4175-8956-c06d7dee97fc","facets":{"spark_version":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.8.1/integration/spark>","_schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>","spark-version":"3.2.1","openlineage_spark_version":"0.8.1"},"spark.logicalPlan":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.8.1/integration/spark>","_schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>","plan":[{"class":"org.apache.spark.sql.catalyst.plans.logical.ShowNamespaces","num-children":1,"namespace":0,"output":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"databaseName","dataType":"string","nullable":false,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":4,"jvmId":"eaa0543b_5e04_4f5b_844b_0e4598f019a7"},"qualifier":[]}]]},{"class":"org.apache.spark.sql.catalyst.analysis.ResolvedNamespace","num_children":0,"catalog":null,"namespace":[]}]},"spark_unknown":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.8.1/integration/spark>","_schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>","output":{"description":"Unable to serialize logical plan due to: Infinite recursion (StackOverflowError) ... OpenLineageHttpException(code=0, message=java.lang.RuntimeException: java.util.concurrent.ExecutionException: openlineage.hc.client5.http.ConnectTimeoutException: Connect to <http://135.170.226.91:8400> [/135.170.226.91] failed: Connection timed out, details=java.util.concurrent.CompletionException: java.lang.RuntimeException: java.util.concurrent.ExecutionException: openlineage.hc.client5.http.ConnectTimeoutException: Connect to <http://135.170.226.91:8400> [/135.170.226.91] failed: Connection timed out) at io.openlineage.spark.agent.EventEmitter.emit(EventEmitter.java:68) at io.openlineage.spark.agent.lifecycle.SparkSQLExecutionContext.start(SparkSQLExecutionContext.java:69) at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$sparkSQLExecStart$0(OpenLineageSparkListener.java:90) at java.util.Optional.ifPresent(Optional.java:159) at io.openlineage.spark.agent.OpenLineageSparkListener.sparkSQLExecStart(OpenLineageSparkListener.java:90) at io.openlineage.spark.agent.OpenLineageSparkListener.onOtherEvent(OpenLineageSparkListener.java:81) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:102) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:119) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:103) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1612) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) the connection timeout is surprising because I can connect just fine using the example curl code from the same cluster:

%sh curl -X POST <http://135.170.226.91:8400/api/v1/lineage> \ -H 'Content-Type: application/json' \ -d '{ "eventType": "START", "eventTime": "2020-12-28T19:52:00.001+10:00", "run": { "runId": "d46e465b-d358-4d32-83d4-df660ff614dd" }, "job": { "namespace": "gus2~-namespace", "name": "my-job" }, "inputs": [{ "namespace": "gus2-namespace", "name": "gus-input" }], "producer": "<https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client>" }' Spark config: spark.openlineage.host <http://135.170.226.91:8400> spark.openlineage.version v1 spark.openlineage.namespace gus-namespace Not sure what is going on, the EventEmitter init log looks like it's right but clearly something is off. Thanks so much for the help

John Thomas (john@datakin.com)
2022-05-02 15:03:40

*Thread Reply:* hmmm, interesting - if it's easy could you spin both up locally and check that it's just a communication issue? It helps with diagnosis

It might also be a firewall issue, but your cURL should preclude that

Kostikey Mustakas (kostikey.mustakas@gmail.com)
2022-05-02 15:05:38

*Thread Reply:* Since it's Databricks I was having a hard time figuring out how to try locally. Other than just using plain 'ol spark on my laptop and a localhost Marquez...

John Thomas (john@datakin.com)
2022-05-02 15:07:13

*Thread Reply:* hmm, that could be an interesting test to see if it's a databricks issue - the databricks integration is pretty much the same as the spark integration, just with a little bit of a wrapper and the init script

Kostikey Mustakas (kostikey.mustakas@gmail.com)
2022-05-02 15:08:44

*Thread Reply:* yeah, i was going to try that but it just didnt seem like helpful troubleshooting for exactly that reason... but i may just do that anyways just so i can see something working 🙂 (morale booster)

John Thomas (john@datakin.com)
2022-05-02 15:09:22

*Thread Reply:* oh totally! Network issues are a huge pain in the ass, and if you're still seeing issues locally with spark/mz then we'll know a lot more than we do now 🙂

Kostikey Mustakas (kostikey.mustakas@gmail.com)
2022-05-02 15:11:19

*Thread Reply:* sounds good, i will give it a go!

Will Johnson (will@willj.co)
2022-05-02 15:16:16

*Thread Reply:* @Kostikey Mustakas - I think spark.openlineage.version should be equal to 1 not v1.

In addition, is http://135.170.226.91:8400 accessible to Databricks? Could you try doing a %sh command inside of a databricks notebook and see if you can ping that IP address (https://linux.die.net/man/8/ping)?

For your Databricks cluster did you VNET inject it into an existing VNET? If it's in an existing VNET, you should confirm that the VM running marquez can access it. If it's in a non-VNET injected VNET, you probably need to redeploy to a VNET that has that VM or has connectivity to that VM.

linux.die.net
Kostikey Mustakas (kostikey.mustakas@gmail.com)
2022-05-02 15:19:22

*Thread Reply:* Ya, know i meant to ask about that. Docs say 1 like you mention: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/databricks. I second guessed from this thread https://openlineage.slack.com/archives/C01CK9T7HKR/p1638848249159700.

} Dinakar Sundar (https://openlineage.slack.com/team/U02MQ8E22HF)
Kostikey Mustakas (kostikey.mustakas@gmail.com)
2022-05-02 15:23:42

*Thread Reply:* @Will Johnson, ping fails... this is surprising as the curl command mentioned above works fine.

Julius Rentergent (julius.rentergent@thetradedesk.com)
2022-05-02 15:37:00

*Thread Reply:* I’m also trying to set up Databricks according to Running Marquez on AWS. Right now I’m stuck on the database part rather than the Marquez part — I can’t connect my EKS cluster to the RDS database which I described in more detail on the Marquez slack.

@Kostikey Mustakas Sorry for the distraction, but I’m curious how you have set up your networking to make the API requests work with Databricks. Good luck with your issue!

Kostikey Mustakas (kostikey.mustakas@gmail.com)
2022-05-02 15:47:17

*Thread Reply:* @Julius Rentergent We are using Azure and leverage Private Endpoints to connect resources in separate subscriptions. There is a Bastion proxy in place that we can map http traffic through and I have a Load Balancer Inbound NAT rule I setup that maps one our whitelisted port ranges (8400) to 5000.

:gratitude_thank_you: Julius Rentergent
Kostikey Mustakas (kostikey.mustakas@gmail.com)
2022-05-02 20:15:01

*Thread Reply:* @Will Johnson a little progress maybe... I created a private endpoint and updated dns to point to it. Now I get a 404 Not Found error instead of a timeout

🙌 Will Johnson
Kostikey Mustakas (kostikey.mustakas@gmail.com)
2022-05-02 20:16:41

*Thread Reply:* 22/05/03 00:09:24 ERROR EventEmitter: Could not emit lineage [responseCode=404]: {"eventType":"START","eventTime":"2022-05-03T00:09:22.498Z","run":{"runId":"f41575a0-e59d-4cbc-a401-9b52d2b020e0","facets":{"spark_version":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.8.1/integration/spark>","_schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>","spark-version":"3.2.1","openlineage_spark_version":"0.8.1"},"spark.logicalPlan":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.8.1/integration/spark>","_schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>","plan":[{"class":"org.apache.spark.sql.catalyst.plans.logical.ShowNamespaces","num-children":1,"namespace":0,"output":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"databaseName","dataType":"string","nullable":false,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":4,"jvmId":"aad3656d_8903_4db3_84f0_fe6d773d71c3"},"qualifier":[]}]]},{"class":"org.apache.spark.sql.catalyst.analysis.ResolvedNamespace","num_children":0,"catalog":null,"namespace":[]}]},"spark_unknown":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.8.1/integration/spark>","_schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>","output":{"description":"Unable to serialize logical plan due to: Infinite recursion (StackOverflowError) (through reference chain: org.apache.spark.sql.catalyst.expressions.AttributeReference[\"preCanonicalized\"] .... OpenLineageHttpException(code=null, message={"code":404,"message":"HTTP 404 Not Found"}, details=null) at io.openlineage.spark.agent.EventEmitter.emit(EventEmitter.java:68)

Julius Rentergent (julius.rentergent@thetradedesk.com)
2022-05-27 00:03:30

*Thread Reply:* Following up on this as I encounter the same issue with the Openlineage Databricks integration. This issue seems quite malicious as it crashes the Spark Context and requires a restart.

I have marquez running on AWS EKS; I’m using Openlineage 0.8.2 on Databricks 10.4 (Spark 3.2.1) and my Spark config looks like this: spark.openlineage.host <https://internal-xxxxxxxxxxxxxxxxxxxxxxx-xxxxxxxxxxx.us-east-1.elb.amazonaws.com> spark.openlineage.namespace default spark.openlineage.version v1 &lt;- also tried "1" I can run some simple read and write commands and successfully find the log4j events highlighted in the docs: INFO SparkContext; INFO OpenLineageContext; INFO AsyncEventQueue for each time I run the cell After doing this a few times I get The spark context has stopped and the driver is restarting. Your notebook will be automatically reattached. stderr shows a bunch of things. log4j shows the same as for Kostikey: ERROR EventEmitter: [...] Unable to serialize logical plan due to: Infinite recursion (StackOverflowError)

I have one more piece of information which I can’t make much sense of, but hopefully someone else can; if I include the port in the host, I can very reliably crash the Spark Context on the first attempt. So: <https://internal-xxxxxxxxxxxxxxxxxxxxxxx-xxxxxxxxxxx.us-east-1.elb.amazonaws.com> &lt;- crashes after a couple of attempts, sometimes it takes me a while to reproduce it while repeatedly reading/writing the same datasets <https://internal-xxxxxxxxxxxxxxxxxxxxxxx-xxxxxxxxxxx.us-east-1.elb.amazonaws.com:80> &lt;- crashes on first try Any insights would be greatly appreciated! 🙂

Julius Rentergent (julius.rentergent@thetradedesk.com)
2022-05-27 00:22:27

*Thread Reply:* I tried two more things: • curl works, ping fails, just like in the previous report • Databricks allows providing spark configs without quotes, whereas quotes are generally required for Spark. So I added the quotes to the host name, but now I’m getting: ERROR OpenLineageSparkListener: Unable to parse open lineage endpoint. Lineage events will not be collected

Martin Fiser (fisa@keboola.com)
2022-05-27 14:00:38

*Thread Reply:* @Kostikey Mustakas May I ask what is the reason for migration from Palantir? Sorry for this off-topic question!

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-05-30 05:46:27

*Thread Reply:* @Julius Rentergent created issue on project github: https://github.com/OpenLineage/OpenLineage/issues/795

Labels
bug, integration/spark
Julius Rentergent (julius.rentergent@thetradedesk.com)
2022-06-01 11:15:26

*Thread Reply:* Thank you @Maciej Obuchowski. Just to clarify, the Spark Context crashes with and without port; it’s just that adding the port causes it to crash more quickly (on the 1st attempt).

I will run some more experiments when I have time, and add the results to the ticket.

Edit - added to issue:

I ran some more experiments, this time with a fake host and on OpenLineage 0.9.0, and was not able to reproduce the issue with regards to the port; instead, the new experiments show that Spark 3.2 looks to be involved.

On Spark 3.2.1 / Databricks 10.4 LTS: Using (fake) host http://ac7aca38330144df9.amazonaws.com:5000 crashes when the first notebook cell is evaluated with The spark context has stopped and the driver is restarting. The same occurs when the port is removed.

On Spark 3.1.2 / Databricks 9.1 LTS: Using (fake) host http://ac7aca38330144df9.amazonaws.com:5000 does not impede the cluster but, reasonably, produces for each lineage event ERROR EventEmitter: Could not emit lineage w/ exception io.openlineage.client.OpenLineageClientException: java.net.UnknownHostException The same occurs when the port is removed.

Michael Robinson (michael.robinson@astronomer.io)
2022-05-02 14:52:09

@channel The poll results are in, and the new day/time for the monthly TSC meeting is each second Thursday at 10 am PT. The next meeting will take place on Thursday, 5/19, at 10 am PT, due to a conflict with the Astronomer Spring Summit. Future meetings will take place on the second Thursday of each month. Calendar updates will be forthcoming. Thanks!

🙌 Willy Lulciuc, Mynor Choc
Will Johnson (will@willj.co)
2022-05-02 15:09:42

*Thread Reply:* @Michael Robinson - just to be sure, is the 5/19 meeting at 10 AM PT as well?

Michael Robinson (michael.robinson@astronomer.io)
2022-05-02 15:14:11

*Thread Reply:* Yes, and I’ll update the msg for others. Thank you

Will Johnson (will@willj.co)
2022-05-02 15:16:25

*Thread Reply:* Thank you!

Sandeep Bhat (bhatsandeep424@gmail.com)
2022-05-02 21:45:39

Hii Team, as i saw marquez is building lineage by java code, from seed command, what should i do to connect with mysql (our database) with credentials and building a lineage for our data?

Marco Diaz (mdiaz@roblox.com)
2022-05-03 12:40:55

@here How do we clear old jobs, datasets and namespaces from Marquez?

Mirko Raca (racamirko@gmail.com)
2022-05-04 07:04:48

*Thread Reply:* It seems we can't for now. This was the same question I had last week:

https://github.com/MarquezProject/marquez/issues/1736

Comments
1
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-05-04 10:56:35

*Thread Reply:* Seems that it's really popular request 🙂

Tyler Farris (tyler@kickstand.work)
2022-05-03 13:43:56

Hello, I'm sending lineage events to astrocloud.datakin DB with the Marquez API. The event is sent- but the metadata for inputs and outputs isn't coming through. Below is an example of the event I'm sending. Not sure if this is the place for this question. Cross-posting to Marquez Slack. { "eventTime": "2022-05-03T17:20:04.151087+00:00", "run": { "runId": "2dfc6dcd4011d2a1c3dc1e5861127e5b" }, "job": { "namespace": "from-airflow", "name": "Postgres_1_to_Snowflake_2.extract" }, "producer": "<https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client>", "inputs": [ { "name": "Postgres_1_to_Snowflake_2.extract", "namespace": "from-airflow" } ] } Thanks.

Tyler Farris (tyler@kickstand.work)
2022-05-04 11:28:48

*Thread Reply:* @Mirko Raca pointed out that I was missing eventType.

Mirko Raca : "From a quick glance - you're missing "eventType": "START", attribute. It's also worth noting that metadata typically shows up after the second event (type COMPLETE)"

thanks again.

👍 Mirko Raca
Sandeep Bhat (bhatsandeep424@gmail.com)
2022-05-06 05:01:34

Hii Team, could anyone tell me, to view lineage in marquez do we have to write metadata as a code, or does marquez has a feature to scan the sql code and build a lineage automatically?please clarify my doubt regarding this.

Juan Carlos Fernández Rodríguez (jcfernandez@keedio.com)
2022-05-06 05:26:16

*Thread Reply:* As far as I understand, OpenLineage has tools to extract metadata from sources. Depend on your source, you could find an integration, if it doesn't exists you should write your own integration (and collaborate with the project)

Ross Turk (ross@datakin.com)
2022-05-06 12:59:06

*Thread Reply:* @Sandeep Bhat take a look at https://openlineage.io/integration - there is some info there on the different integrations that can be used to automatically pull metadata.

openlineage.io
Ross Turk (ross@datakin.com)
2022-05-06 13:00:39

*Thread Reply:* The Airflow integration, in particular, uses a SQL parser to determine input/output tables (in cases where the data store can't be queried for that info)

Jorik (jorik@scivis.net)
2022-05-12 05:13:01

Hi all. We are looking at using OpenLineage for capturing some lineage in our custom processing system. I think we got the lineage events understood, but we have often datasets that get appended, or get overwritten by an operation. Is there anything in openlineage that would facilitate making this distinction? (ie. if a set gets overwritten we would be interested in the lineage events from the last overwrite, if it gets appended we would like to have all of these in the display)

Mirko Raca (racamirko@gmail.com)
2022-05-12 05:48:43

*Thread Reply:* To my understanding - datasets model the structure, not the content. So, as long as your table doesn't change number of columns, it's the same thing.

The catch-all would be to create a Dataset facet which would record the distinction between append/overwrite per run. But, while this is supported by the standard, Marquez does not handle custom facets at the moment (I'll happily be corrected).

Jorik (jorik@scivis.net)
2022-05-12 06:05:36

*Thread Reply:* Thanks, that makes sense. We're looking for a way to get the lineage of table contents. We may have to opt for new names on overwrite, or indeed extend a facet to flag these.

Jorik (jorik@scivis.net)
2022-05-12 06:06:44

*Thread Reply:* Use case is compliancy, where we need to show how a certain delivered data product (at a given point in time) was constructed. We have all our transforms/transfers as code, but there are a few parts where datasets get recreated in the process after fixes have been made, and I wouldn't want to bother the auditors with those stray paths

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-05-12 06:12:09

*Thread Reply:* We have LifecycleStateChangeDataset facet that captures this information. It's currently emitted when using Spark integration

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-05-12 06:13:25

*Thread Reply:* > But, while this is supported by the standard, Marquez does not handle custom facets at the moment (I'll happily be corrected). It displays this information when it exists

🙌 Mirko Raca
Jorik (jorik@scivis.net)
2022-05-12 06:13:29

*Thread Reply:* Oh that looks perfect! I completely missed that, thanks!

Marco Diaz (mdiaz@roblox.com)
2022-05-12 15:46:04

Are there any examples on how to use this facet ColumnLineageDatasetFacet.json?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-05-13 05:19:47

*Thread Reply:* Work with Spark is not yet fully merged

raghanag (raghanag@gmail.com)
2022-05-12 17:49:23

Hi All, I am trying to see where we can provide owner details when using openlineage-spark configuration, i see only namespace and other config parameters but not the owner. Can we add owner configuration also as part of openlineage-spark like spark.openlineage.owner? Owner will be used to even filter namespaces when showing the jobs or namespaces in Marquez UI.

Michael Robinson (michael.robinson@astronomer.io)
2022-05-13 19:07:04

@channel The next OpenLineage Technical Steering Committee meeting is next Thursday, 5/19, at 10 am PT! Going forward, meetings will take place on the second Thursday of each month at 10 am PT. Join us on Zoom: https://astronomer.zoom.us/j/87156607114?pwd=a3B0K210dnRaQmdkaFdGMytBREZEQT09 All are welcome! Agenda: • releases 0.7.1 & 0.8.1 • column-level lineage • open lineage For notes and the agenda visit the wiki: https://tinyurl.com/openlineagetsc

🙌 Maciej Obuchowski, Ross Turk
Yannick Libert (yan@ioly.fr)
2022-05-16 11:02:23

Hi all, we are considering using OL to send lineage events from various jobs and places in our company. Since there will be multiple producers, we would like to use Kafka as our main hub for communication. One of our sources will be Airflow (more particularly MWAA, ie airflow in its 2.2.2 version). Is there a way to configure the Airflow lineage backend to send event to kafka instead of Marquez directly? So far, from what I've seen in the docs and in here, the only way would be to create a simple proxy to stream the http events to Kafka. Is it still the case?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-05-16 11:31:17

*Thread Reply:* I think you can either use proxy backend: https://github.com/OpenLineage/OpenLineage/tree/main/proxy

or configure OL client to send data to kafka: https://github.com/OpenLineage/OpenLineage/tree/main/client/python#kafka

👍 Yannick Libert
Yannick Libert (yan@ioly.fr)
2022-05-16 12:15:59

*Thread Reply:* Thank you very much for the useful pointers. The proxy solutions could indeed work in our case but it implies creating another service in front of Kafka, and thus and another layer of complexity to the architecture. If there is another more "native" way of streaming event directly from the Airflow backend that'll be great to know

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-05-16 12:37:10

*Thread Reply:* The second link 😉

Yannick Libert (yan@ioly.fr)
2022-05-17 03:46:03

*Thread Reply:* Sure, we already implemented the python client for jobs outside airflow and it works great 🙂 You are saying that there is a way to use this python client in conjonction with the MWAA lineage backend to relay the job events that come with the airflow integration (without including it in the DAGs)? Our strategy is to use both the airflow backend to collect automatic lineage events without modifying any existing DAGs, and the in-code implementation to allow our data engineers to send their own events if they want to. The second option works perfectly but the first one is where we struggle a bit, especially with MWAA.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-05-17 05:24:30

*Thread Reply:* If you can mount file to MWAA, then yes - it should work with config file option: https://github.com/OpenLineage/OpenLineage/tree/main/client/python#config-file

Yannick Libert (yan@ioly.fr)
2022-05-17 05:40:45

*Thread Reply:* Brilliant! I'm going to test that. Thank you Maciej!

Michael Robinson (michael.robinson@astronomer.io)
2022-05-17 15:20:58

A release has been requested. Are there any +1s? Three from committers will authorize. Thanks.

➕ Maciej Obuchowski, Ross Turk, Willy Lulciuc, Michael Collado
Michael Robinson (michael.robinson@astronomer.io)
2022-05-18 10:33:03

The OpenLineage TSC meeting is tomorrow at 10am PT! https://openlineage.slack.com/archives/C01CK9T7HKR/p1652483224119229

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
🙌 Willy Lulciuc
Tyler Farris (tyler@kickstand.work)
2022-05-18 16:23:56

Hey all, Do custom extractors work with the taskflow api?

John Thomas (john@datakin.com)
2022-05-18 16:34:25

*Thread Reply:* Hey Tyler - A custom extractor just needs to be able to assemble the runEvents and send the information out to the lineage backends.

If the things you're sending/receiving with TaskFlow are accessible in terms of metadata in the environment the DAG is running in, then you should be able to make one that would work!

This Webinar goes over creating custom extractors for reference.

Does that answer your question?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-05-18 16:41:16

*Thread Reply:* Taskflow internally is just PythonOperator. If you'd write extractor that assumes something more than just it being PythonOperator then you'd probably make it work 🙂

Tyler Farris (tyler@kickstand.work)
2022-05-18 17:15:52

*Thread Reply:* Thanks @John Thomas @Maciej Obuchowski, Your answers both make sense. I just keep running into this error in my logs: [2022-05-18, 20:52:34 UTC] {__init__.py:97} WARNING - Unable to find an extractor. task_type=_PythonDecoratedOperator airflow_dag_id=Postgres_1_to_Snowflake_1_v3 task_id=Postgres_1 airflow_run_id=scheduled__2022-05-18T20:51:34.334045+00:00 The picture is my custom extractor, it's not doing anything currently as this is just a test.

Tyler Farris (tyler@kickstand.work)
2022-05-18 17:16:05

*Thread Reply:* thanks again for the help yall

John Thomas (john@datakin.com)
2022-05-18 17:16:34

*Thread Reply:* did you set the environment variable with the path to your extractor?

Tyler Farris (tyler@kickstand.work)
2022-05-18 17:16:46

*Thread Reply:*

Tyler Farris (tyler@kickstand.work)
2022-05-18 17:17:13

*Thread Reply:* i believe thats correct @John Thomas

Tyler Farris (tyler@kickstand.work)
2022-05-18 17:18:35

*Thread Reply:* and the versions im using: Astronomer Runtime 5.0.0 based on Airflow 2.3.0+astro.1

John Thomas (john@datakin.com)
2022-05-18 17:25:58

*Thread Reply:* this might not be the problem, but you should have only one of extract and extract_on_complete - which one are you meaning to use?

Tyler Farris (tyler@kickstand.work)
2022-05-18 17:32:26

*Thread Reply:* ahh thanks John, as of right now extract_on_complete.

This is a similar setup as Michael had in the video.

John Thomas (john@datakin.com)
2022-05-18 17:33:31

*Thread Reply:* if it's still not working I'm not really sure at this point - that's about what I had when I spun up my own custom extractor

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-05-18 17:39:44

*Thread Reply:* is there anything in logs regarding extractors?

Tyler Farris (tyler@kickstand.work)
2022-05-18 17:40:36

*Thread Reply:* just this: [2022-05-18, 21:36:59 UTC] {__init__.py:97} WARNING - Unable to find an extractor. task_type=_PythonDecoratedOperator airflow_dag_id=competitive_oss_projects_git_to_snowflake task_id=Transform_git_logs_to_S3 airflow_run_id=scheduled__2022-05-18T21:35:57.694690+00:00

Tyler Farris (tyler@kickstand.work)
2022-05-18 17:41:11

*Thread Reply:* @John Thomas Thanks, I appreciate your help.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-05-19 06:01:52

*Thread Reply:* No Failed to import messages?

Tyler Farris (tyler@kickstand.work)
2022-05-19 11:26:34

*Thread Reply:* @Maciej Obuchowski None that I can see. Here is the full log: ```* Failed to verify remote log exists s3:///dag_id=Postgres_1_to_Snowflake_1_v3/run_id=scheduled2022-05-19T15:23:49.248097+00:00/task_id=Postgres_1/attempt=1.log. Please provide a bucket_name instead of "s3:///dag_id=Postgres_1_to_Snowflake_1_v3/run_id=scheduled2022-05-19T15:23:49.248097+00:00/task_id=Postgres_1/attempt=1.log" Falling back to local log * Reading local file: /usr/local/airflow/logs/dagid=Postgres1toSnowflake1v3/runid=scheduled2022-05-19T15:23:49.248097+00:00/taskid=Postgres1/attempt=1.log [2022-05-19, 15:24:50 UTC] {taskinstance.py:1158} INFO - Dependencies all met for <TaskInstance: Postgres1toSnowflake1v3.Postgres1 scheduled2022-05-19T15:23:49.248097+00:00 [queued]> [2022-05-19, 15:24:50 UTC] {taskinstance.py:1158} INFO - Dependencies all met for <TaskInstance: Postgres1toSnowflake1v3.Postgres1 scheduled_2022-05-19T15:23:49.248097+00:00 [queued]>

[2022-05-19, 15:24:50 UTC] {taskinstance.py:1355} INFO -

[2022-05-19, 15:24:50 UTC] {taskinstance.py:1356} INFO - Starting attempt 1 of 1

[2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

[2022-05-19, 15:24:50 UTC] {taskinstance.py:1376} INFO - Executing <Task(PythonDecoratedOperator): Postgres1> on 2022-05-19 15:23:49.248097+00:00 [2022-05-19, 15:24:50 UTC] {standardtaskrunner.py:52} INFO - Started process 3957 to run task [2022-05-19, 15:24:50 UTC] {standardtaskrunner.py:79} INFO - Running: ['airflow', 'tasks', 'run', 'Postgres1toSnowflake1v3', 'Postgres1', 'scheduled2022-05-19T15:23:49.248097+00:00', '--job-id', '96473', '--raw', '--subdir', 'DAGSFOLDER/pgtosnow.py', '--cfg-path', '/tmp/tmp9n7u3i4t', '--error-file', '/tmp/tmp9a55v9b'] [2022-05-19, 15:24:50 UTC] {standardtaskrunner.py:80} INFO - Job 96473: Subtask Postgres1 [2022-05-19, 15:24:50 UTC] {loggingmixin.py:115} WARNING - /usr/local/lib/python3.9/site-packages/airflow/configuration.py:470 DeprecationWarning: The sqlalchemyconn option in [core] has been moved to the sqlalchemyconn option in [database] - the old setting has been used, but please update your config. [2022-05-19, 15:24:50 UTC] {taskcommand.py:369} INFO - Running <TaskInstance: Postgres1toSnowflake1v3.Postgres1 scheduled2022-05-19T15:23:49.248097+00:00 [running]> on host 056ca0b6c7f5 [2022-05-19, 15:24:50 UTC] {taskinstance.py:1568} INFO - Exporting the following env vars: AIRFLOWCTXDAGOWNER=airflow AIRFLOWCTXDAGID=Postgres1toSnowflake1v3 AIRFLOWCTXTASKID=Postgres1 AIRFLOWCTXEXECUTIONDATE=20220519T15:23:49.248097+00:00 AIRFLOWCTXTRYNUMBER=1 AIRFLOWCTXDAGRUNID=scheduled2022-05-19T15:23:49.248097+00:00 [2022-05-19, 15:24:50 UTC] {loggingmixin.py:115} WARNING - /usr/local/lib/python3.9/site-packages/airflow/utils/context.py:202 AirflowContextDeprecationWarning: Accessing 'executiondate' from the template is deprecated and will be removed in a future version. Please use 'dataintervalstart' or 'logicaldate' instead. [2022-05-19, 15:24:50 UTC] {loggingmixin.py:115} WARNING - /usr/local/lib/python3.9/site-packages/airflow/utils/context.py:202 AirflowContextDeprecationWarning: Accessing 'nextds' from the template is deprecated and will be removed in a future version. Please use '{{ dataintervalend | ds }}' instead. [2022-05-19, 15:24:50 UTC] {loggingmixin.py:115} WARNING - /usr/local/lib/python3.9/site-packages/airflow/utils/context.py:202 AirflowContextDeprecationWarning: Accessing 'nextdsnodash' from the template is deprecated and will be removed in a future version. Please use '{{ dataintervalend | dsnodash }}' instead. [2022-05-19, 15:24:50 UTC] {loggingmixin.py:115} WARNING - /usr/local/lib/python3.9/site-packages/airflow/utils/context.py:202 AirflowContextDeprecationWarning: Accessing 'nextexecutiondate' from the template is deprecated and will be removed in a future version. Please use 'dataintervalend' instead. [2022-05-19, 15:24:50 UTC] {loggingmixin.py:115} WARNING - /usr/local/lib/python3.9/site-packages/airflow/utils/context.py:202 AirflowContextDeprecationWarning: Accessing 'prevds' from the template is deprecated and will be removed in a future version. [2022-05-19, 15:24:50 UTC] {loggingmixin.py:115} WARNING - /usr/local/lib/python3.9/site-packages/airflow/utils/context.py:202 AirflowContextDeprecationWarning: Accessing 'prevdsnodash' from the template is deprecated and will be removed in a future version. [2022-05-19, 15:24:50 UTC] {loggingmixin.py:115} WARNING - /usr/local/lib/python3.9/site-packages/airflow/utils/context.py:202 AirflowContextDeprecationWarning: Accessing 'prevexecutiondate' from the template is deprecated and will be removed in a future version. [2022-05-19, 15:24:50 UTC] {loggingmixin.py:115} WARNING - /usr/local/lib/python3.9/site-packages/airflow/utils/context.py:202 AirflowContextDeprecationWarning: Accessing 'prevexecutiondatesuccess' from the template is deprecated and will be removed in a future version. Please use 'prevdataintervalstartsuccess' instead. [2022-05-19, 15:24:50 UTC] {loggingmixin.py:115} WARNING - /usr/local/lib/python3.9/site-packages/airflow/utils/context.py:202 AirflowContextDeprecationWarning: Accessing 'tomorrowds' from the template is deprecated and will be removed in a future version. [2022-05-19, 15:24:50 UTC] {loggingmixin.py:115} WARNING - /usr/local/lib/python3.9/site-packages/airflow/utils/context.py:202 AirflowContextDeprecationWarning: Accessing 'tomorrowdsnodash' from the template is deprecated and will be removed in a future version. [2022-05-19, 15:24:50 UTC] {loggingmixin.py:115} WARNING - /usr/local/lib/python3.9/site-packages/airflow/utils/context.py:202 AirflowContextDeprecationWarning: Accessing 'yesterdayds' from the template is deprecated and will be removed in a future version. [2022-05-19, 15:24:50 UTC] {loggingmixin.py:115} WARNING - /usr/local/lib/python3.9/site-packages/airflow/utils/context.py:202 AirflowContextDeprecationWarning: Accessing 'yesterdaydsnodash' from the template is deprecated and will be removed in a future version. [2022-05-19, 15:24:50 UTC] {python.py:173} INFO - Done. Returned value was: extract [2022-05-19, 15:24:50 UTC] {loggingmixin.py:115} WARNING - /usr/local/lib/python3.9/site-packages/airflow/models/baseoperator.py:1369 DeprecationWarning: Passing 'executiondate' to 'TaskInstance.xcompush()' is deprecated. [2022-05-19, 15:24:50 UTC] {init.py:97} WARNING - Unable to find an extractor. tasktype=PythonDecoratedOperator airflowdagid=Postgres1toSnowflake1v3 taskid=Postgres1 airflowrunid=scheduled2022-05-19T15:23:49.248097+00:00 [2022-05-19, 15:24:50 UTC] {client.py:74} INFO - Constructing openlineage client to send events to https://api.astro-livemaps.datakin.com/ [2022-05-19, 15:24:50 UTC] {taskinstance.py:1394} INFO - Marking task as SUCCESS. dagid=Postgres1toSnowflake1v3, taskid=Postgres1, executiondate=20220519T152349, startdate=20220519T152450, enddate=20220519T152450 [2022-05-19, 15:24:50 UTC] {localtaskjob.py:156} INFO - Task exited with return code 0 [2022-05-19, 15:24:50 UTC] {localtask_job.py:273} INFO - 1 downstream tasks scheduled from follow-on schedule check```

Josh Owens (Josh@kickstand.work)
2022-05-19 16:57:38

*Thread Reply:* @Maciej Obuchowski is our ENV var wrong maybe? Do we need to mention the file to import somewhere else that we may have missed?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-05-20 10:26:01

*Thread Reply:* @Josh Owens one thing I can think of is that you might have older openlineage integration version, as OPENLINEAGE_EXTRACTORS variable was added very recently: https://github.com/OpenLineage/OpenLineage/pull/694

Tyler Farris (tyler@kickstand.work)
2022-05-20 11:58:28

*Thread Reply:* @Maciej Obuchowski, that was it! For some reason, my requirements.txt wasn't pulling the latest version of openlineage-airflow. Working now with 0.8.2

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-05-20 11:59:01

*Thread Reply:* 🙌

Michael Raymond (michael.raymond@cervest.earth)
2022-05-19 05:32:06

Hi 👋, I'm looking at OpenLineage as a solution for fine-grained data lineage tracking. Could I clarify a couple of points?

Where does one specify the version of an input dataset in the RunEvent? In the Marquez seed data I can see that it's recorded, but I'm not sure where it goes from looking at the OpenLineage schema. Or does it just assume the last version?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-05-19 05:59:59

*Thread Reply:* Currently, it assumes latest version. There's an effort with DatasetVersionDatasetFacet to be able to specify it manually - or extract this information from cases like Iceberg or Delta Lake tables.

Michael Raymond (michael.raymond@cervest.earth)
2022-05-19 06:14:59

*Thread Reply:* Ah ok. Is it Marquez assuming the latest version when it records the OpenLineage event?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-05-19 06:18:20

*Thread Reply:* yes

✅ Michael Raymond
Michael Raymond (michael.raymond@cervest.earth)
2022-05-19 06:54:40

*Thread Reply:* Thanks, that's very helpful 👍

Howard Yoo (howardyoo@gmail.com)
2022-05-19 15:23:33

Hi all, I was testing https://github.com/MarquezProject/marquez/tree/main/examples/airflow#step-21-create-dag-counter, and the following error was observed in my airflow env:

Howard Yoo (howardyoo@gmail.com)
2022-05-19 15:23:52

Anybody know why this is happening? Any comments would be welcomed.

Tyler Farris (tyler@kickstand.work)
2022-05-19 15:27:35

*Thread Reply:* @Howard Yoo What version of airflow?

Howard Yoo (howardyoo@gmail.com)
2022-05-19 15:27:51

*Thread Reply:* it's 2.3

Howard Yoo (howardyoo@gmail.com)
2022-05-19 15:28:42

*Thread Reply:* (sorry, it's 2.4)

Tyler Farris (tyler@kickstand.work)
2022-05-19 15:29:28

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow Id refer to the docs again.

"Airflow 2.3+ Integration automatically registers itself for Airflow 2.3 if it's installed on Airflow worker's python. This means you don't have to do anything besides configuring it, which is described in Configuration section."

Howard Yoo (howardyoo@gmail.com)
2022-05-19 15:29:53

*Thread Reply:* Right, configuring I don't see any issues

Tyler Farris (tyler@kickstand.work)
2022-05-19 15:30:56

*Thread Reply:* so you dont need:

from openlineage.airflow import DAG

in your dag files

Howard Yoo (howardyoo@gmail.com)
2022-05-19 15:31:41

*Thread Reply:* Okay... that makes sense then

Tyler Farris (tyler@kickstand.work)
2022-05-19 15:32:47

*Thread Reply:* so if you need to import DAG it would just be: from airflow import DAG

👍 Howard Yoo
Howard Yoo (howardyoo@gmail.com)
2022-05-19 15:56:19

*Thread Reply:* Thanks!

👍 Tyler Farris
Michael Robinson (michael.robinson@astronomer.io)
2022-05-19 17:13:02

@channel OpenLineage 0.8.2 is now available! The project now supports credentialing from the Airflow Secrets Backend and for the Azure Databricks Credential Passthrough, detection of datasets wrapped by ExternalRDDs, bug fixes, and more. For the details, see: https://github.com/OpenLineage/OpenLineage/releases/tag/0.8.2

🎉 Marco Diaz, Howard Yoo, Willy Lulciuc, Michael Collado, Ross Turk, Francis McGregor-Macdonald, Maciej Obuchowski
xiang chen (cdmikechen@hotmail.com)
2022-05-19 22:18:42

Hi~ everyone Is there possible to let openlineage to support camel pipeline?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-05-20 10:23:55

*Thread Reply:* What changes do you mean by letting openlineage support? Or, do you mean, to write Apache Camel integration?

xiang chen (cdmikechen@hotmail.com)
2022-05-22 19:54:17

*Thread Reply:* @Maciej Obuchowski Yes, let openlineage work as same as airflow

xiang chen (cdmikechen@hotmail.com)
2022-05-22 19:56:47

*Thread Reply:* I think this is a very valuable thing. I wish openlineage can support some commonly used pipeline tools, and try to abstract out some general interfaces so that users can expand by themselves

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-05-23 05:20:30

*Thread Reply:* For Python, we have OL client, common libraries (well, at least beginning of them) and SQL parser

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-05-23 05:20:44

*Thread Reply:* As we support more systems, the general libraries will grow as well.

Conor Beverland (conorbev@gmail.com)
2022-05-20 13:50:53

I see a change in the metadata collected from Airflow jobs which I think was introduced with the combination of Airflow 2.3/OpenLineage 0.8.1. There's an airflow_version facet that contains an operator attribute.

Previously that attribute had values such as: airflow.providers.postgres.operators.postgres.PostgresOperator but I now see that for the very same task the operator is now tracked as: airflow.models.taskinstance.TaskInstance

( fwiw there's also a taskInfo attribute in there containing a json string which itself has a operator that is still set to PostgresOperator )

Is this an already known issue?

👀 Maciej Obuchowski
Julien Le Dem (julien@apache.org)
2022-05-20 20:23:15

*Thread Reply:* This looks like a bug. we are probably not looking at the right instance in the TaskInstanceListener

Conor Beverland (conorbev@gmail.com)
2022-05-21 14:17:19

*Thread Reply:* @Howard Yoo I filed: https://github.com/OpenLineage/OpenLineage/issues/767 for this

Will Johnson (will@willj.co)
2022-05-20 21:42:46

Would anyone happen to have a link to the Technical Steering Committee meeting recordings?

I have quite a few people interested in seeing the overview of column lineage that Pawel provided during the Technical Steering Committee meeting on Thursday May 19th.

The wiki does not include a link to the recordings: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

Are the recordings made public? Thank you for any links and guidance!

Julien Le Dem (julien@apache.org)
2022-05-20 21:55:09

That would be @Michael Robinson Yes the recordings are made public.

Michael Robinson (michael.robinson@astronomer.io)
2022-05-20 22:05:27
Will Johnson (will@willj.co)
2022-05-21 09:42:21

*Thread Reply:* Thank you so much, Michael!

Tyler Farris (tyler@kickstand.work)
2022-05-23 15:00:10

Is there documentation/examples around creating custom facets?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-05-24 06:41:11

*Thread Reply:* In Python or Java?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-05-24 06:44:32

*Thread Reply:* In python just inherit BaseFacet and add _get_schema static method that would point to some place where you have your json schema of a facet. For example our DbtVersionRunFacet

In Java you can take a look at Spark's custom facets.

Tyler Farris (tyler@kickstand.work)
2022-05-24 16:40:00

*Thread Reply:* Thanks, @Maciej Obuchowski, I was asking in regards to Python, sorry I should have clarified.

I'm not sure what the disconnect is, but the facets aren't showing up in the inputs and outputs. The Lineage event is sent successfully to my astrocloud.

below is the facet and extractor, any help is appreciated. Thanks!

```import logging from openlineage.airflow.extractors.base import BaseExtractor, TaskMetadata from openlineage.client.run import InputDataset, OutputDataset from typing import List, Optional from openlineage.client.facet import BaseFacet import attr

log = logging.getLogger(name)

@attr.s class ManualLineageFacet(BaseFacet): database: Optional[str] = attr.ib(default=None) cluster: Optional[str] = attr.ib(default=None) connectionUrl: Optional[str] = attr.ib(default=None) target: Optional[str] = attr.ib(default=None) source: Optional[str] = attr.ib(default=None) _producer: str = attr.ib(init=False) _schemaURL: str = attr.ib(init=False)

@staticmethod
def _get_schema() -&gt; str:
    return {
        "$schema": "<http://json-schema.org/schema#>",
        "$defs": {
            "ManualLineageFacet": {
                "allOf": [
                    {
                        "type": "object",
                        "properties": {
                            "database": {
                                "type": "string",
                                "example": "Snowflake",
                            },
                            "cluster": {
                                "type": "string",
                                "example": "us-west-2",
                            },
                            "connectionUrl": {
                                "type": "string",
                                "example": "<http://snowflake>",
                            },
                            "target": {
                                "type": "string",
                                "example": "Postgres",
                            },
                            "source": {
                                "type": "string",
                                "example": "Stripe",
                            },
                            "description": {
                                "type": "string",
                                "example": "Description of inlet/outlet",
                            },
                            "_producer": {
                                "type": "string",
                            },
                            "_schemaURL": {
                                "type": "string",
                            },
                        },
                    },
                ],
                "type": "object",
            }
        },
    }

class ManualLineageExtractor(BaseExtractor): @classmethod def getoperatorclassnames(cls) -> List[str]: return ["PythonOperator", "_PythonDecoratedOperator"]

def extract_on_complete(self, task_instance) -&gt; Optional[TaskMetadata]:

    return TaskMetadata(
        f"{task_instance.dag_run.dag_id}.{task_instance.task_id}",
        inputs=[
            InputDataset(
                namespace="default",
                name=self.operator.get_inlet_defs()[0]["name"],
                inputFacets=ManualLineageFacet(
                    database=self.operator.get_inlet_defs()[0]["database"],
                    cluster=self.operator.get_inlet_defs()[0]["cluster"],
                    connectionUrl=self.operator.get_inlet_defs()[0][
                        "connectionUrl"
                    ],
                    target=self.operator.get_inlet_defs()[0]["target"],
                    source=self.operator.get_inlet_defs()[0]["source"],
                ),
            )
            if self.operator.get_inlet_defs()
            else {},
        ],
        outputs=[
            OutputDataset(
                namespace="default",
                name=self.operator.get_outlet_defs()[0]["name"],
                outputFacets=ManualLineageFacet(
                    database=self.operator.get_outlet_defs()[0]["database"],
                    cluster=self.operator.get_outlet_defs()[0]["cluster"],
                    connectionUrl=self.operator.get_outlet_defs()[0][
                        "connectionUrl"
                    ],
                    target=self.operator.get_outlet_defs()[0]["target"],
                    source=self.operator.get_outlet_defs()[0]["source"],
                ),
            )
            if self.operator.get_outlet_defs()
            else {},
        ],
        job_facets={},
        run_facets={},
    )

def extract(self) -&gt; Optional[TaskMetadata]:
    pass```
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-05-25 09:21:02

*Thread Reply:* _get_schema should return address to the schema hosted somewhere else - afaik sending object field where server expects string field might cause some problems

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-05-25 09:21:59

*Thread Reply:* can you register ManualLineageFacet as facets not as inputFacets or outputFacets?

Tyler Farris (tyler@kickstand.work)
2022-05-25 13:15:30

*Thread Reply:* Thanks for the advice @Maciej Obuchowski, I was able to get it working! Also great talk today at the airflow summit.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-05-25 13:25:17

*Thread Reply:* Thanks 🙇

Bruno González (brugms2@gmail.com)
2022-05-24 06:26:25

Hey guys! I'm pretty new with OL but would like to start using it for a combination of data lineage in Airflow + data quality metrics collection. I was wondering if that was possible, but Ross clarified that in the deeper dive webinar from some weeks ago (great one by the way!).

I'm referencing this comment from Julien to see if you have any updates or more examples apart from the one from great expectations. We have some custom operators and would like to push lineage and data quality metrics to Marquez using custom extractors. Any reference will be highly appreciated. Thanks in advance!

} Julien Le Dem (https://openlineage.slack.com/team/U01DCLP0GU9)
YouTube
} Astronomer (https://www.youtube.com/c/Astronomer)
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-05-24 06:35:05

*Thread Reply:* We're also getting data quality from dbt if you're running dbt test or dbt build https://github.com/OpenLineage/OpenLineage/blob/main/integration/common/openlineage/common/provider/dbt.py#L399

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-05-24 06:37:15

*Thread Reply:* Generally, you'd need to construct DataQualityAssertionsDatasetFacet and/or DataQualityMetricsInputDatasetFacet and attach it to tested dataset

Bruno González (brugms2@gmail.com)
2022-05-24 13:23:34

*Thread Reply:* Thanks @Maciej Obuchowski!!!

Howard Yoo (howardyoo@gmail.com)
2022-05-24 16:55:08

Hi all, https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow#development <-- does this still work? I did follow the instructions, but running pytest failed with error messages like ________________________________________________ ERROR collecting tests/extractors/test_bigquery_extractor.py ________________________________________________ ImportError while importing test module '/Users/howardyoo/git/OpenLineage/integration/airflow/tests/extractors/test_bigquery_extractor.py'. Hint: make sure your test modules/packages have valid Python names. Traceback: openlineage/airflow/utils.py:251: in import_from_string module = importlib.import_module(module_path) /opt/homebrew/Caskroom/miniconda/base/envs/airflow/lib/python3.9/importlib/__init__.py:127: in import_module return _bootstrap._gcd_import(name[level:], package, level) &lt;frozen importlib._bootstrap&gt;:1030: in _gcd_import ??? &lt;frozen importlib._bootstrap&gt;:1007: in _find_and_load ??? &lt;frozen importlib._bootstrap&gt;:986: in _find_and_load_unlocked ??? &lt;frozen importlib._bootstrap&gt;:680: in _load_unlocked ??? &lt;frozen importlib._bootstrap_external&gt;:850: in exec_module ??? &lt;frozen importlib._bootstrap&gt;:228: in _call_with_frames_removed ??? ../../../airflow.master/airflow/providers/google/cloud/operators/bigquery.py:39: in &lt;module&gt; from airflow.providers.google.cloud.hooks.bigquery import BigQueryHook, BigQueryJob ../../../airflow.master/airflow/providers/google/cloud/hooks/bigquery.py:46: in &lt;module&gt; from googleapiclient.discovery import Resource, build E ModuleNotFoundError: No module named 'googleapiclient'

Howard Yoo (howardyoo@gmail.com)
2022-05-24 16:55:09

...

Howard Yoo (howardyoo@gmail.com)
2022-05-24 16:55:54

looks like just running the pytest wouldn't be able to run all the tests - as some of these dag tests seems to be requiring connectivities to google's big query, databases, etc..

Mardaunt (miostat@yandex.ru)
2022-05-25 16:32:08

👋 Hi everyone! I didn't find this in the documentation. Can open lineage show me which source columns the final DataFrame column came from? (Spark)

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-05-25 16:59:47

*Thread Reply:* We're working on this feature - should be in the next release from OpenLineage side

🙌 Mardaunt
Mardaunt (miostat@yandex.ru)
2022-05-25 17:06:12

*Thread Reply:* Thanks! I will keep an eye on updates.

Martin Fiser (fisa@keboola.com)
2022-05-25 21:08:39

Hi all, showcase time:

We have implemented a native OpenLineage endpoint and metadata writer in our Keboola all-in-one data platform. The reason was that for more complex data pipeline scenarios it is beneficial to display the lineage in more detail. Additionally, we hope that OpenLineage as a standard will catch up and open up the ability to push lineage data into other data governance tools than Marquez. The implementation started as an internal POC of tweaking our metadata into OpenLineage /lineage format and resulted into a native API endpoint and later on an app within Keboola platform ecosystem - feeding platform job metadata in a regular cadence. We furthermore use a namespace for each keboola project so users can observe the data through their whole data mesh setup (multi-project architecture). Please reach me out if you have any questions!

🙌 Maciej Obuchowski, Michael Robinson
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-05-26 06:05:33

*Thread Reply:* Looks great! Thanks for sharing!

Gopi Krishnan Rajbahadur (gopikrishnanrajbahadur@gmail.com)
2022-05-26 10:13:26

Hi OpenLineage team,

I am Gopi Krishnan Rajbahadur, one of the core members of OpenDatalogy project (a project that we are currently trying to sandbox as a part of LF-AI). Our OpenDatalogy project focuses on providing a process that allows users of publicly available datasets (e.g., CIFAR-10) to ensure license compliance. In addition, we also aim to provide a public repo that documents the final rights and obligations associated with common publicly available datasets, so that users of these datasets can use them compliantly in their AI models and software.

One of the key aspects of conducting dataset license compliance analysis involves tracking the lineage and provenance of the dataset (as we highlight in this paper here: https://arxiv.org/abs/2111.02374). We think that in this regard, our projects (i.e., OpenLineage and OpenDatalogy) could work together to use the existing OpenLineage standard and also collaborate to adopt/modify/enhance and use OpenLineage to track and document the lineage of a publicly available dataset. On that note, we are also working with the SPDX community to make the lineage and provenance of a dataset be tracked as a part of the SPDX BOM that is in the works for representing AI software (AI SBOM).

We think our projects could mutually benefit from collaborating with each other. Our project's Github could be found here: https://github.com/OpenDataology/OpenDataology. Any feedback that you have about our project would be greatly appreciated. Also, as we are trying to sandbox our project, if you could also show us your support we would greatly appreciate it!

Look forward to hearing back from you

Sincerely, Gopi

arXiv.org
Stars
3
Last updated
3 days ago
👀 Howard Yoo, Maciej Obuchowski
Ilqar Memmedov (ccmilgar@gmail.com)
2022-05-30 04:25:10

Hi guys, sorry for basics. I did some PoC for OpenLineage usage for gathering metrics on Spark job, especially for table creation, alter and drop I detect that Drop/Alter table statements is not trigger listener to post lineage data, Is it normal behaviour?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-05-30 05:38:41

*Thread Reply:* Might be that case if you're using Spark 3.2

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-05-30 05:38:54

*Thread Reply:* There were some changes to those operators

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-05-30 05:39:09

*Thread Reply:* If you're not using 3.2, please share more details 🙂

Ilqar Memmedov (ccmilgar@gmail.com)
2022-05-30 07:58:58

*Thread Reply:* Yeap, im using spark version 3.2.1

Ilqar Memmedov (ccmilgar@gmail.com)
2022-05-30 07:59:35

*Thread Reply:* is it open issue, or i have some option to force them to be sent?)

Ilqar Memmedov (ccmilgar@gmail.com)
2022-05-30 07:59:58

*Thread Reply:* btw thank you for quick response @Maciej Obuchowski

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-05-30 08:00:34

*Thread Reply:* Yes, we have issue for AlterTable at least

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2022-06-01 02:52:14

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/616 -> that’s the issue for altering tables in Spark 3.2. @Ilqar Memmedov Did you mean drop table or drop columns? I am not aware of any drop table issue.

Assignees
<a href="https://github.com/tnazarew">@tnazarew</a>
Labels
enhancement, integration/spark
Ilqar Memmedov (ccmilgar@gmail.com)
2022-06-01 06:03:38

*Thread Reply:* @Paweł Leszczyński drop table statement.

Ilqar Memmedov (ccmilgar@gmail.com)
2022-06-01 06:05:58

*Thread Reply:* For reproduce it, i just create simple spark job. Create table as select from other, Select data from table, and then drop entire table.

Lineage data was posted only for "Create table as select" part

xiang chen (cdmikechen@hotmail.com)
2022-06-01 05:16:01

Hi~all, I have a question about lineage. I am now running airflow 2.3.1 and have started a latest marquez service by docker-compose. I found that using the example DAG of airflow can only see the job information, but not the lineage of the job. How can I configure it to see the lineage ?

Ross Turk (ross@datakin.com)
2022-06-03 14:20:16

*Thread Reply:* hi xiang 👋 lineage in airflow depends on the operator. some operators have extractors as part of the integration, but when they are missing you only see job information in Marquez.

xiang chen (cdmikechen@hotmail.com)
2022-06-01 05:23:54

Another problem is that if I declare a skip task(e.g. DummyOperator) in the DAG, it will never appear in the job list. I think this is a problem, because even if it can not run, it should be able to see it as a metadata object.

Michael Robinson (michael.robinson@astronomer.io)
2022-06-01 10:19:33

@channel The next OpenLineage Technical Steering Committee meeting is on Thursday, June 9 at 10 am PT. Join us on Zoom: https://us02web.zoom.us/j/81831865546?pwd=RTladlNpc0FTTDlFcWRkM2JyazM4Zz09 All are welcome! Agenda:

  1. a recent blog post about Snowflake
  2. the Great Expectations integration
  3. the dbt integration
  4. Open discussion Notes: https://tinyurl.com/openlineagetsc Is there a topic you think the community should discuss at this or a future meeting? DM me to add items to the agenda.
👀 Howard Yoo, Francis McGregor-Macdonald
Michael Robinson (michael.robinson@astronomer.io)
2022-06-04 09:45:41

@channel OpenLineage 0.9.0 is now available, featuring column-level lineage in the Spark integration, bug fixes and more! For the details, see: https://github.com/OpenLineage/OpenLineage/releases/tag/0.9.0 and https://github.com/OpenLineage/OpenLineage/compare/0.8.2...0.9.0. Thanks to all the contributors who made this release possible, including @Paweł Leszczyński for authoring the column-level lineage PRs and new contributor @JDarDagran!

👍 Howard Yoo, Jarek Potiuk, Maciej Obuchowski, Ross Turk, Minkyu Park, pankaj koti, Jorik, Li Ding, Faouzi, Howard Yoo, Mardaunt
🎉 pankaj koti, Faouzi, Howard Yoo, Sheeri Cabral (Collibra), Mardaunt
❤️ Faouzi, Howard Yoo, Mardaunt
Tyler Farris (tyler@kickstand.work)
2022-06-06 16:14:52

Hey, all. Working on a PR to OpenLineage. I'm curious about file naming conventions for facets. Im noticing that there are two conventions being used:

• In OpenLineage.spec.facets; ex. ExampleFacet.json • In OpenLineage.integration.common.openlineage.common.schema; ex. example-facet.json. Thanks

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-06-08 08:02:58

*Thread Reply:* I think internal naming is more important 🙂

I guess, for now, try to match what the local directory has.

Tyler Farris (tyler@kickstand.work)
2022-06-08 10:59:39

*Thread Reply:* Thanks @Maciej Obuchowski

raghanag (raghanag@gmail.com)
2022-06-07 03:24:03

Hi Team, we are seeing DatasetName as the Custom query when we run a spark job which queries Oracle DB using JDBC with a Custom Query and the custom query is having newline syntax in it which is causing the NodeId ID_PATTERN match to fail. How to give custom dataset name when we use custom queries?

Marquez API regex ref: https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/service/models/NodeId.java#L44 ERROR [2022-06-07 06:11:49,592] io.dropwizard.jersey.errors.LoggingExceptionMapper: Error handling a request: 3648e87216d7815b ! java.lang.IllegalArgumentException: node ID (dataset:oracle:thin:_//&lt;host-name&gt;:1521:( ! SELECT ! RULE.RULE_ID, ! ASSG.ASSIGNED_OBJECT_ID, ASSG.ORG_ID, ASSG.SPLIT_PCT, ! PRTCP.PARTICIPANT_NAME, PRTCP.START_DATE, PRTCP.END_DATE ! FROM RULE RULE, ! ASSG ASSG, ! PRTCP PRTCP ! WHERE ! RULE.RULE_ID = ASSG.RULE_ID(+) ! --AND RULE.RULE_ID = 300100207891651 ! AND PRTCP.PARTICIPANT_ID = ASSG.ASSIGNED_OBJECT_ID ! -- and RULE.created_by = ' 1=1 ' ! and 1=1 ! )) must start with 'dataset', 'job', or 'run'

George Zachariah V (manish.zack@gmail.com)
2022-06-08 07:48:16

Hi Team, We have a spark job xyz that uses OpenLineageListener which posts Lineage events to Marquez server. But we are seeing some unknown jobs in the Marquez UI : • xyz.collect_limitxyz.execute_insert_into_hadoop_fs_relation_command What jobs are these (collect_limit, execute_insert_into_hadoop_fs_relation_command ) ? How do we get the lineage listener to post only our job (xyz) ?

👍 Pradeep S
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-06-08 11:00:41

*Thread Reply:* Those jobs are actually what Spark does underneath 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-06-08 11:00:57

*Thread Reply:* Are you using Delta Lake btw?

Moiz (moiz.groups@gmail.com)
2022-06-08 12:02:39

*Thread Reply:* No, this is not Delta Lake. It is a normal Spark app .

raghanag (raghanag@gmail.com)
2022-06-08 13:58:05

*Thread Reply:* @Maciej Obuchowski i think David posted about this before. https://openlineage.slack.com/archives/C01CK9T7HKR/p1636011698055200

} David Virgil (https://openlineage.slack.com/team/U02K9U58X7F)
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-06-08 14:27:46

*Thread Reply:* I agree that it looks bad on UI, but I also think integration is going good job here. The eventual "aggregation" should be done by event consumer.

If anything, we should filter some 'useless' nodes like collect_limit since they add nothing.

We have an issue for doing this to specifically delta lake operations, as they are the biggest offenders: https://github.com/OpenLineage/OpenLineage/issues/628

Milestone
<a href="https://github.com/OpenLineage/OpenLineage/milestone/6">0.10.0</a>
👍 George Zachariah V
raghanag (raghanag@gmail.com)
2022-06-08 14:33:09

*Thread Reply:* @Maciej Obuchowski but we only see these 2 jobs in the namespace, no other jobs were part of the lineage metadata, are we doing something wrong?

raghanag (raghanag@gmail.com)
2022-06-08 16:09:15

*Thread Reply:* @Michael Robinson On this note, may we know how to form a lineage if we have different set of API's before calling the spark job (already integrated with OpenLineageSparkListener), we want to see how the different set of params pass thru these components before landing into the spark job. If we use openlineage client to post the lineage events into the Marquez, do we need to mention the same Run UUID across the lineage events for the run or is there any other way to do this? Can you pls advise?

Ross Turk (ross@datakin.com)
2022-06-08 22:51:38

*Thread Reply:* I think I understand what you are asking -

The runID is used to correlate different state updates (i.e., start, fail, complete, abort) across the lifespan of a run. So if you are trying to add additional metadata to the same job run, you’d use the same runID.

So you’d generate a runID and send a START event, then in the various components you could send OTHER events containing the same runID + params you want to study in facets, then at the end you would send a COMPLETE.

(I think there should be an UPDATE event type in the spec for this sort of thing.)

👍 George Zachariah V, raghanag
raghanag (raghanag@gmail.com)
2022-06-08 22:59:39

*Thread Reply:* thanks @Ross Turk but what i am looking for is lets say for example, if we have 4 components in the system then we want to show the 4 components as job icons in the graph and the datasets between them would show the input/output parameters that these components use. A(job) --> DS1(dataset) --> B(job) --> DS2(dataset) --> C(job) --> DS3(dataset) --> D(job)

Ross Turk (ross@datakin.com)
2022-06-08 23:04:37

*Thread Reply:* then you would need to have separate Jobs for each, with inputs and outputs defined

Ross Turk (ross@datakin.com)
2022-06-08 23:06:03

*Thread Reply:* so there would be a Run of job B that shows DS1 as an input and DS2 as an output

raghanag (raghanag@gmail.com)
2022-06-08 23:06:18

*Thread Reply:* got it

Ross Turk (ross@datakin.com)
2022-06-08 23:06:34

*Thread Reply:* (fyi: I know openlineage but my understanding stops at spark 😄)

👍 raghanag
Sharanya Santhanam (santhanamsharanya@gmail.com)
2022-06-10 12:27:58

*Thread Reply:* > The eventual “aggregation” should be done by event consumer. @Maciej Obuchowski Are there any known client side libraries that support this aggregation already ? In case of spark applications running as part of ETL pipelines, most of the times our end user is interested in seeing only the aggregated view where all jobs spawned as part of a single application are rolled up into 1 job.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-06-10 12:32:14

*Thread Reply:* I believe Microsoft @Will Johnson has something similar to that, but it's probably proprietary.

We'd love to have something like it, but AFAIK it affects only some percentage of Spark jobs and we can only do so much.

With exception of Delta Lake/Databricks, where it affects every job, and we know some nodes that could be safely filtered client side.

Will Johnson (will@willj.co)
2022-06-11 23:38:27

*Thread Reply:* @Maciej Obuchowski Microsoft ❤️ OSS!

Apache Atlas doesn't have the same model as Marquez. It only knows of effectively one entity that represents the complete asset.

@Mark Taylor designed this solution available now on Github to consolidate OpenLineage messages

https://github.com/microsoft/Purview-ADB-Lineage-Solution-Accelerator/blob/d6514f2[…]/Function.Domain/Helpers/OlProcessing/OlMessageConsolodation.cs

In addition, we do some filtering only based on inputs and outputs to limit the messages AFTER it has been emitted.

🙌 Maciej Obuchowski
Sharanya Santhanam (santhanamsharanya@gmail.com)
2022-06-19 09:37:06

*Thread Reply:* thank you !

Michael Robinson (michael.robinson@astronomer.io)
2022-06-08 10:54:32

@channel The next OpenLineage TSC meeting is tomorrow! https://openlineage.slack.com/archives/C01CK9T7HKR/p1654093173961669

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
👍 Maciej Obuchowski, Sheeri Cabral (Collibra), Willy Lulciuc, raghanag, Mardaunt
Jakub Moravec (jkb.moravec@gmail.com)
2022-06-09 13:04:00

*Thread Reply:* Hi, is the link correct? The meeting room is empty

Michael Robinson (michael.robinson@astronomer.io)
2022-06-09 16:04:23

*Thread Reply:* sorry about that, thanks for letting us know

Mark Beebe (mark_j_beebe@progressive.com)
2022-06-13 15:13:59

Hello all, after sending dbt openlineage events to Marquez, I am now looking to use the Marquez API to extract the lineage information. I am able to use python requests to call the Marquez API to get other information such as namespaces, datasets, etc., but I am a little bit confused about what I need to enter to get the lineage. I included screenshots for what the API reference shows regarding retrieving the lineage where it shows that a nodeId is required. However, this is where I seem to be having problems. It is not exactly clear where the nodeId needs to be set or what the nodeId needs to include. I would really appreciate any insights. Thank you!

Ross Turk (ross@datakin.com)
2022-06-13 18:49:37

*Thread Reply:* Hey @Mark Beebe!

In this case, nodeId is going to be either a dataset or a job. You need to tell Marquez where to start since there is likely to be more than one graph. So you need to get your hands on an identifier for that starting node.

Ross Turk (ross@datakin.com)
2022-06-13 18:50:07

*Thread Reply:* You can do this in a few ways (that I can think of). First, by looking for a namespace, then querying for the datasets in that namespace:

Ross Turk (ross@datakin.com)
2022-06-13 18:53:43

*Thread Reply:* Or you can search, if you know the name of the dataset:

Ross Turk (ross@datakin.com)
2022-06-13 18:53:54

*Thread Reply:* aaaaannnnd that’s actually all the ways I can think of.

Mark Beebe (mark_j_beebe@progressive.com)
2022-06-14 08:11:30

*Thread Reply:* That worked, thank you so much!

👍 Ross Turk
Varun Singh (varuntestaz@outlook.com)
2022-06-14 05:52:39

Hi all, I need to send the lineage information from spark integration directly to a kafka topic. Java client seems to have a KafkaTransport, is it planned to have this support from inside the spark integration as well?

👀 Francis McGregor-Macdonald
Michael Robinson (michael.robinson@astronomer.io)
2022-06-14 10:35:48

Hi all, I’m working on a blog post about the Spark integration and would like to credit @tnazarew and @Sbargaoui for their contributions. Anyone know these contributors’ names? Are you on here? Thanks for any leads.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-06-14 10:37:01

*Thread Reply:* tnazarew - Tomasz Nazarewicz

Michael Robinson (michael.robinson@astronomer.io)
2022-06-14 10:37:14

*Thread Reply:* 🙌

Ross Turk (ross@datakin.com)
2022-06-15 12:46:45

*Thread Reply:* 👍

Conor Beverland (conorbev@gmail.com)
2022-06-14 13:58:07

Has anyone tried getting the OpenLineage Spark integration working with GCP Dataproc ?

Peter Hanssens (peter@cloudshuttle.com.au)
2022-06-15 15:49:17

Hi Folks, DataEngBytes is a community data engineering conference here in Australia and will be hosted on the 27th and 29th of September. Our CFP is open for just under a month and tickets are on sale now: Call for paper: https://sessionize.com/dataengbytes-2022/ Tickets: https://www.tickettailor.com/events/dataengbytes/713307 Promo video https://youtu.be/1HE_XNLvHss

sessionize.com
tickettailor.com
YouTube
} DataEngAU (https://www.youtube.com/c/DataEngAU)
👀 Ross Turk, Michael Collado
Michael Robinson (michael.robinson@astronomer.io)
2022-06-17 16:23:32

A release of OpenLineage has been requested pending the merging of #856. Three +1s will authorize a release today. @Willy Lulciuc @Michael Collado @Ross Turk @Maciej Obuchowski @Paweł Leszczyński @Mandy Chessell @Daniel Henneberger @Drew Banin @Julien Le Dem @Ryan Blue @Will Johnson @Zhamak Dehghani

➕ Willy Lulciuc, Maciej Obuchowski, Michael Collado
✅ Michael Collado
Chase Christensen (christensenc3526@gmail.com)
2022-06-22 17:09:18

👋 Hi everyone!

👋 Conor Beverland, Ross Turk, Maciej Obuchowski, Michael Robinson, George Zachariah V, Willy Lulciuc, Dinakar Sundar
Lee (chenzuoli709@gmail.com)
2022-06-23 21:54:05

hi

👋 Maciej Obuchowski, Sheeri Cabral (Collibra), Willy Lulciuc, Michael Robinson, Dinakar Sundar
Michael Robinson (michael.robinson@astronomer.io)
2022-06-25 07:34:32

@channel OpenLineage 0.10.0 is now available! We added SnowflakeOperatorAsync extractor support to the Airflow integration, an InMemoryRelationInputDatasetBuilder for InMemory datasets to the Spark integration, a static code analysis tool to run in CircleCI on Python modules, a copyright to all source files, and a debugger called PMD to the build process. Changes we made include skipping FunctionRegistry.class serialization in the Spark integration, installing the new rust-based SQL parser by default in the Airflow integration, improving the integration tests for the Airflow integration, reducing event payload size by excluding local data and including an output node in start events, and splitting the Spark integration into submodules. Thanks to all the contributors who made this release possible! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.10.0 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.9.0...0.10.0 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🙌 Maciej Obuchowski, Filipe Comparini Vieira, Manuel, Dinakar Sundar, Ross Turk, Paweł Leszczyński, Willy Lulciuc, Adisesha Reddy G, Conor Beverland, Francis McGregor-Macdonald, Jam Car
Mike brenes (brenesmi@gmail.com)
2022-06-28 18:29:29

Why has put dataset been deprecated? How do I add an initial data set via api?

Willy Lulciuc (willy@datakin.com)
2022-06-28 18:39:16

*Thread Reply:* I think you’re reference the deprecation of the DatasetAPI in Marquez? A milestone for the Marquez is to only collect metadata via OpenLineage events. This includes metadata for datasets , jobs , and runs . The DatasetAPI won’t be removed until support for collecting dataset metadata via OpenLineage has been added, see https://github.com/OpenLineage/OpenLineage/issues/323

Assignees
<a href="https://github.com/mobuchowski">@mobuchowski</a>
Labels
proposal
Willy Lulciuc (willy@datakin.com)
2022-06-28 18:40:28

*Thread Reply:* Once the spec supports dataset metadata, we’ll outline steps in the Marquez project to switch to using the new dataset event type

Willy Lulciuc (willy@datakin.com)
2022-06-28 18:43:20

*Thread Reply:* The DatasetAPI was also deprecated to avoid confusion around which API to use

Mike brenes (brenesmi@gmail.com)
2022-06-28 18:41:38

🥺

Mike brenes (brenesmi@gmail.com)
2022-06-28 18:42:21

So how would you propose I create the initial node if I am trying to do a POC?

Willy Lulciuc (willy@datakin.com)
2022-06-28 18:44:49

*Thread Reply:* Do you want to register just datasets? Or are you extracting metadata for a job that would include input / output datasets? (outside of Airflow of course)

Mike brenes (brenesmi@gmail.com)
2022-06-28 18:45:09

*Thread Reply:* Sorry didn't notice you over here ! lol

Mike brenes (brenesmi@gmail.com)
2022-06-28 18:45:53

*Thread Reply:* So ideally I would like to map out our current data flow from on prem to aws

Willy Lulciuc (willy@datakin.com)
2022-06-28 18:47:39

*Thread Reply:* What do you mean by mapping to AWS? Like send OL events to a service on AWS that would process the lineage metadata?

Mike brenes (brenesmi@gmail.com)
2022-06-28 18:48:14

*Thread Reply:* no, just visualize the current migration flow.

Willy Lulciuc (willy@datakin.com)
2022-06-28 18:48:53

*Thread Reply:* Ah I see, youre doing a infra migration from on prem to AWS 👌

Mike brenes (brenesmi@gmail.com)
2022-06-28 18:49:08

*Thread Reply:* really AWS is irrelevant. Source sink -> migration scriipts -> s3 -> additional processing -> final sink

Mike brenes (brenesmi@gmail.com)
2022-06-28 18:49:19

*Thread Reply:* correct

Willy Lulciuc (willy@datakin.com)
2022-06-28 18:49:45

*Thread Reply:* right right. so you want to map out that flow and visualize it in Marquez? (or some other meta service)

Mike brenes (brenesmi@gmail.com)
2022-06-28 18:50:05

*Thread Reply:* yes

Mike brenes (brenesmi@gmail.com)
2022-06-28 18:50:26

*Thread Reply:* which I think I can do once the first nodes exist

Mike brenes (brenesmi@gmail.com)
2022-06-28 18:51:18

*Thread Reply:* But I don't know how to get that initial node. I tried using the input facet at job start , that didn't do it. I also can't get the sql context that is in these examples.

Mike brenes (brenesmi@gmail.com)
2022-06-28 18:51:54

*Thread Reply:* really just want to re-create food_devlivery using my own biz context

Willy Lulciuc (willy@datakin.com)
2022-06-28 18:52:14

*Thread Reply:* Have you looked over our workshops and this example? (assuming you’re using python?)

Willy Lulciuc (willy@datakin.com)
2022-06-28 18:53:49

*Thread Reply:* that goes over the py client with some OL examples, but really calling openlineage.emit(...) method with RunEvents and specifying Marquez as the backend will get you up and running!

Willy Lulciuc (willy@datakin.com)
2022-06-28 18:54:32

*Thread Reply:* Don’t forget to configure the transport for the client

Mike brenes (brenesmi@gmail.com)
2022-06-28 18:54:45

*Thread Reply:* sweet. Thank you! I'll take a look. Also.. Just came across datakin for the first time. very nice 🙂

Willy Lulciuc (willy@datakin.com)
2022-06-28 18:55:25

*Thread Reply:* thanks! …. but we’re now part of astronomer.io 😉

Willy Lulciuc (willy@datakin.com)
2022-06-28 18:55:48

*Thread Reply:* making airflow oh-so-easy-to-use one DAG at a time

Mike brenes (brenesmi@gmail.com)
2022-06-28 18:55:52

*Thread Reply:* saw that too !

Willy Lulciuc (willy@datakin.com)
2022-06-28 18:56:03

*Thread Reply:* you’re on top of it!

Mike brenes (brenesmi@gmail.com)
2022-06-28 18:56:28

*Thread Reply:* ha. Thanks again!

Mike brenes (brenesmi@gmail.com)
2022-06-28 18:42:40

This would be outside of Airflow

Fenil Doshi (fdoshi@salesforce.com)
2022-06-28 18:43:22

Hello, Is OpenLineage planning to add support for inlets and outlets for Airflow integration? I am working on a project that relies on it and was hoping to contribute to this feature if its something that is in the talks. I saw an open issue here

I am willing to work on it. My plan was to just support Files and Tables entities (for inlets and outlets). Pass the inlets and outlets info into extract_metadata function here and then convert Airflow entities into TaskMetaData entities here.

Does this sound reasonable?

Willy Lulciuc (willy@datakin.com)
2022-06-28 18:59:38

*Thread Reply:* Honestly, I’ve been a huge fan of using / falling back on inlets and outlets since day 1. AND if you’re willing to contribute this support, you get a +1 from me (I’ll add some minor comments to the issue) /cc @Julien Le Dem

🙌 Fenil Doshi
Willy Lulciuc (willy@datakin.com)
2022-06-28 18:59:59

*Thread Reply:* would be great to get @Maciej Obuchowski thoughts on this as well

👍 Fenil Doshi
Fenil Doshi (fdoshi@salesforce.com)
2022-07-08 12:40:39

*Thread Reply:* I have created a draft PR for this here. Please let me know if the changes make sense.

Comments
1
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-07-08 12:42:30

*Thread Reply:* I think this effort: https://github.com/OpenLineage/OpenLineage/pull/904 ultimately makes more sense, since it will allow getting lineage on Airflow 2.3+ too

Comments
2
✅ Fenil Doshi
👀 Fenil Doshi
Fenil Doshi (fdoshi@salesforce.com)
2022-07-08 18:12:47

*Thread Reply:* I have made the changes in-line to the mentioned comments here. Does this look good?

Comments
1
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-07-12 09:35:22

*Thread Reply:* I think it looks good! Would be great to have tests for this feature though.

👍 Fenil Doshi, Julien Le Dem
Fenil Doshi (fdoshi@salesforce.com)
2022-07-15 21:56:50

*Thread Reply:* I have added the tests! Would really appreciate it if someone can take a look and let me know if anything else needs to be done. Thank you for the support! 😄

👀 Willy Lulciuc, Maciej Obuchowski
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-07-18 06:48:03

*Thread Reply:* One change and I think it will be good for now.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-07-18 06:48:07

*Thread Reply:* Have you tested it manually?

Fenil Doshi (fdoshi@salesforce.com)
2022-07-20 13:22:04

*Thread Reply:* Thanks a lot for the review! Appreciate it 🙌 Yes, I tested it manually (for Airflow versions 2.1.4 and 2.3.3) and it works 🎉

Conor Beverland (conorbev@gmail.com)
2022-07-20 13:24:55

*Thread Reply:* I think this is such a useful feature to have, thank you! Would you mind adding a little example to the PR of how to use it? Like a little example DAG or something? ( either in a comment or edit the PR description )

👍 Fenil Doshi
Fenil Doshi (fdoshi@salesforce.com)
2022-07-20 15:20:32

*Thread Reply:* Yes, Sure! I will add it in the PR description

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-07-21 05:30:56

*Thread Reply:* I think it would be easy to convert to integration test then if you provided example dag

👍 Fenil Doshi
Conor Beverland (conorbev@gmail.com)
2022-07-27 12:20:43

*Thread Reply:* ping @Fenil Doshi if possible I would really love to see the example DAG on there 🙂 🙏

Fenil Doshi (fdoshi@salesforce.com)
2022-07-27 12:26:22

*Thread Reply:* Yes, I was going to but the PR got merged so did not update the description. Should I just update the description of merged PR? Or should I add it somewhere in the docs?

Conor Beverland (conorbev@gmail.com)
2022-07-27 12:42:29

*Thread Reply:* ^ @Ross Turk is it easy for @Fenil Doshi to contribute doc for manual inlet definition on the new doc site?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-07-27 12:48:32

*Thread Reply:* It is easy 🙂 it's just markdown: https://github.com/openlineage/docs/

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-07-27 12:49:23

*Thread Reply:* @Fenil Doshi feel free to create new page here and don't sweat where to put it, we'll still figuring the structure of it out and will move it then

👍 Ross Turk, Fenil Doshi
Ross Turk (ross@datakin.com)
2022-07-27 13:12:31

*Thread Reply:* exactly, yes - don’t be worried about the doc quality right now, the doc site is still in a pre-release state. so whatever you write will be likely edited or moved before it becomes official 👍

👍 Fenil Doshi
Fenil Doshi (fdoshi@salesforce.com)
2022-07-27 20:37:34

*Thread Reply:* I added documentations here - https://github.com/OpenLineage/docs/pull/16

Also, have added an example for it. 🙂 Let me know if something is unclear and needs to be updated.

✅ Conor Beverland
Conor Beverland (conorbev@gmail.com)
2022-07-28 12:50:54

*Thread Reply:* Thanks! very cool.

Conor Beverland (conorbev@gmail.com)
2022-07-28 12:52:22

*Thread Reply:* Does Airflow check the types of the inlets/outlets btw?

Like I wonder if a user could directly define an OpenLineage DataSet ( which might even have various other facets included on it ) and specify it in the inlets/outlets ?

Ross Turk (ross@datakin.com)
2022-07-28 12:54:56

*Thread Reply:* Yeah, I was also curious about using the models from airflow.lineage.entities as opposed to openlineage.client.run.

Ross Turk (ross@datakin.com)
2022-07-28 12:55:42

*Thread Reply:* I am accustomed to creating OpenLineage entities like this:

taxes = Dataset(namespace="<postgres://foobar>", name="schema.table")

Ross Turk (ross@datakin.com)
2022-07-28 12:56:45

*Thread Reply:* I don’t dislike the airflow.lineage.entities models especially, but if we only support one of them…

Conor Beverland (conorbev@gmail.com)
2022-07-28 12:58:18

*Thread Reply:* yeah, if Airflow allows that class within inlets/outlets it'd be nice to support both imo.

Like we would suggest users to use openlineage.client.run.Dataset but if a user already has DAGs that use Table then they'd still work in a best efforts way.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-07-28 13:03:07

*Thread Reply:* either Airflow depends on OpenLineage or we can probably change those entities as part of AIP-48 overhaul to more openlineage-like ones

Ross Turk (ross@datakin.com)
2022-07-28 17:18:35

*Thread Reply:* hm, not sure I understand the dependency issue. isn’t this extractor living in openlineage-airflow?

Conor Beverland (conorbev@gmail.com)
2022-08-15 09:49:02

*Thread Reply:* I gave manual lineage a try with native OL Datasets specified in the Airflow inlets/outlets and it seems to work! Had to make some small tweaks which I have attempted here: https://github.com/OpenLineage/OpenLineage/pull/1015

( I left the support for converting the Airflow Table to Dataset because I think that's nice to have also )

Mike brenes (brenesmi@gmail.com)
2022-06-28 18:44:24

food_delivery example example.etl_categories node

Mike brenes (brenesmi@gmail.com)
2022-06-28 18:44:40

how do I recreate that using Openlineage?

Willy Lulciuc (willy@datakin.com)
2022-06-28 18:45:52

*Thread Reply:* Ahh great question! I actually just updated the seeding cmd for Marquez to do just this (but in java of course)

Willy Lulciuc (willy@datakin.com)
2022-06-28 18:46:15

*Thread Reply:* Give me a sec to send you over the diff…

❤️ Mike brenes
Willy Lulciuc (willy@datakin.com)
2022-06-28 18:56:35

*Thread Reply:* … continued here https://openlineage.slack.com/archives/C01CK9T7HKR/p1656456734272809?thread_ts=1656456141.097229&cid=C01CK9T7HKR

} Willy Lulciuc (https://openlineage.slack.com/team/U01DCMDFHBK)
Conor Beverland (conorbev@gmail.com)
2022-06-28 20:05:33

I'm very new to DBT but wanted to give it a try with OL. I had a couple of questions when going through the DBT tutorial here: https://docs.getdbt.com/guides/getting-started/learning-more/getting-started-dbt-core

  1. An earlier part of the tutorial has you build a model in a single sql file: https://docs.getdbt.com/guides/getting-started/learning-more/getting-started-dbt-core#build-your-first-model When I did this and ran dbt-ol I got a lineage graph like this:
👀 Maciej Obuchowski
Conor Beverland (conorbev@gmail.com)
2022-06-28 20:05:54
Conor Beverland (conorbev@gmail.com)
2022-06-28 20:07:11

then a later part of the tutorial has you split that same example into multiple models and when I run it again I get the graph like:

Conor Beverland (conorbev@gmail.com)
2022-06-28 20:07:27
Conor Beverland (conorbev@gmail.com)
2022-06-28 20:08:54

^ I'm just kind of curious if it's working as expected? And/or could it be possible to parse the DBT .sql so that the lineage in the first case would still show those staging tables?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-06-29 10:04:14

*Thread Reply:* I think you should declare those as sources? Or do you need something different?

Conor Beverland (conorbev@gmail.com)
2022-06-29 21:15:33

*Thread Reply:* I'll try to experiment with this.

Conor Beverland (conorbev@gmail.com)
2022-06-28 20:09:19
  1. I see that DBT has a concept of adding tests to your models. Could those add data quality facets in OL ?
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-06-29 10:02:17

*Thread Reply:* this should already be working if you run dbt-ol test or dbt-ol build

Conor Beverland (conorbev@gmail.com)
2022-06-29 21:15:25

*Thread Reply:* oh, nice!

shweta p (shweta.pbs@gmail.com)
2022-07-04 02:48:35

Hi everyone, i am trying openlineage-dbt. It works perfectly on locally when i try to publish the events to Marquez...but when i run the same commands from mwaa...i dont see those events triggered..i amnt able to view any logs if there is any error. How do i debug the issue

Julien Le Dem (julien@apache.org)
2022-07-06 14:26:59

*Thread Reply:* Maybe @Maciej Obuchowski knows? You need to check, it's using the dbt-ol command and that the configuration is available. (environment variables or conf file)

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-07-06 15:31:20

*Thread Reply:* Maybe some aws networking stuff? I'm not really sure how mwaa works internally (or, at all - never used it)

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-07-06 15:35:06

*Thread Reply:* anyway, any logs/errors should be in the same space where your task logs are

Michael Robinson (michael.robinson@astronomer.io)
2022-07-06 05:32:28

Agenda items are requested for the next OpenLineage Technical Steering Committee meeting on July 14. Reply in thread or ping me with your item(s)!

Will Johnson (will@willj.co)
2022-07-06 10:21:50

*Thread Reply:* What is the status on the Flink / Streaming decisions being made for OpenLineage / Marquez?

A few months ago, Flink was being introduced and it was said that more thought was needed around supporting streaming services in OpenLineage.

It would be very helpful to know where the community stands on how streaming data sources should work in OpenLineage.

👍 Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2022-07-06 11:08:01

*Thread Reply:* @Will Johnson added your item

👍 Will Johnson
Will Johnson (will@willj.co)
2022-07-06 10:19:44

Request for Creating a New OpenLineage Release

Hello #general, as per the Governance guide (https://github.com/OpenLineage/OpenLineage/blob/main/GOVERNANCE.md#openlineage-project-releases), I am asking that we generate a new release based on the latest commit by @Maciej Obuchowski (c92a93cdf3df636a02984188563d019474904b2b) which fixes a critical issue running OpenLineage on Azure Databricks.

Having this release made available to the general public on Maven would allow us to enable the hundred+ users of the solution to run OpenLineage on the latest LTS versions of Databricks. In addition, it would enable the Microsoft team to integrate the amazing column level lineage feature contributed by @Paweł Leszczyński with our solution for Microsoft Purview.

👍 Maciej Obuchowski, Jakub Dardziński, Ross Turk, Willy Lulciuc, Will Johnson, Julien Le Dem
Michael Robinson (michael.robinson@astronomer.io)
2022-07-07 10:33:41

@channel The next OpenLineage Technical Steering Committee meeting is on Thursday, July 14 at 10 am PT. Join us on Zoom: https://bit.ly/OLzoom All are welcome! Agenda:

  1. Announcements/recent talks
  2. Release 0.10.0 overview
  3. Flink integration retrospective
  4. Discuss: streaming services in Flink integration
  5. Open discussion Notes: https://bit.ly/OLwiki Is there a topic you think the community should discuss at this or a future meeting? Reply or DM me to add items to the agenda.
Zoom Video
David Cecchi (david_cecchi@cargill.com)
2022-07-11 10:30:34

*Thread Reply:* would appreciate a TSC discussion on OL philosophy for Streaming in general and where/if it fits in the vision and strategy for OL. fully appreciate current maturity, moreso just validating how OL is being positioned from a vision perspective. as we consider aligning enterprise lineage solution around OL want to make sure we're not making bad assumptions. neat discussion might be "imagine that Confluent decided to make Stream Lineage OL compliant/capable - are we cool with that and what are the implications?".

👍 Michael Robinson
Ross Turk (ross@datakin.com)
2022-07-12 12:36:17

*Thread Reply:* @Michael Robinson could I also have a quick 5m to talk about plans for a documentation site?

👍 Michael Robinson, Sheeri Cabral (Collibra)
Michael Robinson (michael.robinson@astronomer.io)
2022-07-12 12:46:29

*Thread Reply:* @David Cecchi @Ross Turk Added your items to the agenda. Thanks and looking forward to the discussion!

David Cecchi (david_cecchi@cargill.com)
2022-07-12 15:08:48

*Thread Reply:* this is great - will keep an eye out for recording. if it got tabled due to lack of attendance will pick it up next TSC.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2022-07-12 16:12:43

*Thread Reply:* I think OpenLineage should have some representation at https://impactdatasummit.com/2022

I’m happy to help craft the abstract, look over slides, etc. (I could help present, but all I’ve done with OpenLineage is one tutorial, so I’m hardly an expert).

CfP closes 31 Aug so there’s plenty of time, but if you want a 2nd set of eyes on things, we can’t just wait until the last minute to submit 😄

impactdatasummit.com
Will Johnson (will@willj.co)
2022-07-07 12:04:09

How to create custom facets without recompiling OpenLineage?

I have a customer who is interested in using OpenLineage but wants to extend the facets WITHOUT recompiling OL / maintaining a clone of OL with their changes.

Do we have any examples of how someone might create their own jar but using the OpenLineage CustomFacetBuilder and then have that jar's classes be injected into OpenLineage?

Will Johnson (will@willj.co)
2022-07-07 12:04:55

*Thread Reply:* @Michael Collado would you have any thoughts on how to extend the Facets without having to alter OpenLineage itself?

Michael Collado (collado.mike@gmail.com)
2022-07-07 15:16:45

*Thread Reply:* This is described here. Notably: > Custom implementations are registered by following Java's ServiceLoader conventions. A file called io.openlineage.spark.api.OpenLineageEventHandlerFactory must exist in the application or jar's META-INF/service directory. Each line of that file must be the fully qualified class name of a concrete implementation of OpenLineageEventHandlerFactory. More than one implementation can be present in a single file. This might be useful to separate extensions that are targeted toward different environments - e.g., one factory may contain Azure-specific extensions, while another factory may contain GCP extensions.

Michael Collado (collado.mike@gmail.com)
2022-07-07 15:17:55
Will Johnson (will@willj.co)
2022-07-07 20:19:01

*Thread Reply:* @Michael Collado you are amazing! Thank you so much for pointing me to the docs and example!

Michael Robinson (michael.robinson@astronomer.io)
2022-07-07 19:27:47

@channel @Will Johnson OpenLineage 0.11.0 is now available! We added: • an HTTP option to override timeout and properly close connections in openlineage-java lib, • dynamic mapped tasks support to the Airflow integration, • a SqlExtractor to the Airflow integration, • PMD to Java and Spark builds in CI. We changed: • when testing extractors in the Airflow integration, the extractor list length assertion is now dynamic, • templates are rendered at the start of integration tests for the TaskListener in the Airflow integration. Thanks to all the contributors who made this release possible! For the bug fixes and more details, see: Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.11.0 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.10.0...0.11.0 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

👍 Chandru TMBA, John Thomas, Maciej Obuchowski, Fenil Doshi
👏 John Thomas, Willy Lulciuc, Ricardo Gaspar
🙌 Will Johnson, Maciej Obuchowski, Sergio Sicre
Varun Singh (varuntestaz@outlook.com)
2022-07-11 07:06:36

Hi all, I am using openlineage-spark in my project where I lock the dependency versions in gradle.lockfile. After release 0.10.0, this is not working. Is this a known limitation of switching to splitting the integration into submodules?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-07-14 06:18:29

*Thread Reply:* Can you expand on what's not working exactly?

This is not something we're aware of.

Varun Singh (varuntestaz@outlook.com)
2022-07-19 04:09:39

*Thread Reply:* @Maciej Obuchowski Sure, I have my own library where I am creating a shadowJar. This includes the open lineage library into the new uber jar. This worked fine till 0.9.0 but now building the shadowJar gives this error Could not determine the dependencies of task ':shadowJar'. &gt; Could not resolve all dependencies for configuration ':runtimeClasspath'. &gt; Could not find spark:app:0.10.0. Searched in the following locations: - <https://repo.maven.apache.org/maven2/spark/app/0.10.0/app-0.10.0.pom> If the artifact you are trying to retrieve can be found in the repository but without metadata in 'Maven POM' format, you need to adjust the 'metadataSources { ... }' of the repository declaration. Required by: project : &gt; io.openlineage:openlineage_spark:0.10.0 &gt; Could not find spark:shared:0.10.0. Searched in the following locations: - <https://repo.maven.apache.org/maven2/spark/shared/0.10.0/shared-0.10.0.pom> If the artifact you are trying to retrieve can be found in the repository but without metadata in 'Maven POM' format, you need to adjust the 'metadataSources { ... }' of the repository declaration. Required by: project : &gt; io.openlineage:openlineage_spark:0.10.0 &gt; Could not find spark:spark2:0.10.0. Searched in the following locations: - <https://repo.maven.apache.org/maven2/spark/spark2/0.10.0/spark2-0.10.0.pom> If the artifact you are trying to retrieve can be found in the repository but without metadata in 'Maven POM' format, you need to adjust the 'metadataSources { ... }' of the repository declaration. Required by: project : &gt; io.openlineage:openlineage_spark:0.10.0 &gt; Could not find spark:spark3:0.10.0. Searched in the following locations: - <https://repo.maven.apache.org/maven2/spark/spark3/0.10.0/spark3-0.10.0.pom> If the artifact you are trying to retrieve can be found in the repository but without metadata in 'Maven POM' format, you need to adjust the 'metadataSources { ... }' of the repository declaration. Required by: project : &gt; io.openlineage:openlineage_spark:0.10.0

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-07-19 05:00:02

*Thread Reply:* Can you try 0.11? I think we might already fixed that.

Varun Singh (varuntestaz@outlook.com)
2022-07-19 05:50:03

*Thread Reply:* Tried with that as well. Doesn't work

Varun Singh (varuntestaz@outlook.com)
2022-07-19 05:56:50

*Thread Reply:* Same error with 0.11.0 as well

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-07-19 08:11:13

*Thread Reply:* I think I see - we removed internal dependencies from maven's pom.xml but we also publish gradle metadata: https://repo1.maven.org/maven2/io/openlineage/openlineage-spark/0.11.0/openlineage-spark-0.11.0.module

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-07-19 08:11:34

*Thread Reply:* we should remove the dependencies or disable the gradle metadata altogether, it's not required

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-07-19 08:16:18

*Thread Reply:* @Varun Singh For now I think you can try ignoring gradle metadata: https://docs.gradle.org/current/userguide/declaring_repositories.html#sec:supported_metadata_sources

Hanbing Wang (doris.wang200902@gmail.com)
2022-07-19 14:18:45

*Thread Reply:* @Varun Singh did you find out how to build shadowJar successful with release 0.10.0. I can build shadowJar with 0.9.0, but not higher version. If your problem already resolved, could you share some suggestion. thanks ^^

Varun Singh (varuntestaz@outlook.com)
2022-07-20 03:44:40

*Thread Reply:* @Hanbing Wang I followed @Maciej Obuchowski's instructions (Thank you!) and added this to my build.gradle file: repositories { mavenCentral() { metadataSources { mavenPom() ignoreGradleMetadataRedirection() } } } I am able to build the jar now. I am not proficient in gradle so don't know if this is the right way to do this. Please correct me if I am wrong.

Varun Singh (varuntestaz@outlook.com)
2022-07-20 05:26:04

*Thread Reply:* Also, I am not able to see the 3rd party dependencies in the dependency lock file, but they are present in some folder inside the jar (relocated in subproject's build file). But this is a different problem ig

Hanbing Wang (doris.wang200902@gmail.com)
2022-07-20 18:45:50

*Thread Reply:* Thanks @Varun Singh for the very helpful info. I will also try update build.gradle and rebuild shadowJar again.

Will Johnson (will@willj.co)
2022-07-13 01:10:01

Java Question: Why Can't I Find a Class on the Class Path? / How the heck does the ClassLoader know where to find a class?

Are there any java pros that would be willing to share alternatives to searching if a given class exists or help explain what should change in the Kusto package to make it work for the behaviors as seen in Kafka and SQL DW relation visitors? --- Details --- @Hanna Moazam and I are trying to introduce two new Azure data sources into OpenLineage's Spark integration. The https://github.com/Azure/azure-kusto-spark package is nearly done but we're getting tripped up on some Java concepts. In order to know if we should add the KustoRelationVisitor to the input dataset visitors, we need to see if the Kusto jar is installed on the spark / databricks cluster. In this case, the com.microsoft.kusto.spark.datasource.DefaultSource is a public class but it cannot be found using the KustRelationVisitor.class.getClassLoader().loadClass("class name") methods as seen in:

https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/[…]nlineage/spark/agent/lifecycle/plan/SqlDWDatabricksVisitor.javahttps://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/[…]penlineage/spark/agent/lifecycle/plan/KafkaRelationVisitor.java At first I thought it was the Azure packages but then I tried to do the same approach with a simple java library

I instantiate a spark-shell like this spark-shell --master local[4] \ --conf spark.driver.extraClassPath=/mnt/repos/SparkListener-Basic/lib/build/libs/custom-listener.jar \ --conf spark.extraListeners=listener.MyListener --jars /mnt/repos/wjtestlib/lib/build/libs/lib.jar With lib.jar containing a class that looks like this: ```package wjtestlib;

public class WillLibrary { public boolean someLibraryMethod() { return true; } } And the custom listener is very simple. public class MyListener extends org.apache.spark.scheduler.SparkListener {

private static final Logger log = LoggerFactory.getLogger("MyLogger");

public MyListener() { log.info("INITIALIZING"); }

@Override public void onJobStart(SparkListenerJobStart jobStart) { log.info("MYLISTENER: ON JOB START"); try{ log.info("Trying wjtestlib.WillLibrary"); MyListener.class.getClassLoader().loadClass("wjtestlib.WillLibrary"); log.info("Got wjtestlib.WillLibrary"); } catch(ClassNotFoundException e){ log.info("Could not get wjtestlib.WillLibrary"); }

try{
  <a href="http://log.info">log.info</a>("Trying wjtestlib.WillLibrary using Class.forName");
  Class.forName("wjtestlib.WillLibrary", false, this.getClass().getClassLoader());
  <a href="http://log.info">log.info</a>("Got wjtestlib.WillLibrary using Class.forName");
} catch(ClassNotFoundException e){
  <a href="http://log.info">log.info</a>("Could not get wjtestlib.WillLibrary using Class.forName");
}

} } And I still a result indicating it cannot find the class. 2022-07-12 23:58:22,048 INFO MyLogger: MYLISTENER: ON JOB START 2022-07-12 23:58:22,048 INFO MyLogger: Trying wjtestlib.WillLibrary 2022-07-12 23:58:22,057 INFO MyLogger: Could not get wjtestlib.WillLibrary 2022-07-12 23:58:22,058 INFO MyLogger: Trying wjtestlib.WillLibrary using Class.forName 2022-07-12 23:58:22,065 INFO MyLogger: Could not get wjtestlib.WillLibrary using Class.forName``` Are there any java pros that would be willing to share alternatives to searching if a given class exists or help explain what should change in the Kusto package to make it work for the behaviors as seen in Kafka and SQL DW relation visitors?

Thank you for any guidance.!

Stars
58
Language
Scala
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2022-07-13 08:50:15

*Thread Reply:* Could you unzip the created jar and verify that classes you’re trying to use are present? Perhaps there’s some relocate in shadowJar plugin, which renames the classes. Making sure the classes are present in jar good point to start.

Then you can try doing classForName just from the spark-shell without any listeners added. The classes should be available there.

Will Johnson (will@willj.co)
2022-07-13 11:42:25

*Thread Reply:* Thank you for the reply Pawel! Hanna and I just wrapped up some testing.

It looks like Databricks AND open source spark does some magic when you install a library OR use --jars on the spark-shell. In both Databricks and Apache Spark, the thread running the SparkListener cannot see the additional libraries installed unless they're on the original / main class path.

• Confirmed the uploaded jars are NOT shaded / renamed. • The databricks class path ($CLASSPATH) is focused on /databricks/jars • The added libraries are in /local_disk0/tmp and are not found in $CLASSPATH. • The sparklistener only recognizes $CLASSPATH. • Using a classloader with an object like spark does not find our installed class: spark.getClass().getClassLoader().getResource("com/microsoft/kusto/spark/datasource/KustoSourceOptions.class") • When we use a classloader on a class we installed and imported, it DOES find the class. myImportedClass.getClass().getClassLoader().getResource("com/microsoft/kusto/spark/datasource/KustoSourceOptions.class") @Michael Collado and @Maciej Obuchowski have you seen any challenges with using --jars on the spark-shell and detecting if the class is installed?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-07-13 12:02:05

*Thread Reply:* We run tests using --packages for external stuff like Delta - which is the same as --jars , but getting them from maven central, not local disk, and it works, like in KafkaRelationVisitor.

What if you did it like it? By that I mean adding it to your code with compileOnly in gradle or provided in maven, compiling with it, then using static method to check if it loads?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-07-13 12:02:36

*Thread Reply:* > • When we use a classloader on a class we installed and imported, it DOES find the class. myImportedClass.getClass().getClassLoader().getResource("com/microsoft/kusto/spark/datasource/KustoSourceOptions.class") Isn't that this actual scenario?

Will Johnson (will@willj.co)
2022-07-13 12:36:47

*Thread Reply:* Thank you for the reply, Maciej!

I will try the compileOnly route tonight!

Re: myImportedClass.getClass().getClassLoader().getResource("com/microsoft/kusto/spark/datasource/KustoSourceOptions.class")

I failed to mention that this was only achieved in the interactive shell / Databricks notebook. It never worked inside the SparkListener UNLESS we installed the Kusto jar on the databricks class path.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2022-07-14 06:43:47

*Thread Reply:* The difference between --jars and --packages is that for packages all transitive dependencies will be handled. But this does not seem to be the case here.

More doc can be found here: (https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management)

When starting a SparkContext, all the jars available on the classpath should be listed and put into Spark logs. So that’s the place one can check if the jar is loaded or not.

If --conf spark.driver.extraClassPath is working, you can add multiple jar files there (they must be separated by commas).

Other examples of adding multiple jars to spark classpath can be found here -> https://sparkbyexamples.com/spark/add-multiple-jars-to-spark-submit-classpath/

Will Johnson (will@willj.co)
2022-07-14 11:20:02

*Thread Reply:* @Paweł Leszczyński thank you for the reply! Hanna and I experimented with jars vs extraClassPath.

When using jars, the spark listener does NOT find the class using a classloader.

When using extraClassPath, the spark listener DOES find the class using a classloader.

When using --jars, we can see in the spark logs that after spark starts (and after the spark listener is already established?) there are Spark.AddJar commands being executed.

@Maciej Obuchowski we also experimented with doing a compileOnly on OpenLineage's spark listener, it did not change the behavior. OpenLineage still failed to identify that I had the kusto-spark-connector.

I'm going to reach out to Databricks to see if there is any guidance on letting the SparkListener be aware of classes added via their libraries / --jar method on the spark-shell.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-07-14 11:22:01

*Thread Reply:* So, this is only relevant to Databricks now? Because I don't understand what do you do different than us with Kafka/Iceberg/Delta

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-07-14 11:22:48

*Thread Reply:* I'm not the spark/classpath expert though - maybe @Michael Collado have something to add?

Will Johnson (will@willj.co)
2022-07-14 11:24:12

*Thread Reply:* @Maciej Obuchowski that's a super good question on Iceberg. How do you instantiate a spark job with Iceberg installed?

Will Johnson (will@willj.co)
2022-07-14 11:26:04

*Thread Reply:* It is still relevant to apache spark because I can't get OpenLineage to find the installed package UNLESS I use extraClassPath.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-07-14 11:29:13

*Thread Reply:* Basically, by adding --packages org.apache.iceberg:iceberg_spark_runtime_3.1_2.12:0.13.0

https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/tes[…]a/io/openlineage/spark/agent/SparkContainerIntegrationTest.java

Will Johnson (will@willj.co)
2022-07-14 11:29:51

*Thread Reply:* Trying with --pacakges right now.

Will Johnson (will@willj.co)
2022-07-14 11:54:37

*Thread Reply:* Using --packages wouldn't let me find the Spark relation's default source:

Spark Shell command spark-shell --master local[4] \ --conf spark.driver.extraClassPath=/customListener-1.0-SNAPSHOT.jar \ --conf spark.extraListeners=listener.MyListener \ --jars /WillLibrary.jar \ --packages com.microsoft.azure.kusto:kusto_spark_3.0_2.12:3.0.0 Code inside customListener:

try{ <a href="http://log.info">log.info</a>("Trying Kusto DefaultSource"); MyListener.class.getClassLoader().loadClass("com.microsoft.kusto.spark.datasource.DefaultSource"); <a href="http://log.info">log.info</a>("Got Kusto DefaultSource!!!!"); } catch(ClassNotFoundException e){ <a href="http://log.info">log.info</a>("Could not get Kusto DefaultSource"); } Logs indicating it still can't find the class when using --packages. 2022-07-14 10:47:35,997 INFO MyLogger: MYLISTENER: ON JOB START 2022-07-14 10:47:35,997 INFO MyLogger: Trying wjtestlib.WillLibrary 2022-07-14 10:47:36,000 INFO 2022-07-14 10:47:36,052 INFO MyLogger: Trying LogicalRelation 2022-07-14 10:47:36,053 INFO MyLogger: Got logical relation 2022-07-14 10:47:36,053 INFO MyLogger: Trying Kusto DefaultSource 2022-07-14 10:47:36,064 INFO MyLogger: Could not get Kusto DefaultSource 😢

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-07-14 11:59:07

*Thread Reply:* what if you load your listener using also packages?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-07-14 12:00:38

*Thread Reply:* That's how I'm doing it locally using spark.conf: spark.jars.packages com.google.cloud.bigdataoss:gcs_connector:hadoop3-2.2.2,io.delta:delta_core_2.12:1.0.0,org.apache.iceberg:iceberg_spark3_runtime:0.12.1,io.openlineage:openlineage_spark:0.9.0

👀 Will Johnson
Will Johnson (will@willj.co)
2022-07-14 12:20:47

*Thread Reply:* @Maciej Obuchowski - You beautiful bearded man! 🙏 2022-07-14 11:14:21,266 INFO MyLogger: Trying LogicalRelation 2022-07-14 11:14:21,266 INFO MyLogger: Got logical relation 2022-07-14 11:14:21,266 INFO MyLogger: Trying org.apache.iceberg.catalog.Catalog 2022-07-14 11:14:21,295 INFO MyLogger: Got org.apache.iceberg.catalog.Catalog!!!! 2022-07-14 11:14:21,295 INFO MyLogger: Trying Kusto DefaultSource 2022-07-14 11:14:21,361 INFO MyLogger: Got Kusto DefaultSource!!!! I ended up setting my spark-shell like this (and used --jars for my custom spark listener since it's not on Maven).

spark-shell --master local[4] \ --conf spark.extraListeners=listener.MyListener \ --packages org.apache.iceberg:iceberg_spark_runtime_3.1_2.12:0.13.0,com.microsoft.azure.kusto:kusto_spark_3.0_2.12:3.0.0 \ --jars customListener-1.0-SNAPSHOT.jar So, now I just need to figure out how Databricks differs from this approach 😢

😂 Maciej Obuchowski, Jakub Dardziński, Hanna Moazam
Michael Collado (collado.mike@gmail.com)
2022-07-14 12:21:35

*Thread Reply:* This is an annoying detail about Java ClassLoaders and the way Spark loads extra jars/packages

Remember Java's ClassLoaders are hierarchical - there are parent ClassLoaders and child ClassLoaders. Parents can't see their children's classes, but children can see their parent's classes.

When you use --spark.driver.extraClassPath , you're adding a jar to the main application ClassLoader. But when you use --jars or --packages, you're instructing the Spark application itself to load the extra jars into its own ClassLoader - a child of the main application ClassLoader that the Spark code creates and manages separately. Since your listener class is loaded by the main application ClassLoader, it can't see any classes that are loaded by the Spark child ClassLoader. Either both jars need to be on the driver classpath or both jars need to be loaded by the --jar or --packages configuration parameter

🙌 Will Johnson, Paweł Leszczyński
Michael Collado (collado.mike@gmail.com)
2022-07-14 12:26:15

*Thread Reply:* In Databricks, we were not able to simply use the --packages argument to load the listener, which is why we have that init script that copies the jar into the classpath that Databricks uses for application startup (the main ClassLoader). You need to copy your visitor jar into the same location so that both jars are loaded by the same ClassLoader and can see each other

Michael Collado (collado.mike@gmail.com)
2022-07-14 12:29:09

*Thread Reply:* (as an aside, this is one of the major drawbacks of the java agent approach and one reason why all the documentation recommends using the spark.jars.packages configuration parameter for loading the OL library - it guarantees that any DataSource nodes loaded by the Spark ClassLoader can be seen by the OL library and we don't have to use reflection for everything)

Will Johnson (will@willj.co)
2022-07-14 12:30:25

*Thread Reply:* @Michael Collado Thank you so much for the reply. The challenge is that Databricks has their own mechanism for installing libraries / packages.

https://docs.microsoft.com/en-us/azure/databricks/libraries/

These packages are installed on databricks AFTER spark is started and the physical files are located in a folder that is different than the main classpath.

I'm going to reach out to Databricks and see if we can get any guidance on this 😢

docs.microsoft.com
Will Johnson (will@willj.co)
2022-07-14 12:31:32

*Thread Reply:* Unfortunately, I can't ask users to install their packages on Databricks in a non-standard way (e.g. via an init script) because no one will follow that recommendation.

Michael Collado (collado.mike@gmail.com)
2022-07-14 12:32:46

*Thread Reply:* yeah, I'd prefer if we didn't need an init script to get OL on Databricks either 🤷‍♂️:skintone4:

🤣 Will Johnson
Will Johnson (will@willj.co)
2022-07-17 01:03:02

*Thread Reply:* Quick update: • Turns out using a class loader from a Scala spark listener does not have this problem. • https://stackoverflow.com/questions/7671888/scala-classloaders-confusion • I'm trying to use URLClassLoader as recommended by a few MSFT folks and point it at the /local_disk0/tmp folder. • https://stackoverflow.com/questions/17724481/set-classloader-different-directory • I'm not having luck so far but hoping I can reason about it tomorrow and Monday. This is blocking us from adding additional data sources that are not pre-installed on databricks 😢

Stack Overflow
Stack Overflow
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-07-18 05:45:59

*Thread Reply:* Can't help you now, but I'd love if you dumped the knowledge you've gained through this process into some doc on new OpenLineage doc site 🙏

👍 Hanna Moazam
Hanna Moazam (hannamoazam@microsoft.com)
2022-07-18 05:48:15

*Thread Reply:* We'll definitely put all of it together as a reference for others, and hopefully have a solution by the end of it too

🙌 Maciej Obuchowski
Michael Robinson (michael.robinson@astronomer.io)
2022-07-13 12:06:24

@channel The next OpenLineage TSC meeting is tomorrow at 10 am PT! https://openlineage.slack.com/archives/C01CK9T7HKR/p1657204421157959

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
🙌 Willy Lulciuc, Maciej Obuchowski
💯 Willy Lulciuc, Maciej Obuchowski
David Cecchi (david_cecchi@cargill.com)
2022-07-13 16:32:12

check this out folks - marklogic datahub flow lineage into OL/marquez with jobs and runs and more. i would guess this is a pretty narrow use case but it went together really smoothly and thought i'd share sometimes it's just cool to see what people are working on

🍺 Willy Lulciuc, Conor Beverland, Maciej Obuchowski, Paweł Leszczyński
❤️ Willy Lulciuc, Conor Beverland, Julien Le Dem, Michael Robinson, Maciej Obuchowski, Minkyu Park
Willy Lulciuc (willy@datakin.com)
2022-07-13 16:40:48

*Thread Reply:* Soo cool, @David Cecchi 💯💯💯. I’m not familiar with marklogic, but pretty awesome ETL platform and the lineage graph looks 👌! Did you have to write any custom integration code? Or where you able to use our off the self integrations to get things working? (Also, thanks for sharing!)

David Cecchi (david_cecchi@cargill.com)
2022-07-13 16:57:29

*Thread Reply:* team had to write some custom stuff but it's all framework so it can be repurposed not rewritten over and over. i would see this as another "Platform" in the context of the integrations semantic OL uses, so no, we didn't start w/ an existing solution. just used internal hooks and then called lineage APIs.

Willy Lulciuc (willy@datakin.com)
2022-07-13 17:02:53

*Thread Reply:* Ah totally make sense. Would you be open to a brief presentation and/or demo in a future OL community meeting? The community is always looking to hear how OL is used in the wild, and this seems aligned with that (assuming you can talk about the implementation at a high-level)

Willy Lulciuc (willy@datakin.com)
2022-07-13 17:05:35

*Thread Reply:* No pressure, of course 😉

David Cecchi (david_cecchi@cargill.com)
2022-07-13 17:08:50

*Thread Reply:* ha not feeling any pressure. familiar with the intentions and dynamic. let's keep that on radar - i don't keep tabs on community meetings but mid/late august would be workable. and to be clear, this is being used in the wild in a sandbox 🙂.

Willy Lulciuc (willy@datakin.com)
2022-07-13 17:12:55

*Thread Reply:* Sounds great, and a reasonable timeline! (cc @Michael Robinson can follow up). Even if it’s in a sandbox, talking about the level of effort helps with improving our APIs or sharing with others how smooth it can be!

👍 David Cecchi
Ross Turk (ross@datakin.com)
2022-07-13 17:18:27

*Thread Reply:* chiming in as well to say this is really cool 👍

Julien Le Dem (julien@apache.org)
2022-07-13 18:26:28

*Thread Reply:* Nice! Would this become a product feature in Marklogic Data Hub?

Mark Chiarelli (mark.chiarelli@marklogic.com)
2022-07-14 11:07:42

*Thread Reply:* MarkLogic is a multi-model database and search engine. This implementation triggers off the MarkLogic Datahub Github batch records created when running the datahub flows. Just a toe in the water so far.

Location
San Carlos, CA USA
URL
<http://developer.marklogic.com>
Repositories
23
Willy Lulciuc (willy@datakin.com)
2022-07-14 20:31:18

@Ross Turk, in the OL community meeting today, you presented the new doc site (awesome!) that isn’t up (yet!), but I’ve been talk with @Julien Le Dem about the usage of _producer and would like to add a section on the use / function of _producer in OL events. Feel like the new doc site would be a great place to add this! Let me know when’s a good time to start crowd sourcing content for the site

Ross Turk (ross@datakin.com)
2022-07-14 20:37:25

*Thread Reply:* That sounds like a good idea to me. Be good to have some guidance on that.

The repo is open for business! Feel free to add the page where you think it fits.

Website
<https://docs.openlineage.io>
Stars
1
❤️ Willy Lulciuc
Willy Lulciuc (willy@datakin.com)
2022-07-14 20:42:09

*Thread Reply:* OK! Let’s do this!

Willy Lulciuc (willy@datakin.com)
2022-07-14 20:59:36

*Thread Reply:* @Ross Turk, feel free to assign to me https://github.com/OpenLineage/docs/issues/1!

Ross Turk (ross@datakin.com)
2022-07-14 20:39:26

Hey everyone! As Willy says, there is a new documentation site for OpenLineage in the works.

It’s not quite ready to be, uh, a proper reference yet. But it’s not too far away. Help us get there by submitting issues, making page stubs, and adding sections via PR.

https://github.com/openlineage/docs/

Website
<https://docs.openlineage.io>
Stars
1
🙌 Maciej Obuchowski, Michael Robinson
Willy Lulciuc (willy@datakin.com)
2022-07-14 20:43:09

*Thread Reply:* Thanks, @Ross Turk for finding a home for more technical / how-to docs… long overdue 💯

Ross Turk (ross@datakin.com)
2022-07-14 21:22:09

*Thread Reply:* BTW you can see the current site at http://openlineage.io/docs/ - merges to main will ship a new site.

openlineage.io
Willy Lulciuc (willy@datakin.com)
2022-07-14 21:23:32

*Thread Reply:* great, was using <a href="http://docs.openlineage.io">docs.openlineage.io</a> … we’ll eventually want the docs to live under the docs subdomain though?

Ross Turk (ross@datakin.com)
2022-07-14 21:25:32

*Thread Reply:* TBH I activated GitHub Pages on the repo expecting it to live at openlineage.github.io/docs, thinking we could look at it there before it's ready to be published and linked in to the website

Ross Turk (ross@datakin.com)
2022-07-14 21:25:39

*Thread Reply:* and it came live at openlineage.io/docs 😄

Willy Lulciuc (willy@datakin.com)
2022-07-14 21:26:06

*Thread Reply:* nice and sounds good 👍

Ross Turk (ross@datakin.com)
2022-07-14 21:26:31

*Thread Reply:* still do not understand why, but I'll take it as a happy accident. we can move to docs.openlineage.io easily - just need to add the A record in the LF infra + the CNAME file in the static dir of this repo

shweta p (shweta.pbs@gmail.com)
2022-07-15 09:10:46

Hi #general, how do i link the tasks of airflow which may not have any input or output datasets as they are running some conditions. the dataset is generated only on the last task

shweta p (shweta.pbs@gmail.com)
2022-07-15 09:11:25

In the lineage, though there is option to link the parent , it doesnt show up the lineage of job -> job

shweta p (shweta.pbs@gmail.com)
2022-07-15 09:11:43

does it need to be job -> dataset -> job only ?

Ross Turk (ross@datakin.com)
2022-07-15 14:41:30

*Thread Reply:* yes - openlineage is job -> dataset -> job. particularly, the model is designed to observe the movement of data

Ross Turk (ross@datakin.com)
2022-07-15 14:43:41

*Thread Reply:* the spec is based around run events, which are observed states of job runs. jobs are observed to see how they affect datasets, and that relationship is what OpenLineage traces

Ilya Davidov (idavidov@marpaihealth.com)
2022-07-18 11:32:06

👋 Hi everyone!

Ilya Davidov (idavidov@marpaihealth.com)
2022-07-18 11:32:51

i am looking for some information regarding openlineage integration with AWS Glue jobs/workflows

Ilya Davidov (idavidov@marpaihealth.com)
2022-07-18 11:33:32

i am wondering if it possible and someone already give a try and maybe documented it?

John Thomas (john.thomas@astronomer.io)
2022-07-18 15:16:54

*Thread Reply:* This thread covers glue in some detail: https://openlineage.slack.com/archives/C01CK9T7HKR/p1637605977118000?threadts=1637605977.118000&cid=C01CK9T7HKR|https://openlineage.slack.com/archives/C01CK9T7HKR/p1637605977118000?threadts=1637605977.118000&cid=C01CK9T7HKR

} Francis McGregor-Macdonald (https://openlineage.slack.com/team/U02K353H2KF)
John Thomas (john.thomas@astronomer.io)
2022-07-18 15:17:49

*Thread Reply:* TL;Dr: you can use the spark integration to capture some lineage, but it's not comprehensive

David Cecchi (david_cecchi@cargill.com)
2022-07-18 16:29:02

*Thread Reply:* i suspect there will be opportunities to influence AWS to be a "fast follower" if OL adoption and buy-in starts to feel authentically real in non-aws portions of the stack. i discussed OL casually with AWS analytics leadership (Rahul Pathak) last winter and he seemed curious and open to this type of idea. to be clear, ~95% chance he's forgotten that conversation now but hey it's still something.

👍 Ross Turk
Francis McGregor-Macdonald (francis@mc-mac.com)
2022-07-18 19:34:32

*Thread Reply:* There are a couple of aws people here (including me) following.

👍 David Cecchi, Ross Turk
Mikkel Kringelbach (mikkel@theoremlp.com)
2022-07-19 18:01:46

Hi all, I have been playing around with Marquez for a hackday. I have been able to get some lineage information loaded in (using the local docker version for now). I have been trying set the location (for the link) and description information for a job (the text saying "Nothing to show here") but I haven't been able to figure out how to do this using the /lineage api. Any help would be appreciated.

Ross Turk (ross@datakin.com)
2022-07-19 20:11:38

*Thread Reply:* I believe what you want is the DocumentationJobFacet. It adds a description property to a job.

Ross Turk (ross@datakin.com)
2022-07-19 20:13:03

*Thread Reply:* You can see a Python example here, in the Airflow integration: https://github.com/OpenLineage/OpenLineage/blob/65a5f021a1ba3035d5198e759587737a05b242e1/integration/airflow/openlineage/airflow/adapter.py#L217

:gratitude_thank_you: Mikkel Kringelbach
Ross Turk (ross@datakin.com)
2022-07-19 20:13:18

*Thread Reply:* (looking for a curl example…)

Mikkel Kringelbach (mikkel@theoremlp.com)
2022-07-19 20:25:49

*Thread Reply:* I see, so there are special facet keys which will get translated into something special in the ui, is that correct?

Are these documented anywhere?

Ross Turk (ross@datakin.com)
2022-07-19 20:27:55

*Thread Reply:* Correct - info from the various OpenLineage facets are used in the Marquez UI.

Ross Turk (ross@datakin.com)
2022-07-19 20:28:28

*Thread Reply:* I couldn’t find a curl example with a description field, but I did generate this one with a sql field:

{ "job": { "name": "order_analysis.find_popular_products", "facets": { "sql": { "query": "DROP TABLE IF EXISTS top_products;\n\nCREATE TABLE top_products AS\nSELECT\n product,\n COUNT(order_id) AS num_orders,\n SUM(quantity) AS total_quantity,\n SUM(price ** quantity) AS total_value\nFROM\n orders\nGROUP BY\n product\nORDER BY\n total_value desc,\n num_orders desc;", "_producer": "https: //github.com/OpenLineage/OpenLineage/tree/0.11.0/integration/airflow", "_schemaURL": "<https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/SqlJobFacet>" } }, "namespace": "workshop" }, "run": { "runId": "13460e52-a829-4244-8c45-587192cfa009", "facets": {} }, "inputs": [ ... ], "outputs": [ ... ], "producer": "<https://github.com/OpenLineage/OpenLineage/tree/0.11.0/integration/airflow>", "eventTime": "2022-07-20T00: 23: 06.986998Z", "eventType": "COMPLETE" }

Ross Turk (ross@datakin.com)
2022-07-19 20:28:58

*Thread Reply:* The facets (at least, those in the core spec) are here: https://github.com/OpenLineage/OpenLineage/tree/65a5f021a1ba3035d5198e759587737a05b242e1/spec/facets

Ross Turk (ross@datakin.com)
2022-07-19 20:29:19

*Thread Reply:* it’s designed so that facets can exist outside the core, in other repos, as well

Mikkel Kringelbach (mikkel@theoremlp.com)
2022-07-19 22:25:39

*Thread Reply:* Thank you for sharing these, I was able to get the sql query highlighting to work. But I failed to get the location link or the documentation to work. My facet attempt looked like: { "facets": { "description": "test-description-job", "sql": { "query": "SELECT QUERY", "_schema": "<https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/SqlJobFacet>" }, "documentation": { "documentation": "Test docs?", "_schema": "<https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/DocumentationJobFacet>" }, "link": { "type": "", "url": "<a href="http://www.google.com/test_url">www.google.com/test_url</a>", "_schema": "<https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/SourceCodeLocationJobFacet>" } } }

Mikkel Kringelbach (mikkel@theoremlp.com)
2022-07-19 22:36:55

*Thread Reply:* I got the documentation link to work by renaming the property from documentation -> description . I still haven't been able to get the external link to work

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-07-20 10:33:36

Hey all. I've been doing a cleanup of issues on GitHub. If I've closed your issue that you think is still relevant, please reopen it and let us know.

🙌 Jakub Dardziński, Michael Collado, Will Johnson, Ross Turk
Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2022-07-21 16:09:08

Is https://databricks.com/blog/2022/06/08/announcing-the-availability-of-data-lineage-with-unity-catalog.html - are they using OpenLineage? I know there’s been a lot of work to make sure OpenLineage integrates with Databricks, even earlier this year.

Databricks
Ross Turk (ross@datakin.com)
2022-07-21 16:25:47

*Thread Reply:* There’s a good integration between OL and Databricks for pulling metadata out of running Spark clusters. But there’s not currently a connection between OL and the Unity Catalog.

I think it would be cool to see some discussions start to develop around it 👍

👍 Sheeri Cabral (Collibra), Julius Rentergent
Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2022-07-21 16:26:44

*Thread Reply:* Absolutely. I saw some mention of APIs and access, and was wondering if maybe they used OpenLineage as a framework, which would be awesome.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2022-07-21 16:30:55

*Thread Reply:* (and since Azure Databricks uses it - https://openlineage.io/blog/openlineage-microsoft-purview/ I wasn’t sure about Unity Catalog)

openlineage.io
👍 Will Johnson
Julien Le Dem (julien@apache.org)
2022-07-21 16:56:24

*Thread Reply:* We're in the early stages of discussion regarding an OpenLineage integration for Unity. You showing interest would help increase the priority of that on the DB side.

👍 Sheeri Cabral (Collibra), Will Johnson, Thijs Koot
Thijs Koot (thijs.koot@gmail.com)
2022-07-27 11:41:48

*Thread Reply:* I'm interested in Databricks enabling an openlineage endpoint, serving as a catalogue. Similar to how they provide hosted MLFlow. I can mention this to our Databricks reps as well

Joao Vicente (joao.diogo.vicente@gmail.com)
2022-07-23 04:09:55

Hi all I am trying to find the state of columnLineage in OL I see a proposal and some examples in https://github.com/OpenLineage/OpenLineage/search?q=columnLineage&type=|https://github.com/OpenLineage/OpenLineage/search?q=columnLineage&type= but I can't find it in the spec. Can anyone shed any light why this would be the case?

Joao Vicente (joao.diogo.vicente@gmail.com)
2022-07-23 04:12:26

*Thread Reply:* Link to spec where I looked https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json

Joao Vicente (joao.diogo.vicente@gmail.com)
2022-07-23 04:37:11

*Thread Reply:* My bad. I realize now that column lineage has been implemented as a facet, hence not visible in the main spec https://github.com/OpenLineage/OpenLineage/search?q=ColumnLineageDatasetFacet&type=|https://github.com/OpenLineage/OpenLineage/search?q=ColumnLineageDatasetFacet&type=

👍 Maciej Obuchowski
Julien Le Dem (julien@apache.org)
2022-07-26 19:37:54

*Thread Reply:* It is supported in the Spark integration

Julien Le Dem (julien@apache.org)
2022-07-26 19:39:13

*Thread Reply:* @Paweł Leszczyński could you add the Column Lineage facet here in the spec? https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md#standard-facets

Will Johnson (will@willj.co)
2022-07-24 16:24:15

SundayFunday

Putting together some internal training for OpenLineage and highlighting some of the areas that have been useful to me on my journey with OpenLineage. Many thanks to @Michael Collado, @Maciej Obuchowski, and @Paweł Leszczyński for the continued technical support and guidance.

❤️ Hanna Moazam, Ross Turk, Minkyu Park, Atif Tahir, Paweł Leszczyński
Will Johnson (will@willj.co)
2022-07-24 16:26:59

*Thread Reply:* @Ross Turk I still want to contribute something like this to the OpenLineage docs / new site but the bar for an internal doc is lower in my mind 😅

Ross Turk (ross@datakin.com)
2022-07-25 11:49:54

*Thread Reply:* 😄

Ross Turk (ross@datakin.com)
2022-07-25 11:50:54

*Thread Reply:* @Will Johnson happy to help you with docs, when the time comes! sketching outline --> editing, whatever you need

Julien Le Dem (julien@apache.org)
2022-07-26 19:39:56

*Thread Reply:* This looks nice by the way.

❤️ Will Johnson
Sylvia Seow (sylviaseow@gmail.com)
2022-07-26 09:06:28

hi all, really appreciate if anyone could help. I have been trying to create a poc project with openlineage with dbt. attached will be the pip list of the openlineage packages that i have. However, when i run "dbt-ol"command, it prompted as öpen as file, instead of running as a command. the regular dbt run can be executed without issue. i would want i had done wrong or if any configuration that i have missed. Thanks a lot

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-07-26 10:39:57

*Thread Reply:* do you have proper execute permissions?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-07-26 10:41:09

*Thread Reply:* not sure how that works on windows, but it just looks like it does not recognize dbt-ol as executable

Sylvia Seow (sylviaseow@gmail.com)
2022-07-26 10:43:00

*Thread Reply:* yes i have admin rights. how to make this as executable?

Sylvia Seow (sylviaseow@gmail.com)
2022-07-26 10:43:25

*Thread Reply:* btw do we have a sample docker image where dbt-ol can run?

Ross Turk (ross@datakin.com)
2022-07-26 17:33:08

*Thread Reply:* I have also never tried on Windows 😕 but you might try python3 dbt-ol run?

Sylvia Seow (sylviaseow@gmail.com)
2022-07-26 21:03:43

*Thread Reply:* will try that

Will Johnson (will@willj.co)
2022-07-26 16:41:04

Running a single unit test on the Spark Integration - How it works with the different modules?

Prior to splitting up the OpenLineage spark integration, I could run a command like the one below to test a single test or even a single test method. Now I get a failure and it's pointing to the app: module. Can anyone share the right syntax for running a unit test with the current package structure? Thank you!!

```wj@DESKTOP-ECF9QME:~/repos/OpenLineageWill/integration/spark$ ./gradlew test --tests io.openlineage.spark.agent.OpenLineageSparkListenerTest

> Task :app:test FAILED

SUCCESS: Executed 0 tests in 872ms

FAILURE: Build failed with an exception.

** What went wrong: Execution failed for task ':app:test'. > No tests found for given includes: io.openlineage.spark.agent.OpenLineageSparkListenerTest

** Try: > Run with --stacktrace option to get the stack trace. > Run with --info or --debug option to get more log output. > Run with --scan to get full insights.

** Get more help at https://help.gradle.org

Deprecated Gradle features were used in this build, making it incompatible with Gradle 8.0.

You can use '--warning-mode all' to show the individual deprecation warnings and determine if they come from your own scripts or plugins.

See https://docs.gradle.org/7.4/userguide/command_line_interface.html#sec:command_line_warnings

BUILD FAILED in 2s 18 actionable tasks: 4 executed, 14 up-to-date```

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2022-07-27 01:54:31

*Thread Reply:* This may be a result of splitting Spark integration into multiple submodules: app, shared, spark2, spark3, spark32, etc. If the test case is from shared submodule (this one looks like that), you could try running: ./gradlew :shared:test --tests io.openlineage.spark.agent.OpenLineageSparkListenerTest

Hanna Moazam (hannamoazam@microsoft.com)
2022-07-27 03:18:42

*Thread Reply:* @Paweł Leszczyński, I tried running that command, and I get the following error:

```> Task :shared:test FAILED

FAILURE: Build failed with an exception.

** What went wrong: Execution failed for task ':shared:test'. > No tests found for given includes: io.openlineage.spark.agent.OpenLineageSparkListenerTest

** Try: > Run with --stacktrace option to get the stack trace. > Run with --info or --debug option to get more log output. > Run with --scan to get full insights.

** Get more help at https://help.gradle.org

Deprecated Gradle features were used in this build, making it incompatible with Gradle 8.0.

You can use '--warning-mode all' to show the individual deprecation warnings and determine if they come from your own scripts or plugins.

See https://docs.gradle.org/7.4/userguide/command_line_interface.html#sec:command_line_warnings

BUILD FAILED in 971ms 6 actionable tasks: 2 executed, 4 up-to-date```

Hanna Moazam (hannamoazam@microsoft.com)
2022-07-27 03:24:41

*Thread Reply:* When running build and test for all the submodules, I can see outputs for tests in different submodules (spark3, spark2 etc), but for some reason, I cannot find any indication that the tests in OpenLineage/integration/spark/app/src/test/java/io/openlineage/spark/agent/lifecycle/plan are being run at all.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2022-07-27 03:42:43

*Thread Reply:* That’s interesting. Let’s ask @Tomasz Nazarewicz about that.

👍 Hanna Moazam
Hanna Moazam (hannamoazam@microsoft.com)
2022-07-27 03:57:08

*Thread Reply:* For reference, I attached the stdout and stderr messages from running the following: ./gradlew :shared:spotlessApply &amp;&amp; ./gradlew :app:spotlessApply &amp;&amp; ./gradlew clean build test

Tomasz Nazarewicz (tomasz.nazarewicz@getindata.com)
2022-07-27 04:27:23

*Thread Reply:* I'll look into it

Tomasz Nazarewicz (tomasz.nazarewicz@getindata.com)
2022-07-28 05:17:36

*Thread Reply:* Update: some test appeared to not be visible after split, that's fixed but now I have to solevr some dependency issues

🙌 Hanna Moazam, Will Johnson
Hanna Moazam (hannamoazam@microsoft.com)
2022-07-28 05:19:16

*Thread Reply:* That's great, thank you!

Hanna Moazam (hannamoazam@microsoft.com)
2022-07-29 06:05:55

*Thread Reply:* Hi Tomasz, thanks so much for looking into this. Is this your PR (https://github.com/OpenLineage/OpenLineage/pull/953) that fixes the whole issue, or is there still some work to do to solve the dependency issues you mentioned?

Tomasz Nazarewicz (tomasz.nazarewicz@getindata.com)
2022-07-29 06:07:58

*Thread Reply:* I'm still testing it, should've changed it to draft, sorry

👍 Hanna Moazam, Will Johnson
Hanna Moazam (hannamoazam@microsoft.com)
2022-07-29 06:08:59

*Thread Reply:* No worries! If I can help with testing or anything please let me know!

Tomasz Nazarewicz (tomasz.nazarewicz@getindata.com)
2022-07-29 06:09:29

*Thread Reply:* Will do! Thanks :)

Hanna Moazam (hannamoazam@microsoft.com)
2022-08-02 11:06:31

*Thread Reply:* Hi @Tomasz Nazarewicz, if possible, could you please share an estimated timeline for resolving the issue? We have 3 PRs which we are either waiting to open or to update which are dependent on the tests.

Tomasz Nazarewicz (tomasz.nazarewicz@getindata.com)
2022-08-02 13:45:34

*Thread Reply:* @Hanna Moazam hi, it's quite difficult to do that because the issue is that all the tests are passing when I execute ./gradlew app:test but one is failing with ./gradlew app:build

but if it fixes your problem I can disable this test for now and make a PR without it, then you can maybe unblock your stuff and I will have more time to investigate the issue.

Hanna Moazam (hannamoazam@microsoft.com)
2022-08-02 14:54:45

*Thread Reply:* Oh that's a strange issue. Yes that would be really helpful if you can, because we have some tests we implemented which we need to make sure pass as expected.

Hanna Moazam (hannamoazam@microsoft.com)
2022-08-02 14:54:52

*Thread Reply:* Thank you for your help Tomasz!

Tomasz Nazarewicz (tomasz.nazarewicz@getindata.com)
2022-08-03 06:12:07

*Thread Reply:* @Hanna Moazam https://github.com/OpenLineage/OpenLineage/pull/980 here is the pull request with the changes

🙌 Hanna Moazam
Tomasz Nazarewicz (tomasz.nazarewicz@getindata.com)
2022-08-03 06:12:26

*Thread Reply:* its waiting for review currently

Hanna Moazam (hannamoazam@microsoft.com)
2022-08-03 06:20:41

*Thread Reply:* Thank you!

Conor Beverland (conorbev@gmail.com)
2022-07-26 18:44:47

Is there any doc yet about column level lineage? I see a spec for the facet here: https://github.com/openlineage/openlineage/issues/148

Assignees
<a href="https://github.com/pawel-big-lebowski">@pawel-big-lebowski</a>
Labels
proposal
Julien Le Dem (julien@apache.org)
2022-07-26 19:41:13

*Thread Reply:* The doc site would benefit from a page about it. Maybe @Paweł Leszczyński?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2022-07-27 01:59:27

*Thread Reply:* Sure, it’s already on my list, will do

:gratitude_thank_you: Julien Le Dem
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2022-07-29 07:55:40

*Thread Reply:* https://openlineage.io/docs/integrations/spark/spark_column_lineage

openlineage.io
✅ Conor Beverland
Conor Beverland (conorbev@gmail.com)
2022-07-26 20:03:55

maybe another question for @Paweł Leszczyński: I was watching the Airflow summit talk that you and @Maciej Obuchowski did ( very nice! ). How is this exposed? I'm wondering if it shows up as an edge on the graph in Marquez? ( I guess it may be tracked as a parent run and if so probably does not show on the graph directly at this time? )

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2022-07-27 04:08:18

*Thread Reply:* To be honest, I have never seen that in action and would love to have that in our documentation.

@Michael Collado or @Maciej Obuchowski: are you able to create some doc? I think one of you was working on that.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-07-27 04:24:19

*Thread Reply:* Yes, parent run

shweta p (shweta.pbs@gmail.com)
2022-07-27 01:29:05

Hi #general, there has been a issue with airflow+dbt+openlineage. This was working fine with openlineage-dbt v0.11.0 but there has been some change to the typeextensions due to which i had to upgrade to latest dbt (from 1.0.0 to 1.1.0) and now the dbt-ol is failing with schema version support (the version generated is v5 vs dbt-ol supports only v4). Has anyone else been able to fix this

👀 Maciej Obuchowski
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-07-27 04:47:18

*Thread Reply:* Will take a look

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-07-27 04:47:40

*Thread Reply:* But generally this support message is just a warning

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-07-27 10:04:20

*Thread Reply:* @shweta p any actual error you've found? I've tested it with dbt-bigquery on 1.1.0 and it works despite warning:

➜ small OPENLINEAGE_URL=<http://localhost:5050> dbt-ol build Running OpenLineage dbt wrapper version 0.11.0 This wrapper will send OpenLineage events at the end of dbt execution. 14:03:16 Running with dbt=1.1.0 14:03:17 Found 2 models, 3 tests, 0 snapshots, 0 analyses, 191 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics 14:03:17 14:03:17 Concurrency: 2 threads (target='dev') 14:03:17 14:03:17 1 of 5 START table model dbt_test1.my_first_dbt_model .......................... [RUN] 14:03:21 1 of 5 OK created table model dbt_test1.my_first_dbt_model ..................... [CREATE TABLE (2.0 rows, 0 processed) in 3.31s] 14:03:21 2 of 5 START test unique_my_first_dbt_model_id ................................. [RUN] 14:03:22 2 of 5 PASS unique_my_first_dbt_model_id ....................................... [PASS in 1.55s] 14:03:22 3 of 5 START view model dbt_test1.my_second_dbt_model .......................... [RUN] 14:03:24 3 of 5 OK created view model dbt_test1.my_second_dbt_model ..................... [OK in 1.38s] 14:03:24 4 of 5 START test not_null_my_second_dbt_model_id .............................. [RUN] 14:03:24 5 of 5 START test unique_my_second_dbt_model_id ................................ [RUN] 14:03:25 5 of 5 PASS unique_my_second_dbt_model_id ...................................... [PASS in 1.38s] 14:03:25 4 of 5 PASS not_null_my_second_dbt_model_id .................................... [PASS in 1.42s] 14:03:25 14:03:25 Finished running 1 table model, 3 tests, 1 view model in 8.44s. 14:03:25 14:03:25 Completed successfully 14:03:25 14:03:25 Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5 Artifact schema version: <https://schemas.getdbt.com/dbt/manifest/v5.json> is above dbt-ol supported version 4. This might cause errors. Emitting OpenLineage events: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00&lt;00:00, 274.42it/s] Emitted 10 openlineage events

Fenil Doshi (fdoshi@salesforce.com)
2022-07-27 20:39:21

When will the next version of OpenLineage be available tentatively?

Michael Robinson (michael.robinson@astronomer.io)
2022-07-27 20:41:44

*Thread Reply:* I think it's safe to say we'll see a release by the end of next week

:gratitude_thank_you: Fenil Doshi
👍 Fenil Doshi
Yehuda Korotkin (yehudak@elementor.com)
2022-07-28 04:02:06

👋 Hi everyone! Yesterday was a great presentation by @Julien Le Dem that talked about OpenLineage and did grate comparison between OL and Open-Telemetry, (i wrote a small summary here: https://bit.ly/3z5caOI )

Julian’s charm sparked inside me curiosity especially regarding OL in streaming. I saw the design/architecture of OL I got some questions/discussions that I would like to understand better.

In the context of streaming jobs reporting “start job” - “end job” might be more relevant in the context of a batch mode. or do you mean reporting start job/end job should be processed each event?

  • and this will be equivalent to starting job each row in a table via UDF, for example.

Thank you in advance

linkedin.com
🙌 Maciej Obuchowski, Michael Robinson, Paweł Leszczyński
Will Johnson (will@willj.co)
2022-07-28 08:50:44

*Thread Reply:* Welcome to the community!

We talked about this exact topic in the most recent community call. https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting#MonthlyTSCmeeting-Nextmeeting:Nov10th2021(9amPT)

Discussion: streaming in Flink integration • Has there been any evolution in the thinking on support for streaming? ◦ Julien: start event, complete event, snapshots in between limited to certain number per time interval ◦ Paweł: we can make the snapshot volume configurable • Does Flink support sending data to multiple tables like Spark? ◦ Yes, multiple outputs supported by OpenLineage model ◦ Marquez, the reference implementation of OL, combines the outputs

🙏 Yehuda Korotkin
❤️ Julien Le Dem
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-07-28 09:56:05

*Thread Reply:* > or do you mean reporting start job/end job should be processed each event? We definitely want to avoid tracking every single event 🙂

One thing worth mentioning is that OpenLineage events are meant to be cumulative - the streaming jobs start, run, and eventually finish or restart. In the meantime, we capture additional events "in the middle" - for example, on Apache Flink checkpoint, or every few minutes - where we can emit additional information connected to the state of the job.

🙏 Yehuda Korotkin
Yehuda Korotkin (yehudak@elementor.com)
2022-07-28 11:11:17

*Thread Reply:* @Will Johnson and @Maciej Obuchowski Thank you for your answer

jobs start, run, and eventually finish or restart

This is the perspective that I have a hard time understanding in the context of streaming.

The classic streaming job should always be on it should not be “finish” event (Except failure). usually, streaming data is “dripping”.

It is possible to understand if the job starts/ends in the resolution of the running application and represents when the application begin and when it failed.

if you do start/stop events from the checkpoints on Flink it might be the wrong representation instead use the concept of event-driven for example reporting state.

What do you think?

Yehuda Korotkin (yehudak@elementor.com)
2022-07-28 11:11:36
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-07-28 12:00:34

*Thread Reply:* The idea is that jobs usually get upgraded - for example, you change Apache Flink version, increase resources, or change the structure of a job - that's the difference for us. The stop events make sense, because if you for example changed SQL of your Flink SQL job, you probably would want this to be captured - from X to Y job was running with older SQL version well, but after change, the second run started and throughput dropped to 10% of the previous one.

> if you do start/stop events from the checkpoints on Flink it might be the wrong representation instead use the concept of event-driven for example reporting state. But this is an misunderstanding 🙂 The information exposed from a checkpoints are in addition to start and stop events.

We want to get information from running job - I just argue that sometimes end of a streaming job is also relevant.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-07-28 12:01:16

*Thread Reply:* The checkpoint would be captured as a new eventType: RUNNING - do I miss something why you want to add StateFacet?

👍 Yehuda Korotkin
Yehuda Korotkin (yehudak@elementor.com)
2022-07-28 14:24:03

*Thread Reply:* About argue - it’s depends on what the definition of job in streaming mode, i agree that if you already have ‘job’ you want to know about the job more information.

each event that entering the sub process (job) should do REST call “Start job” and “End job” ?

Nope, I just represented two possible ways that i thought, or StateFacet or add new Event type eg. RUNNING 😉

👍 Maciej Obuchowski
Michael Robinson (michael.robinson@astronomer.io)
2022-07-28 09:14:28

Hi everyone, I’d like to request a release to publish the new Flink integration (thanks, @Maciej Obuchowski) and an important fix to the Spark integration (thanks, @Paweł Leszczyński). As per our policy here, 3 +1s from committers will authorize an immediate release. Thanks!

➕ Maciej Obuchowski, Paweł Leszczyński, Willy Lulciuc, Will Johnson, Julien Le Dem
Michael Robinson (michael.robinson@astronomer.io)
2022-07-28 17:30:33

*Thread Reply:* Thanks for the +1s. We will initiate the release by Tuesday.

Barak F (fargoun@gmail.com)
2022-07-28 10:30:15

Static code annotations for OpenLineage: hi everyone, i heard yesterday a great lecture by @Julien Le Dem on OpenLineage, and as i'm very interested in this area, i wanted to raise a question: are there any plans to have OpenLineage-like annotations on actual code (e.g. Spark, AirFlow, arbitrary code) to allow deducing some of the lineage informtion from static code analysis?

The reason i'm asking this is because while OpenLineage does a great job of integrating with multiple platforms (AirFlow, Dbt, Spark), some companies still have a lot of legacy-related data processing stack that will probably not get full OpenLineage (as it's a one-off, and the companies themselves will probably won't implement OpenLineage support for their custom frameworks). Having some standard way to annotate code with information like: "reads from X; writes to Y; Job name regexp: Z", may allow writing a "generic" OpenLineage colelctor that can go over the source code, collect this configuration information and then use it when constructing the lineage graph (even though it won't be as complete and full as the full OpenLineage info).

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-08-03 08:30:15

*Thread Reply:* I think this is an interesting idea, however, just the static analysis does not convey any runtime information.

We're doing something similar within Airflow now, but as a fallback mechanism: https://github.com/OpenLineage/OpenLineage/pull/914

You can manually annotate DAG with information instead of writing extractor for your operator. This still gives you runtime information. Similar features might get added to other integrations, especially with such a vast scope as Airflow has - but I think it's unlikely we'd work on a feature for just statically traversing code without runtime context.

Barak F (fargoun@gmail.com)
2022-08-03 14:25:31

*Thread Reply:* Thanks for the detailed response @Maciej Obuchowski! It seems like this solution is specific only to AirFlow, and i wonder why wouldn't we generalize this outside of just AirFlow? My thinking is that there are other areas where there is vast scope (e.g. arbitrary code that does data manipulations), and without such an option, the only path is to provide full runtime information via building your own extractor, which might be a bit hard/expensive to do. If i understand your response correctly, then you assume that OpenLineage can get wide enough "native" support across the stack without resorting to a fallback like 'static code analysis'. Is that your base assumption?

Petr Hajek (petr.hajek@profinit.eu)
2022-07-29 04:36:03

Hi all, does anybody have an experience extracting Airflow lineage using Marquez as documented here https://www.astronomer.io/guides/airflow-openlineage/#generating-and-viewing-lineage-data ? We tested it on our Airflow instance with Marquez hoping to get the standard .json files describing lineage in accord with open-lineage model as described in https://json-schema.org/draft/2020-12/schema. But there seems to be only one GET method related to lineage export in Marquez API library called "Get a lineage graph". This produces quite different .json structure than what we know from open-lineage. Could anybody help if there is a chance to get open-lineage .json structure from Marquez?

astronomer.io
Ross Turk (ross@datakin.com)
2022-07-29 12:58:38

*Thread Reply:* The query API has a different spec than the reporting API, so what you’d get from Marquez would look different from what Marquez receives.

Few ideas:

  1. you could send the lineage to a pipedream endpoint to inspect, if you’re just trying to experiment
  2. you could grab them from the lineage table in Marquez’s postgres
Petr Hajek (petr.hajek@profinit.eu)
2022-07-30 16:29:24

*Thread Reply:* ok, now I understand, thank you

👍 Jan Kopic
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-08-03 08:25:57

*Thread Reply:* FYI we want to have something like that too: https://github.com/MarquezProject/marquez/issues/1927

But if you need just the raw events endpoint, without UI, then Marquez might be overkill for your needs

Comments
2
Milestone
<a href="https://github.com/MarquezProject/marquez/milestone/4">Roadmap</a>
Dinakar Sundar (dinakar_sundar@condenast.com)
2022-07-30 13:44:13

Hi @everyone , we are trying to extract lineage information and import into amundsen .please point us right direction to move - based on the documentation -> Databricks + marquez + amundsen is this the only way to move on ?

John Thomas (john.thomas@astronomer.io)
2022-07-30 13:49:25

*Thread Reply:* Short of implementing an open lineage endpoint in Amundsen, yes that's the right approach.

The Lineage endpoint in Marquez can output the whole graph centered on a node ID, and you can use the jobs/datasets apis to grab lists of each for reference

Barak F (fargoun@gmail.com)
2022-07-31 00:35:06

*Thread Reply:* Is your lineage information coming via OpenLineage? if so - you can quickly use the Amundsen scripts in order to load data into Amundsen, for example, see this script here: https://github.com/amundsen-io/amundsendatabuilder/blob/master/example/scripts/sample_data_loader.py

Where is your lineage coming from?

Dinakar Sundar (dinakar_sundar@condenast.com)
2022-08-01 20:17:22

*Thread Reply:* yes @Barak F we are using open lineage

Barak F (fargoun@gmail.com)
2022-08-02 01:26:18

*Thread Reply:* So, have you tried using Amundsen data builder scripts to load the lineage information into Amundsen? (maybe you'll have to "play" with those a bit)

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-08-03 08:24:58

*Thread Reply:* AFAIK there is OpenLineage extractor: https://www.amundsen.io/amundsen/databuilder/#openlineagetablelineageextractor

Not sure it solves your issue though 🙂

Dinakar Sundar (dinakar_sundar@condenast.com)
2022-08-05 04:46:45

*Thread Reply:* thanks

Michael Robinson (michael.robinson@astronomer.io)
2022-08-01 17:08:46

@channel OpenLineage 0.12.0 is now available! We added: • an Apache Flink integration, • support for Spark 3.3.0, • the ability to extend column level lineage mechanism, • an ErrorMessageRunFacet to the OpenLineage spec, • SQLCheckExtractors, a RedshiftSQLExtractor & RedshiftDataExtractor to the Airflow integration, • a dataset builder to the AlterTableCommand class in the Spark integration. We changed: • the filtering of Delta events to reduce noise, • the flow of metadata in the Airflow integration to allow metadata from Airflow through inlets and outlets. Thanks to all the contributors who made this release possible! For the bug fixes and more details, see: Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.12.0 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.11.0...0.12.0 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/ (edited)

❤️ Minkyu Park, Harel Shein, Willy Lulciuc, Peter Hicks, Fenil Doshi, Maciej Obuchowski, Howard Yoo, Paul Wilson Villena, Jarek Potiuk, Dinakar Sundar, Shubham Mehta, Sharanya Santhanam, Sheeri Cabral (Collibra)
🎉 Minkyu Park, Peter Hicks, Fenil Doshi, Howard Yoo, Jarek Potiuk, Paweł Leszczyński, Ryan Peterson
🚀 Minkyu Park, Howard Yoo, Jarek Potiuk
🙌 Minkyu Park, Willy Lulciuc, Maciej Obuchowski, Howard Yoo, Jarek Potiuk
Sharanya Santhanam (santhanamsharanya@gmail.com)
2022-08-02 10:12:01

What is the right way of handling/parsing facets on the server side?

I see the generated server side stubs are generic : https://github.com/OpenLineage/OpenLineage/blob/main/client/java/generator/src/main/java/io/openlineage/client/Generator.java#L131 and dont have any resolved facet information. Marquez seems to have duplicated the OL model with https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/service/models/LineageEvent.java#L71 and converts the incoming OL events to a “LineageEvent” for appropriate handling. Is there a cleaner approach where in the known facets can be generated in io.openlineage.server?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-08-02 12:28:11

*Thread Reply:* I think the reason for server model being very generic is because new facets can be added later (also as custom facets) - and generally server wants to accept all valid events and get the facet information that it can actually use, rather than reject event because it has unknown field.

Server model was added here after some discussion in Marquez which is relevant - I think @Michael Collado @Willy Lulciuc can add to that

Sharanya Santhanam (santhanamsharanya@gmail.com)
2022-08-02 15:54:24

*Thread Reply:* Thanks for the response. I realize the server stubs were created to support flexibility , but it also makes the parsing logic on server side a bit more complex as we need to maintain code on the server side to look for specific facets & their properties from maps or like maquez duplicate the OL model on our end with the facets we care about. Wanted to know whats the guidance around managing this server side. @Willy Lulciuc @Michael Collado Any suggestions ?

Michael Robinson (michael.robinson@astronomer.io)
2022-08-02 18:27:27

Agenda items are requested for the next OpenLineage Technical Steering Committee meeting on August 11 at 10am PT. Reply in thread or ping me with your item(s)!

Varun Singh (varuntestaz@outlook.com)
2022-08-03 04:16:22

Hi all, I am trying out the openlineage spark integration and can't find any column lineage information included with the events. I tried it out with an input dataset where I renamed one of the columns but the columnLineage facet was not present. Can anyone suggest some other examples where it might show up?

Thanks!

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-08-03 04:45:36

*Thread Reply:* @Paweł Leszczyński do we collect column level lineage on renames?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2022-08-05 05:55:12

*Thread Reply:* I’ve created an issue for column lineage in case of renaming: https://github.com/OpenLineage/OpenLineage/issues/993

Labels
integration/spark
Varun Singh (varuntestaz@outlook.com)
2022-08-08 09:37:43

*Thread Reply:* Thanks @Paweł Leszczyński!

Ross Turk (ross@datakin.com)
2022-08-03 12:58:44

Hey everyone! I am looking into Fivetran a bit, and it occurs to me that the NAMING.md document does not have an opinion about how to deal with entire systems as datasets. More in 🧵.

Ross Turk (ross@datakin.com)
2022-08-03 13:00:22

*Thread Reply:* Fivetran is a tool that copies data from source systems to target databases. One of these source systems might be SalesForce, for example.

This copying results in thousands of SQL queries run against the target database for each sync. I don’t think each of these queries should map to an OpenLineage job, I think the entire synchronization should. Maybe I’m wrong here.

Ross Turk (ross@datakin.com)
2022-08-03 13:01:00

*Thread Reply:* But if I’m right, that means that there needs to be a way to specify “SalesForce Account #45123452233” as a dataset.

Ross Turk (ross@datakin.com)
2022-08-03 13:01:44

*Thread Reply:* or it ends up just being a job with outputs and no inputs…but that’s not very illuminating

Ross Turk (ross@datakin.com)
2022-08-03 13:02:27

*Thread Reply:* or is that good enough?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-08-04 10:31:11

*Thread Reply:* You are looking at a pretty big topic here 🙂

Basically you're asking what is a job in OpenLineage - and it's not fully answered yet.

I think the discussion is kinda relevant to this proposed facet and I kinda replied there: https://github.com/OpenLineage/OpenLineage/issues/812#issuecomment-1205337556

Harel Shein (harel.shein@gmail.com)
2022-08-04 15:50:22

*Thread Reply:* my 2 cents on this is that in the Salesforce example, the system is to complex to capture as a single dataset. and so maybe different objects within a salesforce account (org/account/opportunity/etc…) could be treated as individual datasets. But as @Maciej Obuchowski pointed out, this is quite a large topic 🙂

Ross Turk (ross@datakin.com)
2022-08-08 13:46:31

*Thread Reply:* I guess it depends on whether you actually care about the table/column level lineage for an operation like “copy salesforce to snowflake”.

I can see it being a nuisance having all of that on a lineage graph. OTOH, I can see it being useful to know that a datum can be traced back to a specific endpoint at SFDC.

Ross Turk (ross@datakin.com)
2022-08-08 13:46:55

*Thread Reply:* this is a design decision, IMO.

Michael Robinson (michael.robinson@astronomer.io)
2022-08-04 11:30:00

@channel The next OpenLineage Technical Steering Committee meeting is on Thursday, August 11 at 10 am PT. Join us on Zoom: https://bit.ly/OLzoom All are welcome! Agenda:

  1. Announcements
  2. Docs site update
  3. Release 0.11.0 and 0.12.0 overview
  4. Extractors: examples and how to write them
  5. Open discussion Notes: https://bit.ly/OLwiki Is there a topic you think the community should discuss at this or a future meeting? Reply or DM me to add items to the agenda. (edited)
Zoom Video
🙌 Maciej Obuchowski, Harel Shein, Paul Wilson Villena
👀 Francis McGregor-Macdonald
Chris Coulthrust (coulthrust@gmail.com)
2022-08-06 12:06:47

👋 Hi everyone!

👋 Jakub Dardziński, Michael Robinson, Ross Turk, Harel Shein, Willy Lulciuc, Howard Yoo
Michael Robinson (michael.robinson@astronomer.io)
2022-08-10 11:00:01

@channel The next OpenLineage TSC meeting is tomorrow! https://openlineage.slack.com/archives/C01CK9T7HKR/p1659627000308969

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
👀 Howard Yoo
❤️ Minkyu Park
Will Johnson (will@willj.co)
2022-08-10 22:34:29

*Thread Reply:* I am so sad I'm going to miss this month's meeting 😰 Looking forward to the recording!

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-08-12 06:19:58

*Thread Reply:* We missed you too @Will Johnson 😉

Raj Mishra (hax0755@gmail.com)
2022-08-11 18:50:18

Hi everyone! I have a REST endpoint that I use for other pipelines that can POST their RunEvent and I forward that to marquez. I'm expecting a JSON which has the RunEvent details, which also has the input or output dataset depending upon the EventType. I can see the Run details always shows up on the marquez UI, but the dataset has issues. I can see the dataset listed but when I can click on it, just shows "something went wrong." I don't see any details of that dataset. { "eventType": "START", "eventTime": "2022-08-09T19:49:24.201361Z", "run": { "runId": "d46e465b-d358-4d32-83d4-df660ff614dd" }, "job": { "namespace": "TEST-NAMESPACE", "name": "test-job" }, "inputs": [ { "namespace": "TEST-NAMESPACE", "name": "my-test-input", "facets": { "schema": { "_producer": "<https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client>", "_schemaURL": "<https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/spec/OpenLineage.json#/definitions/SchemaDatasetFacet>", "fields": [ { "name": "a", "type": "INTEGER" }, { "name": "b", "type": "TIMESTAMP" }, { "name": "c", "type": "INTEGER" }, { "name": "d", "type": "INTEGER" } ] } } } ], "producer": "<https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client>" } In above payload, the input data set is never created on marquez. I can only see the Run details, but input data set is just empty. Does the input data set needs to created first and then only the RunEvent can be created?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-08-12 06:09:57

*Thread Reply:* From the first look, you're missing outputsfield in your event - this might break something

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-08-12 06:10:20

*Thread Reply:* If not, then Marquez logs might help to see something

Raj Mishra (hax0755@gmail.com)
2022-08-12 13:12:56

*Thread Reply:* Does the START event needs to have an output?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-08-12 13:19:24

*Thread Reply:* It can have empty output 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-08-12 13:32:43

*Thread Reply:* well, in your case you need to send COMPLETE event

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-08-12 13:33:44

*Thread Reply:* Internally, Marquez does not create dataset version until you complete event. It makes sense when your semantics are transactional - you can still read from previous dataset version until it's finished writing.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-08-12 13:34:06

*Thread Reply:* After I send COMPLETE event with the same information I can see the dataset.

Raj Mishra (hax0755@gmail.com)
2022-08-12 13:56:37

*Thread Reply:* Thanks for the explanation @Maciej Obuchowski So, if I understand this correct. I won't see the my-test-input dataset till I have the COMPLETE event with input and output?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-08-12 14:34:51

*Thread Reply:* @Raj Mishra Yes and no 🙂

Basically your COMPLETE event does not need to contain any input and output datasets at all - OpenLineage model is cumulative, so it's enough to have datasets on either start or complete. That also means you can add different datasets in different moment of a run lifecycle - for example, you know inputs, but not outputs, so you emit inputs on START , but not COMPLETE.

Or, the job is modifying the same dataset it reads from (which happens surprisingly often), Then, you want to collect various input metadata from the dataset before modifying it - most likely you won't have them on COMPLETE 🙂

In this example I've added my-test-input on START and my-test-input2 on COMPLETE :

Raj Mishra (hax0755@gmail.com)
2022-08-12 14:47:56

*Thread Reply:* @Maciej Obuchowski Thank you so much! This is great explanation.

Sharanya Santhanam (santhanamsharanya@gmail.com)
2022-08-11 20:28:40

Effectively handling file datasets on server side. We have a common usecase where dataset of type is produced/consumed per day. On the Lineage UI/server side it would be ideal to treat all files of this pattern as 1 dataset Vs 1 dataset per daily file. Any suggestions ?

Sharanya Santhanam (santhanamsharanya@gmail.com)
2022-08-11 20:35:33

*Thread Reply:* Would adding support for alias/grouping as a config on OL client side be valuable to other users ? i.e OL client could pass down an Alias/grouping facet Or should this be treated purely a server side feature

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-08-12 06:11:21

*Thread Reply:* Agreed 🙂

How do you produce this dataset? Spark integration? Are you using any system like Apache Iceberg/Delta Lake or just writing raw files?

Sharanya Santhanam (santhanamsharanya@gmail.com)
2022-08-12 12:59:48

*Thread Reply:* these are raw files written from Spark or map reduce jobs. And downstream Spark jobs read these raw files to produce tables

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-08-12 13:27:34

*Thread Reply:* written using Spark dataframe API, like df.write.format("parquet").save("/tmp/spark_output/parquet") or RDD?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-08-12 13:27:59

*Thread Reply:* the actual API used matters, because we're handling different cases separately

Sharanya Santhanam (santhanamsharanya@gmail.com)
2022-08-12 13:29:48

*Thread Reply:* I see. Let me look that up to be absolutely sure

Sharanya Santhanam (santhanamsharanya@gmail.com)
2022-08-12 19:21:41

*Thread Reply:* It is like. this : df.write.format("parquet").save("/tmp/spark_output/parquet")

Sharanya Santhanam (santhanamsharanya@gmail.com)
2022-08-15 12:43:45

*Thread Reply:* @Maciej Obuchowski curious what you had in mind with respect to RDDs & Dataframes. Also what if we cannot integrate OL with the frameworks that produce this dataset , but only those that consume from the already produced datasets. Is there a way we could still capture the dataset appropriately ?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-08-16 05:30:57

*Thread Reply:* @Sharanya Santhanam the naming should be consistent between reading and writing, so it wouldn't change much of you can't integrate OL into writers. For the rest, can you create an issue on OL GitHub so someone can pick it up? I'm at vacation now.

Sharanya Santhanam (santhanamsharanya@gmail.com)
2022-08-16 15:08:41

*Thread Reply:* Sounds good , Ty !

Varun Singh (varuntestaz@outlook.com)
2022-08-12 06:02:00

Hi, Minor Suggestion: This line https://github.com/OpenLineage/OpenLineage/blob/46efab1e7c2a0aa5ebe8d11185fe8d5225[…]/app/src/main/java/io/openlineage/spark/agent/EventEmitter.java is printing variables like api key and other parameters in the logs. Wouldn't it be more appropriate to use log.debug instead? I'll create an issue if others agree

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-08-12 06:09:11

*Thread Reply:* yes

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-08-12 06:09:32

*Thread Reply:* please do create 🙂

✅ Varun Singh
Conor Beverland (conorbev@gmail.com)
2022-08-15 09:01:47

dumb question but, is it easy to run all the OpenLineage tests locally? ( and if so how? 🙂 )

Julien Le Dem (julien@apache.org)
2022-08-17 13:54:19

*Thread Reply:* it's per project. java based: ./gradlew test python based: https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow#development

Will Johnson (will@willj.co)
2022-08-18 23:45:30

Spark Integration: The Order of Processing Events in the Async Event Queue

Hey, OpenLineage team, I'm working on a PR (https://github.com/OpenLineage/OpenLineage/pull/849/) that is going to store information given in different spark events (e.g. SparkListenerSQLExecutionStart, SparkListenerJobStart).

However, I want to avoid holding all this data once the execution of the job is complete. As a result, I want to remove the data once I receive a SparkListenerSQLExecutionEnd.

However, can I be guaranteed that the ExecutionEnd event will be processed AFTER the JobStart event? Is it possible that I can take too long to process the the JobStart event that the ExecutionEnd executes prior to the JobStart finishing?

I know we do something similar to this with sparkSqlExecutionRegistry (https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/mai[…]n/java/io/openlineage/spark/agent/OpenLineageSparkListener.java) but do we have any docs to help explain how the AsyncEventQueue orders and consumes events for a listener?

Thank you so much for any insights

Julien Le Dem (julien@apache.org)
2022-08-19 18:38:10

*Thread Reply:* Hey Will! A bunch of folks are on vacation or out this week. Sorry for the delay, I am personally not sure but if it's not too urgent you can have an answer when knowledgable folks are back.

Will Johnson (will@willj.co)
2022-08-19 20:21:18

*Thread Reply:* Hah! No worries, @Julien Le Dem! I can definitely wait for the lucky people who are enjoying the last few weeks of summer unlike the rest of us 😋

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-08-29 05:31:32

*Thread Reply:* @Paweł Leszczyński might want to look at that

Hanbing Wang (doris.wang200902@gmail.com)
2022-08-19 01:53:56

Hi, I try to find out if openLineage spark support pyspark (Non-sql) use cases? Is there any doc I could get more details about non-sql openLineage support? Thanks a lot

Julien Le Dem (julien@apache.org)
2022-08-19 12:30:08

*Thread Reply:* Hello Hanbing, the spark integration works for PySpark since pyspark is wrapped into regular spark operators.

Hanbing Wang (doris.wang200902@gmail.com)
2022-08-19 13:49:35

*Thread Reply:* @Julien Le Dem Thanks a lot for your help. I searched around, but I couldn't find any doc introduce how pyspark supported in openLineage. My company want to integrate with openLineage-spark, I am working on figure out what info does OpenLineage make available for non-sql and does it at least have support for logging the logical plan?

Julien Le Dem (julien@apache.org)
2022-08-19 18:26:48

*Thread Reply:* Yes, it does send the logical plan as part of the event

Julien Le Dem (julien@apache.org)
2022-08-19 18:27:32

*Thread Reply:* This configuration here should work as well for pyspark https://openlineage.io/docs/integrations/spark/

openlineage.io
Julien Le Dem (julien@apache.org)
2022-08-19 18:28:11

*Thread Reply:* --conf "spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener"

Julien Le Dem (julien@apache.org)
2022-08-19 18:28:26

*Thread Reply:* you need to add the jar, set the listener and pass your OL config

Julien Le Dem (julien@apache.org)
2022-08-19 18:31:11

*Thread Reply:* Actually I'm demoing this at 27:10 right here 🙂 https://pretalx.com/bbuzz22/talk/FHEHAL/

pretalx
Julien Le Dem (julien@apache.org)
2022-08-19 18:32:11

*Thread Reply:* you can see the parameters I'm passing to the pyspark command line in the video

Hanbing Wang (doris.wang200902@gmail.com)
2022-08-19 18:35:50

*Thread Reply:* @Julien Le Dem Thanks for the info, Let me take a look at the video now.

Julien Le Dem (julien@apache.org)
2022-08-19 18:40:10

*Thread Reply:* The full demo starts at 24:40. It shows lineage connected together in Marquez coming from 3 different sources: Airflow, Spark and a custom integration

Michael Robinson (michael.robinson@astronomer.io)
2022-08-22 14:32:53

Hi everyone, a release has been requested by @Harel Shein. As per our policy here, 3 +1s from committers will authorize an immediate release. Thanks! Unreleased commits: https://github.com/OpenLineage/OpenLineage/compare/0.12.0...HEAD

➕ Willy Lulciuc, Michael Robinson, Minkyu Park, Jakub Dardziński, Julien Le Dem
Willy Lulciuc (willy@datakin.com)
2022-08-22 14:38:58

*Thread Reply:* @Michael Robinson can we start posting the “Unreleased” section in the changelog along with the release request? That way, we / the community will know what will be in the upcoming release

👍 Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2022-08-22 15:00:37

*Thread Reply:* The release is approved. Thanks @Willy Lulciuc, @Minkyu Park, @Harel Shein

🙌 Willy Lulciuc, Harel Shein
Michael Robinson (michael.robinson@astronomer.io)
2022-08-22 16:18:30

@channel OpenLineage 0.13.0 is now available! We added: • BigQuery check support • RUNNING EventType in the spec and Python client • databases and schemas to SQL extractors • an event forwarding feature via HTTP • Azure Cosmos Handler to the Spark integration • support for OL datasets in manual lineage inputs/outputs • ownership facets. We changed: • use RUNNING EventType in Flink integration for currently running jobs • convert task object into JSON encodable when creating Airflow version facet. Thanks to all the contributors who made this release possible! For the bug fixes and more details, see: Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.13.0 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.12.0...0.13.0 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/ (edited)

🎉 Harel Shein, Ross Turk, Jarek Potiuk, Sheeri Cabral (Collibra), Willy Lulciuc, Howard Yoo, Howard Yoo, Ernie Ostic, Francis McGregor-Macdonald
✅ Sheeri Cabral (Collibra), Howard Yoo
Conor Beverland (conorbev@gmail.com)
2022-08-23 03:55:24

*Thread Reply:* Cool! Are the new ownership facets populated by the Airflow integration ?

AMRIT SARKAR (sarkaramrit2@gmail.com)
2022-08-24 08:23:35

Hi everyone, excited to work with OpenLineage. I am new to both OpenLineage and Data Lineage in general. Are there working examples/blog posts around actually integrating OpenLineage with existing graph DBs like Neo4J, Neptune etc? (I understand the service layer in between) I understand we have Amundsen with sample open lineage sample data - databuilder/example/sample_data/openlineage/sample_openlineage_events.ndjson. Thanks in advance.

Julien Le Dem (julien@apache.org)
2022-08-25 18:15:59

*Thread Reply:* There is not that I know of besides the Amundsen integration example you pointed at. A basic idea to do such a thing would be to implement an OpenLineage endpoint (receive the lineage events through http posts) and convert them to a format the graph db understand. If others in the community have ideas, please chime in

AMRIT SARKAR (sarkaramrit2@gmail.com)
2022-09-01 13:48:09

*Thread Reply:* Understood, thanks a lot Julien. Make sense.

Harel Shein (harel.shein@gmail.com)
2022-08-25 17:30:46

Hey all, can I ask for a release for OpenLineage?

👍 Harel Shein, Minkyu Park, Michael Robinson, Michael Collado, Ross Turk, Julien Le Dem, Willy Lulciuc, Maciej Obuchowski
Willy Lulciuc (willy@datakin.com)
2022-08-25 17:32:44

*Thread Reply:* @Michael Robinson ^

Michael Robinson (michael.robinson@astronomer.io)
2022-08-25 17:34:04

*Thread Reply:* Thanks, Harel. 3 +1s from committers is all we need to make this happen today.

Minkyu Park (minkyu@datakin.com)
2022-08-25 17:52:40

*Thread Reply:* 🙏

Michael Robinson (michael.robinson@astronomer.io)
2022-08-25 18:09:51

*Thread Reply:* Thanks, all. The release is authorized

🎉 Willy Lulciuc
Julien Le Dem (julien@apache.org)
2022-08-25 18:16:44

*Thread Reply:* can you also state the main purpose for this release?

Michael Robinson (michael.robinson@astronomer.io)
2022-08-25 18:25:49

*Thread Reply:* I believe (correct me if wrong, @Harel Shein) that this is to make available a fix of a bug in the compare functionality

Minkyu Park (minkyu@datakin.com)
2022-08-25 18:27:53

*Thread Reply:* ParentRunFacet from the airflow integration is not compliant to OpenLineage spec and this release includes the fix of that so that the marquez can handle parent run/job information.

Michael Robinson (michael.robinson@astronomer.io)
2022-08-25 18:49:30

@channel OpenLineage 0.13.1 is now available! We fixed: • Rename all parentRun occurrences to parent from Airflow integration #1037 @fm100 • Do not change task instance during on_running event #1028 @JDarDagran Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.13.1 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.13.0...0.13.1 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🎉 Harel Shein, Minkyu Park, Ross Turk, Michael Collado, Howard Yoo
❤️ Minkyu Park, Ross Turk, Howard Yoo
🥳 Minkyu Park, Ross Turk, Howard Yoo
Jason (shzhan@coupang.com)
2022-08-26 18:58:17

Hi, I am new to openlineage. Any one know how to enable spark column level lineage? I saw the code comment, it said default is disabled, thanks

Harel Shein (harel.shein@gmail.com)
2022-08-26 19:26:22

*Thread Reply:* What version of Spark are you using? it should be enabled by default for Spark 3 https://openlineage.io/docs/integrations/spark/spark_column_lineage

openlineage.io
Jason (shzhan@coupang.com)
2022-08-26 20:21:12

*Thread Reply:* Thanks. Good to here that. I am use 0.9.+ . I will try again

Jason (shzhan@coupang.com)
2022-08-29 13:14:01

*Thread Reply:* I tested 0.9.+ 0.12.+ with spark 3.0 and 3.2 version. There still do not have dataset facet columnlineage. This is strange. I saw the column lineage design proposals 148. It should support from 0.9.+ Do I miss something?

Jason (shzhan@coupang.com)
2022-08-29 13:14:41

*Thread Reply:* @Harel Shein

Will Johnson (will@willj.co)
2022-08-30 00:56:18

*Thread Reply:* @Jason it depends on the data source. What sort of data are you trying to read? Is it in a hive metastore? Is it on an S3 bucket? Is it a delta file format?

Jason (shzhan@coupang.com)
2022-08-30 13:51:03

*Thread Reply:* I tried read hive megastore on s3 and cave file on local. All are miss the columnlineage

Will Johnson (will@willj.co)
2022-08-31 00:33:17

*Thread Reply:* @Jason - Sorry, you'll have to translate a bit for me. Can you share a snippet of code you're using to do the read and write? Is it a special package you need to install or is it just using the hadoop standard for S3? https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html

Jason (shzhan@coupang.com)
2022-08-31 20:00:47

*Thread Reply:* spark.read \ .option("header", "true") \ .option("inferschema", "true") \ .csv("data/input/batch/wikidata.csv") \ .write \ .mode('overwrite') \ .csv("data/output/batch/python-sample.csv")

Jason (shzhan@coupang.com)
2022-08-31 20:01:21

*Thread Reply:* This is simple code run on my local for testing

Will Johnson (will@willj.co)
2022-08-31 21:41:31

*Thread Reply:* Which version of OpenLineage are you running? You might look at the code on the main branch. This looks like a HadoopFSRelation which I implemented for column lineage but the latest release (0.13.1) does not include it yet.

Will Johnson (will@willj.co)
2022-08-31 21:42:05

*Thread Reply:* Specifically this commit is what implemented it. https://github.com/OpenLineage/OpenLineage/commit/ce30178cc81b63b9930be11ac7500ed34808edd3

Jason (shzhan@coupang.com)
2022-08-31 22:02:16

*Thread Reply:* I see. I use 0.13.0

Harel Shein (harel.shein@gmail.com)
2022-09-01 12:04:41

*Thread Reply:* @Jason we have our monthly release coming up now, so it should be included in 0.14.0 when released today/tomorrow

Jason (shzhan@coupang.com)
2022-09-01 12:52:52

*Thread Reply:* Great. Thanks Harel.

Raj Mishra (hax0755@gmail.com)
2022-08-28 17:46:38

Hi! I have ran into some issues and wanted to clarify my doubts. • Why are input schema changes(column delete, new columns) doesn't show up on the UI. I have changed the input schema for the same job, but I'm not seeing getting updated on the UI. • Why is there only ever 1 input schema version. Every change I make in input schema, I only see output schema has multiple versions but only 1 version for input schema. • Is there a reason why can't we see the input schema till the COMPLETE event is posted? I have used the examples from here. https://openlineage.io/getting-started/ curl -X POST <http://localhost:5000/api/v1/lineage> \ -H 'Content-Type: application/json' \ -d '{ "eventType": "START", "eventTime": "2020-12-28T19:52:00.001+10:00", "run": { "runId": "d46e465b-d358-4d32-83d4-df660ff614dd" }, "job": { "namespace": "my-namespace", "name": "my-job" }, "inputs": [{ "namespace": "my-namespace", "name": "my-input" }], "producer": "<https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client>" }' curl -X POST <http://localhost:5000/api/v1/lineage> \ -H 'Content-Type: application/json' \ -d '{ "eventType": "COMPLETE", "eventTime": "2020-12-28T20:52:00.001+10:00", "run": { "runId": "d46e465b-d358-4d32-83d4-df660ff614dd" }, "job": { "namespace": "my-namespace", "name": "my-job" }, "outputs": [{ "namespace": "my-namespace", "name": "my-output", "facets": { "schema": { "_producer": "<https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client>", "_schemaURL": "<https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/spec/OpenLineage.json#/definitions/SchemaDatasetFacet>", "fields": [ { "name": "a", "type": "VARCHAR"}, { "name": "b", "type": "VARCHAR"} ] } } }], "producer": "<https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client>" }' Changing the inputs schema for START doesn't change the schema input version and doesn't update the UI. Thanks!

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-08-29 05:29:52

*Thread Reply:* Reading dataset - which input dataset implies - does not mutate the dataset 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-08-29 05:30:14

*Thread Reply:* If you change the dataset, it would be represented as some other job with this datasets in the outputs list

Raj Mishra (hax0755@gmail.com)
2022-08-29 12:42:55

*Thread Reply:* So, changing the input dataset will always create new output data versions? Sorry I have trouble understanding this, but if the input is changing, shouldn't the input data set will have different versions?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-09-01 08:35:42

*Thread Reply:* @Raj Mishra if input is changing, there should be something else in your data infrastructure that changes this dataset - and it should emit this dataset as output

Iftach Schonbaum (iftach.schonbaum@hunters.ai)
2022-08-29 12:21:52

Hi Everyone, new here. i went thourhg the docs and examples. cant seem to understand how can i model views on top of base tables if not from a data processing job but rather via modeling something static that is coming from some software internals. i.e. i want to issue the lineage my self rather it will learn it dynamically from some Airflow DAG or spark DAG

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-08-29 12:35:32

*Thread Reply:* I think you want to emit raw events using python or java client: https://openlineage.io/docs/client/python

openlineage.io
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-08-29 12:35:46

*Thread Reply:* (docs in progress 😉)

Iftach Schonbaum (iftach.schonbaum@hunters.ai)
2022-08-30 02:07:02

*Thread Reply:* can you give a hind what should i look for for modeling a dataset on top of other dataset? potentially also map columns?

Iftach Schonbaum (iftach.schonbaum@hunters.ai)
2022-08-30 02:12:50

*Thread Reply:* i can only see that i can have a dataset as input to a job run and not for another dataset

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-09-01 08:34:35

*Thread Reply:* Not sure I understand - jobs process input datasets into output datasets. There is always something that can be modeled into a job that consumes input and produces output.

Iftach Schonbaum (iftach.schonbaum@hunters.ai)
2022-09-01 10:30:51

*Thread Reply:* so openlineage force me to put a job between datasets? does not fit our use case

Iftach Schonbaum (iftach.schonbaum@hunters.ai)
2022-09-01 10:31:09

*Thread Reply:* unless we can some how easily hide the process that does that on the graph.

Jason (shzhan@coupang.com)
2022-08-29 20:41:19

QQ, I saw that spark Column level lineage start with open lineage 0.9.+ version with spark 3.+, Does it mean it needs to run lower than open lineage 0.9 if our spark is 2.3 or 2.4?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-08-30 04:44:06

*Thread Reply:* I don't think it will work for Spark 2.X.

Jason (shzhan@coupang.com)
2022-08-30 13:42:20

*Thread Reply:* Is there have plan to support spark 2.x?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-08-30 14:00:38

*Thread Reply:* Nope - on the other hand we plan to drop any support for it, as it's unmaintained for quite a bit and vendors are dropping support for it too - afaik Databricks in April 2023.

Jason (shzhan@coupang.com)
2022-08-30 17:19:43

*Thread Reply:* I see. Thanks. Amazon Emr still support spark 2.x

Will Johnson (will@willj.co)
2022-08-30 01:15:10

Spark Integration: Handling Data Source V2 API datasets

Is it expected that a DataSourceV2 relation has a start event with inputs and outputs but a complete event with only outputs? Based on @Michael Collado’s previous comments, I think it's fair to say YES this is expected and we just need to handle it. https://openlineage.slack.com/archives/C01CK9T7HKR/p1645037070719159?thread_ts=1645036515.163189&cid=C01CK9T7HKR

@Hanna Moazam and I noticed this behavior when we looked at the Cosmos Db visitor and then reproduced it for the Iceberg visitor. We traced it down to the fact that the AbstractQueryPlanInputDatasetBuilder (which is the parent of DataSourceV2RelationInputDatasetBuilder) has an isDefinedAt that only includes SparkListenerJobStart and SparkListenerSQLExecutionStart

This means an Iceberg COMPLETE event will NEVER contain inputs because the isDefinedAt will always be false (since COMPLETE only fires for JobEnd and ExecutionEnd events). Does that sound correct (@Paweł Leszczyński)?

It seems that Delta tables (or at least Delta on Databricks) does not follow this same code path and as a result our complete events includes outputs AND inputs.

} Michael Collado (https://openlineage.slack.com/team/U01NNCBCP6K)
👀 Maciej Obuchowski
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-09-01 05:56:13

*Thread Reply:* At least for Iceberg I've done it, since I want to emit DatasetVersionDatasetFacet for input dataset only at START - and after I finish writing the dataset might have different version than before writing.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-09-01 05:58:59

*Thread Reply:* Same should be for output AFAIK - output version should be emitted only on COMPLETE, since the version changes after I finish writing.

Will Johnson (will@willj.co)
2022-09-01 09:52:30

*Thread Reply:* Ah! Okay, so this still requires us to truly combine START and COMPLETE to get a TOTAL picture of the entire run. Is that fair?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-09-01 10:30:41

*Thread Reply:* Yes

👍 Will Johnson
Will Johnson (will@willj.co)
2022-09-01 10:31:21

*Thread Reply:* As usual, thank you Maciej for the responses and insights!

🙌 Maciej Obuchowski
Jason (shzhan@coupang.com)
2022-08-31 22:19:44

QQ team, I use spark sql with openlineage namespace weblog: spark.sql(“select ** from weblog where dt=‘1’”).write.orc(“…”) there have two issues 1, there have no upstream dataset weblog on Marquez UI. 2, there have new namespace s3-cdp-prod-hive created. It should the bucket of s3. Am I missing something? Thanks

Jason (shzhan@coupang.com)
2022-09-07 14:13:34

*Thread Reply:* Anyone can help for it? Does I miss something

Jason (shzhan@coupang.com)
2022-08-31 22:21:57

Here is the Marquez UI

Michael Robinson (michael.robinson@astronomer.io)
2022-09-01 07:34:24

Hi everyone, I’m opening up a vote on this month’s OpenLineage release. 3 +1s from committers will authorize. Additions include support for KustoRelationHandler in Kusto (Azure Data Explorer) and for ABFSS and Hadoop Logical Relation, both in the Spark integration. All commits can be found here: https://github.com/OpenLineage/OpenLineage/compare/0.13.1...HEAD. Thanks in advance!

➕ Maciej Obuchowski, Ross Turk, Paweł Leszczyński, Will Johnson, Hanna Moazam
Michael Robinson (michael.robinson@astronomer.io)
2022-09-01 13:18:59

*Thread Reply:* Thanks. The release is authorized. It will be initiated within 2 business days.

🙌 Will Johnson, Maciej Obuchowski
srutikanta hota (srutikanta.hota@gmail.com)
2022-09-05 07:57:02

Is there a reference on how to deploy openlineage on a Non AWS infrastructure ?

Will Johnson (will@willj.co)
2022-09-08 10:31:44

*Thread Reply:* Which integration are you looking to implement?

And what environment are you looking to deploy it on? The Cloud? On-Prem?

srutikanta hota (srutikanta.hota@gmail.com)
2022-09-08 10:40:11

*Thread Reply:* We are planning to deploy on premise with Kerberos as authentication for postgres

Will Johnson (will@willj.co)
2022-09-08 11:27:06

*Thread Reply:* Ah! Are you planning on running Marquez as well and that is your main concern or are you planning on building your own store of OpenLineage Events and using the SQL integration to generate those events?

https://github.com/OpenLineage/OpenLineage/tree/main/integration

srutikanta hota (srutikanta.hota@gmail.com)
2022-09-08 11:33:44

*Thread Reply:* I am looking to deploy Marquez on-prem with onprem postgres as back-end with Kerberos authentication.

srutikanta hota (srutikanta.hota@gmail.com)
2022-09-08 11:34:32

*Thread Reply:* Is the the right forum for Marquez as well or there is different slack channel for Marquez available

Will Johnson (will@willj.co)
2022-09-08 11:46:35

*Thread Reply:* https://bit.ly/MarquezSlack

Will Johnson (will@willj.co)
2022-09-08 11:47:14

*Thread Reply:* There is another slack channel just for Marquez! That might be a better spot with more dedicated Marquez developers.

Michael Robinson (michael.robinson@astronomer.io)
2022-09-06 15:52:32

@channel OpenLineage 0.14.0 is now available! We added: • Support ABFSS and Hadoop Logical Relation in Column-level lineage #1008 @wjohnson • Add Kusto relation visitor #939 @hmoazam • Add ColumnLevelLineage facet doc #1020 @julienledem • Include symlinks dataset facet #935 @pawel-big-lebowski • Add support for dbt 1.3 beta’s metadata changes #1051 @mobuchowski • Support Flink 1.15 #1009 @mzareba382 • Add Redshift dialect to the SQL integration #1066 @mobuchowski We changed: • Make the timeout configurable in the Spark integration #1050 @tnazarew We fixed: • Add a dialect parameter to Great Expectations SQL parser calls #1049 @collado-mike • Fix Delta 2.1.0 with Spark 3.3.0 #1065 @pawel-big-lebowski Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.14.0 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.13.1...0.14.0 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

❤️ Willy Lulciuc, Howard Yoo, Alexander Wagner, Hanna Moazam, Minkyu Park, Grayson Stream, Paweł Leszczyński, Maciej Obuchowski, Conor Beverland, Jason
Willy Lulciuc (willy@datakin.com)
2022-09-06 15:54:30

*Thread Reply:* Thanks for breaking up the changes in the release! Love the new format 💯

🙌 Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2022-09-07 09:05:35

Hello all, I’m requesting a patch release to fix a bug in the Spark integration. Currently, OpenlineageSparkListener fails when no openlineage.timeout is provided. PR #1069 by @Paweł Leszczyński, merged today, will fix it. As per our policy here, 3 +1s from committers will authorize an immediate release.

➕ Paweł Leszczyński, Maciej Obuchowski, Howard Yoo, Willy Lulciuc, Ross Turk, Julien Le Dem
Willy Lulciuc (willy@datakin.com)
2022-09-07 10:00:11

*Thread Reply:* Is PR #1069 all that’s going in 0.14.1 ?

Michael Robinson (michael.robinson@astronomer.io)
2022-09-07 10:27:39

*Thread Reply:* There’s also 1058. 1069 is urgently needed. We can technically wait…

🙌 Willy Lulciuc
Michael Robinson (michael.robinson@astronomer.io)
2022-09-07 10:30:31

*Thread Reply:* (edited prior message because I’m not sure how accurately I was describing the issue)

Willy Lulciuc (willy@datakin.com)
2022-09-07 10:39:32

*Thread Reply:* Thanks for clarifying!

Michael Robinson (michael.robinson@astronomer.io)
2022-09-07 10:50:29

*Thread Reply:* Thanks, all. The release is authorized.

❤️ Willy Lulciuc
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-09-07 11:04:39

*Thread Reply:* 1058 also fixes some bugs

Iftach Schonbaum (iftach.schonbaum@hunters.ai)
2022-09-08 01:55:41

Hello all, question: Views on top of base table is also a use case for lineage and there is no job in between. i dont seem to find a way to have a dataset on top of others to represent a view on top of tables. is there a way to do that without a job in between?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-09-08 04:41:07

*Thread Reply:* Usually there is something creating the view, for example dbt materialization: https://docs.getdbt.com/docs/building-a-dbt-project/building-models/materializations

Besides that, there is this proposal that did not get enough love yet https://github.com/OpenLineage/OpenLineage/issues/323

Iftach Schonbaum (iftach.schonbaum@hunters.ai)
2022-09-08 04:53:23

*Thread Reply:* but we are not working iwth dbt. we try to model lineage of our internal view/tables hirarchy which is related to a propriety application of ours. so we like OpenLineage that lets me explicily model stuff and not only via scanning some DW. but in that case we dont want a job in between.

Iftach Schonbaum (iftach.schonbaum@hunters.ai)
2022-09-08 04:58:47

*Thread Reply:* this PR does not seem to support lineage between datasets

Ross Turk (ross@datakin.com)
2022-09-08 12:49:48

*Thread Reply:* This is something core to the OpenLineage design - the lineage relationships are defined as dataset-job-dataset, not dataset-dataset.

In OpenLineage, something observes the lineage relationship being created.

Ross Turk (ross@datakin.com)
2022-09-08 12:50:13

*Thread Reply:*

🙌 Will Johnson, Maciej Obuchowski
Ross Turk (ross@datakin.com)
2022-09-08 12:51:15

*Thread Reply:* It’s a bit different from some other lineage approaches, but OL is intended to be a push model. A job is observed as it runs, metadata is pushed to the backend.

Ross Turk (ross@datakin.com)
2022-09-08 12:54:27

*Thread Reply:* so in this case, according to openlineage 🙂, the job would be whatever runs within the pipeline that creates the view. very operational point of view.

Iftach Schonbaum (iftach.schonbaum@hunters.ai)
2022-09-11 12:27:42

*Thread Reply:* but what about the view definition use case? u have lineage of columns in view/base table relation ships

Iftach Schonbaum (iftach.schonbaum@hunters.ai)
2022-09-11 12:28:05

*Thread Reply:* how would you model that in OpenLineage? would you create a dummy job ?

Iftach Schonbaum (iftach.schonbaum@hunters.ai)
2022-09-11 12:31:57

*Thread Reply:* would you say that because this is my use case i might better choose some other lineage tool?

Iftach Schonbaum (iftach.schonbaum@hunters.ai)
2022-09-11 12:33:04

*Thread Reply:* for the context: i am not talking about some view and table definitions in some warehouse e.g. SF but its internal data processing mechanism with propriety view/tables definition (in Flink SQL) and we want to push this metadata for visibility

Ross Turk (ross@datakin.com)
2022-09-12 17:20:13

*Thread Reply:* Ah, gotcha. Yeah, I would say it’s probably best to create a job in this case. You can send the view definition using a sourcecodefacet, so it will be collected as well. You’d want to send START and STOP events for it.

Ross Turk (ross@datakin.com)
2022-09-12 17:22:03

*Thread Reply:* regarding the PR linked before, you are right - I wonder if someday the spec should have a way to express “the system was made aware that these datasets are related, but did not observe the relationship being created so it can’t tell you i.e. how long it took or whether it changed over time”

Michael Robinson (michael.robinson@astronomer.io)
2022-09-09 10:25:21

@channel OpenLineage 0.14.1 is now available! We fixed: • Fix Spark integration issues including error when no openlineage.timeout #1069 @pawel-big-lebowski Bug fixes were also included in this release. Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.14.1 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.14.0...0.14.1 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🙌 Maciej Obuchowski, Willy Lulciuc, Howard Yoo, Francis McGregor-Macdonald, AMRIT SARKAR
data_fool (data.fool.me@gmail.com)
2022-09-09 13:52:39

Hello, any future plans for integrating Airbyte with openlineage?

👋 Willy Lulciuc, Maciej Obuchowski
Willy Lulciuc (willy@datakin.com)
2022-09-09 14:01:13

*Thread Reply:* Hey, @data_fool! Not in the near term. but of course we’d love to see this happen. We’re open to having an Airbyte integration driven by the community. Want to open an issue to start the discussion?

data_fool (data.fool.me@gmail.com)
2022-09-09 15:36:20

*Thread Reply:* hey @Willy Lulciuc, Yep, will open an issue. Thanks!

🙌 Willy Lulciuc
Hubert Dulay (hubert.dulay@gmail.com)
2022-09-10 22:00:10

Hi can you create lineage across namespaces? Thanks

Julien Le Dem (julien@apache.org)
2022-09-12 19:26:25

*Thread Reply:* yes!

srutikanta hota (srutikanta.hota@gmail.com)
2022-09-26 10:31:56

*Thread Reply:* Any example or ticket on how to lineage across namespace

Iftach Schonbaum (iftach.schonbaum@hunters.ai)
2022-09-12 02:27:49

Hello, Does OpenLineage support column level lineage?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-09-12 04:56:13

*Thread Reply:* Yes https://openlineage.io/blog/column-lineage/

openlineage.io
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2022-09-22 02:18:45

*Thread Reply:* • More details on Spark & Column level lineage integration: https://openlineage.io/docs/integrations/spark/spark_column_lineage • Proposal on how to implement column level lineage in Marquez (implementation is currently work in progress): https://github.com/MarquezProject/marquez/blob/main/proposals/2045-column-lineage-endpoint.md @Iftach Schonbaum let us know if you find the information useful.

openlineage.io
Paul Lee (paullee@lyft.com)
2022-09-12 15:29:12

where can i find docs on just simply using extractors? without marquez. for example, a basic BashOperator on Airflow 1.10.15

Paul Lee (paullee@lyft.com)
2022-09-12 15:30:08

*Thread Reply:* or is it automatic for anything that exists in extractors/?

Howard Yoo (howard.yoo@astronomer.io)
2022-09-12 15:30:16

*Thread Reply:* Yes

👍 Paul Lee
:gratitude_thank_you: Paul Lee
Paul Lee (paullee@lyft.com)
2022-09-12 15:31:12

*Thread Reply:* so anything i add to extractors directory with the same name as the operator will automatically extract the metadata from the operator is that correct?

Howard Yoo (howard.yoo@astronomer.io)
2022-09-12 15:31:31

*Thread Reply:* Well, not entirely

Howard Yoo (howard.yoo@astronomer.io)
2022-09-12 15:31:47

*Thread Reply:* please take a look at the source code of one of the extractors

👍 Paul Lee
Howard Yoo (howard.yoo@astronomer.io)
2022-09-12 15:32:13

*Thread Reply:* also, there are docs available at openlineage.io/docs

🙏 Paul Lee
Paul Lee (paullee@lyft.com)
2022-09-12 15:33:45

*Thread Reply:* ok, i'll take a look. i think one thing that would be helpful is having a custom setup without marquez. a lot of the docs or videos i found were integrated with marquez

Howard Yoo (howard.yoo@astronomer.io)
2022-09-12 15:34:29

*Thread Reply:* I see. Marquez is a openlineage backend that stores the lineage data, so many examples do need them.

Howard Yoo (howard.yoo@astronomer.io)
2022-09-12 15:34:47

*Thread Reply:* If you do not want to run marquez but just test out the openlineage, you can also take a look at OpenLineage Proxy.

👍 Paul Lee
Paul Lee (paullee@lyft.com)
2022-09-12 15:35:14

*Thread Reply:* awesome thanks Howard! i'll take a look at these resources and come back around if i need to

👍 Howard Yoo
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-09-12 16:01:45

*Thread Reply:* http://openlineage.io/docs/integrations/airflow/extractor - this is the doc you might want to read

openlineage.io
🎉 Paul Lee
Paul Lee (paullee@lyft.com)
2022-09-12 17:08:49

*Thread Reply:* yeah, saw that doc earlier. thanks @Maciej Obuchowski appreciate it 🙏

Jay (sanjay.sudhakaran@trovemoney.co.nz)
2022-09-21 20:55:24

Hey team! I’m pretty new to the field in general

In the real world, I would be running pyspark scripts on AWS EMR. Could you explain to me how the metadata is sent to Marquez from my pyspark script, and where it’s persisted?

Would I need to set up an S3 bucket to store the lineage data?

I’m also unsure about how I would run the Marquez UI on AWS - Would I need to have an EC2 instance running permanently in order to access that UI?

Jay (sanjay.sudhakaran@trovemoney.co.nz)
2022-09-21 20:57:39

*Thread Reply:* In my head, I have:

Pyspark script -> Store metadata in S3 -> Marquez UI gets data from S3 and displays it

I suspect this is incorrect?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2022-09-22 02:14:50

*Thread Reply:* It’s more like: you add openlineage jar to Spark job, configure it what to do with the events. Popular options are: * sent to rest endpoint (like Marquez), * send as an event onto Kafka, ** print it onto console There is no S3 in between Spark & Marquez by default. Marquez serves both as an API where events are sent and UI to investigate them.

Jay (sanjay.sudhakaran@trovemoney.co.nz)
2022-09-22 17:36:10

*Thread Reply:* Yeah S3 was just an example for a storage option.

I actually found the answer I was looking for, turns out I had to look at Marquez documentation: https://marquezproject.ai/resources/deployment/

The answer is that Marquez uses a postgres instance to persist the metadata it is given. Thanks for your time though! I appreciate the effort 🙂

👍 Kevin Adams
Hanbing Wang (doris.wang200902@gmail.com)
2022-09-25 17:06:41

Hello team, For the OpenLineage Spark, even when I processed one Spark sql query (CTAS Create Table As Select), I will received multiple events back (2+ Start events, 2 Complete events). I try to understand why OpenLineage need to send back that much events, and what is the primary difference between Start VS Start events, Start VS Complete events? Do we have any doc can help me understand more on it? Thanks

Will Johnson (will@willj.co)
2022-09-26 00:27:05

*Thread Reply:* The Spark execution model follows:

  1. Spark SQL Execution Start event
  2. Spark Job Start event
  3. Spark Job End event
  4. Spark SQL Execution End event As a result, OpenLineage tracks all of those execution and jobs. There is a proposed plan to distinguish between those events (e.g. you wouldn't get two starts but one Start and one Job Start or something like that).

You should collect all of these events in order to be sure you are receiving all the data since each event may contain a subset of the complete facets that represent what occurred in the job.

Hanbing Wang (doris.wang200902@gmail.com)
2022-09-26 15:16:26

*Thread Reply:* Thanks @Will Johnson Can I get an example of how the proposed plan can be used to distinguish between start and job start events? Because I compare the 2 starts events I got, only the event_time is different, all other information are the same.

Hanbing Wang (doris.wang200902@gmail.com)
2022-09-26 15:30:34

*Thread Reply:* One followup question, if I process multiple queries in one command, for example (Drop + Create Table + Insert Overwrite), should I expected for (1). 1 Spark SQL execution start event (2). 3 Spark job start event (Each query has a job start event ) (3). 3 Spark job end event (Each query has a job end event ) (4). 1 Spark SQL execution end event

Will Johnson (will@willj.co)
2022-09-27 10:25:47

*Thread Reply:* Re: Distinguish between start and job start events. There was a proposal to differentiate the two (https://github.com/OpenLineage/OpenLineage/issues/636) but the current discussion is here: https://github.com/OpenLineage/OpenLineage/issues/599 As it currently stands, there is not a way to tell which one is which (I believe). The design of OpenLineage is such that you should consume ALL events under the same run id and job name / namespace.

Re: Multiple Queries in One Command: This is where Spark's execution model comes into play. I believe each one of those commands are executed sequentially and as a result, you'd actually get three execution start and three execution end. If you chose DROP + Create Table As Select, that would be only two commands and thus only two execution start events.

Hanbing Wang (doris.wang200902@gmail.com)
2022-09-27 16:49:37

*Thread Reply:* Thanks a lot for your help 🙏 @Will Johnson, For multiple queries in one command, I still have a confused place why Drop + CreateTable and Drop + CreateTableAsSelect act different.

When I test Drop + Create Table Query: DROP TABLE IF EXISTS shadow_test.test_sparklineage_4; CREATE TABLE IF NOT EXISTS shadow_test.test_sparklineage_4 (val INT, region STRING) PARTITIONED BY ( ds STRING ) STORED AS PARQUET; I only received 1 start + 1 complete event And the events only contains DropTableCommandVisitor/DropTableCommand. I expected we should also received start and complete events for CreateTable query with CreateTableCommanVisitor/CreateTableComman .

But when I test Drop + Create Table As Select Query: DROP TABLE IF EXISTS shadow_test.test_sparklineage_5; CREATE TABLE IF NOT EXISTS shadow_test.test_sparklineage_5 AS SELECT ** from shadow_test.test_sparklineage where ds &gt; '2022-08-24'" I received 1 start + 1 complete event with DropTableCommandVisitor/DropTableCommand And 2 start + 2 complete events with CreateHiveTableAsSelectCommandVisitor/CreateHiveTableAsSelectCommand

Will Johnson (will@willj.co)
2022-09-27 22:03:38

*Thread Reply:* @Hanbing Wang are you running this on Databricks with a hive metastore that is defaulting to Delta by any chance?

I THINK there are some gaps in OpenLineage because of the way Databricks Delta handles things and now there is Unity catalog that is causing some hiccups as well.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-09-28 09:18:48

*Thread Reply:* > For multiple queries in one command, I still have a confused place why Drop + CreateTable and Drop + CreateTableAsSelect act different. @Hanbing Wang That's basically why we capture all the events (SQL Execution, Job) instead of one of them. We're just inconsistently notified of them by Spark.

Some computations emit SQL Execution events, some emit Job events, I think majority emits both. This also differs by spark version.

The solution OpenLineage assumes is having cumulative model of job execution, where your backend deals with possible duplication of information.

> I THINK there are some gaps in OpenLineage because of the way Databricks Delta handles things and now there is Unity catalog that is causing some hiccups as well. @Will Johnson would be great if you created issue with some complete examples

Hanbing Wang (doris.wang200902@gmail.com)
2022-09-28 15:44:45

*Thread Reply:* @Will Johnson and @Maciej Obuchowski Thanks a lot for your help We are not running on Databricks. We implemented the OpenLineage Spark listener, and custom the Event Transport which emitting the events to our own events pipeline with a hive metastore. We are using Spark version 3.2.1 OpenLineage version 0.14.1

Will Johnson (will@willj.co)
2022-09-29 15:16:28

*Thread Reply:* Ooof! @Hanbing Wang then I'm not certain why you're not receiving the extra event 😞 You may need to run your spark cluster in debug mode to step through the Spark Listener.

Will Johnson (will@willj.co)
2022-09-29 15:17:08

*Thread Reply:* @Maciej Obuchowski - I'll add it to my list!

Hanbing Wang (doris.wang200902@gmail.com)
2022-09-30 15:34:01

*Thread Reply:* @Will Johnson Thanks a lot for your help. Let us debug and continue investigating on this issue.

Yujia Yang (yujia@tubi.tv)
2022-09-26 03:46:19

Hi team, I find Openlineage posts a lot for run events to the backend.

eg. I submit jar to Spark cluster with computations like

  1. count from table1. --> this will have more than one run events inputs:[table1], outputs:[]
  2. count from table2 --> this will have more than one run events inputs:[table2], outputs:[]
  3. write Seq[(t1, count1), (t2, count2)) to table3. --> this may give inputs:[] outputs [table3] can I just get one post with a summary telling me, inputs:[table1, table2], outputs:[table3] alongside with a merged columnareLineage?
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2022-09-28 08:34:20

*Thread Reply:* One of assumptions was to create a stateless integration model where multiple events can be sent for a single job run. This has several advantages like sending events for jobs which suddenly fail, sending events immediately, etc.

The events can be merged then at the backend side. The behavior, you describe, can be then achieved by using backends like Marquez and Marquez API to obtain combined data.

Currently, we’re developing column-lineage dedicated endpoint in Marquez according to the proposal: https://github.com/MarquezProject/marquez/blob/main/proposals/2045-column-lineage-endpoint.md This will allow you to request whole column lineage graph based on multiple jobs.

:gratitude_thank_you: Yujia Yang
👀 Yujia Yang
srutikanta hota (srutikanta.hota@gmail.com)
2022-09-28 09:47:55

Is there a provision to include additional MDC properties as part of openlineage ? Or something like sparkSession.sparkContext().setLocalProperties("key","value")

Julien Le Dem (julien@apache.org)
2022-09-29 14:30:37

*Thread Reply:* Hello @srutikanta hota, could you elaborate a bit on your use case? I'm not sure what you are trying to achieve. Possibly @Paweł Leszczyński will know.

Will Johnson (will@willj.co)
2022-09-29 15:24:26

*Thread Reply:* @srutikanta hota - Not sure what MDC properties stands for but you might take inspiration from the DatabricksEnvironmentHandler Facet Builder: https://github.com/OpenLineage/OpenLineage/blob/65a5f021a1ba3035d5198e759587737a05[…]ark/agent/facets/builder/DatabricksEnvironmentFacetBuilder.java

You can create a facet that could extract out the properties that you might set from within the spark session.

I don't think OpenLineage / a Spark Listener can affect the SparkSession itself so you wouldn't be able to SET the properties in the listener.

srutikanta hota (srutikanta.hota@gmail.com)
2022-09-30 04:56:25

*Thread Reply:* Many thanks for the details. My usecase is simple, I like to default the sparkgroupjob Id as openlineage parent runid if there is no parent run Id set. sc.setJobGroup("myjobgroupid", "job description goes here") This set the value in spark as setLocalProperty(SparkContext.SPARKJOBGROUPID, group_id)

I like to use myjobgroup_id as openlineage parent run id

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2022-09-30 05:01:08

*Thread Reply:* MDC is an ability to add extra key -> value pairs to a log entry, while not doing this within message body. So the question here is (I believe): how to add custom entries / custom facets to OpenLineage events?

@srutikanta hota What information would you like to include? There is great chance we already have some fields for that. If not it’s still worth putting in in write place like: is this info job specific, run specific or relates to some of input / output datasets?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-09-30 05:04:34

*Thread Reply:* @srutikanta hota sounds like you want to set up spark.openlineage.parentJobName spark.openlineage.parentRunId https://openlineage.io/docs/integrations/spark/

srutikanta hota (srutikanta.hota@gmail.com)
2022-09-30 05:15:18

*Thread Reply:* @… we are having a long-running spark context(the context may run for a week) where we submit jobs. Settings the parentrunid at beginning won't help. We are submitting the job with sparkgroupid. I like to use the group Id as parentRunId

https://spark.apache.org/docs/1.6.1/api/R/setJobGroup.html

🤔 Maciej Obuchowski
Trevor Swan (trevor.swan@matillion.com)
2022-09-29 13:59:20

Hi team - I am from Matillion and we would like to build support for openlineage. Who would be best placed to move the conversation with my product team?

🙌 Will Johnson, Maciej Obuchowski, Francis McGregor-Macdonald
🎉 Michael Robinson
👍 Ernie Ostic
Julien Le Dem (julien@apache.org)
2022-09-29 14:22:06

*Thread Reply:* Hi Trevor, thank you for reaching out. I’d be happy to discuss with you how we can help you support OpenLineage. Let me send you an email.

Jarek Potiuk (jarek@potiuk.com)
2022-09-29 15:58:35

cccccbctlvggfhvrcdlbbvtgeuredtbdjrdfttbnldcb

🐈 Julien Le Dem, Jakub Dardziński, Maciej Obuchowski, Paweł Leszczyński
🐈‍⬛ Julien Le Dem, Maciej Obuchowski, Paweł Leszczyński
Petr Hajek (petr.hajek@profinit.eu)
2022-09-30 02:52:51

Hi Everyone! Would anybody be interested in participation in MANTA Open Lineage connector testing? We are specially looking for an environment with rich Airflow implementation but we will be happy to test on any other OL Producer technology. Send me a direct message for more information. Thanks, Petr

🙌 Michael Robinson, Ross Turk
Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2022-09-30 14:34:45

Question about Apache Airflow that I think folks here would know, because doing a web search has failed me:

Is there a way to interact with Apache Airflow to retrieve the contents of the files in the sql directory, but NOT to run them?

(the APIs all seem to run sql, and when I search I just get “how to use the airflow API to run queries”)

Ross Turk (ross@datakin.com)
2022-09-30 14:38:34

*Thread Reply:* Is this in the context of an OpenLineage extractor?

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2022-09-30 14:40:47

*Thread Reply:* Yes! I was specifically looking at the PostgresOperator

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2022-09-30 14:41:54

*Thread Reply:* (as Snowflake lineage can be retrieved from their internal ACCESS_HISTORY tables, we wouldn’t need to use Airflow’s SnowflakeOperator to get lineage, we’d use the method on the openlineage blog)

Ross Turk (ross@datakin.com)
2022-09-30 14:43:08

*Thread Reply:* The extractor for the SQL operators gets the query like this: https://github.com/OpenLineage/OpenLineage/blob/45fda47d8ef29dd6d25103bb491fb8c443[…]gration/airflow/openlineage/airflow/extractors/sql_extractor.py

👍 Sheeri Cabral (Collibra)
Ross Turk (ross@datakin.com)
2022-09-30 14:43:48

*Thread Reply:* let me see if I can find the corresponding part of the Airflow API docs...

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2022-09-30 14:45:00

*Thread Reply:* aha! I’m not so far behind the times, it was only put in during July https://github.com/OpenLineage/OpenLineage/pull/907

Ross Turk (ross@datakin.com)
2022-09-30 14:47:28

*Thread Reply:* Hm. The PostgresOperator seems to extend BaseOperator directly: https://github.com/apache/airflow/blob/029ebacd9cbbb5e307a03530bdaf111c2c3d4f51/airflow/providers/postgres/operators/postgres.py#L58

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2022-09-30 14:48:01

*Thread Reply:* yeah 😞 I couldn’t find a way to make that work as an end-user.

Ross Turk (ross@datakin.com)
2022-09-30 14:48:08

*Thread Reply:* perhaps that can't be assumed for all operators that deal with SQL. I know that @Maciej Obuchowski has spent a lot of time on this.

Ross Turk (ross@datakin.com)
2022-09-30 14:49:14

*Thread Reply:* I don't know enough about the airflow internals 😞

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2022-09-30 14:50:00

*Thread Reply:* No worries. In case it saves you work, I also had a look at https://github.com/apache/airflow/blob/029ebacd9cbbb5e307a03530bdaf111c2c3d4f51/airflow/providers/common/sql/operators/sql.py - which also extends BaseOperator but not with a way to just get the SQL.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2022-09-30 15:22:24

*Thread Reply:* that's more of an Airflow question indeed. As far as I understand you need to read file with SQL statement within Airflow Operator and do something but run the query (like pass as an XCom)? SQLExtractors we have get same SQL that operators render and uses it to extract additional information like table schema straight from database

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2022-09-30 14:36:18

(I’m also ok with a way to get the SQL that has been run - but from Airflow, not the data source - I’m looking for a db-neutral way to do this, otherwise I can just parse query logs on any specific db system)

Paul Lee (paullee@lyft.com)
2022-09-30 18:45:09

👋 are there any docs on how the listener hooks in and gets run with openlineage-airflow? trying to write some unit tests but no docs seem to exist on the flow.

Julien Le Dem (julien@apache.org)
2022-09-30 19:06:47

*Thread Reply:* There's a design doc linked from the PR: https://github.com/apache/airflow/pull/20443 https://docs.google.com/document/d/1L3xfdlWVUrdnFXng1Di4nMQYQtzMfhvvWDR9K4wXnDU/edit

Labels
area:scheduler/executor, area:dev-tools, area:plugins, type:new-feature, full tests needed
Comments
3
👀 Paul Lee
Paul Lee (paullee@lyft.com)
2022-09-30 19:18:47

*Thread Reply:* amazing thank you I will take a look

Michael Robinson (michael.robinson@astronomer.io)
2022-10-03 11:32:52

@channel Hello everyone, I’m opening up a vote on releasing OpenLineage 0.15.0, including • an improved development experience in the Airflow integration • updated proposal and integration templates • a change to the BigQuery client in the Airflow integration • plus bug fixes across the project. 3 +1s from committers will authorize an immediate release. For all the commits, see: https://github.com/OpenLineage/OpenLineage/compare/0.14.0...HEAD. Note: this will be the last release to support Airflow 1.x! Thanks!

🎉 Paul Lee, Howard Yoo, Minkyu Park, Michael Collado, Paweł Leszczyński, Maciej Obuchowski, Harel Shein
👍 Michael Collado, Julien Le Dem, Maciej Obuchowski
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-10-03 11:33:30

*Thread Reply:* Hey @Michael Robinson. Removal of Airflow 1.x support is planned for next release after 0.15.0

👍 Jakub Dardziński, Paul Lee
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-10-03 11:37:03

*Thread Reply:* 0.15.0 would be the last release supporting Airflow 1.x

Michael Robinson (michael.robinson@astronomer.io)
2022-10-03 11:37:07

*Thread Reply:* just caught this myself. I’ll make the change

Paul Lee (paullee@lyft.com)
2022-10-03 11:40:33

*Thread Reply:* we’re still on 1.10.15 at the moment so i guess our team would have to rely on <=0.15.0?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-10-03 11:49:47

*Thread Reply:* Is this something you want to continue doing or do you want to migrate relatively soon?

We want to remove 1.10 integration because for multiple PRs, maintaining compatibility with it takes a lot of time; the code is littered with checks like this. if parse_version(AIRFLOW_VERSION) &gt;= parse_version("2.0.0"):

👍 Paul Lee
Paul Lee (paullee@lyft.com)
2022-10-03 12:03:40

*Thread Reply:* hey Maciej, we do have plans to migrate in the coming months but for right now we need to stay on 1.10.15.

Michael Robinson (michael.robinson@astronomer.io)
2022-10-04 09:39:11

*Thread Reply:* Thanks, all. The release is authorized, and you can expect it by Thursday.

Paul Lee (paullee@lyft.com)
2022-10-03 17:56:08

👋 what would be a possible reason for the built in airflow backend being utilized instead of a custom wrapper over airflow.lineage.Backend ? double checked the [lineage] key in our airflow.cfg

there doesn't seem to be any errors being thrown and the object loads 🤔

Paul Lee (paullee@lyft.com)
2022-10-03 17:56:36

*Thread Reply:* running airflow 2.3.4 with openlineage-airflow 0.14.1

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-10-03 18:03:03

*Thread Reply:* if you're talking about LineageBackend, it is used in Airflow 2.1-2.2. It did not have functionality where you can be notified on task start or failure, so we wanted to expand the functionality: https://github.com/apache/airflow/issues/17984

Consensus of Airflow maintainers wasn't positive about changing this interface, so we went with another direction: https://github.com/apache/airflow/pull/20443

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-10-03 18:06:58

*Thread Reply:* Why nothing happens? https://github.com/OpenLineage/OpenLineage/blob/895160423643398348154a87e0682c3ab5c8704b/integration/airflow/openlineage/lineage_backend/__init__.py#L91

Paul Lee (paullee@lyft.com)
2022-10-03 18:30:32

*Thread Reply:* ah hmm ok, i will double check. i commented that part out so technically it should run but maybe i missed something

Paul Lee (paullee@lyft.com)
2022-10-03 18:30:42

*Thread Reply:* thank you for your fast response @Maciej Obuchowski ! i appreciate it

Paul Lee (paullee@lyft.com)
2022-10-03 18:31:13

*Thread Reply:* it seems like it doesn't use my custom wrapper but instead uses the openlineage implementation.

Paul Lee (paullee@lyft.com)
2022-10-03 20:11:15

*Thread Reply:* @Maciej Obuchowski ok, after checking we are emitting events with our custom backend but an odd thing is an attempt is always made with the openlineage backend. is there something obvious i am perhaps missing 🤔

ends up with requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url immediately after task start. but by the end on task success/failure it emits the event with our custom backend both RunState.COMPLETE and RunState.START into our own pipeline.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-10-04 06:19:06

*Thread Reply:* If you're on 2.3 and trying to use some wrapped LineageBackend, what I think is happening is OpenLineagePlugin that automatically registers via setup.py entrypoint https://github.com/OpenLineage/OpenLineage/blob/65a5f021a1ba3035d5198e759587737a05b242e1/integration/airflow/openlineage/airflow/plugin.py#L30

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-10-04 06:23:48

*Thread Reply:* I think if you want to extend it with proprietary code there are two good options.

First, if your code only needs to touch HTTP client side - which I guess is the case due to 401 error - then you can create custom Transport.

Second, is that you fork OL code and create your own package, without entrypoint script or with adding your own if you decide to extend OpenLineagePlugin instead of LineageBackend

👍 Paul Lee
Paul Lee (paullee@lyft.com)
2022-10-04 14:23:33

*Thread Reply:* amazing thank you for your help. i will take a look

Paul Lee (paullee@lyft.com)
2022-10-04 14:49:47

*Thread Reply:* @Maciej Obuchowski is there a way to extend the plugin like how we can wrap the custom backend with 2.2? or would it be necessary to fork it.

we're trying to not fork and instead opt with extending.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-10-05 04:55:05

*Thread Reply:* I think it's best to fork, since it's getting loaded by Airflow as an entrypoint: https://github.com/OpenLineage/OpenLineage/blob/133110300e8ea4e42e3640608cfed459683d5a8d/integration/airflow/setup.py#L70

🙏 Paul Lee
:gratitude_thank_you: Paul Lee
Paul Lee (paullee@lyft.com)
2022-10-05 13:29:24

*Thread Reply:* got it. and in terms of the openlineage.yml and defining a custom transport is there a way i can define where openlineage-python should look for the custom transport? e.g. different path

Paul Lee (paullee@lyft.com)
2022-10-05 13:30:04

*Thread Reply:* because from the docs i. can't tell except for the file i'm supposed to copy and implement.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-10-05 14:18:19

*Thread Reply:* @Paul Lee you should derive from Transport base class and register type as full python import path to your custom transport, for example https://github.com/OpenLineage/OpenLineage/blob/f8533266491acea2159f602f782a99a4f8a82cca/client/python/tests/openlineage.yml#L2

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-10-05 14:20:48

*Thread Reply:* your custom transport should have also define custom class Config , and this class should implement from_dict method

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-10-05 14:20:56
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-10-05 14:21:09

*Thread Reply:* and I know we need to document this better 🙂

🙏 Paul Lee
Paul Lee (paullee@lyft.com)
2022-10-05 15:35:31

*Thread Reply:* amazing, thanks for all your help 🙂 +1 to the docs, if i have some time when done i will push up some docs to document what i've done

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-10-05 15:50:29

*Thread Reply:* https://github.com/openlineage/docs/ - let me know and I'll review 🙂

Website
<https://openlineage.io/docs>
Stars
4
🎉 Paul Lee
Michael Robinson (michael.robinson@astronomer.io)
2022-10-04 12:39:59

@channel Hi everyone, opening a vote on a release (0.15.1) to add #1131 to fix the release process on CI. 3 +1s from committers will authorize an immediate release. Thanks. More details are here: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md

✅ Michael Collado, Maciej Obuchowski, Julien Le Dem
Michael Robinson (michael.robinson@astronomer.io)
2022-10-04 14:25:49

*Thread Reply:* Thanks, all. The release is authorized.

Michael Robinson (michael.robinson@astronomer.io)
2022-10-05 10:46:46

@channel OpenLineage 0.15.1 is now available! We added: • Airflow: improve development experience #1101 @JDarDagran • Documentation: update issue templates for proposal & add new integration template #1116 @rossturk • Spark: add description for URL parameters in readme, change overwriteName to appName #1130 @tnazarew We changed: • Airflow: lazy load BigQuery client #1119 @mobuchowski Many bug fixes were also included in this release. Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.15.1 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.14.1...0.15.1 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🙌 Maciej Obuchowski, Jakub Dardziński, Howard Yoo, Harel Shein, Paul Lee, Paweł Leszczyński
🎉 Howard Yoo, Harel Shein, Paul Lee
Michael Robinson (michael.robinson@astronomer.io)
2022-10-06 07:35:00

Is there a topic you think the community should discuss at the next OpenLineage TSC meeting? Reply or DM with your item, and we’ll add it to the agenda.

🌟 Paul Lee
Paul Lee (paullee@lyft.com)
2022-10-06 13:29:30

*Thread Reply:* would love to add improvement in docs :) for newcomers

👏 Jakub Dardziński
Paul Lee (paullee@lyft.com)
2022-10-06 13:31:07

*Thread Reply:* also, what’s TSC?

Michael Robinson (michael.robinson@astronomer.io)
2022-10-06 15:20:23

*Thread Reply:* Technical Steering Committee, but it’s open to everyone

👍 Paul Lee
Michael Robinson (michael.robinson@astronomer.io)
2022-10-06 15:20:45

*Thread Reply:* and we encourage newcomers to attend

Paul Lee (paullee@lyft.com)
2022-10-06 13:49:00

has anyone seen their COMPLETE/FAILED listeners not firing on Airflow 2.3.4 but START events do emit? using openlineage-airflow 0.14.1

Jakub Dardziński (jakub.dardzinski@getindata.com)
2022-10-06 14:39:27

*Thread Reply:* is there any error/warn message logged maybe?

Paul Lee (paullee@lyft.com)
2022-10-06 14:40:53

*Thread Reply:* none that i'm seeing on our workers. i do see that our custom http transport is being utilized on START.

but on SUCCESS nothing fires.

Paul Lee (paullee@lyft.com)
2022-10-06 14:41:21

*Thread Reply:* which makes me believe the listeners themselves aren't being utilized? 🤔

Jakub Dardziński (jakub.dardzinski@getindata.com)
2022-10-06 16:37:54

*Thread Reply:* uhm, any chance you're experiencing this with custom extractors?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2022-10-06 16:38:13

*Thread Reply:* I'd be happy to jump on a quick call if you wish

Jakub Dardziński (jakub.dardzinski@getindata.com)
2022-10-06 16:38:40

*Thread Reply:* but in more EU friendly hours 🙂

Paul Lee (paullee@lyft.com)
2022-10-07 16:19:47

*Thread Reply:* no custom extractors, its usingt he base extractor. a call would be 👍. let me look at my calendar and EU hours.

Michael Robinson (michael.robinson@astronomer.io)
2022-10-06 15:23:27

@channel The next OpenLineage Technical Steering Committee meeting is on Thursday, October 13 at 10 am PT. Join us on Zoom: https://bit.ly/OLzoom All are welcome! Agenda:

  1. Announcements
  2. Recent Release 0.15.1
  3. Project roadmap review
  4. Open discussion Notes: https://bit.ly/OLwiki Is there a topic you think the community should discuss at this or a future meeting? Reply or DM me to add items to the agenda.
🙌 Paul Lee, Harel Shein
Srinivasa Raghavan (gsrinir@gmail.com)
2022-10-07 06:52:42

hello all. I am trying to run the airflow example from here I changed the Marquez web port from 5000 to 15000 but when I start the docker images, it seems to always default to port 5000 and therefore when I go to localhost:3000, the jobs don't load up as they are not able to connect to the marquez app running in 15000. I've overriden the values in docker-compose.yml and in openLineage.env but it seems to be picking up the 5000 value from some other location. This is what I see in the logs. Any pointers on this or please redirect me to the appropriate channel. Thanks! INFO [2022-10-07 10:48:58,022] org.eclipse.jetty.server.AbstractConnector: Started application@782fd504{HTTP/1.1, (http/1.1)}{0.0.0.0:5000} INFO [2022-10-07 10:48:58,034] org.eclipse.jetty.server.AbstractConnector: Started admin@1537c744{HTTP/1.1, (http/1.1)}{0.0.0.0:5001}

👀 Maciej Obuchowski
Srinivasa Raghavan (gsrinir@gmail.com)
2022-10-20 05:11:09

*Thread Reply:* Apparently the value is hard coded in the code somewhere that I couldn't figure out but at-least learnt that in my Mac where this port 5000 is being held up can be freed by following the below simple step.

Hanna Moazam (hannamoazam@microsoft.com)
2022-10-10 18:00:17

Hi #general - @Will Johnson and I are working on adding support for Snowflake to OL, and as we were going to specify the package under the compileOnly dependencies in gradle, we had some doubts looking at the existing dependencies. Taking bigQuery as an example - we see it's included as a dependency in both the shared build.gradle file, and in the app build.gradle file. We're a bit confused about the following:

  1. Why do we need to have the bigQuery package in shared's dependencies? App of course contains the bigQueryNodeVisitor but we couldn't spot where it's being used within shared.
  2. For all the dependencies in the shared gradle file, the versions for Scala and Spark are fixed (Scala 2.11, Spark 2.4.8), whereas for app, the versionsMap allows for different combinations of spark and scala versions. Why is this so?
  3. How do the dependencies between app and shared interact? Does one or the other take precedence for which version of the bigQuery connector is compiled? We'd appreciate any guidance!

Thank you in advance!

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2022-10-11 03:47:31

*Thread Reply:* Hi @Hanna Moazam,

Within recent PR https://github.com/OpenLineage/OpenLineage/pull/1111, I removed BigQuery dependencies from spark2, spark32 and spark3 subprojects. It has to stay in sharedbecause of BigQueryNodeVisitor. The usage of BigQueryNodeVisitor is tricky as we never know if bigquery classes are available on runtime or not. The check is done in io.openlineage.spark.agent.lifecycle.BaseVisitorFactory if (BigQueryNodeVisitor.hasBigQueryClasses()) { list.add(new BigQueryNodeVisitor(context, factory)); } Regarding point 2, there were some Spark versions which allowed two Scala versions (2.11 and 2.12). Then it makes sense to make it configurable. On the other hand, for Spark 3.2 we only support 2.12 which is hardcoded in build.gradle.

The idea of app project is let's create a separate project to aggregate all the dependecies and run integration tests on it . Subprojects spark2, spark3, etc. do depend on shared . Putting integration tests in shared would create additional opposite-way dependency, which we wanted to avoid.

Labels
bug, documentation, integration/spark, integration/bigquery
Will Johnson (will@willj.co)
2022-10-11 09:20:44

*Thread Reply:* So, if we wanted to add Snowflake, we would need to:

  1. Pick a version of snowflake's spark library
  2. Pick a version of scala that we target (i.e. we are only going to support Snowflake in Spark 3.2 so scala 2.12 will be hard coded)
  3. Add the visitor code to Shared
  4. Add the dependencies to app (ONLY if there is an integration test in app?? This is the confusing part still)
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2022-10-12 03:51:54

*Thread Reply:* Yes. Please note that snowflake library will not be included in target OpenLineage jar. So you may test it manually against multiple Snowflake library versions or even adjust code in case of minor differences.

👍 Hanna Moazam, Will Johnson
Hanna Moazam (hannamoazam@microsoft.com)
2022-10-12 05:20:17

*Thread Reply:* Thank you Pawel!

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-10-12 12:18:16
Hanna Moazam (hannamoazam@microsoft.com)
2022-10-12 12:26:35

*Thread Reply:* We actually used only reflection for Kusto and were hoping to do it the 'better' way with the package itself for snowflake - if it's possible :)

Akash r (akashrn25@gmail.com)
2022-10-11 02:04:28

Hi Community,

I was going through the code of dbt integration with Open lineage, Once the events has been emitted from client code , I wanted to check the server code where the events are read and the lineage is formed. Where can I find that code ?

Thanks

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-10-11 05:03:26

*Thread Reply:* Reference implementation of OpenLineage consumer is Marquez: https://github.com/MarquezProject/marquez

Website
<https://marquezproject.ai>
Stars
1187
Michael Robinson (michael.robinson@astronomer.io)
2022-10-12 11:59:55

This month’s OpenLineage TSC meeting is tomorrow at 10 am PT! https://openlineage.slack.com/archives/C01CK9T7HKR/p1665084207602369

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
🙌 Maciej Obuchowski
Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2022-10-13 12:05:17

Is there anyone in the Open Lineage community in San Diego? I’ll be there Nov 1-3 and would love to meet some of y’all in person

Paul Lee (paullee@lyft.com)
2022-10-20 13:49:39

👋 is there a way to define a base extractor to be defaulted to? for example, i'd like to have all our operators (50+) default to my custom base extractor instead of having a list of 50+ operators in get_operator_classnames

Howard Yoo (howard.yoo@astronomer.io)
2022-10-20 13:53:55

I don't think that's possible yet, as the extractor checks are based on the class name... and it wouldn't check which parent operator has it inherited from.

Paul Lee (paullee@lyft.com)
2022-10-20 14:05:38

😢 ok, i would contribute upstream but unfortunately we're still on 1.10.15. looking like we might have to hardcode for a bit.

Paul Lee (paullee@lyft.com)
2022-10-20 14:06:01

is this the correct assumption? we're still on 0.14.1 ^

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-10-20 14:33:49

If you'll move to 2.x series and OpenLineage 0.16, you could use this feature: https://github.com/OpenLineage/OpenLineage/pull/1162

Labels
integration/airflow, extractor
👍 Paul Lee
Paul Lee (paullee@lyft.com)
2022-10-20 14:46:36

thanks @Maciej Obuchowski we're working on it. hoping we'll land on 2.3.4 in the coming month.

🔥 Maciej Obuchowski
Austin Poulton (austin.poulton@equalexperts.com)
2022-10-26 05:31:07

👋 Hi everyone!

👋 Jakub Dardziński, Maciej Obuchowski, Michael Robinson, Ross Turk, Willy Lulciuc, Paweł Leszczyński, Harel Shein
Harel Shein (harel.shein@gmail.com)
2022-10-26 15:22:22

*Thread Reply:* Hey @Austin Poulton, welcome! 👋

Austin Poulton (austin.poulton@equalexperts.com)
2022-10-31 06:09:41

*Thread Reply:* thanks Harel 🙂

Michael Robinson (michael.robinson@astronomer.io)
2022-11-01 09:44:18

@channel Hi everyone, I’m opening a vote to release OpenLineage 0.16.0, featuring: • support for boolean arguments in the DefaultExtractor • a more efficient get_connection_uri method in the Airflow integration • a reorganized, Rust-based SQL integration (easing the addition of language interfaces in the future) • bug fixes and more. 3 +1s from committers will authorize an immediate release. Thanks. More details are here: https://github.com/OpenLineage/OpenLineage/compare/0.15.1...HEAD

🙌 Howard Yoo, Paweł Leszczyński, Maciej Obuchowski
👍 Ross Turk, Paweł Leszczyński, Maciej Obuchowski
➕ Willy Lulciuc, Mandy Chessell, Maciej Obuchowski
Michael Robinson (michael.robinson@astronomer.io)
2022-11-01 13:37:54

*Thread Reply:* Thanks, all! The release is authorized. We will initiate it within 48 hours.

Iftach Schonbaum (iftach.schonbaum@hunters.ai)
2022-11-02 08:45:20

Anybody with a success use-case of ingesting column-level lineage into amundsen?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-11-02 09:19:43

*Thread Reply:* I think amundsen-openlineage dataloader precedes column-level lineage in OL by a bit, so I doubt this works

Harel Shein (harel.shein@gmail.com)
2022-11-02 15:54:31

*Thread Reply:* do you want to open up an issue for it @Iftach Schonbaum?

Michael Robinson (michael.robinson@astronomer.io)
2022-11-02 12:36:22

Hi everyone, you might notice Dependabot opening PRs to update dependencies now that it’s been configured and turned on (https://github.com/OpenLineage/OpenLineage/pull/1182). There will probably be a large number of PRs to start with, but this shouldn’t always be the case and we can change the tool’s behavior, as well. (Some background: this will help us earn the OSSF Silver badge for the project, which will help us advance in the LFAI.)

👍 Maciej Obuchowski
Michael Robinson (michael.robinson@astronomer.io)
2022-11-03 07:53:31

@channel I’m opening a vote to release OpenLineage 0.16.1 to fix an issue in the SQL integration. This release will also include all the commits announced for 0.16.0. 3 +1s from committers will authorize an immediate release. Thanks.

Labels
integration/sql
➕ Maciej Obuchowski, Hanna Moazam, Jakub Dardziński, Ross Turk, Paweł Leszczyński, Jarek Potiuk, Willy Lulciuc
Michael Robinson (michael.robinson@astronomer.io)
2022-11-03 12:25:29

*Thread Reply:* Thanks, all. The release is authorized and will be initiated shortly.

Michael Robinson (michael.robinson@astronomer.io)
2022-11-03 13:46:58

@channel OpenLineage 0.16.1 is now available, featuring: Additions: • Airflow: add dag_run information to Airflow version run facet #1133 @fm100 • Airflow: add LoggingMixin to extractors #1149 @JDarDagran • Airflow: add default extractor #1162 @mobuchowski • Airflow: add on_complete argument in DefaultExtractor #1188 @JDarDagran • SQL: reorganize the library into multiple packages #1167 @StarostaGit @mobuchowski Changes: • Airflow: move get_connection_uri as extractor’s classmethod #1169 @JDarDagran • Airflow: change get_openlineage_facets_on_start/complete behavior #1201 @JDarDagran Bug fixes and more! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.16.1 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.15.1...0.16.1 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🙌 Maciej Obuchowski, Francis McGregor-Macdonald, Eric Veleker
Phil Chen (phil@gpr.com)
2022-11-03 13:59:29

Are there any tutorial and documentation how to create an Openlinage connector. For example, what if we Argo workflow instead of Apache airflow for orchestrating ETL jobs? How are we going to create Openlinage Argo workflow connector? How much efforts, roughly? And can people contribute such connectors to the community if they create one?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-11-04 06:34:27

*Thread Reply:* > Are there any tutorial and documentation how to create an Openlinage connector. We have somewhat of a start of a doc: https://openlineage.io/docs/development/developing/

Here we have an example of using Python OL client to emit OL events: https://openlineage.io/docs/client/python#start-docker-and-marquez

> How much efforts, roughly? I'm not familiar with Argo workflows, but usually the effort needed depends on extensibility of the underlying system. From the first look, Argo looks like it has sufficient mechanisms for that: https://argoproj.github.io/argo-workflows/executor_plugins/#examples-and-community-contributed-plugins

Then, it depends if you can get the information that you need in that plugin. Basic need is to have information from which datasets the workflow/job is reading and to which datasets it's writing.

> And can people contribute such connectors to the community if they create one? Definitely! And if you need help with anything OpenLineage feel free to write here on Slack

openlineage.io
Michael Robinson (michael.robinson@astronomer.io)
2022-11-03 17:57:37

Is there a topic you think the community should discuss at the next OpenLineage TSC meeting? Reply or DM with your item, and we’ll add it to the agenda.

Michael Robinson (michael.robinson@astronomer.io)
2022-11-03 18:03:18

@channel This month’s OpenLineage TSC meeting is next Thursday, November 10th, at 10 am PT. Join us on Zoom: https://bit.ly/OLzoom. All are welcome! On the tentative agenda:

  1. Recent release overview [Michael R.]
  2. Update on LFAI & Data Foundation progress [Michael R.]
  3. Proposal: Defining “implementing OpenLineage” [Julien]
  4. Update from MANTA on their OpenLineage integration [Eric and/or Petr from MANTA]
  5. Linking CMF (a common ML metadata framework) and OpenLineage [Suparna and AnnMary from HP Enterprise]
  6. Open discussion
👍 Luca Soato, Maciej Obuchowski, Paul Lee, Willy Lulciuc
Kenton (swiple.io) (kknoxparton@gmail.com)
2022-11-08 04:47:41

Hi all 👋 I’m Kenton — a Software Engineer and founder of Swiple. I’m looking forward to working with OpenLineage and its community to integrate data lineage and data observability. https://swiple.io

swiple.io
🙌 Maciej Obuchowski, Jakub Dardziński, Michael Robinson, Ross Turk, John Thomas, Julien Le Dem, Willy Lulciuc, Varun Singh
Ross Turk (ross@datakin.com)
2022-11-08 10:22:15

*Thread Reply:* Welcome Kenton! Happy to help 👍

👍 Kenton (swiple.io)
Deepika Prabha (deepikaprabha@gmail.com)
2022-11-08 05:35:03

Hi everyone, We wanted to pass some dynamic metadata from spark job that we can catch up in OpenLineage event and use it for processing. Presently I have seen that we have few conf parameters like openlineage params that we can send only with Spark conf. Is there any other option we have where we can send some information dynamically from the spark jobs.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-11-08 10:06:10

*Thread Reply:* What kind of data? My first feeling is that you need to extend the Spark integration

Deepika Prabha (deepikaprabha@gmail.com)
2022-11-09 00:35:29

*Thread Reply:* Yes, we wanted to add information like user/job description that we can use later with rest of openlineage event fields in our system

Deepika Prabha (deepikaprabha@gmail.com)
2022-11-09 00:41:35

*Thread Reply:* I can see in this PR https://github.com/OpenLineage/OpenLineage/pull/490 that env values can be captured which we can use to add some custom metadata but it seems it is specific to Databricks only.

Comments
8
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-11-09 05:14:50

*Thread Reply:* I think it makes sense to have something like that, but generic, if you want to contribute it

👍 Will Johnson, Deepika Prabha
Varun Singh (varuntestaz@outlook.com)
2022-11-14 03:28:35

*Thread Reply:* @Maciej Obuchowski Do you mean adding something like spark.openlineage.jobFacet.FacetName.Key=Value to the spark conf should add a new job facet like "FacetName": { "Key": "Value" }

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-11-14 05:56:02

*Thread Reply:* We can argue about name of that key, but yes, something like that. Just notice that while it's possible to attach something to run and job facets directly, it would be much harder to do this with datasets

slackbot
2022-11-09 11:15:49

This message was deleted.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2022-11-10 02:22:18

*Thread Reply:* Hi @Varun Singh, what version of openlineage-spark where you using? Are you able to copy lineage event here?

Michael Robinson (michael.robinson@astronomer.io)
2022-11-09 12:31:10

@channel This month’s TSC meeting is tomorrow at 10 am PT! https://openlineage.slack.com/archives/C01CK9T7HKR/p1667512998061829

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
💥 Willy Lulciuc, Maciej Obuchowski
Hanna Moazam (hannamoazam@microsoft.com)
2022-11-11 11:32:54

Hi #general, quick question: do we plan to disable spark 2 support in the near future?

Longer question: I've recently made a PR (https://github.com/OpenLineage/OpenLineage/pull/1231) to support capturing lineage from Snowflake, but it fails at a specific integration test due to what we think is a dependency mismatch for guava. I've tried to exclude any transient dependencies which may cause the problem but no luck with that so far.

Just wondering if:

  1. It makes sense to spend more time trying to ensure that test passes? Especially if we plan to remove spark 2 support soon.
  2. Assuming we do want to make sure to pass the test, does anyone have any other ideas for where to look/modify to prevent the error? Here's the test failure message: ```io.openlineage.spark.agent.lifecycle.LibraryTest testRdd(SparkSession) FAILED (16s)

java.lang.IllegalAccessError: tried to access method com.google.common.base.Stopwatch.<init>()V from class org.apache.hadoop.mapred.FileInputFormat at io.openlineage.spark.agent.lifecycle.LibraryTest.testRdd(LibraryTest.java:113) ``` Thanks in advance!

Labels
documentation, integration/spark, spec
Comments
4
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-11-11 16:28:07

*Thread Reply:* What if we just not include it in the BaseVisitorFactory but only in the Spark3 visitor factories?

Paul Lee (paullee@lyft.com)
2022-11-11 14:52:19

quick question: how do i get the &lt;&lt;non-serializable Time...to show in the extraction? or really any object that gets passed in.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-11-11 16:24:30

*Thread Reply:* You might look here: https://github.com/OpenLineage/OpenLineage/blob/f7049c599a0b1416408860427f0759624326677d/client/python/openlineage/client/serde.py#L51

:gratitude_thank_you: Paul Lee
srutikanta hota (srutikanta.hota@gmail.com)
2022-11-14 01:12:45

Is there a way I can update the detaset description and the column description. While generating the open lineage spark events and columns

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2022-11-15 02:09:25

*Thread Reply:* I don’t think this is possible at the moment.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2022-11-15 15:47:49

Hey all, I'd like to ask for a release for OpenLineage. #1256 fixes bug in DefaultExtractor. This blocks people from migrating code from custom extractors to get_openlineage_facets methods.

➕ Michael Robinson, Howard Yoo, Maciej Obuchowski, Willy Lulciuc
Michael Robinson (michael.robinson@astronomer.io)
2022-11-16 09:13:17

*Thread Reply:* Thanks, all. The release is authorized.

Michael Robinson (michael.robinson@astronomer.io)
2022-11-16 10:41:07

*Thread Reply:* The PR for the changelog updates: https://github.com/OpenLineage/OpenLineage/pull/1306

Varun Singh (varuntestaz@outlook.com)
2022-11-16 03:34:01

Hi, small question: Is it possible to disable the /api/{version}/lineage suffix that gets added to every url automatically? Thanks!

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-11-16 12:27:12

*Thread Reply:* I think we had similar request before, but nothing was implemented.

👍 Varun Singh
Michael Robinson (michael.robinson@astronomer.io)
2022-11-16 12:23:54

@channel OpenLineage 0.17.0 is now available, featuring: Additions: • Spark: support latest Spark 3.3.1 #1183 @pawel-big-lebowski • Spark: add Kinesis Transport and support config Kinesis in Spark integration #1200 @yogyang • Spark: disable specified facets #1271 @pawel-big-lebowski • Python: add facets implementation to Python client #1233 @pawel-big-lebowski • SQL: add Rust parser interface #1172 @StarostaGit @mobuchowski • Proxy: add helm chart for the proxy backend #1068 @wslulciuc • Spec: include possible facets usage in spec #1249 @pawel-big-lebowski • Website: publish YML version of spec to website #1300 @rossturk • Docs: update language on nominating new committers #1270 @rossturk Changes: • Website: publish spec into new website repo location #1295 @rossturk • Airflow: change how pip installs packages in tox environments #1302 @JDarDagran Removals: • Deprecate HttpTransport.Builder in favor of HttpConfig #1287 @collado-mike Bug fixes and more! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.17.0 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.16.1...0.17.0 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🙌 Howard Yoo, Maciej Obuchowski, Ross Turk, Aphra Bloomfield, Harel Shein, Kengo Seki, Paweł Leszczyński, pankaj koti, Varun Singh
Diego Cesar (dcesar@krakenrobotik.de)
2022-11-18 05:40:53

Hi everyone,

I'm trying to get the lineage of a dataset per version. I initially had something like

Dataset A -&gt; Dataset B -&gt; DataSet C (version 1)

then:

Dataset D -&gt; Dataset E -&gt; DataSet C (version 2)

I can get the graph for version 2 without problems, but I'm wondering if there's any way to retrieve the entire graph for DataSet C version 1.

Thanks

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-11-22 13:40:44

*Thread Reply:* It's kind of a hard problem UI side. Backend can express that relationship

Diego Cesar (dcesar@krakenrobotik.de)
2022-11-22 13:48:58

*Thread Reply:* Thanks for replying. Could you please point me to the API that allows me to do that? I've been calling GET /lineage with dataset in the node ID, e g., nodeId=dataset:my_dataset . Where could I specify the version of my dataset?

Paul Lee (paullee@lyft.com)
2022-11-18 17:55:24

👋 how do we get the actual values from macros? e.g. a schema name is passed in with {{params.table_name}} and thats what shows in lineage instead of the actual table name

Jakub Dardziński (jakub.dardzinski@getindata.com)
2022-11-19 04:54:13

*Thread Reply:* Templated fields are rendered before generating lineage data. Do you have some sample code or logs preferrably?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-11-22 13:40:11

*Thread Reply:* If you're on 1.10 then I think it won't work

Paul Lee (paullee@lyft.com)
2022-11-28 12:50:39

*Thread Reply:* @Maciej Obuchowski we are still on airflow 1.10.15 unfortunately.

cc. @Eli Schachar @Allison Suarez

Paul Lee (paullee@lyft.com)
2022-11-28 12:50:49

*Thread Reply:* is there no workaround we can make work?

Paul Lee (paullee@lyft.com)
2022-11-28 12:51:01

*Thread Reply:* @Jakub Dardziński is this for airflow versions 2.0+?

Varun Singh (varuntestaz@outlook.com)
2022-11-21 07:07:10

Hey, quick question: I see there is Kafka transport in the java client, but it's not supported in the spark integration, right?

srutikanta hota (srutikanta.hota@gmail.com)
2022-11-22 13:03:41

How can we auto instrument a dataset owner at Java agent level? Is there any spark property available?

srutikanta hota (srutikanta.hota@gmail.com)
2022-11-22 16:47:37

Is there a way if we are running a job with business day as yesterday to capture the information. Just think if I am running yesterday missing job today. Or Friday's file on Monday as we received file late from vendor etc..

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-11-22 18:45:48

*Thread Reply:* I think that's what NominalTimeFacet covers

openlineage.io
Rahul Sharma (panditrahul151197@gmail.com)
2022-11-24 09:15:45

hello Team, i wanna to use data lineage using airflow but not getting understand from docs please let me know if someone have pretty docs

Harel Shein (harel.shein@gmail.com)
2022-11-28 10:29:58

*Thread Reply:* Hey @Rahul Sharma, what version of Airflow are you running?

Rahul Sharma (panditrahul151197@gmail.com)
2022-11-28 10:30:14

*Thread Reply:* i am using airflow 2.x

Rahul Sharma (panditrahul151197@gmail.com)
2022-11-28 10:30:27

*Thread Reply:* can we connect if you have time ?

Harel Shein (harel.shein@gmail.com)
2022-11-28 11:11:58

*Thread Reply:* did you see these docs before? https://openlineage.io/integration/apache-airflow/#airflow-20

openlineage.io
Rahul Sharma (panditrahul151197@gmail.com)
2022-11-28 11:12:22

*Thread Reply:* yes

Rahul Sharma (panditrahul151197@gmail.com)
2022-11-28 11:12:36

*Thread Reply:* i already set configuration in airflow.cfg file

Harel Shein (harel.shein@gmail.com)
2022-11-28 11:12:57

*Thread Reply:* where are you sending the events to?

Rahul Sharma (panditrahul151197@gmail.com)
2022-11-28 11:13:24

*Thread Reply:* i have a docker machine on which marquez is working

Harel Shein (harel.shein@gmail.com)
2022-11-28 11:13:47

*Thread Reply:* so, what is the issue you are seeing?

Rahul Sharma (panditrahul151197@gmail.com)
2022-11-28 11:15:37

*Thread Reply:* there is no error

Rahul Sharma (panditrahul151197@gmail.com)
2022-11-28 11:16:01

*Thread Reply:* ```[lineage]

what lineage backend to use

backend =openlineage.lineage_backend.OpenLineageBackend

MARQUEZ_URL=http://10.36.37.178:3000

MARQUEZ_NAMESPACE=airflow

MARQUEZBACKEND=HTTP MARQUEZURL=http://10.36.37.178:5000

MARQUEZAPIKEY=[YOURAPIKEY]

MARQUEZ_NAMESPACE=airflow```

Rahul Sharma (panditrahul151197@gmail.com)
2022-11-28 11:16:09

*Thread Reply:* above config i have set

Rahul Sharma (panditrahul151197@gmail.com)
2022-11-28 11:16:22

*Thread Reply:* please let me know any other thing need to do

Mohamed Nabil H (m.nabil.hafez@gmail.com)
2022-11-24 14:02:27

hey i wonder if somebody can link me to the lineage ( table lineage ) event schema ?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2022-11-25 02:20:40

*Thread Reply:* please have a look at openapi definition of the event: https://openlineage.io/apidocs/openapi/

Murali Krishna (vmurali.krishnaraju@genpact.com)
2022-11-30 02:34:51

Hello Team, I am from Genpact Data Analytics team, we are looking for demo of your product

Conor Beverland (conorbev@gmail.com)
2022-11-30 14:10:10

*Thread Reply:* hey, I'll DM you.

Michael Robinson (michael.robinson@astronomer.io)
2022-12-01 15:00:28

Hello all, I’m calling for a vote on releasing OpenLineage 0.18.0, including: • improvements to the Spark integration, • extractors for Sagemaker operators and SFTPOperator in the Airflow integration, • a change to the Databricks integration to support Databricks Runtime 11.3, • new governance docs, • bug fixes, • and more. Three +1s from committers will authorize an immediate release.

➕ Maciej Obuchowski, Will Johnson, Bramha Aelem
Michael Robinson (michael.robinson@astronomer.io)
2022-12-06 13:56:17

*Thread Reply:* Thanks, all. The release is authorized will be initiated within two business days.

Michael Robinson (michael.robinson@astronomer.io)
2022-12-01 15:11:10

@channel This month’s OpenLineage TSC meeting is next Thursday, December 8th, at 10 am PT. Join us on Zoom: https://bit.ly/OLzoom. All are welcome! On the tentative agenda:

  1. an overview of the new Rust implementation of the SQL integration
  2. a pesentation/discussion of what it actually means to “implement” OpenLineage
  3. open discussion.
Scott Anderson (scott.anderson@alteryx.com)
2022-12-02 13:57:07

Hello everyone! General question here, aside from ‘consumer’ orgs/integrations (dbt/dagster/manta), is anyone aware of any enterprise organizations that are leveraging OpenLineage today? Example lighthouse brands?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-12-02 15:21:20

*Thread Reply:* Microsoft https://openlineage.io/blog/openlineage-microsoft-purview/

openlineage.io
🙌 Will Johnson
Will Johnson (will@willj.co)
2022-12-05 13:54:06

*Thread Reply:* I think we can share that we have over 2,000 installs of that Microsoft solution accelerator using OpenLineage.

That means we have thousands of companies having experimented with OpenLineage and Microsoft Purview.

We can't name any customers at this point unfortunately.

🎉 Conor Beverland, Kengo Seki
👍 Scott Anderson
Michael Robinson (michael.robinson@astronomer.io)
2022-12-07 12:03:06

@channel This month’s TSC meeting is tomorrow at 10 am PT. All are welcome! https://openlineage.slack.com/archives/C01CK9T7HKR/p1669925470878699

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
Will Johnson (will@willj.co)
2022-12-07 14:22:58

*Thread Reply:* For open discussion, I'd like to ask the team for an overview of how the different gradle files are working together for the Spark implementation. I'm terribly confused on where dependencies need to be added (whether it's in shared, app, or a spark version specific folder). Maybe @Maciej Obuchowski...?

👍 Michael Robinson
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2022-12-07 14:25:12

*Thread Reply:* Unfortunately I'll be unable to attend the meeting @Will Johnson 😞

😭 Will Johnson
Julien Le Dem (julien@apache.org)
2022-12-08 13:03:08

*Thread Reply:* This is starting now. CC @Will Johnson

Julien Le Dem (julien@apache.org)
2022-12-09 19:24:15

*Thread Reply:* @Will Johnson Check the notes and the recording. @Michael Collado did a pass at explaining the relationship between shared, app and the versions

Julien Le Dem (julien@apache.org)
2022-12-09 19:24:30

*Thread Reply:* feel free to follow up here as well

Michael Collado (collado.mike@gmail.com)
2022-12-09 19:39:37

*Thread Reply:* ascii art to the rescue! (top “depends on” bottom)

              /   \
             / / \ \
            / /   \ \
           / /     \ \
          / /       \ \
         / |         | \
        /  |         |  \
       /   |         |   \
      /    |         |    \
     /     |         |     \
    /      |         |      \
   /       |         |       \
spark2   spark3   spark32   spark33
   \        |        |       /
    \       |        |      /
     \      |        |     /
      \     |        |    /
       \    |        |   /
        \   |        |  /
         \  |        | /
          \ |       / /
           \ \     / /
            \ \   / /
             \ \ / /
              \   /
               \ /
             share
Julien Le Dem (julien@apache.org)
2022-12-09 19:40:05

*Thread Reply:* 😍

Michael Collado (collado.mike@gmail.com)
2022-12-09 19:41:13

*Thread Reply:* (btw, we should have written datakin to output ascii art; it’s obviously the superior way to generate graphs 😜)

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2022-12-14 05:18:53

*Thread Reply:* Hi, is there a recording for this meeting?

Christian Lundgren (christian@lunit.io)
2022-12-07 20:33:19

Hi! I have a basic question about the naming conventions for blob storage. The spec is not totally clear to me. Is the convention to use (1) namespace=bucket name=bucket+path or (2) namespace=bucket name=path?

Julien Le Dem (julien@apache.org)
2022-12-07 22:05:25

*Thread Reply:* The namespace is the bucket and the dataset name is the path. Is there a blob storage provider in particular you are thinking of?

Christian Lundgren (christian@lunit.io)
2022-12-07 23:13:41

*Thread Reply:* Thanks, that makes sense. We use GCS, so it is already covered by the naming conventions documented. I was just not sure if I was understanding the document correctly or not.

Julien Le Dem (julien@apache.org)
2022-12-07 23:34:33

*Thread Reply:* No problem. Let us know if you have suggestions on the wording to make the doc clearer

Michael Robinson (michael.robinson@astronomer.io)
2022-12-08 11:44:49

@channel OpenLineage 0.18.0 is available now, featuring: • Airflow: support SQLExecuteQueryOperator #1379 @JDarDagran • Airflow: introduce a new extractor for SFTPOperator #1263 @sekikn • Airflow: add Sagemaker extractors #1136 @fhoda • Airflow: add S3 extractor for Airflow operators #1166 @fhoda • Spec: add spec file for ExternalQueryRunFacet #1262 @howardyoo • Docs: add a TSC doc #1303 @merobi-hub • Plus bug fixes. Thanks to all our contributors, including new contributor @Faisal Hoda! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.18.0 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.17.0...0.18.0 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🚀 Willy Lulciuc, Minkyu Park, Kengo Seki, Enrico Rotundo, Faisal Hoda
🙌 Howard Yoo, Minkyu Park, Kengo Seki, Enrico Rotundo, Faisal Hoda
srutikanta hota (srutikanta.hota@gmail.com)
2022-12-09 01:42:59

1) Is there a specifications to capture dataset dependency. ds1 is dependent on ds2

Ross Turk (ross@datakin.com)
2022-12-09 11:51:16

*Thread Reply:* Dataset dependencies are represented through common relationship with a Job - e.g., the task that performed the transformation.

srutikanta hota (srutikanta.hota@gmail.com)
2022-12-11 09:01:19

*Thread Reply:* Is it possible to populate table level dependency without any transformation using open lineage specifications? Like to define dataset 1 is dependent of table 1 and table 2 which can be represented as separate datasets

Ross Turk (ross@datakin.com)
2022-12-13 15:24:20

*Thread Reply:* Not explicitly, in today's spec. The guiding principle is that something created that dependency, and the dependency changes over time in a way that is important to study.

Ross Turk (ross@datakin.com)
2022-12-13 15:25:12

*Thread Reply:* I say this to explain why it is the way it is - but the spec can change over time to serve new uses cases, certainly!

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2022-12-14 05:18:10

Hi everyone, I'd like to use openlineage to capture column level lineage for spark. I would also like to capture a few custom environment variables along with the column lineage. May I know how this can be done? Thanks!

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2022-12-14 09:56:22

*Thread Reply:* Hi @Anirudh Shrinivason, you could start with column-lineage & spark workshop available here -> https://github.com/OpenLineage/workshops/tree/main/spark

❤️ Ricardo Gaspar
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2022-12-14 10:05:54

*Thread Reply:* Hi @Paweł Leszczyński Thanks for the link! But this does not really answer the concern.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2022-12-14 10:06:08

*Thread Reply:* I am already able to capture column lineage

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2022-12-14 10:06:33

*Thread Reply:* What I would like is to capture some extra environment variables, and send it to the server along with the lineage

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2022-12-14 11:22:59

*Thread Reply:* i remember we already have a facet for that: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/facets/EnvironmentFacet.java

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2022-12-14 11:24:07

*Thread Reply:* but it is only used at the moment to capture some databricks environment attributes

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2022-12-14 11:28:29

*Thread Reply:* so you can contribute to project and add a feature which adds specified/al environment variables to lineage event.

you can also have a look at extending section of spark integration docs (https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark#extending) and create a class thats add run facet builder according to your needs.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2022-12-14 11:29:28

*Thread Reply:* the third way is to create an issue related to this bcz being able to send selected/all environment variables in OL event seems to be really cool feature.

👍 Anirudh Shrinivason
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2022-12-14 21:49:19

*Thread Reply:* That is great! Thank you so much! This really helps!

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2022-12-15 01:44:42

*Thread Reply:* List&lt;String&gt; dbPropertiesKeys = Arrays.asList( "orgId", "spark.databricks.clusterUsageTags.clusterOwnerOrgId", "spark.databricks.notebook.path", "spark.databricks.job.type", "spark.databricks.job.id", "spark.databricks.job.runId", "user", "userId", "spark.databricks.clusterUsageTags.clusterName", "spark.databricks.clusterUsageTags.azureSubscriptionId"); dbPropertiesKeys.stream() .forEach( (p) -&gt; { dbProperties.put(p, jobStart.properties().getProperty(p)); }); It seems like it is obtaining these env variable information from the jobStart obj, but not capturing from the env directly?

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2022-12-15 01:57:05

*Thread Reply:* I have opened an issue in the community here: https://github.com/OpenLineage/OpenLineage/issues/1419

Labels
proposal
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-02-01 02:24:39

*Thread Reply:* Hi @Paweł Leszczyński I have opened a PR for helping to add this use case. Please do help to see if we can merge it in. Thanks! https://github.com/OpenLineage/OpenLineage/pull/1545

Labels
integration/spark
Comments
1
👀 Maciej Obuchowski, Ross Turk
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-02-02 11:45:52

*Thread Reply:* Hey @Anirudh Shrinivason, sorry for late reply, but I reviewed the PR.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-02-06 03:06:42

*Thread Reply:* Hey thanks a lot! I have made the requested changes! Thanks!

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-02-06 03:06:49

*Thread Reply:* @Maciej Obuchowski ^ 🙂

👀 Maciej Obuchowski
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-02-06 09:09:34

*Thread Reply:* Hey @Anirudh Shrinivason, took a look at it but it unfortunately fails integration tests (throws NPE), can you take a look again?

23/02/06 12:18:39 ERROR AsyncEventQueue: Listener OpenLineageSparkListener threw an exception java.lang.NullPointerException at io.openlineage.spark.agent.EventEmitter.&lt;init&gt;(EventEmitter.java:39) at io.openlineage.spark.agent.OpenLineageSparkListener.initializeContextFactoryIfNotInitialized(OpenLineageSparkListener.java:276) at io.openlineage.spark.agent.OpenLineageSparkListener.onOtherEvent(OpenLineageSparkListener.java:80) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1433) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-02-07 04:17:02

*Thread Reply:* Hi yeah my bad. It should be fixed in the latest push. But I think the tests are not running in the CI because of some GCP environment issue? I am not really sure how to fix it...

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-02-07 04:18:46

*Thread Reply:* I can make them run, it's just that running them on forks is disabled. We need to make it more clear I suppose

👍 Anirudh Shrinivason
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-02-07 04:24:38

*Thread Reply:* Ahh I see thanks! Also, some of the tests are failing on my local, such as https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/test/java/io/openlineage/spark/agent/lifecycle/DeltaDataSourceTest.java. Is this expected behaviour?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-02-07 07:20:11

*Thread Reply:* tests failing isn't expected behaviour 🙂

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-02-08 03:37:23

*Thread Reply:* Ahh yeap it was a local ide issue on my side. I added some tests to verify the presence of env variables too.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-02-08 03:47:22

*Thread Reply:* @Anirudh Shrinivason let me know then when you'll push fixed version, I can run full tests then

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-02-08 03:49:35

*Thread Reply:* I have pushed just now

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-02-08 03:49:39

*Thread Reply:* You can run the tests

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-02-08 04:13:07

*Thread Reply:* @Maciej Obuchowski mb I pushed again rn. Missed out a closing bracket.

👍 Maciej Obuchowski
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-02-10 00:47:04

*Thread Reply:* @Maciej Obuchowski Hi, could we merge this PR in? I'd like to see if we can have these changes in the new release...

Bramha Aelem (bramhaaelem@gmail.com)
2022-12-15 17:14:02

Hi All- I am sending lineage from ADF for each activity which i am performing. But the individual activities are representing correctly. How can I represent task1 as a parent to task2. can someone please share the sample json request for it.

Ross Turk (ross@datakin.com)
2022-12-16 13:29:44

*Thread Reply:* Hi 👋 this would require a series of JSON calls:

  1. start the first task
  2. end the first task, specify output dataset
  3. start the second task, specify input dataset
  4. end the second task
Ross Turk (ross@datakin.com)
2022-12-16 13:32:08

*Thread Reply:* in OpenLineage relationships are typically Job -> Dataset -> Job, so • you create a relationship between datasets by referring to them in the same job - i.e., this task ran that read from these datasets and wrote to those datasets • you create a relationship between tasks by referring to the same datasets across both of them - i.e., this task wrote that dataset and this other task read from it

Ross Turk (ross@datakin.com)
2022-12-16 13:35:06

*Thread Reply:* @Bramha Aelem if you look in this directory, you can find example start/complete JSON calls that show how to specify input/output datasets.

(it’s an airflow workshop, but those examples are for a part of the workshop that doesn’t involve airflow)

Ross Turk (ross@datakin.com)
2022-12-16 13:35:46

*Thread Reply:* (these can also be found in the docs)

openlineage.io
👍 Ross Turk
Bramha Aelem (bramhaaelem@gmail.com)
2022-12-16 14:49:30

*Thread Reply:* @Ross Turk - Thanks for the details. will try and get back to you on it

Bramha Aelem (bramhaaelem@gmail.com)
2022-12-17 19:53:21

*Thread Reply:* @Ross Turk - Good Evening, It worked as expected. I am able to replicate the scenarios which I am looking for.

👍 Ross Turk
Bramha Aelem (bramhaaelem@gmail.com)
2022-12-17 19:53:48

*Thread Reply:* @Ross Turk - Thanks for your response.

Bramha Aelem (bramhaaelem@gmail.com)
2023-01-12 13:23:56

*Thread Reply:* @Ross Turk - First activity : I am making HTTP Call to pull the lookup data and store it in ADLS. Second Activity : After the completion of first activity I am making Azure databricks call to use the lookup file and generate the output tables. How I can refer the databricks generated tables facets as an input to the subsequent activities in the pipeline. When I refer it's as an input the spark tables metadata is not showing up. How can this be achievable. After the execution of each activity in ADF Pipeline I am sending start and complete/fail event lineage to Marquez.

Can someone please guide me on this.

Bramha Aelem (bramhaaelem@gmail.com)
2022-12-15 17:19:34

I am not using airflow in my Process. pls suggest

Bramha Aelem (bramhaaelem@gmail.com)
2022-12-19 12:40:26

Hi All - Good Morning, how the column lineage of data source when it ran by different teams and jobs in openlineage.

Al (Koii) (al@koii.network)
2022-12-20 14:26:57

Hey folks! I'm al from Koii.network, very happy to have heard about this project :)

👋 Willy Lulciuc, Maciej Obuchowski, Julien Le Dem
Willy Lulciuc (willy@datakin.com)
2022-12-20 14:27:59

*Thread Reply:* welcome! let’s us know if you have any questions

Matt Menzenski (matt@payitgov.com)
2022-12-29 08:22:26

Hello! I found the OpenLineage project today after searching for “OpenTelemetry” in the dbt Slack.

Harel Shein (harel.shein@gmail.com)
2022-12-29 10:47:00

*Thread Reply:* Hey Matt! Happy to have you here! Feel free to reach out if you have any questions

:gratitude_thank_you: Matt Menzenski
Max (maxime.broussard@gmail.com)
2022-12-30 05:33:40

Hi guys - I am really excited to test open lineage. I had a quick question, sorry if this is not the right place for it. We are testing dbt-ol with airflow and I was hoping this would by default push the number of rows updated/created in that dbt transformation to marquez. It runs fine on airflow, but when I check in marquez there doesn't seem to be a 'dataset' created, only 'jobs' with job level metadata. When i check here I see that the dataset facets should have it though https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md Does anyone know if creating a dataset & sending row counts to OL is out of the box on dbt-ol or if I need to build another script to get that number from my snowflake instance and push it to OL as another step in my process? Thanks a lot!

Viraj Parekh (vmpvmp94@gmail.com)
2023-01-03 13:20:14

*Thread Reply:* @Ross Turk maybe you can help with this?

Ross Turk (ross@datakin.com)
2023-01-03 13:34:23

*Thread Reply:* hmm, I believe the dbt-ol integration does capture bytes/rows, but only for some data sources: https://github.com/OpenLineage/OpenLineage/blob/6ae1fd5665d5fd539b05d044f9b6fb831ce9d475/integration/common/openlineage/common/provider/dbt.py#L567

Ross Turk (ross@datakin.com)
2023-01-03 13:34:58

*Thread Reply:* I haven't personally tried it with Snowflake in a few versions, but the code suggests that it's one of them.

Ross Turk (ross@datakin.com)
2023-01-03 13:35:42

*Thread Reply:* @Max you say your dbt-ol run is resulting in only jobs and no datasets emitted, is that correct?

Ross Turk (ross@datakin.com)
2023-01-03 13:38:06

*Thread Reply:* if so, I'd say something rather strange is going on because in my experience each model should result in a Job and a Dataset.

Kuldeep (kuldeep.marathe@affirm.com)
2023-01-03 00:41:09

Hi All, Curious to see if there is an openlineage integration with luigi or any open source projects working on it.

Kuldeep (kuldeep.marathe@affirm.com)
2023-01-03 01:53:10

*Thread Reply:* I was looking for something similar to the airflow integration

Viraj Parekh (vmpvmp94@gmail.com)
2023-01-03 13:21:18

*Thread Reply:* hey @Kuldeep - i don't think there's something for Luigi right now - is that something you'd potentially be interested in?

Kuldeep (kuldeep.marathe@affirm.com)
2023-01-03 13:23:53

*Thread Reply:* @Viraj Parekh Yes this is something we are interested in! There are a lot of projects out there that use luigi

Michael Robinson (michael.robinson@astronomer.io)
2023-01-03 11:05:48

Hello all, I’m opening a vote to release OpenLineage 0.19.0, including: • new extractors for Trino and S3FileTransformOperator in the Airflow integration • a new, standardized run facet in the Airflow integration • a new NominalTimeRunFacet and OwnershipJobFacet in the Airflow integration • Postgres support in the dbt integration • a new client-side proxy (skeletal version) • a new, improved mechanism for passing conf parameters to the OpenLineage client in the Spark integration • a new ExtractionErrorRunFacet to reflect internal processing errors for the SQL parser • testing improvements, bug fixes and more. As always, three +1s from committers will authorize an immediate release. Thanks in advance!

➕ Willy Lulciuc, Maciej Obuchowski, Paweł Leszczyński, Jakub Dardziński, Julien Le Dem
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-01-03 23:07:59

*Thread Reply:* Hi @Michael Robinson a new, improved mechanism for passing conf parameters to the OpenLineage client in the Spark integration Would it be possible to have more details on what this entails please? Thanks!

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-01-04 09:21:46

*Thread Reply:* @Tomasz Nazarewicz might explain this better

Tomasz Nazarewicz (tomasz.nazarewicz@getindata.com)
2023-01-04 10:04:22

*Thread Reply:* @Anirudh Shrinivason until now If you wanted to add new property to OL client, you had to also implement it in the integration because it had to parse all properties, create appropriate objects etc. New implementation makes client properties transparent to integration, they are only passed through and parsing happens inside the client.

Michael Robinson (michael.robinson@astronomer.io)
2023-01-04 13:02:39

*Thread Reply:* Thanks, all. The release is authorized and will commence shortly 🙂

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-01-04 22:00:55

*Thread Reply:* @Tomasz Nazarewicz Ahh I see. Okay thanks!

Michael Robinson (michael.robinson@astronomer.io)
2023-01-05 10:37:09

@channel This month’s OpenLineage TSC meeting is next Thursday, January 12th, at 10 am PT. Join us on Zoom: https://bit.ly/OLzoom. All are welcome! On the tentative agenda:

  1. Recent release overview @Michael Robinson
  2. Column lineage update @Maciej Obuchowski
  3. Airflow integration improvements @Jakub Dardziński
  4. Discussions: • Real-world implementation of OpenLineage (What does it really mean?) @Sheeri Cabral (Collibra) • Using namespaces @Michael Robinson
  5. Open discussion Notes: https://bit.ly/OLwiki Is there a topic you think the community should discuss at this or a future meeting? Reply or DM me to add items to the agenda.
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-01-05 23:45:38

*Thread Reply:* @Michael Robinson Will there be a recording?

Michael Robinson (michael.robinson@astronomer.io)
2023-01-06 09:10:50

*Thread Reply:* @Anirudh Shrinivason Yes, and the recording will be here: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

:gratitude_thank_you: Anirudh Shrinivason
Michael Robinson (michael.robinson@astronomer.io)
2023-01-05 13:00:01

OpenLineage 0.19.2 is available now, including: • Airflow: add Trino extractor #1288 @sekikn • Airflow: add S3FileTransformOperator extractor #1450 @sekikn • Airflow: add standardized run facet #1413 @JDarDagran • Airflow: add NominalTimeRunFacet and OwnershipJobFacet #1410 @JDarDagran • dbt: add support for postgres datasources #1417 @julienledem • Proxy: add client-side proxy (skeletal version) #1439 #1420 @fm100 • Proxy: add CI job to publish Docker image #1086 @wslulciuc • SQL: add ExtractionErrorRunFacet #1442 @mobuchowski • SQL: add column-level lineage to SQL parser #1432 #1461 @mobuchowski @StarostaGit • Spark: pass config parameters to the OL client #1383 @tnazarew • Plus bug fixes and testing and CI improvements. Thanks to all the contributors, including new contributor Saurabh (@versaurabh) Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.19.2 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.18.0...0.19.2 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

❤️ Julien Le Dem, Howard Yoo, Willy Lulciuc, Maciej Obuchowski, Kengo Seki, Harel Shein, Jarek Potiuk, Varun Singh
Will Johnson (will@willj.co)
2023-01-06 01:07:18

Question on Spark Integration and External Hive Metastores

@Hanna Moazam and I are working with a team using OpenLineage and wants to extract out the server name of the hive metastore they're using when writing to a Hive table through Spark.

For example, the hive metastore is an Azure SQL database and the table name is sales.transactions.

OpenLineage will give something like /usr/hive/warehouse/sales.db/transactions for the name.

However, this is not a complete picture since sales.db/transactions is defined like this for a given hive metastore. In Hive, you'd define the fully qualified name as sales.transactions@sqlservername.database.windows.net .

Has anyone else come across this before? If not, we plan on raising an issue and suggesting we extract out the spark.hadoop.javax.jdo.option.ConnectionURL in the DatabricksEnvironmentFacetBuilder but ideally there would be a better way of extracting this.

https://learn.microsoft.com/en-us/azure/databricks/data/metastores/external-hive-metastore#set-up-an-external-metastore-using-the-ui

There was an issue by @Maciej Obuchowski or @Paweł Leszczyński that talked about providing a facet of the alias of a path but I can't find it at this point :(

learn.microsoft.com
👀 Maciej Obuchowski
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-01-09 02:28:43

*Thread Reply:* Hi @Hanna Moazam, we've written Jupyter notebook to demo dataset symlinks feature: https://github.com/OpenLineage/workshops/blob/main/spark/dataset_symlinks.ipynb

For scenario you describe, there should be symlink facet sent similar to: { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/0.15.1/integration/spark>", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/SymlinksDatasetFacet.json#/$defs/SymlinksDatasetFacet>", "identifiers": [ { "namespace": "<hive://metastore>", "name": "default.some_table", "type": "TABLE" } ] } Within Openlineage Spark integration code, symlinks are included here: https://github.com/OpenLineage/OpenLineage/blob/0.19.2/integration/spark/shared/src/main/java/io/openlineage/spark/agent/util/PathUtils.java#L75

and they are added only when spark catalog is hive and metastore URI in spark conf is present.

➕ Maciej Obuchowski
🤯 Will Johnson
Will Johnson (will@willj.co)
2023-01-09 14:21:10

*Thread Reply:* This is so awesome, @Paweł Leszczyński - Thank you so much for sharing this! I'm wondering if we could extend this to capture the hive JDBC Connection URL. I will explore this and put in an issue and PR to try and extend it. Thank you for the insights!

Michael Robinson (michael.robinson@astronomer.io)
2023-01-11 12:00:02

@channel Friendly reminder: this month’s OpenLineage TSC meeting is tomorrow at 10am, and all are welcome. https://openlineage.slack.com/archives/C01CK9T7HKR/p1672933029317449

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
🙌 Maciej Obuchowski, Will Johnson, John Bagnall, AnnMary Justine, Willy Lulciuc, Minkyu Park, Paweł Leszczyński, Varun Singh
Varun Singh (varuntestaz@outlook.com)
2023-01-12 06:37:56

Hi, are there any plans to add an Azure EventHub transport similar to the Kinesis one?

Will Johnson (will@willj.co)
2023-01-12 17:31:12

*Thread Reply:* @Varun Singh why not just use the KafkaTransport and the Event Hub's Kafka endpoint?

https://github.com/yogyang/OpenLineage/blob/2b7fa8bbd19a2207d54756e79aea7a542bf7bb[…]/main/java/io/openlineage/client/transports/KafkaTransport.java

https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-kafka-stream-analytics

learn.microsoft.com
👍 Varun Singh
Julien Le Dem (julien@apache.org)
2023-01-12 09:01:24

Following up on last month’s discussion (), I created the <#C04JPTTC876|spec-compliance> channel for further discussion

Will Johnson (will@willj.co)
2023-01-12 17:43:55

*Thread Reply:* @Julien Le Dem is there a channel to discuss the community call / ask follow-up questions on the communiyt call topics? For example, I wanted to ask more about the AirflowFacet and if we expected to introduce more tool specific facets into the spec. Where's the right place to ask that question? On the PR?

Julien Le Dem (julien@apache.org)
2023-01-17 15:11:05

*Thread Reply:* I think asking in #general is the right place. If there’s a specific github issue/PR, his is a good place as well. You can tag the relevant folks as well to get their attention

Allison Suarez (asuarezmiranda@lyft.com)
2023-01-12 18:37:24

@here I am using the Spark listener and whenever a query like INSERT OVERWRITE TABLE gets executed it looks like I can see some outputs, but there are no symlinks for the output table. The operation type being executed is InsertIntoHadoopFsRelationCommand . I am not sure why I cna see symlinks for all the input tables but not the output tables. Anyone know the reason behind this?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-01-13 02:30:37

*Thread Reply:* Hello @Allison Suarez, in case of InsertIntoHadoopFsRelationCommand, Spark Openlineage implementation uses method: DatasetIdentifier di = PathUtils.fromURI(command.outputPath().toUri(), "file"); (https://github.com/OpenLineage/OpenLineage/blob/0.19.2/integration/spark/shared/sr[…]ark/agent/lifecycle/plan/InsertIntoHadoopFsRelationVisitor.java)

If the dataset identifier is constructed from a path, then no symlinks are added. That's the current behaviour.

Calling io.openlineage.spark.agent.util.DatasetIdentifier#withSymlink(io.openlineage.spark.agent.util.DatasetIdentifier.Symlink) on DatasretIdentifier in InsertIntoHadoopFsRelationVisitor could be a remedy to that.

Do you have some Spark code snippet to reproduce this issue?

Will Johnson (will@willj.co)
2023-01-22 10:04:56

*Thread Reply:* @Allison Suarez it would also be good to know what compute engine you're using to run your code on? On-Prem Apache Spark? Azure/AWS/GCP Databricks?

Allison Suarez (asuarezmiranda@lyft.com)
2023-02-13 18:18:52

*Thread Reply:* I created a custom visitor and fixed the issue that way, thank you!

🙌 Will Johnson
Varun Singh (varuntestaz@outlook.com)
2023-01-13 11:44:19

Hi, I am trying to use kafka transport in spark for sending events to an EventHub but it requires me to set a property sasl.jaas.config which needs to have semicolons (;) in its value. But this gives an error about being unable to convert Array to a String. I think this is due to this line which splits property values into an array if they have a semicolon https://github.com/OpenLineage/OpenLineage/blob/92adbc877f0f4008928a420a1b8a93f394[…]pp/src/main/java/io/openlineage/spark/agent/ArgumentParser.java Does this seem like a bug or is it intentional?

Harel Shein (harel.shein@gmail.com)
2023-01-13 14:39:51

*Thread Reply:* seems like a bug to me, but tagging @Tomasz Nazarewicz / @Paweł Leszczyński

Tomasz Nazarewicz (tomasz.nazarewicz@getindata.com)
2023-01-13 15:22:19

*Thread Reply:* So we needed a generic way of passing parameters to client and made an assumption that every field with ; will be treated as an array

Varun Singh (varuntestaz@outlook.com)
2023-01-14 02:00:04

*Thread Reply:* Thanks for the confirmation, should I add a condition to split only if it's a key that can have array values? We can have a list of such keys like facets.disabled

Tomasz Nazarewicz (tomasz.nazarewicz@getindata.com)
2023-01-14 02:28:41

*Thread Reply:* We thought about this solution but it forces us to know the structure of each config and we wanted to avoid that as much as possible

Tomasz Nazarewicz (tomasz.nazarewicz@getindata.com)
2023-01-14 02:34:06

*Thread Reply:* Maybe the condition could be having ; and [] in the value

👍 Varun Singh
Varun Singh (varuntestaz@outlook.com)
2023-01-15 08:14:14

*Thread Reply:* Makes sense, I can add this check. Thanks @Tomasz Nazarewicz!

Varun Singh (varuntestaz@outlook.com)
2023-01-16 01:15:19

*Thread Reply:* Created issue https://github.com/OpenLineage/OpenLineage/issues/1506 for this

Comments
2
Michael Robinson (michael.robinson@astronomer.io)
2023-01-17 12:00:02

Hi everyone, I’m excited to share some good news about our progress in the LFAI & Data Foundation: we’ve achieved Incubation status! This required us to earn a Silver Badge from the OpenSSF, get 300+ stars on GitHub (which was NBD as we have over 1100 already), and win the approval of the LFAI & Data’s TAC. Now that we’ve cleared this hurdle, we have access to additional services from the foundation, including assistance with creative work, marketing and communication support, and event-planning assistance. Graduation from the program, which will earn us a voting seat on the TAC, is on the horizon. Stay tuned for updates on our progress with the foundation.

LF AI & Data is an umbrella foundation of the Linux Foundation that supports open source innovation in artificial intelligence (AI) and data. LF AI & Data was created to support open source AI and data, and to create a sustainable open source AI and data ecosystem that makes it easy to create AI and data products and services using open source technologies. They foster collaboration under a neutral environment with an open governance in support of the harmonization and acceleration of open source technical projects.

For more info about the foundation and other LFAI & Data projects, visit their website.

❤️ Julien Le Dem, Paweł Leszczyński, Maciej Obuchowski, Ross Turk, Jakub Dardziński, Minkyu Park, Howard Yoo, Jarek Potiuk, Danilo Mota, Willy Lulciuc, Kengo Seki, Harel Shein
Ross Turk (ross@datakin.com)
2023-01-17 15:53:12

if you want to share this news (and I hope you do!) there is a blog post here: https://openlineage.io/blog/incubation-stage-lfai/

openlineage.io
Ross Turk (ross@datakin.com)
2023-01-17 15:54:07

and I'll add a quick shoutout of @Michael Robinson, who has done a whole lot of work to make this happen 🎉 thanks, man, you're awesome!

🙌 Howard Yoo, Maciej Obuchowski, Jarek Potiuk, Minkyu Park, Willy Lulciuc, Kengo Seki, Paweł Leszczyński, Varun Singh
Michael Robinson (michael.robinson@astronomer.io)
2023-01-17 15:56:38

*Thread Reply:* Thank you, Ross!! I appreciate it. I might have coordinated it, but it’s been a team effort. Lots of folks shared knowledge and time to help us check all the boxes, literally and figuratively (lots of boxes). ;)

☑️ Willy Lulciuc, Paweł Leszczyński, Viraj Parekh
Jarek Potiuk (jarek@potiuk.com)
2023-01-17 16:03:36

Congrats @Michael Robinson and @Ross Turk - > major step for Open Lineage!

🙌 Michael Robinson, Maciej Obuchowski, Jakub Dardziński, Julien Le Dem, Ross Turk, Willy Lulciuc, Kengo Seki, Viraj Parekh, Paweł Leszczyński, Anirudh Shrinivason, Robert
Sudhir Nune (sudhir.nune@kraftheinz.com)
2023-01-18 11:15:02

Hi all, I am new to the https://openlineage.io/integration/dbt/, I followed the steps on Windows Laptop. But the dbt-ol does not get executed.

'dbt-ol' is not recognized as an internal or external command, operable program or batch file.

I see the following Packages installed too openlineage-dbt==0.19.2 openlineage-integration-common==0.19.2 openlineage-python==0.19.2

openlineage.io
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-01-18 11:17:14

*Thread Reply:* What are the errors?

Sudhir Nune (sudhir.nune@kraftheinz.com)
2023-01-18 11:18:09

*Thread Reply:* 'dbt-ol' is not recognized as an internal or external command, operable program or batch file.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-01-19 11:11:09

*Thread Reply:* Hm, I think this is due to different windows conventions around scripts.

Ross Turk (ross@datakin.com)
2023-01-19 14:26:35

*Thread Reply:* I have not tried it on Windows before myself, but on mac/linux if you make a Python virtual environment in venv/ and run pip install openlineage-dbt, the script winds up in ./venv/bin/dbt-ol.

Ross Turk (ross@datakin.com)
2023-01-19 14:27:04

*Thread Reply:* (maybe that helps!)

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-01-19 14:38:23

*Thread Reply:* This might not work, but I think I have an idea that would allow it to run as python -m dbt-ol run ...

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-01-19 14:38:27

*Thread Reply:* That needs one fix though

Sudhir Nune (sudhir.nune@kraftheinz.com)
2023-01-19 14:40:52

*Thread Reply:* Hi @Maciej Obuchowski, thanks for the input, when I try to use python -m dbt-ol run, I see the below error :( \python.exe: No module named dbt-ol

Michael Robinson (michael.robinson@astronomer.io)
2023-01-24 13:23:56

*Thread Reply:* We’re seeing a similar issue with the Great Expectations integration at the moment. This is purely a guess, but what happens when you try with openlineage-dbt 0.18.0?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-01-24 13:24:36

*Thread Reply:* @Michael Robinson GE issue is on Windows?

Michael Robinson (michael.robinson@astronomer.io)
2023-01-24 13:24:49

*Thread Reply:* No, not Windows

Michael Robinson (michael.robinson@astronomer.io)
2023-01-24 13:24:55

*Thread Reply:* (that I know of)

Sudhir Nune (sudhir.nune@kraftheinz.com)
2023-01-24 13:46:39

*Thread Reply:* @Michael Robinson - I see the same error. I used 2 Combinations

  1. Python 3.8.10 with openlineage-dbt 0.18.0 & Latest
  2. Python 3.9.7 with openlineage-dbt 0.18.0 & Latest
Ross Turk (ross@datakin.com)
2023-01-24 13:49:19

*Thread Reply:* Hm. You should be able to find the dbt-ol command wherever pip is installing the packages. In my case, that's usually in a virtual environment.

But if I am not in a virtual environment, it installs the packages in my PYTHONPATH. You might try this to see if the dbt-ol script can be found in one of the directories in sys.path.

Ross Turk (ross@datakin.com)
2023-01-24 13:58:38

*Thread Reply:* this can help you verify that your PYTHONPATH and PATH are correct - installing an unrelated python command-line tool and seeing if you can execute it:

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-01-24 13:59:42

*Thread Reply:* Again, I think this is windows issue

Ross Turk (ross@datakin.com)
2023-01-24 14:00:54

*Thread Reply:* @Maciej Obuchowski you think even if dbt-ol could be found in the path, that might not be the issue?

Sudhir Nune (sudhir.nune@kraftheinz.com)
2023-01-24 14:15:13

*Thread Reply:* Hi @Ross Turk - I could not find the dbt-ol in the site-packages.

Ross Turk (ross@datakin.com)
2023-01-24 14:16:48

*Thread Reply:* Hm 😕 then perhaps @Maciej Obuchowski is right and there is a bigger issue here

Sudhir Nune (sudhir.nune@kraftheinz.com)
2023-01-24 14:31:15

*Thread Reply:* @Ross Turk & @Maciej Obuchowski I see the issue event when I do the install using the https://pypi.org/project/openlineage-dbt/#files - openlineage-dbt-0.19.2.tar.gz.

For some reason, I see only the following folder created

  1. openlineage
  2. openlineage_dbt-0.19.2.dist-info
  3. openlineageintegrationcommon-0.19.2.dist-info
  4. openlineage_python-0.19.2.dist-info and not brining in the openlineage-dbt-0.19.2, which has the scripts/dbt-ol

If it helps I am using pip 21.2.4

Francis McGregor-Macdonald (francis@mc-mac.com)
2023-01-18 18:40:32

@Paul Villena @Stephen Said and Vishwanatha Nayak published an AWS blog Automate data lineage on Amazon MWAA with OpenLineage

Amazon Web Services
👀 Ross Turk, Peter Hicks, Willy Lulciuc
🔥 Ross Turk, Willy Lulciuc, Michael Collado, Peter Hicks, Minkyu Park, Julien Le Dem, Kengo Seki, Anirudh Shrinivason, Paweł Leszczyński, Maciej Obuchowski, Harel Shein, Paul Wilson Villena
❤️ Willy Lulciuc, Minkyu Park, Julien Le Dem, Kengo Seki, Paweł Leszczyński, Viraj Parekh
Ross Turk (ross@datakin.com)
2023-01-18 18:54:57

*Thread Reply:* This is excellent! May we promote it on openlineage and marquez social channels?

Willy Lulciuc (willy@datakin.com)
2023-01-18 18:55:30

*Thread Reply:* This is an amazing write up! 🔥 💯 🚀

Francis McGregor-Macdonald (francis@mc-mac.com)
2023-01-18 19:49:46

*Thread Reply:* Happy to have it promoted. 😄 Vish posted on LinkedIn: https://www.linkedin.com/posts/vishwanatha-nayak-b8462054automate-data-lineage-on-amazon-mwaa-with-activity-7021589819763945473-yMHF?utmsource=share&utmmedium=memberios|https://www.linkedin.com/posts/vishwanatha-nayak-b8462054automate-data-lineage-on-amazon-mwaa-with-activity-7021589819763945473-yMHF?utmsource=share&utmmedium=memberios if you want something to repost there.

linkedin.com
❤️ Willy Lulciuc, Ross Turk
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-01-19 00:13:26

Hi guys, I am trying to build the openlineage jar locally for spark. I ran ./gradlew shadowJar in the /integration/spark directory. However, I am getting this issue: ** What went wrong: A problem occurred evaluating project ':app'. &gt; Could not resolve all files for configuration ':app:spark33'. &gt; Could not resolve io.openlineage:openlineage_java:0.20.0-SNAPSHOT. Required by: project :app &gt; project :shared &gt; Could not resolve io.openlineage:openlineage_java:0.20.0-SNAPSHOT. &gt; Unable to load Maven meta-data from <https://astronomer.jfrog.io/artifactory/maven-public-libs-snapshot/io/openlineage/openlineage-java/0.20.0-SNAPSHOT/maven-metadata.xml>. &gt; Could not GET '<https://astronomer.jfrog.io/artifactory/maven-public-libs-snapshot/io/openlineage/openlineage-java/0.20.0-SNAPSHOT/maven-metadata.xml>'. Received status code 401 from server: Unauthorized It used to work a few weeks ago...May I ask if anyone would know what the reason might be? Thanks! 🙂

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-01-19 03:58:42

*Thread Reply:* Hello @Anirudh Shrinivason, you need to build your openlineage-java package first. Possibly you built in some time ao in different version

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-01-19 03:59:28

*Thread Reply:* ./gradlew clean build publishToMavenLocal in /client/java should help.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-01-19 04:34:33

*Thread Reply:* Ahh yeap this works thanks! 🙂

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2023-01-19 09:17:01

Are there any resources to explain the differences between lineage with Apache Atlas vs. lineage using OpenLineage? we have discussions with customers and partners, and some of them are looking into which is more “ready for industry”.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-01-19 11:03:39

*Thread Reply:* It's been a while since I looked at Atlas, but does it even now supports something else than very Java Apache-adjacent projects like Hive and HBase?

Ross Turk (ross@datakin.com)
2023-01-19 13:10:11

*Thread Reply:* To directly answer your question @Sheeri Cabral (Collibra): I am not aware of any resources currently that explain this 😞 but I would welcome the creation of one & pitch in where possible!

✅ Sheeri Cabral (Collibra)
Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2023-01-20 17:00:25

*Thread Reply:* I don’t know enough about Atlas to make that doc.

Justine Boulant (justine.boulant@seenovate.com)
2023-01-19 10:43:18

Hi everyone, I am currently working on a project and we have some questions to use OpenLineage with Apache Airflow : • How does it work : ux vs code/script? How can we implement it? a schema of its architecture for example • What are the visual outputs available? • Is the lineage done from A to Z? if there are multiple intermediary transformations for example? • Is the lineage done horizontally across the organization or vertically on different system levels? or both? • Can we upgrade it to industry-level? • Does it work with Python and/or R? • Does it read metadata or scripts? Thanks a lot if you can help 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-01-19 11:00:54

*Thread Reply:* I think most of your questions will be answered by this video: https://www.youtube.com/watch?v=LRr-ja8_Wjs

YouTube
} Astronomer (https://www.youtube.com/@Astronomer)
Ross Turk (ross@datakin.com)
2023-01-19 13:10:58

*Thread Reply:* I agree - a lot of the answers are in that overview video. You might also take a look at the docs, they do a pretty good job of explaining how it works.

Ross Turk (ross@datakin.com)
2023-01-19 13:19:34

*Thread Reply:* More explicitly: • Airflow is an interesting platform to observe because it runs a large variety of workloads and lineage can only be automatically extracted for some of them • In general, OpenLineage is essentially a standard and data model for lineage. There are integrations for various systems, including Airflow, that cause them to emit lineage events to an OpenLineage compatible backend. It's a push model. • Marquez is one such backend, and the one I recommend for testing & development • There are a few approaches for lineage in Airflow: ◦ Extractors, which pair with Operators to extract and emit lineage ◦ Manual inlets/outlets on a task, defined by a developer - useful for PythonOperator and other cases where an extractor can't do it auto ◦ Orchestration of an underlying OpenLineage integration, like openlineage-dbt • IDK about "A to Z", that depends on your environment. The goal is to capture every transformation. Depending on your pipeline, there may be a set of integrations that give you the coverage you need. We often find that there are gaps. • It works with Python. You can use the openlineage-python client to emit lineage events to a backend. This is useful if there isn't an integration for something your pipeline does. • It describes the pipeline by observing running jobs and the way they affect datasets, not the organization. I don't know what you mean by "industry-level". • I am not aware of an integration that parses source code to determine lineage at this time. • The openlineage-dbt integration consumes the various metadata that dbt leaves behind to construct lineage. Dunno if that's what you mean by "read metadata".

Ross Turk (ross@datakin.com)
2023-01-19 13:23:33

*Thread Reply:* FWIW I did a workshop on openlineage and airflow a while back, and it's all in this repo. You can find slides + a quick Python example + a simple Airflow example in there.

Justine Boulant (justine.boulant@seenovate.com)
2023-01-20 03:44:22

*Thread Reply:* Thanks a lot!! Very helpful!

Ross Turk (ross@datakin.com)
2023-01-20 11:42:43

*Thread Reply:* 👍

Brad Paskewitz (bradford.paskewitz@fivetran.com)
2023-01-20 15:28:06

Hey folks, my team is working on a solution that would support the OL standard with column level lineage. I'm working through the architecture now and I'm wondering if everyone uses the standard rest api backed by a db or if other teams found success using other technologies such as webhooks, streams, etc in order to capture and process lineage events. I'd be very curious to connect on the topic

Julien Le Dem (julien@apache.org)
2023-01-20 19:45:55

*Thread Reply:* Hello Brad, on top of my head:

Julien Le Dem (julien@apache.org)
2023-01-20 19:47:15

*Thread Reply:* • Marquez uses the API HTTP Post. so does Astro • Egeria and Purview prefer consuming through a Kafka topic. There is a ProxyBackend that takes HTTP Posts and writes to Kafka. The client can also be configured to write to Kafka

👍 Jakub Dardziński
Julien Le Dem (julien@apache.org)
2023-01-20 19:48:09

*Thread Reply:* @Will Johnson @Mandy Chessell might have opinions

Julien Le Dem (julien@apache.org)
2023-01-20 19:49:10

*Thread Reply:* The Microsoft Purview approach is documented here: https://learn.microsoft.com/en-us/samples/microsoft/purview-adb-lineage-solution-accelerator/azure-databricks-to-purview-lineage-connector/

learn.microsoft.com
Julien Le Dem (julien@apache.org)
2023-01-20 19:49:47

*Thread Reply:* There’s a blog post about Egeria here: https://openlineage.io/blog/openlineage-egeria/

openlineage.io
Will Johnson (will@willj.co)
2023-01-22 10:00:56

*Thread Reply:* @Brad Paskewitz at Microsoft, the solution that Julien linked above, we are using the HTTP Transport (REST API) as we are consuming the OpenLineage Events and transforming them to Apache Atlas / Microsoft Purview.

However, there is a good deal of interest in using the kafka transport instead and that's our future roadmap.

👍 Ross Turk, Brad Paskewitz
Quentin Nambot (qnambot@gmail.com)
2023-01-25 09:59:13

❓ Hi everyone, I am trying to use openlineage with Databricks (using 11.3 LTS runtime, and openlineage 0.19.2) Using this documentation I managed to install openlineage and send events to marquez However marquez did not received all COMPLETE events, it seems like databricks cluster is shutdown immediatly at the end of the job. It is not the first time that i see this with databricks, last year I tried to use spline and we noticed that Databricks seems to not wait that spark session is nicely closed before shutting down instances (see this issue) My question is: has anyone faced the same issue? Does somebody know a workaround? 🙏

spline
Michael Collado (collado.mike@gmail.com)
2023-01-25 12:04:48

*Thread Reply:* Hmm, if Databricks is shutting the process down without waiting for the ListenerBus to clear, I don’t know that there’s a lot we can do. The best thing is to somehow delay the main application thread from exiting. One thing you could try is to subclass the OpenLineageSparkListener and generate a lock for each SparkListenerSQLExecutionStart and release it when the accompanying SparkListenerSQLExecutionEnd event is processed. Then, in the main application, block until all such locks are released. If you try it and it works, let us know!

Quentin Nambot (qnambot@gmail.com)
2023-01-26 05:46:35

*Thread Reply:* Ok thanks for the idea! I'll tell you if I try this and if it works 🤞

Petr Hajek (petr.hajek@profinit.eu)
2023-01-25 10:12:42

Hi, would anybody be able and willing to help us configure S3 and Snowflake extractors within Airflow integration for one of our clients? Our trouble is that Airflow integration returns valid OpenLineage .json files but it lacks any information about input and output DataSets. Thanks in advance 🙂

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-01-25 10:38:03

*Thread Reply:* Hey Petr. Please DM me or describe the issue here 🙂

Susmitha Anandarao (susmitha.anandarao@gmail.com)
2023-01-27 15:24:47

Hello.. I am trying to play with openlineage spark integration with Kafka and currently trying to just use the config as part of the spark submit command but I run into errors. Details in the 🧵

Susmitha Anandarao (susmitha.anandarao@gmail.com)
2023-01-27 15:25:04

*Thread Reply:* Command spark-submit --packages "io.openlineage:openlineage_spark:0.19.+" \ --conf "spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener" \ --conf "spark.openlineage.transport.type=kafka" \ --conf "spark.openlineage.transport.topicName=topicname" \ --conf "spark.openlineage.transport.localServerId=Kafka_server" \ file.py

Susmitha Anandarao (susmitha.anandarao@gmail.com)
2023-01-27 15:25:14

*Thread Reply:* 23/01/27 17:29:06 ERROR AsyncEventQueue: Listener OpenLineageSparkListener threw an exception java.lang.NullPointerException at io.openlineage.client.transports.TransportFactory.build(TransportFactory.java:44) at io.openlineage.spark.agent.EventEmitter.&lt;init&gt;(EventEmitter.java:40) at io.openlineage.spark.agent.OpenLineageSparkListener.initializeContextFactoryIfNotInitialized(OpenLineageSparkListener.java:278) at io.openlineage.spark.agent.OpenLineageSparkListener.onApplicationStart(OpenLineageSparkListener.java:267) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:55) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1446) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)

Susmitha Anandarao (susmitha.anandarao@gmail.com)
2023-01-27 15:25:31

*Thread Reply:* I would appreciate any pointers on getting started with using openlineage-spark with Kafka.

Susmitha Anandarao (susmitha.anandarao@gmail.com)
2023-01-27 16:15:00

*Thread Reply:* Also this might seem a little elementary but the kafka topic itself, should it be hosted on the spark cluster or could it be any kafka topic?

Susmitha Anandarao (susmitha.anandarao@gmail.com)
2023-01-30 08:37:07

*Thread Reply:* 👀 Could I get some help on this, please?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-01-30 09:07:08

*Thread Reply:* I think any NullPointerException is clearly our bug, can you open issue on OL GitHub?

👍 Susmitha Anandarao
Susmitha Anandarao (susmitha.anandarao@gmail.com)
2023-01-30 09:30:51

*Thread Reply:* @Maciej Obuchowski Another interesting thing is if I use 0.19.2 version specifically, I get 23/01/30 14:28:33 INFO RddExecutionContext: RDDs are empty: skipping sending OpenLineage event

I am trying to print to console at the moment. I haven't been able to get Kafka transport type working though.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-01-30 09:41:12

*Thread Reply:* Are you getting events printed on the console though? This log should not affect you if you're running, for example Spark SQL jobs

Susmitha Anandarao (susmitha.anandarao@gmail.com)
2023-01-30 09:42:28

*Thread Reply:* I am trying to run a python file using pyspark. 23/01/30 14:40:49 INFO RddExecutionContext: RDDs are empty: skipping sending OpenLineage event I see this and don't see any events on the console.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-01-30 09:55:41

*Thread Reply:* Any logs filling pattern log.warn("Unable to access job conf from RDD", nfe); or <a href="http://log.info">log.info</a>("Found job conf from RDD {}", jc); before?

Susmitha Anandarao (susmitha.anandarao@gmail.com)
2023-01-30 09:57:20

*Thread Reply:* ```23/01/30 14:40:48 INFO DAGScheduler: Submitting ShuffleMapStage 0 (PairwiseRDD[2] at reduceByKey at /tmp/spark-20487725-f49b-4587-986d-e63a61890673/statusapidemo.py:47), which has no missing parents 23/01/30 14:40:49 WARN RddExecutionContext: Unable to access job conf from RDD java.lang.NoSuchFieldException: Field is not instance of HadoopMapRedWriteConfigUtil at io.openlineage.spark.agent.lifecycle.RddExecutionContext.lambda$setActiveJob$0(RddExecutionContext.java:117) at java.util.Optional.orElseThrow(Optional.java:290) at io.openlineage.spark.agent.lifecycle.RddExecutionContext.setActiveJob(RddExecutionContext.java:115) at java.util.Optional.ifPresent(Optional.java:159) at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$onJobStart$9(OpenLineageSparkListener.java:148) at java.util.Optional.ifPresent(Optional.java:159) at io.openlineage.spark.agent.OpenLineageSparkListener.onJobStart(OpenLineageSparkListener.java:145) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1446) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)

23/01/30 14:40:49 INFO RddExecutionContext: Found job conf from RDD Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-rbf-default.xml, hdfs-site.xml, hdfs-rbf-site.xml, resource-types.xml

23/01/30 14:40:49 INFO RddExecutionContext: Found output path null from RDD PythonRDD[5] at collect at /tmp/spark-20487725-f49b-4587-986d-e63a61890673/statusapidemo.py:48 23/01/30 14:40:49 INFO RddExecutionContext: RDDs are empty: skipping sending OpenLineage event``` I see both actually.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-01-30 10:03:35

*Thread Reply:* I think this is same problem as this: https://github.com/OpenLineage/OpenLineage/issues/1521

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-01-30 10:04:14

*Thread Reply:* and I think I might have solution on a branch for it, just need to polish it up to release

Susmitha Anandarao (susmitha.anandarao@gmail.com)
2023-01-30 10:13:37

*Thread Reply:* Aah got it. I will give it a try with SQL and a jar.

Do you have a ETA on when the python issue would be fixed?

Susmitha Anandarao (susmitha.anandarao@gmail.com)
2023-01-30 10:37:51

*Thread Reply:* @Maciej Obuchowski Well I run into the same errors if I run spark-submit on a jar.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-01-30 10:38:44

*Thread Reply:* I think that has nothing to do with python

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-01-30 10:39:16

*Thread Reply:* BTW, which Spark version are you using?

Susmitha Anandarao (susmitha.anandarao@gmail.com)
2023-01-30 10:41:22

*Thread Reply:* We are on 3.3.1

👍 Maciej Obuchowski
Susmitha Anandarao (susmitha.anandarao@gmail.com)
2023-01-30 11:38:24

*Thread Reply:* @Maciej Obuchowski Do you have a estimated release date for the fix. Our team is specifically interested in using the Emitter to write out to Kafka.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-01-30 11:46:30

*Thread Reply:* I think we plan to release somewhere in the next week

:gratitude_thank_you: Susmitha Anandarao
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-02-06 09:21:25

*Thread Reply:* @Susmitha Anandarao PR fixing this has been merged, release should be today

:gratitude_thank_you: Susmitha Anandarao
Paul Lee (paullee@lyft.com)
2023-01-27 16:31:45

👋 what would be the reason conn_id on something like SQLCheckOperator ends up being None when OpenLineage attempts to extract metadata but is fine on task execution?

i'm using OpenLineage for Airflow 0.14.1 on 2.3.4 and i'm getting an error about connid not being found. it's a SQLCheckOperator where the check runs fine but the task fails because when OpenLineage goes to extract task metadata it attempts to grab the connid but at that moment it finds it to be None.

Ross Turk (ross@datakin.com)
2023-01-27 18:38:40

*Thread Reply:* hmmm, I am not sure. perhaps @Benji Lampel can help, he’s very familiar with those operators.

👍 Paul Lee
Paul Lee (paullee@lyft.com)
2023-01-27 18:46:15

*Thread Reply:* @Benji Lampel any help would be appreciated!

Benji Lampel (benjamin@astronomer.io)
2023-01-30 09:01:34

*Thread Reply:* Hey Paul, the SQLCheckExtractors were written with the intent that they would be used by a provider that inherits for them - they are all treated as a sort of base class. What is the exact error message you're getting? And what is the operator code? Could you try this with a PostgresCheckOperator ? (Also, only the SqlColumnCheckOperator and SqlTableCheckOperator will provide data quality facets in their output, those functions are not implementable in the other operators at this time)

👀 Paul Lee
Paul Lee (paullee@lyft.com)
2023-01-31 14:36:07

*Thread Reply:* @Benji Lampel here is the error message. i am not sure what the operator code is.

3-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - Traceback (most recent call last): [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - self.run() [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - File "/usr/lib/python3.8/threading.py", line 870, in run [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - self._target(**self._args, ****self._kwargs) [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - File "/code/venvs/venv/lib/python3.8/site-packages/openlineage/airflow/listener.py", line 99, in on_running [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - task_metadata = extractor_manager.extract_metadata(dagrun, task) [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - File "/code/venvs/venv/lib/python3.8/site-packages/openlineage/airflow/extractors/manager.py", line 28, in extract_metadata [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - extractor = self._get_extractor(task) [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - File "/code/venvs/venv/lib/python3.8/site-packages/openlineage/airflow/extractors/manager.py", line 96, in _get_extractor [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - self.task_to_extractor.instantiate_abstract_extractors(task) [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - File "/code/venvs/venv/lib/python3.8/site-packages/openlineage/airflow/extractors/extractors.py", line 118, in instantiate_abstract_extractors [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - task_conn_type = BaseHook.get_connection(task.conn_id).conn_type [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - File "/code/venvs/venv/lib/python3.8/site-packages/airflow/hooks/base.py", line 67, in get_connection [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - conn = Connection.get_connection_from_secrets(conn_id) [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - File "/code/venvs/venv/lib/python3.8/site-packages/airflow/models/connection.py", line 430, in get_connection_from_secrets [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - raise AirflowNotFoundException(f"The conn_id `{conn_id}` isn't defined") [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - airflow.exceptions.AirflowNotFoundException: The conn_id `None` isn't defined

Paul Lee (paullee@lyft.com)
2023-01-31 14:37:06

*Thread Reply:* and above that

[2023-01-31, 00:32:38 UTC] {connection.py:424} ERROR - Unable to retrieve connection from secrets backend (EnvironmentVariablesBackend). Checking subsequent secrets backend. Traceback (most recent call last): File "/code/venvs/venv/lib/python3.8/site-packages/airflow/models/connection.py", line 420, in get_connection_from_secrets conn = secrets_backend.get_connection(conn_id=conn_id) File "/code/venvs/venv/lib/python3.8/site-packages/airflow/secrets/base_secrets.py", line 91, in get_connection value = self.get_conn_value(conn_id=conn_id) File "/code/venvs/venv/lib/python3.8/site-packages/airflow/secrets/environment_variables.py", line 48, in get_conn_value return os.environ.get(CONN_ENV_PREFIX + conn_id.upper())

Paul Lee (paullee@lyft.com)
2023-01-31 14:39:31

*Thread Reply:* sorry, i should mention we're wrapping over the CheckOperator as we're still migrating from 1.10.15 @Benji Lampel

Benji Lampel (benjamin@astronomer.io)
2023-01-31 15:09:51

*Thread Reply:* What do you mean by wrapping the CheckOperator? Like how so, exactly? Can you show me the operator code you're using in the DAG?

Paul Lee (paullee@lyft.com)
2023-01-31 17:38:45

*Thread Reply:* like so

class CustomSQLCheckOperator(CheckOperator): ....

Paul Lee (paullee@lyft.com)
2023-01-31 17:39:30

*Thread Reply:* i think i found the issue though, we have our own get_hook function and so we don't follow the traditional Airflow way of setting CONN_ID which is why CONN_ID is always None and that path only gets called through OpenLineage which doesn't ever get called with our custom wrapper

✅ Benji Lampel
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-01-30 03:50:39

Hi everyone, I am using openlineage to capture column level lineage from spark databricks. I noticed that the environment variables captured are only present in the start event, but are not present in the complete event. Is there a reason why it is implemented like this? It seems more intuitive that whatever variables are present in the start event should also be present in the complete event...

Susmitha Anandarao (susmitha.anandarao@gmail.com)
2023-01-31 08:30:37

Hi everyone.. Does the DBT integration provide an option to emit events to a Kafka topic similar to the Spark integration? I could not find anything regarding this in the documentation and I wanted to make sure if only http transport type is supported. Thank you!

Julien Le Dem (julien@apache.org)
2023-01-31 12:57:47

*Thread Reply:* The dbt integration uses the python client, you should be able to do something similar than with the java client. See here: https://github.com/OpenLineage/OpenLineage/tree/main/client/python#kafka

Susmitha Anandarao (susmitha.anandarao@gmail.com)
2023-01-31 13:26:33

*Thread Reply:* Thank you for this!

I created a openlineage.yml file with the following data to test out the integration locally. transport: type: "kafka" config: { 'bootstrap.servers': 'localhost:9092', } topic: "ol_dbt_events" However, I run into a no module named 'confluent_kafka' error from this code. Running OpenLineage dbt wrapper version 0.19.2 This wrapper will send OpenLineage events at the end of dbt execution. Traceback (most recent call last): File "/Users/susmithaanandarao/.pyenv/virtualenvs/dbt-examples-domain-repo/3.9.8/bin/dbt-ol", line 168, in &lt;module&gt; main() File "/Users/susmithaanandarao/.pyenv/virtualenvs/dbt-examples-domain-repo/3.9.8/bin/dbt-ol", line 94, in main client = OpenLineageClient.from_environment() File "/Users/susmithaanandarao/.pyenv/virtualenvs/dbt-examples-domain-repo/3.9.8/lib/python3.9/site-packages/openlineage/client/client.py", line 73, in from_environment return cls(transport=get_default_factory().create()) File "/Users/susmithaanandarao/.pyenv/virtualenvs/dbt-examples-domain-repo/3.9.8/lib/python3.9/site-packages/openlineage/client/transport/factory.py", line 37, in create return self._create_transport(yml_config) File "/Users/susmithaanandarao/.pyenv/virtualenvs/dbt-examples-domain-repo/3.9.8/lib/python3.9/site-packages/openlineage/client/transport/factory.py", line 69, in _create_transport return transport_class(config_class.from_dict(config)) File "/Users/susmithaanandarao/.pyenv/virtualenvs/dbt-examples-domain-repo/3.9.8/lib/python3.9/site-packages/openlineage/client/transport/kafka.py", line 43, in __init__ import confluent_kafka as kafka ModuleNotFoundError: No module named 'confluent_kafka' Manually installing confluent-kafka worked. But I am curious why it was not automatically installed and if I am missing any config.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-02-02 14:39:29

*Thread Reply:* @Susmitha Anandarao It's not installed because it's large binary package. We don't want to install for every user something giant majority won't use, and it's 100x bigger than rest of the client.

We need to indicate this way better, and do not throw this error directly at user thought, both in docs and code.

Benji Lampel (benjamin@astronomer.io)
2023-01-31 11:28:53

~Hey, would love to see a release of OpenLineage~

➕ Michael Robinson, Jakub Dardziński, Ross Turk, Maciej Obuchowski
Julien Le Dem (julien@apache.org)
2023-01-31 12:51:44

Hello, I have been working on a proposal to bring an OpenLineage provider to Airflow. I am currently looking for feedback on a draft AIP. See the thread here: https://lists.apache.org/thread/2brvl4ynkxcff86zlokkb47wb5gx8hw7

🔥 Maciej Obuchowski, Viraj Parekh, Jakub Dardziński, Enrico Rotundo, Harel Shein, Paweł Leszczyński
👀 Enrico Rotundo
🙌 Will Johnson
Bramha Aelem (bramhaaelem@gmail.com)
2023-01-31 14:02:21

@Willy Lulciuc, - Any updates on - https://github.com/OpenLineage/OpenLineage/discussions/1494

Category
Ideas
Comments
3
Natalie Zeller (natalie.zeller@naturalint.com)
2023-02-02 08:26:38

Hello, While trying to use OpenLineage with spark, I've noticed that sometimes the query execution is missing or already got closed (here is the relevant code). As a result, some of the events are skipped. Is this a known issue? Is there a way to overcome it?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-02-02 08:39:34
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-02-02 08:39:59

*Thread Reply:* We sometimes experience this in context of very small, quick jobs

Natalie Zeller (natalie.zeller@naturalint.com)
2023-02-02 08:43:24

*Thread Reply:* Yes, my scenarios are dealing with quick jobs. Good to know that we will be able to solve it with future spark versions. Thanks!

Michael Robinson (michael.robinson@astronomer.io)
2023-02-02 11:09:13

@channel This month’s OpenLineage TSC meeting is next Thursday, February 9th, at 10 am PT. Join us on Zoom: https://bit.ly/OLzoom. All are welcome! On the tentative agenda:

  1. Recent release overview @Michael Robinson
  2. AIP: OpenLineage in Airflow
  3. Discussions: • Real-world implementation of OpenLineage (What does it really mean?) @Sheeri Cabral (Collibra) (continued) • Using namespaces @Michael Robinson
  4. Open discussion Notes: https://bit.ly/OLwiki Is there a topic you think the community should discuss at this or a future meeting? Reply or DM me to add items to the agenda.
🔥 Maciej Obuchowski, Bramha Aelem, Viraj Parekh, Brad Paskewitz, Harel Shein
👍 Bramha Aelem, Viraj Parekh, Enrico Rotundo, Daniel Henneberger
Michael Robinson (michael.robinson@astronomer.io)
2023-02-03 13:22:51

Hi folks, I’m opening a vote to release OpenLineage 0.20.0, featuring: • Airflow: add new extractor for GCSToGCSOperator Adds a new extractor for this operator.Proxy: implement lineage event validator for client proxy
Implements logic in the proxy (which is still in development) for validating and handling lineage events. • A fix of a breaking change in the common integration and other bug fixes in the DBT, Airflow, Spark, and SQL integrations and in the Java and Python clients. As per the policy here, three +1s from committers will authorize. Thanks in advance.

➕ Willy Lulciuc, Maciej Obuchowski, Julien Le Dem, Jakub Dardziński, Howard Yoo
Willy Lulciuc (willy@datakin.com)
2023-02-03 13:24:03

*Thread Reply:* exciting to see the client proxy work being released by @Minkyu Park 💯

Michael Robinson (michael.robinson@astronomer.io)
2023-02-03 13:35:38

*Thread Reply:* This was without a doubt among the fastest release votes we’ve ever had 😉 . Thank you! You can expect the release to happen on Monday.

Minkyu Park (minkyu@datakin.com)
2023-02-03 14:02:52

*Thread Reply:* Lol the proxy is still in development and not ready for use

Willy Lulciuc (willy@datakin.com)
2023-02-03 14:03:26

*Thread Reply:* Good point! Let’s make that clear in the release / docs?

👍 Michael Robinson, Minkyu Park
Minkyu Park (minkyu@datakin.com)
2023-02-03 14:03:33

*Thread Reply:* But it doesn’t block anything anyway, so happy to see the release

Minkyu Park (minkyu@datakin.com)
2023-02-03 14:04:38

*Thread Reply:* We can celebrate that the proposal for the proxy is merged. I’m happy with that 🥳

🎊 Willy Lulciuc
Daniel Joanes (djoanes@gmail.com)
2023-02-06 00:01:49

Hey 👋 From what I gather, there's no solution to getting column level lineage from spark streaming jobs. Is there a issue I can follow to keep track?

Ross Turk (ross@datakin.com)
2023-02-06 14:47:15

*Thread Reply:* Hey @Daniel Joanes! thanks for the question.

I am not aware of an issue that captures this. Column-level lineage is a somewhat new facet in the spec, and implementations across the various integrations are in varying states of readiness.

I invite you to create the issue - that way it's attributed to you, which makes sense because you're the one who first raised it. But I'm happy to create it for you & give you the PR# if you'd rather, just let me know 👍

Daniel Joanes (djoanes@gmail.com)
2023-02-06 14:50:59

*Thread Reply:* Go for it, once it's created i'll add a watch

👍 Ross Turk
Daniel Joanes (djoanes@gmail.com)
2023-02-06 14:51:13

*Thread Reply:* Thanks Ross!

Ross Turk (ross@datakin.com)
2023-02-06 23:10:30

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/1581

Labels
integration/spark, column-level-lineage
Michael Robinson (michael.robinson@astronomer.io)
2023-02-07 18:46:50

@channel OpenLineage 0.20.4 is now available, including: Additions: • Airflow: add new extractor for GCSToGCSOperator #1495 @sekikn • Flink: resolve topic names from regex, support 1.16.0 #1522 @pawel-big-lebowski • Proxy: implement lineage event validator for client proxy #1469 @fm100 Changes: • CI: use ruff instead of flake8, isort, etc., for linting and formatting #1526 @mobuchowski Plus many bug fixes & doc changes. Thank you to all our contributors! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.20.4 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.19.2...0.20.4 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🎉 Kengo Seki, Harel Shein, Willy Lulciuc, Nadav Geva
Michael Robinson (michael.robinson@astronomer.io)
2023-02-08 15:31:32

@channel Friendly reminder: this month’s OpenLineage TSC meeting is tomorrow at 10am, and all are welcome. https://openlineage.slack.com/archives/C01CK9T7HKR/p1675354153489629

❤️ Minkyu Park, Kengo Seki, Paweł Leszczyński, Harel Shein, Sheeri Cabral (Collibra), Enrico Rotundo
Harel Shein (harel.shein@gmail.com)
2023-02-09 10:50:07

Hey, can we please schedule a release of OpenLineage? I would like to have a release that includes the latest fixes for Async Operator on Airflow and some dbt bug fixes.

➕ Michael Robinson, Maciej Obuchowski, Benji Lampel, Willy Lulciuc
Michael Robinson (michael.robinson@astronomer.io)
2023-02-09 10:50:49

*Thread Reply:* Thanks for requesting a release. 3 +1s from committers will authorize an immediate release.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-02-09 11:15:35

*Thread Reply:* 0.20.5 ?

➕ Harel Shein
Benji Lampel (benjamin@astronomer.io)
2023-02-09 11:28:20

*Thread Reply:* @Michael Robinson auth'd

Michael Robinson (michael.robinson@astronomer.io)
2023-02-09 11:32:06

*Thread Reply:* 👍 the release is authorized

❤️ Sheeri Cabral (Collibra), Willy Lulciuc, Paweł Leszczyński
Avinash Pancham (avinashpancham@outlook.com)
2023-02-09 15:57:58

Hi all, I have been experimenting with OpenLineage for a few days and it's great! I successfully setup the openlineage-spark listener on my Databricks cluster and that pushes openlineage data to our Marquez backend. That was all pretty easy to do 🙂

Now for my challenge: I would like to actually extend the metadata that my cluster pushes with custom values (you can think of spark config settings, commit hash of the executed code, or maybe even runtime defined values). I browsed through some documentation and found custom facets one can define. The link below describes how to use Python to push custom metadata to a backend, but I was actually hoping that there was a way to do this automatically in Spark. So ideally I would like to write my own OpenLineage.json (that has my custom facet) and tell Spark to use that Openlineage spec instead of the default one. In that way I hope my custom metadata will be forwarded automatically.

I just do not know how to do that (and whether that is even possible), since I could not find any tutorials on that topic. Any help on this would be greatly appreciated!

https://openlineage.io/docs/spec/facets/custom-facets

openlineage.io
Susmitha Anandarao (susmitha.anandarao@gmail.com)
2023-02-09 16:23:36

*Thread Reply:* I am also exploring something similar, but writing to kafka, and would want to know more on how we could add custom metadata from spark.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-02-10 02:23:40

*Thread Reply:* Hi @Avinash Pancham @Susmitha Anandarao, it's great to hear about successful experimenting on your side.

Although Openlineage spec provides some built-in facets definition, a facet object can be anything you want (https://openlineage.io/apidocs/openapi/#tag/OpenLineage/operation/postRunEvent). The example metadata provided in this chat could be put into job or run facets I believe.

There is also a way to extend Spark integration to collect custom metadata described here (https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark#extending). One needs to create own JAR with DatasetFacetBuilders, RunFacetsBuilder (whatever is needed). openlineage-spark integration will make use of those bulders.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2023-02-10 09:09:10

*Thread Reply:* (I would love to see what your specs are! I’m not with Astronomer, just a community member, but I am finding that many of the customizations people are making to the spec are valuable ones that we should consider adding to core)

Susmitha Anandarao (susmitha.anandarao@gmail.com)
2023-02-14 16:51:28

*Thread Reply:* Are there any examples out there of customizations already done in Spark? An example would definitely help!

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-02-15 08:43:08

*Thread Reply:* I think @Will Johnson might have something to add about customization

Will Johnson (will@willj.co)
2023-02-15 23:58:36

*Thread Reply:* Oh man... Mike Collado did a nice write up on Slack of how many different ways there are to customize / extend OpenLineage. I know we all talked about doing a blog post at one point!

@Susmitha Anandarao - You might take a look at https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/[…]ark/agent/facets/builder/DatabricksEnvironmentFacetBuilder.java which has a hard coded set of properties we are extracting.

It looks like Avinash's changes were accepted as well: https://github.com/OpenLineage/OpenLineage/pull/1545

Michael Robinson (michael.robinson@astronomer.io)
2023-02-10 12:42:24

@channel OpenLineage 0.20.6 is now available, including: Additions • Airflow: add new extractor for FTPFileTransmitOperator #1603 @sekikn Changes • Airflow: make extractors for async operators work #1601 @JDarDagran Thanks to all our contributors! For the bug fixes and details, see: Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.20.6 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.20.4...0.20.6 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🥳 Minkyu Park, Willy Lulciuc, Kengo Seki, Paweł Leszczyński, Anirudh Shrinivason, pankaj koti, Maciej Obuchowski
❤️ Minkyu Park, Ross Turk, Willy Lulciuc, Kengo Seki, Paweł Leszczyński, Anirudh Shrinivason, pankaj koti
🎉 Minkyu Park, Willy Lulciuc, Kengo Seki, Anirudh Shrinivason, pankaj koti
Michael Robinson (michael.robinson@astronomer.io)
2023-02-13 14:20:26

Hi everyone, in case you missed the announcement at the most recent community meeting, our first-ever meetup will be held on March 9th in Providence, RI. Join us there to learn more about the present and future of OpenLineage, meet other members of the ecosystem, learn about the project’s goals and fundamental design, and participate in a discussion about the future of the project. Food will be provided, and the meetup is open to all. Don’t miss this opportunity to influence the direction of this important new standard! We hope to see you there. More information: https://openlineage.io/blog/data-lineage-meetup/

openlineage.io
🎉 Harel Shein, Ross Turk, Maciej Obuchowski, Kengo Seki, Paweł Leszczyński, Willy Lulciuc, Sheeri Cabral (Collibra)
🔥 Harel Shein, Ross Turk, Maciej Obuchowski, Anirudh Shrinivason, Kengo Seki, Paweł Leszczyński, Willy Lulciuc, Sheeri Cabral (Collibra)
Quentin Nambot (qnambot@gmail.com)
2023-02-15 04:52:27

Hi, I opened a PR to fix the way that Athena extractor get the database, but spark integration tests failed. However I don't think that it is related to my PR, since I only updated the Airflow integration Can anybody help me with that please? 🙏

Labels
integration/airflow, extractor
Comments
2
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-02-15 07:19:39

*Thread Reply:* @Quentin Nambot this happens because we run additional integration tests against real databases (like BigQuery) which aren't ever configured on forks, since we don't want to expose our secrets. We need to figure out how to make this experience better, but in the meantime we've pushed your code using git-push-fork-to-upstream-branch and it passes all the tests.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-02-15 07:21:49

*Thread Reply:* Feel free to un-draft your PR if you think it's ready for review

Quentin Nambot (qnambot@gmail.com)
2023-02-15 08:03:56

*Thread Reply:* Ok nice thanks 👍

Quentin Nambot (qnambot@gmail.com)
2023-02-15 08:04:49

*Thread Reply:* I think it's ready, however should I update the version somewhere?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-02-15 08:42:39

*Thread Reply:* @Quentin Nambot I don't think so - it's just that you opened PR as Draft , so I'm not sure if you want to add something else to it.

👍 Quentin Nambot
Quentin Nambot (qnambot@gmail.com)
2023-02-15 08:43:36

*Thread Reply:* No I don't want to add anything so I opened it 👍

Allison Suarez (asuarezmiranda@lyft.com)
2023-02-15 21:26:37

@here I have a question about extending the spark integration. Is there a way to use a custom visitor factory? I am trying to see if I can add a visitor for a command that is not currently covered in this integration (AlterTableAddPartitionCommand). It seems that because its not in the base visitor factory I am unable to use the visitor I created.

Allison Suarez (asuarezmiranda@lyft.com)
2023-02-15 21:32:19

*Thread Reply:* I have that set up already like this: public class LyftOpenLineageEventHandlerFactory implements OpenLineageEventHandlerFactory { @Override public Collection&lt;PartialFunction&lt;LogicalPlan, List&lt;OutputDataset&gt;&gt;&gt; createOutputDatasetQueryPlanVisitors(OpenLineageContext context) { Collection&lt;PartialFunction&lt;LogicalPlan, List&lt;OutputDataset&gt;&gt;&gt; visitors = new ArrayList&lt;PartialFunction&lt;LogicalPlan, List&lt;OutputDataset&gt;&gt;&gt;(); visitors.add(new LyftInsertIntoHadoopFsRelationVisitor(context)); visitors.add(new AlterTableAddPartitionVisitor(context)); visitors.add(new AlterTableDropPartitionVisitor(context)); return visitors; } }

Allison Suarez (asuarezmiranda@lyft.com)
2023-02-15 21:33:35

*Thread Reply:* do I just add a constructor? the visitorFactory is private so I wasn't sure if that's something that was intended to change

Allison Suarez (asuarezmiranda@lyft.com)
2023-02-15 21:34:30

*Thread Reply:* .

Allison Suarez (asuarezmiranda@lyft.com)
2023-02-15 21:34:49

*Thread Reply:* @Michael Collado

Michael Collado (collado.mike@gmail.com)
2023-02-15 21:35:14

*Thread Reply:* The VisitorFactory is only used by the internal EventHandlerFactory. It shouldn’t be needed for your custom one

Michael Collado (collado.mike@gmail.com)
2023-02-15 21:35:48

*Thread Reply:* Have you added the file to the META-INF folder of your jar?

Allison Suarez (asuarezmiranda@lyft.com)
2023-02-16 11:01:56

*Thread Reply:* yes, I am able to use my custom event handler factory with a list of visitors but for some reason I cant access the visitors for some commands (AlterTableAddPartitionCommand) is one

Allison Suarez (asuarezmiranda@lyft.com)
2023-02-16 11:02:29

*Thread Reply:* so even if I set up everything correctly I am unable to reach the code for that specific visitor

Allison Suarez (asuarezmiranda@lyft.com)
2023-02-16 11:05:22

*Thread Reply:* and my assumption is I can reach other commands but not this one because the command is not defined in the BaseVisitorFactory but maybe im wrong @Michael Collado

Michael Collado (collado.mike@gmail.com)
2023-02-16 15:05:19

*Thread Reply:* the VisitorFactory is loaded by the InternalEventHandlerFactory here. However, the createOutputDatasetQueryPlanVisitors should contain a union of everything defined by the VisitorFactory as well as your custom visitors: see this code.

Michael Collado (collado.mike@gmail.com)
2023-02-16 15:09:21

*Thread Reply:* there might be a conflict with another visitor that’s being matched against that command. Can you turn on debug logging and look for this line to see what visitor is being applied to that command?

Allison Suarez (asuarezmiranda@lyft.com)
2023-02-16 16:54:46

*Thread Reply:* This was helpful, it works now, thank you so much Michael!

slackbot
2023-02-16 19:08:26

This message was deleted.

👋 Willy Lulciuc
Willy Lulciuc (willy@datakin.com)
2023-02-16 19:09:49

*Thread Reply:* what is the curl cmd you are running? and what endpoint are you hitting? (assuming Marquez?)

thebruuu (bruno.c@inwind.it)
2023-02-16 19:18:28

*Thread Reply:* yep I am running curl - X curl -X POST http://localhost:5000/api/v1/namespaces/test ^ -H 'Content-Type: application/json' ^ -d '{ownerName:"me", description:"no description"^ }'

the weird thing is the log where I don't have a 0.0.0.0 IP (the log correspond to the equivament postman command)

marquez-api | WARN [2023-02-17 00:14:32,695] marquez.logging.LoggingMdcFilter: status: 405 marquez-api | XXX.23.0.1 - - [17/Feb/2023:00:14:32 +0000] "POST /api/v1/namespaces/test HTTP/1.1" 405 52 "-" "PostmanRuntime/7.30.0" 2

Willy Lulciuc (willy@datakin.com)
2023-02-16 19:23:08

*Thread Reply:* Marquez logs all supported endpoints (and methods) on start up. For example, here are all the supported methods on /api/v1/namespaces/{namespace} : marquez-api | DELETE /api/v1/namespaces/{namespace} (marquez.api.NamespaceResource) marquez-api | GET /api/v1/namespaces/{namespace} (marquez.api.NamespaceResource) marquez-api | PUT /api/v1/namespaces/{namespace} (marquez.api.NamespaceResource) To ADD a namespace, you’ll want to use PUT (see API docs)

thebruuu (bruno.c@inwind.it)
2023-02-16 19:26:23

*Thread Reply:* 3rd stupid question of the night Sorry kept on trying POST who knows why

Willy Lulciuc (willy@datakin.com)
2023-02-16 19:26:56

*Thread Reply:* no worries! keep the questions coming!

Willy Lulciuc (willy@datakin.com)
2023-02-16 19:29:46

*Thread Reply:* well, maybe because it’s so late on your end! get some rest!

thebruuu (bruno.c@inwind.it)
2023-02-16 19:36:25

*Thread Reply:* Yeah but I want to see how it works Right now I have a response 200 for the creation of the names ... but it seems that nothing occurred nor on marquez front end (localhost:3000) nor on the database

Willy Lulciuc (willy@datakin.com)
2023-02-16 19:37:13

*Thread Reply:* can you curl the list namespaces endpoint?

thebruuu (bruno.c@inwind.it)
2023-02-16 19:38:14

*Thread Reply:* yep : nothing changed only default and food_delivery

Willy Lulciuc (willy@datakin.com)
2023-02-16 19:38:47

*Thread Reply:* can you post your server logs? you should see the request

thebruuu (bruno.c@inwind.it)
2023-02-16 19:40:41

*Thread Reply:* marquez-api | XXX.23.0.4 - - [17/Feb/2023:00:30:38 +0000] "PUT /api/v1/namespaces/ciro HTTP/1.1" 500 110 "-" "-" 7 marquez-api | INFO [2023-02-17 00:32:07,072] marquez.logging.LoggingMdcFilter: status: 200

Willy Lulciuc (willy@datakin.com)
2023-02-16 19:41:12

*Thread Reply:* the server is returning a 500 ?

Willy Lulciuc (willy@datakin.com)
2023-02-16 19:41:57

*Thread Reply:* odd that LoggingMdcFilter is logging 200

thebruuu (bruno.c@inwind.it)
2023-02-16 19:43:24

*Thread Reply:* Bit confused because now I realize that postman is returning bad request

thebruuu (bruno.c@inwind.it)
2023-02-16 19:43:51

*Thread Reply:*

thebruuu (bruno.c@inwind.it)
2023-02-16 19:44:30

*Thread Reply:* You'll notice that I go to use 3000 in the url If I use 5000 I get No host

Willy Lulciuc (willy@datakin.com)
2023-02-17 01:14:50

*Thread Reply:* odd, the API should be using port 5000, have you followed our quickstart for Marquez?

thebruuu (bruno.c@inwind.it)
2023-02-17 03:43:29

*Thread Reply:* Hello Willy I am starting from scratch followin instruction from https://openlineage.io/docs/getting-started/ I am on Windows Instead of git clone git@github.com:MarquezProject/marquez.git && cd marquez I run the git clone

git clone <https://github.com/MarquezProject/marquez.git> But before I had to clear the auto carriage return in git git config --global core.autocrlf false This avoid an error message on marquez-api when running wait-for-it.sh àt line 1 where #!/usr/bin/env bash is otherwise read as #!/usr/bin/env bash\r'

It turns out that when switching off the autocr, this impacts some file containing marquez password ... and I get a fail on accessing the db to overcome this I run notepad++ and replaced ALL the \r\n with \n And in this way I managed to run docker\up.sh and docker\down.sh correctly (with or without seed ... with access to the db, via pgadmin)

👍 Ernie Ostic
thebruuu (bruno.c@inwind.it)
2023-02-20 03:40:48

*Thread Reply:* The issue is related to PostMan

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-02-17 03:39:07

Hi, I'd like to capture column lineage from spark, but also capture how the columns are transformed, and any column operations that are done too. May I ask if this feature is supported currently, or will be supported in future based on current timeline? Thanks!

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-02-17 03:54:47

*Thread Reply:* Hi @Anirudh Shrinivason, this is a great question. We included extra fields in OpenLineage spec to contain that information: "transformationDescription": { "type": "string", "description": "a string representation of the transformation applied" }, "transformationType": { "type": "string", "description": "IDENTITY|MASKED reflects a clearly defined behavior. IDENTITY: exact same as input; MASKED: no original data available (like a hash of PII for example)" } so the standard is ready to support it. We included two fields, so that one can contain human readable description of what is happening. However, we don't have this implemented in Spark integration.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-02-17 04:02:30

*Thread Reply:* Thanks a lot! That is great. Is there a potential plan in the roadmap to support this for spark?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-02-17 04:08:16

*Thread Reply:* I think there will be a growing interest in that. In general a dependency may really difficult to express if many Spark operators are used on input columns to produce output one. The simple version would be just to detect indetity operation or some kind of hashing.

To sum up, we don't have yet a proposal on that but this seems to be a natural next step in enriching column lineage features.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-02-17 04:40:04

*Thread Reply:* Got it. Thanks! If this item potentially comes on the roadmap, then I'd be happy to work with other interested developers to help contribute! 🙂

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-02-17 04:43:00

*Thread Reply:* Great to hear that. What you could perhaps start with, is come to our monthly OpenLineage meetings and ask @Michael Robinson to put this item on discussions' list. There are many strategies to address this issue and hearing your story, usage scenario and would are you trying to achieve, would be super helpful in design and implementation phase.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-02-17 04:44:18

*Thread Reply:* Got it! The monthly meeting might be a bit hard for me to attend live, because of the time zone. But I'll try my best to make it to the next one! thanks!

Michael Robinson (michael.robinson@astronomer.io)
2023-02-17 09:46:22

*Thread Reply:* Thank you for bringing this up, @Anirudh Shrinivason. I’ll add it to the agenda of our next meeting because there might be interest from others in adding this to the roadmap.

👍 Anirudh Shrinivason
:gratitude_thank_you: Anirudh Shrinivason
thebruuu (bruno.c@inwind.it)
2023-02-17 15:12:57

Hello how can I improve the verbosity of the marquez-api? Regards

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-02-20 02:10:13

*Thread Reply:* Hi @thebruuu, pls take a look at logging documentation of Dropwizard (https://www.dropwizard.io/en/latest/manual/core.html#logging) - the framework Marquez is implemented in. The logging configuration section is present in marquez.yml .

thebruuu (bruno.c@inwind.it)
2023-02-20 03:29:07

*Thread Reply:* Thank You Pavel

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-02-21 02:23:40

Hey, can we please schedule a release of OpenLineage? I would like to have the release that includes the feature to capture custom env variables from spark clusters... Thanks!

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-02-21 09:12:17

*Thread Reply:* We generally schedule a release every month, next one will be in the next week - is that okay @Anirudh Shrinivason?

Michael Robinson (michael.robinson@astronomer.io)
2023-02-21 11:38:50

*Thread Reply:* Yes, there’s one scheduled for next Wednesday, if that suits.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-02-21 21:45:58

*Thread Reply:* Okay yeah sure that works. Thanks

🙌 Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2023-03-01 10:12:45

*Thread Reply:* @Anirudh Shrinivason we’re expecting the release to happen today or tomorrow, FYI

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-03-01 21:22:40

*Thread Reply:* Awesome thanks

Jingyi Chen (jingyi@cloudshuttle.com.au)
2023-02-23 23:43:23

Hello team, we used OpenLineage and Great Expectations integrated. I want to use GE to verify the table in Snowflake. I found that the configuration I added OpenLineage into GE produced this error after running. Could someone please give me some answers? 👀 File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/great_expectations/validation_operators/validation_operators.py", line 469, in _run_actions action_result = self.actions[action["name"]].run( File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/great_expectations/checkpoint/actions.py", line 106, in run return self._run( File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/openlineage/common/provider/great_expectations/action.py", line 156, in _run datasets = self._fetch_datasets_from_sql_source( File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/openlineage/common/provider/great_expectations/action.py", line 362, in _fetch_datasets_from_sql_source self._get_sql_table( File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/openlineage/common/provider/great_expectations/action.py", line 395, in _get_sql_table if engine.connection_string: AttributeError: 'Engine' object has no attribute 'connection_string' 'Engine' object has no attribute 'connection_string'

Jingyi Chen (jingyi@cloudshuttle.com.au)
2023-02-23 23:44:03

*Thread Reply:* This is my checkponit configuration in GE. ```name: 'openlineagecheckpoint' configversion: 1.0 templatename: modulename: greatexpectations.checkpoint classname: Checkpoint runnametemplate: '%Y%m%d-%H%M%S-mycheckpoint' expectationsuitename: EMAILVALIDATION batchrequest: actionlist:

  • name: storevalidationresult action: class_name: StoreValidationResultAction
  • name: storeevaluationparams action: class_name: StoreEvaluationParametersAction
  • name: updatedatadocs action: classname: UpdateDataDocsAction sitenames: []
  • name: openlineage action: classname: OpenLineageValidationAction modulename: openlineage.common.provider.greatexpectations openlineagehost: http://localhost:5000 # openlineageapiKey: 12345 openlineagenamespace: geexpectations # Replace with your job namespace; we recommend a meaningful namespace like dev or prod, etc. jobname: gevalidation evaluationparameters: {} runtime_configuration: {} validations:
  • batchrequest: datasourcename: LANDINGDEV dataconnectorname: defaultinferreddataconnectorname dataassetname: 'snowpipe.pii' dataconnectorquery: index: -1 expectationsuitename: EMAILVALIDATION

profilers: [] gecloudid: expectationsuitegecloudid:```

Benji Lampel (benjamin@astronomer.io)
2023-02-24 11:31:05

*Thread Reply:* What version of GX are you running? And is this being run directly through GX or through Airflow with the operator?

Jingyi Chen (jingyi@cloudshuttle.com.au)
2023-02-26 20:05:12

*Thread Reply:* I use the latest version of Great Expectations. This error occurs either directly through Great Expectations or airflow

Benji Lampel (benjamin@astronomer.io)
2023-02-27 09:10:00

*Thread Reply:* I noticed another issue in the latest version as well. Try dropping to GE version great-expectations==0.15.44 for now. That is the latest one that works for me.

Benji Lampel (benjamin@astronomer.io)
2023-02-27 09:11:34

*Thread Reply:* You should definitely open an issue here, and you can tag me @denimalpaca in the comment

Jingyi Chen (jingyi@cloudshuttle.com.au)
2023-02-27 18:07:29

*Thread Reply:* Thanks Benji, but I still have the same problem after I drop to great-expectations==0.15.44 , this is my requirement file

great_expectations==0.15.44
sqlalchemy
psycopg2-binary
numpy
pandas
snowflake-connector-python
snowflake-sqlalchem
Benji Lampel (benjamin@astronomer.io)
2023-02-28 13:34:03

*Thread Reply:* interesting... I do think this may be a GX issue so let's see if they say anything. I can also cross post this thread to their slack

Saravanan (saravanan@athivatech.com)
2023-03-01 00:27:30

Hello Team, I’m trying to use Open Lineage with AWS Glue and Marquez. Has anyone successfully integrated AWS Workflows/ Glue ETL jobs with Open Lineage?

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2023-05-01 11:47:40

*Thread Reply:* I know I’m responding to an older post - I’m not sure if this would work in your environment? https://aws.amazon.com/blogs/big-data/build-data-lineage-for-data-lakes-using-aws-glue-amazon-neptune-and-spline/ Are you using AWS Glue with Spark jobs?

Amazon Web Services
Saravanan (saravanan@athivatech.com)
2023-05-02 15:16:14

*Thread Reply:* This was proposed by our AWS Solution architect but we are not seeing much improvement compared to open lineage. Have you deployed the above solution to prod?

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2023-05-25 11:30:44

*Thread Reply:* We are currently in the research phase, so we have not deployed to prod. We have customers with thousands of existing scripts that they don’t want to rewrite to add openlineage libraries - i would imagine that if you are already integrating OpenLineage in your code, the spark listener isn’t an improvement. Our research is on magically getting lineage from existing scripts 😄

Michael Robinson (michael.robinson@astronomer.io)
2023-03-01 09:42:23

Hello everyone, I’m opening a vote to release OpenLineage 0.21.0, featuring: • a new CustomEnvironmentFacetBuilder class and new output visitors AlterTableAddPartitionCommandVisitor and AlterTableSetLocationCommandVisitor in the Spark integration • a Linux-ARM version of the SQL parser’s native library • DEBUG logging of events in transports • bug fixes and more. Three +1s from committers will authorize an immediate release.

➕ Maciej Obuchowski, Jakub Dardziński, Benji Lampel, Natalie Zeller, Paweł Leszczyński
Michael Robinson (michael.robinson@astronomer.io)
2023-03-01 10:26:22

*Thread Reply:* Thanks, all. The release is authorized and will be initiated as soon as possible.

Nigel Jones (nigel.l.jones@gmail.com)
2023-03-02 03:52:03

I’ve got some security related questions/observations. The main site suggests opening an issue to report vulnerabilities etc. I wanted to check if there is a private mailing list/DM channel to just check a few things first? I’m happy to use github issues otherwise. Thanks!

Moritz E. Beber (midnighter@posteo.net)
2023-03-02 05:15:55

*Thread Reply:* GitHub has a new issue template for reporting vulnerabilities, actually. If you use a config that enables this issue template.

Michael Robinson (michael.robinson@astronomer.io)
2023-03-02 10:21:16

Reminder: our first meetup is one week from today in Providence, RI! You can find the details in the meetup blog post. And if you’re coming, it would be great if you could RSVP. Looking forward to seeing some of you there!

openlineage.io
Meetup
🎉 Kengo Seki
🚀 Kengo Seki
✅ Sheeri Cabral (Collibra)
Michael Robinson (michael.robinson@astronomer.io)
2023-03-02 16:52:50

@channel We released OpenLineage 0.21.1, including: Additions • Clients: add DEBUG logging of events to transports #1633 by @mobuchowski • Spark: add CustomEnvironmentFacetBuilder class #1545 by New contributor @Anirudh181001 • Spark: introduce the new output visitors AlterTableAddPartitionCommandVisitor and AlterTableSetLocationCommandVisitor #1629 by New contributor @nataliezeller1 • Spark: add column lineage for JDBC relations #1636 by @tnazarew • SQL: add linux-aarch64 native library to Java SQL parser #1664 by @mobuchowski Changes • Airflow: get table database in Athena extractor #1631 by New contributor @rinzool Removals • Airflow: remove JobIdMapping and update macros to better support Airflow version 2+ #1645 by @JDarDagran Thanks to all our contributors! For the bug fixes and details, see: Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.21.1 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.20.6...0.21.1 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🎉 Kengo Seki, Harel Shein, Maciej Obuchowski
🚀 Kengo Seki, Harel Shein, Maciej Obuchowski
Paul Lee (paullee@lyft.com)
2023-03-02 19:01:23

how do you turn off the openlineage listener in airflow 2? for some reason we're seeing a Thread-2 and seeing it fire twice in tasks

Harel Shein (harel.shein@gmail.com)
2023-03-02 20:04:19

*Thread Reply:* Hey @Paul Lee, are you seeing this happen for Async operators?

Harel Shein (harel.shein@gmail.com)
2023-03-02 20:06:00

*Thread Reply:* might be related to this issue https://github.com/OpenLineage/OpenLineage/pull/1601 that was fixed in 0.20.6

Paul Lee (paullee@lyft.com)
2023-03-03 16:15:44

*Thread Reply:* hmm perhaps.

Paul Lee (paullee@lyft.com)
2023-03-03 16:15:55

*Thread Reply:* @Harel Shein if i want to turn off openlineage listener how do i do that? do i just remove the package?

Harel Shein (harel.shein@gmail.com)
2023-03-03 16:24:07

*Thread Reply:* meaning, you don’t want openlineage to collect any information from your Airflow deployment?

👍 Paul Lee
Harel Shein (harel.shein@gmail.com)
2023-03-03 16:24:50

*Thread Reply:* in that case, you could either remove it from your requirements file, or set OPENLINEAGE_DISABLED=True in your Airflow env vars

👍 Paul Lee
Paul Lee (paullee@lyft.com)
2023-03-06 14:43:56

*Thread Reply:* removed it from requirements and also the backend key in airflow config. needed both

Michael Robinson (michael.robinson@astronomer.io)
2023-03-02 20:29:42

@channel This month’s OpenLineage TSC meeting is next Thursday, March 9th, at 10 am PT. Join us on Zoom: https://bit.ly/OLzoom. All are welcome! On the tentative agenda:

  1. Recent release overview
  2. A new consumer
  3. Custom env variable support in Spark
  4. Async operator support in Airflow
  5. JDBC relations support in Spark
  6. Discussion topics: • New feature idea: column transformations/operations in the Spark integration • Using namespaces
  7. Open discussion Notes: https://bit.ly/OLwiki Is there a topic you think the community should discuss at this or a future meeting? Reply or DM me to add items to the agenda.
🙌 Willy Lulciuc, Paweł Leszczyński, Maciej Obuchowski, alexandre bergere
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-03-02 21:48:29

Hi everyone, I noticed that Openlineage is sending each of the events twice for spark. Is this expected? Is there some way to disable this behaviour?

Will Johnson (will@willj.co)
2023-03-02 23:46:08

*Thread Reply:* Are you seeing duplicate START events or do you see two events one that is a START and one that is COMPLETE?

OpenLineage's events may send partial information. You should expect to collect all events for a given RunId and merge them together to get the complete events.

In addition, some data sources are really chatty like Delta tables. That may cause you to see many events that look very similar.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-03-03 00:45:19

*Thread Reply:* Hmm...I'm seeing 2 start events for the same runnable command

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-03-03 00:45:27

*Thread Reply:* And 2 complete

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-03-03 00:46:08

*Thread Reply:* I am currently only testing on parquet tables...

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-03-03 02:31:28

*Thread Reply:* One of openlineage assumptions is the ability to merge lineage events in the backend to make client integrations stateless. So, it is possible that Spark can emit multiple events for the same job. However, sometimes it does not make any sense to send or collect some events, which happened to us some time ago with delta. In that case we decided to filter them and created filtering mechanism (https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/filters) than can be extended in case of other unwanted events being generated and sent.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-03-05 22:59:06

*Thread Reply:* Ahh I see...okay thanks!

Daniel Joanes (djoanes@gmail.com)
2023-03-07 00:05:48

*Thread Reply:* in general , you should build any event consumer system with at least once semantics. Even if this issue is fixed, there is a possibility of duplicates for other valid scenarios

➕ Maciej Obuchowski, Anirudh Shrinivason
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-03-09 14:10:47

*Thread Reply:* Hi..I compared some duplicate 'START' events just now, and noticed that they are exactly the same, with the only exception of one of them having an 'environment-properties' field... Could I just quickly check if this is a bug or a feature haha?

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-03-10 01:18:18

*Thread Reply:* CC: @Paweł Leszczyński ^

Michael Robinson (michael.robinson@astronomer.io)
2023-03-08 11:15:48

@channel Reminder: this month’s OpenLineage TSC meeting is tomorrow at 10am PT. All are welcome. https://openlineage.slack.com/archives/C01CK9T7HKR/p1677806982084969

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
Susmitha Anandarao (susmitha.anandarao@gmail.com)
2023-03-08 15:51:07

Hi if we have OpenLineage listener configured as a default spark conf, is there an easy way to disable ol for a specific notebook?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-03-08 17:30:44

*Thread Reply:* if you can set up env variables for particular notebooks, you can set OPENLINEAGE_DISABLED=true

:gratitude_thank_you: Susmitha Anandarao
Benji Lampel (benjamin@astronomer.io)
2023-03-10 13:15:41

Hey all,

I opened a PR (and corresponding issue) to change how naming works in OpenLineage. The idea generally is to move from Naming.md as the end-all-be-all of names for integrations, and towards JSON schemas per integration, with each schema defining very precisely what fields a name and namespace should contain, how they're connected, and how they're validated. Would really appreciate some feedback as this is a pretty big change!

Labels
documentation, proposal
Comments
1
Sunil Patil (spatil@twilio.com)
2023-03-13 17:05:56

What do i need to do to enable dag level metric capturing for airflow. I followed the instruction to install openlineage 0.21.1 on airflow 2.3.3. When i run a DAG i see metrics related to Task start, success/failure. But i dont see any metrics for Dag success/failure. Do i have to do something to enable DAG execution capturing ?

openlineage.io
Sunil Patil (spatil@twilio.com)
2023-03-13 17:08:53

*Thread Reply:* is DAG run capturing enabled starting airflow 2.5.1 ? https://github.com/apache/airflow/pull/27113

Labels
area:scheduler/executor, type:new-feature
Comments
6
Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-03-13 17:11:47

*Thread Reply:* you're right, only the change was included in 2.5.0

🙏 Sunil Patil
Sunil Patil (spatil@twilio.com)
2023-03-13 17:43:15

*Thread Reply:* Thanks Jakub

Michael Robinson (michael.robinson@astronomer.io)
2023-03-14 15:37:34

Fresh on the heels of our first-ever in-person event, we’re meeting up again soon at Data Council Austin! Join us on March 30th (the same day as @Julien Le Dem’s talk) at 12:15 pm to discuss the project’s goals and design, meet other members of the data ecosystem, and help shape the future of the spec. For more info, check out the OpenLineage blog. If you haven’t registered for the conference yet, click and use promo code OpenLineage20 for a special rate. Hope to see you there!

openlineage.io
tickettailor.com
Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2023-03-15 15:11:18

If someone is using airflow and DAG-docs for lineage, can they export the lineage in, say, OL format?

Harel Shein (harel.shein@gmail.com)
2023-03-15 15:18:22

*Thread Reply:* I don’t see it currently on the AirflowRunFacet, but probably not a big deal to add it? @Benji Lampel wdyt?

❤️ Sheeri Cabral (Collibra)
Benji Lampel (benjamin@astronomer.io)
2023-03-15 15:22:00

*Thread Reply:* Definitely could be a good thing to have--is there not some info facet that could hold this data already? I don't see an issue with adding to the AirflowRunFacet tho (full disclosure, I'm not super familiar with this facet)

❤️ Sheeri Cabral (Collibra)
Ross Turk (ross@datakin.com)
2023-03-15 15:58:40

*Thread Reply:* Perhaps DocumentationJobFacet or DocumentationDatasetFacet?

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2023-03-15 15:13:55

(is it https://docs.astronomer.io/learn/airflow-openlineage ? )

docs.astronomer.io
Susmitha Anandarao (susmitha.anandarao@gmail.com)
2023-03-17 12:31:02

Happy Friday 👋 I am looking for some help setting the parent information for a dbt run. I have set the namespace variable in the openlineage.yml but doesn't seem to take effect and ends up using the default value of dbt. Also using openlineage.yml to set the transport properties for emitting to kafka. Is there a way to set parent namespace, name and run id in the yml file? Thank you!

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-03-18 12:09:23

*Thread Reply:* dbt-ol does not read from openlineage.yml so you need to pass this information in OPENLINEAGE_NAMESPACE environment variable

Ross Turk (ross@datakin.com)
2023-03-20 15:17:03

*Thread Reply:* Hmmm. Interesting! I thought that it used client = OpenLineageClient.from_environment(), I’ll do some testing with Kafka backends.

Susmitha Anandarao (susmitha.anandarao@gmail.com)
2023-03-20 15:22:07

*Thread Reply:* Thank you for the hint. I was able to make it work with specifying the env OPENLINEAGE_CONFIGto specify the yml file holding transport info and OPENLINEAGE_NAMESPACE

👍 Ross Turk
Ross Turk (ross@datakin.com)
2023-03-20 15:24:05

*Thread Reply:* Awesome! That’s exactly what I was going to test.

Ross Turk (ross@datakin.com)
2023-03-20 15:25:04

*Thread Reply:* I think it also works if you put it in $HOME/.openlineage/openlineage.yml.

:gratitude_thank_you: Susmitha Anandarao
Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-03-21 08:32:17

*Thread Reply:* @Susmitha Anandarao I might have provided misleading information. I meant that dbt-ol does not read OL namespace from openlineage.yml but from OPENLINEAGE_NAMESPACE env var instead

Michael Robinson (michael.robinson@astronomer.io)
2023-03-21 13:48:28

Data Council Austin, the host of our next meetup, is one week away: https://openlineage.slack.com/archives/C01CK9T7HKR/p1678822654288379

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
Michael Robinson (michael.robinson@astronomer.io)
2023-03-21 13:52:52

In addition to Data Council Austin next week, the hybrid Big Data Technology Warsaw Summit will be taking place on March 28th-30th, featuring three of our committers: @Maciej Obuchowski, @Paweł Leszczyński and @Ross Turk ! There’s more info here: https://bigdatatechwarsaw.eu/

Big Data Technology Warsaw Summit
Estimated reading time
6 minutes
🙌 Howard Yoo, Maciej Obuchowski, Jakub Dardziński, Ross Turk, Perttu Salonen
👍 thebruuu
Brad Paskewitz (bradford.paskewitz@fivetran.com)
2023-03-22 22:38:26

hey folks, is anyone capturing dataset metadata for multi-table schemas? I'm looking at the schema dataset facet: https://openlineage.io/docs/spec/facets/dataset-facets/schema but it looks like this only represents a single table so im wondering if I'll need to write a custom facet

openlineage.io
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-03-23 04:25:19

*Thread Reply:* It should be represented by multiple datasets, unless I misunderstood what you mean by multi-table

Brad Paskewitz (bradford.paskewitz@fivetran.com)
2023-03-23 10:55:58

*Thread Reply:* here at Fivetran when we sync data it is generally 1 schema with multiple tables (sometimes many) so we would want to represent all of that

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-03-23 11:11:25

*Thread Reply:* So what I understand:

  1. your single job represents synchronization of multiple tables
  2. you want to have precise input-output dataset lineage? am I right?

I would model that as multiple OL jobs that describe each dataset mappings. Additionally, I'd have one "wrapping" job that represents your definition of a job. Rest of those jobs would refer to it in ParentRunFacet.

This is a pattern we use for Airflow and dbt dags.

Brad Paskewitz (bradford.paskewitz@fivetran.com)
2023-03-23 12:57:15

*Thread Reply:* Yes your statements are correct. Thanks for sharing that model, that makes sense to me

Brad Paskewitz (bradford.paskewitz@fivetran.com)
2023-03-24 15:56:27

has anyone had success creating custom facets using java? I'm following this guide: https://openlineage.io/docs/spec/facets/custom-facets and im wondering if it makes sense to manually create POJOs or if others are creating the json schema for the object and then automatically generating the java code?

openlineage.io
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-03-27 05:26:06

*Thread Reply:* I think it's better to just create POJO. This is what we do in Spark integration, for example.

For now, JSON Schema generator isn't flexible enough to generate custom facets from whatever schema we give it, so it would be unnecessary complexity

:gratitude_thank_you: Brad Paskewitz
Julien Le Dem (julien@apache.org)
2023-03-27 12:29:57

*Thread Reply:* Agreed, just a POJO would work. This is using Jackson, so you would use annotations as needed. You can also use a Jackson JSONNode or even Map.

:gratitude_thank_you: Brad Paskewitz
Brad Paskewitz (bradford.paskewitz@fivetran.com)
2023-03-27 14:01:07

One other question: I'm in the process of adding different types of facets to our base payloads and I'm wondering if we have any related guidelines / best practices / standards / conventions. For example if I add a full source schema as a schema dataset facet to every start event it seems like that could be inefficient compared to a 1-time full-source-schema followed by incremental diffs for each following sync. Curious how others are thinking about + solving these types of problems in practice

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-03-27 17:59:28

*Thread Reply:* That depends on the OL consumer, but for something like SchemaDatasetFacet it seems to be okay to assume schema stays the same if not send.

For others, like OutputStatisticsOutputDatasetFacet you definitely can't assume that, as the data is unique to each run.

Brad Paskewitz (bradford.paskewitz@fivetran.com)
2023-03-27 19:05:14

*Thread Reply:* ok great thanks, that makes sense to me

Saravanan (saravanan@athivatech.com)
2023-03-27 21:42:20

Hi Team, I’m seeing creating data source, dataset API’s marked as deprecated . Can anyone point me how to create datasets via API calls?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-03-28 04:47:31

*Thread Reply:* OpenLineage API: https://openlineage.io/docs/getting-started/

openlineage.io
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-03-28 06:08:18

Hi everyone, I recently encountered this error saying V2SessionCatalog is not supported by openlineage. May I ask if support for this will be added in near future? Thanks!

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-03-28 08:05:30

*Thread Reply:* I think it would be great to support V2SessionCatalog, and it would very much help if you created GitHub issue with more explanation and examples of it's use.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-03-29 02:53:37

*Thread Reply:* Sure thanks!

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-03-29 05:34:37

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/1747 I have opened an issue here. Thanks! 🙂

Labels
proposal
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-04-17 11:53:52

*Thread Reply:* Hi @Maciej Obuchowski Just curious, is this issue on the potential roadmap for the next Openlineage release?

Tom van Eijk (t.m.h.vaneijk@tilburguniversity.edu)
2023-04-02 19:37:27

Hi all! Can anyone provide me some advice on how to solve this error: ValueError: `emit` only accepts RunEvent class [2023-04-02, 23:22:00 UTC] {taskinstance.py:1326} INFO - Marking task as FAILED. dag_id=etl_openlineage, task_id=send_ol_events, execution_date=20230402T232112, start_date=20230402T232114, end_date=20230402T232200 [2023-04-02, 23:22:00 UTC] {standard_task_runner.py:105} ERROR - Failed to execute job 400 for task send_ol_events (`emit` only accepts RunEvent class; 28020) [2023-04-02, 23:22:00 UTC] {local_task_job.py:212} INFO - Task exited with return code 1 [2023-04-02, 23:22:00 UTC] {taskinstance.py:2585} INFO - 0 downstream tasks scheduled from follow-on schedule check I'm trying to follow this tutorial (https://openlineage.io/blog/openlineage-snowflake/) on connecting Snowflake to OpenLineage through Apache Airflow, however, the last step (sending the OpenLineage events) returns an error.

openlineage.io
Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-04-03 09:32:46

*Thread Reply:* The blog post is a bit old and in the meantime there were changes in OpenLineage Python Client introduced. May I ask if you want just to test the flow or looking for any viable Snowflake data lineage solution?

Ross Turk (ross@datakin.com)
2023-04-03 10:47:57

*Thread Reply:* I believe that this will work if you change the line to client.transport.emit()

Ross Turk (ross@datakin.com)
2023-04-03 10:49:05

*Thread Reply:* (this would be in the dags/lineage folder, if memory serves)

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-04-03 10:57:23

*Thread Reply:* Ross is right, that should work

Tom van Eijk (t.m.h.vaneijk@tilburguniversity.edu)
2023-04-04 12:23:13

*Thread Reply:* This works! Thank you so much!

Tom van Eijk (t.m.h.vaneijk@tilburguniversity.edu)
2023-04-04 12:24:40

*Thread Reply:* @Jakub Dardziński I want to use a viable Snowflake data lineage solution alongside a Amazon DataZone Catalog 🙂

Ross Turk (ross@datakin.com)
2023-04-04 13:03:58

*Thread Reply:* I have been meaning to revisit that tutorial 👍

Michael Robinson (michael.robinson@astronomer.io)
2023-04-03 10:52:42

Hello all, I’d like to open a vote to release OpenLineage 0.22.0, including: • a new properties facet in the Spark integration • a new field in HttpConfig for passing custom headers in the Spark integration • improved namespace generation for JDBC connections in the Spark integration • removal of unnecessary warnings about column lineage in the Spark integration • support for alter, truncate, and drop statements in the SQL parser • typing hints in the SQL integration • a new from_dict class method in the Python client to support creating it from a dictionary • a case-insensitive env variable for disabling OpenLineage in the Python client and Airflow integration • bug fixes, docs changes, and more. Three +1s from committers will authorize an immediate release. For more details about the release process, see GOVERNANCE.md.

➕ Maciej Obuchowski, Perttu Salonen, Jakub Dardziński, Ross Turk
Michael Robinson (michael.robinson@astronomer.io)
2023-04-03 15:39:46

*Thread Reply:* Thanks, all. The release is authorized and will be initiated within 48 hours.

Michael Robinson (michael.robinson@astronomer.io)
2023-04-03 16:55:44

@channel We released OpenLineage 0.22.0, including: Additions: • Spark: add properties facet #1717 by @tnazarew • SQL: SQLParser supports alter, truncate and drop statements #1695 by @pawel-big-lebowski • Common/SQL: provide public interface for openlineage_sql package #1727 by @JDarDagran • Java client: add configurable headers to HTTP transport #1718 by @tnazarew • Python client: create client from dictionary #1745 by @JDarDagran Changes: • Spark: remove URL parameters for JDBC namespaces #1708 by @tnazarew • Make OPENLINEAGE_DISABLED case-insensitive #1705 by @jedcunningham Removals: • Spark: remove unnecessary warnings for column lineage #1700 by @pawel-big-lebowski • Spark: remove deprecated configs #1711 by @tnazarew Thanks to all the contributors! For the bug fixes and details, see: Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.22.0 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.21.1...0.22.0 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🙌 Jakub Dardziński, Francis McGregor-Macdonald, Howard Yoo, 김형은, Kengo Seki, Anirudh Shrinivason, Perttu Salonen, Paweł Leszczyński, Maciej Obuchowski, Harel Shein
🎉 Ross Turk, 김형은, Kengo Seki, Anirudh Shrinivason, Perttu Salonen
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-04-04 01:49:37

Hi everyone, if I set executors to 0, and bind address to localhost, and then if I want to use openlineage to capture metadata, I seem to run into an error where the executor tries to fetch the spark jar from the driver, even though there is no executor set. Then, it fails because a connection cannot be established. This is some of the error stack trace: INFO Executor: Fetching spark://&lt;DRIVER_IP&gt;:44541/jars/io.openlineage_openlineage-spark-0.21.1.jar with timestamp 1680506544239 ERROR Utils: Aborting task java.io.IOException: Failed to connect to /&lt;DRIVER_IP&gt;:44541 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:287) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:218) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:230) at org.apache.spark.rpc.netty.NettyRpcEnv.downloadClient(NettyRpcEnv.scala:399) at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$openChannel$4(NettyRpcEnv.scala:367) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1473) at org.apache.spark.rpc.netty.NettyRpcEnv.openChannel(NettyRpcEnv.scala:366) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:755) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:541) at org.apache.spark.executor.Executor.$anonfun$updateDependencies$13(Executor.scala:953) at org.apache.spark.executor.Executor.$anonfun$updateDependencies$13$adapted(Executor.scala:945) at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:876) at <a href="http://org.apache.spark.executor.Executor.org">org.apache.spark.executor.Executor.org</a>$apache$spark$executor$Executor$$updateDependencies(Executor.scala:945) at org.apache.spark.executor.Executor.&lt;init&gt;(Executor.scala:247) at org.apache.spark.scheduler.local.LocalEndpoint.&lt;init&gt;(LocalSchedulerBackend.scala:64) at org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:132) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:220) at org.apache.spark.SparkContext.&lt;init&gt;(SparkContext.scala:579) at org.apache.spark.api.java.JavaSparkContext.&lt;init&gt;(JavaSparkContext.scala:58) at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source) at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) at java.base/java.lang.reflect.Constructor.newInstance(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:238) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.base/java.lang.Thread.run(Unknown Source) Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /&lt;DRIVER_IP&gt;:44541 Caused by: java.net.ConnectException: Connection refused at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Unknown Source) Just curious if anyone here has run into a similar problem before, and what the recommended way to resolve this would be...

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-04-04 13:39:19

*Thread Reply:* Do you have small configuration and job to replicate this?

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-04-04 22:21:35

*Thread Reply:* Yeah. For configs: spark.driver.bindAddress: "localhost" spark.master: "local[**]" spark.sql.catalogImplementation: "hive" spark.openlineage.transport.endpoint: "&lt;endpoint&gt;" spark.openlineage.transport.type: "http" spark.sql.catalog.spark_catalog: "org.apache.spark.sql.delta.catalog.DeltaCatalog" spark.openlineage.transport.url: "&lt;url&gt;" spark.extraListeners: "io.openlineage.spark.agent.OpenLineageSparkListener" and job is submitted via spark submit in client mode with number of executors set to 0. The spark job by itself could be anything...I think the job fails before initializing the spark session itself.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-04-04 22:23:19

*Thread Reply:* The issue is because of the spark.jars.packages config... spark.jars config also runs into the same issue. Because the executor tries to fetch the jar from driver for some reason even though there is no executors set...

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-04-05 05:38:55

*Thread Reply:* TBH I'm not sure if we can do anything about it. Seems like just having any SparkListener which is not in Spark jars would fall under the same problems, right?

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-04-10 06:07:11

*Thread Reply:* Yeah... Actually, this was because of binding the driver ip to localhost. In that case, the executor was not able to get the jar from the driver. But yeah I don't think we could have done anything from openlienage end anyway for this. Was just an interesting error to encounter lol

Lq Dodo (tryopenmetadata@gmail.com)
2023-04-04 12:07:21

Hi, I am new to open lineage. I was able to follow https://openlineage.io/getting-started/ to create a lineage "my-input-->my-job-->my-output". I want to use "my-output" as an input dataset, and connect to the next job, thing like this "my-input-->my-job-->my-output-->my-job2-->my-final-output". How to do it? I have trouble to set eventType and runId, etc. Once the new lineages get massed up, the Marquez UI becomes blank (which is a separated issue).

Ross Turk (ross@datakin.com)
2023-04-04 13:02:21

*Thread Reply:* In this case you would have four runevents:

  1. a START event on my-job where my-input is the input and my-output is the output, with a runId you generate on the client
  2. a COMPLETE event on my-job with the same runId from #1
  3. a START event on my-job2 where the input is my-output and the output is my-final-output, with a separate runId you generate
  4. a COMPLETE event on my-job2 with the same runId from #3
Lq Dodo (tryopenmetadata@gmail.com)
2023-04-04 14:53:14

*Thread Reply:* thanks for the response. I tried it but now the UI only shows like one second and then turn to blank. I has similar issue before. It seems to me every time when I added a bad lineage, the UI stops working. I have to delete the docker image:-( Not sure whether it is MacOS M1 related issue.

Ross Turk (ross@datakin.com)
2023-04-04 16:07:06

*Thread Reply:* Hmmm, that's interesting. Not sure I've seen that before. If you happen to catch it in that state again, perhaps capture the contents of the lineage_events table so it can be replicated.

Lq Dodo (tryopenmetadata@gmail.com)
2023-04-04 16:24:28

*Thread Reply:* I can fairly easy to reproduce this blank UI issue. Apparently I used the same runId for two different jobs. If I use different unId (which I should), the lineage displays correctly. Thanks again!

👍 Ross Turk
Lq Dodo (tryopenmetadata@gmail.com)
2023-04-04 16:41:54

Is it possible to add column level lineage via api? Let's say I have fields A,B,C from my-input, and A,B from my-output, and B,C from my-output-s3. I want to see, filter, or query by the column name.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-04-05 05:35:02

*Thread Reply:* You can add https://openlineage.io/docs/spec/facets/dataset-facets/column_lineage_facet/ to your datasets.

However, I don't think you can currently do any filtering over it

openlineage.io
Ross Turk (ross@datakin.com)
2023-04-05 13:20:20

*Thread Reply:* you can see a good example here, @Lq Dodo: https://github.com/MarquezProject/marquez/blob/289fa3eef967c8f7915b074325bb6f8f55480030/docker/metadata.json#L430

Lq Dodo (tryopenmetadata@gmail.com)
2023-04-06 11:48:48

*Thread Reply:* those examples really help. I can at least build the lineage with column level info using the apis. thanks a lot! Ideally I'd like select one column from the UI and then show me the column level graph. Seems not possible.

Ross Turk (ross@datakin.com)
2023-04-06 12:46:54

*Thread Reply:* correct, right now there isn't column-level metadata on the lineage graph 😞

Pavani (ylpavani@gmail.com)
2023-04-05 22:01:33

Is airflow mandatory, while integrating snowflake with openlineage?

I am currently looking for a solution which can capture lineage details from snowflake execution

Harel Shein (harel.shein@gmail.com)
2023-04-06 10:22:17

*Thread Reply:* something needs to trigger lineage collection, are you using some sort of scheduler / execution engine?

Pavani (ylpavani@gmail.com)
2023-04-06 11:26:13

*Thread Reply:* Nope... We currently don't have scheduling tool. Isn't it possible to use open lineage api and collect the details?

Michael Robinson (michael.robinson@astronomer.io)
2023-04-06 13:12:44

@channel This month’s OpenLineage TSC meeting is on Thursday, April 20th, at 10 am PT. Meeting info: https://openlineage.io/meetings/. All are welcome! On the tentative agenda:

  1. Announcements
  2. Updates (new!) a. OpenLineage in Airflow AIP b. Static lineage support c. Reworking namespaces
  3. Recent release overview
  4. A new consumer
  5. Caching support for column lineage
  6. Discussion items a. Snowflake tagging
  7. Open discussion Notes: https://bit.ly/OLwiki Is there a topic you think the community should discuss at this or a future meeting? Reply or DM me to add items to the agenda.
openlineage.io
🚀 alexandre bergere, Paweł Leszczyński
Tom van Eijk (t.m.h.vaneijk@tilburguniversity.edu)
2023-04-06 15:27:41

Hi!

I have a specific question about how OpenLineage fits in between Amazon MWAA and Marquez on AWS EKS. I guess I need to change for example the etl_openlineage DAG in this Snowflake integration tutorial and the OPENLINEAGE_URL here. However, I'm wondering how to reproduce the Docker containers airflow, airflow_scheduler, and airflow_worker here.

I heard from @Ross Turk that @Willy Lulciuc and @Michael Collado are experts on the K8s integration for OpenLineage and Marquez. Could you provide me some recommendations on how to approach this integration? Or can anyone else help me?

Kind regards,

Tom

openlineage.io
John Lukenoff (john@jlukenoff.com)
2023-04-07 12:47:18

[RESOLVED]👋 Hi there, I’m doing a POC of OpenLineage for our airflow deployment. We have a ton of custom operators and I’m trying to test out extracting lineage using the get_openlineage_facets_on_start method. Currently when I’m testing I can see that the OpenLineage plugin is running via airflow plugins but am not able to see that the method is ever getting called. Do I need to do anything else to tell the default extractor to use get_openlineage_facets_on_start? This is the documentation I’m referencing: https://openlineage.io/docs/integrations/airflow/extractors/default-extractors

openlineage.io
John Lukenoff (john@jlukenoff.com)
2023-04-07 12:50:14

*Thread Reply:* E.g. do I need to update my custom operators to inherit from DefaultExtractor?

John Lukenoff (john@jlukenoff.com)
2023-04-07 13:18:05

*Thread Reply:* FWIW, I can tell some level of connectivity to my Marquez deployment is working since I can see it created the default namespace I defined in my OPENLINEAGE_NAMESPACE env var.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-04-07 18:37:44

*Thread Reply:* hey John, it is enough to add the method to your custom operator. Perhaps something breaks inside the method. Did anything show up in the logs?

John Lukenoff (john@jlukenoff.com)
2023-04-07 19:03:01

*Thread Reply:* That’s the strange part. I’m not seeing anything to suggest that the method is ever getting called. I’m also expecting that the listener created by the plugin should at least be calling this log line when the task runs. However, I’m not seeing that either. I’m able to verify the plugin is registered using airflow plugins and have debug level logging enabled via AIRFLOW__LOGGING__LOGGING_LEVEL='DEBUG'. This is the output of airflow plugins

name | macros | listeners | source ==================+================================================+==============================+================================================= OpenLineagePlugin | openlineage.airflow.macros.lineage_run_id,open | openlineage.airflow.listener | openlineage-airflow==0.22.0: | lineage.airflow.macros.lineage_parent_id | | EntryPoint(name='OpenLineagePlugin', | | | value='openlineage.airflow.plugin:OpenLineagePlu | | | gin', group='airflow.plugins') Appreciate any ideas you might have!

John Lukenoff (john@jlukenoff.com)
2023-04-11 13:09:05

*Thread Reply:* Figured this out. Just needed to run the airflow scheduler and trigger tasks through the DAGs vs. airflow tasks test …

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2023-04-07 16:29:03

I have a question that I believe will be very easy to answer, and I think I know the answer already, but I want to confirm my understanding of extracting OpenLineage with airflow python scripts.

Extractors extract lineage from operators, so they have to be using operators, right? If someone asks if I can get lineage from their Airflow-orchestrated python scripts, and they show me their scripts but they’re not importing anything starting with airflow.operators, then I can’t use extractors and therefore can’t get lineage. Is that accurate?

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2023-04-07 16:30:00

*Thread Reply:* (they are importing dagkit sdk stuff like Job, JobContext, ExecutionContext, and NodeContext.)

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-04-07 18:40:39

*Thread Reply:* Do they run those scripts in PythonOperator? If so, they should receive some events but with no datasets extracted

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2023-04-07 21:28:25

*Thread Reply:* How can I know that? Would it be in the scripts or the airflow configuration or...

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2023-04-08 07:13:56

*Thread Reply:* And "with no datasets extracted" that means I wouldn't have the schema of the input and output datasets? (I need the db/schema/table/column names for my purposes)

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-04-11 02:49:07

*Thread Reply:* That really depends what is the current code but in general any custom code in Airflow does not extract any extra information, especially datasets. One can write their own extractors (more in the docs)

openlineage.io
✅ Sheeri Cabral (Collibra)
Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2023-04-12 16:52:04

*Thread Reply:* Thanks! This is very helpful. Exactly what I needed.

👍 Jakub Dardziński
Tushar Jain (tujain@ivp.in)
2023-04-09 12:48:04

Hi. I was exploring OpenLineage and I want to know does OpenLineage integrate with MS-SQL (Microsoft SQL Server) ? If yes, how to generate OpenLineage events for MS-SQL Views/Tables/Queries?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-04-12 02:30:19

*Thread Reply:* Currently there's no extractor implemented for MS-SQL. We try to update list of supported databases here: https://openlineage.io/docs/integrations/about/

openlineage.io
Michael Robinson (michael.robinson@astronomer.io)
2023-04-10 12:00:03

@channel Save the date: the next OpenLineage meetup will be in New York on April 26th! More info is coming soon…

✅ Sheeri Cabral (Collibra), Ross Turk, Minkyu Park
Michael Robinson (michael.robinson@astronomer.io)
2023-04-10 19:00:38

@channel Due to many TSC members being on vacation this week, this month’s TSC meeting will be moved to next Thursday, April 20th. All are welcome! https://openlineage.slack.com/archives/C01CK9T7HKR/p1680801164289949

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
✅ Sheeri Cabral (Collibra)
Tom van Eijk (t.m.h.vaneijk@tilburguniversity.edu)
2023-04-11 13:42:03

Hi everyone!

I'm so sorry for all the messages but I'm trying to get Snowflake, OpenLineage and Marquez working for days now. Hopefully, this is my last question. The snowflake.connector import connect package seems to be outdated here in extract_openlineage.py and is not working for airflow. Does anyone know how to rewrite this code (e.g., with SnowflakeOperator ) and extract the openlineage access history? You'd be my absolute hero!!!

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-04-11 17:05:35

*Thread Reply:* > The snowflake.connector import connect package seems to be outdated here in extract_openlineage.py and is not working for airflow. What's the error?

> Does anyone know how to rewrite this code (e.g., with SnowflakeOperator ) Current extractor for SnowflakeOperator extracts lineage for SQL executed in the task, in contrast to the method above with OPENLINEAGE_ACCESS_HISTORY view

Tom van Eijk (t.m.h.vaneijk@tilburguniversity.edu)
2023-04-11 18:13:49

*Thread Reply:* Hi Maciej!Thank you so much for the reply! I managed to generate a working combination on Windows between the airflow example in the marquez git and the snowflake openlineage git. The only error I still get is: ****** Log file does not exist: /opt/bitnami/airflow/logs/dag_id=etl_openlineage/run_id=manual__2023-04-10T14:12:53.764783+00:00/task_id=send_ol_events/attempt=1.log ****** Fetching from: <http://1c8bb4a78f14:8793/log/dag_id=etl_openlineage/run_id=manual__2023-04-10T14:12:53.764783+00:00/task_id=send_ol_events/attempt=1.log> ****** !!!! Please make sure that all your Airflow components (e.g. schedulers, webservers and workers) have the same 'secret_key' configured in 'webserver' section and time is synchronized on all your machines (for example with ntpd) !!!!! ************ See more at <https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#secret-key> ************ Failed to fetch log file from worker. Client error '403 FORBIDDEN' for url '<http://1c8bb4a78f14:8793/log/dag_id=etl_openlineage/run_id=manual__2023-04-10T14:12:53.764783+00:00/task_id=send_ol_events/attempt=1.log>' For more information check: <https://httpstatuses.com/403> This one doesn't make sense to me. I found a workaround for the ETL examples in the OpenLineage git by manually creating a Snowflake connector in Airflow, however, the error is still present for the extract_openlineage.py file. I noticed this file is the only one that uses snowflake.connector import connect and not airflow.providers.snowflake.operators.snowflake import SnowflakeOperator like the other ETL Dags.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-04-12 05:35:41

*Thread Reply:* I think it's Airflow error related to getting logs from worker

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-04-12 05:36:07

*Thread Reply:* snowflake.connector is a Snowflake connector library that SnowflakeOperator uses underneath to connect to Snowflake

Tom van Eijk (t.m.h.vaneijk@tilburguniversity.edu)
2023-04-12 10:15:21

*Thread Reply:* Ah alright! Thanks for pointing that out! 🙂 Do you know how to solve it? Or do you have any recommendations on how to look for the solution?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-04-12 10:19:53

*Thread Reply:* I have no experience with Windows, and I think it's the issue: https://github.com/apache/airflow/issues/10388

I would try running it in Docker TBH

Labels
kind:feature
Comments
22
Tom van Eijk (t.m.h.vaneijk@tilburguniversity.edu)
2023-04-12 11:47:41

*Thread Reply:* Yeah I was running Airflow in Docker but this didn't work. I'll try to use my Macbook for now because I don't think there is a solution for this in the short time. Thank you so much for the support though!!

Peter Hanssens (peter@cloudshuttle.com.au)
2023-04-13 04:55:41

Hi All, My team and I have been building a status page based on open lineage and I did a talk about it… keen for feedback and thoughts: https://youtu.be/nGh5_j3hXrE

YouTube
} DataEngAU (https://www.youtube.com/@DataEngAU)
👀 Maciej Obuchowski
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-04-13 11:19:57

*Thread Reply:* Very interesting!

Julien Le Dem (julien@apache.org)
2023-04-13 13:28:53

*Thread Reply:* that’s awesome 🙂

Ernie Ostic (ernie.ostic@getmanta.com)
2023-04-13 08:22:50

Hi Peter. Looks good. I like the way you introduced the premise of, and benefits of, using OpenLineage for your project. Have you also explored other integrations in addition to dbt?

Peter Hanssens (peter@cloudshuttle.com.au)
2023-04-13 08:36:01

*Thread Reply:* Thanks Ernie, I’m looking at Airflow as well as GE and would like to contribute back to the project as well… we’re close to getting a public preview release of our product done and then we want to help build out open lineage

❤️ Julien Le Dem, Harel Shein
John Lukenoff (john@jlukenoff.com)
2023-04-13 14:08:38

[Resolved] Has anyone seen this error before where the openlineage-airflow plugin / listener fails to deepcopy the task instance? I’m using the native airflow DAG / BashOperator objects to do a basic test of static lineage tagging. More details in 🧵

John Lukenoff (john@jlukenoff.com)
2023-04-13 14:10:08

*Thread Reply:* The dag is basically just: ```dag = DAG( dagid="asanaexampledag", defaultargs=defaultargs, scheduleinterval=None, )

samplelineagetask = BashOperator( taskid="samplelineagetask", bashcommand='echo $OPENLINEAGEURL', dag=dag, inlets=[Table(database="redshift", cluster="someschema", name="someinputtable")], outlets=[Table(database="redshift", cluster="someotherschema", name="someoutputtable")] )```

John Lukenoff (john@jlukenoff.com)
2023-04-13 14:11:02

*Thread Reply:* This is the error I’m getting, seems to be coming from this line: [2023-04-13, 17:45:33 UTC] {logging_mixin.py:115} WARNING - Exception in thread Thread-1: Traceback (most recent call last): File "/opt/conda/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/opt/conda/lib/python3.7/threading.py", line 870, in run self._target(**self._args, ****self._kwargs) File "/opt/conda/lib/python3.7/site-packages/openlineage/airflow/listener.py", line 89, in on_running task_instance_copy = copy.deepcopy(task_instance) File "/opt/conda/lib/python3.7/copy.py", line 180, in deepcopy y = _reconstruct(x, memo, **rv) File "/opt/conda/lib/python3.7/copy.py", line 281, in _reconstruct state = deepcopy(state, memo) File "/opt/conda/lib/python3.7/copy.py", line 150, in deepcopy y = copier(x, memo) File "/opt/conda/lib/python3.7/copy.py", line 241, in _deepcopy_dict y[deepcopy(key, memo)] = deepcopy(value, memo) File "/opt/conda/lib/python3.7/copy.py", line 161, in deepcopy y = copier(memo) File "/opt/conda/lib/python3.7/site-packages/airflow/models/baseoperator.py", line 1156, in __deepcopy__ setattr(result, k, copy.deepcopy(v, memo)) File "/opt/conda/lib/python3.7/copy.py", line 150, in deepcopy y = copier(x, memo) File "/opt/conda/lib/python3.7/copy.py", line 241, in _deepcopy_dict y[deepcopy(key, memo)] = deepcopy(value, memo) File "/opt/conda/lib/python3.7/copy.py", line 161, in deepcopy y = copier(memo) File "/opt/conda/lib/python3.7/site-packages/airflow/models/dag.py", line 1941, in __deepcopy__ setattr(result, k, copy.deepcopy(v, memo)) File "/opt/conda/lib/python3.7/copy.py", line 150, in deepcopy y = copier(x, memo) File "/opt/conda/lib/python3.7/copy.py", line 241, in _deepcopy_dict y[deepcopy(key, memo)] = deepcopy(value, memo) File "/opt/conda/lib/python3.7/copy.py", line 161, in deepcopy y = copier(memo) File "/opt/conda/lib/python3.7/site-packages/airflow/models/baseoperator.py", line 1156, in __deepcopy__ setattr(result, k, copy.deepcopy(v, memo)) File "/opt/conda/lib/python3.7/copy.py", line 180, in deepcopy y = _reconstruct(x, memo, **rv) File "/opt/conda/lib/python3.7/copy.py", line 281, in _reconstruct state = deepcopy(state, memo) File "/opt/conda/lib/python3.7/copy.py", line 150, in deepcopy y = copier(x, memo) File "/opt/conda/lib/python3.7/copy.py", line 241, in _deepcopy_dict y[deepcopy(key, memo)] = deepcopy(value, memo) File "/opt/conda/lib/python3.7/copy.py", line 150, in deepcopy y = copier(x, memo) File "/opt/conda/lib/python3.7/copy.py", line 241, in _deepcopy_dict y[deepcopy(key, memo)] = deepcopy(value, memo) File "/opt/conda/lib/python3.7/copy.py", line 161, in deepcopy y = copier(memo) File "/opt/conda/lib/python3.7/site-packages/airflow/models/baseoperator.py", line 1156, in __deepcopy__ setattr(result, k, copy.deepcopy(v, memo)) File "/opt/conda/lib/python3.7/site-packages/airflow/models/baseoperator.py", line 1000, in __setattr__ self.set_xcomargs_dependencies() File "/opt/conda/lib/python3.7/site-packages/airflow/models/baseoperator.py", line 1107, in set_xcomargs_dependencies XComArg.apply_upstream_relationship(self, arg) File "/opt/conda/lib/python3.7/site-packages/airflow/models/xcom_arg.py", line 186, in apply_upstream_relationship op.set_upstream(ref.operator) File "/opt/conda/lib/python3.7/site-packages/airflow/models/taskmixin.py", line 241, in set_upstream self._set_relatives(task_or_task_list, upstream=True, edge_modifier=edge_modifier) File "/opt/conda/lib/python3.7/site-packages/airflow/models/taskmixin.py", line 185, in _set_relatives dags: Set["DAG"] = {task.dag for task in [**self.roots, **task_list] if task.has_dag() and task.dag} File "/opt/conda/lib/python3.7/site-packages/airflow/models/taskmixin.py", line 185, in &lt;setcomp&gt; dags: Set["DAG"] = {task.dag for task in [**self.roots, **task_list] if task.has_dag() and task.dag} File "/opt/conda/lib/python3.7/site-packages/airflow/models/dag.py", line 508, in __hash__ val = tuple(self.task_dict.keys()) AttributeError: 'DAG' object has no attribute 'task_dict'

👀 Maciej Obuchowski
John Lukenoff (john@jlukenoff.com)
2023-04-13 14:12:11

*Thread Reply:* This is with Airflow 2.3.2 and openlineage-airflow 0.22.0

John Lukenoff (john@jlukenoff.com)
2023-04-13 14:13:34

*Thread Reply:* Seems like it might be some issue like this with a circular structure? https://stackoverflow.com/questions/46283738/attributeerror-when-using-python-deepcopy

Stack Overflow
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-04-14 08:44:36

*Thread Reply:* Just by quick look at it, it will definitely be fixed with Airflow 2.6, as it won't need to deepcopy anything.

👍 John Lukenoff
Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-04-14 08:47:16

*Thread Reply:* I can't seem to reproduce the issue. I ran following example DAG with same Airflow and OL versions as yours: ```import datetime

from airflow.lineage.entities import Table from airflow.models import DAG from airflow.operators.bash import BashOperator

defaultargs = { "startdate": datetime.datetime.now() }

dag = DAG( dagid="asanaexampledag", defaultargs=defaultargs, scheduleinterval=None, )

samplelineagetask = BashOperator( taskid="samplelineagetask", bashcommand='echo $OPENLINEAGEURL', dag=dag, inlets=[Table(database="redshift", cluster="someschema", name="someinputtable")], outlets=[Table(database="redshift", cluster="someotherschema", name="someoutputtable")] )```

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-04-14 08:53:48

*Thread Reply:* is there any extra configuration you made possibly?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-04-14 13:02:40

*Thread Reply:* @John Lukenoff, I was finally able to reproduce this when passing xcom as task.output looks like this was reported here and solved by this PR (not sure if this was released in 2.3.3 or later)

John Lukenoff (john@jlukenoff.com)
2023-04-14 13:06:59

*Thread Reply:* Ah interesting. Let me see if bumping my Airflow version resolves this. Haven’t had a chance to tinker with it much since yesterday.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-04-14 13:13:21

*Thread Reply:* I ran it against 2.4 and same dag works

John Lukenoff (john@jlukenoff.com)
2023-04-14 13:15:35

*Thread Reply:* 👍 Looks like a fix for that issue was rolled out in 2.3.3. I’m gonna try that for now (my company has a notoriously difficult time with airflow major version updates 😅)

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-04-14 13:17:06

*Thread Reply:* 👍

John Lukenoff (john@jlukenoff.com)
2023-04-17 12:29:09

*Thread Reply:* Got this working! We just monkey patched the __deepcopy__ method of the BaseOperator for now until we can get bandwidth for an airflow upgrade. Thanks for the help here!

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-04-17 03:45:47

Hi everyone, I am facing this null pointer error: ERROR AsyncEventQueue: Listener OpenLineageSparkListener threw an exception java.lang.NullPointerException java.base/java.util.concurrent.ConcurrentHashMap.putVal(Unknown Source) java.base/java.util.concurrent.ConcurrentHashMap.put(Unknown Source) io.openlineage.spark.agent.JobMetricsHolder.addMetrics(JobMetricsHolder.java:40) io.openlineage.spark.agent.OpenLineageSparkListener.onTaskEnd(OpenLineageSparkListener.java:179) org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:45) org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1381) org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) Could I get some help on this pls 🙇

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-04-17 03:56:30

*Thread Reply:* This is the spark submit command: spark-submit --py-files /usr/local/lib/common_utils.zip,/usr/local/lib/team_utils.zip,/usr/local/lib/project_utils.zip --conf spark.executor.cores=16 --conf spark.hadoop.fs.s3a.connection.maximum=100 --conf spark.sql.shuffle.partitions=1000 --conf spark.speculation=true --conf spark.sql.adaptive.advisoryPartitionSizeInBytes=256MB --conf spark.hadoop.fs.s3a.multiobjectdelete.enable=false --conf spark.memory.fraction=0.7 --conf spark.kubernetes.executor.label.experiment=some_label --conf spark.kubernetes.executor.label.team=team_name --conf spark.driver.memory=26112m --conf <a href="http://spark.kubernetes.executor.label.app.kubernetes.io/managed-by=pipeline_name">spark.kubernetes.executor.label.app.kubernetes.io/managed-by=pipeline_name</a> --conf spark.kubernetes.executor.label.instance-type=4xlarge --conf spark.executor.instances=10 --conf spark.kubernetes.executor.label.env=prd --conf spark.kubernetes.executor.label.job-name=job_name --conf spark.kubernetes.executor.label.owner=owner --conf spark.kubernetes.executor.label.pipeline=pipeline --conf spark.kubernetes.executor.label.platform-name=platform_name --conf spark.speculation.multiplier=10 --conf spark.memory.storageFraction=0.4 --conf spark.driver.maxResultSize=26112m --conf spark.kubernetes.executor.request.cores=15000m --conf spark.speculation.interval=1s --conf spark.executor.memory=104g --conf spark.sql.catalogImplementation=hive --conf spark.eventLog.dir=file:///logs/spark-events --conf spark.hadoop.fs.s3a.threads.max=100 --conf spark.speculation.quantile=0.75 job.py

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-04-17 04:09:57

*Thread Reply:* @Anirudh Shrinivason pls create an issue for this and I will look at it. Although it may be difficult to find the root cause, null pointer exception should be always avoided and this seems to be a bug.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-04-17 04:14:41

*Thread Reply:* Hmm yeah sure. I'll create an issue on github for this issue. Thanks!

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-04-17 05:13:54

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/1784 Opened an issue here

Allison Suarez (asuarezmiranda@lyft.com)
2023-04-17 19:32:23

Hey! Question about spark column lineage. What is the intended way to write custom code for getting column lineage? i am trying to implement CustomColumnLineageVisitor but when I try to do so I get: io.openlineage.spark3.agent.lifecycle.plan.column.CustomColumnLineageVisitor is not public in io.openlineage.spark3.agent.lifecycle.plan.column; cannot be accessed from outside package

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-04-18 02:25:04

*Thread Reply:* Hi @Allison Suarez, CustomColumnLineageVisitor should be definitely public. I'll prepare a fix PR for that. We do have a test for custom column lineage visitors (CustomColumnLineageVisitorTestImpl), but they're in the same package. Thanks for bringing this.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-04-18 03:07:11

*Thread Reply:* This PR should resolve problem: https://github.com/OpenLineage/OpenLineage/pull/1788

Labels
integration/spark
Allison Suarez (asuarezmiranda@lyft.com)
2023-04-18 13:34:43

*Thread Reply:* Thank you so much @Paweł Leszczyński 🙂

Allison Suarez (asuarezmiranda@lyft.com)
2023-04-18 13:35:46

*Thread Reply:* How does the release process work for OL? Do we have to wait a certain amount of time to get this change in a new release?

Allison Suarez (asuarezmiranda@lyft.com)
2023-04-18 17:34:29

*Thread Reply:* @Maciej Obuchowski ^

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-04-19 01:49:33

*Thread Reply:* 0.22.0 was released two weeks ago, so the next schedule should be in next two weeks. We can ask @Michael Robinson his opinion on releasing 0.22.1 before that.

Michael Robinson (michael.robinson@astronomer.io)
2023-04-19 09:08:58

*Thread Reply:* Hi Allison 👋, Anyone can request a release in the #general channel. I encourage you to go this route. You’ll need three +1s (there’s more info about the process here: https://github.com/OpenLineage/OpenLineage/blob/main/GOVERNANCE.md), but I don’t know of any reasons why we can’t do a mid-cycle release. 🙂

🙏 Allison Suarez
Allison Suarez (asuarezmiranda@lyft.com)
2023-04-19 16:23:20

*Thread Reply:* seems like we got enough +1s

Michael Robinson (michael.robinson@astronomer.io)
2023-04-19 16:24:33

*Thread Reply:* We need three committers to give a +1. I’ll reach out again to see if I can recruit a third

🙌 Allison Suarez
Allison Suarez (asuarezmiranda@lyft.com)
2023-04-19 16:24:55

*Thread Reply:* oooh

Michael Robinson (michael.robinson@astronomer.io)
2023-04-19 16:32:47

*Thread Reply:* Yeah, sorry I forgot to mention that!

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-04-20 05:02:46

*Thread Reply:* we have it now

Michael Robinson (michael.robinson@astronomer.io)
2023-04-19 09:52:02

@channel This month’s TSC meeting is tomorrow, 4/20, at 10 am PT: https://openlineage.slack.com/archives/C01CK9T7HKR/p1681167638153879

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
Allison Suarez (asuarezmiranda@lyft.com)
2023-04-19 13:40:31

I would like to get a 0.22.1 patch release to get the issue described in this thread before the next scheduled release.

} Allison Suarez (https://openlineage.slack.com/team/U04BNREL8PM)
➕ Michael Robinson, Paweł Leszczyński, Rohit Menon, Maciej Obuchowski, Julien Le Dem, Jakub Dardziński
Michael Robinson (michael.robinson@astronomer.io)
2023-04-20 09:46:06

*Thread Reply:* The release is authorized and will be initiated within 2 business days (not including tomorrow).

Michael Robinson (michael.robinson@astronomer.io)
2023-04-19 15:19:38

Here are the details about next week’s OpenLineage Meetup at Astronomer’s NY offices: https://openlineage.io/blog/nyc-meetup. Hope to see you there if you can make it!

openlineage.io
👍 Ernie Ostic
Sai (saivenkatesh161@gmail.com)
2023-04-20 07:38:55

Hi Team, I tried integrating openLineage with spark databricks and followed the steps as per the documentation. Installation and all looks good as the listener is enabled, but no event is getting passed to Marquez. I can see below message in log4j logs. Am I missing any configuration to be set?

Running few spark commands in databricks notebook to create events.

23/04/20 11:10:34 INFO SparkSQLExecutionContext: OpenLineage received Spark event that is configured to be skipped: SparkListenerSQLExecutionStart 23/04/20 11:10:34 INFO SparkSQLExecutionContext: OpenLineage received Spark event that is configured to be skipped: SparkListenerSQLExecutionEnd

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-04-20 08:57:45

*Thread Reply:* Hi Sai,

Perhaps you could try within printing OpenLineage events into logs. This can be achieved with Spark config parameter: spark.openlineage.transport.type equal to console .

This can help you determine if a problem is generating Openlineage events itself or emitting them into Marquez.

Sai (saivenkatesh161@gmail.com)
2023-04-20 09:18:53

*Thread Reply:* Hi @Paweł Leszczyński I passed this config as below, but could not see any changes in the logs. The events are getting generated sometimes like below:

23/04/20 10:00:15 INFO ConsoleTransport: {"eventType":"START","eventTime":"2023-04-20T10:00:15.085Z","run":{"runId":"ef4f46d1-d13a-420a-87c3-19fbf6ffa231","facets":{"spark.logicalPlan":{"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.22.0/integration/spark","schemaURL":"https://openlineage.io/spec/1-0-5/OpenLineage.json#/$defs/RunFacet","plan":[{"class":"org.apache.spark.sql.catalyst.plans.logical.CreateTableAsSelect","num-children":2,"name":0,"partitioning":[],"query":1,"tableSpec":null,"writeOptions":null,"ignoreIfExists":false},{"class":"org.apache.spark.sql.catalyst.analysis.ResolvedTableName","num-children":0,"catalog":null,"ident":null},{"class":"org.apache.spark.sql.catalyst.plans.logical.Project","num-children":1,"projectList":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num_children":0,"name":"workorderid","dataType":"integer","nullable":true,"metadata":{},"exprId":{"product-cl

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-04-20 09:19:37

*Thread Reply:* Ok, great. This means the issue is related to Spark <-> Marquez connection

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-04-20 09:20:33

*Thread Reply:* Some time ago Spark config has changed and here is the up-to-date-documentation: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-04-20 09:21:10

*Thread Reply:* please note that spark.openlineage.transport.url has to be used which is different from what you have on screenshot attached

Sai (saivenkatesh161@gmail.com)
2023-04-20 09:22:40

*Thread Reply:* You mean instead of "spark.openlineage.host" I need to use "spark.openlineage.transport.url"?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-04-20 09:23:04

*Thread Reply:* yes, please give it a try

Sai (saivenkatesh161@gmail.com)
2023-04-20 09:23:40

*Thread Reply:* sure will give a try and let you know the outcome

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-04-20 09:23:48

*Thread Reply:* and set spark.openlineage.transport.type to http

Sai (saivenkatesh161@gmail.com)
2023-04-20 09:24:04

*Thread Reply:* okay

Sai (saivenkatesh161@gmail.com)
2023-04-20 09:26:42

*Thread Reply:* does these configs suffice or I need to add anything else

spark.extraListeners io.openlineage.spark.agent.OpenLineageSparkListener spark.openlineage.consoleTransport true spark.openlineage.version v1 spark.openlineage.transport.type http spark.openlineage.transport.url http://<host>:5000/api/v1/namespaces/sparkintegrationpoc/

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-04-20 09:27:07

*Thread Reply:* spark.openlineage.consoleTransport true this one can be removed

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-04-20 09:27:33

*Thread Reply:* otherwise shall be OK

Sai (saivenkatesh161@gmail.com)
2023-04-20 10:01:30

*Thread Reply:* I added these configs and run, but still same issue. Now I am not able to see the events in log file as well.

Sai (saivenkatesh161@gmail.com)
2023-04-20 10:04:27

*Thread Reply:* 23/04/20 13:51:22 INFO SparkSQLExecutionContext: OpenLineage received Spark event that is configured to be skipped: SparkListenerSQLExecutionStart 23/04/20 13:51:22 INFO SparkSQLExecutionContext: OpenLineage received Spark event that is configured to be skipped: SparkListenerSQLExecutionEnd

Does this need any changes in the config side?

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2023-04-20 13:02:23

If you are trying to get into the OpenLineage Technical Steering Committee meeting, you have to RSVP to the specific event at https://www.addevent.com/calendar/pP575215 to get the password (in the invitation to add to your calendar)

addevent.com
🙌 Michael Robinson
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-04-20 13:53:31

Here is a nice article I found online that briefly explains about the spark catalogs just for some context: https://www.waitingforcode.com/apache-spark-sql/pluggable-catalog-api/read In reference to the V2SessionCatalog use case brought up in the meeting just now

waitingforcode.com
🙌 Michael Robinson, Maciej Obuchowski, Paweł Leszczyński
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-04-24 06:49:43

*Thread Reply:* @Anirudh Shrinivason Thanks for linking this as it contains a clear explanation on Spark catalogs. However, I am still unable to write a failing integration test that reproduces the scenario. Could you provide an example of Spark which is failing on V2SessionCatalog and provide more details how are you trying to read/write data?

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-04-24 07:14:04

*Thread Reply:* Hi @Paweł Leszczyński I noticed this issue on one of our pipelines before actually. I didn't note down which pipeline the issue was occuring in unfortunately. I'll keep checking from my end to identify the spark job that ran into this error. In the meantime, I'll also try to see for which cases deltaCatalog makes use of the V2SessionCatalog to understand this better. Thanks!

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-04-26 03:44:15

*Thread Reply:* Hi @Paweł Leszczyński ''' CREATE TABLE IF NOT EXISTS TABLE_NAME ( SOME COLUMNS ) USING delta PARTITIONED BY (col) location 's3 location' ''' A spark sql like this actually triggers the V2SessionCatalog

❤️ Paweł Leszczyński
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-04-26 03:44:48

*Thread Reply:* Thanks @Anirudh Shrinivason, will look into that.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-04-26 05:06:05

*Thread Reply:* which spark & delta versions are you using?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-04-27 02:35:50

*Thread Reply:* I am not 100% sure if this is something you described, but this was an error I was able to replicate and fix. Please look at the exception stacktrace and let me know if it is same on your side. https://github.com/OpenLineage/OpenLineage/pull/1798

Labels
documentation, integration/spark
:gratitude_thank_you: Anirudh Shrinivason
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-04-27 02:36:20

*Thread Reply:* Hi

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-04-27 02:36:45

*Thread Reply:* Hmm actually I am noticing this error on my local

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-04-27 02:37:01

*Thread Reply:* But on the prod job, I am seeing no such error in the logs...

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-04-27 02:37:28

*Thread Reply:* Also, I was using spark 3.1.2

👀 Paweł Leszczyński
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-04-27 02:37:39

*Thread Reply:* then perhaps it's sth different :face_palm: will try to replicate on spark 3.1.2

:gratitude_thank_you: Anirudh Shrinivason
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-04-27 02:37:42

*Thread Reply:* Not too sure which delta version the prod job was using...

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-04-28 03:30:49

*Thread Reply:* I was running on Spark 3.1.2 the following command: spark.sql( "CREATE TABLE t_partitioned (a int, b int) USING delta " + "PARTITIONED BY (a) LOCATION '/tmp/delta/tbl'" ); and I got Openlineage event emitted with t_partitioned output dataset.

:gratitude_thank_you: Anirudh Shrinivason
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-04-28 03:31:47

*Thread Reply:* Oh... hmm... that is strange. Let me check more from my end too

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-04-28 03:33:01

*Thread Reply:* for spark 3.1, we're using delta 1.0.0

👀 Anirudh Shrinivason
Cory Visi (cvisi@amazon.com)
2023-04-20 14:41:23

Hi team! I have two Spark jobs chained together to process incoming data files, and I'm using openlineage-spark-0.22.0 with Marquez to visualize. I'm struggling to figure out the best way to use spark.openlineage.parentRunId and spark.openlineage.parentJobName. Should these values be unique for each Spark job? Should they be unique for each execution of the chain of both spark jobs? Or should they be the same for all runs? I'm setting them to be unique to the execution of the chain and I'm getting strange results (jobs are not showing completed, and not showing at all)

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-04-24 05:38:09

*Thread Reply:* Hi Cory, I think the definition of ParentRunFacet (https://openlineage.io/docs/spec/facets/run-facets/parent_run) contains answer to that: Commonly, scheduler systems like Apache Airflow will trigger processes on remote systems, such as on Apache Spark or Apache Beam jobs. Those systems might have their own OpenLineage integration and report their own job runs and dataset inputs/outputs. The ParentRunFacet allows those downstream jobs to report which jobs spawned them to preserve job hierarchy. To do that, the scheduler system should have a way to pass its own job and run id to the child job. For example, when airflow is used to run Spark job, we want Spark events to contain some information on what triggered the spark job and parameters, you ask about, are used to pass that information from airflow operator to spark job.

openlineage.io
Cory Visi (cvisi@amazon.com)
2023-04-26 17:28:39

*Thread Reply:* Thank you for pointing me at this documentation; I did not see it previously. In my setup, the calling system is AWS Step Functions, which have no integration with OpenLineage.

So I've been essentially passing non-existing parent job information to OpenLineage. It has been useful as a data point for searches and reporting though.

Is there any harm in doing what I am doing? Is it causing the jobs that I see never completing?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-04-27 04:59:39

*Thread Reply:* I think parentRunId should be the same for Openlineage START and COMPLETE event. Is it like this in your case?

Cory Visi (cvisi@amazon.com)
2023-05-03 11:13:58

*Thread Reply:* that makes sense, and based on my configuration, i would think that it would be. however, given that i am seeing incomplete jobs in Marquez, i'm wondering if somehow the parentrunID is changing. I need to investigate

Michael Robinson (michael.robinson@astronomer.io)
2023-04-20 15:44:39

@channel We released OpenLineage 0.23.0, including: Additions: • SQL: parser improvements to support: copy into, create stage, pivot #1742 @pawel-big-lebowski • dbt: add support for snapshots #1787 @JDarDagran Changes: • Spark: change custom column lineage visitors #1788 @pawel-big-lebowski Plus bug fixes, doc changes and more. Thanks to all the contributors! For the bug fixes and details, see: Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.23.0 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.22.0...0.23.0 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🎉 Harel Shein, Maciej Obuchowski, Anirudh Shrinivason, Kengo Seki, Paweł Leszczyński, Perttu Salonen
👍 Cory Visi, Maciej Obuchowski, Anirudh Shrinivason, Kengo Seki
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-04-21 05:07:30

Just curious, how long before we can see 0.23.0 over here: https://mvnrepository.com/artifact/io.openlineage/openlineage-spark

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-04-21 09:06:06

*Thread Reply:* I think @Michael Robinson has to manually promote artifacts

Michael Robinson (michael.robinson@astronomer.io)
2023-04-21 09:08:06

*Thread Reply:* I promoted the artifacts, but there is a delay before they appear in Maven. A couple releases ago, the delay was about 24 hours long

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-04-21 09:26:09

*Thread Reply:* Ahh I see... Thanks!

Michael Robinson (michael.robinson@astronomer.io)
2023-04-21 10:10:38

*Thread Reply:* @Anirudh Shrinivason are you using search.maven.org by chance? Version 0.23.0 is not appearing there yet, but I do see it on central.sonatype.com.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-04-21 10:15:00

*Thread Reply:* Hmm I can see it now on search.maven.org actually. But I still cannot see it on https://mvnrepository.com/artifact/io.openlineage/openlineage-spark ...

Michael Robinson (michael.robinson@astronomer.io)
2023-04-21 10:19:38

*Thread Reply:* Understood. I believe you can download the 0.23.0 jars from central.sonatype.com. For Spark, try going here: https://central.sonatype.com/artifact/io.openlineage/openlineage-spark/0.23.0/versions

Maven Central
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-04-22 06:11:10

*Thread Reply:* Yup. I can see it on all maven repos now haha. I think its just the delay.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-04-22 06:11:18

*Thread Reply:* ~24 hours ig

Michael Robinson (michael.robinson@astronomer.io)
2023-04-24 16:49:15

*Thread Reply:* 👍

John Doe (adarsh.pansari@tigeranalytics.com)
2023-04-21 08:49:54

Hello Everyone, I am facing an issue while trying to integrate openlineage with Jupyter notebook. I am following the Docs. My containers are running and I am getting the URL for Jupyter notebook but when I try with the token in the terminal, I get invalid credentials error. Can someone please help resolve this ? Am I doing something wrong..

John Doe (adarsh.pansari@tigeranalytics.com)
2023-04-21 09:28:18

*Thread Reply:* Good news, everyone! The login worked on the second attempt after starting the Docker containers. Although it's unclear why it failed the first time.

👍 Maciej Obuchowski, Anirudh Shrinivason, Michael Robinson, Paweł Leszczyński
Natalie Zeller (natalie.zeller@naturalint.com)
2023-04-23 23:52:34

Hi team, I have a question regarding the customization of transport types in OpenLineage. At my company, we are using OpenLineage to report lineage from our Spark jobs to OpenMetadata. We have created a custom OpenMetadataTransport to send lineage to the OpenMetadata APIs, conforming to the OpenMetadata format. Currently, we are using a fork of OpenLineage, as we needed to make some changes in the core to identify the new TransportConfig. We believe it would be more optimal for OpenLineage to support custom transport types, which would allow us to use OpenLineage JAR alongside our own JAR containing the custom transport. I noticed some comments in the code suggesting that customizations are possible. However, I couldn't make it work without modifying the TransportFactory and the TransportConfig interface, as the transport types are hardcoded. Am I missing something? 🤔 If custom transport types are not currently supported, we would be more than happy to contribute a PR that enables custom transports. What are your thoughts on this?

❤️ Paweł Leszczyński
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-04-24 02:32:51

*Thread Reply:* Hi Natalie, it's wonderful to hear you're planning to contribute. Yes, you're right about TransportFactory . What other transport type was in your mind? If it is something generic, then it is surely OK to include it within TransportFactory. If it is a custom feature, we could follow ServiceLoader pattern that we're using to allow including custom plan visitors and dataset builders.

Natalie Zeller (natalie.zeller@naturalint.com)
2023-04-24 02:54:40

*Thread Reply:* Hi @Paweł Leszczyński Yes, I was planning to change TransportFactory to support custom/generic transport types using ServiceLoader pattern. After this change is done, I will be able to use our custom OpenMetadataTransport without changing anything in OpenLineage core. For now I don't have other types in mind, but after we'll add the customization support anyone will be able to create their own transport type and report the lineage to different backends

👍 Paweł Leszczyński
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-04-24 03:28:30

*Thread Reply:* Perhaps it's not strictly related to this particular usecase, but you may also find interesting our recent PoC about Fluentd & Openlineage integration. This will bring some cool backend features like: copy event and send it to multiple backends, send it to backends supported by fluentd output plugins etc. https://github.com/OpenLineage/OpenLineage/pull/1757/files?short_path=4fc5534#diff-4fc55343748f353fa1def0e00c553caa735f9adcb0da18baad50a989c0f2e935

Natalie Zeller (natalie.zeller@naturalint.com)
2023-04-24 05:36:24

*Thread Reply:* Sounds interesting. Thanks, I will look into it

Michael Robinson (michael.robinson@astronomer.io)
2023-04-24 16:37:33

Are you planning to come to the first New York OpenLineage Meetup this Wednesday at Astronomer’s offices in the Flatiron District? Don’t forget to RSVP so we know much food and drink to order!

Meetup
Sudhar Balaji (sudharshan.dataaces@gmail.com)
2023-04-25 03:20:57

Hi, I'm new to Open data lineage and I'm trying to connect snowflake database with marquez using airflow and getting the error in etl_openlineage while running the airflow dag on local ubuntu environment and unable to see the marquez UI once it etl_openlineage has ran completed as success.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-04-25 08:07:36

*Thread Reply:* What's the extract_openlineage.py file? Looks like your code?

Sudhar Balaji (sudharshan.dataaces@gmail.com)
2023-04-25 08:43:04

*Thread Reply:* import json import os from pendulum import datetime

from airflow import DAG from airflow.decorators import task from openlineage.client import OpenLineageClient from snowflake.connector import connect

SNOWFLAKEUSER = os.getenv('SNOWFLAKEUSER') SNOWFLAKEPASSWORD = os.getenv('SNOWFLAKEPASSWORD') SNOWFLAKEACCOUNT = os.getenv('SNOWFLAKEACCOUNT')

@task def sendolevents(): client = OpenLineageClient.from_environment()

with connect(
    user=SNOWFLAKE_USER,
    password=SNOWFLAKE_PASSWORD,
    account=SNOWFLAKE_ACCOUNT,
    database='OPENLINEAGE',
    schema='PUBLIC',
) as conn:
    with conn.cursor() as cursor:
        ol_view = 'OPENLINEAGE_ACCESS_HISTORY'
        ol_event_time_tag = 'OL_LATEST_EVENT_TIME'

        var_query = f'''
            use warehouse {SNOWFLAKE_WAREHOUSE};
        '''

        cursor.execute(var_query)

        var_query = f'''
            set current_organization='{SNOWFLAKE_ACCOUNT}';
        '''

        cursor.execute(var_query)

        ol_query = f'''
            SELECT ** FROM {ol_view}
            WHERE EVENT:eventTime &gt; system$get_tag('{ol_event_time_tag}', '{ol_view}', 'table')
            ORDER BY EVENT:eventTime ASC;
        '''

        cursor.execute(ol_query)
        ol_events = [json.loads(ol_event[0]) for ol_event in cursor.fetchall()]

        for ol_event in ol_events:
            client.emit(ol_event)

        if len(ol_events) &gt; 0:
            latest_event_time = ol_events[-1]['eventTime']
            cursor.execute(f'''
                ALTER VIEW {ol_view} SET TAG {ol_event_time_tag} = '{latest_event_time}';
            ''')

with DAG( 'etlopenlineage', startdate=datetime(2022, 4, 12), scheduleinterval='@hourly', catchup=False, defaultargs={ 'owner': 'openlineage', 'dependsonpast': False, 'emailonfailure': False, 'emailonretry': False, 'email': ['demo@openlineage.io'], 'snowflakeconnid': 'openlineagesnowflake' }, description='Send OL events every minutes.', tags=["extract"], ) as dag: sendol_events()

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-04-25 09:52:33

*Thread Reply:* OpenLineageClient expects RunEvent classes and you're sending it raw json. I think at this point your options are either sending them by constructing your own HTTP client, using something like requests, or using something like https://github.com/python-attrs/cattrs to structure json to RunEvent

Website
<https://catt.rs>
Stars
625
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-04-25 10:05:57

*Thread Reply:* @Jakub Dardziński suggested that you can change client.emit(ol_event) to client.transport.emit(ol_event) and it should work

👍 Ross Turk, Sudhar Balaji
Ross Turk (ross@rossturk.com)
2023-04-25 12:24:08

*Thread Reply:* @Maciej Obuchowski I believe this is from https://github.com/Snowflake-Labs/OpenLineage-AccessHistory-Setup/blob/main/examples/airflow/dags/lineage/extract_openlineage.py

Ross Turk (ross@rossturk.com)
2023-04-25 12:25:26

*Thread Reply:* I believe this example no longer works - perhaps a new access history pull/push example could be created that is simpler and doesn’t use airflow.

👀 Maciej Obuchowski
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-04-26 08:34:02

*Thread Reply:* I think separating the actual getting data from the view and Airflow DAG would make sense

Ross Turk (ross@rossturk.com)
2023-04-26 13:57:34

*Thread Reply:* Yeah - I also think that Airflow confuses the issue. You don’t need Airflow to get lineage from Snowflake Access History, the only reason Airflow is in the example is a) to simulate a pipeline that can be viewed in Marquez; b) to establish a mechanism that regularly pulls and emits lineage…

but most people will already have A, and the simplest example doesn’t need to accomplish B.

Ross Turk (ross@rossturk.com)
2023-04-26 13:58:59

*Thread Reply:* just a few weeks ago 🙂 I was working on a script that you could run like SNOWFLAKE_USER=foo ./process_snowflake_lineage.py --from-date=xxxx-xx-xx --to-date=xxxx-xx-xx

Tom van Eijk (t.m.h.vaneijk@tilburguniversity.edu)
2023-04-27 11:13:58

*Thread Reply:* Hi @Ross Turk! Do you have a link to this script? Perhaps this script can fix the connection issue 🙂

Ross Turk (ross@rossturk.com)
2023-04-27 11:47:20

*Thread Reply:* No, it never became functional before I stopped to take on another task 😕

Sudhar Balaji (sudharshan.dataaces@gmail.com)
2023-04-25 07:47:57

Hi, Currently, In the .env file, we have using the OPENLINEAGE_URL as <http://marquez-api:5000> and got the error requests.exceptions.HTTPError: 422 Client Error: for url: <http://marquez-api:5000/api/v1/lineage> we have tried using OPENLINEAGE_URL as <http://localhost:5000> and getting the error as requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=5000): Max retries exceeded with url: /api/v1/lineage (Caused by NewConnectionError('&lt;urllib3.connection.HTTPConnection object at 0x7fc71edb9590&gt;: Failed to establish a new connection: [Errno 111] Connection refused')) I'm not sure which variable value to use for OPENLINEAGE_URL, so please offer the correct variable value.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-04-25 09:54:07

*Thread Reply:* Looks like the first URL is proper, but there's something wrong with entity - Marquez logs would help here.

Sudhar Balaji (sudharshan.dataaces@gmail.com)
2023-04-25 09:57:36

*Thread Reply:* This is my log in airflow, can you please prvide more info over it.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-04-25 10:13:37

*Thread Reply:* Airflow log does not tell us why Marquez rejected the event. Marquez logs would be more helpful

Sudhar Balaji (sudharshan.dataaces@gmail.com)
2023-04-26 05:48:08

*Thread Reply:* We investigated the marquez container logs and were unable to locate the error. Could you please specify the log file that belongs to marquez while connecting the airflow or snowflake?

Is it correct that the marquez-web log points to <http://api:5000/>? [HPM] Proxy created: /api/v1 -&gt; <http://api:5000/> App listening on port 3000!

👀 Maciej Obuchowski
Tom van Eijk (t.m.h.vaneijk@tilburguniversity.edu)
2023-04-26 11:26:36

*Thread Reply:* I've the same error at the moment but can provide some additional screenshots. The Event data in Snowflake seems fine and the data is being retrieved correctly by the Airflow DAG. However, there seems to be a warning in the Marquez API logs. Hopefully we can troubleshoot this together!

Tom van Eijk (t.m.h.vaneijk@tilburguniversity.edu)
2023-04-26 11:33:35
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-04-26 13:06:30

*Thread Reply:* Possibly the Python part between does some weird things, like double-jsonning the data? I can imagine it being wrapped in second, unnecessary JSON object

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-04-26 13:08:18

*Thread Reply:* I guess only way to check is print one of those events - in the form they are send in Python part, not Snowflake - and see how they are like. For example using ConsoleTransport or setting DEBUG log level in Airflow

Tom van Eijk (t.m.h.vaneijk@tilburguniversity.edu)
2023-04-26 14:37:32

*Thread Reply:* Here is a code snippet by using logging in DEBUG on the snowflake python connector:

[20230426T17:16:55.166+0000] {cursor.py:593} DEBUG - binding: [set currentorganization='[PRIVATE]';] with input=[None], processed=[{}] [2023-04-26T17:16:55.166+0000] {cursor.py:800} INFO - query: [set currentorganization='[PRIVATE]';] [2023-04-26T17:16:55.166+0000] {connection.py:1363} DEBUG - sequence counter: 2 [2023-04-26T17:16:55.167+0000] {cursor.py:467} DEBUG - Request id: f7bca188-dda0-4fe6-8d5c-a92dc5f9c7ac [2023-04-26T17:16:55.167+0000] {cursor.py:469} DEBUG - running query [set currentorganization='[PRIVATE]';] [2023-04-26T17:16:55.168+0000] {cursor.py:476} DEBUG - isfiletransfer: True [2023-04-26T17:16:55.168+0000] {connection.py:1035} DEBUG - _cmdquery [2023-04-26T17:16:55.168+0000] {connection.py:1062} DEBUG - sql=[set currentorganization='[PRIVATE]';], sequenceid=[2], isfiletransfer=[False] [2023-04-26T17:16:55.168+0000] {network.py:1162} DEBUG - Session status for SessionPool [PRIVATE]', SessionPool 1/1 active sessions [2023-04-26T17:16:55.169+0000] {network.py:850} DEBUG - remaining request timeout: None, retry cnt: 1 [2023-04-26T17:16:55.169+0000] {network.py:828} DEBUG - Request guid: 4acea1c3-6a68-4691-9af4-22f184e0f660 [2023-04-26T17:16:55.169+0000] {network.py:1021} DEBUG - socket timeout: 60 [2023-04-26T17:16:55.259+0000] {connectionpool.py:465} DEBUG - [PRIVATE]"POST /queries/v1/query-request?requestId=f7bca188-dda0-4fe6-8d5c-a92dc5f9c7ac&requestguid=4acea1c3-6a68-4691-9af4-22f184e0f660 HTTP/1.1" 200 1118 [2023-04-26T17:16:55.261+0000] {network.py:1047} DEBUG - SUCCESS [2023-04-26T17:16:55.261+0000] {network.py:1168} DEBUG - Session status for SessionPool [PRIVATE], SessionPool 0/1 active sessions [2023-04-26T17:16:55.261+0000] {network.py:729} DEBUG - ret[code] = None, after post request [2023-04-26T17:16:55.261+0000] {network.py:751} DEBUG - Query id: 01abe3ac-0603-4df4-0042-c78307975eb2 [2023-04-26T17:16:55.262+0000] {cursor.py:807} DEBUG - sfqid: 01abe3ac-0603-4df4-0042-c78307975eb2 [2023-04-26T17:16:55.262+0000] {cursor.py:813} INFO - query execution done [2023-04-26T17:16:55.262+0000] {cursor.py:827} DEBUG - SUCCESS [2023-04-26T17:16:55.262+0000] {cursor.py:846} DEBUG - PUT OR GET: False [2023-04-26T17:16:55.263+0000] {cursor.py:941} DEBUG - Query result format: json [2023-04-26T17:16:55.263+0000] {resultbatch.py:433} DEBUG - parsing for result batch id: 1 [2023-04-26T17:16:55.263+0000] {cursor.py:956} INFO - Number of results in first chunk: 1 [2023-04-26T17:16:55.263+0000] {cursor.py:735} DEBUG - executing SQL/command [2023-04-26T17:16:55.263+0000] {cursor.py:593} DEBUG - binding: [SELECT * FROM OPENLINEAGE_ACCESS_HISTORY WHERE EVENT:eventTime > system$get_tag(...] with input=[None], processed=[{}] [2023-04-26T17:16:55.264+0000] {cursor.py:800} INFO - query: [SELECT * FROM OPENLINEAGEACCESSHISTORY WHERE EVENT:eventTime > system$gettag(...] [2023-04-26T17:16:55.264+0000] {connection.py:1363} DEBUG - sequence counter: 3 [2023-04-26T17:16:55.264+0000] {cursor.py:467} DEBUG - Request id: 21e2ab85-4995-4010-865d-df06cf5ee5b5 [2023-04-26T17:16:55.265+0000] {cursor.py:469} DEBUG - running query [SELECT ** FROM OPENLINEAGEACCESSHISTORY WHERE EVENT:eventTime > system$gettag(...] [2023-04-26T17:16:55.265+0000] {cursor.py:476} DEBUG - isfiletransfer: True [2023-04-26T17:16:55.265+0000] {connection.py:1035} DEBUG - cmdquery [2023-04-26T17:16:55.265+0000] {connection.py:1062} DEBUG - sql=[SELECT ** FROM OPENLINEAGEACCESSHISTORY WHERE EVENT:eventTime > system$gettag(...], sequenceid=[3], isfiletransfer=[False] [2023-04-26T17:16:55.266+0000] {network.py:1162} DEBUG - Session status for SessionPool '[PRIVATE}', SessionPool 1/1 active sessions [2023-04-26T17:16:55.267+0000] {network.py:850} DEBUG - remaining request timeout: None, retry cnt: 1 [2023-04-26T17:16:55.268+0000] {network.py:828} DEBUG - Request guid: aba82952-a5c2-4c6b-9c70-a10545b8772c [2023-04-26T17:16:55.268+0000] {network.py:1021} DEBUG - socket timeout: 60 [2023-04-26T17:17:21.844+0000] {connectionpool.py:465} DEBUG - [PRIVATE] "POST /queries/v1/query-request?requestId=21e2ab85-4995-4010-865d-df06cf5ee5b5&requestguid=aba82952-a5c2-4c6b-9c70-a10545b8772c HTTP/1.1" 200 None [2023-04-26T17:17:21.879+0000] {network.py:1047} DEBUG - SUCCESS [2023-04-26T17:17:21.881+0000] {network.py:1168} DEBUG - Session status for SessionPool '[PRIVATE}', SessionPool 0/1 active sessions [2023-04-26T17:17:21.882+0000] {network.py:729} DEBUG - ret[code] = None, after post request [2023-04-26T17:17:21.882+0000] {network.py:751} DEBUG - Query id: 01abe3ac-0603-4df4-0042-c78307975eb6 [2023-04-26T17:17:21.882+0000] {cursor.py:807} DEBUG - sfqid: 01abe3ac-0603-4df4-0042-c78307975eb6 [2023-04-26T17:17:21.882+0000] {cursor.py:813} INFO - query execution done [2023-04-26T17:17:21.883+0000] {cursor.py:827} DEBUG - SUCCESS [2023-04-26T17:17:21.883+0000] {cursor.py:846} DEBUG - PUT OR GET: False [2023-04-26T17:17:21.883+0000] {cursor.py:941} DEBUG - Query result format: arrow [2023-04-26T17:17:21.903+0000] {resultbatch.py:102} DEBUG - chunk size=256 [2023-04-26T17:17:21.920+0000] {cursor.py:956} INFO - Number of results in first chunk: 112 [2023-04-26T17:17:21.949+0000] {arrowiterator.cpython-37m-x8664-linux-gnu.so:0} DEBUG - Batches read: 1 [2023-04-26T17:17:21.950+0000] {CArrowIterator.cpp:16} DEBUG - Arrow BatchSize: 1 [2023-04-26T17:17:21.950+0000] {CArrowChunkIterator.cpp:50} DEBUG - Arrow chunk info: batchCount 1, columnCount 1, usenumpy: 0 [2023-04-26T17:17:21.950+0000] {resultset.py:232} DEBUG - result batch 1 has id: data001 [2023-04-26T17:17:21.951+0000] {resultset.py:232} DEBUG - result batch 2 has id: data002 [2023-04-26T17:17:21.951+0000] {resultset.py:232} DEBUG - result batch 3 has id: data003 [2023-04-26T17:17:21.951+0000] {resultset.py:232} DEBUG - result batch 4 has id: data010 [2023-04-26T17:17:21.951+0000] {resultset.py:232} DEBUG - result batch 5 has id: data011 [2023-04-26T17:17:21.951+0000] {resultset.py:232} DEBUG - result batch 6 has id: data012 [2023-04-26T17:17:21.952+0000] {resultset.py:232} DEBUG - result batch 7 has id: data013 [2023-04-26T17:17:21.952+0000] {resultset.py:232} DEBUG - result batch 8 has id: data020 [2023-04-26T17:17:21.952+0000] {resultset.py:232} DEBUG - result batch 9 has id: data02_1

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-04-26 14:45:26

*Thread Reply:* I don't see any Airflow standard logs here, but anyway I looked at it and debugging it would not work if you're bypassing OpenLineageClient.emit and going directly to transport - the logging is done on Client level https://github.com/OpenLineage/OpenLineage/blob/acc207d63e976db7c48384f04bc578409f08cc8a/client/python/openlineage/client/client.py#L73

Tom van Eijk (t.m.h.vaneijk@tilburguniversity.edu)
2023-04-27 11:16:20

*Thread Reply:* I'm sorry, do you have a code snippet on how to get these logs from https://github.com/Snowflake-Labs/OpenLineage-AccessHistory-Setup/blob/main/examples/airflow/dags/lineage/extract_openlineage.py? I still get the ValueError for OpenLineageClient.emit

Tom van Eijk (t.m.h.vaneijk@tilburguniversity.edu)
2023-05-04 10:56:34

*Thread Reply:* Hey does anyone have an idea on this? I'm still stuck on this issue 😞

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-05-05 08:58:49

*Thread Reply:* I've found the root cause. It's because facets don't have _producer and _schemaURL set. I'll provide a fix soon

♥️ Tom van Eijk, Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2023-04-26 11:36:23

The first New York OpenLineage Meetup is happening today at 5:30 pm ET at Astronomer’s offices in the Flatiron District! https://openlineage.slack.com/archives/C01CK9T7HKR/p1681931978353159

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
Julien Le Dem (julien@apache.org)
2023-04-26 11:36:57

*Thread Reply:* I’ll be there! I’m looking forward to see you all.

Julien Le Dem (julien@apache.org)
2023-04-26 11:37:23

*Thread Reply:* We’ll talk about the evolution of the spec.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-04-27 02:55:00

delta_table = DeltaTable.forPath(spark, path) delta_table.alias("source").merge(df.alias("update"),lookup_statement).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute() If I write based on df operations like this, I notice that OL does not emit any event. May I know whether these or similar cases can be supported too? 🙇

👀 Paweł Leszczyński
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-04-28 04:23:24

*Thread Reply:* I've created an integration test based on your example. The Openlineage event gets sent, however it does not contain output dataset. I will look deeper into that.

:gratitude_thank_you: Anirudh Shrinivason
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-04-28 08:55:43

*Thread Reply:* Hey, sorry do you mean input dataset is empty? Or output dataset?

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-04-28 08:55:51

*Thread Reply:* I am seeing that input dataset is empty

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-04-28 08:56:05

*Thread Reply:* ooh, I see input datasets

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-04-28 08:56:11

*Thread Reply:* Hmm

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-04-28 08:56:12

*Thread Reply:* I see

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-04-28 08:57:07

*Thread Reply:* I create a test in SparkDeltaIntegrationTest class a test method: ```@Test void testDeltaMergeInto() { Dataset<Row> dataset = spark .createDataFrame( ImmutableList.of( RowFactory.create(1L, "bat"), RowFactory.create(2L, "mouse"), RowFactory.create(3L, "horse") ), new StructType( new StructField[] { new StructField("a", LongType$.MODULE$, false, Metadata.empty()), new StructField("b", StringType$.MODULE$, false, Metadata.empty()) })) .repartition(1); dataset.createOrReplaceTempView("temp");

spark.sql("CREATE TABLE t1 USING delta LOCATION '/tmp/delta/t1' AS SELECT ** FROM temp");
spark.sql("CREATE TABLE t2 USING delta LOCATION '/tmp/delta/t2' AS SELECT ** FROM temp");

DeltaTable.forName("t1")
    .merge(spark.read().table("t2"),"t1.a = t2.a")
    .whenMatched().updateAll()
    .whenNotMatched().insertAll()
    .execute();

verifyEvents(mockServer, "pysparkDeltaMergeIntoCompleteEvent.json");

}```

👍 Anirudh Shrinivason
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-04-28 08:59:14

*Thread Reply:* Oh yeah my bad. I am seeing output dataset is empty.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-04-28 08:59:21

*Thread Reply:* Checks out with your observation

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-05-03 23:23:36

*Thread Reply:* Hi @Paweł Leszczyński just curious, has a fix for this been implemented alr?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-05-04 02:40:11

*Thread Reply:* Hi @Anirudh Shrinivason, I had some days ooo. I will look into this soon.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-05-04 07:37:52

*Thread Reply:* Ahh okie! Thanks so much! Hope you had a good rest!

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-05-04 07:38:38

*Thread Reply:* yeah. this was an amazing extended weekend 😉

🎉 Anirudh Shrinivason
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-05-05 02:09:10

*Thread Reply:* This should be it: https://github.com/OpenLineage/OpenLineage/pull/1823

Labels
integration/spark
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-05-05 02:43:24

*Thread Reply:* Hi @Anirudh Shrinivason, please let me know if there is still something to be done within #1747 PROPOSAL] Support for V2SessionCatalog. I could not reproduce exactly what you described but fixed some issue nearby.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-05-05 02:49:38

*Thread Reply:* Hmm yeah sure let me find out the exact cause of the issue. The pipeline that was causing the issue is now inactive haha. So I'm trying to backtrace from the limited logs I captured last time. Let me get back by next week thanks! 🙇

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-05-05 09:35:00

*Thread Reply:* Hi @Paweł Leszczyński I was trying to replicate the issue from my end, but couldn't do so. I think we can close the issue for now, and revisit later on if the issue resurfaces. Does that sound okay?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-05-05 09:40:33

*Thread Reply:* sounds cool. we can surely create a new issue later on.

👍 Anirudh Shrinivason
Harshini Devathi (harshini.devathi@tigeranalytics.com)
2023-05-09 23:34:04

*Thread Reply:* @Paweł Leszczyński - I was trying to implement these new changes in databricks. I was wondering which java file should I use for building the jar file? Could you plese help me?

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-06-09 00:46:34

*Thread Reply:* .

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-06-09 02:37:49

*Thread Reply:* Hi I found that these merge operations have no input datasets/col lineage: ```df.write.format(fileformat).mode(mode).option("mergeSchema", mergeschema).option("overwriteSchema", overwriteSchema).save(path)

df.write.format(fileformat).mode(mode).option("mergeSchema", mergeschema).option("overwriteSchema", overwriteSchema)\ .partitionBy(**partitions).save(path)

df.write.format(fileformat).mode(mode).option("mergeSchema", mergeschema).option("overwriteSchema", overwriteSchema)\ .partitionBy(**partitions).option("replaceWhere", where_clause).save(path)`` I also noticed the same issue when using theMERGE INTO` command from spark sql. Would it be possible to extend the support to these df operations. too please? Thanks! CC: @Paweł Leszczyński

👀 Paweł Leszczyński
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-06-09 02:41:24

*Thread Reply:* Hi @Anirudh Shrinivason, great to hear from you. Could you create an issue out of this? I am working at the moment on Spark 3.4. Once this is ready, I will look at the spark issues. And this one seems to be nicely reproducible. Thanks for that.

👍 Anirudh Shrinivason
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-06-09 02:49:56

*Thread Reply:* Sure let me create an issue! Thanks!

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-06-09 02:55:21

*Thread Reply:* Created an issue here! https://github.com/OpenLineage/OpenLineage/issues/1919 Thanks! 🙇

Labels
proposal
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-06-15 10:39:50

*Thread Reply:* Hi @Paweł Leszczyński I just realised, https://github.com/OpenLineage/OpenLineage/pull/1823/files This PR doesn't actually capture column lineage for the MergeIntoCommand? It looks like there is no column lineage field in the events json.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-06-17 04:21:24

*Thread Reply:* Hi @Paweł Leszczyński Is there a potential timeline in mind to support column lineage for the MergeIntoCommand? We're really excited for this feature and would be a huge help to overcome a current blocker. Thanks!

Michael Robinson (michael.robinson@astronomer.io)
2023-04-28 14:11:34

Thanks to everyone who came out to Wednesday night’s meetup in New York! In addition to great pizza from Grimaldi’s (thanks for the tip, @Harel Shein), we enjoyed a spirited discussion of: • the state of observability tooling in the data space today • the history and high-level architecture of the project courtesy of @Julien Le Dem • exciting news of an OpenLineage Scanner being planned at MANTA courtesy of @Ernie Ostic • updates on the project roadmap and some exciting proposals from @Julien Le Dem, @Harel Shein and @Willy Lulciuc • an introduction to and demo of Marquez from project lead @Willy Lulciuc • and more. Be on the lookout for an announcement about the next meetup!

❤️ Harel Shein, Maciej Obuchowski, Peter Hicks, Jakub Dardziński, Atif Tahir
Michael Robinson (michael.robinson@astronomer.io)
2023-04-28 16:02:22

As discussed during the April TSC meeting, comments are sought from the community on a proposal to support RunEvent-less (AKA static) lineage metadata emission. This is currently a WIP. For details and to comment, please see: • https://docs.google.com/document/d/1366bAPkk0OqKkNA4mFFt-41X0cFUQ6sOvhSWmh4Iydo/edit?usp=sharinghttps://docs.google.com/document/d/1gKJw3ITJHArTlE-Iinb4PLkm88moORR0xW7I7hKZIQA/edit?usp=sharing

Ernie Ostic (ernie.ostic@getmanta.com)
2023-04-30 21:35:47

Hi all. Probably I just need to study the spec further, but what is the significance of _producer vs producer in the context of where they are used? (same question also for _schemaURL vs schemaURL)? Thx!

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2023-05-01 12:02:13

*Thread Reply:* “producer” is an element of the event run itself - e.g. what produced the JSON packet you’re studying. There is only one of these per event run. You can think of it as a top-level property.

producer” (and “schemaURL”) are elements of a facet. They are the 2 required elements for any customized facet (though I don’t agree they should be required, or at least I believe they should be able to be compatible with a blank value and a null value).

A packet sent to an API should only have one “producer” element, but can have many _producer elements in sub-objects (though, only one _producer per facet).

Ernie Ostic (ernie.ostic@getmanta.com)
2023-05-01 12:06:52

*Thread Reply:* just curious --- is/was there any specific reason for the underscore prefix? If they are in a facet, they would already be qualified.......

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2023-05-01 13:13:28

*Thread Reply:* The facet “BaseFacet” that’s used for customization, has 2 required elements - _producer and _schemaURL. so I don’t believe it’s related to qualification.

👍 Ernie Ostic
Michael Robinson (michael.robinson@astronomer.io)
2023-05-01 11:33:02

I’m opening a vote to release OpenLineage 0.24.0, including: • a new OpenLineage extractor for dbt Cloud • a new interface - TransportBuilder - for creating custom transport types without modifying core components of OpenLineage • a fix to the LogicalPlanSerializer in the Spark integration to make it operational again • a new configuration parameter in the Spark integration for making dataset paths less verbose • a fix to the Flink integration CI • and more. Three +1s from committers will authorize an immediate release.

➕ Jakub Dardziński, Willy Lulciuc, Julien Le Dem
✅ Sheeri Cabral (Collibra)
Michael Robinson (michael.robinson@astronomer.io)
2023-05-02 19:43:12

*Thread Reply:* Thanks for voting. The release will commence within 2 days.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2023-05-01 12:03:19

Does the Spark integration for OpenLineage also support ETL that uses the Apache Spark Structured Streaming framework?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-05-04 02:33:32

*Thread Reply:* Although it is not documented, we do have an integration test for that: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/test/resources/spark_scripts/spark_kafka.py

The test reads and writes data to Kafka and verifies if input/output datasets are collected.

✅ Sheeri Cabral (Collibra)
Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2023-05-01 13:14:14

Also, does it work for pyspark jobs? (Forgive me if Spark job = pyspark, I don’t have a lot of depth on how Spark works.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-05-01 22:37:25

*Thread Reply:* From my experience, yeah it works for pyspark

🙌 Paweł Leszczyński, Sheeri Cabral (Collibra)
Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2023-05-01 13:35:41

(and in a less generic question, would it work on top of this Spline agent/lineage harvester, or is it a replacement for it?

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-05-01 22:39:18

*Thread Reply:* Also from my experience, I think we can only use one of them as we can only configure one spark listener... correct me if I'm wrong. But it seems like the latest releases of spline are already using openlineage to some capacity?

✅ Sheeri Cabral (Collibra)
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-05-08 09:46:15

*Thread Reply:* In spark.extraListeners you can configure multiple listeners by comma separating them - I think you can use multiple ones with OpenLineage without obvious problems. I think we do pretty similar things to Spline though

👍 Anirudh Shrinivason
✅ Sheeri Cabral (Collibra)
Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2023-05-25 11:28:41

*Thread Reply:* (I never said thank you for this, so, thank you!)

Sai (saivenkatesh161@gmail.com)
2023-05-02 04:03:40

Hi Team,

I have configured Open lineage with databricks and it is sending events to Marquez as expected. I have a notebook which joins 3 tables and write the result data frame to an azure adls location. Each time I run the notebook manually, it creates two start events and two complete events for one run as shown in the screenshot. Is this something expected or I am missing something?

Michael Robinson (michael.robinson@astronomer.io)
2023-05-02 10:45:37

*Thread Reply:* Hello Sai, thanks for your question! A number of folks who could help with this are OOO, but someone will reply as soon as possible.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-05-04 02:44:46

*Thread Reply:* That is interesting @Sai. Are you able to reproduce this with a simple code snippet? Which Openlineage version are you using?

Sai (saivenkatesh161@gmail.com)
2023-05-05 01:16:20

*Thread Reply:* Yes @Paweł Leszczyński. Each join query I run on top of delta tables have two start and two complete events. We are using below jar for openlineage.

openlineage-spark-0.22.0.jar

👀 Paweł Leszczyński
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-05-05 02:41:26

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/1828

Assignees
<a href="https://github.com/pawel-big-lebowski">@pawel-big-lebowski</a>
Sai (saivenkatesh161@gmail.com)
2023-05-08 04:05:26

*Thread Reply:* Hi @Paweł Leszczyński any updates on this issue?

Also, OL is not giving column level lineage for group by operations on tables. Is this expected?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-05-08 04:07:04

*Thread Reply:* Hi @Sai, https://github.com/OpenLineage/OpenLineage/pull/1830 should fix duplication issue

Labels
documentation, integration/spark
Assignees
<a href="https://github.com/pawel-big-lebowski">@pawel-big-lebowski</a>
Sai (saivenkatesh161@gmail.com)
2023-05-08 04:08:06

*Thread Reply:* this would be part of next release?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-05-08 04:08:30

*Thread Reply:* Regarding column lineage & group by issue, I think it's something on databricks side -> we do have an open issue for that #1821

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-05-08 04:09:24

*Thread Reply:* once #1830 is reviewed and merged, it will be the part of the next relase

Sai (saivenkatesh161@gmail.com)
2023-05-08 04:11:01

*Thread Reply:* sure.. thanks @Paweł Leszczyński

Sai (saivenkatesh161@gmail.com)
2023-05-16 03:27:01

*Thread Reply:* @Paweł Leszczyński I have used the latest jar (0.25.0) and still this issue persists. I see two events for same input/output lineage.

Thomas (xsist10@gmail.com)
2023-05-03 03:55:44

Has anyone used Open Lineage for application lineage? I'm particularly interested in how if/how you handled service boundaries like APIs and Kafka topics and what Dataset Naming (URI) you used.

Thomas (xsist10@gmail.com)
2023-05-03 04:06:37

*Thread Reply:* For example, MySQL is stored as producer + host + port + database + table as something like <mysql://db.foo.com:6543/metrics.orders> For an API (especially one following REST conditions), I was thinking something like method + host + port + path or GET <https://api.service.com:433/v1/users>

Michael Robinson (michael.robinson@astronomer.io)
2023-05-03 10:13:25

*Thread Reply:* Hi Thomas, thanks for asking about this — it sounds cool! I don’t know of others working on this kind of thing, but I’ve been developing a SQLAlchemy integration and have been experimenting with job naming — which I realize isn’t exactly what you’re working on. Hopefully others will chime in here, but in the meantime, would you be willing to create an issue about this? It seems worth discussing how we could expand the spec for this kind of use case.

Thomas (xsist10@gmail.com)
2023-05-03 10:58:32

*Thread Reply:* I suspect this will definitely be a bigger discussion. Let me ponder on the problem a bit more and come back with something a bit more concrete.

Michael Robinson (michael.robinson@astronomer.io)
2023-05-03 10:59:21

*Thread Reply:* Looking forward to hearing more!

Thomas (xsist10@gmail.com)
2023-05-03 11:05:47

*Thread Reply:* On a tangential note, does OpenLineage's column level lineage have support for (I see it can be extended but want to know if someone had to map this before): • Properties as a path in a structure (like a JSON structure, Avro schema, protobuf, etc) maybe using something like JSON Path or XPath notation. • Fragments (when a column is a JSON blob, there is an entire sub-structure that needs to be described) • Transformation description (how an input affects an output. Is it a direct copy of the value or is it part of a formula)

Michael Robinson (michael.robinson@astronomer.io)
2023-05-03 11:22:21

*Thread Reply:* I don’t know, but I’ll ping some folks who might.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-05-04 03:24:01

*Thread Reply:* Hi @Thomas. Column-lineage support currently does not include json fields. We have included in specification fields like transformationDescription and transformationType to store a string representation of the transformation applied and its type like IDENTITY|MASKED. However, those fields aren't filled within Spark integration at the moment.

🙌 Thomas, Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2023-05-03 09:54:57

@channel We released OpenLineage 0.24.0, including: Additions: • Support custom transport types #1795 @nataliezeller1 • Airflow: dbt Cloud integration #1418 @howardyoo • Spark: support dataset name modification using regex #1796 @pawel-big-lebowski Plus bug fixes and more. Thanks to all the contributors! For the bug fixes and details, see: Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.24.0 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.23.0...0.24.0 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🎉 Harel Shein, tati
GreetBot
2023-05-03 10:45:32

@GreetBot has joined the channel

Michael Robinson (michael.robinson@astronomer.io)
2023-05-04 11:25:23

@channel This month’s TSC meeting is next Thursday, May 11th, at 10:00 am PT. The tentative agenda will be on the wiki. More info and the meeting link can be found on the website. All are welcome! Also, feel free to reply or DM me with discussion topics, agenda items, etc.

openlineage.io
Harshini Devathi (harshini.devathi@tigeranalytics.com)
2023-05-05 12:11:37

Hello all, noticed that openlineage is not able to give column level lineage if there is a groupby operation on a spark dataframe. Has anyone else faced this issue and have any fixes or workarounds? Apache Spark 3.0.1 and Openlineage version 1 are being used. Also tried on Spark version 3.3.0

Log4j error details follow:

23/05/05 18:09:11 ERROR ColumnLevelLineageUtils: Error when invoking static method 'buildColumnLineageDatasetFacet' for Spark3 java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at io.openlineage.spark.agent.lifecycle.plan.column.ColumnLevelLineageUtils.buildColumnLineageDatasetFacet(ColumnLevelLineageUtils.java:35) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$buildOutputDatasets$21(OpenLineageRunEventBuilder.java:424) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildOutputDatasets(OpenLineageRunEventBuilder.java:437) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.populateRun(OpenLineageRunEventBuilder.java:296) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:279) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:222) at io.openlineage.spark.agent.lifecycle.SparkSQLExecutionContext.start(SparkSQLExecutionContext.java:70) at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$sparkSQLExecStart$0(OpenLineageSparkListener.java:91) at java.util.Optional.ifPresent(Optional.java:159) at io.openlineage.spark.agent.OpenLineageSparkListener.sparkSQLExecStart(OpenLineageSparkListener.java:91) at io.openlineage.spark.agent.OpenLineageSparkListener.onOtherEvent(OpenLineageSparkListener.java:82) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:102) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:39) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:39) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:118) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:102) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:107) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:107) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:102) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:98) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1639) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:98) Caused by: java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.resultId()Lorg/apache/spark/sql/catalyst/expressions/ExprId; at io.openlineage.spark3.agent.lifecycle.plan.column.ExpressionDependencyCollector.traverseExpression(ExpressionDependencyCollector.java:79) at io.openlineage.spark3.agent.lifecycle.plan.column.ExpressionDependencyCollector.lambda$traverseExpression$4(ExpressionDependencyCollector.java:74) at java.util.Iterator.forEachRemaining(Iterator.java:116) at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647) at io.openlineage.spark3.agent.lifecycle.plan.column.ExpressionDependencyCollector.traverseExpression(ExpressionDependencyCollector.java:74) at io.openlineage.spark3.agent.lifecycle.plan.column.ExpressionDependencyCollector.lambda$null$2(ExpressionDependencyCollector.java:60) at java.util.LinkedList$LLSpliterator.forEachRemaining(LinkedList.java:1235) at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647) at io.openlineage.spark3.agent.lifecycle.plan.column.ExpressionDependencyCollector.lambda$collect$3(ExpressionDependencyCollector.java:60) at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:285) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1$adapted(TreeNode.scala:286) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:286) at io.openlineage.spark3.agent.lifecycle.plan.column.ExpressionDependencyCollector.collect(ExpressionDependencyCollector.java:38) at io.openlineage.spark3.agent.lifecycle.plan.column.ColumnLevelLineageUtils.collectInputsAndExpressionDependencies(ColumnLevelLineageUtils.java:70) at io.openlineage.spark3.agent.lifecycle.plan.column.ColumnLevelLineageUtils.buildColumnLineageDatasetFacet(ColumnLevelLineageUtils.java:40) ... 36 more

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-05-08 07:38:19

*Thread Reply:* Hi @Harshini Devathi, I think this the same as issue: https://github.com/OpenLineage/OpenLineage/issues/1821

Assignees
<a href="https://github.com/pawel-big-lebowski">@pawel-big-lebowski</a>
Labels
integration/spark, integration/databricks
Harshini Devathi (harshini.devathi@tigeranalytics.com)
2023-05-08 19:44:26

*Thread Reply:* Thank you @Paweł Leszczyński. So, is this an issue with databricks. The issue thread says that it was able to work on AWS Glue. If so, is there some kind of solution to make it work on Databricks?

Harshini Devathi (harshini.devathi@tigeranalytics.com)
2023-05-05 12:22:06

Hello all, is there a way to get lineage in azure synapse analytics with openlineage

Julien Le Dem (julien@apache.org)
2023-05-09 20:17:38

*Thread Reply:* maybe @Will Johnson knows?

Sai (saivenkatesh161@gmail.com)
2023-05-08 07:06:37

Hi Team,

I have a usecase where we are connecting to Azure sql database from databricks to extract, transform and load data to delta tables. I could see the lineage is getting build, but there is no column level lineage through its 1:1 mapping from source. Could you please check and update on this.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-05-09 10:06:02

*Thread Reply:* There are few possible issues:

  1. The column-level lineage is not implemented for particular part of Spark LogicalPlan
  2. Azure SQL or Databricks have their own implementations of some Spark class, which does not exactly match our extractor. We've seen that happen
  3. You're using SQL JDBC connection with SELECT ** - in which case we can't do anything for now, since we don't know the input columns.
  4. Possibly something else 🙂 @Paweł Leszczyński might have an idea To fully understand the issue, we'd have to see logs, LogicalPlan of the Spark job, or the job code itself
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-05-10 02:35:32

*Thread Reply:* @Sai, providing a short code snippet that is able to reproduce this would be super helpful in examining that.

Sai (saivenkatesh161@gmail.com)
2023-05-10 02:59:24

*Thread Reply:* sure Pawel Will share the code I used in sometime

Sai (saivenkatesh161@gmail.com)
2023-05-10 03:37:54

*Thread Reply:* Here is the code we use.

Sai (saivenkatesh161@gmail.com)
2023-05-16 03:23:13

*Thread Reply:* Hi Team, Any updates on this?

Sai (saivenkatesh161@gmail.com)
2023-05-16 03:23:37

*Thread Reply:* I tried with putting a sql query having column names in it, still the lineage didn't show up..

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-05-09 10:00:39

2023-05-09T13:37:48.526698281Z java.lang.ClassCastException: class org.apache.spark.scheduler.ShuffleMapStage cannot be cast to class java.lang.Boolean (org.apache.spark.scheduler.ShuffleMapStage is in unnamed module of loader 'app'; java.lang.Boolean is in module java.base of loader 'bootstrap') 2023-05-09T13:37:48.526703550Z at scala.runtime.BoxesRunTime.unboxToBoolean(BoxesRunTime.java:87) 2023_05_09T13:37:48.526707874Z at scala.collection.LinearSeqOptimized.forall(LinearSeqOptimized.scala:85) 2023_05_09T13:37:48.526712381Z at scala.collection.LinearSeqOptimized.forall$(LinearSeqOptimized.scala:82) 2023_05_09T13:37:48.526716848Z at scala.collection.immutable.List.forall(List.scala:91) 2023_05_09T13:37:48.526723183Z at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.registerJob(OpenLineageRunEventBuilder.java:181) 2023_05_09T13:37:48.526727604Z at io.openlineage.spark.agent.lifecycle.SparkSQLExecutionContext.setActiveJob(SparkSQLExecutionContext.java:152) 2023_05_09T13:37:48.526732292Z at java.base/java.util.Optional.ifPresent(Unknown Source) 2023-05-09T13:37:48.526736352Z at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$onJobStart$10(OpenLineageSparkListener.java:150) 2023_05_09T13:37:48.526740471Z at java.base/java.util.Optional.ifPresent(Unknown Source) 2023-05-09T13:37:48.526744887Z at io.openlineage.spark.agent.OpenLineageSparkListener.onJobStart(OpenLineageSparkListener.java:147) 2023_05_09T13:37:48.526750258Z at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37) 2023_05_09T13:37:48.526753454Z at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) 2023_05_09T13:37:48.526756235Z at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) 2023_05_09T13:37:48.526759315Z at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) 2023_05_09T13:37:48.526762133Z at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) 2023_05_09T13:37:48.526764941Z at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) 2023_05_09T13:37:48.526767739Z at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) 2023_05_09T13:37:48.526776059Z at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) 2023_05_09T13:37:48.526778937Z at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) 2023_05_09T13:37:48.526781728Z at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) 2023_05_09T13:37:48.526786986Z at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) 2023_05_09T13:37:48.526789893Z at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) 2023_05_09T13:37:48.526792722Z at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1446) 2023_05_09T13:37:48.526795463Z at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) Hi, noticing this error message from OL... anyone know why its happening?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-05-09 10:02:25

*Thread Reply:* @Anirudh Shrinivason what's your OL and Spark version?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-05-09 10:03:29

*Thread Reply:* Some example job would also help, or logs/LogicalPlan 🙂

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-05-09 10:05:54

*Thread Reply:* OL version is 0.23.0 and spark version is 3.3.1

👍 Maciej Obuchowski
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-05-09 11:00:22

*Thread Reply:* Hmm actually, it seems like the error is intermittent actually. I ran the same job again, but did not notice any errors this time...

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-05-10 02:27:19

*Thread Reply:* This is interesting and it happens within a line: job.finalStage().parents().forall(toScalaFn(stage -&gt; stageMap.put(stage.id(), stage))); The result of stageMap.put is Stage and for some reason which I don't undestand it tries doing unboxToBoolean . We could rewrite that to: job.finalStage().parents().forall(toScalaFn(stage -&gt; { stageMap.put(stage.id(), stage) return true; })); but this is so weird that it is intermittent and I don't get why is it happening.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-05-11 02:22:25

*Thread Reply:* @Anirudh Shrinivason, please let us know if it is still a valid issue. If so, we can create an issue for that.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-05-11 03:11:13

*Thread Reply:* Hi @Paweł Leszczyński Sflr. Yeah, I think if we are able to fix this, it'll be better. If this is the dedicated fix, then I can create an issue and raise an MR.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-05-11 04:12:46

*Thread Reply:* Opened an issue and PR. Do help check if its okay thanks!

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-05-11 04:29:33

*Thread Reply:* please run ./gradlew spotlessApply with Java 8

✅ Anirudh Shrinivason
Pietro Brunetti (pietrobrunetti89@gmail.com)
2023-05-10 05:49:00

Hi all, I’m new to openlineage (and marquez) so I’m trying to figure out if it could be the right option form a client usecase in which: • a legacy custom data catalog (mongo backend + Java API backend for fronted in angular) • AS-IS component lineage realations are retrieve in a custom way from the each component’s APIs • the customer would like to bring in a basic data lineage feature based on already published metadata that represent custom workloads type (batch,streaming,interactive ones) + data access pattern (no direct relation with the datasources right now but only a abstraction layer upon them) I’d like to exploit directly Marquez as the metastore to publish metadata about datasource, workload (the workload is the declaration + business logic code deployed into the customer platform) once the component is deployed (e.g. the service that exposes the specific access pattern, or the workload custom declaration), but I saw the openlinage spec is based on strictly coupling between run,job and datasource; I mean I want to be able to publish one item at a time and then (maybe in a future release of the customer product) be able to exploit runtime lineage also

Am I in the right place? Thanks anyway :)

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-05-10 07:36:33

*Thread Reply:* > I mean I want to be able to publish one item at a time and then (maybe in a future release of the customer product) be able to exploit runtime lineage also This is not something that we support yet - there are definitely a lot of plans and preliminary work for that.

Pietro Brunetti (pietrobrunetti89@gmail.com)
2023-05-10 07:57:44

*Thread Reply:* Thanks for the response, btw I already took a look at the current capabilities provided by openlineage, so my “hidden” question is how do achieve what the customer want to in order to be integrated in some way with openalineage+marquez? should I choose between make or buy (between already supported platforms) and then try to align “static” (aka declarative) lineage metadata within the openlinage conceptual model?

Michael Robinson (michael.robinson@astronomer.io)
2023-05-10 11:04:20

@channel This month’s TSC meeting is tomorrow at 10am PT. All are welcome! https://openlineage.slack.com/archives/C01CK9T7HKR/p1683213923529529

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
John Lukenoff (john@jlukenoff.com)
2023-05-11 12:59:42

Does anyone here have experience with vendors in this space like Atlan or Manta? I’m advocating pretty heavily for OpenLineage at my company and have a strong suspicion that the LoE of enabling an equivalent solution from a vendor is equal or greater than that of OL/Marquez. Curious if anyone has first-hand experience with these tools they might be willing to share?

👋 Eric Veleker
👀 Pietro Brunetti
Ernie Ostic (ernie.ostic@getmanta.com)
2023-05-11 13:58:28

*Thread Reply:* Hi John. Great question! [full disclosure, I am with Manta 🙂 ]. I'll let others answer as to their experience with ourselves or many other vendors that provide lineage, but want to mention that a variety of our customers are finding it beneficial to bring code based static lineage together with the event-based runtime lineage that OpenLineage provides. This gives them the best of both worlds, for analyzing the lineage of their existing systems, where rich parsers already exist (for everything from legacy ETL tools, reporting tools, rdbms, etc.), to newer or home-grown technologies where applying OpenLineage is a viable alternative.

👍 John Lukenoff
Brad Paskewitz (bradford.paskewitz@fivetran.com)
2023-05-11 14:12:04

*Thread Reply:* @Ernie Ostic do you see a single front-runner in the static lineage space? The static/event-based situation you describe is exactly the product roadmap I'm seeing here at Fivetran and I'm wondering if there's an opportunity to drive consensus towards a best-practice solution. If I'm not mistaken weren't there plans to start supporting non-run-based events in OL as well?

👋 Eric Veleker
John Lukenoff (john@jlukenoff.com)
2023-05-11 14:16:34

*Thread Reply:* I definitely like the idea of a 3rd party solution being complementary to OSS tools we can maintain ourselves while allowing us to offload maintenance effort where possible. Currently I have strong opinions on both sides of the build vs. buy aisle and this seems like the best of both worlds.

Harel Shein (harel.shein@gmail.com)
2023-05-11 14:52:40

*Thread Reply:* @Brad Paskewitz that’s 100% our plan to extend the OL spec to support “run-less” events. We want to collect that static metadata for Datasets and Jobs outside of the context of a run through OpenLineage. happy to get your feedback here as well: https://github.com/OpenLineage/OpenLineage/pull/1839

:gratitude_thank_you: Brad Paskewitz
Eric Veleker (eric@atlan.com)
2023-05-11 14:57:46

*Thread Reply:* Hi @John Lukenoff. Here at Atlan we've been working with the OpenLineage community for quite some time to unlock the use case you describe. These efforts are adjacent to our ongoing integration with Fivetran. Happy to connect and give you a demo of what we've built and dig into your use case specifics.

John Lukenoff (john@jlukenoff.com)
2023-05-12 11:26:32

*Thread Reply:* Thanks all! These comments are really informative, it’s exciting to hear about vendors leaning into the project to let us continue to benefit from the tremendous progress being made by the community. Had a great discussion with Atlan yesterday and plan to connect with Manta next week to discuss our use-cases.

Ernie Ostic (ernie.ostic@getmanta.com)
2023-05-12 12:34:32

*Thread Reply:* Reach out anytime, John. @John Lukenoff Looking forward to engaging further with you on these topics!

Harshini Devathi (harshini.devathi@tigeranalytics.com)
2023-05-12 11:15:10

Hello all, I would like to have a new release of Openlineage as the new code base seems to have some issues fixed. I need these fixes for my project.

➕ Michael Robinson, Maciej Obuchowski, Julien Le Dem, Jakub Dardziński, Anirudh Shrinivason, Harshini Devathi, Paweł Leszczyński, pankaj koti
Michael Robinson (michael.robinson@astronomer.io)
2023-05-12 11:19:02

*Thread Reply:* Thank you for requesting an OpenLineage release. As stated here, three +1s from committers will authorize an immediate release. Our policy is not to release on Fridays, so the earliest we could initiate would be Monday.

Harshini Devathi (harshini.devathi@tigeranalytics.com)
2023-05-12 13:12:43

*Thread Reply:* A release on Monday is totally fine @Michael Robinson.

Michael Robinson (michael.robinson@astronomer.io)
2023-05-15 08:37:39

*Thread Reply:* The release will be initiated today. Thanks @Harshini Devathi

👍 Anirudh Shrinivason, Harshini Devathi
Harshini Devathi (harshini.devathi@tigeranalytics.com)
2023-05-16 20:16:07

*Thread Reply:* Appreciate it @Michael Robinson and thanks to all the committers for the prompt response

Michael Robinson (michael.robinson@astronomer.io)
2023-05-15 12:09:24

@channel We released OpenLineage 0.25.0, including: Additions: • Spark: merge into query support #1823 @pawel-big-lebowski Fixes: • Spark: fix JDBC query handling #1808 @nataliezeller1 • Spark: filter Delta adaptive plan events #1830 @pawel-big-lebowski • Spark: fix Java class cast exception #1844 @Anirudh181001 • Flink: include missing fields of Openlineage events #1840 @pawel-big-lebowski Plus doc changes and more. Thanks to all the contributors! For the details, see: Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.25.0 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.24.0...0.25.0 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🙌 Jakub Dardziński, Sai, pankaj koti, Paweł Leszczyński, Perttu Salonen, Maciej Obuchowski, Fraser Marlow, Ross Turk, Harshini Devathi, tati
Michael Robinson (michael.robinson@astronomer.io)
2023-05-16 14:03:01

@channel If you’re planning on being in San Francisco at the end of June — perhaps for this year’s Data+AI Summit — please stop by Astronomer’s offices on California Street on 6/27 for the first SF OpenLineage Meetup. We’ll be discussing spec changes planned for OpenLineage v1.0.0, progress on Airflow AIP 53, and more. Plus, dinner will be provided! For more info and to sign up, check out the OL blog. Join us!

openlineage.io
🙌 alexandre bergere, Anirudh Shrinivason, Harel Shein, Willy Lulciuc, Jarek Potiuk, Ross Turk, John Lukenoff, tati
Willy Lulciuc (willy@datakin.com)
2023-05-16 14:13:16

*Thread Reply:* Can’t wait! 💯

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-05-17 00:09:23

Hi, I've been noticing this error that is intermittently popping up in some of the spark jobs: AsyncEventQueue: Dropping event from queue appStatus. This likely means one of the listeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler. spark.scheduler.listenerbus.eventqueue.size Increasing this spark config did not help either. Any ideas on how to mitigate this issue? Seeing this in spark 3.1.2 btw

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-05-17 01:58:28

*Thread Reply:* Hi @Anirudh Shrinivason, are you able to send the OL events to console. This would let us confirm if the issue is related with event generation or emitting it and waiting for the backend to repond.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-05-17 01:59:03

*Thread Reply:* Ahh okay sure. Let me see if I can do that

Sai (saivenkatesh161@gmail.com)
2023-05-17 01:52:15

Hi Team

We are seeing an issue with OL configured cluster where delta table merge is failing with below error. It is running fine when we run with other clusters where OL is not configured. I ran it multiple times assuming its intermittent issue with memory, but it keeps on failing with same error. Attached the code for reference. We are using the latest release (0.25.0)

org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.StackOverflowError

@Paweł Leszczyński @Michael Robinson

👀 Paweł Leszczyński
Sai (saivenkatesh161@gmail.com)
2023-05-19 03:55:51

*Thread Reply:* Hi @Paweł Leszczyński

Thanks for fixing the issue and with new release merge is working. But I could not see any input and output datasets for this. Let me know if you need any further details to look into this.

},
"job": {
    "namespace": "openlineage_poc",
    "name": "spark_ol_integration_execute_merge_into_command_edge",
    "facets": {}
},
"inputs": [],
"outputs": [],
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-05-19 04:00:01

*Thread Reply:* Oh man, it's just that vanilla spark differs from the one available in databricks platform. our integration tests do verify behaviour on vanilla spark which still leaves a possibility for inconsistency. will need to get back to it then at some time.

Sai (saivenkatesh161@gmail.com)
2023-06-02 02:11:28

*Thread Reply:* Hi @Paweł Leszczyński

Did you get chance to look into this issue?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-06-02 02:13:18

*Thread Reply:* Hi Sai, I am going back to spark. I am working on support for Spark 3.4, which is going to add some event filtering on internal delta operations that trigger unncecessarly the events

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-06-02 02:13:28

*Thread Reply:* this may be releated to issue you created

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-06-02 02:14:13

*Thread Reply:* I do have planned creating integration test for databricks which will be helpful to tackle the issues you raised

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-06-02 02:14:27

*Thread Reply:* so yes, I am looking at the Spark

Sai (saivenkatesh161@gmail.com)
2023-06-02 02:20:06

*Thread Reply:* thanks much Pawel.. I am looking more into the merge part as first priority as we use is frequently.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-06-02 02:21:01

*Thread Reply:* I know, this is important.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-06-02 02:21:14

*Thread Reply:* It just need still some time.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-06-02 02:21:46

*Thread Reply:* thank you for your patience and being so proactive on those issues.

Sai (saivenkatesh161@gmail.com)
2023-06-02 02:22:12

*Thread Reply:* no problem.. Please do keep us posted with updates..

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-05-17 10:47:27

Our recent Openlineage release (0.25.0) proved there are many users that use Openlineage on databricks, which is incredible. I am super happy to know that, although we realised that as a side effect of a bug. Sorry for that.

I would like to opt for a new release which contains PR #1858 and should unblock databricks users.

➕ Paweł Leszczyński, Maciej Obuchowski, Harshini Devathi, Jakub Dardziński, Sai, Anirudh Shrinivason, Anbarasi
👍 Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2023-05-18 10:26:48

*Thread Reply:* The release request has been approved and will be initiated shortly.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-05-17 22:49:41

Actually, I noticed a few other stack overflow errors on 0.25.0. Let me raise an issue. Could we cut a release once this bug are fixed too please?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-05-18 02:29:55

*Thread Reply:* Hi Anirudh, I saw your issue and I think it is the same one as solved within #1858. Are you able to reproduce it on a version built on the top of main?

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-05-18 06:21:05

*Thread Reply:* Hi I haven't managed to try with the main branch. But if its the same error then all's good! If the error resurfaces then we can look into it.

Lovenish Goyal (lovenishgoyal@gmail.com)
2023-05-18 02:21:13

Hi All,

We are in POC phase OpenLineage integration with our core DBT, can anyone help me with document to start with.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-05-18 02:28:31

*Thread Reply:* I know this one: https://openlineage.io/docs/integrations/dbt

openlineage.io
Lovenish Goyal (lovenishgoyal@gmail.com)
2023-05-18 02:41:39

*Thread Reply:* Hi @Paweł Leszczyński Thanks for the revert, I tried same but facing below issue

requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=5000): Max retries exceeded with url:

Looks like I need to start the service

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-05-18 02:44:09

*Thread Reply:* @Lovenish Goyal, exactly. You need to start Marquez. More about it: https://marquezproject.ai/quickstart

Harel Shein (harel.shein@gmail.com)
2023-05-18 10:27:52

*Thread Reply:* @Lovenish Goyal how are you running dbt core currently?

Lovenish Goyal (lovenishgoyal@gmail.com)
2023-05-19 01:55:20

*Thread Reply:* Trying but facing issue while running marquezproject @Jakub Dardziński

Lovenish Goyal (lovenishgoyal@gmail.com)
2023-05-19 01:56:03

*Thread Reply:* @Harel Shein we have created custom docker image of DBT + Airflow and running it on an EC2

Harel Shein (harel.shein@gmail.com)
2023-05-19 09:05:31

*Thread Reply:* for running dbt core on Airflow, we have a utility that helps develop dbt natively on Airflow. There’s also built in support for collecting lineage if you have the airflow-openlineage provider installed. https://astronomer.github.io/astronomer-cosmos/#quickstart

Harel Shein (harel.shein@gmail.com)
2023-05-19 09:06:30

*Thread Reply:* RE issues running Marquez, can you share what those are? I’m guessing that since you are running both of them in individual docker images, the airflow deployment might not be able to communicate with the Marquez endpoints?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-05-19 09:06:53

*Thread Reply:* @Harel Shein I've already helped with running Marquez 🙂

:first_place_medal: Harel Shein, Paweł Leszczyński, Michael Robinson
Anbarasi (anbujothi@gmail.com)
2023-05-18 02:29:53

@Paweł Leszczyński We are facing the following issue with Azure databricks. When we use aggregate functions in databricks notebooks, Open lineage is not able to provide column level lineage. I understand its an existing issue. Can you please let me know in which release this issue will be fixed ? It is one of the most needed feature for us to implement openlineage in our current project. Kindly let me know

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-05-18 02:34:35

*Thread Reply:* I am not sure if this is the same. If you see OL events collected with column-lineage missing, then it's a different one.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-05-18 02:41:11

*Thread Reply:* Please also be aware, that it is extremely helpful to investigate the issues on your own before creating them.

Our integration traverses spark's logical plans and extracts lineage events from plan nodes that it understands. Some plan nodes are not supported yet and, from my experience, when working on an issue, 80% of time is spent on reproducing the scenario.

So, if you are able to produce a minimal amount of spark code that reproduces an issue, this can be extremely helpful and significantly speed up resolution time.

Anbarasi (anbujothi@gmail.com)
2023-05-18 03:52:30

*Thread Reply:* @Paweł Leszczyński Thanks for the prompt response.

Provided sample codes with and without using aggregate functions and its respective lineage events for reference.

  1. Please find the code without using aggregate function: finaldf=spark.sql(""" select productid ,OrderQty as TotalOrderQty ,ReceivedQty as TotalReceivedQty ,StockedQty as TotalStockedQty ,RejectedQty as TotalRejectedQty from openlineagepoc.purchaseorder --group by productid order by productid""")

       final_df.write.mode("overwrite").saveAsTable("openlineage_poc.productordertest1")
    

Please find the Openlineage Events for the Input, Ouput datasets. We could find the column lineage in this.

"inputs": [ { "namespace": "dbfs", "name": "/mnt/dlzones/warehouse/openlineagepoc/gold/purchaseorder", "facets": { "dataSource": { "producer": "https://github.com/OpenLineage/OpenLineage/tree/0.25.0/integration/spark", "schemaURL": "https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet", "name": "dbfs", "uri": "dbfs" }, "schema": { "producer": "https://github.com/OpenLineage/OpenLineage/tree/0.25.0/integration/spark", "schemaURL": "https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet", "fields": [ { "name": "PurchaseOrderID", "type": "integer" }, { "name": "PurchaseOrderDetailID", "type": "integer" }, { "name": "DueDate", "type": "timestamp" }, { "name": "OrderQty", "type": "short" }, { "name": "ProductID", "type": "integer" }, { "name": "UnitPrice", "type": "decimal(19,4)" }, { "name": "LineTotal", "type": "decimal(19,4)" }, { "name": "ReceivedQty", "type": "decimal(8,2)" }, { "name": "RejectedQty", "type": "decimal(8,2)" }, { "name": "StockedQty", "type": "decimal(9,2)" }, { "name": "RevisionNumber", "type": "integer" }, { "name": "Status", "type": "integer" }, { "name": "EmployeeID", "type": "integer" }, { "name": "NationalIDNumber", "type": "string" }, { "name": "JobTitle", "type": "string" }, { "name": "Gender", "type": "string" }, { "name": "MaritalStatus", "type": "string" }, { "name": "VendorID", "type": "integer" }, { "name": "ShipMethodID", "type": "integer" }, { "name": "ShipMethodName", "type": "string" }, { "name": "ShipMethodrowguid", "type": "string" }, { "name": "OrderDate", "type": "timestamp" }, { "name": "ShipDate", "type": "timestamp" }, { "name": "SubTotal", "type": "decimal(19,4)" }, { "name": "TaxAmt", "type": "decimal(19,4)" }, { "name": "Freight", "type": "decimal(19,4)" }, { "name": "TotalDue", "type": "decimal(19,4)" } ] }, "symlinks": { "producer": "https://github.com/OpenLineage/OpenLineage/tree/0.25.0/integration/spark", "schemaURL": "https://openlineage.io/spec/facets/1-0-0/SymlinksDatasetFacet.json#/$defs/SymlinksDatasetFacet", "identifiers": [ { "namespace": "/mnt/dlzones/warehouse/openlineagepoc/gold", "name": "openlineagepoc.purchaseorder", "type": "TABLE" } ] } }, "inputFacets": {} } ], "outputs": [ { "namespace": "dbfs", "name": "/mnt/dlzones/warehouse/openlineagepoc/productordertest1", "facets": { "dataSource": { "producer": "https://github.com/OpenLineage/OpenLineage/tree/0.25.0/integration/spark", "schemaURL": "https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet", "name": "dbfs", "uri": "dbfs" }, "schema": { "producer": "https://github.com/OpenLineage/OpenLineage/tree/0.25.0/integration/spark", "schemaURL": "https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet", "fields": [ { "name": "productid", "type": "integer" }, { "name": "TotalOrderQty", "type": "short" }, { "name": "TotalReceivedQty", "type": "decimal(8,2)" }, { "name": "TotalStockedQty", "type": "decimal(9,2)" }, { "name": "TotalRejectedQty", "type": "decimal(8,2)" } ] }, "storage": { "producer": "https://github.com/OpenLineage/OpenLineage/tree/0.25.0/integration/spark", "schemaURL": "https://openlineage.io/spec/facets/1-0-0/StorageDatasetFacet.json#/$defs/StorageDatasetFacet", "storageLayer": "unity", "fileFormat": "parquet" }, "columnLineage": { "producer": "https://github.com/OpenLineage/OpenLineage/tree/0.25.0/integration/spark", "schemaURL": "https://openlineage.io/spec/facets/1-0-1/ColumnLineageDatasetFacet.json#/$defs/ColumnLineageDatasetFacet", "fields": { "productid": { "inputFields": [ { "namespace": "dbfs", "name": "/mnt/dlzones/warehouse/openlineagepoc/gold/purchaseorder", "field": "ProductID" } ] }, "TotalOrderQty": { "inputFields": [ { "namespace": "dbfs", "name": "/mnt/dlzones/warehouse/openlineagepoc/gold/purchaseorder", "field": "OrderQty" } ] }, "TotalReceivedQty": { "inputFields": [ { "namespace": "dbfs", "name": "/mnt/dlzones/warehouse/openlineagepoc/gold/purchaseorder", "field": "ReceivedQty" } ] }, "TotalStockedQty": { "inputFields": [ { "namespace": "dbfs", "name": "/mnt/dlzones/warehouse/openlineagepoc/gold/purchaseorder", "field": "StockedQty" } ] }, "TotalRejectedQty": { "inputFields": [ { "namespace": "dbfs", "name": "/mnt/dlzones/warehouse/openlineagepoc/gold/purchaseorder", "field": "RejectedQty" } ] } } }, "symlinks": { "producer": "https://github.com/OpenLineage/OpenLineage/tree/0.25.0/integration/spark", "schemaURL": "https://openlineage.io/spec/facets/1-0-0/SymlinksDatasetFacet.json#/$defs/SymlinksDatasetFacet", "identifiers": [ { "namespace": "/mnt/dlzones/warehouse/openlineagepoc", "name": "openlineagepoc.productordertest1", "type": "TABLE" } ] }, "lifecycleStateChange": { "producer": "https://github.com/OpenLineage/OpenLineage/tree/0.25.0/integration/spark", "_schemaURL": "https://openlineage.io/spec/facets/1-0-0/LifecycleStateChangeDatasetFacet.json#/$defs/LifecycleStateChangeDatasetFacet", "lifecycleStateChange": "OVERWRITE" } }, "outputFacets": {} } ]

Anbarasi (anbujothi@gmail.com)
2023-05-18 03:55:04

*Thread Reply:* 2. Please find the code using aggregate function:

    final_df=spark.sql("""
    select productid
    ,sum(OrderQty) as TotalOrderQty
    ,sum(ReceivedQty) as TotalReceivedQty
    ,sum(StockedQty) as TotalStockedQty
    ,sum(RejectedQty) as TotalRejectedQty
    from openlineage_poc.purchaseorder
    group by productid
    order by productid""")

    final_df.write.mode("overwrite").saveAsTable("openlineage_poc.productordertest2")

Please find the Openlineage Events for the Input, Ouput datasets. We couldnt find the column lineage in output section. Please find the sample

"inputs": [ { "namespace": "dbfs", "name": "/mnt/dlzones/warehouse/openlineagepoc/gold/purchaseorder", "facets": { "dataSource": { "producer": "https://github.com/OpenLineage/OpenLineage/tree/0.25.0/integration/spark", "schemaURL": "https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet", "name": "dbfs", "uri": "dbfs" }, "schema": { "producer": "https://github.com/OpenLineage/OpenLineage/tree/0.25.0/integration/spark", "schemaURL": "https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet", "fields": [ { "name": "PurchaseOrderID", "type": "integer" }, { "name": "PurchaseOrderDetailID", "type": "integer" }, { "name": "DueDate", "type": "timestamp" }, { "name": "OrderQty", "type": "short" }, { "name": "ProductID", "type": "integer" }, { "name": "UnitPrice", "type": "decimal(19,4)" }, { "name": "LineTotal", "type": "decimal(19,4)" }, { "name": "ReceivedQty", "type": "decimal(8,2)" }, { "name": "RejectedQty", "type": "decimal(8,2)" }, { "name": "StockedQty", "type": "decimal(9,2)" }, { "name": "RevisionNumber", "type": "integer" }, { "name": "Status", "type": "integer" }, { "name": "EmployeeID", "type": "integer" }, { "name": "NationalIDNumber", "type": "string" }, { "name": "JobTitle", "type": "string" }, { "name": "Gender", "type": "string" }, { "name": "MaritalStatus", "type": "string" }, { "name": "VendorID", "type": "integer" }, { "name": "ShipMethodID", "type": "integer" }, { "name": "ShipMethodName", "type": "string" }, { "name": "ShipMethodrowguid", "type": "string" }, { "name": "OrderDate", "type": "timestamp" }, { "name": "ShipDate", "type": "timestamp" }, { "name": "SubTotal", "type": "decimal(19,4)" }, { "name": "TaxAmt", "type": "decimal(19,4)" }, { "name": "Freight", "type": "decimal(19,4)" }, { "name": "TotalDue", "type": "decimal(19,4)" } ] }, "symlinks": { "producer": "https://github.com/OpenLineage/OpenLineage/tree/0.25.0/integration/spark", "schemaURL": "https://openlineage.io/spec/facets/1-0-0/SymlinksDatasetFacet.json#/$defs/SymlinksDatasetFacet", "identifiers": [ { "namespace": "/mnt/dlzones/warehouse/openlineagepoc/gold", "name": "openlineagepoc.purchaseorder", "type": "TABLE" } ] } }, "inputFacets": {} } ], "outputs": [ { "namespace": "dbfs", "name": "/mnt/dlzones/warehouse/openlineagepoc/productordertest2", "facets": { "dataSource": { "producer": "https://github.com/OpenLineage/OpenLineage/tree/0.25.0/integration/spark", "schemaURL": "https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet", "name": "dbfs", "uri": "dbfs" }, "schema": { "producer": "https://github.com/OpenLineage/OpenLineage/tree/0.25.0/integration/spark", "schemaURL": "https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet", "fields": [ { "name": "productid", "type": "integer" }, { "name": "TotalOrderQty", "type": "long" }, { "name": "TotalReceivedQty", "type": "decimal(18,2)" }, { "name": "TotalStockedQty", "type": "decimal(19,2)" }, { "name": "TotalRejectedQty", "type": "decimal(18,2)" } ] }, "storage": { "producer": "https://github.com/OpenLineage/OpenLineage/tree/0.25.0/integration/spark", "schemaURL": "https://openlineage.io/spec/facets/1-0-0/StorageDatasetFacet.json#/$defs/StorageDatasetFacet", "storageLayer": "unity", "fileFormat": "parquet" }, "symlinks": { "producer": "https://github.com/OpenLineage/OpenLineage/tree/0.25.0/integration/spark", "schemaURL": "https://openlineage.io/spec/facets/1-0-0/SymlinksDatasetFacet.json#/$defs/SymlinksDatasetFacet", "identifiers": [ { "namespace": "/mnt/dlzones/warehouse/openlineagepoc", "name": "openlineagepoc.productordertest2", "type": "TABLE" } ] }, "lifecycleStateChange": { "producer": "https://github.com/OpenLineage/OpenLineage/tree/0.25.0/integration/spark", "schemaURL": "https://openlineage.io/spec/facets/1-0-0/LifecycleStateChangeDatasetFacet.json#/$defs/LifecycleStateChangeDatasetFacet", "lifecycleStateChange": "OVERWRITE" } }, "outputFacets": {} } ]

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-05-18 04:09:17

*Thread Reply:* amazing. https://github.com/OpenLineage/OpenLineage/issues/1861

Assignees
<a href="https://github.com/pawel-big-lebowski">@pawel-big-lebowski</a>
Anbarasi (anbujothi@gmail.com)
2023-05-18 04:11:56

*Thread Reply:* Thanks for considering the request and looking into it

Michael Robinson (michael.robinson@astronomer.io)
2023-05-18 13:12:35

@channel We released OpenLineage 0.26.0, including: Additions: • Proxy: Fluentd proxy support (experimental) #1757 @pawel-big-lebowski Changes: • Python client: use Hatchling over setuptools to orchestrate Python env setup #1856 @gaborbernat Fixes: • Spark: fix logicalPlan serialization issue on Databricks #1858 @pawel-big-lebowski Plus an additional fix, doc changes and more. Thanks to all the contributors, including new contributor @gaborbernat! For the details, see: Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.26.0 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.25.0...0.26.0 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

❤️ Paweł Leszczyński, Maciej Obuchowski, Anirudh Shrinivason, Peter Hicks, pankaj koti
Bramha Aelem (bramhaaelem@gmail.com)
2023-05-18 14:42:49

Hi Team , can someone please address https://github.com/OpenLineage/OpenLineage/issues/1866.

Labels
proposal
Comments
1
Julien Le Dem (julien@apache.org)
2023-05-18 20:13:09

*Thread Reply:* Hi @Bramha Aelem I replied in the ticket. Thank you for opening it.

Bramha Aelem (bramhaaelem@gmail.com)
2023-05-18 21:15:30

*Thread Reply:* Hi @Julien Le Dem - Thanks for quick response. I replied in the ticket. Please let me know if you need any more details.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-05-19 02:13:57

*Thread Reply:* Hi @Bramha Aelem - asked for more details in the ticket.

Bramha Aelem (bramhaaelem@gmail.com)
2023-05-22 11:08:58

*Thread Reply:* Hi @Paweł Leszczyński - I replied with necessary details in the ticket. Please let me know if you need any more details.

Bramha Aelem (bramhaaelem@gmail.com)
2023-05-25 15:22:42

*Thread Reply:* Hi @Paweł Leszczyński - any further updates on issue?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-05-26 01:56:47

*Thread Reply:* hi @Bramha Aelem, i was out of office for a few days. will get back into this soon. thanks for update.

Bramha Aelem (bramhaaelem@gmail.com)
2023-05-27 18:46:27

*Thread Reply:* Hi @Paweł Leszczyński -Thanks for your reply. will wait for your response to proceed further on the issue.

Bramha Aelem (bramhaaelem@gmail.com)
2023-06-02 19:29:08

*Thread Reply:* Hi @Paweł Leszczyński -Hope you are doing well. Did you get a chance to look into the samples which are provided in the ticket. Kindly let me know your observations/recommendations.

Bramha Aelem (bramhaaelem@gmail.com)
2023-06-09 12:43:54

*Thread Reply:* Hi @Paweł Leszczyński - Hope you are doing well. Did you get a chance to look into the samples which are provided in the ticket. Kindly let me know your observations/recommendations.

👀 Paweł Leszczyński
Bramha Aelem (bramhaaelem@gmail.com)
2023-07-06 10:29:01

*Thread Reply:* Hi @Paweł Leszczyński - Good day. Did you get a chance to look into query which I have posted. can you please provide any thoughts on my observation/query.

John Doe (adarsh.pansari@tigeranalytics.com)
2023-05-19 03:42:21

Hello Everyone, I was trying to integrate openlineage with Jupyter Notebooks, I followed the docs but when I run the sample notebook I am getting an error 23/05/19 07:39:08 ERROR EventEmitter: Could not emit lineage w/ exception Can someone Please help understand why am I getting this error and the resolution.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-05-19 03:49:27

*Thread Reply:* Hello @John Doe, this mostly means there's somehting wrong with your transport config for emitting Openlineage events.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-05-19 03:49:41

*Thread Reply:* what do you want to do with the events?

John Doe (adarsh.pansari@tigeranalytics.com)
2023-05-19 04:10:24

*Thread Reply:* Hi @Paweł Leszczyński, I am working on a PoC to understand the use cases of OL and how it build Lineages.

As for the transport config I am using the codes from the documentation to setup OL. https://openlineage.io/docs/integrations/spark/quickstart_local

Apart from these I dont have anything else in my nb

openlineage.io
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-05-19 04:38:58

*Thread Reply:* ok, I am wondering if what you experience isn't similar to issue #1860. Could you try openlineage 0.23.0 to see if get the same error?

<https://github.com/OpenLineage/OpenLineage/issues/1860>

John Doe (adarsh.pansari@tigeranalytics.com)
2023-05-19 10:05:59

*Thread Reply:* I tried with 0.23.0 still getting the same error

John Doe (adarsh.pansari@tigeranalytics.com)
2023-05-23 02:34:52

*Thread Reply:* @Paweł Leszczyński any other way I can try to setup. The issue still persists

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-05-29 03:53:04

*Thread Reply:* hmyy, I've just redone steps from https://openlineage.io/docs/integrations/spark/quickstart_local with 0.26.0 and could not reproduce behaviour you encountered.

openlineage.io
Tom van Eijk (t.m.h.vaneijk@tilburguniversity.edu)
2023-05-22 09:41:55

Hello Team!!! A part of my master thesis's case study was about data lineage in data mesh and how open-source initiatives such as OpenLineage and Marquez can realize this. Can you recommend me some material that can support the writing part of my thesis (more context: I tried to extract lineage events from Snowflake through Airlfow and used Docker Compose on EC2 to connect Airflow and the Marquez webserver)? We will divide the thesis into a few academic papers to make the content more digestible and publish one of them soon hopefully!

👍 Ernie Ostic, Maciej Obuchowski, Ross Turk, Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2023-05-22 16:34:00

*Thread Reply:* Tom, thanks for your question. This is really exciting! I assume you’ve already started checking out the docs, but there are many other resources on the website, as well (on the blog and resources pages in particular). And don’t skip the YouTube channel, where we’ve recently started to upload short, more digestible excerpts from the community meetings. Please keep us updated as you make progress!

👀 Tom van Eijk
Tom van Eijk (t.m.h.vaneijk@tilburguniversity.edu)
2023-05-22 16:48:06

*Thread Reply:* Hi Michael! Thank you so much for sending these resources! I've been working on this thesis for quite some time already and it's almost finished. I just needed some additional information to help in accurately describing some of the processes in OpenLineage and Marquez. Will send you the case study chapter later this week to get some feedback if possible. Keep you posted on things such as publication! Perhaps it can make OpenLineage even more popular than it already is 😉

Michael Robinson (michael.robinson@astronomer.io)
2023-05-22 16:52:18

*Thread Reply:* Yes, please share it! Looking forward to checking it out. Super cool!

Ernie Ostic (ernie.ostic@getmanta.com)
2023-05-22 09:57:50

Hi Tom. Good luck. Sounds like a great case study. You might want to compare and contrast various kinds of lineage solutions....all of which complement each other, as well as having their own pros and cons. (...code based lineage via parsing, data similarity lineage, run-time lineage reporting, etc.) ...and then focus on open source and OpenLineage with Marquez in particular.

🙏 Tom van Eijk
Tom van Eijk (t.m.h.vaneijk@tilburguniversity.edu)
2023-05-22 10:04:44

*Thread Reply:* Thank you so much Ernie! That sounds like a very interesting direction to keep in mind during research!

Michael Robinson (michael.robinson@astronomer.io)
2023-05-22 16:37:44

@channel For an easily digestible recap of recent events, communications and releases in the community, please sign up for our new monthly newsletter! Look for it in your inbox soon.

Bernat Gabor (gaborjbernat@gmail.com)
2023-05-22 23:32:16

looking here https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json#L64 it show that the schemaURL must be set, but then the examples in https://openlineage.io/getting-started#step-1-start-a-run do not contain it, is this a bug, expected? 😄

openlineage.io
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-05-23 07:24:09

*Thread Reply:* yeah, it's a bug

Bernat Gabor (gaborjbernat@gmail.com)
2023-05-23 12:00:48

*Thread Reply:* so it's optional then? 😄 or bug in the example?

Bernat Gabor (gaborjbernat@gmail.com)
2023-05-23 12:02:09

I noticed that DataQualityAssertionsDatasetFacet inherits from InputDatasetFacet, https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/DataQualityAssertionsDatasetFacet.json though I think should do from DatasetFacet like all else 🤔

Michael Robinson (michael.robinson@astronomer.io)
2023-05-23 14:20:09

@channel Two years ago last Saturday, we released the first version of OpenLineage, a test release of the Python client. So it seemed like an appropriate time to share our first annual ecosystem survey, which is both a milestone in the project’s growth and an important effort to set our course. This survey has been designed to help us learn more about who is using OpenLineage, what your lineage needs are, and what new tools you hope the project will support. Thank you in advance for taking the time to share your opinions and vision for the project! (Please note: the survey might seem longer than it actually is due to the large number of optional questions. Not all questions apply to all use cases.)

Google Docs
🙌 Harel Shein, Maciej Obuchowski, Atif Tahir, Peter Hicks, Tamara Fingerlin, Paweł Leszczyński
Sharanya Santhanam (santhanamsharanya@gmail.com)
2023-05-23 18:59:46

Open Lineage Spark Integration our spark workloads on Spark 2.4 are correctly setting .config("spark.sql.catalogImplementation", "hive") however sql queries for CREATE/INSERT INTO dont recoognize the datasets as “Hive”. As per https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/supported-commands.md USING HIVE is needed for appropriate parsing. Why is that the case ? Why cant HQL format for CREATE/INSERT be supported?

Sharanya Santhanam (santhanamsharanya@gmail.com)
2023-05-23 19:01:43

*Thread Reply:* @Michael Collado wondering if you could shed some light here

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-05-24 05:39:01

*Thread Reply:* can you show logical plan of your Spark job? I think using hive is not the most important part, but whether job's LogicalPlan parses to CreateHiveTableAsSelectCommand or InsertIntoHiveTable

Sharanya Santhanam (santhanamsharanya@gmail.com)
2023-05-24 19:37:02

*Thread Reply:* It parses into InsertIntoHadoopFsRelationCommand. example == Optimized Logical Plan == InsertIntoHadoopFsRelationCommand <s3a://uchmsdev03/default/sharanyaOutputTable>, false, [id#89], Parquet, [serialization.format=1, mergeSchema=false, partitionOverwriteMode=dynamic], Append, CatalogTable( Database: default Table: sharanyaoutputtable Owner: 2700940971 Created Time: Thu Jun 09 11:13:35 PDT 2022 Last Access: UNKNOWN Created By: Spark 3.2.0 Type: EXTERNAL Provider: hive Table Properties: [transient_lastDdlTime=1654798415] Location: <s3a://uchmsdev03/default/sharanyaOutputTable> Serde Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat Storage Properties: [serialization.format=1] Partition Provider: Catalog Partition Columns: [`id`] Schema: root |-- displayName: string (nullable = true) |-- serialnum: string (nullable = true) |-- osversion: string (nullable = true) |-- productfamily: string (nullable = true) |-- productmodel: string (nullable = true) |-- id: string (nullable = true) ), org.apache.spark.sql.execution.datasources.CatalogFileIndex@5fe23214, [displayName, serialnum, osversion, productfamily, productmodel, id] +- Union false, false :- Relation default.tablea[displayName#84,serialnum#85,osversion#86,productfamily#87,productmodel#88,id#89] parquet +- Relation default.tableb[displayName#90,serialnum#91,osversion#92,productfamily#93,productmodel#94,id#95] parquet

Sharanya Santhanam (santhanamsharanya@gmail.com)
2023-05-24 19:39:54

*Thread Reply:* using spark 3.2 & this is the query spark.sql(s"INSERT INTO default.sharanyaOutput select ** from (SELECT ** from default.tableA union all " + s"select ** from default.tableB)")

Rakesh Jain (rakeshj@us.ibm.com)
2023-05-24 01:09:58

Is there any example of how sourceCodeLocation / git info can be used from a spark job? What do we need to set to be able to see that as part of metadata?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-05-24 05:37:06

*Thread Reply:* I think we can't really get it from Spark context, as Spark jobs are submitted in compiled, jar form, instead of plain text like for example Airflow dags.

Rakesh Jain (rakeshj@us.ibm.com)
2023-05-25 02:15:35

*Thread Reply:* How about Jupyter Notebook based spark job?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-05-25 08:44:18

*Thread Reply:* I don't think it changes much - but maybe @Paweł Leszczyński knows more

Michael Robinson (michael.robinson@astronomer.io)
2023-05-25 11:24:21

@channel Deprecation notice: support for Airflow 2.1 will end in about two weeks, when it will be removed from testing. The exact date will be announced as we get closer to it — this is just a heads up. After that date, use 2.1 at your own risk! (Note: the next release, 0.27.0, will still support 2.1.)

👍 Maciej Obuchowski
Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2023-05-25 11:27:39

For the OpenLineageSparkListener, is there a way to configure it to send packets locally, e.g. save to a file? (instead of pushing to a URL destination)

alexandre bergere (alexandre.pro.bergere@gmail.com)
2023-05-25 12:00:04

*Thread Reply:* We developed a FileTransport class in order to save locally in json file our metrics if you interested in

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2023-05-25 12:00:37

*Thread Reply:* Does it also save the openlineage information, e.g. inputs/outputs?

alexandre bergere (alexandre.pro.bergere@gmail.com)
2023-05-25 12:02:07

*Thread Reply:* yes it save all json information, inputs / ouputs included

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2023-05-25 12:03:03

*Thread Reply:* Yes! then I am very interested. Is there guidance on how to use the FileTransport class?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-05-25 13:06:22

*Thread Reply:* @alexandre bergere it would be pretty useful contribution if you can submit it 🙂

🙌 alexandre bergere
alexandre bergere (alexandre.pro.bergere@gmail.com)
2023-05-25 13:08:28

*Thread Reply:* We are using it on a transformed OpenLineage library we developed ! I'm going to make a PR in order to share it with you :)

👍 Julien Le Dem, Anirudh Shrinivason
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-05-25 13:56:48

*Thread Reply:* would be great to have. I had it in mind to implement as an enabler for databricks integration tests. great to hear that!

alexandre bergere (alexandre.pro.bergere@gmail.com)
2023-05-29 08:19:46

*Thread Reply:* PR sent: https://github.com/OpenLineage/OpenLineage/pull/1891 🙂 @Maciej Obuchowski could you tell me how to update the documentation once approved please?

Labels
client/python
Comments
1
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-05-29 08:36:21

*Thread Reply:* @alexandre bergere we have separate repo for website + docs: https://github.com/OpenLineage/docs

Website
<https://openlineage.io/docs>
Stars
5
🙏 alexandre bergere
Bramha Aelem (bramhaaelem@gmail.com)
2023-05-25 16:40:26

Hi Team- When we run databricks job, lot of dbfs path namespaces are getting created. Can someone please let us know how to overwrite the symlink namespaces and link with the spark app name or openlineage namespace marquez UI.

Harshini Devathi (harshini.devathi@tigeranalytics.com)
2023-05-26 09:09:09

Hello,

I am looking to connect the common data model in postgres marquez database and Azure Purview (which uses Apache Atlas API's) lineage endpoint. Does anyone have any how-to on this or can point me to some useful links for this?

Thanks in advance.

Michael Robinson (michael.robinson@astronomer.io)
2023-05-26 13:08:56

*Thread Reply:* I wonder if this blog post might help? https://openlineage.io/blog/openlineage-microsoft-purview

openlineage.io
Michael Robinson (michael.robinson@astronomer.io)
2023-05-26 16:13:38

*Thread Reply:* This might not fully match your use case, either, but might help: https://learn.microsoft.com/en-us/samples/microsoft/purview-adb-lineage-solution-accelerator/azure-databricks-to-purview-lineage-connector/

learn.microsoft.com
Harshini Devathi (harshini.devathi@tigeranalytics.com)
2023-06-01 23:23:49

*Thread Reply:* Thanks @Michael Robinson

Bernat Gabor (gaborjbernat@gmail.com)
2023-05-26 12:44:09

Are there any constraints on facets? Such as is reasonable to expect that a single job will have a single parent? The schema hints to this by making the parent a single entry; but then one can send different parents for the START and COMPLETE event? 🤔

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-05-29 05:04:32

*Thread Reply:* I think, for now such thing is not defined other than by implementation of consumers.

Bernat Gabor (gaborjbernat@gmail.com)
2023-05-30 10:32:09

*Thread Reply:* Any reason for that?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-06-01 10:25:33

*Thread Reply:* The idea is that for particular run, facets can be attached to any event type.

This has advantages, for example, job that modifies dataset that it's also reading from, can get particular version of dataset it's reading from and attach it on start; it would not work if you tried to do it on complete as the dataset would change by then.

Similarly, if the job is creating dataset, we could not get additional metadata on it, so we can attach those information only on complete.

There are also cases where we want facets to be cumulative. The reason for this are streaming jobs. For example, with Apache Flink, we could emit metadata on each checkpoint (or every N checkpoints) that contain metadata that show us how the job is progressing.

Generally consumers should be agnostic for that, but we don't want to overspecify what consumers should do - as people might want to use OL data in different ways, or even ignore some data we're sending.

Bernat Gabor (gaborjbernat@gmail.com)
2023-05-30 17:49:54

Any reason why the lifecycle state change facet is not just on the output? But is also allowed on the inputs? 🤔 https://openlineage.io/docs/spec/facets/dataset-facets/lifecycle_state_change I can't see how would it be interpreted for an input 🤔

openlineage.io
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-06-01 10:18:48

*Thread Reply:* I think it should be output-only, yes.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-06-01 10:19:14

*Thread Reply:* @Paweł Leszczyński what do you think?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-06-02 08:35:13

*Thread Reply:* yes, should be output only I think

Bernat Gabor (gaborjbernat@gmail.com)
2023-06-05 13:39:07

*Thread Reply:* should we move it over then? 😄

Bernat Gabor (gaborjbernat@gmail.com)
2023-06-05 13:39:31

*Thread Reply:* under Output Dataset Facets that is

Michael Robinson (michael.robinson@astronomer.io)
2023-06-01 12:30:00

@channel The first issue of OpenLineage News is now available. To get it directly in your inbox when it’s published, become a subscriber.

🚀 Willy Lulciuc, Jakub Dardziński, Maciej Obuchowski, Bernat Gabor, Harel Shein, Laurent Paris, Tamara Fingerlin, Perttu Salonen
🔥 Willy Lulciuc, Natalie Zeller, Ernie Ostic, Laurent Paris
💯 Willy Lulciuc
Michael Robinson (michael.robinson@astronomer.io)
2023-06-01 14:23:17

*Thread Reply:* Correction: Julien and Willy’s talk at Data+AI Summit will take place on June 28

Michael Robinson (michael.robinson@astronomer.io)
2023-06-01 13:50:23

Hello all, I’m opening a vote to release 0.27.0, featuring: • Spark: fixed column lineage from databricks in the case of aggregate queries • Python client: configurable job-name filtering • Airflow: fixed urllib.parse.urlparse in case of [] values Three +1s from committers will authorize an immediate release.

➕ Jakub Dardziński, Maciej Obuchowski, Willy Lulciuc, Paweł Leszczyński
Michael Robinson (michael.robinson@astronomer.io)
2023-06-02 10:30:39

*Thread Reply:* Thanks, all. The release is authorized and will be initiated on Monday in accordance with our policy here.

Michael Robinson (michael.robinson@astronomer.io)
2023-06-02 13:13:18

@channel This month’s TSC meeting is next Thursday, June 8th, at 10:00 am PT. On the tentative agenda: announcements, meetup updates, recent releases, static lineage progress, and open discussion. More info and the meeting link can be found on the website. All are welcome! Also, feel free to reply or DM me with discussion topics, agenda items, etc.

openlineage.io
🙌 Sheeri Cabral (Collibra), Maciej Obuchowski, Harel Shein, alexandre bergere, Paweł Leszczyński, Willy Lulciuc
Michael Robinson (michael.robinson@astronomer.io)
2023-06-05 12:34:29

@channel We released OpenLineage 0.27.1, including: Additions: • Python client: add emission filtering mechanism and exact, regex filters #1878 @mobuchowski Fixes: • Spark: fix column lineage for aggregate queries on databricks #1867 @pawel-big-lebowski • Airflow: fix unquoted [ and ] in Snowflake URIs #1883 @JDarDagran Plus a CI fix and a proposal. For the details, see: Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.27.1 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.26.0...0.27.1 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🙌 Sheeri Cabral (Collibra)
Bernat Gabor (gaborjbernat@gmail.com)
2023-06-05 13:01:06

Looking for a reviewer under: https://github.com/OpenLineage/OpenLineage/pull/1892 🙂

Labels
documentation, spec
🙌 Sheeri Cabral (Collibra), Paweł Leszczyński, Michael Robinson
Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-06-05 15:47:08

*Thread Reply:* @Bernat Gabor thanks for the PR!

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-06-06 08:17:47

Hey, I request release 0.27.2 to fix potential breaking change in Python client in 0.27.1: https://github.com/OpenLineage/OpenLineage/pull/1908

➕ Jakub Dardziński, Paweł Leszczyński, Michael Robinson, Willy Lulciuc
Michael Robinson (michael.robinson@astronomer.io)
2023-06-06 10:58:23

*Thread Reply:* Thanks @Maciej Obuchowski. The release is authorized and will be initiated as soon as possible.

Michael Robinson (michael.robinson@astronomer.io)
2023-06-06 12:33:55

@channel We released OpenLineage 0.27.2, including: Fixes: • Python client: deprecate client.from_environment, do not skip loading config #1908 @Maciej Obuchowski
For the details, see: Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.27.2 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.27.1...0.27.2 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

👍 Maciej Obuchowski
Bernat Gabor (gaborjbernat@gmail.com)
2023-06-06 14:22:18

Found a major bug in the python client - https://github.com/OpenLineage/OpenLineage/pull/1917, if someone can review

Labels
client/python, common
Bernat Gabor (gaborjbernat@gmail.com)
2023-06-06 14:54:47

And also https://github.com/OpenLineage/OpenLineage/pull/1913 🙂 that fixes the type information not being packaged

Labels
integration/airflow, integration/great-expectations, client/python, common, integration/dagster, extractor
Michael Robinson (michael.robinson@astronomer.io)
2023-06-07 09:48:58

@channel This month’s TSC meeting is tomorrow, and all are welcome! https://openlineage.slack.com/archives/C01CK9T7HKR/p1685725998982879

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
Rachana Gandhi (rachana.gandhi410@gmail.com)
2023-06-08 11:11:31

Hi team,

I wanted a lineage of my data for my tables and column level. I am using jupyter notebook and spark code.

spark = (SparkSession.builder.master('local') .appName('samplespark') .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener') .config('spark.jars.packages', 'io.openlineage:openlineagespark:0.12.0') .config('spark.openlineage.host', 'http://marquez-api:5000') .config('spark.openlineage.namespace', 'spark_integration') .getOrCreate())

I used this and then opened the localhost:3000 for marquez

I can see my job there but when i click on the job when its supposed to show lineage, its just an empty screen

John Lukenoff (john@jlukenoff.com)
2023-06-08 12:39:20

*Thread Reply:* Do you get any output in your devtools? I just ran into this yesterday and it looks like it’s related to this issue: https://github.com/MarquezProject/marquez/issues/2410

Labels
bug
Comments
2
John Lukenoff (john@jlukenoff.com)
2023-06-08 12:40:01

*Thread Reply:* Seems like more of a Marquez client-side issue than something with OL

Rachana Gandhi (rachana.gandhi410@gmail.com)
2023-06-08 12:43:02

*Thread Reply:* ohh but if i try using the console output, it throws ClientProtocolError

John Lukenoff (john@jlukenoff.com)
2023-06-08 12:43:41

*Thread Reply:* Sorry I mean in the dev console of your web browser

Rachana Gandhi (rachana.gandhi410@gmail.com)
2023-06-08 12:44:43

*Thread Reply:* this is the dev console in browser

John Lukenoff (john@jlukenoff.com)
2023-06-08 12:47:59

*Thread Reply:* Seems like it’s coming from this line. Are there any job facets defined when you fetch from the API directly? That seems like kind of an old version of OL so maybe the schema is incompatible with the version Marquez is expecting

Rachana Gandhi (rachana.gandhi410@gmail.com)
2023-06-08 12:51:21

*Thread Reply:* from pyspark.sql import SparkSession

spark = (SparkSession.builder.master('local') .appName('sample_spark') .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener') .config('spark.jars.packages', 'io.openlineage:openlineage_spark:0.12.0') .config('spark.openlineage.host', '<http://marquez-api:5000>') .config('spark.openlineage.namespace', 'spark_integration')
.getOrCreate())

spark.sparkContext.setLogLevel("INFO")

spark.createDataFrame([ {'a': 1, 'b': 2}, {'a': 3, 'b': 4} ]).write.mode("overwrite").saveAsTable("temp_table8")

Rachana Gandhi (rachana.gandhi410@gmail.com)
2023-06-08 12:51:49

*Thread Reply:* This is my only code, I havent done anything apart from this

John Lukenoff (john@jlukenoff.com)
2023-06-08 12:52:30

*Thread Reply:* I would try a more recent version of OL. Looks like you’re using 0.12.0 and I think the project is on 0.27.x currently

Rachana Gandhi (rachana.gandhi410@gmail.com)
2023-06-08 12:55:07

*Thread Reply:* so i should change io.openlineage:openlineage_spark:0.12.0 to io.openlineage:openlineage_spark:0.27.1?

👍 John Lukenoff, Julien Le Dem
Rachana Gandhi (rachana.gandhi410@gmail.com)
2023-06-08 13:10:03

*Thread Reply:* it executed well, unable to see it in marquez

Rachana Gandhi (rachana.gandhi410@gmail.com)
2023-06-08 13:18:16

*Thread Reply:* marquez didnt get updated

Rachana Gandhi (rachana.gandhi410@gmail.com)
2023-06-08 13:20:44

*Thread Reply:* I am actually doing a POC on OpenLineage to find table and column level lineage for my team at Amazon. If this goes through, the team could use openlineage to track data lineage on a larger scale..

John Lukenoff (john@jlukenoff.com)
2023-06-08 13:24:49

*Thread Reply:* Maybe marquez is still pulling the data from the previous run using the old OL version. Do you still get the same error in the browser console? Do you get the same result if you rebuild and start with a clean marquez db?

Rachana Gandhi (rachana.gandhi410@gmail.com)
2023-06-08 13:25:10

*Thread Reply:* yes i did that as well

Rachana Gandhi (rachana.gandhi410@gmail.com)
2023-06-08 13:25:49

*Thread Reply:* the error was present only once you clicked on any of the jobs in marquez, since my job isnt showing up i cant check for the error itself

Rachana Gandhi (rachana.gandhi410@gmail.com)
2023-06-08 13:26:29

*Thread Reply:* docker run --network sparkdefault -p 3000:3000 -e MARQUEZHOST=marquez-api -e MARQUEZ_PORT=5000 --link marquez-api:marquez-api marquezproject/marquez-web:0.19.1

used this to rebuild marquez

John Lukenoff (john@jlukenoff.com)
2023-06-08 13:26:54

*Thread Reply:* That’s odd, sorry, that’s probably the most I can help, I’m kinda new to OL/Marquez as well 😅

Rachana Gandhi (rachana.gandhi410@gmail.com)
2023-06-08 13:27:41

*Thread Reply:* no problem, can you refer me to someone who would know, so that i can ask them?

John Lukenoff (john@jlukenoff.com)
2023-06-08 13:29:25

*Thread Reply:* Actually looking at in now I think you’re using a slightly outdated version of marquez-web too. I would update that tag to at least 0.33.0. that’s what I’m using

John Lukenoff (john@jlukenoff.com)
2023-06-08 13:30:10

*Thread Reply:* Other than that I would ask in the marquez slack channel or raise an issue in github on that project. Seems like more of an issue with Marquez since some at least some data is rendering in the UI initially

Rachana Gandhi (rachana.gandhi410@gmail.com)
2023-06-08 13:32:58

*Thread Reply:* nope that version also didnt help

Rachana Gandhi (rachana.gandhi410@gmail.com)
2023-06-08 13:33:19

*Thread Reply:* can you share their slack link?

John Lukenoff (john@jlukenoff.com)
2023-06-08 13:34:52

*Thread Reply:* http://bit.ly/MarquezSlack

Rachana Gandhi (rachana.gandhi410@gmail.com)
2023-06-08 13:35:08

*Thread Reply:* that link is no longer active

Julien Le Dem (julien@apache.org)
2023-06-09 18:44:25

*Thread Reply:* Hello @Rachana Gandhi could you point to the doc where you found the example .config(‘spark.jars.packages’, ‘io.openlineage:openlineage_spark:0.12.0’) ? We should update it to have the latest version instead.

Rachana Gandhi (rachana.gandhi410@gmail.com)
2023-06-09 18:54:49

*Thread Reply:* https://openlineage.io/docs/integrations/spark/quickstart_local/

openlineage.io
Rachana Gandhi (rachana.gandhi410@gmail.com)
2023-06-09 18:59:17

*Thread Reply:* https://openlineage.io/docs/guides/spark

also the docker compose here has an earlier version of marquez

openlineage.io
Harshit Soni (harshit.soni@angelbroking.com)
2023-07-13 17:00:54

*Thread Reply:* Facing same issue with my initial POC. Did we get any solution for this?

Rachana Gandhi (rachana.gandhi410@gmail.com)
2023-06-08 11:11:46
Bernat Gabor (gaborjbernat@gmail.com)
2023-06-08 14:36:38

Approve a new release 🙂

➕ Michael Robinson, Willy Lulciuc, Maciej Obuchowski, Jakub Dardziński
Michael Robinson (michael.robinson@astronomer.io)
2023-06-08 14:43:55

*Thread Reply:* Requesting a release? 3 +1s from committers will authorize. More info here: https://github.com/OpenLineage/OpenLineage/blob/main/GOVERNANCE.md

Bernat Gabor (gaborjbernat@gmail.com)
2023-06-08 14:44:14

*Thread Reply:* Yeah, that one 😊

Bernat Gabor (gaborjbernat@gmail.com)
2023-06-08 14:44:44

*Thread Reply:* Because the python client is broken as is today without a new release

👍 Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2023-06-08 18:45:04

*Thread Reply:* Thanks, all. The release is authorized and will be initiated by EOB next Tuesday, but in all likelihood well before then.

Bernat Gabor (gaborjbernat@gmail.com)
2023-06-08 19:06:34

*Thread Reply:* cool

Michael Robinson (michael.robinson@astronomer.io)
2023-06-12 13:15:26

@channel We released OpenLineage 0.28.0, including: Added • dbt: add Databricks compatibility #1829 @Ines70 Fixed • Fix type-checked marker and packaging #1913 @gaborbernat • Python client: add schemaURL to run event #1917 @gaborbernat For the details, see: Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.28.0 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.27.2...0.28.0 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🚀 Maciej Obuchowski, Willy Lulciuc, Francis McGregor-Macdonald
👍 Ines DAHOUMANE -COWORKING PARIS-
Michael Robinson (michael.robinson@astronomer.io)
2023-06-12 14:35:56

@channel Meetup announcement: there’s another meetup happening soon! This one will be an evening event on 6/22 in New York at Collibra’s HQ. For details and to sign up, please join the meetup group: https://www.meetup.com/data-lineage-meetup/events/294065396/. Thanks to @Sheeri Cabral (Collibra) for cohosting and providing a space.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-06-12 23:27:16

Hi, just curious, does openlineage have a log4j integration?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-06-13 04:44:28

*Thread Reply:* Do you mean to just log events to logging backend?

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-06-13 04:54:30

*Thread Reply:* Hmm more like have a separate logging config for sending all the logs to a backend

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-06-13 04:54:38

*Thread Reply:* Not the events itself

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-06-13 05:01:10

*Thread Reply:* @Anirudh Shrinivason with Spark integration?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-06-13 05:01:59

*Thread Reply:* It uses slf4j so you should be able to set up your log4j logger

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-06-13 05:10:55

*Thread Reply:* Yeah with the spark integration. Ahh I see. Okay sure thanks!

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-06-21 23:21:14

*Thread Reply:* ~Hi @Maciej Obuchowski May I know what the class path I should be using for setting up the log4j if I want to set it up for OL related logs? Is there some guide or runbook to setting up the log4j with OL? Thanks!~ Nvm lol found it! 🙂

Vamshi krishna (vnallamothu@cardinalcommerce.com)
2023-06-13 12:19:01

Hello all, we are just starting to use Marquez as part of our POC. We are following the getting started guide at https://openlineage.io/getting-started/ to set the environment on an AWS Ec2 instance. When we are running ./docker/up.sh, it is not bringing up marquez-web container. Also, we are not able to access Admin UI at 5000 and 5001 ports.

Docker version: 24.0.2 Docker compose version: 2.18.1 OS: Ubuntu_20.04

Can someone please let me know what I am missing? Note: I had to modify docker-compose command in up.sh as per docker compose V2.

Also, we are seeing following log when our loadbalancer is checking for health:

WARN [2023-06-13 15:35:31,040] marquez.logging.LoggingMdcFilter: status: 404 172.30.1.206 - - [13/Jun/2023:15:35:42 +0000] "GET / HTTP/1.1" 200 535 "-" "ELB-HealthChecker/2.0" 1 172.30.1.206 - - [13/Jun/2023:15:35:42 +0000] "GET / HTTP/1.1" 404 43 "-" "ELB-HealthChecker/2.0" 2 WARN [2023-06-13 15:35:42,866] marquez.logging.LoggingMdcFilter: status: 404

openlineage.io
Kavitha (kkandaswamy@cardinalcommerce.com)
2023-06-14 10:42:41

*Thread Reply:* Hello, is anyone eho has recently installed latest version of marquez/open-lineage-spark using docker image available to help Vamshi and I or provide any pointers? Thank you

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-06-15 03:38:38

*Thread Reply:* if you're working on mac, you can have an issue related to port 5000. Instructions here https://github.com/MarquezProject/marquez#quickstart provides a workaround for that ./docker/up.sh --api-port 9000

Kavitha (kkandaswamy@cardinalcommerce.com)
2023-06-15 08:43:33

*Thread Reply:* @Paweł Leszczyński, thank you, we were using ubuntu on an EC2 instance and each time we are running into different errors and are never able to access the application page, web server, the admin interface, we have run out of ideas of what else to try differently to get this setup up and running

Vamshi krishna (vnallamothu@cardinalcommerce.com)
2023-06-22 14:47:00

*Thread Reply:* @Michael Robinson Can you please help us here?

Michael Robinson (michael.robinson@astronomer.io)
2023-06-22 14:58:57

*Thread Reply:* @Vamshi krishna I’m sorry you’re still blocked. Thanks for the information about your system. Would you please share some of the errors you are getting? More details would help us reproduce and diagnose.

Kavitha (kkandaswamy@cardinalcommerce.com)
2023-06-22 16:35:00

*Thread Reply:* @Michael Robinson, thank you, vamshi and i will share the errors that we are running into shortly

Vamshi krishna (vnallamothu@cardinalcommerce.com)
2023-06-23 09:48:16

*Thread Reply:* We are following https://openlineage.io/getting-started/ guide and trying to set up Marquez on a ubuntu ec2 instance. Following are versions of docker, docker compose and ubuntu

openlineage.io
Vamshi krishna (vnallamothu@cardinalcommerce.com)
2023-06-23 09:49:51

*Thread Reply:* @Michael Robinson When we follow the documentation without changing anything and run sudo ./docker/up.sh we are seeing following errors:

Vamshi krishna (vnallamothu@cardinalcommerce.com)
2023-06-23 10:00:38

*Thread Reply:* So, I edited up.sh file and modified docker compose command by removing --log-level flag and ran sudo ./docker/up.sh and found following errors:

Vamshi krishna (vnallamothu@cardinalcommerce.com)
2023-06-23 10:02:29

*Thread Reply:* Then I copied .env.example to .env since compose needs .env file

Vamshi krishna (vnallamothu@cardinalcommerce.com)
2023-06-23 10:05:04

*Thread Reply:* I got this error:

Vamshi krishna (vnallamothu@cardinalcommerce.com)
2023-06-23 10:09:24

*Thread Reply:* since I am getting timeouts, I thought it might be an issue with proxy. So, I followed this doc: https://stackoverflow.com/questions/58841014/set-proxy-on-docker and added my outbound proxy and tried

Stack Overflow
Vamshi krishna (vnallamothu@cardinalcommerce.com)
2023-06-23 10:23:46

*Thread Reply:* @Michael Robinson Then it kind of worked but seeing following errors:

Vamshi krishna (vnallamothu@cardinalcommerce.com)
2023-06-23 10:24:31

*Thread Reply:*

Vamshi krishna (vnallamothu@cardinalcommerce.com)
2023-06-23 10:25:29

*Thread Reply:* @Michael Robinson @Paweł Leszczyński Can you please see above steps and let us know what are we missing/doing wrong? I appreciate your help and time.

Michael Robinson (michael.robinson@astronomer.io)
2023-06-23 10:45:39

*Thread Reply:* The latest errors look to me like they’re being caused by postgres and might reflect a port conflict. Are you using the default port for the API (5000)? You might try using a different port. More info about this in the Marquez readme: https://github.com/MarquezProject/marquez/blob/0.35.0/README.md.

Vamshi krishna (vnallamothu@cardinalcommerce.com)
2023-06-23 10:46:55

*Thread Reply:* Yes we are using default ports: APIPORT=5000 APIADMINPORT=5001 WEBPORT=3000 TAG=0.35.0

Vamshi krishna (vnallamothu@cardinalcommerce.com)
2023-06-23 10:47:40

*Thread Reply:* We see these postgres permission issues only occasionally. Other times we only see db and api containers up but not the web

Michael Robinson (michael.robinson@astronomer.io)
2023-06-23 10:52:38

*Thread Reply:* I would try running ./docker/up.sh --api-port 9000 (see Pawel’s message above for more context.)

👍 Vamshi krishna
Vamshi krishna (vnallamothu@cardinalcommerce.com)
2023-06-23 10:54:18

*Thread Reply:* Still no luck. Seeing same errors.

2023-06-23 14:53:23.971 GMT [1] LOG: could not open configuration file "/etc/postgresql/postgresql.conf": Permission denied marquez-db | 2023-06-23 14:53:23.971 GMT [1] FATAL: configuration file "/etc/postgresql/postgresql.conf" contains errors

Vamshi krishna (vnallamothu@cardinalcommerce.com)
2023-06-23 10:54:43

*Thread Reply:* ERROR [2023-06-23 14:53:42,269] org.apache.tomcat.jdbc.pool.ConnectionPool: Unable to create initial connections of pool. marquez-api | ! java.net.UnknownHostException: postgres marquez-api | ! at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:567) marquez-api | ! at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327) marquez-api | ! at java.base/java.net.Socket.connect(Socket.java:633) marquez-api | ! at org.postgresql.core.PGStream.createSocket(PGStream.java:243) marquez-api | ! at org.postgresql.core.PGStream.&lt;init&gt;(PGStream.java:98) marquez-api | ! at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:132) marquez-api | ! at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:258) marquez-api | ! ... 26 common frames omitted marquez-api | ! Causing: org.postgresql.util.PSQLException: The connection attempt failed. marquez-api | ! at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:354) marquez-api | ! at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54) marquez-api | ! at org.postgresql.jdbc.PgConnection.&lt;init&gt;(PgConnection.java:253) marquez-api | ! at org.postgresql.Driver.makeConnection(Driver.java:434) marquez-api | ! at org.postgresql.Driver.connect(Driver.java:291) marquez-api | ! at org.apache.tomcat.jdbc.pool.PooledConnection.connectUsingDriver(PooledConnection.java:346) marquez-api | ! at org.apache.tomcat.jdbc.pool.PooledConnection.connect(PooledConnection.java:227) marquez-api | ! at org.apache.tomcat.jdbc.pool.ConnectionPool.createConnection(ConnectionPool.java:768) marquez-api | ! at org.apache.tomcat.jdbc.pool.ConnectionPool.borrowConnection(ConnectionPool.java:696) marquez-api | ! at org.apache.tomcat.jdbc.pool.ConnectionPool.init(ConnectionPool.java:495) marquez-api | ! at org.apache.tomcat.jdbc.pool.ConnectionPool.&lt;init&gt;(ConnectionPool.java:153) marquez-api | ! at org.apache.tomcat.jdbc.pool.DataSourceProxy.pCreatePool(DataSourceProxy.java:118) marquez-api | ! at org.apache.tomcat.jdbc.pool.DataSourceProxy.createPool(DataSourceProxy.java:107) marquez-api | ! at org.apache.tomcat.jdbc.pool.DataSourceProxy.getConnection(DataSourceProxy.java:131) marquez-api | ! at org.flywaydb.core.internal.jdbc.JdbcUtils.openConnection(JdbcUtils.java:48) marquez-api | ! at org.flywaydb.core.internal.jdbc.JdbcConnectionFactory.&lt;init&gt;(JdbcConnectionFactory.java:75) marquez-api | ! at org.flywaydb.core.FlywayExecutor.execute(FlywayExecutor.java:147) marquez-api | ! at <a href="http://org.flywaydb.core.Flyway.info">org.flywaydb.core.Flyway.info</a>(Flyway.java:190) marquez-api | ! at marquez.db.DbMigration.hasPendingDbMigrations(DbMigration.java:73) marquez-api | ! at marquez.db.DbMigration.migrateDbOrError(DbMigration.java:27) marquez-api | ! at marquez.MarquezApp.run(MarquezApp.java:105) marquez-api | ! at marquez.MarquezApp.run(MarquezApp.java:48) marquez-api | ! at io.dropwizard.cli.EnvironmentCommand.run(EnvironmentCommand.java:67) marquez-api | ! at io.dropwizard.cli.ConfiguredCommand.run(ConfiguredCommand.java:98) marquez-api | ! at io.dropwizard.cli.Cli.run(Cli.java:78) marquez-api | ! at io.dropwizard.Application.run(Application.java:94) marquez-api | ! at marquez.MarquezApp.main(MarquezApp.java:60) marquez-api | INFO [2023-06-23 14:53:42,274] marquez.MarquezApp: Stopping app...

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-06-23 11:06:32

*Thread Reply:* Why do you run docker up with sudo? some of your screenshots suggest docker is not able to access docker registry. The last error java.net.UnknownHostException: postgres may be just a result of container being down. Could you verify if all the containers are up and running and if not what's the error? Are you able to test this docker.up in your laptop or other environment?

Vamshi krishna (vnallamothu@cardinalcommerce.com)
2023-06-23 11:08:34

*Thread Reply:* Docker commands require sudo and cannot run with other user. Postgres container is not coming up. It is failing with following errors:

2023-06-23 14:53:23.971 GMT [1] LOG: could not open configuration file "/etc/postgresql/postgresql.conf": Permission denied marquez-db | 2023-06-23 14:53:23.971 GMT [1] FATAL: configuration file "/etc/postgresql/postgresql.conf" contains errors

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-06-23 11:10:19

*Thread Reply:* and what does docker ps -a say about postgres container? why did it fail?

Vamshi krishna (vnallamothu@cardinalcommerce.com)
2023-06-23 11:11:36

*Thread Reply:*

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-06-23 11:25:17

*Thread Reply:* hmyy, no changes on our side have been done in postgresql.conf since August 2022. Did you apply any changes or have a clean clone of a repo?

Vamshi krishna (vnallamothu@cardinalcommerce.com)
2023-06-23 11:29:46

*Thread Reply:* No we didn't make any changes

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-06-23 11:32:21

*Thread Reply:* you did write earlier Note: I had to modify docker-compose command in up.sh as per docker compose V2.

Vamshi krishna (vnallamothu@cardinalcommerce.com)
2023-06-23 11:34:54

*Thread Reply:* Yes all I did was modified this line: docker-compose --log-level ERROR $compose_files up $ARGS to docker compose $compose_files up $ARGS since docker compose v2 doesn't support --log-level flag

Vamshi krishna (vnallamothu@cardinalcommerce.com)
2023-06-23 11:37:03

*Thread Reply:* Let me pull an older version and try

Vamshi krishna (vnallamothu@cardinalcommerce.com)
2023-06-23 12:09:43

*Thread Reply:* Still no luck same exact errors. Tried on a different ubuntu instance. Still seeing same errors with postgres

Vamshi krishna (vnallamothu@cardinalcommerce.com)
2023-06-23 15:06:32

*Thread Reply:* @Jeremy W

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-06-15 10:40:47

Hi all, a general doubt. Would the column lineage associated with a job be present in both the start events and the complete events? Or could there be cases where the column lineage, and any output information is only present in one of the events, but not the other?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-06-15 10:49:42

*Thread Reply:* > Or could there be cases where the column lineage, and any output information is only present in one of the events, but not the other? Yes. Generally events regarding single run are cumulative

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-06-15 11:07:03

*Thread Reply:* Ahh I see... Is it fair to assume that if I see column lineage in a start event, it's the full column lineage? Or could it be possible that half the lineage is in the start event, and half the lineage is in the complete event?

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-06-15 22:50:51

*Thread Reply:* Hi @Maciej Obuchowski just pinging in case you'd missed the above message. 🙇

👀 Paweł Leszczyński
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-06-16 04:48:57

*Thread Reply:* Actually, in this case this definitely should not happen. @Paweł Leszczyński am I right?

:gratitude_thank_you: Anirudh Shrinivason
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-06-16 04:50:16

*Thread Reply:* @Maciej Obuchowski yes, you're

:gratitude_thank_you: Anirudh Shrinivason
nivethika R (nivethikar8@gmail.com)
2023-06-15 11:14:33

Hi All.. Is JDBC supported for openLineage and marquez for columnlineage? I did some POC using tables in postgresdb and I am able to see all events but for columnLineage Iam getting it as NULL. Not sure where I am missing.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-06-16 02:14:19

*Thread Reply:* ~No, we do have an open issue for that: https://github.com/OpenLineage/OpenLineage/issues/1758~

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-06-16 05:02:26

*Thread Reply:* @nivethika R, I am sorry for misleading response, we've merged PR for that https://github.com/OpenLineage/OpenLineage/pull/1636. It does not support select ** but besides that, it should be operational.

Could you please try a query from our integration tests to verify if this is working for you or not: https://github.com/OpenLineage/OpenLineage/pull/1636/files#diff-137aa17091138b69681510e13e3b7d66aa9c9c7c81fe8fe13f09f0de76448dd5R46 ?

Nagendra Kolisetty (nkolisetty@geico.com)
2023-06-16 12:12:00

Hi There,

Nagendra Kolisetty (nkolisetty@geico.com)
2023-06-16 12:12:43

We are trying to install the image on the private AKS cluster and we ended up in below error

kubectl : pod marquez/pgsql-postgresql-client terminated (StartError) At line:1 char:1

  • kubectl run pgsql-postgresql-client --rm --tty -i --restart='Never' `
  • ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    • CategoryInfo : NotSpecified: (pod marquez/pgs...ed (StartError):String) [], RemoteException
    • FullyQualifiedErrorId : NativeCommandError

failed to create containerd task: failed to create shim task: OCI runtime create failed: container_linux.go:380: starting container process caused: exec: "PGPASSWORD=macondo": executable file not found in $PATH: unknown

Nagendra Kolisetty (nkolisetty@geico.com)
2023-06-16 12:13:13

We followed the below article to install Marquez in AKS (Azure). By the way, we pulled the images from docker pushed it to our acr. tried installing the postgresql via ACR and it failed with the error

https://github.com/MarquezProject/marquez/blob/main/docs/running-on-aws.md

Michael Robinson (michael.robinson@astronomer.io)
2023-06-21 11:07:04

*Thread Reply:* Hi Nagendra, sorry you’re running into this error. We’re looking into it!

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-06-18 09:53:19

Hi, found this error in couple of the spark jobs: https://github.com/OpenLineage/OpenLineage/issues/1930 Would request your help to kindly help patch thanks!

Labels
proposal
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-06-19 09:37:20

*Thread Reply:* Hey @Anirudh Shrinivason, me and Paweł are at Berlin Buzzwords right now. Will definitely look at it later

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-06-19 10:47:06

*Thread Reply:* Oh nice! Thanks!

ayush mittal (ayushmittal057@gmail.com)
2023-06-20 03:14:02

Hi Team, we are not able to generate lineage for aggregate functions while joining two tables. below is the query df2 = spark.sql("select th.ProductID as Pid, pd.Name as N, sum(th.quantity) as TotalQuantity, sum(th.ActualCost) as TotalCost from silveradventureworks.transactionhistory as th join productdescription_dim as pd on th.ProductID = pd.ProductID group by th.ProductID, pd.Name ")

Rahul (rahul812ry@gmail.com)
2023-06-20 03:47:50

*Thread Reply:* This is the event generated for above query.

ayush mittal (ayushmittal057@gmail.com)
2023-06-20 03:18:22

and one more issue, we are not able to generate the open lineage events on top of view being created by joining multiple tables. i have attached log events for your reference.

ayush mittal (ayushmittal057@gmail.com)
2023-06-20 03:31:11

this is event for view for which no lineage is being generated

John Lukenoff (john@jlukenoff.com)
2023-06-20 13:59:00

Has anyone here successfully implemented the Amundsen OpenLineage extractor? I’m a little confused on the best way to output my lineage events to ndjson files in a scalable way as the docs seem to suggest. Currently I’m pushing all my lineage events to Marquez via REST API. I suppose I could change my transports to Kinesis and write the events to s3 but that comes with the cost of having to build some new way of getting the events to Marquez.

In any case, this seems like a problem someone must have solved before?

Edit: looking at the source code for this Amundsen extractor, it seems like it should be pretty straightforward to just implement our own extractor that can pull these records from the Marquez backend. Will give that a shot and see about getting that merged into Amundsen later.

👍 Maciej Obuchowski
Michael Robinson (michael.robinson@astronomer.io)
2023-06-20 17:34:08

*Thread Reply:* Hi John, glad to hear you figured out a path forward on this! Please let us know what you learn 🙂

Michael Robinson (michael.robinson@astronomer.io)
2023-06-20 14:21:03

Our New York meetup with Collibra is happening in just two days! https://openlineage.slack.com/archives/C01CK9T7HKR/p1686594956030059

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
👍 Maciej Obuchowski
Harshini Devathi (harshini.devathi@tigeranalytics.com)
2023-06-20 14:31:56

Hello all, Do you know if we have th possibility of persisting column orders while creating lineage as it may be available in the table or data set from which it originates. Or, is there some way in which we can get the column order (id or something).

For example, if a dataset has columns xyz, abc, fgh, dec, I would like to know which column shows first in the dataset in the common data model. Please let me know. m

Michael Robinson (michael.robinson@astronomer.io)
2023-06-20 17:33:36

*Thread Reply:* Hi Harshini, I’ve alerted our resident Spark and column-lineage expert about this. Hope to have an answer for you soon.

Harshini Devathi (harshini.devathi@tigeranalytics.com)
2023-06-20 19:39:46

*Thread Reply:* Thank you Michael, looking forward to it

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-06-21 02:58:41

*Thread Reply:* Hello @Harshini Devathi. An interesting topic which I have never thought about. The ordering of the fields, we get for Spark Apps, comes from Spark logical plans we extract information from and we do not apply any sorting on them. So, if Spark plan contains columns a , b, c we trust it's the order of columns for a dataset and don't want to check it on our own.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-06-21 02:59:45

*Thread Reply:* btw. please let us know how do you obtain your lineage: within a Spark app or from some SQL's scheduled by Airflow?

Harshini Devathi (harshini.devathi@tigeranalytics.com)
2023-06-23 14:40:31

*Thread Reply:* Hello @Paweł Leszczyński, thank you for the response. We do not need you to check the ordering specifically but I assume that the spark logical plan maintains the column order based on the input datasets. Can we retain that order by adding column id or some sequence number which helps to represent the lineage in the same order.

The lineage we are capturing using Spark openlineage connector, by posting custom lineage to Marquez through API calls, and also in process of leveraging SQL connector feature using Airflow.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-06-26 04:35:43

*Thread Reply:* Hi @Harshini Devathi, are you asking about schema facet within a dataset? This should have an order from spark logical plans. Or, are you asking about columnLineage facet? Or Marquez API responses? It's not clear to me why do you need it. Each column, is identified by a dataset (dataset namespace + dataset name) and field name. You can, on your side, generate and column id based on that and order columns based on the id, but still I think I am missing some arguments behind doing so.

Michael Robinson (michael.robinson@astronomer.io)
2023-06-21 17:41:48

Attention all Bay-area data friends and Data+AI Summit attendees: our first San Francisco meetup is next Tuesday! https://www.meetup.com/meetup-group-bnfqymxe/events/293448130/

Meetup
🙌 alexandre bergere
Michael Robinson (michael.robinson@astronomer.io)
2023-06-23 16:41:29

Last night in New York we held a meetup with Collibra at their lovely HQ in the Financial District! Many thanks to @Sheeri Cabral (Collibra) for inviting us. Over a bunch of tasty snacks (thanks for the idea @Harel Shein), we discussed: • the history and evolution of the spec, and trends in adoption • progress on the OpenLineage Provider in Airflow (AIP 53) • progress on “static” AKA design lineage support (expected soon in OpenLineage 1.0.0) • progress in the LFAI program • a proposal to add “jobless run” support for auditing use cases and similar edge cases • an idea to throw a hackathon for creating validation tests and example payloads (would you be interested in participating? let us know!) • and more. Many thanks to: • @Julien Le Dem for making the trip • Sheeri & Collibra for hosting • everyone for coming, including second-timer @Ernie Ostic and new member @Shirley Lu It was great meeting/catching up with everyone. Hope to see you and more new faces at the next one!

🎉 Harel Shein, Peter Hanssens, Ernie Ostic, Paweł Leszczyński, Maciej Obuchowski, Shirley Lu
Michael Robinson (michael.robinson@astronomer.io)
2023-06-26 10:59:08

Our first San Francisco meetup is tomorrow at 5:30 PM at Astronomer’s offices in the Financial District. https://openlineage.slack.com/archives/C01CK9T7HKR/p1687383708927189

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
🚀 alexandre bergere
Rakesh Jain (rakeshj@us.ibm.com)
2023-06-27 03:43:10

I can’t seem to get OL logging working with Spark. Any guidance please?

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-06-27 03:45:31

*Thread Reply:* Is it because the logLevel is set to WARN or ERROR?

Rakesh Jain (rakeshj@us.ibm.com)
2023-06-27 12:07:12

*Thread Reply:* No, I set it to INFO, may be I need to add some jars?

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-06-27 12:30:02

*Thread Reply:* Hmm have you set the relevant spark configs?

Rakesh Jain (rakeshj@us.ibm.com)
2023-06-27 12:32:50

*Thread Reply:* yep, I have http working. But not the console spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener spark.openlineage.transport.type=console

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-06-27 12:35:27

*Thread Reply:* Oh wait http works but not console...

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-06-27 12:37:02

*Thread Reply:* If you want to see the console events which are emitted, then need to set logLevel to DEBUG

Rakesh Jain (rakeshj@us.ibm.com)
2023-06-27 12:37:44

*Thread Reply:* tried that too, still nothing

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-06-27 12:38:54

*Thread Reply:* Is the openlienage jar installed and added to config?

Rakesh Jain (rakeshj@us.ibm.com)
2023-06-27 12:39:09

*Thread Reply:* yep, that’s why http works

Rakesh Jain (rakeshj@us.ibm.com)
2023-06-27 12:39:26

*Thread Reply:* the only thing I see in the logs is this: 23/06/27 07:39:11 INFO SparkSQLExecutionContext: OpenLineage received Spark event that is configured to be skipped: SparkListenerJobEnd

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-06-27 12:40:59

*Thread Reply:* Hmm if an event is still emitted for this case, but logs not showing up then I'm not sure... Maybe someone with more knowledge on this can help

Rakesh Jain (rakeshj@us.ibm.com)
2023-06-27 12:42:37

*Thread Reply:* sure, thanks for trying @Anirudh Shrinivason

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-06-28 05:23:36

*Thread Reply:* What job are you trying this on? If there's this message, then logging is working afaik

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-06-28 12:16:52

*Thread Reply:* Hi @Maciej Obuchowski Actually I also noticed a similar issue... For some spark pipelines, the log level is set to debug, but I'm not seeing any events being logged. I am however receiving these events in the backend. Have any of the logging been removed from some places?

Rakesh Jain (rakeshj@us.ibm.com)
2023-06-28 20:57:45

*Thread Reply:* yep, exactly same thing here also @Maciej Obuchowski, I can get the events on http, but changing to console gets me nothing from ConsoleTransport.

John Lukenoff (john@jlukenoff.com)
2023-06-27 20:45:15

@here A bunch of us are downstairs in the lobby at 8 California but no one is down here to let us up. Anyone here to help?

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-06-29 03:36:36

Hi guys, I noticed a few of the jobs getting OOMed while running with openlineage. Even increasing the number of executors and doubling the memory does not seem to fix it actually. This is observed especially when using the graphx libs. Is this a known issue? Just curious as to what the cause might be... The same jobs run fine once openlineage is disabled. Are there some rogue threads from the listener or any connections we aren't closing properly?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-06-29 05:57:59

*Thread Reply:* Hi @Anirudh Shrinivason, could you disable serializing spark.logicalPlan to see if the behaviour is the same?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-06-29 05:58:28

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark -> spark.openlineage.facets.disabled -> [spark_unknown;spark.logicalPlan]

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-06-29 05:59:55

*Thread Reply:* We do serialize logicalPlan because this is useful in many cases, but sometimes can lead to serializing things that shouldn't be serialized

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-06-29 15:49:35

*Thread Reply:* Ahh I see. Yeah okay let me try that

Michael Robinson (michael.robinson@astronomer.io)
2023-06-30 08:01:34

Hello all, I’m opening a vote to release OpenLineage 0.29.0, including: • support for Spark 3.4 • support for Flink 1.17.1 • a fix in the Flink integration to enable dataset schema extraction for a KafkaSource when GenericRecord is used • removal of the unused Golang proxy client (made redundant by the fluentd proxy) • security vulnerability fixes, doc changes, test improvements, and more. Three +1s from committers will authorize an immediate release.

➕ Jakub Dardziński, Paweł Leszczyński, Maciej Obuchowski
Michael Robinson (michael.robinson@astronomer.io)
2023-06-30 08:05:53

*Thread Reply:* Thanks, all. The release is authorized.

Michael Robinson (michael.robinson@astronomer.io)
2023-06-30 13:27:35

@channel We released OpenLineage 0.29.2, including: Added • Flink: support Flink version 1.17.1 #1947 @pawel-big-lebowski • Spark: support Spark version 3.4 #1790 @pawel-big-lebowski Removed • Proxy: remove unused Golang client approach #1926 @mobuchowski • Req: bump minimum supported Python version to 3.8 #1950 @mobuchowskiNote: this removes support for Python 3.7, which is at EOL. Plus test improvements, docs changes, bug fixes and more. Thanks to all the contributors! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.29.2 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.28.0...0.29.2 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

👍 Shirley Lu, Maciej Obuchowski, Paweł Leszczyński, Tamara Fingerlin
Michael Robinson (michael.robinson@astronomer.io)
2023-06-30 17:23:04

@channel The latest issue of OpenLineage News is now available, featuring a recap of recent events, releases, and more. To get it directly in your inbox each month, sign up https://openlineage.us14.list-manage.com/track/click?u=fe7ef7a8dbb32933f30a10466&id=e598962936&e=ef0563a7f8|here.

apache.us14.list-manage.com
👍 Maciej Obuchowski, Paweł Leszczyński, Tristan GUEZENNEC -CROIX-, Tamara Fingerlin, Jeremy W, Anirudh Shrinivason, Julien Le Dem, Sheeri Cabral (Collibra), alexandre bergere
Michael Robinson (michael.robinson@astronomer.io)
2023-07-06 13:36:44

@channel This month’s TSC meeting is next Thursday, 7/13, at a special time: 8 am PT. All are welcome! On the tentative agenda: • announcements • updates • recent releases • a new DataGalaxy integration • open discussion

✅ Sheeri Cabral (Collibra), Maciej Obuchowski, alexandre bergere, Paweł Leszczyński, Willy Lulciuc, Anirudh Shrinivason, Shirley Lu
Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2023-07-07 10:35:08

Wow, I just got finished watching @Julien Le Dem and @Willy Lulciuc’s presentation of OpenLineage at databricks and it’s really fantastic! There isn’t a better 30 minutes of content on theory + practice than this, IMO. https://www.databricks.com/dataaisummit/session/cross-platform-data-lineage-openlineage/ (you can watch for free by making an account. I’m not affiliated with databricks…)

databricks.com
❤️ Willy Lulciuc, Harel Shein, Yuanli Wang, Ross Turk, Michael Robinson, Jakub Dardziński, Conor Beverland, Maciej Obuchowski, Jarek Potiuk, Julien Le Dem, Chris Folkes, Anirudh Shrinivason, Shirley Lu
Willy Lulciuc (willy@datakin.com)
2023-07-07 10:37:49

*Thread Reply:* thanks for watching and sharing! the recording is also on youtube 😉 https://www.youtube.com/watch?v=rO3BPqUtWrI

YouTube
} Databricks (https://www.youtube.com/@Databricks)
Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2023-07-07 10:38:01

*Thread Reply:* ❤️

Jarek Potiuk (jarek@potiuk.com)
2023-07-08 13:35:10

*Thread Reply:* Very much agree. I’ve even forwarded to a few people here and there, those who I think should learn about it.

❤️ Sheeri Cabral (Collibra)
Julien Le Dem (julien@apache.org)
2023-07-08 13:47:17

*Thread Reply:* You’re both too kind :) Thank you for your support and being part of the community.

❤️ Sheeri Cabral (Collibra), Jarek Potiuk
Michael Robinson (michael.robinson@astronomer.io)
2023-07-07 15:44:33

@channel If you registered for TSC meetings through AddEvent, first of all, thank you! Second of all, I’ve had to create a new event series there to enable the editing of individual events. When you have a moment, would you please register for next week’s meeting? Apologies for the inconvenience.

addevent.com
👍 Kiran Hiremath, Willy Lulciuc, Shirley Lu
Juan Manuel Cappi (juancappi@gmail.com)
2023-07-10 12:29:31

Hi community, we are interested in capturing time-travel usage for Iceberg Spark sql in column lineage. For instance, INSERT INTO schema.table select ** from schema.another_table version as of 'some_version' . Column lineage is currently missing the version, if used, which it’s actually quite relevant. I’ve gone through the open issues and didn’t see anything similar. Does it look like a valid use case scenario? We started going through the OL, iceberg and Spark code in trying to capture/expose it, but so far we haven’t been able to. If anyone can give a hint/idea/pointer, we are willing to give it try a contribute back with the code

👀 Rakesh Jain, Nitin Ramchandani
Julien Le Dem (julien@apache.org)
2023-07-11 05:46:36

*Thread Reply:* I think yes this is a great use case. @Paweł Leszczyński is more familiar with the spark integration code than I. I think in this case, we would add the datasetVersion facet with the underlying Iceberg version: https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/DatasetVersionDatasetFacet.json We extract this information in a few places: https://github.com/search?q=repo%3AOpenLineage%2FOpenLineage%20%20DatasetVersionDatasetFacet&type=code

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-07-11 05:57:17

*Thread Reply:* Yes, we do have datasetVersion which captures for output and input datasets their iceberg or delta version. Input versions are collected on START while output are collected on COMPLETE in case a job reads and writes to the same dataset. So, even though column-lineage facet is missing the version, it should be available within events related to a particular run.

If it is not, then perhaps the case here is the lack of support of as of syntax. As far as I remeber, we always get a current version of a dataset and this may be a missing part here.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-07-11 05:58:49

*Thread Reply:* link to a method that gets dataset version for iceberg: https://github.com/OpenLineage/OpenLineage/blob/0.29.2/integration/spark/spark3/sr[…]lineage/spark3/agent/lifecycle/plan/catalog/IcebergHandler.java

Juan Manuel Cappi (juancappi@gmail.com)
2023-07-11 10:57:26

*Thread Reply:* Thank you @Julien Le Dem and @Paweł Leszczyński Based on what I’ve seen so far, indeed it seems that only the current snapshot is tracked. When IcebergHandler.getDatasetVersion() Initially I was expecting to be able to obtain the snapshotId from the SparkTable which comes within getDatasetVersion() but now I realize that OL is using an older version of Iceberg runtime, (0.12.1) which does not support time travel (introduced in 0.14.1). The evidence is: • Iceberg documentation for release 0.14.1: https://iceberg.apache.org/docs/0.14.0/spark-queries/#sql • Iceberg release notes https://iceberg.apache.org/releases/#0140-release • Comparing the source code, I see the SparkTable from 0.14.1 onward does have a snapshotId instance variable, while previous versions don’t https://github.com/apache/iceberg/blob/0.14.x/spark/v3.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java#L82 https://github.com/apache/iceberg/blob/0.12.x/spark3/src/main/java/org/apache/iceberg/spark/source/SparkTable.java#L78

I don’t see anyone complaining about the old version of Iceberg runtime being used and there is no open issue to upgrade so I’ll open the issue and please let me know if that seems reasonable as the immediate next step to take

Juan Manuel Cappi (juancappi@gmail.com)
2023-07-11 15:48:53

*Thread Reply:* Created issues: #1969 and #1970

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-07-12 07:15:14

*Thread Reply:* Thanks @Juan Manuel Cappi. openlineage-spark jar contains modules like spark3 , spark32 , spark33 and spark34 that is going to be merged soon (we do have a ready PR for that). spark34 will be compiled against latest iceberg version. Once this is done #1969 can be closed. For 1970, one would need to implement datasetBuilder within spark34 module and visits node within spark's logical plan that is responsible for as of and creates dataset for OpenLineage event other way than getting latest snapshot version.

Juan Manuel Cappi (juancappi@gmail.com)
2023-07-13 12:51:19

*Thread Reply:* @Paweł Leszczyński I’ve see PR #1971 and I see a new spark34 project with the latest iceberg-spark dependency version, but other versions (spark33, spark32, etc) have not being upgraded in that PR. Since the change is small and does not break any tests, I’ve created PR #1976 for to fix #1969. That alone is unlocking some time travel lineage (i.e. dataset identifier now becomes schema.table.version or schema.table.snapshot_id). Hope it makes sense

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-07-14 04:37:55

*Thread Reply:* Hi @Juan Manuel Cappi, You're right and after discussion with you I realized we support some version of iceberg (for spark 3.3 it's still 0.14.0) but this is not the latest iceberg version matching spark version.

There's some tricky part here. Although we wan't our code to succeed with latest spark, we don't want it to fail in a nasty way (class not found exception) when a user is working with an old iceberg version. There are places in our code where we do check are iceberg classes on the classpath? We need to extend this to are iceberg classes on classpath is iceberg version above 0.14 or not For sure this is the case for merge into commands I am working on at the moment. Let's see if the other integration tests are affected in your PR

Amod Bhalerao (amod.bhalerao@gmail.com)
2023-07-11 08:09:57

HI Team, I Seen that Kafka lineage is not coming properly in for Spark streaming, Are we working on this?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-07-11 08:28:59

*Thread Reply:* what do you mean by that? there is a pyspark & kafka integration test that verifies event being sent when reading or writing to kafka topic: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/tes[…]a/io/openlineage/spark/agent/SparkContainerIntegrationTest.java

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-07-11 09:28:56

*Thread Reply:* We do have an old issue https://github.com/OpenLineage/OpenLineage/issues/372 to support more spark plans that are stream related. But, if you had an example of streaming that is not working for you, this would have been really helpful.

Labels
integration/spark, streaming
Amod Bhalerao (amod.bhalerao@gmail.com)
2023-07-26 08:03:30

*Thread Reply:* I have a pipeline Which reads from topic and send data to 3 HIVE tables and one postgres , Its not emitting any lineage for this pipeline

Amod Bhalerao (amod.bhalerao@gmail.com)
2023-07-26 08:06:51

*Thread Reply:* just one task is getting created

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-07-12 05:55:19

Hi guys, I notice that with the below spark configs: ```from pyspark.sql import SparkSession import os

os.environ["TEST_VAR"] = "1"

spark = (SparkSession.builder.master('local') .appName('samplespark') .config('spark.jars.packages', 'io.openlineage:openlineagespark:0.29.2,io.delta:deltacore2.12:1.0.1') .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener') .config('spark.openlineage.transport.type', 'console') .config('spark.sql.catalog.sparkcatalog', "org.apache.spark.sql.delta.catalog.DeltaCatalog") .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") .config("hive.metastore.schema.verification", False) .config("spark.sql.warehouse.dir","/tmp/") .config("hive.metastore.warehouse.dir","/tmp/") .config("javax.jdo.option.ConnectionURL","jdbc:derby:;databaseName=/tmp/metastoredb;create=true") .config("spark.openlineage.facets.customenvironmentvariables","[TESTVAR;]") .config("spark.openlineage.facets.disabled","[sparkunknown;spark.logicalPlan]") .config("spark.hadoop.fs.permissions.unmask-mode","000") .enableHiveSupport() .getOrCreate())``` The custom environment variables facet is not kicking in. However, when all the delta related spark configs are removed, it is working fine. Is this a known issue? Are there any workarounds for it? Thanks!

👀 Paweł Leszczyński
Juan Manuel Cappi (juancappi@gmail.com)
2023-07-12 06:14:41

*Thread Reply:* Hi @Anirudh Shrinivason, I’m not familiar with Delta, but enabling debugging helped me a lot to understand what’s going when things fail silently. Just add at the end: spark.sparkContext.setLogLevel("DEBUG")

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-07-12 06:20:47

*Thread Reply:* Yeah I checked on debug

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-07-12 06:20:50

*Thread Reply:* There are no errors

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-07-12 06:21:10

*Thread Reply:* Just that there is no environment-properties in the event that is being emitted

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-07-12 07:31:01

*Thread Reply:* Hi @Anirudh Shrinivason, what spark version is that? i see you delta version is pretty old. Anyway, the observation is weird and don't know how come delta interferes with environment facet builder. These are so disjoint features. Are you sure you create a new session (there is getOrCreate) ?

Glen M (glen_m@apple.com)
2023-07-12 19:29:06

*Thread Reply:* @Paweł Leszczyński its because of this line : https://github.com/OpenLineage/OpenLineage/blob/0.29.2/integration/spark/app/src/m[…]nlineage/spark/agent/lifecycle/InternalEventHandlerFactory.java

Glen M (glen_m@apple.com)
2023-07-12 19:32:44

*Thread Reply:* Assuming this is https://learn.microsoft.com/en-us/azure/databricks/delta/ ... delta .. which is azure datbricks. @Anirudh Shrinivason

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-07-12 22:58:13

*Thread Reply:* Hmm I wasn't using databricks

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-07-12 22:59:12

*Thread Reply:* @Paweł Leszczyński I'm using spark 3.1 btw

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-07-13 08:05:49

*Thread Reply:* @Anirudh Shrinivason This should resolve the issue https://github.com/OpenLineage/OpenLineage/pull/1973

Labels
documentation, integration/spark
Comments
1
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-07-13 08:06:11

*Thread Reply:* PR description contains info on how come the observed behaviour was possible

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-07-13 08:07:47

*Thread Reply:* As always, thank you @Anirudh Shrinivason for providing clear information on how to reproduce the issue 🚀 :medal: 👍

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-07-13 09:52:29

*Thread Reply:* Ohh that is really great! Thankss so much for the help! 🙂

Michael Robinson (michael.robinson@astronomer.io)
2023-07-12 13:50:51

@channel A friendly reminder: this month’s TSC meeting — open to all — is tomorrow at 8 am PT. https://openlineage.slack.com/archives/C01CK9T7HKR/p1688665004736219

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
👍 Dongjin Seo
thebruuu (bruno.c@inwind.it)
2023-07-12 14:54:29

Hi Team How are you ? Is there any chance to use airflow to run queries against Access file? Sorry to bother with a question that is not directly related to openlineage ... but I am kind of stuck

Harel Shein (harel.shein@gmail.com)
2023-07-12 15:22:52

*Thread Reply:* what do you mean by Access file?

thebruuu (bruno.c@inwind.it)
2023-07-12 16:09:03

*Thread Reply:* ... accdb file, Microsoft Access File: I am in a reverse engineering projet facing a spaghetti style development and would have loved to use, airflow and openlineage as a magic wand, to help me in this damn work

Harel Shein (harel.shein@gmail.com)
2023-07-12 21:44:21

*Thread Reply:* oof.. I’d look into https://airflow.apache.org/docs/apache-airflow-providers-odbc/4.0.0/ but I really have no clue..

thebruuu (bruno.c@inwind.it)
2023-07-13 09:47:02

*Thread Reply:* Thank you Harel I started from that too ... but it became foggy after the initial step

Aaman Lamba (aamanlamba@gmail.com)
2023-07-12 16:30:41

Hi folks, having an issue ingesting the seed metadata when starting the docker container. The output shows "seed-marquez-with-metadata exited with code 0" but no information is visible in marquez What can be the issue?

✅ Aaman Lamba
Michael Robinson (michael.robinson@astronomer.io)
2023-07-12 16:55:00

*Thread Reply:* Did you check the namespace menu in the top right for a food_delivery namespace?

Michael Robinson (michael.robinson@astronomer.io)
2023-07-12 16:55:12

*Thread Reply:* (Hi Aaman!)

Aaman Lamba (aamanlamba@gmail.com)
2023-07-12 16:55:45

*Thread Reply:* Hi! Thank you that helped!

Aaman Lamba (aamanlamba@gmail.com)
2023-07-12 16:55:55

*Thread Reply:* I think that should be added to the quickstart guide

🙌 Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2023-07-12 16:56:23

*Thread Reply:* Great idea, thank you

Julien Le Dem (julien@apache.org)
2023-07-13 12:09:29

As discussed in the Monthly meeting, I have opened a PR to propose adding deletion to facets for static lineage metadata: https://github.com/OpenLineage/OpenLineage/pull/1975

Labels
documentation, proposal
Steven (xli@zjuici.com)
2023-07-13 23:21:29

Hi, I'm using OL python client. client.emit( DatasetEvent( _eventTime_=datetime.now().isoformat(), _producer_=producer, _schemaURL_="<https://openlineage.io/spec/1-0-5/OpenLineage.json#/definitions/DatasetEvent>", _dataset_=Dataset(_namespace_=namespace, _name_=f"input-file"), ) ) I want to send a dataset event once files been uploaded. But I received 422 from api/v1/lineage, saying that run and job must not be null. I don't have a job or run yet. How can I solve this?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-07-14 04:09:15

*Thread Reply:* Hi @Steven, I assume you send your Openlineage events to Marquez. 422 http code is a response from backend and Marquez is still waiting for the PR https://github.com/MarquezProject/marquez/pull/2495 to be merged and released. This PR makes Marquez understand DatasetEvents. They won't be saved in Marquez database (this is to be implemented in future), but at least one will not experience error response code.

To sum up: what you do is correct. You are using a feature that is allowed on a client side but still not implemented on a backend.

Labels
docs, api, client/java
Comments
1
✅ Steven
🥳 Steven
Steven (xli@zjuici.com)
2023-07-14 04:10:30

*Thread Reply:* Thanks!!

Harshit Soni (harshit.soni@angelbroking.com)
2023-07-14 08:36:23

@here Hi Team, I am trying to run a spark application with openLineage Spark :- 3.3.3 Openlineage :- 0.29.2 I am getting below error can you please me, what I could be doing wrong.

``` spark = (SparkSession .builder .config('spark.port.maxRetries', 100) .appName(app_name) .config("spark.openlineage.url","http://localhost/api/v1/namespaces/spark_integration/") .config("spark.extraListeners","io.openlineage.spark.agent.OpenLineageSparkListener") .getOrCreate())

23/07/14 18:04:01 ERROR Utils: uncaught error in thread spark-listener-group-shared, stopping SparkContext java.lang.UnsatisfiedLinkError: /private/var/folders/z6/pl8p30z11v50zf6pv51p259m0000gp/T/native-lib4983292552717270883/libopenlineagesqljava.dylib: dlopen(/private/var/folders/z6/pl8p30z11v50zf6pv51p259m0000gp/T/native-lib4983292552717270883/libopenlineagesqljava.dylib, 0x0001): tried: '/private/var/folders/z6/pl8p30z11v50zf6pv51p259m0000gp/T/native-lib4983292552717270883/libopenlineagesqljava.dylib' (mach-o file, but is an incompatible architecture (have 'x8664', need 'arm64')), '/System/Volumes/Preboot/Cryptexes/OS/private/var/folders/z6/pl8p30z11v50zf6pv51p259m0000gp/T/native-lib4983292552717270883/libopenlineagesqljava.dylib' (no such file), '/private/var/folders/z6/pl8p30z11v50zf6pv51p259m0000gp/T/native-lib4983292552717270883/libopenlineagesqljava.dylib' (mach-o file, but is an incompatible architecture (have 'x8664', need 'arm64')) at java.lang.ClassLoader$NativeLibrary.load(Native Method)```

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-07-18 02:35:18

*Thread Reply:* Hi @Harshit Soni, where are you deploying your spark? locally or not? is it on mac? Calling @Maciej Obuchowski to help with ibopenlineage_sql_java architecture compilation issue

Harshit Soni (harshit.soni@angelbroking.com)
2023-07-18 02:38:03

*Thread Reply:* Currently, was testing on local.

Harshit Soni (harshit.soni@angelbroking.com)
2023-07-18 02:39:43

*Thread Reply:* We have created a centralised utility for all data ingestion needs and want to see how lineage is created for same using Openlineage.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-07-18 05:16:55

*Thread Reply:* 👀

Michael Robinson (michael.robinson@astronomer.io)
2023-07-14 13:00:29

@channel If you missed this month’s TSC meeting, the recording is now available on our YouTube channel: https://youtu.be/2vD6-Uwr7ZE. A clip of Alexandre Bergere’s DataGalaxy integration demo is also available: https://youtu.be/l_HbEtpXphY.

YouTube
} OpenLineage Project (https://www.youtube.com/@openlineageproject6897)
YouTube
} OpenLineage Project (https://www.youtube.com/@openlineageproject6897)
👍 Kiran Hiremath, alexandre bergere, Harel Shein, Paweł Leszczyński
Robin Fehr (robin.fehr@acosom.com)
2023-07-16 17:39:26

Hey guys - trying to get a grip on the ecosystem regarding flink lineage 🙂 as far as my research has revealed, the openlineage project is the only one that supports flink lineage with an out of the box library that can be integrated in jobs. at least as far as i've seen the for other toolings such as datahub we'd have to write our custom hooks that implement their api. as for my question - is my current assumption correct that an integration into the openlineage project of for example datahub/openmetadata would also require support from datahub/openmetadata itself so that they can work with the openlineage spec? or would it somewhat work to write a mapper in between to support their spec? (more of an architectural decision i assume but would be interested in knowing what the openlinage's approach is regarding that)

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-07-17 08:13:49

*Thread Reply:* > or would it somewhat work to write a mapper in between to support their spec? I think yeah - maybe https://github.com/Natural-Intelligence/openLineage-openMetadata-transporter would work out of the box if I understand correctly?

Website
<https://www.top10.com>
Language
Java
Harel Shein (harel.shein@gmail.com)
2023-07-17 08:38:59

*Thread Reply:* Tagging @Natalie Zeller in case you want to collaborate

Natalie Zeller (natalie.zeller@naturalint.com)
2023-07-17 08:47:34

*Thread Reply:* Hi, We've implemented a transporter that transmits lineage from OpenLineage to OpenMetadata, you can find the github project here. I've also published a blog post that explains this integration and how to use it. I'll be happy to help if you have any question

Website
<https://www.top10.com>
Language
Java
🙌 Robin Fehr
Robin Fehr (robin.fehr@acosom.com)
2023-07-17 09:49:30

*Thread Reply:* very cool! thanks a lot for responding so quickly

Michael Robinson (michael.robinson@astronomer.io)
2023-07-17 18:23:53

🚀 We recently hit the 1000-member mark on here! Thank you for joining the movement to establish an open standard for data lineage across the data ecosystem! Tell your friends 🙂! 💯💯💯💯💯💯💯💯💯💯 https://bit.ly/lineageslack

🎉 Juan Manuel Cappi, Harel Shein, Paweł Leszczyński, Maciej Obuchowski, Willy Lulciuc, Viraj Parekh
💯 Harel Shein, Anirudh Shrinivason, Paweł Leszczyński, Maciej Obuchowski, Willy Lulciuc, Robin Fehr, Viraj Parekh, Ernie Ostic
👏 thebruuu
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-07-18 04:58:14

Btw, just curious what exactly does the runId correspond to in the OL spark integration? Is it possible to obtain the spark application id from the event too?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-07-18 05:10:31

*Thread Reply:* runId is an UUID assigned per spark action (compute trigger within a spark job). A single spark script can result in multiple runs then

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-07-18 05:13:17

*Thread Reply:* adding an extra facet with applicationId looks like a good idea to me: https://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkContext.html#applicationId:String

spark.apache.org
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-07-18 23:06:01

*Thread Reply:* Got it thanks!

Madhav Kakumani (madhav.kakumani@6point6.co.uk)
2023-07-18 09:47:47

Hi, I have an usecase to integrate queries run in Jupyter notebook using pandas integrate with OpenLineage to get the Lineage in Marquez. Did anyone implemented this before? please let me know. Thanks

🤩 thebruuu
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-07-20 06:48:54

*Thread Reply:* I think we don't have pandas support so far. So, if one uses pandas to read local files on disk, then perhaps Openlineage (OL) has little sense to do. There is an old pandas issues in our backlog (over 2 years old) -> https://github.com/OpenLineage/OpenLineage/issues/108

Surely one can use use python OL client to create manully events and send them to MQZ, which may be less convenient (https://github.com/OpenLineage/OpenLineage/tree/main/client/python)

Anyway, we would like to know what's you usecase? this would be super helpful in understanding why OL & pandas integration may be useful.

Madhav Kakumani (madhav.kakumani@6point6.co.uk)
2023-07-20 06:52:32

*Thread Reply:* Thanks Pawel for responding

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-07-19 02:57:57

Hi guys, when can we expect the next Openlineage release? Excited for MergeIntoCommand column lineage feature!

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-07-19 03:40:20

*Thread Reply:* Hi @Anirudh Shrinivason, I am still working on that. It's kind of complex because I want to refactor column level lineage so that it can work with multiple Spark versions and multiple delta jars as merge into implementation for delta differs for different delta releases. I thought it's ready, but this needs some extra work to be done in next days. I am excited about that too!

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-07-19 03:54:37

*Thread Reply:* Ahh I see... Got it! Is there a tentative timeline for when we can expect this? So sorry haha don't mean to rush you. Just curious to know thats all! 🙂

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-07-19 22:06:10

*Thread Reply:* Can we author a release sometime soon? Would like to use the CustomEnvironmentFacetBuilder for delta catalog!

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-07-20 05:28:43

*Thread Reply:* we're pretty close i think with merge into delta which is under review. waiting for it would be nice. anyway, we're 3 weeks after the last release.

Michael Robinson (michael.robinson@astronomer.io)
2023-07-20 06:50:56

*Thread Reply:* @Anirudh Shrinivason releases are available basically on-demand using our process in GOVERNANCE.md. I recommend watching 1958 and then making a request in #general once it’s been merged. But, as Paweł suggested, we have a scheduled release coming soon, anyway. Thanks for your interest in the fix!

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-07-20 11:01:14

*Thread Reply:* Ahh I see. Got it. Thanks! @Michael Robinson @Paweł Leszczyński

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-07-21 03:12:22

*Thread Reply:* @Anirudh Shrinivason it's merged -> https://github.com/OpenLineage/OpenLineage/pull/1958

Labels
documentation, integration/spark
Comments
2
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-07-21 04:19:15

*Thread Reply:* Awesome thanks so much! @Paweł Leszczyński

Juan Manuel Cappi (juancappi@gmail.com)
2023-07-19 06:59:31

Hi there, related to my question a few days ago about usage of time travel in iceberg, currently only the alias used (i.e. tag, branch) is captured as part of the dataset identifier for lineage. If the tag is removed, or even worse, if it’s removed and re-created with the same name pointing to a difference snapshotid, the lineage will be capturing an inaccurate history. So, ideally, we’d like to capture the actual snapshotid behind the named reference, as part of the lineage. Anyone else thinking this is a reasonable scenario? => more in 🧵

👀 Paweł Leszczyński, Dongjin Seo
Juan Manuel Cappi (juancappi@gmail.com)
2023-07-19 07:14:54

*Thread Reply:* One hacky approach would be to update the current dataset identifier to include the snapshot_id, so, for schema.table.tag we would have something like schema.table.tag-snapshot_id. The benefit is that it’s explicit and it doesn’t require a change in the OL schemas. The obvious downside (though not that serious in my opinion) is that impacts readability. Not sure though if there are other non-obvious side-effects.

Another alternative would be to add a dedicated property. For instance, the job > latestRun schema, the input/output dataset version objects could look like this: "inputDatasetVersions": [ { "datasetVersionId": { "namespace": "<s3a://warehouse>", "name": "schema.table.tag", "snapshot_id": "7056736771450556218", "version": "1c634e18-e357-347b-b758-4337ac352d6d" }, "facets": {} } ] And column lineage could look like: ```"columnLineage": [ { "name": "somefield", "inputFields": [ { "namespace": "s3a:warehouse", "dataset": "schema.table.tag", "snapshotid": "7056736771450556218", "field": "some_field", ... }, ...

  ],

...```

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-07-19 08:33:43

*Thread Reply:* @Paweł Leszczyński what do you think?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-07-19 08:38:16

*Thread Reply:* 1. How does snapshotId differ from version? Could one make OL version property to be a string concat of iceberg-snapshot-id.iceberg-version

  1. I don't think it's necessary (or don't understand why) to add snapshot-id within column-linegea. Each entry within inputFields of columnLineage is already available within inputs of the OL event related to this run.
Juan Manuel Cappi (juancappi@gmail.com)
2023-07-19 18:43:31

*Thread Reply:* Yes, I think follow the idea. The problem with that is the version is tied to the dataset name, i.e. my_namespace.table_A.tag_v1 which stays the same for the source dataset, which is the one being used with time travel. Suppose the following sequence: step 1 => tableA.tagv1 has snapshot id 123-abc run job: table_A.tag_v1 -> job x -> table_B the inputDatasetVersions > datasetVersionId > version for table_B points to an object which represents table_A.tag_v1 with snapshot id 123-abc correctly captured within facets > version > datasetVersion

step 2 => delete tag_v1, insert some data, create tag_v1 again now table_A.tag_v1 has snapshot id 456-def run job again: table_A.tag_v1 -> job x -> table_B the inputDatasetVersions > datasetVersionId > version for table_B points to the same object which represents table_A.tag_v1 only now snapshot id has been replaced by 456-def within facets > version > datasetVersion which means I don’t have way to know which was the snapshot id used in the step 1

The “hack” I mentioned above though seems to solve the issue, since a new dataset is captured for each combination, so no information is overwritten/lost, i.e., the datasets referenced in inputDatasetVersions are now named: table_A.tag_v1-123-abc table_A.tag_v1-456-def

As a side effect, the column lineage also gets “fixed”, since the lineage for the step 1 and step 2 job runs, without the “hack” both referenced table_A.tag_v1 as the source of input field, though in each run the snapshot id was different. With the hack, one run references table_A.tag_v1-123-abc and the other one table_A.tag_v1-456-def

Hope it makes sense. If it helps, I can put together a few json files with the examples I’ve been using to experiment

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-07-20 06:35:22

*Thread Reply:* So, my understanding of the problem is that icberg version is not unique. So, if you have version 3, revert to version 2, and then write something again, one ends up again with version 3.

I would not like to mess with dataset names because on the backend sides like Marquez, dataset names being the same in different jobs and runs allow creating lineage graph. If dataset names are different, then there is no way to build lineage graph across multiple jobs.

Adding snapshot_id to datasetVersion is one option to go. My concern here is that this is so iceberg specific while we're aiming to have a general solution to dataset versioning.

Some other options are: send concat of version+snapshotId as a version or send only snapshot_id as a version. The second ain't that bad as actually snapashotId is something we're aiming to get as a version, isn't it?

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-07-21 04:21:26

Hi guys, I’d like to open a vote to release the next OpenLineage version! We'd really like to use the fixed CustomEnvironmentFacetBuilder for delta catalogs, and column lineage for Merge Into command in the spark integration! Thanks! 🙂

➕ Jakub Dardziński, Willy Lulciuc, Michael Robinson, Maciej Obuchowski, Anirudh Shrinivason
Michael Robinson (michael.robinson@astronomer.io)
2023-07-21 13:09:39

*Thread Reply:* Thanks, all. The release is authorized and will be initiated within two business days per our policy here.

Michael Robinson (michael.robinson@astronomer.io)
2023-07-25 13:44:47

*Thread Reply:* @Anirudh Shrinivason and others waiting on this release: the release process isn’t working as expected due to security improvements recently made to the website, ironically enough, which is the source for the spec. But we’re working on a fix and hope to complete the release soon.

Michael Robinson (michael.robinson@astronomer.io)
2023-07-25 15:19:49

*Thread Reply:* @Anirudh Shrinivason the release (0.30.1) is out now. Thanks for your patience 🙂

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-07-25 23:21:14

*Thread Reply:* Hi @Michael Robinson Thanks a lot!

Michael Robinson (michael.robinson@astronomer.io)
2023-07-26 08:52:24

*Thread Reply:* 👍

Madhav Kakumani (madhav.kakumani@6point6.co.uk)
2023-07-21 06:38:16

Hi, I am running a job in Marquez with 180 rows of metadata but it is running for more than an hour. Is there a way to check the log on Marquez? Below is the screenshot of the job:

Willy Lulciuc (willy@datakin.com)
2023-07-21 08:10:58

*Thread Reply:* > I am running a job in Marquez with 180 rows of metadata Do you mean that you have +100 rows of metadata in the jobs table for Marquez? Or that the job never finishes?

Willy Lulciuc (willy@datakin.com)
2023-07-21 08:11:47

*Thread Reply:* Also, yes, we have an even viewer that allows you to query the raw OL events

Willy Lulciuc (willy@datakin.com)
2023-07-21 08:12:19

*Thread Reply:* If you post a sample of your events, it’d be helpful to troubleshoot your issue

Madhav Kakumani (madhav.kakumani@6point6.co.uk)
2023-07-21 08:53:25

*Thread Reply:*

Madhav Kakumani (madhav.kakumani@6point6.co.uk)
2023-07-21 08:53:31

*Thread Reply:* Sure Willy thanks for your response. The job is still running. This is the code I am running from jupyter notebook using Python client:

Madhav Kakumani (madhav.kakumani@6point6.co.uk)
2023-07-21 08:54:33

*Thread Reply:* as you can see my input and output datasets are just 1 row

Madhav Kakumani (madhav.kakumani@6point6.co.uk)
2023-07-21 08:55:02

*Thread Reply:* included column lineage but job keeps running so I don't know if it is working

Madhav Kakumani (madhav.kakumani@6point6.co.uk)
2023-07-21 06:38:49

Please ignore 'UPDATED AT' timestamp

Madhav Kakumani (madhav.kakumani@6point6.co.uk)
2023-07-21 07:56:48

@Paweł Leszczyński there is lot of interest in our organisation to implement Openlineage in several project and we might take the spark route so on that note a small question: Does open lineage works from extracting data from the Catalyst optimiser's Physical/Logical plans etc?

👍 Paweł Leszczyński
❤️ Willy Lulciuc, Paweł Leszczyński, Maciej Obuchowski
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-07-21 08:20:33

*Thread Reply:* spark integration is based on extracting lineage from optimized plans

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-07-21 08:25:35

*Thread Reply:* https://youtu.be/rO3BPqUtWrI?t=1326 i recommend whole presentation but in case you're just interested in Spark integration, there few mins that explain how this is achieved (link points to 22:06 min of video)

YouTube
} Databricks (https://www.youtube.com/@Databricks)
Madhav Kakumani (madhav.kakumani@6point6.co.uk)
2023-07-21 08:43:47

*Thread Reply:* Thanks Pawel for sharing. I will take a look. Have a nice weekend.

Jens Pfau (jenspfau@google.com)
2023-07-21 08:22:51

Hello everyone!

👋 Jakub Dardziński, Maciej Obuchowski, Willy Lulciuc, Michael Robinson, Harel Shein, Ross Turk, Robin Fehr, Julien Le Dem
Michael Robinson (michael.robinson@astronomer.io)
2023-07-21 09:57:51

*Thread Reply:* Welcome, @Jens Pfau!

😀 Jens Pfau
George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-07-23 08:36:38

hello everyone! I am trying to follow your guide https://openlineage.io/docs/integrations/spark/quickstart_local and when i execute spark.createDataFrame([ {'a': 1, 'b': 2}, {'a': 3, 'b': 4} ]).write.mode("overwrite").saveAsTable("temp1")

i not getting the expected result

openlineage.io
George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-07-23 08:37:55

``23/07/23 12:35:20 INFO OpenLineageRunEventBuilder: Visiting query plan Optional[== Parsed Logical Plan == 'CreateTabletemp1`, Overwrite +- LogicalRDD [a#6L, b#7L], false

== Analyzed Logical Plan ==

CreateDataSourceTableAsSelectCommand temp1, Overwrite, [a, b] +- LogicalRDD [a#6L, b#7L], false

== Optimized Logical Plan == CreateDataSourceTableAsSelectCommand temp1, Overwrite, [a, b] +- LogicalRDD [a#6L, b#7L], false

== Physical Plan == Execute CreateDataSourceTableAsSelectCommand temp1, Overwrite, [a, b] +- **(1) Scan ExistingRDD[a#6L,b#7L] ] with input dataset builders [<function1>, <function1>, <function1>, <function1>, <function1>] 23/07/23 12:35:20 INFO OpenLineageRunEventBuilder: Visiting query plan Optional[== Parsed Logical Plan == 'CreateTable temp1, Overwrite +- LogicalRDD [a#6L, b#7L], false

== Analyzed Logical Plan ==

CreateDataSourceTableAsSelectCommand temp1, Overwrite, [a, b] +- LogicalRDD [a#6L, b#7L], false

== Optimized Logical Plan == CreateDataSourceTableAsSelectCommand temp1, Overwrite, [a, b] +- LogicalRDD [a#6L, b#7L], false

== Physical Plan == Execute CreateDataSourceTableAsSelectCommand temp1, Overwrite, [a, b] +- **(1) Scan ExistingRDD[a#6L,b#7L] ] with output dataset builders [<function1>, <function1>, <function1>, <function1>, <function1>, <function1>, <function1>] 23/07/23 12:35:20 INFO CreateDataSourceTableAsSelectCommandVisitor: Matched io.openlineage.spark.agent.lifecycle.plan.CreateDataSourceTableAsSelectCommandVisitor<org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand,io.openlineage.client.OpenLineage$OutputDataset> to logical plan CreateDataSourceTableAsSelectCommand temp1, Overwrite, [a, b] +- LogicalRDD [a#6L, b#7L], false

23/07/23 12:35:20 INFO CreateDataSourceTableAsSelectCommandVisitor: Matched io.openlineage.spark.agent.lifecycle.plan.CreateDataSourceTableAsSelectCommandVisitor<org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand,io.openlineage.client.OpenLineage$OutputDataset> to logical plan CreateDataSourceTableAsSelectCommand temp1, Overwrite, [a, b] +- LogicalRDD [a#6L, b#7L], false

23/07/23 12:35:20 ERROR EventEmitter: Could not emit lineage w/ exception io.openlineage.client.OpenLineageClientException: io.openlineage.spark.shaded.org.apache.http.client.ClientProtocolException at io.openlineage.client.transports.HttpTransport.emit(HttpTransport.java:105) at io.openlineage.client.OpenLineageClient.emit(OpenLineageClient.java:34) at io.openlineage.spark.agent.EventEmitter.emit(EventEmitter.java:71) at io.openlineage.spark.agent.lifecycle.SparkSQLExecutionContext.start(SparkSQLExecutionContext.java:77) at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$sparkSQLExecStart$0(OpenLineageSparkListener.java:99) at java.base/java.util.Optional.ifPresent(Optional.java:183) at io.openlineage.spark.agent.OpenLineageSparkListener.sparkSQLExecStart(OpenLineageSparkListener.java:99) at io.openlineage.spark.agent.OpenLineageSparkListener.onOtherEvent(OpenLineageSparkListener.java:90) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1381) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) Caused by: io.openlineage.spark.shaded.org.apache.http.client.ClientProtocolException at io.openlineage.spark.shaded.org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:187) at io.openlineage.spark.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) at io.openlineage.spark.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) at io.openlineage.client.transports.HttpTransport.emit(HttpTransport.java:100) ... 21 more Caused by: io.openlineage.spark.shaded.org.apache.http.ProtocolException: Target host is not specified at io.openlineage.spark.shaded.org.apache.http.impl.conn.DefaultRoutePlanner.determineRoute(DefaultRoutePlanner.java:71) at io.openlineage.spark.shaded.org.apache.http.impl.client.InternalHttpClient.determineRoute(InternalHttpClient.java:125) at io.openlineage.spark.shaded.org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) ... 24 more 23/07/23 12:35:20 INFO ParquetFileFormat: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter 23/07/23 12:35:20 INFO FileOutputCommitter: File Output Committer Algorithm version is 1 23/07/23 12:35:20 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false 23/07/23 12:35:20 INFO SQLHadoopMapReduceCommitProtocol: Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter 23/07/23 12:35:20 INFO FileOutputCommitter: File Output Committer Algorithm version is 1 23/07/23 12:35:20 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false 23/07/23 12:35:20 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.parquet.hadoop.ParquetOutputCommitter 23/07/23 12:35:20 INFO CodeGenerator: Code generated in 120.989125 ms 23/07/23 12:35:21 INFO SparkContext: Starting job: saveAsTable at NativeMethodAccessorImpl.java:0 23/07/23 12:35:21 INFO DAGScheduler: Got job 0 (saveAsTable at NativeMethodAccessorImpl.java:0) with 1 output partitions 23/07/23 12:35:21 INFO DAGScheduler: Final stage: ResultStage 0 (saveAsTable at NativeMethodAccessorImpl.java:0) 23/07/23 12:35:21 INFO DAGScheduler: Parents of final stage: List() 23/07/23 12:35:21 INFO DAGScheduler: Missing parents: List() 23/07/23 12:35:21 INFO OpenLineageRunEventBuilder: Visiting query plan Optional[== Parsed Logical Plan == 'CreateTable temp1, Overwrite +- LogicalRDD [a#6L, b#7L], false

== Analyzed Logical Plan ==

CreateDataSourceTableAsSelectCommand temp1, Overwrite, [a, b] +- LogicalRDD [a#6L, b#7L], false

== Optimized Logical Plan == CreateDataSourceTableAsSelectCommand temp1, Overwrite, [a, b] +- LogicalRDD [a#6L, b#7L], false

== Physical Plan == Execute CreateDataSourceTableAsSelectCommand temp1, Overwrite, [a, b] +- **(1) Scan ExistingRDD[a#6L,b#7L] ] with input dataset builders [<function1>, <function1>, <function1>, <function1>, <function1>] 23/07/23 12:35:21 INFO OpenLineageRunEventBuilder: Visiting query plan Optional[== Parsed Logical Plan == 'CreateTable temp1, Overwrite +- LogicalRDD [a#6L, b#7L], false

== Analyzed Logical Plan ==

CreateDataSourceTableAsSelectCommand temp1, Overwrite, [a, b] +- LogicalRDD [a#6L, b#7L], false

== Optimized Logical Plan == CreateDataSourceTableAsSelectCommand temp1, Overwrite, [a, b] +- LogicalRDD [a#6L, b#7L], false

== Physical Plan == Execute CreateDataSourceTableAsSelectCommand temp1, Overwrite, [a, b] +- **(1) Scan ExistingRDD[a#6L,b#7L] ] with output dataset builders [<function1>, <function1>, <function1>, <function1>, <function1>, <function1>, <function1>] 23/07/23 12:35:21 INFO CreateDataSourceTableAsSelectCommandVisitor: Matched io.openlineage.spark.agent.lifecycle.plan.CreateDataSourceTableAsSelectCommandVisitor<org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand,io.openlineage.client.OpenLineage$OutputDataset> to logical plan CreateDataSourceTableAsSelectCommand temp1, Overwrite, [a, b] +- LogicalRDD [a#6L, b#7L], false

23/07/23 12:35:21 INFO CreateDataSourceTableAsSelectCommandVisitor: Matched io.openlineage.spark.agent.lifecycle.plan.CreateDataSourceTableAsSelectCommandVisitor<org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand,io.openlineage.client.OpenLineage$OutputDataset> to logical plan CreateDataSourceTableAsSelectCommand temp1, Overwrite, [a, b] +- LogicalRDD [a#6L, b#7L], false

23/07/23 12:35:21 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[10] at saveAsTable at NativeMethodAccessorImpl.java:0), which has no missing parents 23/07/23 12:35:21 ERROR EventEmitter: Could not emit lineage w/ exception io.openlineage.client.OpenLineageClientException: io.openlineage.spark.shaded.org.apache.http.client.ClientProtocolException at io.openlineage.client.transports.HttpTransport.emit(HttpTransport.java:105) at io.openlineage.client.OpenLineageClient.emit(OpenLineageClient.java:34) at io.openlineage.spark.agent.EventEmitter.emit(EventEmitter.java:71) at io.openlineage.spark.agent.lifecycle.SparkSQLExecutionContext.start(SparkSQLExecutionContext.java:174) at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$onJobStart$9(OpenLineageSparkListener.java:153) at java.base/java.util.Optional.ifPresent(Optional.java:183) at io.openlineage.spark.agent.OpenLineageSparkListener.onJobStart(OpenLineageSparkListener.java:149) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1381) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) Caused by: io.openlineage.spark.shaded.org.apache.http.client.ClientProtocolException at io.openlineage.spark.shaded.org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:187) at io.openlineage.spark.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) at io.openlineage.spark.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) at io.openlineage.client.transports.HttpTransport.emit(HttpTransport.java:100) ... 20 more Caused by: io.openlineage.spark.shaded.org.apache.http.ProtocolException: Target host is not specified at io.openlineage.spark.shaded.org.apache.http.impl.conn.DefaultRoutePlanner.determineRoute(```

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-07-23 08:38:46

23/07/23 12:35:20 ERROR EventEmitter: Could not emit lineage w/ exception io.openlineage.client.OpenLineageClientException: io.openlineage.spark.shaded.org.apache.http.client.ClientProtocolException at io.openlineage.client.transports.HttpTransport.emit(HttpTransport.java:105) at io.openlineage.client.OpenLineageClient.emit(OpenLineageClient.java:34) at io.openlineage.spark.agent.EventEmitter.emit(EventEmitter.java:71) at io.openlineage.spark.agent.lifecycle.SparkSQLExecutionContext.start(SparkSQLExecutionContext.java:77) at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$sparkSQLExecStart$0(OpenLineageSparkListener.java:99)

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-07-23 13:31:53

*Thread Reply:* That looks like your URL provided to OpenLineage is missing http:// or https:// in the front

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-07-23 14:54:55

*Thread Reply:* sorry how can i resolve this ? do i need to add this ? i just follow the guide step by step . You dont mention anywhere to add anything. You provide smth that

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-07-23 14:55:05

*Thread Reply:* really does not work out of the box

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-07-23 14:55:13

*Thread Reply:* anbd this is supposed to be a demo

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-07-23 17:07:49

*Thread Reply:* bumping e.g. to io.openlineage:openlineage_spark:0.29.2 seems to be fixing the issue

not sure why it stopped working for 0.12.0 but we’ll take a look and fix accordingly

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-07-24 04:51:34

*Thread Reply:* ...probably by bumping the version on this page 🙂

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-07-24 05:00:28

*Thread Reply:* thank you both for coming back to me , I bumped to 0.29 and i think that it now runs.Is this the expected output ? 23/07/24 08:43:55 INFO ConsoleTransport: {"eventTime":"2023_07_24T08:43:55.941Z","producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.29.2/integration/spark>","schemaURL":"<https://openlineage.io/spec/2-0-0/OpenLineage.json#/$defs/RunEvent>","eventType":"COMPLETE","run":{"runId":"186c06c0_e79c_43cf_8bb7_08e1ab4c86a5","facets":{"spark.logicalPlan":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.29.2/integration/spark>","_schemaURL":"<https://openlineage.io/spec/2-0-0/OpenLineage.json#/$defs/RunFacet>","plan":[{"class":"org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand","num-children":1,"table":{"product-class":"org.apache.spark.sql.catalyst.catalog.CatalogTable","identifier":{"product-class":"org.apache.spark.sql.catalyst.TableIdentifier","table":"temp2","database":"default"},"tableType":{"product-class":"org.apache.spark.sql.catalyst.catalog.CatalogTableType","name":"MANAGED"},"storage":{"product_class":"org.apache.spark.sql.catalyst.catalog.CatalogStorageFormat","compressed":false,"properties":null},"schema":{"type":"struct","fields":[]},"provider":"parquet","partitionColumnNames":[],"owner":"","createTime":1690188235517,"lastAccessTime":-1,"createVersion":"","properties":null,"unsupportedFeatures":[],"tracksPartitionsInCatalog":false,"schemaPreservesCase":true,"ignoredProperties":null},"mode":null,"query":0,"outputColumnNames":"[a, b]"},{"class":"org.apache.spark.sql.execution.LogicalRDD","num_children":0,"output":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"a","dataType":"long","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":12,"jvmId":"173725f4_02c4_4174_9d18_3a61aa311d62"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num_children":0,"name":"b","dataType":"long","nullable":true,"metadata":{},"exprId":{"product_class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":13,"jvmId":"173725f4-02c4-4174-9d18-3a61aa311d62"},"qualifier":[]}]],"rdd":null,"outputPartitioning":{"product_class":"org.apache.spark.sql.catalyst.plans.physical.UnknownPartitioning","numPartitions":0},"outputOrdering":[],"isStreaming":false,"session":null}]},"spark_version":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.29.2/integration/spark>","_schemaURL":"<https://openlineage.io/spec/2-0-0/OpenLineage.json#/$defs/RunFacet>","spark-version":"3.1.2","openlineage_spark_version":"0.29.2"}}},"job":{"namespace":"default","name":"sample_spark.execute_create_data_source_table_as_select_command","facets":{}},"inputs":[],"outputs":[{"namespace":"file","name":"/home/jovyan/spark-warehouse/temp2","facets":{"dataSource":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.29.2/integration/spark>","_schemaURL":"<https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet>","name":"file","uri":"file"},"schema":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.29.2/integration/spark>","_schemaURL":"<https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet>","fields":[{"name":"a","type":"long"},{"name":"b","type":"long"}]},"symlinks":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.29.2/integration/spark>","_schemaURL":"<https://openlineage.io/spec/facets/1-0-0/SymlinksDatasetFacet.json#/$defs/SymlinksDatasetFacet>","identifiers":[{"namespace":"/home/jovyan/spark-warehouse","name":"default.temp2","type":"TABLE"}]},"lifecycleStateChange":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.29.2/integration/spark>","_schemaURL":"<https://openlineage.io/spec/facets/1-0-0/LifecycleStateChangeDatasetFacet.json#/$defs/LifecycleStateChangeDatasetFacet>","lifecycleStateChange":"CREATE"}},"outputFacets":{}}]} ? Also i then proceeded to run docker run --network spark_default -p 3000:3000 -e MARQUEZ_HOST=marquez-api -e MARQUEZ_PORT=5000 --link marquez-api:marquez-api marquezproject/marquez-web:0.19.1 but the page is empty

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-07-24 11:11:08

*Thread Reply:* You'd need to set up spark.openlineage.transport.url to send OpenLineage events to Marquez

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-07-24 11:12:28

*Thread Reply:* where n how can i do this ?

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-07-24 11:13:04

*Thread Reply:* do i need to edit the conf ?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-07-24 11:37:09

*Thread Reply:* yes, in the spark conf

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-07-24 11:37:48

*Thread Reply:* what this url should be ?

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-07-24 11:37:51

*Thread Reply:* http://localhost:3000/ ?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-07-24 11:43:30

*Thread Reply:* That depends how you ran Marquez, but looking at your screenshot UI is at 3000, I guess API would be at 5000

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-07-24 11:43:46

*Thread Reply:* as that's default in Marquez docker-compose

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-07-24 11:44:14

*Thread Reply:* i cannot see spark conf

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-07-24 11:44:23

*Thread Reply:* is it in there or do i need to create it ?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-07-24 16:42:53

*Thread Reply:* Is something like ```from pyspark.sql import SparkSession

spark = (SparkSession.builder.master('local') .appName('samplespark') .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener') .config('spark.jars.packages', 'io.openlineage:openlineagespark:0.29.2') .config('spark.openlineage.transport.url', 'http://marquez:5000') .config('spark.openlineage.transport.type', 'http') .getOrCreate())``` not working?

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-07-25 05:08:08

*Thread Reply:* OK when i use the snippet you provided and then execute docker run --network sparkdefault -p 3000:3000 -e MARQUEZHOST=marquez-api -e MARQUEZ_PORT=5000 --link marquez-api:marquez-api marquezproject/marquez-web:0.19.1

I can now see this

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-07-25 05:08:52

*Thread Reply:* but when i click on the job i then get this

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-07-25 05:09:05

*Thread Reply:* so i cannot see any details of the job

Sarwat Fatima (sarwatfatimam@gmail.com)
2023-09-05 05:54:50

*Thread Reply:* @George Polychronopoulos Hi, I am facing the same issue. After adding spark conf and using the docker run command, marquez is still showing empty. Do I need to change something in the run command?

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-09-05 05:55:15

*Thread Reply:* yes i will tell you

Sarwat Fatima (sarwatfatimam@gmail.com)
2023-09-05 07:36:41

*Thread Reply:* For the docker command that I used, I updated the marquez-web version to 0.40.0 and I also updated the Marquez_host which I am not sure if I have to or not. The UI is running but not showing anything docker run --network spark_default -p 3000:3000 -e MARQUEZ_HOST=localhost -e MARQUEZ_PORT=5000 --link marquez-api:marquez-api marquez/marquez-web:0.40.0

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-09-05 07:36:52

*Thread Reply:* is because you are running this command right

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-09-05 07:36:55

*Thread Reply:* yes thats it

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-09-05 07:36:58

*Thread Reply:* you need 0.40

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-09-05 07:37:03

*Thread Reply:* and there is a lot of stuff

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-09-05 07:37:07

*Thread Reply:* you need rto chwange

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-09-05 07:37:10

*Thread Reply:* in the Docker

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-09-05 07:37:24

*Thread Reply:* so the spark

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-09-05 07:37:25

*Thread Reply:* version

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-09-05 07:37:27

*Thread Reply:* the python

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-09-05 07:38:05

*Thread Reply:* version: "3.10" services: notebook: image: jupyter/pyspark-notebook:spark-3.4.1 ports: - "8888:8888" volumes: - ./docker/notebooks:/home/jovyan/notebooks - ./build:/home/jovyan/openlineage links: - "api:marquez" depends_on: - api

Marquez as an OpenLineage Client

api: image: marquezproject/marquez containername: marquez-api ports: - "5000:5000" - "5001:5001" volumes: - ./docker/wait-for-it.sh:/usr/src/app/wait-for-it.sh links: - "db:postgres" dependson: - db entrypoint: [ "./wait-for-it.sh", "db:5432", "--", "./entrypoint.sh" ]

db: image: postgres:12.1 containername: marquez-db ports: - "5432:5432" environment: - POSTGRESUSER=postgres - POSTGRESPASSWORD=password - MARQUEZDB=marquez - MARQUEZUSER=marquez - MARQUEZPASSWORD=marquez volumes: - ./docker/init-db.sh:/docker-entrypoint-initdb.d/init-db.sh # Enables SQL statement logging (see: https://www.postgresql.org/docs/12/runtime-config-logging.html#GUC-LOG-STATEMENT) # command: ["postgres", "-c", "log_statement=all"]

PostgreSQL Documentation
George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-09-05 07:38:10

*Thread Reply:* this is hopw mine looks

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-09-05 07:38:20

*Thread Reply:* it is all tested and letest version

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-09-05 07:38:31

*Thread Reply:* postgres does not work beyond 12

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-09-05 07:38:56

*Thread Reply:* if you run this docker-compose up

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-09-05 07:38:58

*Thread Reply:* the notebooks

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-09-05 07:39:02

*Thread Reply:* are 10 faster

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-09-05 07:39:06

*Thread Reply:* and give no errors

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-09-05 07:39:14

*Thread Reply:* also you need to update other stuff

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-09-05 07:39:18

*Thread Reply:* such as

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-09-05 07:39:26

*Thread Reply:* dont run what is in the docs

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-09-05 07:39:34

*Thread Reply:* but run what is in github

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-09-05 07:40:13
George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-09-05 07:40:22

*Thread Reply:* run in your notebooks what is in here

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-09-05 07:40:32

*Thread Reply:* ```from pyspark.sql import SparkSession

spark = (SparkSession.builder.master('local') .appName('samplespark') .config('spark.jars.packages', 'io.openlineage:openlineagespark:1.1.0') .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener') .config('spark.openlineage.transport.url', 'http://{openlineage.client.host}/api/v1/namespaces/spark_integration/') .getOrCreate())```

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-09-05 07:40:38

*Thread Reply:* the dont update documentation

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-09-05 07:40:44

*Thread Reply:* it took me 4 weeks to get here

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-07-23 08:39:13

is this a known error ? does anyone know how to debug this ?

Steven (xli@zjuici.com)
2023-07-23 23:57:43

Hi, Using Marquez. I tried to get the dataset version through two apis. First: http://host/api/v1/namespaces/{namespace}/datasets/{dataset} It will include a currentVersion in the response. Then: http://host/api/v1/namespaces/{namespace}/datasets/{dataset}/versions/{currentVersion} But the version used here refers to the "version" column in table dataset_versions. Not the primary key "uuid". Which leads to 404 not found. I checked other apis but seemed that there are no other way to get the version through "currentVersion". Any help?

👀 Maciej Obuchowski, Willy Lulciuc
Steven (xli@zjuici.com)
2023-07-24 00:14:43

*Thread Reply:* Like I want to change the facets of a specific dataset.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-07-24 16:45:18

*Thread Reply:* @Willy Lulciuc do you have any idea? 🙂

Steven (xli@zjuici.com)
2023-07-25 05:02:47

*Thread Reply:* I solved this by adding a new job which outputs to the same dataset. This ended up in a newer dataset version.

Willy Lulciuc (willy@datakin.com)
2023-07-25 06:20:58

*Thread Reply:* @Steven great to hear that you solved the issue! but there are some minor logical inconsistencies that we’d like to address with versioning (for both datasets and jobs) in Marquez. The tl;dr is the version column wasn’t meant to be used externally, but internally within Marquez. The issue is “minor” as it’s more of a pointer thing. We’ll be addressing soon. For some background, you can look at: • https://github.com/MarquezProject/marquez/issues/2071https://github.com/MarquezProject/marquez/pull/2153

Steven (xli@zjuici.com)
2023-07-25 05:06:48

Hi, Are there any keys to set in marquez.yaml to skip db initialization and use existing db? I am deploying the marquez client on k8s client, which uses a cloud postgres. Every time I restart the marquez deployment I have to drop all those tables otherwise it will raise table already exists ERROR

Willy Lulciuc (willy@datakin.com)
2023-07-25 06:43:32

*Thread Reply:* @Steven ahh very good point, it’s technically not “error” in the true sense, but annoying nonetheless. I think you’re referencing the init container in the Marquez helm chart? https://github.com/MarquezProject/marquez/blob/main/chart/templates/marquez/deployment.yaml#L37

Willy Lulciuc (willy@datakin.com)
2023-07-25 06:45:24

*Thread Reply:* hmm, actually what raises the error you’re referencing? the Maruez http server?

Willy Lulciuc (willy@datakin.com)
2023-07-25 06:49:08

*Thread Reply:* > Every time I restart the marquez deployment I have to drop all those tables otherwise it will raise table already exists ERROR This shouldn’t be an error. I’m trying to understand the scenario in which this error is thrown (any info is helpful). We use flyway to manage our db schema, but you may have gotten in an odd state somehow

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2023-07-25 12:52:51

For Databricks notebooks, does the Spark listener work without any notebook changes? (I see that Azure Databricks -> purview needs no changes, but I’m not sure if that applies to anywhere….e.g. if I have an existing databricks notebook, and I add a spark listener, can I get column-level lineage? or do I need to change my notebook to use openlineage libraries, like I do with an arbitrary Python script?)

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-07-31 03:35:58

*Thread Reply:* Nope, one should modify the cluster as per doc <https://openlineage.io/docs/integrations/spark/quickstart_databricks> but no changes in notebook are required.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2023-08-02 10:59:00

*Thread Reply:* Right, great, that’s exactly what I was hoping 😄

Michael Robinson (michael.robinson@astronomer.io)
2023-07-25 15:24:17

@channel We released OpenLineage 0.30.1, including: Added • Flink: support Iceberg sinks #1960 @pawel-big-lebowski • Spark: column-level lineage for merge into on delta tables #1958 @pawel-big-lebowski • Spark: column-level lineage for merge into on Iceberg tables #1971 @pawel-big-lebowski • Spark: add supprt for Iceberg REST catalog #1963 @juancappi • Airflow: add possibility to force direct-execution based on environment variable #1934 @mobuchowski • SQL: add support for Apple Silicon to openlineage-sql-java #1981 @davidjgoss • Spec: add facet deletion #1975 @julienledem • Client: add a file transport #1891 @alexandre bergere Changed • Airflow: do not run plugin if OpenLineage provider is installed #1999 @JDarDagran • Python: rename config to config_class #1998 @mobuchowski Plus test improvements, docs changes, bug fixes and more. Thanks to all the contributors, including new contributors @davidjgoss, @alexandre bergere and @Juan Manuel Cappi! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/0.30.1 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.29.2...0.30.1 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

👏 Julian Rossi, Bernat Gabor, Anirudh Shrinivason, Maciej Obuchowski, Jens Pfau, Sheeri Cabral (Collibra)
👍 Athitya Kumar, Sheeri Cabral (Collibra)
Codrut Stoicescu (codrut.stoicescu@gmail.com)
2023-07-27 11:53:09

Hello everyone! I’m part of a team trying to integrate OpenLineage and Marquez with multiple tools in our ecosystem. Integration with Spark and Iceberg was fairly easy with the listener you guys developed. We are now trying to integrate with Ray and we are having some trouble there. I was wondering if anybody has tried any work in that direction, so we can chat and exchange ideas. Thank you!

Michael Robinson (michael.robinson@astronomer.io)
2023-07-27 14:47:18

*Thread Reply:* This is the first I’ve heard of someone trying to do this, but others have tried getting lineage from pandas. There isn’t support for this currently, but this thread contains a link to an issue that might be helpful: https://openlineage.slack.com/archives/C01CK9T7HKR/p1689850134978429?thread_ts=1689688067.729469&cid=C01CK9T7HKR.

} Paweł Leszczyński (https://openlineage.slack.com/team/U02MK6YNAQ5)
Codrut Stoicescu (codrut.stoicescu@gmail.com)
2023-07-28 02:10:14

*Thread Reply:* Thank you for your response. We have implemented the “manual way” of emitting events with python OL client. We are now looking for a more automated way, so that updates to the scripts that run in Ray are minimal to none

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-07-28 13:03:43

*Thread Reply:* If you're actively using Ray, then you know way more about it than me, or probably any other OL contributor 🙂 I don't know how it works or is deployed, but I would recommend checking if there's robust way of being notified in the runtime about processing occuring there.

Codrut Stoicescu (codrut.stoicescu@gmail.com)
2023-07-31 12:17:07

*Thread Reply:* Thank you for the tip. That’s the kind of details I’m looking for, but couldn’t find yet

Tereza Trojanová (tereza.trojanova@revolt.bi)
2023-07-28 09:20:34

Hi, does anyone have experience integrating OpenLineage and Marquez with Keboola? I am new to OpenLineage and struggling with the KBC component configuration.

Michael Robinson (michael.robinson@astronomer.io)
2023-07-28 10:53:35

*Thread Reply:* @Martin Fiser can you share any resources or pointers that might be helpful?

Martin Fiser (fisa@keboola.com)
2023-08-21 19:17:17

*Thread Reply:* Hi, apologies - vacation period has hit m. However here are the resources:

API endpoint: https://app.swaggerhub.com/apis-docs/keboola/job-queue-api/1.3.4#/Jobs/getJobOpenApiLineage|job-queue-api | 1.3.4 | keboola | SwaggerHub Dedicated component to push data into openlineage(Marquez instance): https://components.keboola.com/components/keboola.wr-openlineage|OpenLineage data destination | Keboola Developer Portal

🙌 Michael Robinson
Damien Hawes (damien.hawes@booking.com)
2023-07-31 12:32:22

Hi folks. I'm looking to find the complete spec in openapi format. For example, if I want to find the complete spec of 1.0.5 , where would I find that? I've looked here: https://openlineage.io/apidocs/openapi/ however when I download the spec, things are missing, specifically the facets. This makes it difficult to generate clients / backend interfaces from the (limited) openapi spec.

Silvia Pina (silviampina@gmail.com)
2023-08-01 05:14:58

*Thread Reply:* +1, I could also really use this!

Silvia Pina (silviampina@gmail.com)
2023-08-01 05:27:34

*Thread Reply:* Found a way: you download it as json in the above link (“Download OpenAPI specification”), then if you copy paste it to editor.swagger.io it asks f you want to convert to yaml :)

Damien Hawes (damien.hawes@booking.com)
2023-08-01 10:25:49

*Thread Reply:* Whilst that works, it isn't complete. The issue is that the "facets" are not resolved. Exploring the website repository (https://github.com/OpenLineage/website/tree/main/static/spec) shows that facets aren't published alongside the spec, beyond 1.0.1 - which means its hard to know which revisions of the facets belong to which version of the spec.

Silvia Pina (silviampina@gmail.com)
2023-08-01 10:26:54

*Thread Reply:* Good point! Would be good if we could clarify how to get the full spec, in that case

Damien Hawes (damien.hawes@booking.com)
2023-08-01 10:30:57

*Thread Reply:* Granted. If the spec follows backwards compatible evolution rules, then this shouldn't be a problem, i.e., new fields must be optional, you can not remove existing fields, you can not modify existing fields, etc.

🙌 Silvia Pina
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-01 12:15:22

*Thread Reply:* We don't have facets with newer version than 1.1.0

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-01 12:15:56

*Thread Reply:* @Damien Hawes we've moved to merge docs and website repos here: https://github.com/OpenLineage/docs

Website
<https://openlineage.io/docs>
Stars
5
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-01 12:18:23

*Thread Reply:* > Would be good if we could clarify how to get the full spec, in that case Is using https://github.com/OpenLineage/OpenLineage/tree/main/spec not enough? We have separate files with facets definition to be able to evolve them separetely from main spec

Damien Hawes (damien.hawes@booking.com)
2023-08-02 04:53:03

*Thread Reply:* @Maciej Obuchowski - thanks for your input. I understand the desire to want to evolve the facets independently from the main spec, yet I keep running into a mental wall.

If I say, 'My application is compatible with OpenLineage 1.0.5' - what does that mean exactly? Does it mean that I am at least compatible with the base definition of RunEvent and its nested components, but not facets?

That's what I'm finding difficult to wrap my head around. Right now, I can not define (for my own sake and the sake of my org) what 'OpenLineage 1.0.5' means.

When I read the Marquez source code, I see that they state they implement 1.0.5, but again, it isn't clear what that completely entails.

I hope I am making sense.

👍 Silvia Pina
Damien Hawes (damien.hawes@booking.com)
2023-08-02 04:56:36

*Thread Reply:* If I approach this from a conventional software engineering standpoint, where I provide a library to my consumers. The library has a version associated with it, and that version encompasses all the objects located within that particular library. If I release a new version of my library, it implies that some form of evolution has happened. Whether it is a bug fix, a documentation change, or evolving the API of my objects it means something has changed and the new version is there to indicate that.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-02 04:56:53

*Thread Reply:* Yes - it means you can read and understand base spec. Facets are completely optional - reading them might provide you additional information, but you as a event consumer need to define what you do with them. Basically, the needs can be very different between consumers, spec should not define behavior of a consumer.

🙌 Silvia Pina
Damien Hawes (damien.hawes@booking.com)
2023-08-02 05:01:26

*Thread Reply:* OK. Thanks for the clarification. That clears things up for me.

👍 Maciej Obuchowski
Michael Robinson (michael.robinson@astronomer.io)
2023-07-31 16:42:48

This month’s issue of OpenLineage News was just sent out. Please to get it directly in your inbox each month!

apache.us14.list-manage.com
👍 Ross Turk, Maciej Obuchowski, Shirley Lu
🎉 Harel Shein
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-01 12:35:22

Hello, I request OpenLineage release, especially for two things: • Snowflake/HTTP/Airflow bugfix: https://github.com/OpenLineage/OpenLineage/pull/2025 • Spec: removing refs from core: https://github.com/OpenLineage/OpenLineage/pull/1997 Three approvals from committers will authorize release. @Michael Robinson

➕ Jakub Dardziński, Harel Shein, Michael Robinson, George Polychronopoulos, Willy Lulciuc, Shirley Lu
Michael Robinson (michael.robinson@astronomer.io)
2023-08-01 13:26:30

*Thread Reply:* Thanks, @Maciej Obuchowski

Michael Robinson (michael.robinson@astronomer.io)
2023-08-01 15:43:00

*Thread Reply:* Thanks, all. The release is authorized and will be initiated within two business days.

Michael Robinson (michael.robinson@astronomer.io)
2023-08-01 16:42:32

@channel We released OpenLineage 1.0.0, featuring static lineage capability! Added: • Airflow: convert lineage from legacy File definition #2006 @Maciej Obuchowski Removed: • Spec: remove facet ref from core #1997 @JDarDagran Changed • Airflow: change log level to DEBUG when extractor isn’t found #2012 @kaxil • Airflow: make sure we cannot fail in thread despite direct execution #2010 @Maciej Obuchowski Plus test improvements, docs changes, bug fixes and more. *See prior releases for additional changes related to static lineage. Thanks to all the contributors, including new contributors @kaxil and @Mars Lan! *Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.0.0 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/0.30.1...1.0.0 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🙌 Julian LaNeve, Bernat Gabor, Maciej Obuchowski, Peter Hicks, Ross Turk, Harel Shein, Willy Lulciuc, Paweł Leszczyński, Peter Hicks
🥳 Julian LaNeve, alexandre bergere, Maciej Obuchowski, Peter Hicks, Juan Manuel Cappi, Ross Turk, Harel Shein, Paweł Leszczyński, Peter Hicks
🚀 alexandre bergere, Peter Hicks, Ross Turk, Harel Shein, Paweł Leszczyński, Peter Hicks
Juan Luis Cano Rodríguez (juan_luis_cano@mckinsey.com)
2023-08-02 08:51:57

hi folks! so happy to see that static lineage is making its way through OL. one question: is the OpenAPI spec up to date? https://openlineage.io/apidocs/openapi/ IIUC, proposal 1837 says that JobEvent and DatasetEvent can be emitted independently from RunEvents now, but it's not clear how this affected the spec.

I see the Python client https://pypi.org/project/openlineage-python/1.0.0/ includes these changes already, so I assume I can go ahead and use it already? (I'm also keeping tabs on https://github.com/MarquezProject/marquez/issues/2544)

Assignees
<a href="https://github.com/wslulciuc">@wslulciuc</a>
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-02 10:09:33

*Thread Reply:* I think the apidocs are not up to date 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-02 10:09:43

*Thread Reply:* https://openlineage.io/spec/2-0-2/OpenLineage.json has the newest spec

Juan Luis Cano Rodríguez (juan_luis_cano@mckinsey.com)
2023-08-02 10:44:23

*Thread Reply:* thanks for the pointer @Maciej Obuchowski

Michael Robinson (michael.robinson@astronomer.io)
2023-08-02 10:49:17

*Thread Reply:* Also working on updating the apidocs

Michael Robinson (michael.robinson@astronomer.io)
2023-08-02 11:21:14

*Thread Reply:* The API docs are now up to date @Juan Luis Cano Rodríguez! Thank you for raising this issue.

🙌:skin_tone_3: Juan Luis Cano Rodríguez
Michael Robinson (michael.robinson@astronomer.io)
2023-08-02 12:58:15

@channel If you can, please join us in San Francisco for a meetup at Astronomer on August 30th at 5:30 PM PT. On the agenda: a presentation by special guest @John Lukenoff plus updates on the Airflow Provider, static lineage, and more. Food will be provided, and all are welcome. Please https://www.meetup.com/meetup-group-bnfqymxe/events/295195280/?utmmedium=referral&utmcampaign=share-btnsavedeventssharemodal&utmsource=link|RSVP to let us know you’re coming.

Meetup
Zahi Fail (zahi.fail@gmail.com)
2023-08-03 03:18:08

Hey, I hope this is the right channel for this kind of question - I’m running a tests to integrate Airflow (2.4.3) with Marquez (Openlineage 0.30.1). Currently, I’m testing the postgres operator and for some reason queries like “Copy” and “Unload” are being sent as events, but doesn’t appear in the graph. Any idea how to solve it?

You can see attached

  1. The graph of an airflow DAG with all the tasks beside the copy and unload.
  2. The graph with the unload task that isn’t connected to the other flow.
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-03 05:36:04

*Thread Reply:* I think our underlying SQL parser does not hancle the Postgres versions of those queries

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-03 05:36:14

*Thread Reply:* Can you post the (anonymized?) queries?

👍 Maciej Obuchowski
Zahi Fail (zahi.fail@gmail.com)
2023-08-03 07:03:09

*Thread Reply:* for example

copy bi.marquez_test_2 from '******' iam_role '**********' delimiter as '^' gzi
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-07 13:35:30

*Thread Reply:* @Zahi Fail iam_role suggests you want redshift version of this supported, not Postgres one right?

Zahi Fail (zahi.fail@gmail.com)
2023-08-08 04:04:35

*Thread Reply:* @Maciej Obuchowski hey, actually I tried both Postgres and Redshift to S3 operators. Both of them sent a new event through OL to Marquez, and still wasn’t part of the entire flow.

Athitya Kumar (athityakumar@gmail.com)
2023-08-04 01:40:15

Hey team! 👋

We were exploring open-lineage and had a couple of questions:

  1. Does open-lineage support presto-sql?
  2. Do we have any docs/benchmarks on query coverage (inner joins, subqueries, etc) & source/sink coverage (spark.read from JDBC, Files etc) for spark-sql?
  3. Can someone point to the code where we currently parse the input/output facets from the spark integration (like sql queries / transformations) and if it's extendable?
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-04 02:17:19

*Thread Reply:* Hey @Athitya Kumar,

  1. For parsing SQL queries, we're using sqlparser-rs (https://github.com/sqlparser-rs/sqlparser-rs) which already has great coverage of sql syntax and supports different dialects. it's open source project and we already did contribute to it for snowflake dialect.
  2. We don't have such a benchmark, but if you like, you could contribute and help us providing such. We do support joins, subqueries, iceberg and delta tables, jdbc for Spark and much more. Everything we do support, is covered in our tests.
  3. Not sure if got it properly. Marquez is our reference backend implementation which parses all the facets and stores them in relational db in a relational manner (facets, jobs, datasets and runs in separate tables).
Stars
1956
Language
Rust
Athitya Kumar (athityakumar@gmail.com)
2023-08-04 02:29:53

*Thread Reply:* For (3), I was referring to where we call the sqlparser-rs in our spark-openlineage event listener / integration; and how customising/improving them would look like

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-04 02:37:20

*Thread Reply:* sqlparser-rs is a rust libary and we bundle it within iface-java (https://github.com/OpenLineage/OpenLineage/blob/main/integration/sql/iface-java/src/main/java/io/openlineage/sql/SqlMeta.java). It's capable of extracting input/output datasets, column lineage information from SQL

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-04 02:40:02

*Thread Reply:* and this is Spark code that extracts it from JdbcRelation -> https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/[…]ge/spark/agent/lifecycle/plan/handlers/JdbcRelationHandler.java

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-04 04:08:53

*Thread Reply:* I think 3 question relates generally to Spark SQL handling, rather than handling JDBC connections inside Spark, right?

Athitya Kumar (athityakumar@gmail.com)
2023-08-04 04:24:57

*Thread Reply:* Yup, both actually. Related to getting the JDBC connection info in the input/output facet, as well as spark-sql queries we do on that JDBC connection

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-04 06:00:17

*Thread Reply:* For Spark SQL - it's translated to Spark's internal query LogicalPlan. We take that plan, and process it's nodes. From root node we can take output dataset, from leaf nodes we can take input datasets, and inside internal nodes we track columns to extract column-level lineage. We express those (table-level) operations by implementing classes like QueryPlanVisitor

You can extend that, for example for additional types of nodes that we don't support by implementing your own QueryPlanVisitor, and then implementing OpenLineageEventHandlerFactory and packaging this into a .jar deployed alongside OpenLineage jar - this would be loaded by us using Java's ServiceLoader .

👍 Kiran Hiremath
Athitya Kumar (athityakumar@gmail.com)
2023-08-08 05:06:07

*Thread Reply:* @Maciej Obuchowski @Paweł Leszczyński - Thanks for your responses! I had a follow-up query regarding the sqlparser-rs that's used internally by open-lineage: we see that these are the SQL dialects supported by sqlparser-rs here doesn't include spark-sql / presto-sql dialects which means they'd fallback to generic dialect:

"--ansi" =&gt; Box::new(AnsiDialect {}), "--bigquery" =&gt; Box::new(BigQueryDialect {}), "--postgres" =&gt; Box::new(PostgreSqlDialect {}), "--ms" =&gt; Box::new(MsSqlDialect {}), "--mysql" =&gt; Box::new(MySqlDialect {}), "--snowflake" =&gt; Box::new(SnowflakeDialect {}), "--hive" =&gt; Box::new(HiveDialect {}), "--redshift" =&gt; Box::new(RedshiftSqlDialect {}), "--clickhouse" =&gt; Box::new(ClickHouseDialect {}), "--duckdb" =&gt; Box::new(DuckDbDialect {}), "--generic" | "" =&gt; Box::new(GenericDialect {}), Any idea on how much coverage generic dialect provides for spark-sql / how different they are etc?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-08 05:21:32

*Thread Reply:* spark-sql integration is based on spark LogicalPlan's tree. Extracting input/output datasets from tree nodes which is more detailed than sql parsing

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-08 07:04:52

*Thread Reply:* I think presto/trino dialect is very standard - there shouldn't be any problems with regular queries

Athitya Kumar (athityakumar@gmail.com)
2023-08-08 11:19:53

*Thread Reply:* @Paweł Leszczyński - Got it, and would you be able to point me to where within the openlineage-spark integration do we:

  1. provide the Spark Logical Plan / query to sqlparser-rs
  2. get the output of sqlparser-rs (parsed query AST) & stitch back the inputs/outputs in the open-lineage events?
Athitya Kumar (athityakumar@gmail.com)
2023-08-08 12:09:06

*Thread Reply:* For example, we'd like to understand which dialectname of sqlparser-rs would be used in which scenario by open-lineage and what're the interactions b/w open-lineage & sqlparser-rs

Athitya Kumar (athityakumar@gmail.com)
2023-08-09 12:18:47

*Thread Reply:* @Paweł Leszczyński - Incase you missed the above messages ^

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-10 03:31:32

*Thread Reply:* Sqlparser-rs is used within Spark integration only for spark jdbc queries (queries to external databases). That's the only scenario. For spark.sql(...) , instead of SQL parsing, we rely on a logical plan of a job and extract information from it. For jdbc queries, that user sqlparser-rs, dialect is extracted from url: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/util/JdbcUtils.java#L69

👍 Athitya Kumar
nivethika R (nivethikar8@gmail.com)
2023-08-06 07:16:53

Hi.. Is column lineage available for spark version 2.4.0?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-06 17:25:31

*Thread Reply:* No, it's not.

nivethika R (nivethikar8@gmail.com)
2023-08-06 23:53:17

*Thread Reply:* Is it only available for spark version 3+?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-07 04:53:41

*Thread Reply:* Yes

GitHubOpenLineageIssues (githubopenlineageissues@gmail.com)
2023-08-07 11:18:25

Hi, Will really appreciate if I can learn how community have been able to harness spark integration. In our testing where a spark application writes to S3 multiple times (different location), OL generates the same job name for all writes (namepsacename.executeinsertintohadoopfsrelation_command ) rendering the OL graph final output less helpful. Say for example if I have series of transformation/writes 5 times , in Lineage graph we are just seeing last 1. There is an open bug and hopefully will be resolved soon.

Curious how much is adoption of OL spark integration in presence of that bug, as generating same name for a job makes it less usable for anything other than trivial one output application.

Example from 2 write application EXPECTED : first produce weather dataset and the subsequent produce weather40. (generated/mocked using 2 spark app). (1st image) ACTUAL OL : weather40. see only last one. (2nd image)

Will really appreciate community guidance as in how successful they have been in utilizing spark integration (vanilla not Databricks) . Thank you

Expected. vs Actual.

GitHubOpenLineageIssues (githubopenlineageissues@gmail.com)
2023-08-07 11:21:04
Michael Robinson (michael.robinson@astronomer.io)
2023-08-07 11:30:00

@channel This month’s TSC meeting is this Thursday, August 10th at 10:00 a.m. PT. On the tentative agenda: • announcements • recent releases • Airflow provider progress update • OpenLineage 1.0 overview • open discussion • more (TBA) More info and the meeting link can be found on the website. All are welcome! Also, feel free to reply or DM me with discussion topics, agenda items, etc.

openlineage.io
👍 Maciej Obuchowski, Athitya Kumar, Anirudh Shrinivason, Paweł Leszczyński
추호관 (hogan.chu@toss.im)
2023-08-08 04:39:45

I can’t see output when saveAsTable 100+ columns in spark. Any help or ideas for issue? Really thanks.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-08 04:59:23

*Thread Reply:* Does this work with similar jobs, but with small amount of columns?

추호관 (hogan.chu@toss.im)
2023-08-08 05:12:52

*Thread Reply:* thanks for reply @Maciej Obuchowski yes it works for small amount of columns but not work in big amount of columns

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-08 05:14:04

*Thread Reply:* one more question: how much data the jobs approximately process and how long does the execution take?

추호관 (hogan.chu@toss.im)
2023-08-08 05:14:54

*Thread Reply:* ah… it’s like 20 min ~ 30 min various data size is like 2000,0000 rows with columns 100 ~ 1000

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-08 05:15:17

*Thread Reply:* that's interesting. we could prepare integration test for that. 100 cols shouldn't make a difference

추호관 (hogan.chu@toss.im)
2023-08-08 05:15:37

*Thread Reply:* honestly sorry for typo its 1000 columns

추호관 (hogan.chu@toss.im)
2023-08-08 05:15:44

*Thread Reply:* pivoting features

추호관 (hogan.chu@toss.im)
2023-08-08 05:16:09

*Thread Reply:* i check it works good for small numbers of columns

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-08 05:16:39

*Thread Reply:* if it's 1000, then maybe we're over event size - event is too large and backend can't accept that

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-08 05:17:06

*Thread Reply:* maybe debug logs could tell us something

추호관 (hogan.chu@toss.im)
2023-08-08 05:19:27

*Thread Reply:* i’ll do spark.sparkContext.setLogLevel("DEBUG") ing

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-08 05:19:30

*Thread Reply:* are there any errors in the logs? perhaps pivoting uses contains nodes in SparkPlan that we don't support yet

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-08 05:19:52

*Thread Reply:* did you check pivoting that results in less columns?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-08 05:20:33

*Thread Reply:* @추호관 would also be good to disable logicalPlan facet: spark.openlineage.facets.disabled: [spark_unknown;spark.logicalPlan] in spark conf

추호관 (hogan.chu@toss.im)
2023-08-08 05:23:40

*Thread Reply:* got it can’t we do in python config .config("spark.dynamicAllocation.enabled", "true") \ .config("spark.dynamicAllocation.initialExecutors", "5") \ .config("spark.openlineage.facets.disabled", [spark_unknown;spark.logicalPlan]

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-08 05:24:31

*Thread Reply:* .config("spark.dynamicAllocation.enabled", "true") \ .config("spark.dynamicAllocation.initialExecutors", "5") \ .config("spark.openlineage.facets.disabled", "[spark_unknown;spark.logicalPlan]"

추호관 (hogan.chu@toss.im)
2023-08-08 05:24:42

*Thread Reply:* ah.. string got it

추호관 (hogan.chu@toss.im)
2023-08-08 05:36:03

*Thread Reply:* ah… there are no errors nor debug level issue successfully Registered listener ìo.openlineage.spark.agent.OpenLineageSparkListener

추호관 (hogan.chu@toss.im)
2023-08-08 05:39:40

*Thread Reply:* maybe df.groupBy(some column).pivot(some_column).agg(**agg_cols) is not supported

추호관 (hogan.chu@toss.im)
2023-08-08 05:43:44

*Thread Reply:* oh.. interesting spark.openlineage.facets.disabled this option gives me output when eventType is START “eventType”: “START” “outputs”: [ … columns …. ]

추호관 (hogan.chu@toss.im)
2023-08-08 05:54:13

*Thread Reply:* Yes "spark.openlineage.facets.disabled", "[spark_unknown;spark.logicalPlan]" <- this option gives output when eventType is START but not give output bunches of columns when that config is not set

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-08 05:55:18

*Thread Reply:* this option prevents logicalPlan being serialized and sent as a part of Openlineage event which included in one of the facets

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-08 05:56:12

*Thread Reply:* possibly, serializing logicalPlans, in case of pivots, leads to size of the events that are not acceptable

추호관 (hogan.chu@toss.im)
2023-08-08 05:57:56

*Thread Reply:* Ah… so you mean pivot makes serializing logical plan not availble for generating event because of size. and disable logical plan with not serializing make availabe to generate event cuz not serialiize logical plan made by pivot

Can we overcome this

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-08 05:58:48

*Thread Reply:* we've seen such issues for some plans some time ago

🙌 추호관
추호관 (hogan.chu@toss.im)
2023-08-08 05:59:29

*Thread Reply:* oh…. how did you solve it?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-08 05:59:51

*Thread Reply:* by excluding some properties from plan to be serialized

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-08 06:01:14
추호관 (hogan.chu@toss.im)
2023-08-08 06:02:00

*Thread Reply:* AH…. excluded properties cause ignore logical plan’s of pivointing

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-08 06:08:25

*Thread Reply:* you can start with writing a failing test here -> https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/tes[…]/openlineage/spark/agent/lifecycle/SparkReadWriteIntegTest.java

then you can try to debug logical plan trying to find out what should be excluded from it when it's being serialized. Even, if you find this difficult, a failing integration test is super helpful to let others help you in that.

추호관 (hogan.chu@toss.im)
2023-08-08 06:24:54

*Thread Reply:* okay i would look into and maybe pr thanks

추호관 (hogan.chu@toss.im)
2023-08-08 06:38:45

*Thread Reply:* Can I ask if there are any suspicious properties?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-08 06:39:25

*Thread Reply:* sure

👍 추호관
🙂 추호관
추호관 (hogan.chu@toss.im)
2023-08-08 07:10:40

*Thread Reply:* Thanks I would also try to find the property too

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-08-08 05:34:46

Hi guys, I've a generic sql-parsing doubt... what would be the recommended way (if any) to check for sql similarity? I understand that most sql parsers parse the query into an AST, but are there any well known ways to measure semantic similarities between 2 or more ASTs? Just curious lol... Any ideas appreciated! Thanks!

Guy Biecher (guy.biecher21@gmail.com)
2023-08-08 07:49:55

*Thread Reply:* Hi @Anirudh Shrinivason, I think I would take a look on this https://sqlglot.com/sqlglot/diff.html

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-08-09 23:12:37

*Thread Reply:* Hey @Guy Biecher Yeah I was looking at this... but it seems to calculate similarity from a more textual context, as opposed to a more semantic one... eg: SELECT ** FROM TABLE_1 and SELECT col1,col2,col3 FROM TABLE_1 could be the same semantic query, but sqlglot would give diffs in the ast because its textual...

Guy Biecher (guy.biecher21@gmail.com)
2023-08-10 02:26:51

*Thread Reply:* I totally get you. In such cases without the metadata of the TABLE_1, it's impossible what I would do I would replace all ** before you use the diff function.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-08-10 07:04:37

*Thread Reply:* Yeah I was thinking about the same... But the more nested and complex your queries get, the harder it'll become to accurately pre-process before running the ast diff too... But yeah that's probably the approach I'd be taking haha... Happy to discuss and learn if there are better ways to doing this

Luigi Scorzato (luigi.scorzato@gmail.com)
2023-08-08 08:36:46

dear all, I have some novice questions. I put them in separate messages for clarity. 1st Question: I understand from the examples in the documentation that the main lineage events are RunEvent's, which can contain link to Run ID, Job ID, Dataset ID (I see they are RunEvent because they have EventType, correct?). However, the main openlineage json object contains also JobEvent and DatasetEvent. When are JobEvent and DatasetEvent supposed to be used in the workflow? Do you have relevant examples? thanks!

openlineage.io
Harel Shein (harel.shein@gmail.com)
2023-08-08 09:53:05

*Thread Reply:* Hey @Luigi Scorzato! You can read about these 2 event types in this blog post: https://openlineage.io/blog/static-lineage

openlineage.io
👍 Luigi Scorzato
Harel Shein (harel.shein@gmail.com)
2023-08-08 09:53:38

*Thread Reply:* we’ll work on getting the documentation improved to clarify the expected use cases for each event type. this is a relatively new addition to the spec.

👍 Luigi Scorzato
Luigi Scorzato (luigi.scorzato@gmail.com)
2023-08-08 10:08:28

*Thread Reply:* this sounds relevant for my 3rd question, doesn't it? But I do not see scheduling information among the use cases, am I wrong?

Harel Shein (harel.shein@gmail.com)
2023-08-08 11:16:39

*Thread Reply:* you’re not wrong, these 2 events were not designed for runtime lineage, but rather “static” lineage that gets emitted after the fact

Luigi Scorzato (luigi.scorzato@gmail.com)
2023-08-08 08:46:39

2nd Question. I see that the input dataset appears in the RunEvent with EventType=START, the output dataset appears in the RunEvent with EventType=COMPLETE only, the RunEvent with EventType=RUNNING has no dataset attached. This makes sense for ETL jobs, but for streaming (e.g. Flink), the run could run very long and never terminate with a COMPLETE. On the other hand, emitting all the info about the output dataset in every RUNNING event would be far too verbose. What is the recommended set up in this case? TLDR: what is the recommended configuration of the frequency and data model of the lineage events for streaming systems like Flink?

Harel Shein (harel.shein@gmail.com)
2023-08-08 09:54:40

*Thread Reply:* great question! did you get a chance to look at the current Flink integration?

Luigi Scorzato (luigi.scorzato@gmail.com)
2023-08-08 10:07:06

*Thread Reply:* to be honest, I only quickly went through this and I did not identfy what I needed. Can you please point me to the relevant section?

openlineage.io
Harel Shein (harel.shein@gmail.com)
2023-08-08 11:13:17

*Thread Reply:* here’s an example START event for Flink: https://github.com/OpenLineage/OpenLineage/blob/main/integration/flink/src/test/resources/events/expected_kafka.json

Harel Shein (harel.shein@gmail.com)
2023-08-08 11:13:26

*Thread Reply:* or a checkpoint (RUNNING) event: https://github.com/OpenLineage/OpenLineage/blob/main/integration/flink/src/test/resources/events/expected_kafka_checkpoints.json

Harel Shein (harel.shein@gmail.com)
2023-08-08 11:15:55

*Thread Reply:* generally speaking, you can see the execution contexts that invoke generation of OL events here: https://github.com/OpenLineage/OpenLineage/blob/main/integration/flink/src/main/ja[…]/openlineage/flink/visitor/lifecycle/FlinkExecutionContext.java

👍 Luigi Scorzato
Luigi Scorzato (luigi.scorzato@gmail.com)
2023-08-08 17:46:17

*Thread Reply:* thank you! So, if I understand correctly, the key is that even eventType=START, admits an output datasets. Correct? What determines how often are the eventType=RUNNING emitted?

👍 Harel Shein
Luigi Scorzato (luigi.scorzato@gmail.com)
2023-08-09 03:25:16

*Thread Reply:* now I see, RUNNING events are emitted on onJobCheckpoint

Luigi Scorzato (luigi.scorzato@gmail.com)
2023-08-08 08:59:40

3rd Question: I am looking for information about the time when the next run should start, in case of scheduled jobs. I see that the Run Facet has a Nominal Time Facet, but -- if I understand correctly -- it refers to the current run, so it is always emitted after the fact. Is the Nominal Start Time of the next run available somewhere? If not, where do you recommend to add it as a custom field? In principle, it belongs to the Job object, but would that maybe cause an undesirable fast change in the Job object?

Harel Shein (harel.shein@gmail.com)
2023-08-08 11:10:47

*Thread Reply:* For Airflow, this is part of the AirflowRunFacet, here: https://github.com/OpenLineage/OpenLineage/blob/81372ca2bc2afecab369eab4a54cc6380dda49d0/integration/airflow/facets/AirflowRunFacet.json#L100

For other orchestrators / schedulers, that would depend..

👍 Luigi Scorzato
Kiran Hiremath (kiran_hiremath@intuit.com)
2023-08-08 10:30:56

Hi Team, Question regarding Databricks OpenLineage init script, is the path /mnt/driver-daemon/jars common to all the clusters? or its unique to each cluster? https://github.com/OpenLineage/OpenLineage/blob/81372ca2bc2afecab369eab4a54cc6380d[…]da49d0/integration/spark/databricks/open-lineage-init-script.sh

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-08 12:15:40

*Thread Reply:* I might be wrong, but I believe it's unique for each cluster - the common part is dbfs\.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-09 02:38:54

*Thread Reply:* dbfs is mounted to a databricks workspace which can run multiple clusters. so i think, it's common.

Worth mentioning: init-scripts located in dbfs are becoming deprecated next month and we plan moving them into workspaces.

👍 Kiran Hiremath
Kiran Hiremath (kiran_hiremath@intuit.com)
2023-08-11 01:33:24

*Thread Reply:* yes, the init scripts are moved at workspace level.

GitHubOpenLineageIssues (githubopenlineageissues@gmail.com)
2023-08-08 14:19:40

Hi @Paweł Leszczyński Will really aprecaite if you please let me know once this PR is good to go. Will love to test in our environment : https://github.com/OpenLineage/OpenLineage/pull/2036. Thank you for all your help.

Labels
integration/spark
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-09 02:35:28

*Thread Reply:* great to hear. I still need some time as there are few corner cases. For example: what should be the behaviour when alter table rename is called 😉 But sure, you can test it if you like. ci is failing on integration tests but ./gradlew clean build with unit tests are fine.

:gratitude_thank_you: GitHubOpenLineageIssues
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-10 03:33:50

*Thread Reply:* @GitHubOpenLineageIssues Feel invited to join todays community and advocate for the importance of this issue. Such discussions are extremely helpful in prioritising backlog the right way.

Gaurav Singh (gaurav.singh@razorpay.com)
2023-08-09 07:54:33

Hi Team, I'm doing a POC with open lineage to extract column lineage of Spark. I'm using it on databricks notebook. I'm facing a issue where I,m trying to get the column lineage in a join involving external tables on s3. The lineage that is being extracted is returning on base path of the table ie on the s3 file path and not on the corresponding tables. Is there a way to extract/map columns of output to the columns of base tables instead of storage location.

Gaurav Singh (gaurav.singh@razorpay.com)
2023-08-09 07:55:28

*Thread Reply:* Query: INSERT INTO test.merchant_md (Select m.`id`, m.name, m.activated, m.parent_id, md.contact_name, md.contact_email FROM test.merchants_0 m LEFT JOIN merchant_details md ON m.id = md.merchant_id WHERE m.created_date &gt; '2023-08-01')

Gaurav Singh (gaurav.singh@razorpay.com)
2023-08-09 08:01:56

*Thread Reply:* "columnLineage":{ "_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.30.1/integration/spark>", "_schemaURL":"<https://openlineage.io/spec/facets/1-0-1/ColumnLineageDatasetFacet.json#/$defs/ColumnLineageDatasetFacet>", "fields":{ "merchant_id":{ "inputFields":[ { "namespace":"<s3a://datalake>", "name":"/test/merchants", "field":"id" } ] }, "merchant_name":{ "inputFields":[ { "namespace":"<s3a://datalake>", "name":"/test/merchants", "field":"name" } ] }, "activated":{ "inputFields":[ { "namespace":"<s3a://datalake>", "name":"/test/merchants", "field":"activated" } ] }, "parent_id":{ "inputFields":[ { "namespace":"<s3a://datalake>", "name":"/test/merchants", "field":"parent_id" } ] }, "contact_name":{ "inputFields":[ { "namespace":"<s3a://datalake>", "name":"/test/merchant_details", "field":"contact_name" } ] }, "contact_email":{ "inputFields":[ { "namespace":"<s3a://datalake>", "name":"/test/merchant_details", "field":"contact_email" } ] } } }, "symlinks":{ "_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.30.1/integration/spark>", "_schemaURL":"<https://openlineage.io/spec/facets/1-0-0/SymlinksDatasetFacet.json#/$defs/SymlinksDatasetFacet>", "identifiers":[ { "namespace":"/warehouse/test.db", "name":"test.merchant_md", "type":"TABLE" }

Gaurav Singh (gaurav.singh@razorpay.com)
2023-08-09 08:23:57

*Thread Reply:* "contact_name":{ "inputFields":[ { "namespace":"<s3a://datalake>", "name":"/test/merchant_details", "field":"contact_name" } ] } This is returning mapping from the s3 location on which the table is created.

Zahi Fail (zahi.fail@gmail.com)
2023-08-09 10:56:27

Hey, I’m running Spark application (spark version 3.4) with OL integration. I changed spark to use “debug” level, and I see the OL events with the below message: “Emitting lineage completed successfully:”

With all the above, I can’t see the event in Marquez.

Attaching the OL configurations. When changing the OL-spark version to 0.6.+, I do see event created in Marquez with only “Start” status (attached below).

The OL-spark version is matching the Spark version? Is there a known issues with the Spark / OL versions ?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-09 11:23:42

*Thread Reply:* > OL-spark version to 0.6.+ This OL version is ancient. You can try with 1.0.0

I think you're hitting this issue which duplicates jobs: https://github.com/OpenLineage/OpenLineage/issues/1943

Assignees
<a href="https://github.com/pawel-big-lebowski">@pawel-big-lebowski</a>
Labels
integration/spark
Zahi Fail (zahi.fail@gmail.com)
2023-08-10 01:46:08

*Thread Reply:* I haven’t mentioned that I tried multiple OL versions - 1.0.0 / 0.30.1 / 0.6.+ … None of them worked for me. @Maciej Obuchowski

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-10 05:25:49

*Thread Reply:* @Zahi Fail understood. Can you provide sample job that reproduces this behavior, and possibly some logs?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-10 05:26:11

*Thread Reply:* If you can, it might be better to create issue at github and communicate there.

Zahi Fail (zahi.fail@gmail.com)
2023-08-10 08:34:01

*Thread Reply:* Before creating an issue in GIT, I wanted to check if my issue only related to versions compatibility..

This is the sample of my test: ```from pyspark.sql import SparkSession from pyspark.sql.functions import col

spark = SparkSession.builder\ .config('spark.jars.packages', 'io.openlineage:openlineage_spark:1.0.0') \ .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener') \ .config('spark.openlineage.host', 'http://localhost:9000') \ .config('spark.openlineage.namespace', 'default') \ .getOrCreate()

spark.sparkContext.setLogLevel("DEBUG")

csv_file = location.csv

df = spark.read.format("csv").option("header","true").option("sep","^").load(csv_file)

df = df.select("campaignid","revenue").groupby("campaignid").sum("revenue").show()``` Part of the logs with the OL configurations and the processed event

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-10 08:40:13

*Thread Reply:* try spark.openlineage.transport.url instead of spark.openlineage.host

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-10 08:40:27

*Thread Reply:* and possibly link the doc where you've seen spark.openlineage.host 🙂

Zahi Fail (zahi.fail@gmail.com)
2023-08-10 08:59:27

*Thread Reply:* https://openlineage.io/blog/openlineage-spark/

openlineage.io
👍 Maciej Obuchowski
Zahi Fail (zahi.fail@gmail.com)
2023-08-10 09:04:56

*Thread Reply:* changing to “spark.openlineage.transport.url” didn’t make any change

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-10 09:09:42

*Thread Reply:* do you see the ConsoleTransport log? it suggests Spark integration did not register that you want to send events to Marquez

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-10 09:10:09

*Thread Reply:* let's try adding spark.openlineage.transport.type to http

Zahi Fail (zahi.fail@gmail.com)
2023-08-10 09:14:50

*Thread Reply:* Now it works !

Zahi Fail (zahi.fail@gmail.com)
2023-08-10 09:14:58

*Thread Reply:* thanks @Maciej Obuchowski

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-10 09:23:04

*Thread Reply:* Cool 🙂 however it should not require it if you provide spark.openlineage.transport.url - I'll create issue for debugging that.

Michael Robinson (michael.robinson@astronomer.io)
2023-08-09 14:37:24

@channel This month’s TSC meeting is tomorrow! All are welcome. https://openlineage.slack.com/archives/C01CK9T7HKR/p1691422200847979

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
Athitya Kumar (athityakumar@gmail.com)
2023-08-10 02:11:07

While using the spark integration, we're unable to see the query in the job facet for any spark-submit - is this a known issue/limitation, and can someone point to the code where this is currently extracted / can be enhanced?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-10 02:55:46

*Thread Reply:* Let me first rephrase my understanding of the question assume a user runs spark.sql('INSERT INTO ...'). Are we able to include sql queryINSERT INTO ...within SQL facet?

We once had a look at it and found it difficult. Given an SQL, spark immediately translates it to a logical plan (which our integration is based on) and we didn't find any place where we could inject our code and get access to sql being run.

Athitya Kumar (athityakumar@gmail.com)
2023-08-10 04:27:51

*Thread Reply:* Got it. So for spark.sql() - there's no interaction with sqlparser-rs and we directly try stitching the input/output & column lineage from the spark logical plan. Would something like this fall under the spark.jdbc() route or the spark.sql() route (say, if the df is collected / written somewhere)?

val df = spark.read.format("jdbc") .option("url", url) .option("user", user) .option("password", password) .option("fetchsize", fetchsize) .option("driver", driver)

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-10 05:15:17

*Thread Reply:* @Athitya Kumar I understand your issue. From my side, there's one problem with this - potentially there can be multiple queries for one spark job. You can imagine something like joining results of two queries - possible to separate systems - and then one SqlJobFacet would be misleading. This needs more thorough spec discussion

Luigi Scorzato (luigi.scorzato@gmail.com)
2023-08-10 05:33:47

Hi Team, has anyone experience with integrating OpenLineage with the SAP ecosystem? And with Salesforce/MuleSoft?

Steven (xli@zjuici.com)
2023-08-10 05:40:47

Hi, Are there any ways to save list of string directly in the dataset facets? Such as the myfacets field in this dict "facets": { "metadata_facet": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/0.29.2/client/python>", "_schemaURL": "<https://sth/schemas/facets.json#/definitions/SomeFacet>", "myfacets": ["a", "b", "c"] } }

Steven (xli@zjuici.com)
2023-08-10 05:42:20

*Thread Reply:* I'm using python OpenLineage package and extend the BaseFacet class

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-10 05:53:57

*Thread Reply:* for custom facets, as long as it's valid json - go for it

Steven (xli@zjuici.com)
2023-08-10 05:55:03

*Thread Reply:* However I tried to insert a list of string. And I tried to get the dataset, the returned valued of that list field is empty.

Steven (xli@zjuici.com)
2023-08-10 05:55:57

*Thread Reply:* @attr.s class MyFacet(BaseFacet): columns: list[str] = attr.ib() Here's my python code.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-10 05:59:02

*Thread Reply:* How did you emit, serialized the event, and where did you look when you said you tried to get the dataset?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-10 06:00:27

*Thread Reply:* I assume the problem is somewhere there, not on the level of facet definition, since SchemaDatasetFacet looks pretty much the same and it works

Steven (xli@zjuici.com)
2023-08-10 06:00:54

*Thread Reply:* I use the python openlineage client to emit the RunEvent. openlineage_client.emit( RunEvent( eventType=RunState.COMPLETE, eventTime=datetime.now().isoformat(), run=run, job=job, producer=PRODUCER, outputs=outputs, ) ) And use marquez to visualize the get data result

Steven (xli@zjuici.com)
2023-08-10 06:02:12

*Thread Reply:* Yah, list of objects is working, but list of string is not.😩

Steven (xli@zjuici.com)
2023-08-10 06:03:23

*Thread Reply:* I think the problem is related to the openlineage package openlineage.client.serde.py. The function Serde.to_json()

Steven (xli@zjuici.com)
2023-08-10 06:05:56

*Thread Reply:*

Steven (xli@zjuici.com)
2023-08-10 06:19:34

*Thread Reply:* I think the code here filters out those string values in the list

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-10 06:21:39

*Thread Reply:* 👀

Steven (xli@zjuici.com)
2023-08-10 06:24:48

*Thread Reply:* Yah, the value in list will end up False in this code and be filtered out isinstance(_x_, dict)

😳

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-10 06:26:33

*Thread Reply:* wow, that's right 😬

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-10 06:26:47

*Thread Reply:* want to create PR fixing that?

Steven (xli@zjuici.com)
2023-08-10 06:27:20

*Thread Reply:* Sure! May do this later tomorrow.

👍 Maciej Obuchowski, Paweł Leszczyński
Steven (xli@zjuici.com)
2023-08-10 23:59:28

*Thread Reply:* I created the pr at https://github.com/OpenLineage/OpenLineage/pull/2044 But the ci on integration-test-integration-spark FAILED

Labels
client/python
Comments
1
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-11 04:17:01

*Thread Reply:* @Steven sorry for that - some tests require credentials that are not present on the forked versions of CI. It will work once I push it to origin. Anyway Spark tests failing aren't blocker for this Python PR

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-11 04:17:45

*Thread Reply:* I would only ask to add some tests for that case with facets containing list of string

Steven (xli@zjuici.com)
2023-08-11 04:18:21

*Thread Reply:* Yeah sure, I will add them now

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-11 04:25:19

*Thread Reply:* ah we had other CI problem, go version was too old in one of the jobs - neverthless I won't judge your PR on stuff failing outside your PR anyway 🙂

Steven (xli@zjuici.com)
2023-08-11 04:36:57

*Thread Reply:* LOL🤣 I've added some tests and made a force push

savan (SavanSharan_Navalgi@intuit.com)
2023-10-20 08:31:45

*Thread Reply:* @GitHubOpenLineageIssues I am trying to contribute to Integration tests which is listed here as good first issue the CONTRIBUTING.md mentions that i can trigger CI for integration tests from forked branch. using this tool. but i am unable to do so, is there a way to trigger CI from forked brach or do i have to get permission from someone to run the CI?

i am getting this error when i run this command sudo git-push-fork-to-upstream-branch upstream savannavalgi:hacktober > Username for '<https://github.com>': savannavalgi &gt; Password for '<https://savannavalgi@github.com>': &gt; remote: Permission to OpenLineage/OpenLineage.git denied to savannavalgi. &gt; fatal: unable to access '<https://github.com/OpenLineage/OpenLineage.git/>': The requested URL returned error: 403 i have tried to configure ssh key also tried to trigger CI from another brach, and tried all of this after fetching the latest upstream

cc: @Athitya Kumar @Maciej Obuchowski @Steven

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-10-23 04:57:44

*Thread Reply:* what PR is the probem related to? I can run git-push-fork-to-upstream-branch for you

savan (SavanSharan_Navalgi@intuit.com)
2023-10-25 01:08:41

*Thread Reply:* @Paweł Leszczyński thanks for approving my PR - ( link )

I will make the changes needed for the new integration test case for drop table (good first issue) , in another PR, I would need your help to run the integration tests again, thank you

savan (SavanSharan_Navalgi@intuit.com)
2023-10-26 07:48:52

*Thread Reply:* @Paweł Leszczyński opened a PR ( link ) for integration test for drop table can you please help run the integration test

Labels
documentation, integration/spark
Comments
1
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-10-26 07:50:29

*Thread Reply:* sure, some of our tests require access to S3/BigQuery secret keys, so will not work automatically from the fork, and require action on our side. working on that

savan (SavanSharan_Navalgi@intuit.com)
2023-10-29 09:31:22

*Thread Reply:* thanks @Paweł Leszczyński let me know if i can help in any way

savan (SavanSharan_Navalgi@intuit.com)
2023-11-15 02:31:50

*Thread Reply:* @Paweł Leszczyński any action item on my side?

savan (SavanSharan_Navalgi@intuit.com)
2023-12-01 02:57:20

*Thread Reply:* @Paweł Leszczyński can you please take a look at this ? 🙂

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-12-01 03:05:05

*Thread Reply:* Hi @savan, were you able to run integration tests locally on your side? It seems the generated OL event is missing schema facet "outputs" : [ { "namespace" : "file", "name" : "/tmp/drop_test/drop_table_test", "facets" : { "dataSource" : { "_producer" : "<https://github.com/OpenLineage/OpenLineage/tree/1.5.0-SNAPSHOT/integration/spark>", "_schemaURL" : "<https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet>", "name" : "file", "uri" : "file" }, "symlinks" : { "_producer" : "<https://github.com/OpenLineage/OpenLineage/tree/1.5.0-SNAPSHOT/integration/spark>", "_schemaURL" : "<https://openlineage.io/spec/facets/1-0-0/SymlinksDatasetFacet.json#/$defs/SymlinksDatasetFacet>", "identifiers" : [ { "namespace" : "/tmp/drop_test", "name" : "default.drop_table_test", "type" : "TABLE" } ] }, "lifecycleStateChange" : { "_producer" : "<https://github.com/OpenLineage/OpenLineage/tree/1.5.0-SNAPSHOT/integration/spark>", "_schemaURL" : "<https://openlineage.io/spec/facets/1-0-0/LifecycleStateChangeDatasetFacet.json#/$defs/LifecycleStateChangeDatasetFacet>", "lifecycleStateChange" : "DROP" } }, "outputFacets" : { } } ] which shouldn't be such a big problem I believe. This event intends to notify table is dropped which is still ok I believe without schema.

savan (SavanSharan_Navalgi@intuit.com)
2023-12-01 03:06:40

*Thread Reply:* @Paweł Leszczyński i am unable to run integration tests locally, as you mentioned it requires S3/BigQuery secret keys and wont work from a forked branch

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-12-01 03:07:06

*Thread Reply:* you can run this particular test you modify, don't need to run all of them

savan (SavanSharan_Navalgi@intuit.com)
2023-12-01 03:07:55

*Thread Reply:* can you please share any doc which will help me do that. i did go through the readme doc, i was stuck at > you dont have permission to perform this action

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-12-01 03:08:18

*Thread Reply:* ./gradlew :app:integrationTest --tests io.openlineage.spark.agent.SparkIcebergIntegrationTest.testDropTable

savan (SavanSharan_Navalgi@intuit.com)
2023-12-01 03:08:33

*Thread Reply:* let me try thanks!

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-12-01 03:08:35

*Thread Reply:* this should run the thing you modify

savan (SavanSharan_Navalgi@intuit.com)
2024-02-13 05:27:36

*Thread Reply:* i am getting this error while building the project. tried a lot of things, any pointers or leads will be helpful? i am using apple m1 max chip computer. thanks > ------ Running smoke test ------ > Exception in thread “main” java.lang.UnsatisfiedLinkError: /private/var/folders/zz/zyxvpxvq6csfxvnn0000000000000/T/native-lib4719585175993676348/libopenlineagesqljava.dylib: dlopen(/private/var/folders/zz/zyxvpxvq6csfxvnn0000000000000/T/native-lib4719585175993676348/libopenlineagesqljava.dylib, 0x0001): tried: ‘/private/var/folders/zz/zyxvpxvq6csfxvnn0000000000000/T/native-lib4719585175993676348/libopenlineagesqljava.dylib’ (mach-o file, but is an incompatible architecture (have ‘arm64’, need ‘x8664’)), ‘/System/Volumes/Preboot/Cryptexes/OS/private/var/folders/zz/zyxvpxvq6csfxvnn0000000000000/T/native-lib4719585175993676348/libopenlineagesqljava.dylib’ (no such file), ‘/private/var/folders/zz/zyxvpxvq6csfxvnn0000000000000/T/native-lib4719585175993676348/libopenlineagesqljava.dylib’ (mach-o file, but is an incompatible architecture (have ‘arm64’, need ‘x86_64’)) > at java.base/java.lang.ClassLoader$NativeLibrary.load0(Native Method)

savan (SavanSharan_Navalgi@intuit.com)
2024-02-13 05:51:45

*Thread Reply:* the build passes with out the smoke tests. but the command you gave is throwing below error

(base) snavalgi@macos-PD7LVVY6MQ spark % ./gradlew -q :app:integrationTest --tests io.openlineage.spark.agent.SparkIcebergIntegrationTest.testDropTable

FAILURE: Build failed with an exception.

** Where: Build file ‘/Users/snavalgi/Documents/GitHub/OpenLineage/integration/spark/app/build.gradle’ line: 256

** What went wrong: A problem occurred evaluating project ‘:app’.

Could not resolve all files for configuration ‘:app:spark2’. Could not resolve io.openlineage:openlineagejava:1.9.0-SNAPSHOT. Required by: project :app > project :shared Could not resolve io.openlineage:openlineagejava:1.9.0-SNAPSHOT. Unable to load Maven meta-data from https://astronomer.jfrog.io/artifactory/maven-public-libs-snapshot/io/openlineage/openlineage-java/1.9.0-SNAPSHOT/maven-metadata.xml. org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 10; DOCTYPE is disallowed when the feature “http://apache.org/xml/features/disallow-doctype-decl” set to true. Could not resolve io.openlineage:openlineagesqljava:1.9.0-SNAPSHOT. Required by: project :app > project :shared Could not resolve io.openlineage:openlineagesqljava:1.9.0-SNAPSHOT. Unable to load Maven meta-data from https://astronomer.jfrog.io/artifactory/maven-public-libs-snapshot/io/openlineage/openlineage-sql-java/1.9.0-SNAPSHOT/maven-metadata.xml. org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 10; DOCTYPE is disallowed when the feature “http://apache.org/xml/features/disallow-doctype-decl” set to true.

** Try:

Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights. Get more help at https://help.gradle.org.

BUILD FAILED in 10s

savan (SavanSharan_Navalgi@intuit.com)
2024-02-14 00:07:56

*Thread Reply:* updated the correct error message

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-14 05:30:22

*Thread Reply:* @savan you need to build openlineage-java and openlineage-sql-java libraries as described here: https://github.com/OpenLineage/OpenLineage/blob/73b4a3bcd84239e7baedd22b5294624623d6f3ad/integration/spark/README.md#preparation

savan (SavanSharan_Navalgi@intuit.com)
2024-02-14 06:52:31

*Thread Reply:* @Maciej Obuchowski thanks for the response. the issue was with java-8 architecture i had installed.

i am able to compile, build and run the integration test now , with java 11 ( of appropriate arch)

savan (SavanSharan_Navalgi@intuit.com)
2024-02-14 08:12:02

*Thread Reply:* was able running some(createtable) integration tests successfully. but now the marquez-api container is repeated crashing. any pointers?

marquez-api | [Too many errors, abort] marquez-api | qemu: uncaught target signal 6 (Aborted) - core dumped marquez-api | /usr/src/app/entrypoint.sh: line 19: 44 Aborted java ${JAVAOPTS} -jar marquez-**.jar server ${MARQUEZCONFIG} marquez-api exited with code 134

savan (SavanSharan_Navalgi@intuit.com)
2024-02-14 08:15:37

*Thread Reply:* the marquez-api docker image has this warning

AMD64, image may have poor performance or fail, if run via emulation

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-14 08:32:54

*Thread Reply:* @Willy Lulciuc I think publishing arm64 image of Marquez would be a good idea

Willy Lulciuc (willy@datakin.com)
2024-02-14 13:36:46

*Thread Reply:* Yeah, supporting multi-architectural docker builds makes sense. Here’s an article outlining an approach https://www.padok.fr/en/blog/multi-architectures-docker-iot#architectures. @Maciej Obuchowski what’s is what you’re suggesting here?

savan (SavanSharan_Navalgi@intuit.com)
2024-02-25 02:53:54

*Thread Reply:* @Maciej Obuchowski @Paweł Leszczyński i have verified the integration test for dropTestTable on my local. it is working fine. ✅✅✅✅✅ can you please trigger the CI for this PR? and expedite the review and merge process https://github.com/OpenLineage/OpenLineage/pull/2214

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-02-26 03:33:48

*Thread Reply:* the test is still failin in CI -> https://app.circleci.com/pipelines/github/OpenLineage/OpenLineage/9488/workflows/f669d751-aa18-4735-a51f-7d647415fee8/jobs/181187

io.openlineage.spark.agent.SparkContainerIntegrationTest testDropTable() FAILED (31.2s)

savan (SavanSharan_Navalgi@intuit.com)
2024-02-26 04:00:06

*Thread Reply:* on my local i see that the test is passing. let me update the branch and test again

savan (SavanSharan_Navalgi@intuit.com)
2024-02-26 05:18:06

*Thread Reply:* i have made a minor change . can you please trigger the CI again @Paweł Leszczyński

👀 Paweł Leszczyński
savan (SavanSharan_Navalgi@intuit.com)
2024-02-26 05:27:14

*Thread Reply:* the test is again passing on my local with latest code. but i notice the below error in the previous CI failure.

the previous CI build was failing because the actual START event for droptable in the CI had empty input and output. > "eventType" : "START", &gt; "inputs" : [ ], &gt; "outputs" : [ ] but on my local , the START event for droptable has output populated as below. > { &gt; "eventType": "START", &gt; "job": { &gt; "namespace": "testDropTable" &gt; }, &gt; "inputs": [], &gt; "outputs": [ &gt; { &gt; "namespace": "file", &gt; "name": "/tmp/drop_test/drop_table_test", &gt; "facets": { &gt; "dataSource": { &gt; "name": "file", &gt; "uri": "file" &gt; }, &gt; "symlinks": { &gt; "identifiers": [ &gt; { &gt; "namespace": "file:/tmp/drop_test", &gt; "name": "default.drop_table_test", &gt; "type": "TABLE" &gt; } &gt; ] &gt; }, &gt; "lifecycleStateChange": { &gt; "lifecycleStateChange": "DROP" &gt; } &gt; } &gt; } &gt; ] &gt; } >

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-02-26 05:29:47

*Thread Reply:* Please note that CI runs tests against several Spark versions. This can be configured with -Pspark.version=3.4.2 It's possible that your test passing for some versions while still failing for other ones.

savan (SavanSharan_Navalgi@intuit.com)
2024-02-26 05:33:05

*Thread Reply:* if CI is verifying against many spark version, does that mean, some spark version have empty output:[] and some have populated output:[] for the same START event of a droptable ?

if so then how do we specify different START events respectively for those versions of spark? is that possible?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-02-26 05:42:24

*Thread Reply:* For complete event the assertion with empty inputs and outputs verifies only if an complete event was emitted. It would make sense for start to verify if this contains information about the deleted dataset. If it is missing for a single spark version, we should first try to understand why is this happening and if there is any workaround for this.

savan (SavanSharan_Navalgi@intuit.com)
2024-02-26 05:43:43

*Thread Reply:* yes makes sense. can you please approve to run CI for integration test again?

I really wanted to check if this build passes.

savan (SavanSharan_Navalgi@intuit.com)
2024-02-26 05:57:51

*Thread Reply:* and for the spark version for which we are getting empty output[] in START event for droptable should i open a new ticket on openlineage and report the issue?

savan (SavanSharan_Navalgi@intuit.com)
2024-02-28 06:02:00

*Thread Reply:* @Paweł Leszczyński @Maciej Obuchowski can you please approve this CI to run integration tests? https://app.circleci.com/pipelines/github/OpenLineage/OpenLineage/9497/workflows/4a20dc95-d5d1-4ad7-967c-edb6e2538820

👍 Paweł Leszczyński
savan (SavanSharan_Navalgi@intuit.com)
2024-02-29 01:13:11

*Thread Reply:* @Paweł Leszczyński only 2 spark version are sending empty input and output for both START and COMPLETE event

• 3.4.2 • 3.5.0 i can look into the above , if you guide me a bit on how to ? should i open a new ticket for it? please suggest how to proceed?

savan (SavanSharan_Navalgi@intuit.com)
2024-03-01 04:01:45

*Thread Reply:* this integration test case lead to finding of the above bug for spark 3.4.2 and 3.5.0 will that be a blocker to merge this test case? @Paweł Leszczyński @Maciej Obuchowski

savan (SavanSharan_Navalgi@intuit.com)
2024-03-06 09:01:44

*Thread Reply:* @Paweł Leszczyński @Maciej Obuchowski any direction on the above blocker will be helpful.

Athitya Kumar (athityakumar@gmail.com)
2023-08-11 07:36:57

Hey folks! 👋

Had a query/observation regarding columnLineage inferred in spark integration - opened this issue for the same. Basically, when we do something like this in our spark-sql: SELECT t1.c1, t1.c2, t1.c3, t2.c4 FROM t1 LEFT JOIN t2 ON t1.c1 = t2.c1 AND t1.c2 = t2.c2 The expected column lineage for output table t3 is: t3.c1 -&gt; Comes from both t1.c1 &amp; t2.c1 (SELECT + JOIN clause) t3.c2 -&gt; Comes from both t1.c2 &amp; t2.c2 (SELECT + JOIN clause) t3.c3 -&gt; Comes from t1.c3 t3.c4 -&gt; Comes from t2.c4 However, actual column lineage for output table t3 is: t3.c1 -&gt; Comes from t1.c1 (Only based on SELECT clause) t3.c2 -&gt; Comes from t1.c1 (Only based on SELECT clause) t3.c3 -&gt; Comes from t1.c3 t3.c4 -&gt; Comes from t2.c4 Is this a known issue/behaviour?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-11 09:18:44

*Thread Reply:* Hmm... this is kinda "logical" difference - is column level lineage taken from actual "physical" operations - like in this case, we always take from t1 - or from "logical" where t2 is used only for predicate, yet we still want to indicate it as a source?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-11 09:18:58

*Thread Reply:* I think your interpretation is more useful

🙏 Athitya Kumar
Athitya Kumar (athityakumar@gmail.com)
2023-08-11 09:25:03

*Thread Reply:* @Maciej Obuchowski - Yup, especially for use-cases where we wanna depend on column lineage for impact analysis, I think we should be considering even predicates. For example, if t2.c1 / t2.c2 gets corrupted or dropped, the query would be impacted - which means that we should be including even predicates (t2.c1 / t2.c2) in the column lineage imo

But is there any technical limitation if we wanna implement this / make an OSS contribution for this (like logical predicate columns not being part of the spark logical plan object that we get in the PlanVisitor or something like that)?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-11 11:14:58

*Thread Reply:* It's probably a bit of work, but can't think it's impossible on parser side - @Paweł Leszczyński will know better about spark collection

Ernie Ostic (ernie.ostic@getmanta.com)
2023-08-11 12:45:34

*Thread Reply:* This is a case where it would be nice to have an alternate indication (perhaps in the Column lineage facet?) for this type of "suggested" lineage. As noted, this is especially important for impact analysis purposes. We (and I believe others do the same or similar) call that "indirect" lineage at Manta.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-11 12:49:10

*Thread Reply:* Something like additional flag in inputFields, right?

👍 Athitya Kumar, Ernie Ostic, Paweł Leszczyński
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-14 02:36:34

*Thread Reply:* Yes, this would require some extension to the spec. What do you mean spark-sql : spark.sql() with some spark query or SQL in spark JDBC?

Athitya Kumar (athityakumar@gmail.com)
2023-08-15 15:16:49

*Thread Reply:* Sorry, missed your question @Paweł Leszczyński. By spark-sql, I'm referring to the former: spark.sql() with some spark query

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-16 03:10:57

*Thread Reply:* cc @Jens Pfau - you may be also interested in extending column level lineage facet.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-08-22 02:23:08

*Thread Reply:* Hi, is there a github issue for this feature? Seems like a really cool and exciting functionality to have!

Athitya Kumar (athityakumar@gmail.com)
2023-08-22 08:03:49

*Thread Reply:* @Anirudh Shrinivason - Are you referring to this issue: https://github.com/OpenLineage/OpenLineage/issues/2048?

Comments
2
:gratitude_thank_you: Anirudh Shrinivason
✅ Anirudh Shrinivason
Athitya Kumar (athityakumar@gmail.com)
2023-08-14 05:13:48

Hey team 👋

Is there a way we can feed the logical plan directly to check the open-lineage events being built, without actually running a spark-job with open-lineage configs? Basically interested to see if we can mock a dry-run of a spark job w/ open-lineage by mimicking the logical plan 😄

cc @Shubh

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-14 06:00:21

*Thread Reply:* Not really I think - the integration does not rely purely on the logical plan

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-14 06:00:44

*Thread Reply:* At least, not in all cases. For some maybe

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-14 07:34:39

*Thread Reply:* We're using pretty similar approach in our column level lineage tests where we run some spark commands, register custom listener https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/tes[…]eage/spark/agent/util/LastQueryExecutionSparkEventListener.java which catches the logical plan. Further we run our tests on the captured logical plan.

The difference here, between what you're asking about, is that we still have an access to the same spark session.

In many cases, our integration uses active Spark session to fetch some dataset details. This happens pretty often (like fetch dataset location) and cannot be taken just from a Logical Plan.

Athitya Kumar (athityakumar@gmail.com)
2023-08-14 11:03:28

*Thread Reply:* @Paweł Leszczyński - We're mainly interested to see the inputs/outputs (mainly column schema and column lineage) for different logical plans. Is that something that could be done in a static manner without running spark jobs in your opinion?

For example, I know that we can statically create logical plans

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-16 03:05:44

*Thread Reply:* The more we talk the more I am wondering what is the purpose of doing so? Do you want to test openlineage coverage or is there any production scenario where you would like to apply this?

Athitya Kumar (athityakumar@gmail.com)
2023-08-16 04:01:39

*Thread Reply:* @Paweł Leszczyński - This is for testing openlineage coverage so that we can be more confident on what're the happy path scenarios and what're the scenarios where it may not work / work partially etc

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-16 04:22:01

*Thread Reply:* If this is for testing, then you're also capable of mocking some SparkSession/catalog methods when Openlineage integration tries to access them. If you want to reuse LogicalPlans from your prod environment, you will encounter logicalplan serialization issues. On the other hand, if you generate logical plans from some example Spark jobs, then the same can be easier achieved in a way the integration tests are run with mockserver.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-14 09:45:31

Hi Team,

Spark & Databricks related question: Starting 1st September Databricks is going to block running init_scripts located in dbfs and this is the way our integration works (https://www.databricks.com/blog/securing-databricks-cluster-init-scripts).

We have two ways of mitigating this in our docs and quickstart: (1) move initscripts to workspace (2) move initscripts to S3

None of them is perfect. (1) requires creating init_script file manually through databricks UI and copy/paste its content. I couldn't find the way to load it programatically. (2) requires quickstart user to have s3 bucket access.

Would love to hear your opinion on this. Perhaps there's some better way to do that. Thanks. `

Databricks
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-08-15 01:13:49

*Thread Reply:* We're uploading the init scripts to s3 via tf. But yeah ig there are some access permissions that the user needs to have

:gratitude_thank_you: Paweł Leszczyński
Abdallah (abdallah@terrab.me)
2023-08-16 07:32:00

*Thread Reply:* Hello I am new here and I am asking why do you need an init script ? If it's a spark integration we can just specify --package=io.openlineage...

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-16 07:41:25

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/databricks/open-lineage-init-script.sh -> I think the issue was in having openlineage-jar installed immediately on the classpath bcz it's required when OpenLineageSparkListener is instantiated. It didn't work without it.

Abdallah (abdallah@terrab.me)
2023-08-16 07:43:55

*Thread Reply:* Yes it happens if you use --jars s3://.../...openlineage-spark-VERSION.jar parameter. (I made a ticket for this issue in Databricks support) But if you use --package io.openlineage... (the package will be downloaded from maven) it works fine.

👀 Paweł Leszczyński
Abdallah (abdallah@terrab.me)
2023-08-16 07:47:50

*Thread Reply:* I think they don't use the right class loader.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-16 08:36:14

*Thread Reply:* To make sure: are you able to run Openlineage & Spark on Databricks Runtime without init_scripts?

I was doing this a second ago and this ended up with Caused by: java.lang.ClassNotFoundException: io.openlineage.spark.agent.OpenLineageSparkListener not found in com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader@1609ed55

Alexandre Campelo (aleqi200@gmail.com)
2023-08-14 19:49:00

Hello, I just downloaded Marquez and I'm trying to send a sample request but I'm getting a 403 (forbidden). Any idea how to find the authentication details?

Alexandre Campelo (aleqi200@gmail.com)
2023-08-15 12:19:34

*Thread Reply:* Ok, nevermind. I figured it out. The port 5000 is reserved in MACOS so I had to start on port 9000 instead.

👍 Maciej Obuchowski
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-08-15 01:25:48

Hi, I noticed that while capturing lineage for merge into commands, some of the tables/columns are unaccounted for the lineage. Example: ```fdummyfunnelstg = spark.sql("""WITH dummyfunnel AS ( SELECT ** FROM fdummyfunnelone WHERE dateid BETWEEN {startdateid} AND {enddateid}

        UNION ALL

        SELECT **
        FROM f_dummy_funnel_two
        WHERE date_id BETWEEN {start_date_id} AND {end_date_id}

        UNION ALL

        SELECT **
        FROM f_dummy_funnel_three
        WHERE date_id BETWEEN {start_date_id} AND {end_date_id}

        UNION ALL

        SELECT **
        FROM f_dummy_funnel_four
        WHERE date_id BETWEEN {start_date_id} AND {end_date_id}

        UNION ALL

        SELECT **
        FROM f_dummy_funnel_five
        WHERE date_id BETWEEN {start_date_id} AND {end_date_id}

    )
    SELECT DISTINCT
        dummy_funnel.customer_id,
        dummy_funnel.product,
        dummy_funnel.date_id,
        dummy_funnel.country_id,
        dummy_funnel.city_id,
        dummy_funnel.dummy_type_id,
        dummy_funnel.num_attempts,
        dummy_funnel.num_transactions,
        dummy_funnel.gross_merchandise_value,
        dummy_funnel.sub_category_id,
        dummy_funnel.is_dummy_flag
    FROM dummy_funnel
    INNER JOIN d_dummy_identity as dummy_identity
        ON dummy_identity.id = dummy_funnel.customer_id
    WHERE
        date_id BETWEEN {start_date_id} AND {end_date_id}""")

spark.sql(f""" MERGE INTO {tablename} USING fdummyfunnelstg ON fdummyfunnelstg.customerid = {tablename}.customerid AND fdummyfunnelstg.product = {tablename}.product AND fdummyfunnelstg.dateid = {tablename}.dateid AND fdummyfunnelstg.countryid = {tablename}.countryid AND fdummyfunnelstg.cityid = {tablename}.cityid AND fdummyfunnelstg.dummytypeid = {tablename}.dummytypeid AND fdummyfunnelstg.subcategoryid = {tablename}.subcategoryid AND fdummyfunnelstg.isdummyflag = {tablename}.isdummyflag WHEN MATCHED THEN UPDATE SET {tablename}.numattempts = fdummyfunnelstg.numattempts , {tablename}.numtransactions = fdummyfunnelstg.numtransactions , {tablename}.grossmerchandisevalue = fdummyfunnelstg.grossmerchandisevalue WHEN NOT MATCHED THEN INSERT ( customerid, product, dateid, countryid, cityid, dummytypeid, numattempts, numtransactions, grossmerchandisevalue, subcategoryid, isdummyflag ) VALUES ( fdummyfunnelstg.customerid, fdummyfunnelstg.product, fdummyfunnelstg.dateid, fdummyfunnelstg.countryid, fdummyfunnelstg.cityid, fdummyfunnelstg.dummytypeid, fdummyfunnelstg.numattempts, fdummyfunnelstg.numtransactions, fdummyfunnelstg.grossmerchandisevalue, fdummyfunnelstg.subcategoryid, fdummyfunnelstg.isdummyflag ) """)`` In cases like this, I notice that the full lineage is not actually captured... I'd expect to see this having 5 upstreams:dummyfunnelone, dummyfunneltwo, dummyfunnelthree, dummyfunnelfour, dummyfunnel_five` , but I notice only 1-2 upstreams for this case... Would like to learn more about why this might happen, and whether this is expected behaviour or not. Thanks!

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-15 06:48:43

*Thread Reply:* Would be useful to see generated event or any logs

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-16 03:09:05

*Thread Reply:* @Anirudh Shrinivason what if there is just one union instead of four? What if there are just two columns selected instead of 10? What if inner join is skipped? Does merge into matter?

The smaller SQL to reproduce the problem, the easier it is to find the root cause. Most of the issues are reproducible with just few lines of code.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-08-16 03:34:30

*Thread Reply:* Yup let me try to identify the cause from my end. Give me some time haha. I'll reach out again once there is more clarity on the occurence

Abdallah (abdallah@terrab.me)
2023-08-16 07:33:21

Hello,

The OpenLineage Databricks integration is not working properly in our side which due to filtering adaptive_spark_plan

Please find the issue link.

https://github.com/OpenLineage/OpenLineage/issues/2058

⬆️ Mouad MOUSSABBIH, Abdallah
Harel Shein (harel.shein@gmail.com)
2023-08-16 09:24:09

*Thread Reply:* thanks @Abdallah for the thoughtful issue that you submitted! was wondering if you’d consider opening up a PR? would love to help you as a contributor is that’s something you are interested in.

Abdallah (abdallah@terrab.me)
2023-08-17 11:59:51

*Thread Reply:* Hello

Abdallah (abdallah@terrab.me)
2023-08-17 11:59:58

*Thread Reply:* Yes I am working on it

Abdallah (abdallah@terrab.me)
2023-08-17 12:00:14

*Thread Reply:* I deleted the line that has that filter.

Abdallah (abdallah@terrab.me)
2023-08-17 12:00:24

*Thread Reply:* I am adding some tests now

Abdallah (abdallah@terrab.me)
2023-08-17 12:00:45

*Thread Reply:* But running ./gradlew --no-daemon databricksIntegrationTest -x test -Pspark.version=3.4.0 -PdatabricksHost=$DATABRICKS_HOST -PdatabricksToken=$DATABRICKS_TOKEN

Abdallah (abdallah@terrab.me)
2023-08-17 12:01:11

*Thread Reply:* gives me A problem occurred evaluating project ':app'. &gt; Could not resolve all files for configuration ':app:spark33'. &gt; Could not resolve io.openlineage:openlineage_java:1.1.0-SNAPSHOT. Required by: project :app &gt; project :shared &gt; Could not resolve io.openlineage:openlineage_java:1.1.0-SNAPSHOT. &gt; Unable to load Maven meta-data from <https://astronomer.jfrog.io/artifactory/maven-public-libs-snapshot/io/openlineage/openlineage-java/1.1.0-SNAPSHOT/maven-metadata.xml>. &gt; org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 326; The reference to entity "display" must end with the ';' delimiter. &gt; Could not resolve io.openlineage:openlineage_sql_java:1.1.0-SNAPSHOT. Required by: project :app &gt; project :shared &gt; Could not resolve io.openlineage:openlineage_sql_java:1.1.0-SNAPSHOT. &gt; Unable to load Maven meta-data from <https://astronomer.jfrog.io/artifactory/maven-public-libs-snapshot/io/openlineage/openlineage-sql-java/1.1.0-SNAPSHOT/maven-metadata.xml>. &gt; org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 326; The reference to entity "display" must end with the ';' delimiter.

Abdallah (abdallah@terrab.me)
2023-08-17 12:01:25

*Thread Reply:* And I am trying to understand what should I do.

Abdallah (abdallah@terrab.me)
2023-08-17 12:13:37

*Thread Reply:* I am compiling sql integration

Abdallah (abdallah@terrab.me)
2023-08-17 13:04:15

*Thread Reply:* I built the java client

Abdallah (abdallah@terrab.me)
2023-08-17 13:04:29

*Thread Reply:* but having A problem occurred evaluating project ':app'. &gt; Could not resolve all files for configuration ':app:spark33'. &gt; Could not resolve io.openlineage:openlineage_java:1.1.0-SNAPSHOT. Required by: project :app &gt; project :shared &gt; Could not resolve io.openlineage:openlineage_java:1.1.0-SNAPSHOT. &gt; Unable to load Maven meta-data from <https://astronomer.jfrog.io/artifactory/maven-public-libs-snapshot/io/openlineage/openlineage-java/1.1.0-SNAPSHOT/maven-metadata.xml>. &gt; org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 326; The reference to entity "display" must end with the ';' delimiter. &gt; Could not resolve io.openlineage:openlineage_sql_java:1.1.0-SNAPSHOT. Required by: project :app &gt; project :shared &gt; Could not resolve io.openlineage:openlineage_sql_java:1.1.0-SNAPSHOT. &gt; Unable to load Maven meta-data from <https://astronomer.jfrog.io/artifactory/maven-public-libs-snapshot/io/openlineage/openlineage-sql-java/1.1.0-SNAPSHOT/maven-metadata.xml>. &gt; org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 326; The reference to entity "display" must end with the ';' delimiter.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-17 14:47:41

*Thread Reply:* Please do ./gradlew publishToMavenLocal in client/java directory

Abdallah (abdallah@terrab.me)
2023-08-17 14:47:59

*Thread Reply:* Okay thanks

Abdallah (abdallah@terrab.me)
2023-08-17 14:48:01

*Thread Reply:* will do

Abdallah (abdallah@terrab.me)
2023-08-22 10:33:02

*Thread Reply:* Hello back

Abdallah (abdallah@terrab.me)
2023-08-22 10:33:12

*Thread Reply:* I created a databricks cluster.

Abdallah (abdallah@terrab.me)
2023-08-22 10:35:00

*Thread Reply:* And I had somme issues that -PdatabricksHost doesn't work with System.getProperty("databricksHost") So I changed to -DdatabricksHost with System.getenv("databricksHost")

Abdallah (abdallah@terrab.me)
2023-08-22 10:36:19

*Thread Reply:* Then I had some issue that the path dbfs:/databricks/openlineage/ doesn't exist, I, then, created the folder /dbfs/databricks/openlineage/

Abdallah (abdallah@terrab.me)
2023-08-22 10:38:03

*Thread Reply:* And now I am investigating this issue : java.lang.NullPointerException at io.openlineage.spark.agent.DatabricksUtils.uploadOpenlineageJar(DatabricksUtils.java:226) at io.openlineage.spark.agent.DatabricksUtils.init(DatabricksUtils.java:66) at io.openlineage.spark.agent.DatabricksIntegrationTest.setup(DatabricksIntegrationTest.java:54) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at ... worker.org.gradle.process.internal.worker.GradleWorkerMain.main(GradleWorkerMain.java:74) Suppressed: com.databricks.sdk.core.DatabricksError: Missing required field: cluster_id at app//com.databricks.sdk.core.error.ApiErrors.readErrorFromResponse(ApiErrors.java:48) at app//com.databricks.sdk.core.error.ApiErrors.checkForRetry(ApiErrors.java:22) at app//com.databricks.sdk.core.ApiClient.executeInner(ApiClient.java:236) at app//com.databricks.sdk.core.ApiClient.getResponse(ApiClient.java:197) at app//com.databricks.sdk.core.ApiClient.execute(ApiClient.java:187) at app//com.databricks.sdk.core.ApiClient.POST(ApiClient.java:149) at app//com.databricks.sdk.service.compute.ClustersImpl.delete(ClustersImpl.java:31) at app//com.databricks.sdk.service.compute.ClustersAPI.delete(ClustersAPI.java:191) at app//com.databricks.sdk.service.compute.ClustersAPI.delete(ClustersAPI.java:180) at app//io.openlineage.spark.agent.DatabricksUtils.shutdown(DatabricksUtils.java:96) at app//io.openlineage.spark.agent.DatabricksIntegrationTest.shutdown(DatabricksIntegrationTest.java:65) at ...

Abdallah (abdallah@terrab.me)
2023-08-22 10:39:22

*Thread Reply:* Suppressed: com.databricks.sdk.core.DatabricksError: Missing required field: cluster_id

Abdallah (abdallah@terrab.me)
2023-08-22 10:40:18

*Thread Reply:* at io.openlineage.spark.agent.DatabricksUtils.uploadOpenlineageJar(DatabricksUtils.java:226)

Abdallah (abdallah@terrab.me)
2023-08-22 10:54:51

*Thread Reply:* I did this !echo "xxx" &gt; /dbfs/databricks/openlineage/openlineage-spark-V.jar

Abdallah (abdallah@terrab.me)
2023-08-22 10:55:29

*Thread Reply:* To create some fake file that can be deleted in uploadOpenlineageJar function.

Abdallah (abdallah@terrab.me)
2023-08-22 10:56:09

*Thread Reply:* Because if there is no file, this part fails StreamSupport.stream( workspace.dbfs().list("dbfs:/databricks/openlineage/").spliterator(), false) .filter(f -&gt; f.getPath().contains("openlineage-spark")) .filter(f -&gt; f.getPath().endsWith(".jar")) .forEach(f -&gt; workspace.dbfs().delete(f.getPath()));

😬 Maciej Obuchowski
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-22 11:47:17

*Thread Reply:* does this work after !echo "xxx" &gt; /dbfs/databricks/openlineage/openlineage-spark-V.jar ?

Abdallah (abdallah@terrab.me)
2023-08-22 11:47:36

*Thread Reply:* Yes

Abdallah (abdallah@terrab.me)
2023-08-22 19:02:05

*Thread Reply:* I am now having another error in the driver

23/08/22 22:56:26 ERROR SparkContext: Error initializing SparkContext. org.apache.spark.SparkException: Exception when registering SparkListener at org.apache.spark.SparkContext.setupAndStartListenerBus(SparkContext.scala:3121) at org.apache.spark.SparkContext.&lt;init&gt;(SparkContext.scala:835) at com.databricks.backend.daemon.driver.DatabricksILoop$.$anonfun$initializeSharedDriverContext$1(DatabricksILoop.scala:362) ... at com.databricks.DatabricksMain.main(DatabricksMain.scala:146) at com.databricks.backend.daemon.driver.DriverDaemon.main(DriverDaemon.scala) Caused by: java.lang.ClassNotFoundException: io.openlineage.spark.agent.OpenLineageSparkListener not found in com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader@298cfe89 at com.databricks.backend.daemon.driver.ClassLoaders$MultiReplClassLoader.loadClass(ClassLoaders.scala:115) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.spark.util.Utils$.classForName(Utils.scala:263)

Abdallah (abdallah@terrab.me)
2023-08-22 19:19:29

*Thread Reply:* Can you please share with me your json conf for the cluster ?

Abdallah (abdallah@terrab.me)
2023-08-22 19:55:57

*Thread Reply:* It's because in mu build file I have

Abdallah (abdallah@terrab.me)
2023-08-22 19:56:27

*Thread Reply:* and the one that was copied is

Abdallah (abdallah@terrab.me)
2023-08-22 20:01:12

*Thread Reply:* due to the findAny 😕 private static void uploadOpenlineageJar(WorkspaceClient workspace) { Path jarFile = Files.list(Paths.get("../build/libs/")) .filter(p -&gt; p.getFileName().toString().startsWith("openlineage-spark-")) .filter(p -&gt; p.getFileName().toString().endsWith("jar")) .findAny() .orElseThrow(() -&gt; new RuntimeException("openlineage-spark jar not found"));

Abdallah (abdallah@terrab.me)
2023-08-22 20:35:10

*Thread Reply:* It works finally 😄

Abdallah (abdallah@terrab.me)
2023-08-23 05:16:19

*Thread Reply:* The PR 😄 https://github.com/OpenLineage/OpenLineage/pull/2061

Labels
integration/spark
Comments
1
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-23 08:23:49

*Thread Reply:* thanks for the pr 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-23 08:24:02

*Thread Reply:* code formatting checks complain now

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-23 08:25:09

*Thread Reply:* for the JAR issues, do you also want to create PR as you've fixed the issue on your end?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-23 09:06:26

*Thread Reply:* @Abdallah you're using newer version of Java than 8, right?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-23 09:07:07

*Thread Reply:* AFAIK googleJavaFormat behaves differently between Java versions

Abdallah (abdallah@terrab.me)
2023-08-23 09:15:41

*Thread Reply:* Okay I will switch back to another java version

Abdallah (abdallah@terrab.me)
2023-08-23 09:25:06

*Thread Reply:* terra@MacBook-Pro-M3 spark % java -version java version "1.8.0_381" Java(TM) SE Runtime Environment (build 1.8.0_381-b09) Java HotSpot(TM) 64-Bit Server VM (build 25.381-b09, mixed mode)

Abdallah (abdallah@terrab.me)
2023-08-23 09:28:28

*Thread Reply:* Can you tell me which java version should I use ?

Abdallah (abdallah@terrab.me)
2023-08-23 09:49:42

*Thread Reply:* Hello, I have @mobuchowski ERROR: Missing environment variable {i} Can you please check what does it come from ?

Company
@getindata
Location
Warsaw
Repositories
16
Followers
25
Abdallah (abdallah@terrab.me)
2023-08-23 09:50:24

*Thread Reply:* Can you help please ?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-23 10:08:43

*Thread Reply:* Java 8

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-23 10:10:14

*Thread Reply:* ```Hello, I have

@mobuchowski ERROR: Missing environment variable {i} Can you please check what does it come from ? (edited) ``` Yup, for now I have to manually make our CI account pick your changes up if you make PR from fork. Just did that

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-23 10:53:34

*Thread Reply:* @Abdallah merged 🙂

Abdallah (abdallah@terrab.me)
2023-08-23 10:59:22

*Thread Reply:* Thank you !

Michael Robinson (michael.robinson@astronomer.io)
2023-08-16 14:21:26

@channel Meetup notice: on Monday, 9/18, at 5:00 pm ET OpenLineage will be gathering in Toronto at Airflow Summit. Coming to the summit? Based in or near Toronto? Please join us to discuss topics such as: • recent developments in the project including the addition of static lineage support and the OpenLineage Airflow Provider, • the project’s history and architecture, • opportunities to contribute, • resources for getting started, • + more. Please visit medium=referral&utmcampaign=share-btnsavedeventssharemodal&utmsource=link|the meetup page> for the specific location (which is not the conference hotel) and to sign up. Hope to see some of you there! (Please note that the start time is 5:00 pm ET.)

Meetup
❤️ Julien Le Dem, Maciej Obuchowski, Harel Shein, Paweł Leszczyński, Athitya Kumar, tati
ldacey (lance.dacey2@sutherlandglobal.com)
2023-08-20 17:45:41

i saw OpenLineage was built into Airflow recently as a provider but the documentation seems really light (https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/guides/user.html), is the documentation from openlineage the correct way I should proceed?

https://openlineage.io/docs/integrations/airflow/usage

openlineage.io
👍 Sheeri Cabral (Collibra)
Julien Le Dem (julien@apache.org)
2023-08-21 20:26:56

*Thread Reply:* openlineage-airflow is the package maintained in the OpenLineage project and to be used for versions of Airflow before 2.7. You could use it with 2.7 as well but you’d be staying on the “old” integration. apache-airflow-providers-openlineage is the new package, maintained in the Airflow project that can be used starting Airflow 2.7 and is the recommended package moving forward. It is compatible with the configuration of the old package described in that usage page. CC: @Maciej Obuchowski @Jakub Dardziński It looks like this page needs improvement.

PyPI
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-22 05:03:28

*Thread Reply:* Yeah, I'll fix that

:gratitude_thank_you: Julien Le Dem
Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-08-22 17:55:08

*Thread Reply:* https://github.com/apache/airflow/pull/33610

fyi

Labels
area:providers, kind:documentation, provider:openlineage
Comments
1
🙌 ldacey, Julien Le Dem
ldacey (lance.dacey2@sutherlandglobal.com)
2023-08-22 17:54:20

do I label certain raw data sources as a dataset, for example SFTP/FTP sites, 0365 emails, etc? I extract that data into a bucket for the client in a "folder" called "raw" which I know will be an OL Dataset. Would this GCS folder (after extracting the data with Airflow) be the first Dataset OL is aware of?

<gcs://client-bucket/source-system-lob/raw>

I then process that data into partitioned parquet datasets which would also be OL Datasets: <gcs://client-bucket/source-system-lob/staging> <gcs://client-bucket/source-system-lob/analytics>

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-08-22 18:02:46

*Thread Reply:* that really depends on the use case IMHO if you consider a whole directory/folder as a dataset (meaning that each file inside folds into a larger whole) you should label dataset as directory

you might as well have directory with each file being something different - in this case it would be best to set each file separately as dataset

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-08-22 18:04:32

*Thread Reply:* there was also SymlinksDatasetFacet introduced to store alternative dataset names, might be useful: https://github.com/OpenLineage/OpenLineage/pull/936

ldacey (lance.dacey2@sutherlandglobal.com)
2023-08-22 18:07:26

*Thread Reply:* cool, yeah in general each file is just a snapshot of data from a client (for example, daily dump). the parquet datasets are normally partitioned and might have small fragments and I definitely picture it as more of a table than individual files

👍 Maciej Obuchowski
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-23 08:22:09

*Thread Reply:* Agree with Jakub here - with object storage, people use different patterns, but usually some directory layer vs file is the valid abstraction level, especially if your pattern is adding files with new data inside

👍 Jakub Dardziński
ldacey (lance.dacey2@sutherlandglobal.com)
2023-08-25 10:26:52

*Thread Reply:* I tested a dataset for each raw file versus the folder and the folder looks much cleaner (not sure if I can collapse individual datasets/files into a group?)

from 2022, this particular source had 6 raw schema changes (client controlled, no warning). what should I do to make that as obvious as possible if I track the dataset at a folder level?

ldacey (lance.dacey2@sutherlandglobal.com)
2023-08-25 10:32:19

*Thread Reply:* I was thinking that I could name the dataset based on the schema_version (identified by the raw column names), so in this example I would have 6 OL datasets feeding into one "staging" dataset

ldacey (lance.dacey2@sutherlandglobal.com)
2023-08-25 10:32:57

*Thread Reply:* not sure what the best practice would be in this scenario though

ldacey (lance.dacey2@sutherlandglobal.com)
2023-08-22 17:55:38

• also saw the docs reference URI = gs://{bucket name}{path} and I wondered if the path would include the filename, or if it was just the base path like I showed above

Mars Lan (Metaphor) (mars@metaphor.io)
2023-08-22 18:35:45

Has anyone managed to get the OL Airflow integration to work on AWS MWAA? We've tried pretty much every trick but still ended up with the following error: Broken plugin: [openlineage.airflow.plugin] No module named 'openlineage.airflow'; 'openlineage' is not a package

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-23 05:22:18

*Thread Reply:* Which version are you trying to use?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-23 05:22:45

*Thread Reply:* Both OL and MWAA/Airflow 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-23 05:23:52

*Thread Reply:* 'openlineage' is not a package suggests that something went wrong with import process, for example cycle in import path

Mars Lan (Metaphor) (mars@metaphor.io)
2023-08-23 16:50:34

*Thread Reply:* MWAA: 2.6.3 OL: 1.0.0

I can see from the log that OL has been successfully installed to the webserver: Successfully installed openlineage-airflow-1.0.0 openlineage-integration-common-1.0.0 openlineage-python-1.0.0 openlineage-sql-1.0.0 This is the full stacktrace: ```Traceback (most recent call last):

File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/pluginsmanager.py", line 229, in loadentrypointplugins pluginclass = entrypoint.load() File "/usr/local/airflow/.local/lib/python3.10/site-packages/importlibmetadata/init.py", line 209, in load module = importmodule(match.group('module')) File "/usr/lib/python3.10/importlib/init.py", line 126, in importmodule return bootstrap.gcdimport(name[level:], package, level) File "<frozen importlib.bootstrap>", line 1050, in gcdimport File "<frozen importlib.bootstrap>", line 1027, in _findandload File "<frozen importlib.bootstrap>", line 992, in findandloadunlocked File "<frozen importlib.bootstrap>", line 241, in _callwithframesremoved File "<frozen importlib.bootstrap>", line 1050, in _gcdimport File "<frozen importlib.bootstrap>", line 1027, in _findandload File "<frozen importlib.bootstrap>", line 1001, in findandloadunlocked ModuleNotFoundError: No module named 'openlineage.airflow'; 'openlineage' is not a package```

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-08-24 08:18:36

*Thread Reply:* It’s taking long to update MWAA environment but I tested 2.6.3 version with the followingrequirements.txt: openlineage-airflow and openlineage-airflow==1.0.0 is there any step that might lead to some unexpected results?

Mars Lan (Metaphor) (mars@metaphor.io)
2023-08-24 08:29:30

*Thread Reply:* Yeah, it takes forever to update MWAA even for a simple change. If you open either the webserver log (in CloudWatch) or the AirFlow UI, you should see the above error message.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-08-24 08:33:53

*Thread Reply:* The thing is that I don’t see any error messages. I wrote simple DAG to test too: ```from future import annotations

from datetime import datetime

from airflow.models import DAG

try: from airflow.operators.empty import EmptyOperator except ModuleNotFoundError: from airflow.operators.dummy import DummyOperator as EmptyOperator # type: ignore

from openlineage.airflow.adapter import OpenLineageAdapter from openlineage.client.client import OpenLineageClient

from airflow.operators.python import PythonOperator

DAGID = "exampleol"

def callable(): client = OpenLineageClient() adapter = OpenLineageAdapter() print(client, adapter)

with DAG( dagid=DAGID, startdate=datetime(2021, 1, 1), schedule="@once", catchup=False, ) as dag: begin = EmptyOperator(taskid="begin")

test = PythonOperator(task_id='print_client', python_callable=callable)```

and it gives expected results as well

Mars Lan (Metaphor) (mars@metaphor.io)
2023-08-24 08:48:11

*Thread Reply:* Oh how interesting. I did have a plugin that sets the endpoint & key via env var. Let me try to disable that to see if it fixes the issue. Will report back after 30 mins, or however long it takes to update MWAA 😉

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-08-24 08:50:05

*Thread Reply:* ohh, I see you probably followed this guide: https://aws.amazon.com/blogs/big-data/automate-data-lineage-on-amazon-mwaa-with-openlineage/?

Amazon Web Services
Mars Lan (Metaphor) (mars@metaphor.io)
2023-08-24 09:04:27

*Thread Reply:* Actually no. I'm not aware of this guide. I assume it's outdated already?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-08-24 09:04:54

*Thread Reply:* tbh I don’t know

Mars Lan (Metaphor) (mars@metaphor.io)
2023-08-24 09:04:55

*Thread Reply:* Actually while we're on that topic, what's the recommended way to pass the URL & API Key in MWAA?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-08-24 09:28:00

*Thread Reply:* I think it's still a plugin that sets env vars

Mars Lan (Metaphor) (mars@metaphor.io)
2023-08-24 09:32:18

*Thread Reply:* Yeah based on the page you shared, secret manager + plugin seems like the way to go.

Mars Lan (Metaphor) (mars@metaphor.io)
2023-08-24 10:31:50

*Thread Reply:* Alas after disabling the plugin and restarting the cluster, I'm still getting the same error. Do you mind to share a screenshot of your cluster's settings so I can compare?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-24 11:57:04

*Thread Reply:* Are you maybe importing some top level OpenLineage code anywhere? This error is most likely circular import

Mars Lan (Metaphor) (mars@metaphor.io)
2023-08-24 12:01:12

*Thread Reply:* Let me try removing all the dags to see if it helps.

Mars Lan (Metaphor) (mars@metaphor.io)
2023-08-24 18:42:49

*Thread Reply:* @Maciej Obuchowski you were correct! It was indeed the DAGs. The errors are gone after removing all the dags. Now just need to figure what caused the circular import since I didn't import OL directly in DAG.

Mars Lan (Metaphor) (mars@metaphor.io)
2023-08-24 18:44:33

*Thread Reply:* Could this be the issue? from airflow.lineage.entities import File, Table How could I declare lineage manually if I can't import these classes?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-25 06:52:47

*Thread Reply:* @Mars Lan (Metaphor) I'll look in more details next week, as I'm in transit now

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-25 06:53:18

*Thread Reply:* but if you could narrow down a problem to single dag that I or @Jakub Dardziński could reproduce, ideally locally, it would help a lot

Mars Lan (Metaphor) (mars@metaphor.io)
2023-08-25 07:07:11

*Thread Reply:* Thanks. I think I understand how this works much better now. Found a few useful BQ example dags. Will give them a try and report back.

🔥 Jakub Dardziński, Maciej Obuchowski
Nitin (nitinkhannain@yahoo.com)
2023-08-23 07:14:44

Hi All, I want to capture, source and target table details as lineage information with openlineage for Amazon Redshift. Please let me know, if anyone has done it

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-08-23 07:32:19

*Thread Reply:* are you using Airflow to connect to Redshift?

Nitin (nitinkhannain@yahoo.com)
2023-08-24 06:50:05

*Thread Reply:* Hi @Jakub Dardziński, Thank you for your reply. No, we are not using Airflow. We are using load/Unload cmd with Pyspark and also Pandas with JDBC connection

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-08-25 13:28:37

*Thread Reply:* @Paweł Leszczyński might know answer if Spark<->OL integration works with Redshift. Eventually JDBC is supported with sqlparser

for Pandas I think there wasn’t too much work done

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-28 02:18:49

*Thread Reply:* @Nitin If you're using jdbc within Spark, the lineage should be obtained via sqlparser-rs library https://github.com/sqlparser-rs/sqlparser-rs. In case it's not, please try to provide some minimal SQL code (or pyspark) which leads to uncaught lineage.

Stars
1980
Language
Rust
Nitin (nitinkhannain@yahoo.com)
2023-08-28 04:53:03

*Thread Reply:* Hi @Jakub Dardziński / @Paweł Leszczyński, thank you for taking out time to reply on my query. We need to capture only load and unload query lineage which we are running using Spark.

If you have any sample implementation for reference, it will be indeed helpful

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-28 06:12:46

*Thread Reply:* I think we don't support load yet on our side: https://github.com/OpenLineage/OpenLineage/blob/main/integration/sql/impl/src/visitor.rs#L8

Nitin (nitinkhannain@yahoo.com)
2023-08-28 08:18:14

*Thread Reply:* Yeah! any way you can think of, we can accommodate it specially load and unload statement. Also, we would like to capture, lineage information where our endpoints are Sagemaker and Redis

Nitin (nitinkhannain@yahoo.com)
2023-08-28 13:20:37

*Thread Reply:* @Paweł Leszczyński can we use this code base integration/common/openlineage/common/provider/redshift_data.py for redshift lineage capture

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-08-28 14:26:40

*Thread Reply:* it still expects input and output tables that are usually retrieved from sqlparser

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-08-28 14:31:00

*Thread Reply:* for Sagemaker there is an Airflow integration written, might be an example possibly https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/openlineage/airflow/extractors/sagemaker_extractors.py

Abdallah (abdallah@terrab.me)
2023-08-23 10:55:10

Approve a new release please 🙂 • Fix spark integration filtering Databricks events.

➕ Abdallah, Tristan GUEZENNEC -CROIX-, Mouad MOUSSABBIH, Ayoub Oudmane, Asmae Tounsi, Jakub Dardziński, Michael Robinson, Harel Shein, Willy Lulciuc, Maciej Obuchowski, Julien Le Dem
Michael Robinson (michael.robinson@astronomer.io)
2023-08-23 12:27:15

*Thread Reply:* Thank you for requesting a release @Abdallah. Three +1s from committers will authorize.

🙌 Abdallah
Michael Robinson (michael.robinson@astronomer.io)
2023-08-23 13:13:18

*Thread Reply:* Thanks, all. The release is authorized and will be initiated within 2 business days.

Athitya Kumar (athityakumar@gmail.com)
2023-08-23 13:08:48

Hey folks! Do we have clear step-by-step documentation on how we can leverage the ServiceLoader based approach for injecting specific OpenLineage customisations for tweaking the transport type with defaults / tweaking column level lineage etc?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-23 13:29:05

*Thread Reply:* For custom transport, you have to provide implementation of interface https://github.com/OpenLineage/OpenLineage/blob/4a1a5c3bf9767467b71ca0e1b6d820ba9e[…]ain/java/io/openlineage/client/transports/TransportBuilder.java and point to it in META_INF file

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-23 13:29:52

*Thread Reply:* But if I understand correctly, if you want to change behavior rather than extend, the correct way may be to either contribute it to repo - if that behavior is useful to anyone, or fork the repo

Athitya Kumar (athityakumar@gmail.com)
2023-08-23 15:14:43

*Thread Reply:* @Maciej Obuchowski - Can you elaborate more on the "point to it in META_INF file"? Let's say we have the custom transport type built in a standalone jar by extending transport builder - what're the exact next steps to use this custom transport in the standalone jar when doing spark-submit?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-23 15:23:13

*Thread Reply:* @Athitya Kumar your jar needs to have META-INF/services/io.openlineage.client.transports.TransportBuilder with fully qualified class names of your custom TransportBuilders there - like openlineage-spark has io.openlineage.client.transports.HttpTransportBuilder io.openlineage.client.transports.KafkaTransportBuilder io.openlineage.client.transports.ConsoleTransportBuilder io.openlineage.client.transports.FileTransportBuilder io.openlineage.client.transports.KinesisTransportBuilder

Athitya Kumar (athityakumar@gmail.com)
2023-08-25 01:49:29

*Thread Reply:* @Maciej Obuchowski - I think this change may be required for consumers to leverage custom transports, can you check & verify this GH comment? https://github.com/OpenLineage/OpenLineage/issues/2007#issuecomment-1690350630

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-25 06:52:30

*Thread Reply:* Probably, I will look at more details next week @Athitya Kumar as I'm in transit

👍 Athitya Kumar
Michael Robinson (michael.robinson@astronomer.io)
2023-08-23 15:04:10

@channel We released OpenLineage 1.1.0, including: Additions: • Flink: create Openlineage configuration based on Flink configuration #2033 @pawel-big-lebowski • Java: add Javadocs to the Java client #2004 @julienledem • Spark: append output dataset name to a job name #2036 @pawel-big-lebowski • Spark: support Spark 3.4.1 #2057 @pawel-big-lebowski Fixes: • Flink: fix a bug when getting schema for KafkaSink #2042 @pentium3 • Spark: fix ignored event adaptive_spark_plan in Databricks #2061 @algorithmy1 Plus additional bug fixes, doc changes and more. Thanks to all the contributors, especially new contributors @pentium3 and @Abdallah! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.1.0 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/1.0.0...1.1.0 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

👏 Ayoub Oudmane, Abdallah, Yuanli Wang, Athitya Kumar, Mars Lan (Metaphor), Maciej Obuchowski, Harel Shein, Kiran Hiremath, Thomas Abraham
:gratitude_thank_you: GitHubOpenLineageIssues
Michael Robinson (michael.robinson@astronomer.io)
2023-08-25 10:29:23

@channel Friendly reminder: our next in-person meetup is next Wednesday, August 30th in San Francisco at Astronomer’s offices in the Financial District. You can sign up and find the details on the medium=referral&utmcampaign=share-btnsavedeventssharemodal&utmsource=link|meetup event page>.

Meetup
George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-08-25 10:57:30

hi Openlineage team , we would like to join one of your meetups(me and @Madhav Kakumani nad @Phil Rolph and we're wondering if you are hosting any meetups after the 18/9 ? We are trying to join this but air - tickets are quite expensive

Harel Shein (harel.shein@gmail.com)
2023-08-25 11:32:12

*Thread Reply:* there will certainly be more meetups, don’t worry about that!

Harel Shein (harel.shein@gmail.com)
2023-08-25 11:32:30

*Thread Reply:* where are you located? perhaps we can try to organize a meetup closer to where you are.

George Polychronopoulos (george.polychronopoulos@6point6.co.uk)
2023-08-25 11:49:37

*Thread Reply:* Thanks a lot for the response, we are in London. We'd be glad to help you organise a meetup and also meet in person!

Michael Robinson (michael.robinson@astronomer.io)
2023-08-25 11:51:39

*Thread Reply:* This is awesome, thanks @George Polychronopoulos. I’ll start a channel and invite you

Juan Luis Cano Rodríguez (juan_luis_cano@mckinsey.com)
2023-08-28 04:47:53

hi folks, I'm looking into exporting static metadata, and found that DatasetEvent requires a eventTime, which in my mind doesn't make sense for static events. I'm setting it to None and the Python client seems to work, but wanted to ask if I'm missing something.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-28 05:59:10

*Thread Reply:* Although you emit DatasetEvent, you still emit an event and eventTime is a valid marker.

Juan Luis Cano Rodríguez (juan_luis_cano@mckinsey.com)
2023-08-28 06:01:40

*Thread Reply:* so, should I use the current time at the moment of emitting it and that's it?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-28 06:01:53

*Thread Reply:* yes, that should be it

:gratitude_thank_you: Juan Luis Cano Rodríguez
Juan Luis Cano Rodríguez (juan_luis_cano@mckinsey.com)
2023-08-28 04:49:21

and something else: I understand that Marquez does not yet support the 2.0 spec, hence it's incompatible with static metadata right? I tried to emit a list of DatasetEvent s and got HTTPError: 422 Client Error: Unprocessable Entity for url: <http://localhost:3000/api/v1/lineage> (I'm using a FileTransport for now)

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-28 06:02:49

*Thread Reply:* marquez is not capable of reflecting DatasetEvents in DB but it should respond with Unsupported event type

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-28 06:03:15

*Thread Reply:* and return 200 instead of 201 created

Juan Luis Cano Rodríguez (juan_luis_cano@mckinsey.com)
2023-08-28 06:05:41

*Thread Reply:* I'll have a deeper look then, probably I'm doing something wrong. thanks @Paweł Leszczyński

Joshua Dotson (josdotso@cisco.com)
2023-08-28 13:25:58

Hi folks. I have some pure golang jobs from which I need to emit OL events to Marquez. Is the right way to go about this to generate a Golang client from the Marquez OpenAPI spec and use that client from my go jobs?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-08-28 14:23:24

*Thread Reply:* I'd rather generate them from OL spec (compliant with JSON Schema)

Joshua Dotson (josdotso@cisco.com)
2023-08-28 15:12:21

*Thread Reply:* I'll look into this. I take you to mean that I would use the OL spec which is available as a set of JSON schemas to create the data object and then HTTP POST it using vanilla Golang. Is that correct? Thank you for your help!

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-08-28 15:30:05

*Thread Reply:* Correct! You’re also very welcome to contribute Golang client (currently we have Python & Java clients) if you manage to send events using golang 🙂

👏 Joshua Dotson
Michael Robinson (michael.robinson@astronomer.io)
2023-08-28 17:28:31

@channel The agenda for the medium=referral&utmcampaign=share-btnsavedeventssharemodal&utmsource=link|Toronto Meetup at Airflow Summit> on 9/18 has been updated. This promises to be an exciting, richly productive discussion. Don’t miss it if you’ll be in the area!

  1. Intros
  2. Evolution of spec presentation/discussion (project background/history)
  3. State of the community
  4. Spark/Column lineage update
  5. Airflow Provider update
  6. Roadmap Discussion
  7. Action items review/next steps
Meetup
❤️ Jarek Potiuk, Paweł Leszczyński, tati
Michael Robinson (michael.robinson@astronomer.io)
2023-08-28 20:05:37

New on the OpenLineage blog: a close look at the new OpenLineage Airflow Provider, including: • the critical improvements it brings to the integration • the high-level design • implementation details • an example operator • planned enhancements • a list of supported operators • more. The post, by @Maciej Obuchowski, @Julien Le Dem and myself is live now on the OpenLineage blog.

openlineage.io
🎉 Drew Meyers, Harel Shein, Maciej Obuchowski, Julian LaNeve, Mars Lan (Metaphor)
Sarwat Fatima (sarwatfatimam@gmail.com)
2023-08-29 03:18:04

Hello, I'm currently in the process of following the instructions outlined in the provided getting started guide at https://openlineage.io/getting-started/. However, I've encountered a problem while attempting to complete *Step 1* of the guide. Unfortunately, I'm encountering an internal server error at this stage. I did manage to successfully run Marquez, but it appears that there might be an issue that needs to be addressed. I have attached screen shots.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-08-29 03:20:18

*Thread Reply:* is 5000 port taken by any other application? or ./docker/up.sh has some errors in logs?

Sarwat Fatima (sarwatfatimam@gmail.com)
2023-08-29 05:23:01

*Thread Reply:* @Jakub Dardziński 5000 port is not taken by any other application. The logs show some errors but I am not sure what is the issue here.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-29 10:02:38

*Thread Reply:* I think Marquez is running on WSL while you're trying to connect from host computer?

Juan Luis Cano Rodríguez (juan_luis_cano@mckinsey.com)
2023-08-29 05:20:39

hi folks, for now I'm producing .jsonl (or .ndjson ) files with one event per line, do you know if there's any way to validate those? would standard JSON Schema tools work?

Juan Luis Cano Rodríguez (juan_luis_cano@mckinsey.com)
2023-08-29 10:58:29

*Thread Reply:* reply by @Julian LaNeve: yes 🙂💯

👍 Maciej Obuchowski
ldacey (lance.dacey2@sutherlandglobal.com)
2023-08-29 13:12:32

for namespaces, if my data is moving between sources (SFTP -> GCS -> Azure Blob (synapse connects to parquet datasets) then should my namespace be based on the client I am working with? my current namespace has been to refer to the bucket, but that falls apart when considering the data sources and some destinations. perhaps I should just add a field for client-name instead to have a consolidated view?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-30 10:53:08

*Thread Reply:* > then should my namespace be based on the client I am working with? I think each of those sources should be a different namespace?

ldacey (lance.dacey2@sutherlandglobal.com)
2023-08-30 12:59:53

*Thread Reply:* got it, yeah I was kind of picturing as one namespace for the client (we handle many clients but they are completely distinct entities). I was able to get it to work with multiple namespaces like you suggested and Marquez was able to plot everything correctly in the visualization

ldacey (lance.dacey2@sutherlandglobal.com)
2023-08-30 13:01:18

*Thread Reply:* I noticed some of my Dataset facets make more sense as Run facets, for example, the name of the specific file I processed and how many rows of data / size of the data for that schedule. that won't impact the Run facets Airflow provides right? I can still have the schedule information + my custom run facets?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-08-30 13:06:38

*Thread Reply:* Yes, unless you name it the same as one of the Airflow facets 🙂

GitHubOpenLineageIssues (githubopenlineageissues@gmail.com)
2023-08-30 08:15:29

Hi, Will really appreciate if someone can guide me or provide me any pointer - if they have been able to implement authentication/authorization for access to Marquez. Have not seen much info around it. Any pointers greatly appreciated. Thanks in advance.

Julien Le Dem (julien@apache.org)
2023-08-30 12:23:18

*Thread Reply:* I’ve seen people do this through the ingress controller in Kubernetes. Unfortunately I don’t have documentation besides k8s specific ones you would find for the ingress controller you’re using. You’d redirect any unauthenticated request to your identity provider

:gratitude_thank_you: GitHubOpenLineageIssues
Michael Robinson (michael.robinson@astronomer.io)
2023-08-30 11:50:05

@channel Friendly reminder: there’s a meetup tonight at Astronomer’s offices in SF!

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
✅ Sheeri Cabral (Collibra)
Julien Le Dem (julien@apache.org)
2023-08-30 12:15:31

*Thread Reply:* I’ll be there and looking forward to see @John Lukenoff ‘s presentation

Michael Barrientos (mbarrien@gmail.com)
2023-08-30 21:38:31

Can anyone let 3 people stuck downstairs into the 7th floor?

👍 Willy Lulciuc
Willy Lulciuc (willy@datakin.com)
2023-08-30 23:25:21

*Thread Reply:* Sorry about that!

Yunhe (yunhe52203334@outlook.com)
2023-08-31 02:31:48

hello,everyone,i can run openLineage spark code in my notebook with python,but when use my idea to execute scala code like this: import org.apache.spark.internal.Logging import org.apache.spark.sql.SparkSession import io.openlineage.client.OpenLineageClientUtils.loadOpenLineageYaml import org.apache.spark.scheduler.{SparkListener, SparkListenerApplicationEnd, SparkListenerApplicationStart} import sun.java2d.marlin.MarlinUtils.logInfo object Test { def main(args: Array[String]): Unit = {

val spark = SparkSession
  .builder()
  .master("local")
  .appName("test")
  .config("spark.jars.packages","io.openlineage:openlineage_spark:0.12.0")
  .config("spark.extraListeners","io.openlineage.spark.agent.OpenLineageSparkListener")
  .config("spark.openlineage.transport.type","console")
  .getOrCreate()

spark.sparkContext.setLogLevel("INFO")

//spark.sparkContext.addSparkListener(new MySparkAppListener)
import spark.implicits._
val input = Seq((1, "zs", 2020), (2, "ls", 2023)).toDF("id", "name", "year")

input.select("id", "name").orderBy("id").show()

}

}

there is something wrong: Exception in thread "spark-listener-group-shared" java.lang.NoSuchMethodError: io.openlineage.client.OpenLineageClientUtils.loadOpenLineageYaml(Ljava/io/InputStream;)Lio/openlineage/client/OpenLineageYaml; at io.openlineage.spark.agent.ArgumentParser.extractOpenlineageConfFromSparkConf(ArgumentParser.java:114) at io.openlineage.spark.agent.ArgumentParser.parse(ArgumentParser.java:78) at io.openlineage.spark.agent.OpenLineageSparkListener.initializeContextFactoryIfNotInitialized(OpenLineageSparkListener.java:277) at io.openlineage.spark.agent.OpenLineageSparkListener.onApplicationStart(OpenLineageSparkListener.java:267) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:55) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1446) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)

i want to know how can i set idea scala environment correctly

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-08-31 02:58:41

*Thread Reply:* io.openlineage:openlineage_spark:0.12.0 -> could you repeat the steps with newer version?

Yunhe (yunhe52203334@outlook.com)
2023-08-31 03:51:52

ok,it`s my first use thie lineage tool. first,I added dependences in my pom.xml like this: <dependency> <groupId>io.openlineage</groupId> <artifactId>openlineage-java</artifactId> <version>0.12.0</version> </dependency> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-api</artifactId> <version>2.7</version> </dependency> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-core</artifactId> <version>2.7</version> </dependency> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-slf4j-impl</artifactId> <version>2.7</version> </dependency> <dependency> <groupId>io.openlineage</groupId> <artifactId>openlineage-spark</artifactId> <version>0.30.1</version> </dependency>

my spark version is 3.3.1 and the version can not change

second, in file Openlineage/intergration/spark I enter command : docker-compose up and follow the steps in this doc: https://openlineage.io/docs/integrations/spark/quickstart_local there is no erro when i use notebook to execute pyspark for openlineage and I could get json message. but after I enter "docker-compose up" ,I want to use my Idea tool to execute scala code like above,the erro happend like above. It seems that I does not configure the environment correctly. so how can i fix the problem .

openlineage.io
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-09-01 05:15:28

*Thread Reply:* please use latest io.openlineage:openlineage_spark:1.1.0 instead. openlineage-java is already contained in the jar, no need to add it on your own.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2023-08-31 15:33:19

Will the August meeting be put up at https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting soon? (usually it’s up in a few days 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-09-01 06:00:53

*Thread Reply:* @Michael Robinson

Michael Robinson (michael.robinson@astronomer.io)
2023-09-01 17:13:32

*Thread Reply:* The recording is on the youtube channel here. I’ll update the wiki ASAP

YouTube
} OpenLineage Project (https://www.youtube.com/@openlineageproject6897)
✅ Sheeri Cabral (Collibra)
Julien Le Dem (julien@apache.org)
2023-08-31 18:10:20

It sounds like there have been a few announcements at Google Next: https://cloud.google.com/data-catalog/docs/how-to/open-lineage https://cloud.google.com/dataproc/docs/guides/lineage

Google Cloud
Google Cloud
🎉 Harel Shein, Willy Lulciuc, Kevin Languasco, Peter Hicks, Maciej Obuchowski, Paweł Leszczyński, Sheeri Cabral (Collibra), Ross Turk, Michael Robinson, Jakub Dardziński, Kiran Hiremath, Laurent Paris, Anastasia Khomyakova
🙌 Harel Shein, Willy Lulciuc, Mars Lan (Metaphor), Peter Hicks, Maciej Obuchowski, Paweł Leszczyński, Eric Veleker, Sheeri Cabral (Collibra), Ross Turk, Michael Robinson
❤️ Willy Lulciuc, Maciej Obuchowski, ldacey, Ross Turk, Michael Robinson
Julien Le Dem (julien@apache.org)
2023-09-01 23:09:55

*Thread Reply:* https://www.youtube.com/watch?v=zvCdrNJsxBo&t=2260s

YouTube
} Google Cloud (https://www.youtube.com/@googlecloud)
Michael Robinson (michael.robinson@astronomer.io)
2023-09-01 17:16:21

@channel The latest issue of OpenLineage News is out now! Please subscribe to get it directly in your inbox each month.

apache.us14.list-manage.com
🙌 Jakub Dardziński, Maciej Obuchowski
🙌:skin_tone_3: Juan Luis Cano Rodríguez
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-09-04 03:38:28

Hi guys, I'd like to capture the spark.databricks.clusterUsageTags.clusterAllTags property from databricks. However, the value of this is a list of keys, and therefore cannot be supported by custom environment facet builder. I was thinking that capturing this property might be useful for most databricks workloads, and whether it might make sense to auto-capture it along with other databricks variables, similar to how we capture mount points for the databricks jobs. Does this sound okay? If so, then I can help to contribute this functionality

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-09-04 06:43:47

*Thread Reply:* Sounds good to me

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-09-11 05:15:03

*Thread Reply:* Added this here: https://github.com/OpenLineage/OpenLineage/pull/2099

Labels
integration/spark
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-09-04 06:39:05

Also, another small clarification is that when using MergeIntoCommand, I'm receiving the lineage events on the backend, but I cannot seem to find any logging of the payload when I enable debug mode in openlineage. I remember there was a similar issue reported by another user in the past. May I check if it might be possible to help with this? It's making debugging quite hard for these cases. Thanks!

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-09-04 06:54:12

*Thread Reply:* I think it only depends on log4j configuration

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-09-04 06:57:15

*Thread Reply:* ```# Set everything to be logged to the console log4j.rootCategory=INFO, console log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err log4j.appender.console.layout=org.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

set the log level for the openlineage spark library

log4j.logger.io.openlineage.spark=DEBUG`` this is what we have inlog4j.properties` in test environment and it works

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-09-04 11:28:11

*Thread Reply:* Hmm... I can see the logs for the other commands, like createViewCommand etc. I just cannot see it for any of the delta runs

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-09-05 03:33:03

*Thread Reply:* that's interesting. So, logging is done here: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/main/java/io/openlineage/spark/agent/EventEmitter.java#L63 and this code is unaware of delta.

The possible problem could be filtering delta events (which we do bcz of delta being noisy)

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-09-05 03:33:36

*Thread Reply:* Recently, we've closed that https://github.com/OpenLineage/OpenLineage/issues/1982 which prevents generating events for ` createOrReplaceTempView

Assignees
<a href="https://github.com/pawel-big-lebowski">@pawel-big-lebowski</a>
Labels
integration/spark
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-09-05 03:35:12

*Thread Reply:* and this is the code change: https://github.com/OpenLineage/OpenLineage/pull/1987/files

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-09-05 05:19:22

*Thread Reply:* Hmm I'm a little confused here. I thought we are only filtering out events for certain specific commands, like show table etc. because its noisy right? Some important commands like MergeInto or SaveIntoDataSource used to be logged before, but I notice now that its not being logged anymore... I'm using 0.23.0 openlineage version.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-09-05 05:47:51

*Thread Reply:* yes, we do. it's just sometimes when doing a filter, we can remove too much. but SaveIntoDataSource and MergeInto should be fine, as we do check them within the tests

ldacey (lance.dacey2@sutherlandglobal.com)
2023-09-04 21:35:05

it looks like my dynamic task mapping in Airflow has the same run ID in marquez, so even if I am processing 100 files, there is only one version of the data. is there a way to have a separate version of each dynamic task so I can track the filename etc?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-05 08:54:57

*Thread Reply:* map_index should be indeed included when calculating run ID (it’s deterministic in Airflow integration) what version of Airflow are you using btw?

ldacey (lance.dacey2@sutherlandglobal.com)
2023-09-05 09:04:14

*Thread Reply:* 2.7.0

I do see this error log in all of my dynamic tasks which might explain it:

[2023-09-05, 00:31:57 UTC] {manager.py:200} ERROR - Extractor returns non-valid metadata: None [2023-09-05, 00:31:57 UTC] {utils.py:401} ERROR - cannot import name 'get_operator_class' from 'airflow.providers.openlineage.utils' (/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/openlineage/utils/__init__.py) Traceback (most recent call last): File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/openlineage/utils/utils.py", line 399, in wrapper return f(**args, ****kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/openlineage/plugins/listener.py", line 93, in on_running ****get_custom_facets(task_instance), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/openlineage/utils/utils.py", line 148, in get_custom_facets custom_facets["airflow_mappedTask"] = AirflowMappedTaskRunFacet.from_task_instance(task_instance) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/openlineage/plugins/facets.py", line 36, in from_task_instance from airflow.providers.openlineage.utils import get_operator_class ImportError: cannot import name 'get_operator_class' from 'airflow.providers.openlineage.utils' (/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/openlineage/utils/__init__.py)

ldacey (lance.dacey2@sutherlandglobal.com)
2023-09-05 09:05:34

*Thread Reply:* I only have a few custom operators with the on_complete facet so I think this is a built in one - it runs before my task custom logs for example

ldacey (lance.dacey2@sutherlandglobal.com)
2023-09-05 09:06:05

*Thread Reply:* and any time I messed up my custom facet, the error would be at the bottom of the logs. this is on top, probably an on_start facet?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-05 09:16:32

*Thread Reply:* seems like some circular import

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-05 09:19:47

*Thread Reply:* I just tested it manually, it’s a bug in OL provider. let me fix that

ldacey (lance.dacey2@sutherlandglobal.com)
2023-09-05 10:53:28

*Thread Reply:* cool, thanks. I am glad it is just a bug, I was afraid dynamic tasks were not supported for a minute there

ldacey (lance.dacey2@sutherlandglobal.com)
2023-09-07 11:46:20

*Thread Reply:* how do the provider updates work? they can be released in between Airflow releases and issues for them are raised on the main Airflow repo?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-07 11:50:07

*Thread Reply:* generally speaking anything related to OL-Airflow should be placed to Airflow repo, important changes/bug fixes would be implemented in OL repo as well

ldacey (lance.dacey2@sutherlandglobal.com)
2023-09-07 15:40:31

*Thread Reply:* got it, thanks

ldacey (lance.dacey2@sutherlandglobal.com)
2023-09-07 19:43:46

*Thread Reply:* is there a way for me to install the openlineage provider based on the commit you made to fix the circular imports?

i was going to try to install from Airflow main branch but didnt want to mess anything up

ldacey (lance.dacey2@sutherlandglobal.com)
2023-09-07 19:44:39

*Thread Reply:* I saw it was merged to airflow main but it is not in 2.7.1 and there is no 1.0.3 provider version yet, so I wondered if I could manually install it for the time being

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-08 05:45:48

*Thread Reply:* https://github.com/apache/airflow/blob/main/BREEZE.rst#preparing-provider-packages building the provider package on your own could be best idea probably? that depends on how you manage your Airflow instance

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-08 12:01:53

*Thread Reply:* there's 1.1.0rc1 btw

ldacey (lance.dacey2@sutherlandglobal.com)
2023-09-08 13:44:44

*Thread Reply:* perfect, thanks. I got started with breeze but then stopped haha

👍 Jakub Dardziński
ldacey (lance.dacey2@sutherlandglobal.com)
2023-09-10 20:29:00

*Thread Reply:* The dynamic task mapping error is gone, I did run into this:

File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/openlineage/extractors/base.py", line 70, in disabledoperators operator.strip() for operator in conf.get("openlineage", "disabledfor_operators").split(";") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/airflow/.local/lib/python3.11/site-packages/airflow/configuration.py", line 1065, in get raise AirflowConfigException(f"section/key [{section}/{key}] not found in config")

I am redeploying now with that option added to my config. I guess it did not use the default which should be ""

ldacey (lance.dacey2@sutherlandglobal.com)
2023-09-10 20:49:17

*Thread Reply:* added "disabledforoperators" to my openlineage config and it worked (using Airflow helm chart - not sure if that means there is an error because the value I provided should just be the default value, not sure why I needed to explicitly specify it)

openlineage: disabledforoperators: "" ...

this is so much better and makes a lot more sense. most of my tasks are dynamic so I was missing a lot of metadata before the fix, thanks!

Abdallah (abdallah@terrab.me)
2023-09-06 16:43:07

Hello Everyone,

I've been diving into the Marquez codebase and found a performance bottleneck in JobDao.java for the query related to namespaceName=MyNameSpace limit=10 and 12s with limit=25. I managed to optimize it using CTEs, and the execution times dropped dramatically to 300ms (for limit=100) and under 100ms (for limit=25 ) on the same cluster. Issue link : https://github.com/MarquezProject/marquez/issues/2608

I believe there's even more room for optimization, especially if we adjust the job_facets_view to include the namespace_name column.

Would the team be open to a PR where I share the optimized query and discuss potential further refinements? I believe these changes could significantly enhance the Marquez web UI experience.

PR link : https://github.com/MarquezProject/marquez/pull/2609

Looking forward to your feedback.

🔥 Jakub Dardziński, Harel Shein, Paweł Leszczyński, Maciej Obuchowski
Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-06 18:03:01

*Thread Reply:* @Willy Lulciuc wdyt?

Bernat Gabor (gaborjbernat@gmail.com)
2023-09-06 17:44:12

Has there been any conversation on the extensibility of facets/concepts? E.g.: • how does one extends the list of run states https://openlineage.io/docs/spec/run-cycle to add a paused/resumed state? • how does one extend https://openlineage.io/docs/spec/facets/run-facets/nominal_time to add a created at field?

openlineage.io
openlineage.io
Julien Le Dem (julien@apache.org)
2023-09-06 18:28:17

*Thread Reply:* Hello Bernat,

The primary mechanism to extend the model is through facets. You can either: • create new standard facets in the spec: https://github.com/OpenLineage/OpenLineage/tree/main/spec/facets • create custom facets defined somewhere else with a prefix in their name: https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md#custom-facet-naming • Update existing facets with a backward compatible change (example: adding an optional field). The core spec can also be modified. Here is an example of adding a state That being said I think more granular states like pause/resume are probably better suited in a run facet. There was an issue opened for that particular one a while ago: https://github.com/OpenLineage/OpenLineage/issues/9 maybe that particular discussion can continue there.

For the nominal time facet, You could open an issue describing the use case and on community agreement follow up with a PR on the facet itself: https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/NominalTimeRunFacet.json (adding an optional field is backwards compatible)

👀 Juan Luis Cano Rodríguez
Bernat Gabor (gaborjbernat@gmail.com)
2023-09-06 18:31:12

*Thread Reply:* I see, so in general one is best copying a standard facet and maintain it under a different name. That way can be made mandatory 🙂 and one does not need to be blocked for a long time until there's a community agreement 🤔

Julien Le Dem (julien@apache.org)
2023-09-06 18:35:43

*Thread Reply:* Yes, The goal of custom facets is to allow you to experiment and extend the spec however you want without having to wait for approval. If the custom facet is very specific to a third party project/product then it makes sense for it to stay a custom facet. If it is more generic then it makes sense to add it to the core facets as part of the spec. Hopefully community agreement can be achieved relatively quickly. Unless someone is strongly against something, it can be added without too much red tape. Typically with support in at least one of the integrations to validate the model.

Michael Robinson (michael.robinson@astronomer.io)
2023-09-07 15:12:20

@channel This month’s TSC meeting is next Thursday the 14th at 10am PT. On the tentative agenda: • announcements • recent releases • demo: Spark integration tests in Databricks runtime • open discussion • more (TBA) More info and the meeting link can be found on the website. All are welcome! Also, feel free to reply or DM me with discussion topics, agenda items, etc.

openlineage.io
👍 Maciej Obuchowski
Michael Robinson (michael.robinson@astronomer.io)
2023-09-11 10:07:41

@channel The first Toronto OpenLineage Meetup, featuring a presentation by recent adopter Metaphor, is just one week away. On the agenda:

  1. Evolution of spec presentation/discussion (project background/history)
  2. State of the community
  3. Integrating OpenLineage with Metaphor (by special guests Ye & Ivan)
  4. Spark/Column lineage update
  5. Airflow Provider update
  6. Roadmap Discussion Find more details and RSVP https://www.meetup.com/openlineage/events/295488014/?utmmedium=referral&utmcampaign=share-btnsavedeventssharemodal&utmsource=link|here.
metaphor.io
Meetup
🙌 Mars Lan (Metaphor), Jarek Potiuk, Harel Shein, Maciej Obuchowski, Peter Hicks, Paweł Leszczyński, Dongjin Seo
John Lukenoff (john@jlukenoff.com)
2023-09-11 17:07:26

I’m seeing some odd behavior with my http transport when upgrading airflow/openlineage-airflow from 2.3.2 -> 2.6.3 and 0.24.0 -> 0.28.0. Previously I had a config like this that let me provide my own auth tokens. However, after upgrading I’m getting a 401 from the endpoint and further debugging seems to reveal that we’re not using the token provided in my TokenProvider. Does anyone know if something changed between these versions that could be causing this? (more details in 🧵 ) transport: type: http url: <https://my.fake-marquez-endpoint.com> auth: type: some.fully.qualified.classpath

John Lukenoff (john@jlukenoff.com)
2023-09-11 17:09:40

*Thread Reply:* If I log this line I can tell the TokenProvider is the class instance I would expect: https://github.com/OpenLineage/OpenLineage/blob/45d94fb73b5488d34b8ca544b58317382ceb3980/client/python/openlineage/client/transport/http.py#L55

John Lukenoff (john@jlukenoff.com)
2023-09-11 17:11:14

*Thread Reply:* However, if I log the token_provider here I get the origin TokenProvider: https://github.com/OpenLineage/OpenLineage/blob/45d94fb73b5488d34b8ca544b58317382ceb3980/client/python/openlineage/client/transport/http.py#L154

John Lukenoff (john@jlukenoff.com)
2023-09-11 17:18:56

*Thread Reply:* Ah I think I see the issue. Looks like this was introduced here, we are instantiating with the base token provider here when we should be using the subclass: https://github.com/OpenLineage/OpenLineage/pull/1869/files#diff-2f8ea6f9a22b5567de8ab56c6a63da8e7adf40cb436ee5e7e6b16e70a82afe05R57

John Lukenoff (john@jlukenoff.com)
2023-09-11 17:37:42

*Thread Reply:* Opened a PR for this here: https://github.com/OpenLineage/OpenLineage/pull/2100

Labels
client/python
Comments
1
❤️ Julien Le Dem
Sarwat Fatima (sarwatfatimam@gmail.com)
2023-09-12 08:14:06

This particular code in docker-compose exits with code 1 because it is unable to find wait-for-it.sh, file in the container. I have checked the mounting path from the local machine, It is correct and the path on the container for Marquez is also correct i.e. /usr/src/app but it is unable to mount the wait-for-it.sh. Does anyone know why is this? This code exists in the open lineage repository as well https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/docker-compose.yml # Marquez as an OpenLineage Client api: image: marquezproject/marquez container_name: marquez-api ports: - "5000:5000" - "5001:5001" volumes: - ./docker/wait-for-it.sh:/usr/src/app/wait-for-it.sh links: - "db:postgres" depends_on: - db entrypoint: [ "./wait-for-it.sh", "db:5432", "--", "./entrypoint.sh" ]

Sarwat Fatima (sarwatfatimam@gmail.com)
2023-09-12 08:15:19

*Thread Reply:* This is the error message:

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-09-12 10:38:41

*Thread Reply:* no permissions?

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-09-12 15:11:45

I am trying to run Google Cloud Composer where i have added the openlineage-airflow pypi packagae as a dependency and have added the env OPENLINEAGEEXTRACTORS to point to my custom extractor. I have added a folder by name dependencies and inside that i have placed my extractor file, and the path given to OPENLINEAGEEXTRACTORS is dependencies.<filename>.<extractorclass_name>…still it fails with the exception saying No module named ‘dependencies’. Can anyone kindly help me out on correcting my mistake

Harel Shein (harel.shein@gmail.com)
2023-09-12 17:15:36

*Thread Reply:* Hey @Guntaka Jeevan Paul, can you share some details on which versions of airflow and openlineage you’re using?

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-09-12 17:16:26

*Thread Reply:* airflow ---> 2.5.3, openlinegae-airflow ---> 1.1.0

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-09-12 17:45:08

*Thread Reply:* ```import traceback import uuid from typing import List, Optional

from openlineage.airflow.extractors.base import BaseExtractor, TaskMetadata from openlineage.airflow.utils import getjobname

class BigQueryInsertJobExtractor(BaseExtractor): def init(self, operator): super().init(operator)

@classmethod
def get_operator_classnames(cls) -&gt; List[str]:
    return ['BigQueryInsertJobOperator']

def extract(self) -&gt; Optional[TaskMetadata]:
    return None

def extract_on_complete(self, task_instance) -&gt; Optional[TaskMetadata]:
    self.log.debug(f"JEEVAN ---&gt; extract_on_complete({task_instance})")
    random_uuid = str(uuid.uuid4())
    self.log.debug(f"JEEVAN ---&gt; Randomly Generated UUID --&gt; {random_uuid}")

    self.operator.job_id = random_uuid

    return TaskMetadata(
        name=get_job_name(task=self.operator)
    )```
Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-09-12 17:45:24

*Thread Reply:* this is the custom extractor code that im trying with

Harel Shein (harel.shein@gmail.com)
2023-09-12 21:10:02

*Thread Reply:* thanks @Guntaka Jeevan Paul, will try to take a deeper look tomorrow

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-09-13 07:54:26

*Thread Reply:* No module named 'dependencies'. This sounds like general Python problem

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-09-13 07:55:12

*Thread Reply:* https://stackoverflow.com/questions/69991553/how-to-import-custom-modules-in-cloud-composer

Stack Overflow
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-09-13 07:56:28

*Thread Reply:* basically, if you're able to import the file from your dag code, OL should be able too

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-09-13 08:01:12

*Thread Reply:* The Problem is in the GCS Composer there is a component called Triggerer, which they say is used for deferrable operators…i have logged into that pod and i could see that the GCS Bucket is not mounted on this, but i am unable to understand why is the initialisation happening inside the triggerer pod

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-09-13 08:01:32
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-09-13 08:01:47

*Thread Reply:* > The Problem is in the GCS Composer there is a component called Triggerer, which they say is used for deferrable operators…i have logged into that pod and i could see that the GCS Bucket is not mounted on this, but i am unable to understand why is the initialisation happening inside the triggerer pod OL integration is not running on triggerer, only on worker and scheduler pods

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-09-13 08:01:53
Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-09-13 08:03:26

*Thread Reply:* As you can see in this screenshot i am seeing the logs of the triggerer and it says clearly unable to import plugin openlineage

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-09-13 08:10:32

*Thread Reply:* I see. There are few possible things to do here - composer could mount the user files, Airflow could not start plugins on triggerer, or we could detect we're on triggerer and not import anything there. However, does it impact OL or Airflow operation in other way than this log?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-09-13 08:12:06

*Thread Reply:* Probably we'd have to do something if that really bothers you as there won't be further changes to Airflow 2.5

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-09-13 08:18:14

*Thread Reply:* The Problem is it is actually not registering this custom extractor written by me, henceforth i am just receiving the DefaultExtractor things and my piece of extractor code is not even getting triggered

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-09-13 08:22:49

*Thread Reply:* any suggestions to try @Maciej Obuchowski

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-09-13 08:27:48

*Thread Reply:* Could you share worker logs?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-09-13 08:27:56

*Thread Reply:* and check if module is importable from your dag code?

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-09-13 08:31:25

*Thread Reply:* these are the worker pod logs…where there is no log of openlineageplugin

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-09-13 08:31:52

*Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1694608076879469?thread_ts=1694545905.974339&cid=C01CK9T7HKR --> sure will check now on this one

} Maciej Obuchowski (https://openlineage.slack.com/team/U01RA9B5GG2)
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-09-13 08:38:32

*Thread Reply:* { "textPayload": "Traceback (most recent call last): File \"/opt/python3.8/lib/python3.8/site-packages/openlineage/airflow/utils.py\", line 427, in import_from_string module = importlib.import_module(module_path) File \"/opt/python3.8/lib/python3.8/importlib/__init__.py\", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File \"&lt;frozen importlib._bootstrap&gt;\", line 1014, in _gcd_import File \"&lt;frozen importlib._bootstrap&gt;\", line 991, in _find_and_load File \"&lt;frozen importlib._bootstrap&gt;\", line 961, in _find_and_load_unlocked File \"&lt;frozen importlib._bootstrap&gt;\", line 219, in _call_with_frames_removed File \"&lt;frozen importlib._bootstrap&gt;\", line 1014, in _gcd_import File \"&lt;frozen importlib._bootstrap&gt;\", line 991, in _find_and_load File \"&lt;frozen importlib._bootstrap&gt;\", line 961, in _find_and_load_unlocked File \"&lt;frozen importlib._bootstrap&gt;\", line 219, in _call_with_frames_removed File \"&lt;frozen importlib._bootstrap&gt;\", line 1014, in _gcd_import File \"&lt;frozen importlib._bootstrap&gt;\", line 991, in _find_and_load File \"&lt;frozen importlib._bootstrap&gt;\", line 973, in _find_and_load_unlockedModuleNotFoundError: No module named 'airflow.gcs'", "insertId": "pt2eu6fl9z5vw", "resource": { "type": "cloud_composer_environment", "labels": { "environment_name": "openlineage", "location": "us-west1", "project_id": "acceldata-acm" } }, "timestamp": "2023-09-13T06:20:44.131577764Z", "severity": "ERROR", "labels": { "worker_id": "airflow-worker-xttt8" }, "logName": "projects/acceldata-acm/logs/airflow-worker", "receiveTimestamp": "2023-09-13T06:20:48.847319607Z" }, it doesn't see No module named 'airflow.gcs' that is part of your extractor path airflow.gcs.dags.big_query_insert_job_extractor.BigQueryInsertJobExtractor however, is it necessary? I generally see people using imports directly from dags folder

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-09-13 08:44:11

*Thread Reply:* this is one of the experimentation that i have did, but then i reverted it back to keeping it to dependencies.bigqueryinsertjobextractor.BigQueryInsertJobExtractor…where dependencies is a module i have created inside my dags folder

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-09-13 08:45:46

*Thread Reply:* these are the logs of the triggerer pod specifically

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-09-13 08:46:31

*Thread Reply:* yeah it would be expected to have this in triggerer where it's not mounted, but will it behave the same for worker where it's mounted?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-09-13 08:47:09

*Thread Reply:* maybe ___init___.py is missing for top-level dag path?

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-09-13 08:49:01

*Thread Reply:* these are the logs of the worker pod at startup, where it does not complain of the plugin like in triggerer, but when tasks are run on this worker…somehow it is not picking up the extractor for the operator that i have written it for

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-09-13 08:49:54

*Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1694609229577469?thread_ts=1694545905.974339&cid=C01CK9T7HKR --> you mean to make the dags folder as well like a module by adding the init.py?

} Maciej Obuchowski (https://openlineage.slack.com/team/U01RA9B5GG2)
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-09-13 08:55:24

*Thread Reply:* yes, I would put whole custom code directly in dags folder, to make sure import paths are the problem

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-09-13 08:55:48

*Thread Reply:* and would be nice if you could set AIRFLOW__LOGGING__LOGGING_LEVEL="DEBUG"

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-09-13 09:14:58

*Thread Reply:* ```Starting the process, got command: triggerer Initializing airflow.cfg. airflow.cfg initialization is done. [2023-09-13T13:11:46.620+0000] {settings.py:267} DEBUG - Setting up DB connection pool (PID 8) [2023-09-13T13:11:46.622+0000] {settings.py:372} DEBUG - settings.prepareengineargs(): Using pool settings. poolsize=5, maxoverflow=10, poolrecycle=570, pid=8 [2023-09-13T13:11:46.742+0000] {cliactionloggers.py:39} DEBUG - Adding <function defaultactionlog at 0x7ff39ca1d3a0> to pre execution callback [2023-09-13T13:11:47.638+0000] {cliactionloggers.py:65} DEBUG - Calling callbacks: [<function defaultactionlog at 0x7ff39ca1d3a0>] _ |( ) _/ /_ _ _ /| |_ / / /_ _ / _ _ | /| / / _ | / _ / _ _/ _ / / // /_ |/ |/ / // |// // // // _/_/|/ [2023-09-13T13:11:50.527+0000] {pluginsmanager.py:300} DEBUG - Loading plugins [2023-09-13T13:11:50.580+0000] {pluginsmanager.py:244} DEBUG - Loading plugins from directory: /home/airflow/gcs/plugins [2023-09-13T13:11:50.581+0000] {pluginsmanager.py:224} DEBUG - Loading plugins from entrypoints [2023-09-13T13:11:50.587+0000] {pluginsmanager.py:227} DEBUG - Importing entrypoint plugin OpenLineagePlugin [2023-09-13T13:11:50.740+0000] {utils.py:430} WARNING - No module named 'boto3' [2023-09-13T13:11:50.743+0000] {utils.py:430} WARNING - No module named 'botocore' [2023-09-13T13:11:50.833+0000] {utils.py:430} WARNING - No module named 'airflow.providers.sftp' [2023-09-13T13:11:51.144+0000] {utils.py:430} WARNING - No module named 'bigqueryinsertjobextractor' [2023-09-13T13:11:51.145+0000] {pluginsmanager.py:237} ERROR - Failed to import plugin OpenLineagePlugin Traceback (most recent call last): File "/opt/python3.8/lib/python3.8/site-packages/openlineage/airflow/utils.py", line 427, in importfromstring module = importlib.importmodule(modulepath) File "/opt/python3.8/lib/python3.8/importlib/init.py", line 127, in importmodule return bootstrap.gcdimport(name[level:], package, level) File "<frozen importlib.bootstrap>", line 1014, in gcdimport File "<frozen importlib.bootstrap>", line 991, in _findandload File "<frozen importlib.bootstrap>", line 973, in findandloadunlocked ModuleNotFoundError: No module named 'bigqueryinsertjobextractor'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/opt/python3.8/lib/python3.8/site-packages/airflow/pluginsmanager.py", line 229, in loadentrypointplugins pluginclass = entrypoint.load() File "/opt/python3.8/lib/python3.8/site-packages/setuptools/vendor/importlibmetadata/init.py", line 194, in load module = importmodule(match.group('module')) File "/opt/python3.8/lib/python3.8/importlib/init.py", line 127, in importmodule return _bootstrap.gcdimport(name[level:], package, level) File "<frozen importlib.bootstrap>", line 1014, in gcdimport File "<frozen importlib.bootstrap>", line 991, in _findandload File "<frozen importlib.bootstrap>", line 975, in findandloadunlocked File "<frozen importlib.bootstrap>", line 671, in _loadunlocked File "<frozen importlib.bootstrapexternal>", line 843, in execmodule File "<frozen importlib.bootstrap>", line 219, in callwithframesremoved File "/opt/python3.8/lib/python3.8/site-packages/openlineage/airflow/plugin.py", line 32, in <module> from openlineage.airflow import listener File "/opt/python3.8/lib/python3.8/site-packages/openlineage/airflow/listener.py", line 75, in <module> extractormanager = ExtractorManager() File "/opt/python3.8/lib/python3.8/site-packages/openlineage/airflow/extractors/manager.py", line 16, in init self.tasktoextractor = Extractors() File "/opt/python3.8/lib/python3.8/site-packages/openlineage/airflow/extractors/extractors.py", line 122, in init extractor = importfromstring(extractor.strip()) File "/opt/python3.8/lib/python3.8/site-packages/openlineage/airflow/utils.py", line 431, in importfromstring raise ImportError(f"Failed to import {path}") from e ImportError: Failed to import bigqueryinsertjobextractor.BigQueryInsertJobExtractor [2023-09-13T13:11:51.235+0000] {pluginsmanager.py:227} DEBUG - Importing entrypoint plugin composermenuplugin [2023-09-13T13:11:51.719+0000] {pluginsmanager.py:316} DEBUG - Loading 1 plugin(s) took 1.14 seconds [2023-09-13T13:11:51.733+0000] {triggererjob.py:101} INFO - Starting the triggerer [2023-09-13T13:11:51.734+0000] {selectorevents.py:59} DEBUG - Using selector: EpollSelector [2023-09-13T13:11:56.118+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:12:01.359+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:12:06.665+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:12:11.880+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:12:17.098+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:12:22.323+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:12:27.597+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:12:32.826+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:12:38.049+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:12:43.275+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:12:48.509+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:12:53.867+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:12:59.087+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:13:04.300+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:13:09.539+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:13:14.785+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:13:20.007+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:13:25.274+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:13:30.510+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:13:35.729+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:13:40.960+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:13:46.444+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:13:51.751+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:13:57.084+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:14:02.310+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:14:07.535+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:14:12.754+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:14:17.967+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:14:23.185+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:14:28.406+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:14:33.661+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:14:38.883+0000] {basejob.py:240} DEBUG - [heartbeat] [2023-09-13T13:14:44.247+0000] {base_job.py:240} DEBUG - [heartbeat]```

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-09-13 09:15:10

*Thread Reply:* still the same error in the triggerer pod

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-09-13 09:16:23

*Thread Reply:* have changed the dags folder where i have added the init file as you suggested and then have updated the OPENLINEAGEEXTRACTORS to bigqueryinsertjob_extractor.BigQueryInsertJobExtractor…still the same thing

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-09-13 09:36:27

*Thread Reply:* > still the same error in the triggerer pod it won't change, we're not trying to fix the triggerer import but worker, and should look only at worker pod at this point

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-09-13 09:43:34

*Thread Reply:* ```extractor for <class 'airflow.providers.google.cloud.operators.bigquery.BigQueryInsertJobOperator'> is <class 'bigqueryinsertjobextractor.BigQueryInsertJobExtractor'

Using extractor BigQueryInsertJobExtractor tasktype=BigQueryInsertJobOperator airflowdagid=dataanalyticsdag taskid=joinbqdatasets.bqjoinholidaysweatherdata2021 airflowrunid=manual_2023-09-13T13:24:08.946947+00:00

fatal: not a git repository (or any parent up to mount point /home/airflow) Stopping at filesystem boundary (GITDISCOVERYACROSSFILESYSTEM not set). fatal: not a git repository (or any parent up to mount point /home/airflow) Stopping at filesystem boundary (GITDISCOVERYACROSSFILESYSTEM not set).```

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-09-13 09:44:44

*Thread Reply:* able to see these logs in the worker pod…so what you said is right that it is able to get the extractor but i get the below error immediately where it says not a git repository

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-09-13 09:45:24

*Thread Reply:* seems like we are almost there nearby…am i missing something obvious

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-09-13 10:06:35

*Thread Reply:* > fatal: not a git repository (or any parent up to mount point /home/airflow) &gt; Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set). &gt; fatal: not a git repository (or any parent up to mount point /home/airflow) &gt; Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set). hm, this could be the actual bug?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-13 10:06:51

*Thread Reply:* that’s casual log in composer

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-13 10:12:16

*Thread Reply:* extractor for &lt;class 'airflow.providers.google.cloud.operators.bigquery.BigQueryInsertJobOperator'&gt; is &lt;class 'big_query_insert_job_extractor.BigQueryInsertJobExtractor' that’s actually class from your custom module, right?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-13 10:14:03

*Thread Reply:* I’ve done experiment, that’s how gcs looks like

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-13 10:14:09

*Thread Reply:* and env vars

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-13 10:14:19

*Thread Reply:* I have this extractor detected as expected

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-13 10:15:06

*Thread Reply:* seens as &lt;class 'dependencies.bq.BigQueryInsertJobExtractor'&gt;

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-13 10:16:02

*Thread Reply:* no __init__.py in base dags folder

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-13 10:17:02

*Thread Reply:* I also checked that triggerer pod indeed has no gcsfuse set up, tbh no idea why, maybe some kind of optimization the only effect is that when loading plugins in triggerer it throws some errors in logs, we don’t do anything at the moment there

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-09-13 10:19:26

*Thread Reply:* okk…got it @Jakub Dardziński…so the init at the top level of dags is as well not reqd, got it. Just one more doubt, there is a requirement where i want to change the operators property in the extractor inside the extract function, will that be taken into account and the operator’s execute be called with the property that i have populated in my extractor?

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-09-13 10:21:28

*Thread Reply:* for example i want to add a custom jobid to the BigQueryInsertJobOperator, so wheneerv someone uses the BigQueryInsertJobOperator operator i want to intercept that and add this jobid property to the operator…will that work?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-13 10:24:46

*Thread Reply:* I’m not sure if using OL for such thing is best choice. Wouldn’t it be better to subclass the operator?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-13 10:25:37

*Thread Reply:* but the answer is: it dependes on the airflow version, in 2.3+ I’m pretty sure the changed property stays in execute method

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-09-13 10:27:49

*Thread Reply:* yeah ideally that is how we should have done this but the problem is our client is having around 1000+ Dag’s in different google cloud projects, which are owned by multiple teams…so they are not willing to change anything in their dag. Thankfully they are using airflow 2.4.3

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-09-13 10:31:15

*Thread Reply:* task_policy might be better tool for that: https://airflow.apache.org/docs/apache-airflow/2.6.0/administration-and-deployment/cluster-policies.html

➕ Jakub Dardziński
Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-13 10:35:30

*Thread Reply:* btw I double-checked - execute method is in different process so this would not change task’s attribute there

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-09-16 03:32:49

*Thread Reply:* @Jakub Dardziński any idea how can we achieve this one. ---> https://openlineage.slack.com/archives/C01CK9T7HKR/p1694849427228709

} Guntaka Jeevan Paul (https://openlineage.slack.com/team/U05QL7LN2GH)
Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-09-12 17:26:01

@here has anyone succeded in getting a custom extractor to work in GCP Cloud Composer or AWS MWAA, seems like there is no way

Mars Lan (Metaphor) (mars@metaphor.io)
2023-09-12 17:34:29

*Thread Reply:* I'm getting quite close with MWAA. See https://openlineage.slack.com/archives/C01CK9T7HKR/p1692743745585879.

} Mars Lan (https://openlineage.slack.com/team/U01HVNU6A4C)
Suraj Gupta (suraj.gupta@atlan.com)
2023-09-13 01:44:27

I am exploring Spark - OpenLineage integration (using the latest PySpark and OL versions). I tested a simple pipeline which: • Reads JSON data into PySpark DataFrame • Apply data transformations • Write transformed data to MySQL database Observed that we receive 4 events (2 START and 2 COMPLETE) for the same job name. The events are almost identical with a small diff in the facets. All the events share the same runId, and we don't get any parentRunId. Team, can you please confirm if this behaviour is expected? Seems to be different from the Airflow integration where we relate jobs to Parent Jobs.

Damien Hawes (damien.hawes@booking.com)
2023-09-13 02:54:37

*Thread Reply:* The Spark integration requires that two parameters are passed to it, namely:

spark.openlineage.parentJobName spark.openlineage.parentRunId You can find the list of parameters here:

https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/README.md

Suraj Gupta (suraj.gupta@atlan.com)
2023-09-13 02:55:51

*Thread Reply:* Thanks, will check this out

Damien Hawes (damien.hawes@booking.com)
2023-09-13 02:57:43

*Thread Reply:* As for double accounting of events - that's a bit harder to diagnose.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-09-13 04:33:03

*Thread Reply:* Can you share the the job and events? Also @Paweł Leszczyński

Suraj Gupta (suraj.gupta@atlan.com)
2023-09-13 06:03:49

*Thread Reply:* Sure, sharing Job and events.

Suraj Gupta (suraj.gupta@atlan.com)
2023-09-13 06:06:21

*Thread Reply:*

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-09-13 06:39:02

*Thread Reply:* Hi @Suraj Gupta,

Thanks for providing such a detailed description of the problem.

It is not expected behaviour, it's an issue. The events correspond to the same logical plan which for some reason lead to sending two OL events. Is it reproducible aka. does it occur each time? If yes, we please feel free to raise an issue for that.

We have added in recent months several tests to verify amount of OL events being generated but we haven't tested it that way with JDBC. BTW. will the same happen if you write your data df_transformed to a file (like parquet file) ?

:gratitude_thank_you: Suraj Gupta
Suraj Gupta (suraj.gupta@atlan.com)
2023-09-13 07:28:03

*Thread Reply:* Thanks @Paweł Leszczyński, will confirm about writing to file and get back.

Suraj Gupta (suraj.gupta@atlan.com)
2023-09-13 07:33:35

*Thread Reply:* And yes, the issue is reproducible. Will raise an issue for this.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-09-13 07:33:54

*Thread Reply:* even if you write onto a file?

Suraj Gupta (suraj.gupta@atlan.com)
2023-09-13 07:37:21

*Thread Reply:* Yes, even when I write to a parquet file.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-09-13 07:49:28

*Thread Reply:* ok. i think i was able to reproduce it locally with https://github.com/OpenLineage/OpenLineage/pull/2103/files

Suraj Gupta (suraj.gupta@atlan.com)
2023-09-13 07:56:11
Suraj Gupta (suraj.gupta@atlan.com)
2023-09-25 16:32:09

*Thread Reply:* @Paweł Leszczyński I see that the PR is work in progress. Any rough estimate on when we can expect this fix to be released?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-09-26 03:32:03

*Thread Reply:* @Suraj Gupta put a comment within your issue. it's a bug we need to solve but I cannot bring any estimates today.

Suraj Gupta (suraj.gupta@atlan.com)
2023-09-26 04:33:03

*Thread Reply:* Thanks for update @Paweł Leszczyński, also please look into this comment. It might related and I'm not sure if expected behaviour.

Michael Robinson (michael.robinson@astronomer.io)
2023-09-13 14:20:32

@channel This month’s TSC meeting, open to all, is tomorrow: https://openlineage.slack.com/archives/C01CK9T7HKR/p1694113940400549

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
✅ Sheeri Cabral (Collibra)
Damien Hawes (damien.hawes@booking.com)
2023-09-14 06:20:15

Context:

We use Spark with YARN, running on Hadoop 2.x (I can't remember the exact minor version) with Hive support.

Problem:

I'm noticed that CreateDataSourceAsSelectCommand objects are always transformed to an OutputDataset with a namespace value set to file - which is curious, because the inputs always have a (correct) namespace of hdfs://&lt;name-node&gt; - is this a known issue? A flaw with Apache Spark? A bug in the resolution logic?

For reference:

```public class CreateDataSourceTableCommandVisitor extends QueryPlanVisitor<CreateDataSourceTableCommand, OpenLineage.OutputDataset> {

public CreateDataSourceTableCommandVisitor(OpenLineageContext context) { super(context); }

@Override public List<OpenLineage.OutputDataset> apply(LogicalPlan x) { CreateDataSourceTableCommand command = (CreateDataSourceTableCommand) x; CatalogTable catalogTable = command.table();

return Collections.singletonList(
    outputDataset()
        .getDataset(
            PathUtils.fromCatalogTable(catalogTable),
            catalogTable.schema(),
            OpenLineage.LifecycleStateChangeDatasetFacet.LifecycleStateChange.CREATE));

} }`` Running this:cat events.log | jq '{eventTime: .eventTime, eventType: .eventType, runId: .run.runId, jobNamespace: .job.namespace, jobName: .job.name, outputs: .outputs[] | {namespace: .namespace, name: .name}, inputs: .inputs[] | {namespace: .namespace, name: .name}}'`

This is an output: { "eventTime": "2023-09-13T16:01:27.059Z", "eventType": "START", "runId": "bbbb5763-3615-46c0-95ca-1fc398c91d5d", "jobNamespace": "spark.cluster-1", "jobName": "ol_hadoop_test.execute_create_data_source_table_as_select_command.dhawes_db_ol_test_hadoop_tgt", "outputs": { "namespace": "file", "name": "/user/hive/warehouse/dhawes.db/ol_test_hadoop_tgt" }, "inputs": { "namespace": "<hdfs://nn1>", "name": "/user/hive/warehouse/dhawes.db/ol_test_hadoop_src" } }

👀 Paweł Leszczyński
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-09-14 07:32:25

*Thread Reply:* Seems like an issue on our side. Do you know how the source is read? What LogicalPlan leaf is used to read src? Would love to find how is this done differently

Damien Hawes (damien.hawes@booking.com)
2023-09-14 09:16:58

*Thread Reply:* Hmm, I'll have to do explain plan to see what exactly it is.

However my sample job uses spark.sql("SELECT ** FROM dhawes.ol_test_hadoop_src")

which itself is created using

spark.sql("SELECT 1 AS id").write.format("orc").mode("overwrite").saveAsTable("dhawes.ol_test_hadoop_src")

Damien Hawes (damien.hawes@booking.com)
2023-09-14 09:23:59

*Thread Reply:* ``&gt;&gt;&gt; spark.sql("SELECT ** FROM dhawes.ol_test_hadoop_src").explain(True) == Parsed Logical Plan == 'Project [**] +- 'UnresolvedRelationdhawes.oltesthadoop_src`

== Analyzed Logical Plan == id: int Project [id#3] +- SubqueryAlias dhawes.ol_test_hadoop_src +- Relation[id#3] orc

== Optimized Logical Plan == Relation[id#3] orc

== Physical Plan == **(1) FileScan orc dhawes.oltesthadoop_src[id#3] Batched: true, Format: ORC, Location: InMemoryFileIndex[], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>```

tati (tatiana.alchueyr@astronomer.io)
2023-09-14 10:03:41

Hey everyone, Any chance we could have a openlineage-integration-common 1.1.1 release with the following changes..? • https://github.com/OpenLineage/OpenLineage/pull/2106https://github.com/OpenLineage/OpenLineage/pull/2108

➕ Michael Robinson, Harel Shein, Maciej Obuchowski, Jakub Dardziński, Paweł Leszczyński, Julien Le Dem
tati (tatiana.alchueyr@astronomer.io)
2023-09-14 10:05:19

*Thread Reply:* Specially the first PR is affecting users of the astronomer-cosmos library: https://github.com/astronomer/astronomer-cosmos/issues/533

Michael Robinson (michael.robinson@astronomer.io)
2023-09-14 10:05:24

*Thread Reply:* Thanks @tati for requesting your first OpenLineage release! Three +1s from committers will authorize

:gratitude_thank_you: tati
Michael Robinson (michael.robinson@astronomer.io)
2023-09-14 11:59:55

*Thread Reply:* The release is authorized and will be initiated within two business days.

🎉 tati
tati (tatiana.alchueyr@astronomer.io)
2023-09-15 04:40:12

*Thread Reply:* Thanks a lot, @Michael Robinson!

Julien Le Dem (julien@apache.org)
2023-09-14 20:23:01

Per discussion in the OpenLineage sync today here is a very early strawman proposal for an OpenLineage registry that producers and consumers could be registered in. Feedback or alternate proposals welcome https://docs.google.com/document/d/1zIxKST59q3I6ws896M4GkUn7IsueLw8ejct5E-TR0vY/edit Once this is sufficiently fleshed out, I’ll create an actual proposal on github

👍 Maciej Obuchowski
Julien Le Dem (julien@apache.org)
2023-10-03 20:33:35

*Thread Reply:* I have cleaned up the registry proposal. https://docs.google.com/document/d/1zIxKST59q3I6ws896M4GkUn7IsueLw8ejct5E-TR0vY/edit In particular: • I clarified that option 2 is preferred at this point. • I moved discussion notes to the bottom. they will go away at some point • Once it is stable, I’ll create a proposal with the preferred option. • we need a good proposal for the core facets prefix. My suggestion is to move core facets to core in the registry. The drawback is prefix would be inconsistent.

Julien Le Dem (julien@apache.org)
2023-10-05 17:34:12

*Thread Reply:* I have created a ticket to make this easier to find. Once I get more feedback I’ll turn it into a md file in the repo: https://docs.google.com/document/d/1zIxKST59q3I6ws896M4GkUn7IsueLw8ejct5E-TR0vY/edit#heading=h.enpbmvu7n8gu https://github.com/OpenLineage/OpenLineage/issues/2161

Labels
proposal
Michael Robinson (michael.robinson@astronomer.io)
2023-09-15 12:03:27

@channel Friendly reminder: the next OpenLineage meetup, our first in Toronto, is happening this coming Monday at 5 PM ET https://openlineage.slack.com/archives/C01CK9T7HKR/p1694441261486759

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
👍 Maciej Obuchowski
Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-09-16 03:30:27

@here we have dataproc operator getting called from a dag which submits a spark job, we wanted to maintain that continuity of parent job in the spark job so according to the documentation we can acheive that by using a macro called lineagerunid that requires task and taskinstance as the parameters. The problem we are facing is that our client’s have 1000's of dags, so asking them to change this everywhere it is used is not feasible, so we thought of using the taskpolicy feature in the airflow…but the problem is that taskpolicy gives you access to only the task/operator, but we don’t have the access to the task instance..that is required as a parameter to the lineagerun_id function. Can anyone kindly help us on how should we go about this one t1 = DataProcPySparkOperator( task_id=job_name, <b>#required</b> pyspark configuration, job_name=job_name, dataproc_pyspark_properties={ 'spark.driver.extraJavaOptions': f"-javaagent:{jar}={os.environ.get('OPENLINEAGE_URL')}/api/v1/namespaces/{os.getenv('OPENLINEAGE_NAMESPACE', 'default')}/jobs/{job_name}/runs/{{{{macros.OpenLineagePlugin.lineage_run_id(task, task_instance)}}}}?api_key={os.environ.get('OPENLINEAGE_API_KEY')}" dag=dag)

➕ Abdallah
Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-16 04:22:47

*Thread Reply:* you don't need actual task instance to do that. you only should set additional argument as jinja template, same as above

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-16 04:25:28

*Thread Reply:* task_instance in this case is just part of string which is evaluated when jinja render happens

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-09-16 04:27:10

*Thread Reply:* ohh…then we could use the same example as above inside the task_policy to intercept the Operator and add the openlineage specific additions properties?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-16 04:30:59

*Thread Reply:* correct, just remember not to override all properties, just add ol specific

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-09-16 04:32:02

*Thread Reply:* yeah sure…thank you so much @Jakub Dardziński, will try this out and keep you posted

👍 Jakub Dardziński
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-09-16 05:00:24

*Thread Reply:* We want to automate setting those options at some point inside the operator itself

➕ Guntaka Jeevan Paul
Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-09-16 19:40:27

@here is there a way by which we could add custom headers to openlineage client in airflow, i see that provision is there for spark integration via these properties spark.openlineage.transport.headers.xyz --> abcdef

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-19 16:40:55

*Thread Reply:* there’s no out-of-the-box possibility to do that yet, you’re very welcome to create an issue in GitHub and maybe contribute as well! 🙂

Mars Lan (Metaphor) (mars@metaphor.io)
2023-09-17 09:07:41

It doesn't seem like there's a way to override the OL endpoint from the default (/api/v1/lineage) in Airflow? I tried setting the OPENLINEAGE_ENDPOINT environment to no avail. Based on this statement, it seems that only OPENLINEAGE_URL was used to construct HttpConfig ?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-18 16:25:11

*Thread Reply:* That’s correct. For now there’s no way to configure the endpoint via env var. You can do that by using config file

Mars Lan (Metaphor) (mars@metaphor.io)
2023-09-18 16:30:39

*Thread Reply:* How do you do that in Airflow? Any particular reason for excluding endpoint override via env var? Happy to create a PR to fix that.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-18 16:52:48

*Thread Reply:* historical I guess? go for the PR, of course 🚀

Mars Lan (Metaphor) (mars@metaphor.io)
2023-10-03 08:52:16

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2151

Labels
documentation, client/python
Terese Larsson (terese@jclab.se)
2023-09-18 08:22:34

Hi! I'm in need of help with wrapping my head around OpenLineage. My team have the goal of collecting metadata from the Airflow operators GreatExpectationsOperator, PythonOperator, MsSqlOperator and BashOperator (for dbt). Where can I see the sourcecode for what is collected for each operator, and is there support for these in the new provider apache-airflow-providers-openlineage? I am super confused and feel lost in the docs. 🤯 We are using MSSQL/ODBC to connect to our db, and this data does not seem to appear as datasets in Marquez, do I need to configure this? If so, HOW and WHERE? 🥲

Happy for any help, big or small! 🙏

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-18 16:26:07

*Thread Reply:* there’s no actual single source of what integrations are currently implemented in openlineage Airflow provider. That’s something we should work on so it’s more visible

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-18 16:26:46

*Thread Reply:* answering this quickly - GE & MS SQL are not currently implemented yet in the provider

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-18 16:26:58

*Thread Reply:* but I also invite you to contribute if you’re interested! 🙂

sarathch (sarathch@hpe.com)
2023-09-19 02:47:47

Hi I need help in extracting OpenLineage for PostgresOperator in json format. any suggestions or comments would be greatly appreciated

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-09-19 16:40:06

*Thread Reply:* If you're using Airflow 2.7, take a look at https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/guides/user.html

❤️ sarathch
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-09-19 16:40:54

*Thread Reply:* If you use one of the lower versions, take a look here https://openlineage.io/docs/integrations/airflow/usage

openlineage.io
sarathch (sarathch@hpe.com)
2023-09-20 06:26:56

*Thread Reply:* Maciej, Thanks for sharing the link https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/guides/user.html this should address the issue

Juan Luis Cano Rodríguez (juan_luis_cano@mckinsey.com)
2023-09-20 09:36:54

congrats folks 🥳 https://lfaidata.foundation/blog/2023/09/20/lf-ai-data-foundation-announces-graduation-of-openlineage-project

🎉 Jakub Dardziński, Mars Lan (Metaphor), Ross Turk, Guntaka Jeevan Paul, Peter Hicks, Maciej Obuchowski, Athitya Kumar, John Lukenoff, Harel Shein, Francis McGregor-Macdonald, Laurent Paris
👍 Athitya Kumar
❤️ Harel Shein
Michael Robinson (michael.robinson@astronomer.io)
2023-09-20 17:08:58

@channel We released OpenLineage 1.2.2! Added • Spark: publish the ProcessingEngineRunFacet as part of the normal operation of the OpenLineageSparkEventListener #2089 @d-m-h • Spark: capture and emit spark.databricks.clusterUsageTags.clusterAllTags variable from databricks environment #2099 @Anirudh181001 Fixed • Common: support parsing dbt_project.yml without target-path #2106 @tatiana • Proxy: fix Proxy chart #2091 @harels • Python: fix serde filtering #2044 @xli-1026 • Python: use non-deprecated apiKey if loading it from env variables @2029 @mobuchowski • Spark: Improve RDDs on S3 integration. #2039 @pawel-big-lebowski • Flink: prevent sending running events after job completes #2075 @pawel-big-lebowski • Spark & Flink: Unify dataset naming from URI objects #2083 @pawel-big-lebowski • Spark: Databricks improvements #2076 @pawel-big-lebowski Removed • SQL: remove sqlparser dependency from iface-java and iface-py #2090 @JDarDagran Thanks to all the contributors, including new contributors @tati, @xli-1026, and @d-m-h! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.2.2 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/1.1.0...1.2.2 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🔥 Maciej Obuchowski, Harel Shein, Anirudh Shrinivason
👍 Guntaka Jeevan Paul, John Rosenbaum, Sangeeta Mishra
Yevhenii Soboliev (esoboliev@griddynamics.com)
2023-09-22 21:05:20

*Thread Reply:* Hi @Michael Robinson Thank you! I love the job that you’ve done. If you have a few seconds, please hint at how I can push lineage gathered from Airflow and Spark jobs into DataHub for visualization? I didn’t find any solutions or official support neither at Openlineage nor at DataHub, but I still want to continue using Openlineage

Michael Robinson (michael.robinson@astronomer.io)
2023-09-22 21:30:22

*Thread Reply:* Hi Yevhenii, thank you for using OpenLineage. The DataHub integration is new to us, but perhaps the experts on Spark and Airflow know more. @Paweł Leszczyński @Maciej Obuchowski @Jakub Dardziński

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-09-23 08:11:17

*Thread Reply:* @Yevhenii Soboliev at Airflow Summit, Shirshanka Das from DataHub mentioned this as upcoming feature.

👍 Yevhenii Soboliev
🎯 Yevhenii Soboliev
Suraj Gupta (suraj.gupta@atlan.com)
2023-09-21 02:11:10

Hi, we're using Custom Operators in airflow(2.5) and are planning to expose lineage via default extractors: https://openlineage.io/docs/integrations/airflow/default-extractors/ Question: Now if we upgrade our Airflow version to 2.7 in the future, would our code be backward compatible? Since OpenLineage has now moved inside airflow and I think there is no concept of extractors in the latest version.

Suraj Gupta (suraj.gupta@atlan.com)
2023-09-21 02:15:00

*Thread Reply:* Also, do we have any docs on how OL works with the latest airflow version? Few questions: • How is it replacing the concept of custom extractors and Manually Annotated Lineage in the latest version? • Do we have any examples of setting up the integration to emit input/output datasets for non supported Operators like PythonOperator?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-27 10:04:09

*Thread Reply:* > Question: Now if we upgrade our Airflow version to 2.7 in the future, would our code be backward compatible? It will be compatible, “default extractors” is generally the same concept as we’re using in the 2.7 integration. One thing that might be good to update is import paths, from openlineage.airflow to airflow.providers.openlineage but should work both ways

> • Do we have any code samples/docs of setting up the integration to emit input/output datasets for non supported Operators like PythonOperator? Our experience with that is currently lacking - this means, it works like in bare airflow, if you annotate your PythonOperator tasks with old Airflow lineage like in this doc.

We want to make this experience better - by doing few things • instrumenting hooks, then collecting lineage from them • integration with AIP-48 datasets • allowing to emit lineage collected inside Airflow task by other means, by providing core Airflow API for that All those things require changing core Airflow in a couple of ways: • tracking which hooks were used during PythonOperator execution • just being able to emit datasets (airflow inlets/outlets) from inside of a task - they are now a static thing, so if you try that it does not work • providing better API for emitting that lineage, preferably based on OpenLineage itself rather than us having to convert that later. As this requires core Airflow changes, it won’t be live until Airflow 2.8 at the earliest.

thanks to @Maciej Obuchowski for this response

Jason Yip (jasonyip@gmail.com)
2023-09-21 18:36:17

I am using this accelerator that leverages OpenLineage on Databricks to publish lineage info to Purview, but it's using a rather old version of OpenLineage aka 0.18, anybody has tried it on a newer version of OpenLineage? I am facing some issues with the inputs and outputs for the same object is having different json https://github.com/microsoft/Purview-ADB-Lineage-Solution-Accelerator/

Stars
77
Language
C#
✅ Harel Shein
Jason Yip (jasonyip@gmail.com)
2023-09-21 21:51:41

I installed 1.2.2 on Databricks, followed the below init script: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/databricks/open-lineage-init-script.sh

my cluster config looks like this:

spark.openlineage.version v1 spark.openlineage.namespace adb-5445974573286168.8#default spark.openlineage.endpoint v1/lineage spark.openlineage.url.param.code 8kZl0bo2TJfnbpFxBv-R2v7xBDj-PgWMol3yUm5iP1vaAzFu9kIZGg== spark.openlineage.url https://f77b-50-35-69-138.ngrok-free.app

But it is not calling the API, it works fine with 0.18 version

✅ Harel Shein
Jason Yip (jasonyip@gmail.com)
2023-09-21 23:16:10

I am attaching the log4j, there is no openlineagecontext

✅ Harel Shein
Jason Yip (jasonyip@gmail.com)
2023-09-21 23:47:22

*Thread Reply:* this issue is resolved, solution can be found here: https://openlineage.slack.com/archives/C01CK9T7HKR/p1691592987038929

} Zahi Fail (https://openlineage.slack.com/team/U05KNSP01TR)
Harel Shein (harel.shein@gmail.com)
2023-09-25 08:59:10

*Thread Reply:* We were all out at Airflow Summit last week, so apologies for the delayed response. Glad you were able to resolve the issue!

Sangeeta Mishra (sangeeta@acceldata.io)
2023-09-25 05:11:50

@here I'm presently addressing a particular scenario that pertains to Openlineage authentication, specifically involving the use of an access key and secret.

I've implemented a custom token provider called AccessKeySecretKeyTokenProvider, which extends the TokenProvider class. This token provider communicates with another service, obtaining a token and an expiration time based on the provided access key, secret, and client ID.

My goal is to retain this token in a cache prior to its expiration, thereby eliminating the need for network calls to the third-party service. Is it possible without relying on an external caching system.

Harel Shein (harel.shein@gmail.com)
2023-09-25 08:56:53

*Thread Reply:* Hey @Sangeeta Mishra, I’m not sure that I fully understand your question here. What do you mean by OpenLineage authentication? What are you using to generate OL events? What’s your OL receiving backend?

Sangeeta Mishra (sangeeta@acceldata.io)
2023-09-25 09:04:33

*Thread Reply:* Hey @Harel Shein, I wanted to clarify the previous message. I apologize for any confusion. When I mentioned "OpenLineage authentication," I was actually referring to the authentication process for the OpenLineage backend, specifically using HTTP transport. This involves using my custom token provider, which utilizes access keys and secrets for authentication. The OL backend is http based backend . I hope this clears things up!

Harel Shein (harel.shein@gmail.com)
2023-09-25 09:05:12

*Thread Reply:* Are you using Marquez?

Sangeeta Mishra (sangeeta@acceldata.io)
2023-09-25 09:05:55

*Thread Reply:* We are trying to leverage our own backend here.

Harel Shein (harel.shein@gmail.com)
2023-09-25 09:07:03

*Thread Reply:* I see.. I’m not sure the OpenLineage community could help here. Which webserver framework are you using?

Sangeeta Mishra (sangeeta@acceldata.io)
2023-09-25 09:08:56

*Thread Reply:* KTOR framework

Sangeeta Mishra (sangeeta@acceldata.io)
2023-09-25 09:15:33

*Thread Reply:* Our backend authentication operates based on either a pair of keys or a single bearer token, with a limited time of expiry. Hence, wanted to cache this information inside the token provider.

Harel Shein (harel.shein@gmail.com)
2023-09-25 09:26:57

*Thread Reply:* I see, I would ask this question here https://ktor.io/support/

Ktor Framework
Sangeeta Mishra (sangeeta@acceldata.io)
2023-09-25 10:12:52

*Thread Reply:* Thank you

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-09-26 04:13:20

*Thread Reply:* @Sangeeta Mishra which openlineage client are you using: java or python?

Sangeeta Mishra (sangeeta@acceldata.io)
2023-09-26 04:19:53

*Thread Reply:* @Paweł Leszczyński I am using python client

Suraj Gupta (suraj.gupta@atlan.com)
2023-09-25 13:36:25

I'm using the Spark OpenLineage integration. In the outputStatistics output dataset facet we receive rowCount and size. The Job performs a SQL insert into a MySQL table and I'm receiving the size as 0. { "outputStatistics": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/1.1.0/integration/spark>", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/OutputStatisticsOutputDatasetFacet.json#/$defs/OutputStatisticsOutputDatasetFacet>", "rowCount": 1, "size": 0 } } I'm not sure what the size means here. Does this mean number of bytes inserted/updated? Also, do we have any documentation for Spark specific Job and Run facets?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-09-27 09:56:00

*Thread Reply:* I am not sure it's stated in the doc. Here's the list of spark facets schemas: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/shared/facets/spark/v1

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-09-26 00:51:30

@here In Airflow Integration we send across a lineage Event for Dag start and complete, but that is not the case with spark integration…we don’t receive any event for the application start and complete in spark…is this expected behaviour or am i missing something?

➕ Suraj Gupta
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-09-27 09:47:39

*Thread Reply:* For spark we do send start and complete for each spark action being run (single operation that causes spark processing being run). However, it is difficult for us to know if we're dealing with the last action within spark job or a spark script.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-09-27 09:49:35

*Thread Reply:* I think we need to look deeper into that as there is reoccuring need to capture such information

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-09-27 09:49:57

*Thread Reply:* and spark listener event has methods like onApplicationStart and onApplicationEnd

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-09-27 09:50:13

*Thread Reply:* We are using the SparkListener, which has a function called OnApplicationStart which gets called whenever a spark application starts, so i was thinking why cant we send one at start and simlarly at end as well

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-09-27 09:50:33

*Thread Reply:* additionally, we would like to have a concept of a parent run for a spark job which aggregates all actions run within a single spark job context

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-09-27 09:51:11

*Thread Reply:* yeah exactly. the way that it works with airflow integration

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-09-27 09:51:26

*Thread Reply:* we do have an issue for that https://github.com/OpenLineage/OpenLineage/issues/2105

Labels
proposal
Comments
2
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-09-27 09:52:08

*Thread Reply:* what you can is: come to our monthly Openlineage open meetings and raise that issue and convince the community about its importance

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-09-27 09:53:32

*Thread Reply:* yeah sure would love to do that…how can i join them, will that be posted here in this slack channel?

Michael Robinson (michael.robinson@astronomer.io)
2023-09-27 09:54:08

*Thread Reply:* Hi, you can see the schedule and RSVP here: https://openlineage.io/community

openlineage.io
🙌 Paweł Leszczyński
:gratitude_thank_you: Guntaka Jeevan Paul
Michael Robinson (michael.robinson@astronomer.io)
2023-09-27 11:19:16

Meetup recap: Toronto Meetup @ Airflow Summit, September 18, 2023 It was great to see so many members of our community at this event! I counted 32 total attendees, with all but a handful being first-timers. Topics included: • Presentation on the history, architecture and roadmap of the project by @Julien Le Dem and @Harel Shein • Discussion of OpenLineage support in Marquez by @Willy Lulciuc • Presentation by Ye Liu and Ivan Perepelitca from Metaphor, the social platform for data, about their integration • Presentation by @Paweł Leszczyński about the Spark integration • Presentation by @Maciej Obuchowski about the Apache Airflow Provider Thanks to all the presenters and attendees with a shout out to @Harel Shein for the help with organizing and day-of logistics, @Jakub Dardziński for the help with set up/clean up, and @Sheeri Cabral (Collibra) for the crucial assist with the signup sheet. This was our first meetup in Toronto, and we learned some valuable lessons about planning events in new cities — the first and foremost being to ask for a pic of the building! 🙂 But it seemed like folks were undeterred, and the space itself lived up to expectations. For a recording and clips from the meetup, head over to our YouTube channel. Upcoming events: • October 5th in San Francisco: Marquez Meetup @ Astronomer (sign up https://www.meetup.com/meetup-group-bnfqymxe/events/295444209/?utmmedium=referral&utmcampaign=share-btnsavedeventssharemodal&utmsource=link|here) • November: Warsaw meetup (details, date TBA) • January: London meetup (details, date TBA) Are you interested in hosting or co-hosting an OpenLineage or Marquez meetup? DM me!

metaphor.io
YouTube
Meetup
🙌 Mars Lan (Metaphor), Harel Shein, Paweł Leszczyński
❤️ Jakub Dardziński, Harel Shein, Rodrigo Maia, Paweł Leszczyński, Julien Le Dem, Willy Lulciuc
🚀 Jakub Dardziński, Kevin Languasco
😅 Harel Shein
✅ Sheeri Cabral (Collibra)
Michael Robinson (michael.robinson@astronomer.io)
2023-09-27 11:55:47
Damien Hawes (damien.hawes@booking.com)
2023-09-27 12:23:05

Hi folks, am I correct in my observations that the Spark integration does not generate inputs and outputs for Kafka-to-Kafka pipelines?

EDIT: Removed the crazy wall of text. Relevant GitHub issue is here.

👀 Paweł Leszczyński
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-09-28 02:42:18

*Thread Reply:* responded within the issue

Erik Alfthan (slack@alfthan.eu)
2023-09-28 02:40:40

Hello community First time poster - bear with me :)

I am looking to make a minor PR on the airflow integration (fixing github #2130), and the code change is easy enough, but I fail to install the python environment. I have tried the simple ones OpenLineage/integration/airflow &gt; pip install -e . or OpenLineage/integration/airflow &gt; pip install -r dev-requirements.txt but they both fail on ERROR: No matching distribution found for openlineage-sql==1.3.0

(which I think is an unreleased version in the git project)

How would I go about to install the requirements?

//Erik

PS. Sorry for posting this in general if there is a specific integration or contribution channel - I didnt find a better channel

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-09-28 03:04:48

*Thread Reply:* Hi @Erik Alfthan, the channel is totally OK. I am not airflow integration expert, but what it looks to me, you're missing openlineage-sql library, which is a rust library used to extract lineage from sql queries. This is how we do that in circle ci: https://app.circleci.com/pipelines/github/OpenLineage/OpenLineage/8080/workflows/aba53369-836c-48f5-a2dd-51bc0740a31c/jobs/140113

and subproject page with build instructions: https://github.com/OpenLineage/OpenLineage/tree/main/integration/sql

Erik Alfthan (slack@alfthan.eu)
2023-09-28 03:07:23

*Thread Reply:* Ok, so I go and "manually" build the internal dependency so that it becomes available in the pip cache?

I was hoping for something more automagical, but that should work

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-09-28 03:08:06

*Thread Reply:* I think so. @Jakub Dardziński am I right?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-28 03:18:27

*Thread Reply:* https://openlineage.io/docs/development/developing/python/setup there’s a guide how to setup the dev environment

> Typically, you first need to build openlineage-sql locally (see README). After each release you have to repeat this step in order to bump local version of the package. This might be somewhat exposed more in GitHub repository README as well

Erik Alfthan (slack@alfthan.eu)
2023-09-28 03:27:20

*Thread Reply:* It didnt find the wheel in the cache, but if I used the line in the sql/README.md pip install openlineage-sql --no-index --find-links ../target/wheels --force-reinstall It is installed and thus skipped/passed when pip later checks if it needs to be installed.

Now I have a second issue because it is expecting me to have mysqlclient-2.2.0 which seems to need a binary Command 'pkg-config --exists mysqlclient' returned non-zero exit status 127 and Command 'pkg-config --exists mariadb' returned non-zero exit status 127 I am on Ubuntu 22.04 in WSL2. Should I go to apt and grab me a mysql client?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-28 03:31:52

*Thread Reply:* > It didnt find the wheel in the cache, but if I used the line in the sql/README.md > pip install openlineage-sql --no-index --find-links ../target/wheels --force-reinstall > It is installed and thus skipped/passed when pip later checks if it needs to be installed. That’s actually expected. You should build new wheel locally and then install it.

> Now I have a second issue because it is expecting me to have mysqlclient-2.2.0 which seems to need a binary > Command 'pkg-config --exists mysqlclient' returned non-zero exit status 127 > and > Command 'pkg-config --exists mariadb' returned non-zero exit status 127 > I am on Ubuntu 22.04 in WSL2. Should I go to apt and grab me a mysql client? We’ve left some system specific configuration, e.g. mysqlclient, to users as it’s a bit aside from OpenLineage and more of general development task.

probably sudo apt-get install python3-dev default-libmysqlclient-dev build-essential should work

Erik Alfthan (slack@alfthan.eu)
2023-09-28 03:32:04

*Thread Reply:* I just realized that I should probably skip setting up my wsl and just run the tests in the docker setup you prepared

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-28 03:35:46

*Thread Reply:* You could do that as well but if you want to test your changes vs many Airflow versions that wouldn’t be possible I think (run them with tox btw)

Erik Alfthan (slack@alfthan.eu)
2023-09-28 04:54:39

*Thread Reply:* This is starting to feel like a rabbit hole 😞

When I run tox, I get a lot of build errors • client needs to be built • sql needs to be built to a different target than its readme says • a lot of builds fail on cython_sources

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-28 05:19:34

*Thread Reply:* would you like to share some exact log lines? I’ve never seen such errors, they probably are system specific

Erik Alfthan (slack@alfthan.eu)
2023-09-28 06:45:48

*Thread Reply:* Getting requirements to build wheel did not run successfully. │ exit code: 1 ╰─&gt; [62 lines of output] /tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/config/setupcfg.py:293: _DeprecatedConfig: Deprecated config insetup.cfg` !!`

        `****************************************************************************************************************************************************************`
        `The license_file parameter is deprecated, use license_files instead.`

        `By 2023-Oct-30, you need to update your project and remove deprecated calls`
        `or your builds will no longer be supported.`

        `See <https://setuptools.pypa.io/en/latest/userguide/declarative_config.html> for details.`
        `****************************************************************************************************************************************************************`

`!!`
  `parsed = self.parsers.get(option_name, lambda x: x)(value)`
`running egg_info`
`writing lib3/PyYAML.egg-info/PKG-INFO`
`writing dependency_links to lib3/PyYAML.egg-info/dependency_links.txt`
`writing top-level names to lib3/PyYAML.egg-info/top_level.txt`
`Traceback (most recent call last):`
  `File "/home/obr_erikal/projects/OpenLineage/integration/airflow/.tox/py3-airflow-2.1.4/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in &lt;module&gt;`
    `main()`
  `File "/home/obr_erikal/projects/OpenLineage/integration/airflow/.tox/py3-airflow-2.1.4/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main`
    `json_out['return_val'] = hook(****hook_input['kwargs'])`
  `File "/home/obr_erikal/projects/OpenLineage/integration/airflow/.tox/py3-airflow-2.1.4/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel`
    `return hook(config_settings)`
  `File "/tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 355, in get_requires_for_build_wheel`
    `return self._get_build_requires(config_settings, requirements=['wheel'])`
  `File "/tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 325, in _get_build_requires`
    `self.run_setup()`
  `File "/tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 341, in run_setup`
    `exec(code, locals())`
  `File "&lt;string&gt;", line 271, in &lt;module&gt;`
  `File "/tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/__init__.py", line 103, in setup`
    `return distutils.core.setup(****attrs)`
  `File "/tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 185, in setup`
    `return run_commands(dist)`
  `File "/tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 201, in run_commands`
    `dist.run_commands()`
  `File "/tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands`
    `self.run_command(cmd)`
  `File "/tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/dist.py", line 989, in run_command`
    `super().run_command(command)`
  `File "/tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command`
    `cmd_obj.run()`
  `File "/tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/command/egg_info.py", line 318, in run`
    `self.find_sources()`
  `File "/tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/command/egg_info.py", line 326, in find_sources`
    `mm.run()`
  `File "/tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/command/egg_info.py", line 548, in run`
    `self.add_defaults()`
  `File "/tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/command/egg_info.py", line 586, in add_defaults`
    `sdist.add_defaults(self)`
  `File "/tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/command/sdist.py", line 113, in add_defaults`
    `super().add_defaults()`
  `File "/tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/_distutils/command/sdist.py", line 251, in add_defaults`
    `self._add_defaults_ext()`
  `File "/tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/_distutils/command/sdist.py", line 336, in _add_defaults_ext`
    `self.filelist.extend(build_ext.get_source_files())`
  `File "&lt;string&gt;", line 201, in get_source_files`
  `File "/tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 107, in __getattr__`
    `raise AttributeError(attr)`
`AttributeError: cython_sources`
`[end of output]`

note: This error originates from a subprocess, and is likely not a problem with pip. py3-airflow-2.1.4: exit 1 (7.85 seconds) /home/obr_erikal/projects/OpenLineage/integration/airflow&gt; python -m pip install --find-links target/wheels/ --find-links ../sql/iface-py/target/wheels --use-deprecated=legacy-resolver --constraint=<https://raw.githubusercontent.com/apache/airflow/constraints-2.1.4/constraints-3.8.txt> apache-airflow==2.1.4 'mypy&gt;=0.9.6' pytest pytest-mock -r dev-requirements.txt pid=368621 py3-airflow-2.1.4: FAIL ✖ in 7.92 seconds

Erik Alfthan (slack@alfthan.eu)
2023-09-28 06:53:54

*Thread Reply:* Then, for the actual error in my PR: Evidently you are not using isort, so what linter/fixer should I use for imports?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-28 06:58:15

*Thread Reply:* for the error - I think there’s a mistake in the docs. Could you please run maturin build --out target/wheels as a temp solution?

👀 Erik Alfthan
Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-28 06:58:57

*Thread Reply:* we’re using ruff , tox runs it as one of commands

Erik Alfthan (slack@alfthan.eu)
2023-09-28 07:00:37

*Thread Reply:* Not in the airflow folder? OpenLineage/integration/airflow$ maturin build --out target/wheels 💥 maturin failed Caused by: pyproject.toml at /home/obr_erikal/projects/OpenLineage/integration/airflow/pyproject.toml is invalid Caused by: TOML parse error at line 1, column 1 | 1 | [tool.ruff] | ^ missing fieldbuild-system``

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-28 07:02:32

*Thread Reply:* I meant change here https://github.com/OpenLineage/OpenLineage/blob/main/integration/sql/README.md

so cd iface-py python -m pip install maturin maturin build --out ../target/wheels becomes cd iface-py python -m pip install maturin maturin build --out target/wheels tox runs install_command = python -m pip install {opts} --find-links target/wheels/ \ --find-links ../sql/iface-py/target/wheels but it should be install_command = python -m pip install {opts} --find-links target/wheels/ \ --find-links ../sql/target/wheels actually and I’m posting PR to fix that

Erik Alfthan (slack@alfthan.eu)
2023-09-28 07:05:12

*Thread Reply:* yes, that part I actually worked out myself, but the cython_sources error I fail to understand cause. I have python3-dev installed on WSL Ubuntu with python version 3.10.12 in a virtualenv. Anything in that that could cause issues?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-28 07:12:20

*Thread Reply:* looks like it has something to do with latest release of Cython? pip install "Cython&lt;3" maybe solves the issue?

Erik Alfthan (slack@alfthan.eu)
2023-09-28 07:15:06

*Thread Reply:* I didnt have any cython before the install. Also no change. Could it be some update to setuptools itself? seems like the depreciation notice and the error is coming from inside setuptools

Erik Alfthan (slack@alfthan.eu)
2023-09-28 07:16:59

*Thread Reply:* (I.e. I tried the pip install "Cython&lt;3" command without any change in the output )

Erik Alfthan (slack@alfthan.eu)
2023-09-28 07:20:30

*Thread Reply:* Applying ruff lint on the converter.py file fixed the issue on the PR though so unless you have any feedback on the change itself, I will set it up on my own computer later instead (right now doing changes on behalf of a client on the clients computer)

If the issue persists on my own computer, I'll dig a bit further

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-28 07:21:03

*Thread Reply:* It’s a bit hard for me to find the root cause as I cannot reproduce this locally and CI works fine as well

Erik Alfthan (slack@alfthan.eu)
2023-09-28 07:22:41

*Thread Reply:* Yeah, I am thinking that if I run into the same problem "at home", I might find it worthwhile to understand the issue. Right now, the client only wants the fix.

👍 Jakub Dardziński
Erik Alfthan (slack@alfthan.eu)
2023-09-28 07:25:10

*Thread Reply:* Is there an official release cycle?

or more specific, given that the PRs are approved, how soon can they reach openlineage-dbt and apache-airflow-providers-openlineage ?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-28 07:28:58

*Thread Reply:* we need to differentiate some things:

  1. OpenLineage repository: a. dbt integration - this is the only place where it is maintained b. Airflow integration - here we only keep backwards compatibility but generally speaking starting from Airflow 2.7+ we would like to do all the job in Airflow repo as OL Airflow provider
  2. Airflow repository - there’s only Airflow Openlineage provider compatible (and works best) with Airflow 2.7+

we have control over releases (obviously) in OL repo - it’s monthly cycle so beginning next week that should happen. There’s also a possibility to ask for ad-hoc release in #general slack channel and with approvals of committers the new version is also released

For Airflow providers - the cycle is monthly as well

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-28 07:31:30

*Thread Reply:* it’s a bit complex for this split but needed temporarily

Erik Alfthan (slack@alfthan.eu)
2023-09-28 07:31:47

*Thread Reply:* oh, I did the fix in the wrong place! The client is on airflow 2.7 and is using the provider. Is it syncing?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-28 07:32:28

*Thread Reply:* it’s not, two separate places a~nd we haven’t even added the whole thing with converting old lineage objects to OL specific~

editing, that’s not true

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-28 07:34:40

*Thread Reply:* the code’s here: https://github.com/apache/airflow/blob/main/airflow/providers/openlineage/extractors/manager.py#L154

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-28 07:35:17

*Thread Reply:* sorry I did not mention this earlier. we definitely need to add some guidance how to proceed with contributions to OL and Airflow OL provider

Erik Alfthan (slack@alfthan.eu)
2023-09-28 07:36:10

*Thread Reply:* anyway, the dbt fix is the blocking issue, so if that parts comes next week, there is no real urgency in getting the columns. It is a nice to have for our ingest parquet files.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-28 07:37:12

*Thread Reply:* may I ask if you use some custom operator / python operator there?

Erik Alfthan (slack@alfthan.eu)
2023-09-28 07:37:33

*Thread Reply:* yeah, taskflow with inlets/outlets

Erik Alfthan (slack@alfthan.eu)
2023-09-28 07:38:38

*Thread Reply:* so we extract from sources and use pyarrow to create parquet files in storage that an mssql-server can use as external tables

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-09-28 07:39:54

*Thread Reply:* awesome 👍 we have plans to integrate more with Python operator as well but not earlier than in Airflow 2.8

Erik Alfthan (slack@alfthan.eu)
2023-09-28 07:43:41

*Thread Reply:* I guess writing a generic extractor for the python operator is quite hard, but if you could support some inlet/outlet type for tabular fileformat / their python libraries like pyarrow or maybe even pandas and document it, I think a lot of people would understand how to use them

➕ Harel Shein
Michael Robinson (michael.robinson@astronomer.io)
2023-09-28 16:16:24

Are you located in the Brussels area or within commutable distance? Interested in attending a meetup between October 16-20? If so, please DM @Sheeri Cabral (Collibra) or myself. TIA

❤️ Sheeri Cabral (Collibra)
Michael Robinson (michael.robinson@astronomer.io)
2023-10-02 11:58:32

@channel Hello all, I’d like to open a vote to release OpenLineage 1.3.0, including: • support for Spark 3.5 in the Spark integration • scheme preservation bug fix in the Spark integration • find-links path in tox bug in the Airflow integration fix • more graceful logging when no OL provider is installed in the Airflow integration • columns as schema facet for airflow.lineage.Table addition • SQLSERVER to supported dbt profile types addition Three +1s from committers will authorize. Thanks in advance.

🙌 Harel Shein, Paweł Leszczyński, Rodrigo Maia
👍 Jason Yip, Paweł Leszczyński
➕ Willy Lulciuc, Jakub Dardziński, Erik Alfthan, Julien Le Dem
Michael Robinson (michael.robinson@astronomer.io)
2023-10-02 17:00:08

*Thread Reply:* Thanks all. The release is authorized and will be initiated within 2 business days.

Jason Yip (jasonyip@gmail.com)
2023-10-02 17:11:46

*Thread Reply:* looking forward to that, I am seeing inconsistent results in Databricks for Spark 3.4+, sometimes there's no inputs / outputs, hope that is fixed?

Harel Shein (harel.shein@gmail.com)
2023-10-03 09:59:24

*Thread Reply:* @Jason Yip if it isn’t fixed for you, would love it if you could open up an issue that will allow us to reproduce and fix

👍 Jason Yip
Jason Yip (jasonyip@gmail.com)
2023-10-03 20:23:40

*Thread Reply:* @Harel Shein the issue still exists -> Spark 3.4 and above, including 3.5, saveAsTable and create table won't have inputs and outputs in Databricks

Jason Yip (jasonyip@gmail.com)
2023-10-03 20:30:15

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/2124

Labels
integration/spark, integration/databricks
Comments
1
Jason Yip (jasonyip@gmail.com)
2023-10-03 20:30:21

*Thread Reply:* and of course this issue still exists

Harel Shein (harel.shein@gmail.com)
2023-10-03 21:45:09

*Thread Reply:* thanks for posting, we’ll continue looking into this.. if you find any clues that might help, please let us know.

Jason Yip (jasonyip@gmail.com)
2023-10-03 21:46:27

*Thread Reply:* is there any instructions on how to hook up a debugger to OL?

Harel Shein (harel.shein@gmail.com)
2023-10-04 09:04:16

*Thread Reply:* @Paweł Leszczyński has been working on adding a debug facet, but more suggestions are more than welcome!

Harel Shein (harel.shein@gmail.com)
2023-10-04 09:05:58

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2147

Labels
documentation, integration/spark
Assignees
<a href="https://github.com/pawel-big-lebowski">@pawel-big-lebowski</a>
👀 Paweł Leszczyński
👍 Jason Yip
Jason Yip (jasonyip@gmail.com)
2023-10-05 03:20:11

*Thread Reply:* @Paweł Leszczyński do you have a build for the PR? Appreciated!

Harel Shein (harel.shein@gmail.com)
2023-10-05 15:05:08

*Thread Reply:* we’ll ask for a release once it’s reviewed and merged

Michael Robinson (michael.robinson@astronomer.io)
2023-10-02 12:28:28

@channel The September issue of OpenLineage News is here! This issue covers the big news about OpenLineage coming out of Airflow Summit, progress on the Airflow Provider, highlights from our meetup in Toronto, and much more. To get the newsletter directly in your inbox each month, sign up here.

apache.us14.list-manage.com
🦆 Harel Shein, Paweł Leszczyński
🔥 Willy Lulciuc, Jakub Dardziński, Paweł Leszczyński
Damien Hawes (damien.hawes@booking.com)
2023-10-03 03:44:36

Hi folks - I'm wondering if its just me, but does io.openlineage:openlineage_sql_java:1.2.2 ship with the arm64.dylib binary? When i try and run code that uses the Java package on an Apple M1, the binary isn't found, The workaround is to checkout 1.2.2 and then build and publish it locally.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-10-03 09:01:38

*Thread Reply:* Not sure if I follow your question. Whenever OL is released, there is a script new-version.sh - https://github.com/OpenLineage/OpenLineage/blob/main/new-version.sh being run and modify the codebase.

So, If you pull the code, it contains OL version that has not been released yet and in case of dependencies, one need to build them on their own.

For example, here https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark#preparation Preparation section describes how to build openlineage-java and openlineage-sql in order to build openlineage-spark.

Damien Hawes (damien.hawes@booking.com)
2023-10-04 05:27:26

*Thread Reply:* Hmm. Let's elaborate my use case a bit.

We run Apache Hive on-premise. Hive provides query execution hooks for pre-query, post-query, and I think failed query.

Any way, as part of the hook, you're given the query string.

So I, naturally, tried to pass the query string into OpenLineageSql.parse(Collections.singletonList(hookContext.getQueryPlan().getQueryStr()), "hive") in order to test this out.

I was using openlineage-sql-java:1.2.2 at that time, and no matter what query string I gave it, nothing was returned.

I then stepped through the code and noticed that it was looking for the arm64 lib, and I noticed that that package (downloaded from maven central) lacked that particular native binary.

Damien Hawes (damien.hawes@booking.com)
2023-10-04 05:27:36

*Thread Reply:* I hope that helps.

👍 Paweł Leszczyński
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-10-04 09:03:02

*Thread Reply:* I get in now. In Circle CI we do have 3 build steps: - build-integration-sql-x86 - build-integration-sql-arm - build-integration-sql-macos but no mac m1. I think at that time circle CI did not have a proper resource class in free plan. Additionally, @Maciej Obuchowski would prefer to migrate this to github actions as he claims this can be achieved there in a cleaner way (https://github.com/OpenLineage/OpenLineage/issues/1624).

Feel free to create an issue for this. Others would be able to upvote it in case they have similar experience.

Assignees
<a href="https://github.com/mobuchowski">@mobuchowski</a>
Labels
ci, integration/sql
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-10-23 11:56:12

*Thread Reply:* It doesn't have the free resource class still 😞 We're blocked on that unfortunately. Other solution would be to migrate to GH actions, where most of our solution could be replaced by something like that https://github.com/PyO3/maturin-action

Stars
98
Language
TypeScript
Michael Robinson (michael.robinson@astronomer.io)
2023-10-03 10:56:03

@channel We released OpenLineage 1.3.1! Added: • Airflow: add some basic stats to the Airflow integration #1845 @harels • Airflow: add columns as schema facet for airflow.lineage.Table (if defined) #2138 @erikalfthan • DBT: add SQLSERVER to supported dbt profile types #2136 @erikalfthan • Spark: support for latest 3.5 #2118 @pawel-big-lebowski Fixed: • Airflow: fix find-links path in tox #2139 @JDarDagran • Airflow: add more graceful logging when no OpenLineage provider installed #2141 @JDarDagran • Spark: fix bug in PathUtils’ prepareDatasetIdentifierFromDefaultTablePath (CatalogTable) to correctly preserve scheme from CatalogTable’s location #2142 @d-m-h Thanks to all the contributors, including new contributor @Erik Alfthan! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.3.1 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/1.2.2...1.3.1 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

👍 Jason Yip, Peter Hicks, Peter Huang, Mars Lan (Metaphor)
🎉 Sheeri Cabral (Collibra)
Mars Lan (Metaphor) (mars@metaphor.io)
2023-10-04 07:42:59

*Thread Reply:* Any chance we can do a 1.3.2 soonish to include https://github.com/OpenLineage/OpenLineage/pull/2151 instead of waiting for the next monthly release?

Labels
documentation, client/python
Comments
4
Matthew Paras (matthewparas2020@u.northwestern.edu)
2023-10-03 12:34:57

Hey everyone - does anyone have a good mechanism for alerting on issues with open lineage? For example, maybe alerting when an event times out - perhaps to prometheus or some other kind of generic endpoint? Not sure the best approach here (if the meta inf extension would be able to achieve it)

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-10-04 03:01:02

*Thread Reply:* That's a great usecase for OpenLineage. Unfortunately, we don't have any doc or recomendation on that.

I would try using FluentD proxy we have (https://github.com/OpenLineage/OpenLineage/tree/main/proxy/fluentd) to copy event stream (alerting is just one of usecases for lineage events) and write fluentd plugin to send it asynchronously further to alerting service like PagerDuty.

It looks cool to me but I never had enough time to test this approach.

👍 Matthew Paras
Michael Robinson (michael.robinson@astronomer.io)
2023-10-05 14:44:14

@channel This month’s TSC meeting is next Thursday the 12th at 10am PT. On the tentative agenda: • announcements • recent releases • Airflow Summit recap • tutorial: migrating to the Airflow Provider • discussion topic: observability for OpenLineage/Marquez • open discussion • more (TBA) More info and the meeting link can be found on the website. All are welcome! Do you have a discussion topic, use case or integration you’d like to demo? DM me to be added to the agenda.

openlineage.io
👀 Sheeri Cabral (Collibra), Julian LaNeve, Peter Hicks
Julien Le Dem (julien@apache.org)
2023-10-05 20:40:40

The Marquez meetup in San Francisco is happening right now! https://www.meetup.com/meetup-group-bnfqymxe/events/295444209/?utmmedium=referral&utmcampaign=share-btnsavedeventssharemodal&utmsource=link|https://www.meetup.com/meetup-group-bnfqymxe/events/295444209/?utmmedium=referral&utmcampaign=share-btnsavedeventssharemodal&utmsource=link

Meetup
🎉 Paweł Leszczyński, Rodrigo Maia
Mars Lan (Metaphor) (mars@metaphor.io)
2023-10-06 07:19:01

@Michael Robinson can we cut a new release to include this change? • https://github.com/OpenLineage/OpenLineage/pull/2151

Labels
documentation, client/python
Comments
6
➕ Harel Shein, Jakub Dardziński, Julien Le Dem, Michael Robinson, Maciej Obuchowski
Michael Robinson (michael.robinson@astronomer.io)
2023-10-06 19:16:02

*Thread Reply:* Thanks for requesting a release, @Mars Lan (Metaphor). It has been approved and will be initiated within 2 business days of next Monday.

🙏 Mars Lan (Metaphor)
Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-10-08 23:59:36

@here I am trying out the openlineage integration of spark on databricks. There is no event getting emitted from Openlineage, I see logs saying OpenLineage Event Skipped. I am attaching the Notebook that i am trying to run and the cluster logs. Kindly can someone help me on this

Jason Yip (jasonyip@gmail.com)
2023-10-09 00:02:10

*Thread Reply:* from my experience, it will only work on Spark 3.3.x or below, aka Runtime 12.2 or below. Anything above the events will show up once in a blue moon

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-10-09 00:04:38

*Thread Reply:* ohh, thanks for the information @Jason Yip, I am trying out with 13.3 Databricks Version and Spark 3.4.1, will try using a below version as you suggested. Any issue tracking this bug @Jason Yip

Jason Yip (jasonyip@gmail.com)
2023-10-09 00:06:06

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/2124

Labels
integration/spark, integration/databricks
Comments
2
Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-10-09 00:11:54

*Thread Reply:* tried with databricks 12.2 --> spark 3.3.2, still the same behaviour no event getting emitted

Jason Yip (jasonyip@gmail.com)
2023-10-09 00:12:35

*Thread Reply:* you can do 11.3, its the most stable one I know

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-10-09 00:12:46

*Thread Reply:* sure, let me try that out

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-10-09 00:31:51

*Thread Reply:* still the same problem…the jar that i am using is the latest openlineage-spark-1.3.1.jar, do you think that can be the problem

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-10-09 00:43:59

*Thread Reply:* tried with openlineage-spark-1.2.2.jar, still the same issue, seems like they are skipping some events

Jason Yip (jasonyip@gmail.com)
2023-10-09 01:47:20

*Thread Reply:* Probably not all events will be captured, I have only tested create tables and jobs

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-10-09 04:31:12

*Thread Reply:* Hi @Guntaka Jeevan Paul, how did you configure openlineage and what is your job doing?

We do have a bunch of integration tests on Databricks platform available here and they're passing on databricks runtime 13.0.x-scala2.12.

Could you also try running code same as our test does (this one)? If you run it and see OL events, this will make us sure your config is OK and we can continue further debug.

Looking at your spark script: could you save your dataset and see if you still don't see any events?

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-10-09 05:06:41

*Thread Reply:* babynames = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("dbfs:/FileStore/babynames.csv") babynames.createOrReplaceTempView("babynames_table") years = spark.sql("select distinct(Year) from babynames_table").rdd.map(lambda row : row[0]).collect() years.sort() dbutils.widgets.dropdown("year", "2014", [str(x) for x in years]) display(babynames.filter(babynames.Year == dbutils.widgets.get("year")))

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-10-09 05:08:09

*Thread Reply:* this is the script that i am running @Paweł Leszczyński…kindly let me know if i’m doing any mistake. I have added the init script at the cluster level and from the logs i could see that openlineage is configured as i see a log statement

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-10-09 05:10:30

*Thread Reply:* there's nothing wrong in that script. It's just we decided to limit amount of OL events for jobs that don't write their data anywhere and just do collect operation

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-10-09 05:11:02

*Thread Reply:* this is also a potential reason why can't you see any events

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-10-09 05:14:33

*Thread Reply:* ohh…okk, will try out the test script that you have mentioned above. Kindly correct me if my understanding is correct, so if there are a few transformatiosna nd finally writing somewhere that is where the OL events are expected to be emitted?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-10-09 05:16:54

*Thread Reply:* yes. main purpose of the lineage is to track dependencies between the datasets, when a job reads from dataset A and writes to dataset B. In case of databricks notebook, that do show or collect and print some query result on the screen, there may be no reason to track it in the sense of lineage.

Michael Robinson (michael.robinson@astronomer.io)
2023-10-09 15:25:14

@channel We released OpenLineage 1.4.1! Additions:Client: allow setting client’s endpoint via environment variable 2151 @Mars Lan (Metaphor) • Flink: expand Iceberg source types 2149 @Peter Huang • Spark: add debug facet 2147 @Paweł Leszczyński • Spark: enable Nessie REST catalog 2165 @julwin Thanks to all the contributors, especially new contributors @Peter Huang and @julwin! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.4.1 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/1.3.1...1.4.1 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

👍 Jason Yip, Ross Turk, Mars Lan (Metaphor), Harel Shein, Rodrigo Maia
Drew Bittenbender (drew@salt.io)
2023-10-09 16:55:35

Hello. I am getting started with OL and Marquez with dbt. I am using dbt-ol. The namespace of the dataset showing up in Marquez is not the namespace I provide using OPENLINEAGENAMESPACE. It happens to be the same as the source in Marquez which is the snowflake account uri. It's obviously picking up the other env variable OPENLINEAGEURL so i am pretty sure its not the environment. Is this expected?

Michael Robinson (michael.robinson@astronomer.io)
2023-10-09 18:56:13

*Thread Reply:* Hi Drew, thank you for using OpenLineage! I don’t know the details of your use case, but I believe this is expected, yes. In general, the dataset namespace is different. Jobs are namespaced separately from datasets, which are namespaced by their containing datasources. This is the case so datasets have the same name regardless of the job writing to them, as datasets are sometimes shared by jobs in different namespaces.

👍 Drew Bittenbender
Jason Yip (jasonyip@gmail.com)
2023-10-10 01:05:11

Any idea why "environment-properties" is gone in Spark 3.4+ in StartEvent?

Jason Yip (jasonyip@gmail.com)
2023-10-10 20:53:59

example:

{"environment_properties":{"spark.databricks.clusterUsageTags.clusterName":"<a href="mailto:jason.yip@tredence.com">jason.yip@tredence.com</a>'s Cluster","spark.databricks.job.runId":"","spark.databricks.job.type":"","spark.databricks.clusterUsageTags.azureSubscriptionId":"a4f54399_8db8_4849_adcc_a42aed1fb97f","spark.databricks.notebook.path":"/Repos/jason.yip@tredence.com/segmentation/01_Data Prep","spark.databricks.clusterUsageTags.clusterOwnerOrgId":"4679476628690204","MountPoints":[{"MountPoint":"/databricks-datasets","Source":"databricks_datasets"},{"MountPoint":"/Volumes","Source":"UnityCatalogVolumes"},{"MountPoint":"/databricks/mlflow-tracking","Source":"databricks/mlflow-tracking"},{"MountPoint":"/databricks-results","Source":"databricks_results"},{"MountPoint":"/databricks/mlflow-registry","Source":"databricks/mlflow-registry"},{"MountPoint":"/Volume","Source":"DbfsReserved"},{"MountPoint":"/volumes","Source":"DbfsReserved"},{"MountPoint":"/","Source":"DatabricksRoot"},{"MountPoint":"/volume","Source":"DbfsReserved"}],"User":"<a href="mailto:jason.yip@tredence.com">jason.yip@tredence.com</a>","UserId":"4768657035718622","OrgId":"4679476628690204"}}

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-10-11 03:46:13

*Thread Reply:* Is this related to any OL version? In OL 1.2.2. we've added extra variable spark.databricks.clusterUsageTags.clusterAllTags to be captured, but this should not break things.

I think we're facing some issues on recent databricks runtime versions. Here is an issue for this: https://github.com/OpenLineage/OpenLineage/issues/2131

Is the problem you describe specific to some databricks runtime versions?

Labels
integration/spark, integration/databricks
Jason Yip (jasonyip@gmail.com)
2023-10-11 11:17:06

*Thread Reply:* yes, exactly Spark 3.4+

Jason Yip (jasonyip@gmail.com)
2023-10-11 21:12:27

*Thread Reply:* Btw I don't understand the code flow entirely, if we are talking about a different classpath only, I see there's Unity Catalog handler in the code and it says it works the same as Delta, but I am not seeing it subclassing Delta. I suppose it will work the same.

I am happy to jump on a call to show you if needed

Jason Yip (jasonyip@gmail.com)
2023-10-16 02:58:56

*Thread Reply:* @Paweł Leszczyński do you think in Spark 3.4+ only one event would happen?

/** * We get exact copies of OL events for org.apache.spark.scheduler.SparkListenerJobStart and * org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart. The same happens for end * events. * * @return */ private boolean isOnJobStartOrEnd(SparkListenerEvent event) { return event instanceof SparkListenerJobStart || event instanceof SparkListenerJobEnd; }

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-10-10 23:43:39

@here i am trying out the databricks spark integration and in one of the events i am getting a openlineage event where the output dataset is having a facet called symlinks , the statement that generated this event is this sql CREATE TABLE IF NOT EXISTS covid_research.covid_data USING CSV LOCATION '<abfss://oltptestdata@jeevanacceldata.dfs.core.windows.net/testdata/johns-hopkins-covid-19-daily-dashboard-cases-by-states.csv>' OPTIONS (header "true", inferSchema "true"); Can someone kindly let me know what this symlinks facet is. i tried seeing the spec but did not get it completely

Jason Yip (jasonyip@gmail.com)
2023-10-10 23:44:53

*Thread Reply:* I use it to get the table with database name

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-10-10 23:47:15

*Thread Reply:* so can i think it like if there is a synlink, then that table is kind of a reference to the original dataset

Jason Yip (jasonyip@gmail.com)
2023-10-11 01:25:44

*Thread Reply:* yes

🙌 Paweł Leszczyński
Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-10-11 06:55:58

@here When i am running this sql as part of a databricks notebook, i am recieving an OL event where i see only an output dataset and there is no input dataset or a symlink facet inside the dataset to map it to the underlying azure storage object. Can anyone kindly help on this spark.sql(f"CREATE TABLE IF NOT EXISTS covid_research.uscoviddata USING delta LOCATION '<abfss://oltptestdata@jeevanacceldata.dfs.core.windows.net/testdata/modified-delta>'") { "eventTime": "2023-10-11T10:47:36.296Z", "producer": "<https://github.com/OpenLineage/OpenLineage/tree/1.2.2/integration/spark>", "schemaURL": "<https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunEvent>", "eventType": "COMPLETE", "run": { "runId": "d0f40be9-b921-4c84-ac9f-f14a86c29ff7", "facets": { "spark.logicalPlan": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/1.2.2/integration/spark>", "_schemaURL": "<https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunFacet>", "plan": [ { "class": "org.apache.spark.sql.catalyst.plans.logical.CreateTable", "num-children": 1, "name": 0, "tableSchema": [], "partitioning": [], "tableSpec": null, "ignoreIfExists": true }, { "class": "org.apache.spark.sql.catalyst.analysis.ResolvedIdentifier", "num-children": 0, "catalog": null, "identifier": null } ] }, "spark_version": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/1.2.2/integration/spark>", "_schemaURL": "<https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunFacet>", "spark-version": "3.3.0", "openlineage-spark-version": "1.2.2" }, "processing_engine": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/1.2.2/integration/spark>", "_schemaURL": "<https://openlineage.io/spec/facets/1-1-0/ProcessingEngineRunFacet.json#/$defs/ProcessingEngineRunFacet>", "version": "3.3.0", "name": "spark", "openlineageAdapterVersion": "1.2.2" } } }, "job": { "namespace": "default", "name": "adb-3942203504488904.4.azuredatabricks.net.create_table.covid_research_db_uscoviddata", "facets": {} }, "inputs": [], "outputs": [ { "namespace": "dbfs", "name": "/user/hive/warehouse/covid_research.db/uscoviddata", "facets": { "dataSource": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/1.2.2/integration/spark>", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet>", "name": "dbfs", "uri": "dbfs" }, "schema": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/1.2.2/integration/spark>", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet>", "fields": [] }, "storage": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/1.2.2/integration/spark>", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/StorageDatasetFacet.json#/$defs/StorageDatasetFacet>", "storageLayer": "unity", "fileFormat": "parquet" }, "symlinks": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/1.2.2/integration/spark>", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/SymlinksDatasetFacet.json#/$defs/SymlinksDatasetFacet>", "identifiers": [ { "namespace": "/user/hive/warehouse/covid_research.db", "name": "covid_research.uscoviddata", "type": "TABLE" } ] }, "lifecycleStateChange": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/1.2.2/integration/spark>", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/LifecycleStateChangeDatasetFacet.json#/$defs/LifecycleStateChangeDatasetFacet>", "lifecycleStateChange": "CREATE" } }, "outputFacets": {} } ] }

Damien Hawes (damien.hawes@booking.com)
2023-10-11 06:57:46

*Thread Reply:* Hey Guntaka - can I ask you a favour? Can you please stop using @here or @channel - please keep in mind, you're pinging over 1000 people when you use that mention. Its incredibly distracting to have Slack notify me of a message that isn't pertinent to me.

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-10-11 06:58:50

*Thread Reply:* sure noted @Damien Hawes

Damien Hawes (damien.hawes@booking.com)
2023-10-11 06:59:34

*Thread Reply:* Thank you!

Madhav Kakumani (madhav.kakumani@6point6.co.uk)
2023-10-11 12:04:24

Hi @there, I am trying to make API call to get column-lineage information could you please let me know the url construct to retrieve the same? As per the API documentation I am passing the following url to GET column-lineage: http://localhost:5000/api/v1/column-lineage but getting error code:400. Thanks

Willy Lulciuc (willy@datakin.com)
2023-10-12 13:55:26

*Thread Reply:* Make sure to provide a dataset field nodeId as a query param in your request. If you’ve seeded Marquez with test metadata, you can use: curl -XGET "<http://localhost:5002/api/v1/column-lineage?nodeId=datasetField%3Afood_delivery%3Apublic.delivery_7_days%3Acustomer_email>" You can view the API docs for column lineage here!

Madhav Kakumani (madhav.kakumani@6point6.co.uk)
2023-10-17 05:57:36

*Thread Reply:* Thanks Willy. The documentation says 'name space' so i constructed API Like this: 'http://marquez-web:3000/api/v1/column-lineage/nodeId=datasetField:file:/home/jovyan/Downloads/event_attribute.csv:eventType' but it is still not working 😞

Madhav Kakumani (madhav.kakumani@6point6.co.uk)
2023-10-17 06:07:06

*Thread Reply:* nodeId is constructed like this: datasetField:<namespace>:<dataset>:<field name>

Michael Robinson (michael.robinson@astronomer.io)
2023-10-11 13:00:01

@channel Friendly reminder: this month’s TSC meeting, open to all, is tomorrow at 10 am PT: https://openlineage.slack.com/archives/C01CK9T7HKR/p1696531454431629

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
Michael Robinson (michael.robinson@astronomer.io)
2023-10-11 14:26:45

*Thread Reply:* Newly added discussion topics: • a proposal to add a Registry of Consumers and Producers • a dbt issue to add OpenLineage Dataset names to the Manifest • a proposal to add Dataset support in Spark LogicalPlan Nodes • a proposal to institute a certification process for new integrations

Jason Yip (jasonyip@gmail.com)
2023-10-12 15:08:34

This might be a dumb question, I guess I need to setup local Spark in order for the Spark tests to run successfully?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-10-13 01:56:19
Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-10-13 06:41:56

*Thread Reply:* when trying to install openlineage-java in local via this command --> cd ../../client/java/ && ./gradlew publishToMavenLocal, i am receiving this error ```> Task :signMavenJavaPublication FAILED

FAILURE: Build failed with an exception.

** What went wrong: Execution failed for task ':signMavenJavaPublication'. > Cannot perform signing task ':signMavenJavaPublication' because it has no configured signatory```

Jason Yip (jasonyip@gmail.com)
2023-10-13 13:35:06

*Thread Reply:* @Paweł Leszczyński this is what I am getting

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-10-16 03:02:13

*Thread Reply:* which java are you using? what is your operation system (is it windows?)?

Jason Yip (jasonyip@gmail.com)
2023-10-16 03:35:18

*Thread Reply:* yes it is Windows, i downloaded java 8 but I can try to build it with Linux subsystem or Mac

Guntaka Jeevan Paul (jeevan@acceldata.io)
2023-10-16 03:35:51

*Thread Reply:* In my case it is Mac

Jason Yip (jasonyip@gmail.com)
2023-10-16 03:56:09

*Thread Reply:* ** Where: Build file '/mnt/c/Users/jason/Downloads/github/OpenLineage/integration/spark/build.gradle' line: 9

** What went wrong: An exception occurred applying plugin request [id: 'com.adarshr.test-logger', version: '3.2.0'] > Failed to apply plugin [id 'com.adarshr.test-logger'] > Could not generate a proxy class for class com.adarshr.gradle.testlogger.TestLoggerExtension.

** Try:

Jason Yip (jasonyip@gmail.com)
2023-10-16 03:56:23

*Thread Reply:* tried with Linux subsystem

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-10-16 04:04:29

*Thread Reply:* we don't have any restrictions for windows builds, however it is something we don't test regularly. 2h ago we did have a successful build on circle CI https://app.circleci.com/pipelines/github/OpenLineage/OpenLineage/8271/workflows/0ec521ae-cd21-444a-bfec-554d101770ea

Jason Yip (jasonyip@gmail.com)
2023-10-16 04:13:04

*Thread Reply:* ... 111 more Caused by: java.lang.ClassNotFoundException: org.gradle.api.provider.HasMultipleValues ... 117 more

Jason Yip (jasonyip@gmail.com)
2023-10-17 00:26:07

*Thread Reply:* @Paweł Leszczyński now I am doing gradlew instead of gradle on windows coz Linux one doesn't work. The doc didn't mention about setting up Spark / Hadoop and that's my original question -- do I need to setup local Spark? Now it's throwing an error on Hadoop: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.

Jason Yip (jasonyip@gmail.com)
2023-10-21 23:33:48

*Thread Reply:* Got it working with Mac, couldn't get it working with Windows / Linux subsystem

Jason Yip (jasonyip@gmail.com)
2023-10-22 13:08:40

*Thread Reply:* Now getting class not found despite build and test succeeded

Jason Yip (jasonyip@gmail.com)
2023-10-22 21:46:23

*Thread Reply:* I uploaded the wrong jar.. there are so many jars, only the jar in the spark folder works, not subfolder

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-10-13 02:48:40

Hi team, I am running the following pyspark code in a cell: ```print("SELECTING 100 RECORDS FROM METADATA TABLE") df = spark.sql("""select ** from

limit 100""")

print("WRITING (1) 100 RECORDS FROM METADATA TABLE") df.write.mode("overwrite").format('delta').save("") df.createOrReplaceTempView("temp_metadata")

print("WRITING (2) 100 RECORDS FROM METADATA TABLE") df.write.mode("overwrite").format("delta").save("")

print("READING (1) 100 RECORDS FROM METADATA TABLE") dfread = spark.read.format('delta').load("") dfread.createOrReplaceTempView("metadata_1")

print("DOING THE MERGE INTO SQL STEP!") dfnew = spark.sql(""" MERGE INTO metadata1 USING

ON metadata1.id = tempmetadata.id WHEN MATCHED THEN UPDATE SET metadata1.id = tempmetadata.id, metadata1.aspect = tempmetadata.aspect WHEN NOT MATCHED THEN INSERT (id, aspect) VALUES (tempmetadata.id, tempmetadata.aspect) """)`` I am running with debug log levels. I actually don't see any of the events being logged forSaveIntoDataSourceCommandor theMergeIntoCommand`, but OL is in fact emitting events to the backend. It seems like the events are just not being logged... I actually observe this for all delta table related spark sql queries...

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-10-16 00:01:42

*Thread Reply:* Hi @Paweł Leszczyński is this expected? CMIIW but we should expect to see the events being logged when running with debug log level right?

Damien Hawes (damien.hawes@booking.com)
2023-10-16 04:17:30

*Thread Reply:* It's impossible to know without seeing how you've configured the listener.

Can you show this configuration?

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-10-17 03:15:20

*Thread Reply:* spark.openlineage.transport.url &lt;url&gt; spark.openlineage.transport.endpoint /&lt;endpoint&gt; spark.openlineage.transport.type http spark.extraListeners io.openlineage.spark.agent.OpenLineageSparkListener spark.openlineage.facets.custom_environment_variables [BUNCH_OF_VARIABLES;] spark.openlineage.facets.disabled [spark_unknown\;spark.logicalPlan] These are my spark configs... I'm setting log level to debug with sc.setLogLevel("DEBUG")

Damien Hawes (damien.hawes@booking.com)
2023-10-17 04:40:03

*Thread Reply:* Two things:

  1. If you want debug logs, you're going to have to provide a log4j.properties file or log4j2.properties file depending on the version of spark you're running. In that file, you will need to configure the logging levels. If I am not mistaken, the sc.setLogLevel controls ONLY the log levels of Spark namespaced components (i.e., org.apache.spark)
  2. You're telling the listener to emit to a URL. If you want to see the events emitted to the console, then set spark.openlineage.transport.type=console, and remove the other spark.openlineage.transport.** configurations. Do either (1) or (2).
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-10-20 00:49:45

*Thread Reply:* @Damien Hawes Hi, sflr.

  1. So enabling sc.setLogLevel does actually enable debug logs from Openlineage. I can see the events and everyting being logged if I save it as a parquet format instead of delta.
  2. I do want to emit events to the url. But, I would like to just see what exactly are the events being emitted for some specific jobs, since I see that the lineage is incorrect for some MergeInto cases
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-10-26 04:56:50

*Thread Reply:* Hi @Damien Hawes would like to check again on whether you'd have any thoughts about this... Thanks! 🙂

Rodrigo Maia (rodrigo.maia@manta.io)
2023-10-17 03:17:57

Hello All 👋! We are currently trying to work the the spark integration for OpenLineage in our Databricks instance. The general setup is done and working with a few hicups here and there. But one thing we are still struggling is how to link all spark jobs events with a Databricks job or a notebook run. We´ve recently noticed that some of the events produced by OL have the "environment-properties" attribute with information (for our context) regarding notebook path (if it is a notebook run), or the the job run ID (if its a databricks job run). But the thing is that these attributes are not always present. I ran some samples yesterday for a job with 4 notebook tasks. From all 20 json payload sent by the OL listener, only 3 presented the "environment-properties" attribute. Its not only happening with Databricks jobs. When i run single notebooks and each cell has its onw set of spark jobs, not all json events presented that property either.

So my question is what is the criteria to have this attributes present or not in the event json file? Or maybe this in an issue? @Jason Yip did you find out anything about this?

⚙️ Spark 3.4 / OL-Spark 1.4.1

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-10-17 06:55:47

*Thread Reply:* In general, we assume that OL events per run are cumulative. So, if you have 20 events with the same runId , then even if a single event contains some facet, we consider this is OK and let the backend combine it together. That's what we do in Marquez project (a reference backend architecture for OL) and that's why it is worth to use in Marquez as a rest API.

Are you able to use job namespace to aggregate all the Spark actions run within the databricks notebook? This is something that should serve this purpose.

Jason Yip (jasonyip@gmail.com)
2023-10-17 12:48:33

*Thread Reply:* @Rodrigo Maia for Spark 3.4 I don't see the environment-properties showing up at all, but if you run the code as it is, register a listener on SparkListenerJobStart and get the properties, all of those properties will show up. There's an event filter that filters out the SparkListenerJobStart, I suspect that filtered out the "unneccessary" events.. was trying to do a custom build to do that, but still trying to setup Hadoop and Spark on my local

Rodrigo Maia (rodrigo.maia@manta.io)
2023-10-18 05:23:16

*Thread Reply:* @Paweł Leszczyński you are right. This is what we are doing as well, combining events with the same runId to process the information on our backend. But even so, there are several runIds without this information. I went through these events to have a better view of what was happening. As you can see from 7 runIds, only 3 were showing the "environment-properties" attribute. Some condition is not being met here, or maybe it is what @Jason Yip suspects and there's some sort of filtering of unnecessary events

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-10-19 02:28:03

*Thread Reply:* @Rodrigo Maia, If you are able to provide a small Spark script such that none of the OL events contain the environment-properties, but at least one should, please raise an issue for this.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-10-19 02:29:11

*Thread Reply:* It's extremely helpful when community open issues that are not only described well, but also contain small piece of code needed to reproduce this.

Rodrigo Maia (rodrigo.maia@manta.io)
2023-10-19 02:59:39

*Thread Reply:* I know. that's the goal. that is why I wanted to understand in the first place if there was any condition preventing this from happening, but now i get that this is not expected behaviour.

👍 Paweł Leszczyński
Jason Yip (jasonyip@gmail.com)
2023-10-19 13:44:00
Jason Yip (jasonyip@gmail.com)
2023-10-19 14:49:03

*Thread Reply:* Please note that I am getting the same behavior, no code is needed, Spark 3.4+ won't be generating no matter what. I have been testing the same code for 2 months from this issue: https://github.com/OpenLineage/OpenLineage/issues/2124

I tried the code without OL and it worked perfectly, so it is OL filtering out the event for sure. I will try posting the code I use to collect the properties.

Labels
integration/spark, integration/databricks
Comments
3
Jason Yip (jasonyip@gmail.com)
2023-10-19 23:46:17

*Thread Reply:* this code proves that the prosperities are still there, somehow got filtered out by OL:

```%scala import org.apache.spark.scheduler._

class JobStartListener extends SparkListener { override def onJobStart(jobStart: SparkListenerJobStart): Unit = { // Extract properties here val jobId = jobStart.jobId val stageInfos = jobStart.stageInfos val properties = jobStart.properties

// You can print properties or save them somewhere
println(s"JobId: $jobId, Stages: ${stageInfos.size}, Properties: $properties")

} }

val listener = new JobStartListener() spark.sparkContext.addSparkListener(listener)

val df = spark.range(1000).repartition(10) df.count()```

Jason Yip (jasonyip@gmail.com)
2023-10-19 23:55:05

*Thread Reply:* of course feel free to test this logic as well, it still works -- if not the filtering:

https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/[…]ark/agent/facets/builder/DatabricksEnvironmentFacetBuilder.java

Rodrigo Maia (rodrigo.maia@manta.io)
2023-10-30 04:46:16

*Thread Reply:* Any ideas on how could i test it?

ankit jain (ankit.goods10@gmail.com)
2023-10-17 22:57:03

Hello All, I am completely new for Openlineage, I have to setup the lab to conduct POC on various aspects like Lineage, metadata management , etc. As per openlineage site, i tried downloading Ubuntu, docker and binary files for Marquez. But I am lost somewhere and unable to configure whole setup. Can someone please assist in steps to start from scratch so that i can delve into the Openlineage capabilities. Many thanks

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-10-18 01:32:01

*Thread Reply:* hey, did you try to follow one of these guides? https://openlineage.io/docs/guides/about

Michael Robinson (michael.robinson@astronomer.io)
2023-10-18 09:14:08

*Thread Reply:* Which guide were you using, and what errors/issues are you encountering?

ankit jain (ankit.goods10@gmail.com)
2023-10-21 15:43:14

*Thread Reply:* Thanks Jakub for the response.

ankit jain (ankit.goods10@gmail.com)
2023-10-21 15:45:42

*Thread Reply:* In docker, marquez-api image is not running and exiting with the exit code 127.

Michael Robinson (michael.robinson@astronomer.io)
2023-10-22 09:34:53

*Thread Reply:* @ankit jain thanks. I don't recognize 127, but 9 times out of 10 if the API or DB container fails the reason is a port conflict. Have you checked if port 5000 is available?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-10-22 09:54:10

*Thread Reply:* could you please check what’s the output of git config --get core.autocrlf or git config --global --get core.autocrlf ?

ankit jain (ankit.goods10@gmail.com)
2023-10-24 08:09:14

*Thread Reply:* @Michael Robinson thanks , I checked the port 5000 is not available. I tried deleting docker images and recreating them, but still the same issue persist stating /Usr/bin/env bash/r not found. Gradle build is successful.

ankit jain (ankit.goods10@gmail.com)
2023-10-24 08:09:54

*Thread Reply:* @Jakub Dardziński thanks, first command resulted as true and second command has no response

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-10-24 08:15:57

*Thread Reply:* are you running docker and git in Windows or Mac OS before 10.0?

Matthew Paras (matthewparas2020@u.northwestern.edu)
2023-10-19 15:00:42

Hey all - we've been noticing that some events go unreported by openlineage (spark) when the AsyncEventQueue fills up and starts dropping events. Wondering if anyone has experienced this before, and knows why it is happening? We've expanded the event queue capacity and thrown more hardware at the problem but no dice

Also as a note, the query plans from this job are pretty big - could the listener just be choking up? Happy to open a github issue as well if we suspect that it could be the listener itself having issues

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-10-20 02:57:50

*Thread Reply:* Hi, just checking, are you excluding the sparkPlan from the events? Or is it sending the spark plan too

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-10-23 11:59:40

*Thread Reply:* yeah - setting spark.openlineage.facets.disabled to [spark_unknown;spark.logicalPlan] should help

Matthew Paras (matthewparas2020@u.northwestern.edu)
2023-10-24 17:50:26

*Thread Reply:* sorry for the late reply - turns out this job is just whack 😄 we were going in circles trying to figure it out, we end up dropping events without open lineage enabled at all. But good to know that disabling the logical plan should speed us up if we run into this again

praveen kanamarlapudi (kpraveen420@gmail.com)
2023-10-20 18:18:37

Hi,

We are using openlineage spark connector. We have used spark 3.2 and scala 2.12 so far. We have triggered a new job with Spark 3.4 and scala 2.13 and faced below exception.

java.lang.NoSuchMethodError: 'scala.collection.Seq org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.map(scala.Function1)' at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$buildInputDatasets$6(OpenLineageRunEventBuilder.java:341) at java.base/java.util.Optional.map(Optional.java:265) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildInputDatasets(OpenLineageRunEventBuilder.java:339) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.populateRun(OpenLineageRunEventBuilder.java:295) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:279) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:222) at io.openlineage.spark.agent.lifecycle.SparkSQLExecutionContext.start(SparkSQLExecutionContext.java:72) at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$sparkSQLExecStart$0(OpenLineageSparkListener.java:91)

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-10-23 04:56:25

*Thread Reply:* Hmy, that is interesting. Did it occur on databricks runtime? Could you give it a try with Scala 2.12? I think we don't test scala 2.13.

praveen kanamarlapudi (kpraveen420@gmail.com)
2023-10-23 12:02:13

*Thread Reply:* I believe our Scala 2.12 jobs are working fine. It's not databricks runtime. We run Spark on Kube.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-10-24 06:47:14

*Thread Reply:* Ok. I think You can raise an issue to support Scala 2.13 for latest Spark versions.

Damien Hawes (damien.hawes@booking.com)
2023-12-08 05:57:55

*Thread Reply:* Yeah - this just hit me yesterday.

Damien Hawes (damien.hawes@booking.com)
2023-12-08 05:58:29

*Thread Reply:* I've created a ticket for it, it wasn't a fun surprise, that's for sure.

priya narayana (n.priya88@gmail.com)
2023-10-26 06:13:40

Hi I want to customise the events which comes from Openlineage spark . Can some one give some information

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-10-26 07:45:41

*Thread Reply:* Hi @priya narayana, please get familiar with Extending section on our docs: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark#extending

priya narayana (n.priya88@gmail.com)
2023-10-26 09:53:07

*Thread Reply:* Okay thank you. Just checking any other docs or git code which also can help me

harsh loomba (hloomba@upgrade.com)
2023-10-26 13:11:17

Hello Team

harsh loomba (hloomba@upgrade.com)
2023-10-26 13:12:38

Im upgrading the version from openlineage-airflow==0.24.0 to openlineage-airflow 1.4.1 but im seeing the following error, any help is appreciated

harsh loomba (hloomba@upgrade.com)
2023-10-26 13:14:02

*Thread Reply:* @Jakub Dardziński any thoughts?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-10-26 13:14:24

*Thread Reply:* what version of Airflow are you using?

harsh loomba (hloomba@upgrade.com)
2023-10-26 13:14:52

*Thread Reply:* 2.6.3 that satisfies the requirement

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-10-26 13:16:38

*Thread Reply:* is it possible you have some custom operator?

harsh loomba (hloomba@upgrade.com)
2023-10-26 13:17:15

*Thread Reply:* i think its the base operator causing the issue

harsh loomba (hloomba@upgrade.com)
2023-10-26 13:17:36

*Thread Reply:* so no i believe

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-10-26 13:18:43

*Thread Reply:* BaseOperator is parent class for any other operators, it defines how to do deepcopy

harsh loomba (hloomba@upgrade.com)
2023-10-26 13:19:11

*Thread Reply:* yeah so its controlled by Airflow itself, I didnt customize it

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-10-26 13:19:49

*Thread Reply:* uhm, maybe it's possible you could share dag code? you may hide sensitive data

harsh loomba (hloomba@upgrade.com)
2023-10-26 13:21:23

*Thread Reply:* let me try with lower versions of openlineage, what's say

harsh loomba (hloomba@upgrade.com)
2023-10-26 13:21:39

*Thread Reply:* its a big jump from 0.24.0 to 1.4.1

harsh loomba (hloomba@upgrade.com)
2023-10-26 13:22:25

*Thread Reply:* but i will help here to investigate this issue

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-10-26 13:24:03

*Thread Reply:* for me it seems that within dag or task you're defining some object that is not easy to copy

harsh loomba (hloomba@upgrade.com)
2023-10-26 13:26:05

*Thread Reply:* possible, but with 0.24.0 that issue is not occurring, so worry is that the version upgrade could potentially break things

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-10-26 13:39:34

*Thread Reply:* 0.24.0 is not that old 🤔

harsh loomba (hloomba@upgrade.com)
2023-10-26 13:45:07

*Thread Reply:* i see the issue with 0.24.0 I see it as warning [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/threading.py", line 932, in _bootstrap_inner [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - self.run() [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/threading.py", line 870, in run [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - self._target(**self._args, ****self._kwargs) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/home/upgrade/.local/lib/python3.8/site-packages/openlineage/airflow/listener.py", line 89, in on_running [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - task_instance_copy = copy.deepcopy(task_instance) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 172, in deepcopy [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y = _reconstruct(x, memo, **rv) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 270, in _reconstruct [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - state = deepcopy(state, memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 146, in deepcopy [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y = copier(x, memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 230, in _deepcopy_dict [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y[deepcopy(key, memo)] = deepcopy(value, memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 172, in deepcopy [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y = _reconstruct(x, memo, **rv) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 270, in _reconstruct [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - state = deepcopy(state, memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 146, in deepcopy [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y = copier(x, memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 230, in _deepcopy_dict [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y[deepcopy(key, memo)] = deepcopy(value, memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 153, in deepcopy [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y = copier(memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/home/upgrade/.local/lib/python3.8/site-packages/airflow/models/dag.py", line 2162, in __deepcopy__ [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - setattr(result, k, copy.deepcopy(v, memo)) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 146, in deepcopy [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y = copier(x, memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 230, in _deepcopy_dict [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y[deepcopy(key, memo)] = deepcopy(value, memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 153, in deepcopy [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y = copier(memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/home/upgrade/.local/lib/python3.8/site-packages/airflow/models/baseoperator.py", line 1224, in __deepcopy__ [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - setattr(result, k, copy.deepcopy(v, memo)) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 172, in deepcopy [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y = _reconstruct(x, memo, **rv) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 270, in _reconstruct [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - state = deepcopy(state, memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 146, in deepcopy [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y = copier(x, memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 230, in _deepcopy_dict [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y[deepcopy(key, memo)] = deepcopy(value, memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 146, in deepcopy [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y = copier(x, memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 230, in _deepcopy_dict [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y[deepcopy(key, memo)] = deepcopy(value, memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 153, in deepcopy [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y = copier(memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/home/upgrade/.local/lib/python3.8/site-packages/airflow/models/baseoperator.py", line 1224, in __deepcopy__ [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - setattr(result, k, copy.deepcopy(v, memo)) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 146, in deepcopy [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y = copier(x, memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 230, in _deepcopy_dict [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - y[deepcopy(key, memo)] = deepcopy(value, memo) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/copy.py", line 161, in deepcopy [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - rv = reductor(4) [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - TypeError: cannot pickle 'module' object but with 1.4.1 its stopped processing any further and threw error

harsh loomba (hloomba@upgrade.com)
2023-10-26 14:18:08

*Thread Reply:* I see the difference of calling in these 2 versions, current versions checks if Airflow is >2.6 then directly runs on_running but earlier version was running on separate thread. IS this what's raising this exception?

harsh loomba (hloomba@upgrade.com)
2023-10-26 14:24:49

*Thread Reply:* this is the issue - https://github.com/OpenLineage/OpenLineage/blob/c343835c1664eda94d5c315897ae6702854c81bd/integration/airflow/openlineage/airflow/listener.py#L89 while copying the task

harsh loomba (hloomba@upgrade.com)
2023-10-26 14:25:21

*Thread Reply:* since we are directly running if version>2.6.0 therefore its throwing error in main processing

harsh loomba (hloomba@upgrade.com)
2023-10-26 14:28:02

*Thread Reply:* may i know which Airflow version we tested this process?

harsh loomba (hloomba@upgrade.com)
2023-10-26 14:28:39

*Thread Reply:* im on 2.6.3

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-10-26 14:30:53

*Thread Reply:* 2.1.4, 2.2.4, 2.3.4, 2.4.3, 2.5.2, 2.6.1 usually there are not too many changes between minor versions

I still believe it might be some code you might improve and probably is also an antipattern in airflow

harsh loomba (hloomba@upgrade.com)
2023-10-26 14:34:26

*Thread Reply:* hummm...that's a valid observation but I dont write DAGS, other teams do, so imagine if many people wrote such DAGS I can't ask everyone to change their patterns right? If something is running on current openlineage version with warning that should still be running on upgraded version isn't it?

harsh loomba (hloomba@upgrade.com)
2023-10-26 14:38:04

*Thread Reply:* however I see ur point

harsh loomba (hloomba@upgrade.com)
2023-10-26 14:49:52

*Thread Reply:* So that specific task has 570 line of query and pretty bulky query, let me split into smaller units

harsh loomba (hloomba@upgrade.com)
2023-10-26 14:50:15

*Thread Reply:* that should help right? @Jakub Dardziński

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-10-26 14:51:27

*Thread Reply:* query length shouldn’t be the issue, rather any python code

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-10-26 14:51:50

*Thread Reply:* I get your point too, we might figure out some mechanism to skip irrelevant parts of task instance so that it doesn’t fail then

harsh loomba (hloomba@upgrade.com)
2023-10-26 14:52:12

*Thread Reply:* actually its failing on that task itself

harsh loomba (hloomba@upgrade.com)
2023-10-26 14:52:33

*Thread Reply:* let me try it will be pretty quick

harsh loomba (hloomba@upgrade.com)
2023-10-26 14:58:58

*Thread Reply:* @Jakub Dardziński but ur right we have to fix this at Openlineage side as well. Because ideally Openlineage shouldn't be causing any issue to the main DAG processing

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-10-26 17:51:05

*Thread Reply:* it doesn’t break any airflow functionality, execution is wrapped into try/except block, only exception traceback is logged as you can see

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-10-27 05:25:54

*Thread Reply:* Can you migrate to Airflow 2.7 and use apache-airflow-providers-openlineage? Ideally we wouldn't make meaningful changes to openlineage-airflow

harsh loomba (hloomba@upgrade.com)
2023-10-27 11:35:44

*Thread Reply:* yup thats what im planning to do

harsh loomba (hloomba@upgrade.com)
2023-10-27 13:59:03

*Thread Reply:* referencing to https://openlineage.slack.com/archives/C01CK9T7HKR/p1698398754823079?threadts=1698340358.557159&cid=C01CK9T7HKR|this conversation - what it takes to move to openlineage provider package from openlineage-airflow. Im updating Airflow to 2.7.2 but moving off of openlineage-airflow to provider package Im trying to estimate the amount of work it takes, any thoughts? reading changelogs I dont think its too much of a change but please share your thoughts and if somewhere its drafted please do share that as well

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-10-30 08:21:10

*Thread Reply:* Generally not much - I would maybe think of a operator coverage. For example, for BigQuery old openlineage-airflow supports BigQueryExecuteQueryOperator. However, new apache-airflow-providers-openlineage supports BigQueryInsertJobOperator - because it's intended replacement for BigQueryExecuteQueryOperator and Airflow community does not want to accept contributions to deprecated operators.

🙏 harsh loomba
harsh loomba (hloomba@upgrade.com)
2023-10-31 15:00:38

*Thread Reply:* one question if someone is around - when im keeping both openlineage-airflow and apache-airflow-providers-openlineage in my requirement file, i see the following error - from openlineage.airflow.extractors import Extractors ModuleNotFoundError: No module named 'openlineage.airflow' any thoughts?

John Lukenoff (john@jlukenoff.com)
2023-10-31 15:37:07

*Thread Reply:* I would usually do a pip freeze | grep openlineage as a sanity check to validate that the module is actually installed. Not sure how the provider and the module play together though

harsh loomba (hloomba@upgrade.com)
2023-10-31 17:07:41

*Thread Reply:* yeah so @John Lukenoff im not getting how i can use the specific extractor when i run my operator. Say for example, I have custom datawarehouseOperator and i want to override getopenlineagefacetsonstart and getopenlineagefacetsoncomplete using the redshift extractor then how would i do that?

Rodrigo Maia (rodrigo.maia@manta.io)
2023-10-27 05:49:25

Spark Integration Logs Hey There Are these events skipped because it's not supported or it's configured somewhere? 23/10/27 08:25:58 INFO SparkSQLExecutionContext: OpenLineage received Spark event that is configured to be skipped: SparkListenerSQLExecutionStart 23/10/27 08:25:58 INFO SparkSQLExecutionContext: OpenLineage received Spark event that is configured to be skipped: SparkListenerSQLExecutionEnd

Hitesh (splicer9904@gmail.com)
2023-10-27 08:12:32

Hi People, actually I want to intercept the OpenLineage spark events right after the job ends and before they are emitted, so that I can add some extra information to the events or remove some information that I don't want. Is there any way of doing this? Can someone please help me

Michael Robinson (michael.robinson@astronomer.io)
2023-10-30 09:03:57

*Thread Reply:* It general, I think this kind of use case is probably best served by facets, but what do you think @Paweł Leszczyński?

openlineage.io
Kavitha (kkandaswamy@cardinalcommerce.com)
2023-10-27 17:01:12

Hello, has anyone run into similar error as posted in this github open issues[https://github.com/MarquezProject/marquez/issues/2468] while setting up marquez on an EC2 Instance, would appreciate any help to get past the errors

Comments
5
Willy Lulciuc (willy@datakin.com)
2023-10-27 17:04:30

*Thread Reply:* Hmm, have you looked over our Running on AWS docs?

Willy Lulciuc (willy@datakin.com)
2023-10-27 17:06:08

*Thread Reply:* More specifically, the AWS RDS section. How are you deploying Marquez on Ec2?

Kavitha (kkandaswamy@cardinalcommerce.com)
2023-10-27 17:08:05

*Thread Reply:* we were primarily referencing this document on git - https://github.com/MarquezProject/marquez

Website
<https://marquezproject.ai>
Stars
1450
Kavitha (kkandaswamy@cardinalcommerce.com)
2023-10-27 17:09:05

*Thread Reply:* leveraged docker and docker-compose

Willy Lulciuc (willy@datakin.com)
2023-10-27 17:13:10

*Thread Reply:* hmm so you’re running docker-compose up on an Ec2 instance you’ve ssh’d into? (just trying to understand your setup better)

Kavitha (kkandaswamy@cardinalcommerce.com)
2023-10-27 17:13:26

*Thread Reply:* yes, thats correct

Willy Lulciuc (willy@datakin.com)
2023-10-27 17:16:39

*Thread Reply:* I’ve only used docker compose for local dev or integration tests. but, ok you’re probably in the PoC phase. Can you run the docker cmd on you local machine successfully? What OS is stalled on the Ec2 instance?

Kavitha (kkandaswamy@cardinalcommerce.com)
2023-10-27 17:18:00

*Thread Reply:* yes, i can run and the OS is Ubuntu 20.04.6 LTS

Kavitha (kkandaswamy@cardinalcommerce.com)
2023-10-27 17:19:27

*Thread Reply:* we initiallly ran into a permission denied error related to postgressql.conf file and we had to update file permissions to 777 and after which we started to see below errors

Kavitha (kkandaswamy@cardinalcommerce.com)
2023-10-27 17:19:36

*Thread Reply:* marquez-db | 2023-10-27 20:35:52.512 GMT [35] FATAL: no pghba.conf entry for host "172.18.0.5", user "marquez", database "marquez", no encryption marquez-db | 2023-10-27 20:35:52.529 GMT [36] FATAL: no pghba.conf entry for host "172.18.0.5", user "marquez", database "marquez", no encryption

Kavitha (kkandaswamy@cardinalcommerce.com)
2023-10-27 17:20:12

*Thread Reply:* we then manually updated pg_hba.conf file to include host user and db details

Willy Lulciuc (willy@datakin.com)
2023-10-27 17:20:42

*Thread Reply:* Did you also update the marquez.yml with the db user / password?

Kavitha (kkandaswamy@cardinalcommerce.com)
2023-10-27 17:20:48

*Thread Reply:* after which we started to see the errors posted in the github open issues page

Willy Lulciuc (willy@datakin.com)
2023-10-27 17:21:33

*Thread Reply:* hmm are you using an external database or are you spinning up the entire Marquez stack with docker compose?

Kavitha (kkandaswamy@cardinalcommerce.com)
2023-10-27 17:21:56

*Thread Reply:* we are spinning up the entire Marquez stack with docker compose

Kavitha (kkandaswamy@cardinalcommerce.com)
2023-10-27 17:23:24

*Thread Reply:* we did not change anything in the marquez.yml, i think we did not find that file in the github repo that we cloned into our local instance

Willy Lulciuc (willy@datakin.com)
2023-10-27 17:26:31

*Thread Reply:* It’s important that the init-db.sh script runs, but I don’t think it is

Willy Lulciuc (willy@datakin.com)
2023-10-27 17:26:56

*Thread Reply:* can you grab all the docker compose logs and share them? it’s hard to debug otherwise

Kavitha (kkandaswamy@cardinalcommerce.com)
2023-10-27 17:29:59

*Thread Reply:*

Willy Lulciuc (willy@datakin.com)
2023-10-27 17:33:15

*Thread Reply:* I would first suggest to remove the --build flag since you are specifying a version of Marquez to use via --tag

Willy Lulciuc (willy@datakin.com)
2023-10-27 17:33:49

*Thread Reply:* no the issue per se, but will help clear up some of the logs

Kavitha (kkandaswamy@cardinalcommerce.com)
2023-10-27 17:35:06

*Thread Reply:* for sure thanks. we could get the logs without the --build portion, we tried with that option just once

Kavitha (kkandaswamy@cardinalcommerce.com)
2023-10-27 17:35:40

*Thread Reply:* the errors were the same with/without --build option

Kavitha (kkandaswamy@cardinalcommerce.com)
2023-10-27 17:36:02

*Thread Reply:* marquez-api | ERROR [2023-10-27 21:34:58,019] org.apache.tomcat.jdbc.pool.ConnectionPool: Unable to create initial connections of pool. marquez-api | ! org.postgresql.util.PSQLException: FATAL: password authentication failed for user "marquez" marquez-api | ! at org.postgresql.core.v3.ConnectionFactoryImpl.doAuthentication(ConnectionFactoryImpl.java:693) marquez-api | ! at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:203) marquez-api | ! at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:258) marquez-api | ! at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54) marquez-api | ! at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:253) marquez-api | ! at org.postgresql.Driver.makeConnection(Driver.java:434) marquez-api | ! at org.postgresql.Driver.connect(Driver.java:291) marquez-api | ! at org.apache.tomcat.jdbc.pool.PooledConnection.connectUsingDriver(PooledConnection.java:346) marquez-api | ! at org.apache.tomcat.jdbc.pool.PooledConnection.connect(PooledConnection.java:227) marquez-api | ! at org.apache.tomcat.jdbc.pool.ConnectionPool.createConnection(ConnectionPool.java:768) marquez-api | ! at org.apache.tomcat.jdbc.pool.ConnectionPool.borrowConnection(ConnectionPool.java:696) marquez-api | ! at org.apache.tomcat.jdbc.pool.ConnectionPool.init(ConnectionPool.java:495) marquez-api | ! at org.apache.tomcat.jdbc.pool.ConnectionPool.<init>(ConnectionPool.java:153) marquez-api | ! at org.apache.tomcat.jdbc.pool.DataSourceProxy.pCreatePool(DataSourceProxy.java:118) marquez-api | ! at org.apache.tomcat.jdbc.pool.DataSourceProxy.createPool(DataSourceProxy.java:107) marquez-api | ! at org.apache.tomcat.jdbc.pool.DataSourceProxy.getConnection(DataSourceProxy.java:131) marquez-api | ! at org.flywaydb.core.internal.jdbc.JdbcUtils.openConnection(JdbcUtils.java:48) marquez-api | ! at org.flywaydb.core.internal.jdbc.JdbcConnectionFactory.<init>(JdbcConnectionFactory.java:75) marquez-api | ! at org.flywaydb.core.FlywayExecutor.execute(FlywayExecutor.java:147) marquez-api | ! at org.flywaydb.core.Flyway.info(Flyway.java:190) marquez-api | ! at marquez.db.DbMigration.hasPendingDbMigrations(DbMigration.java:73) marquez-api | ! at marquez.db.DbMigration.migrateDbOrError(DbMigration.java:27) marquez-api | ! at marquez.MarquezApp.run(MarquezApp.java:105) marquez-api | ! at marquez.MarquezApp.run(MarquezApp.java:48) marquez-api | ! at io.dropwizard.cli.EnvironmentCommand.run(EnvironmentCommand.java:67) marquez-api | ! at io.dropwizard.cli.ConfiguredCommand.run(ConfiguredCommand.java:98) marquez-api | ! at io.dropwizard.cli.Cli.run(Cli.java:78) marquez-api | ! at io.dropwizard.Application.run(Application.java:94) marquez-api | ! at marquez.MarquezApp.main(MarquezApp.java:60) marquez-api | INFO [2023-10-27 21:34:58,024] marquez.MarquezApp: Stopping app...

Willy Lulciuc (willy@datakin.com)
2023-10-27 17:38:52

*Thread Reply:* debugging docker issues like this is so difficult

Willy Lulciuc (willy@datakin.com)
2023-10-27 17:40:44

*Thread Reply:* it could be a number of things, but you are connected to the database it’s just that the marquez user hasn’t been created

Willy Lulciuc (willy@datakin.com)
2023-10-27 17:41:59

*Thread Reply:* the /init-db.sh is what manages user creation

Willy Lulciuc (willy@datakin.com)
2023-10-27 17:42:17

*Thread Reply:* so it’s possible that the script isn’t running for whatever reason on your Ec2 instance

Willy Lulciuc (willy@datakin.com)
2023-10-27 17:44:20

*Thread Reply:* do you have other services running on that Ec2 instance? Like, other than Marquez

Willy Lulciuc (willy@datakin.com)
2023-10-27 17:44:52

*Thread Reply:* is there a postgres process running outside of docker?

Kavitha (kkandaswamy@cardinalcommerce.com)
2023-10-27 20:34:50

*Thread Reply:* no other services except marquez on this EC2 instance

Kavitha (kkandaswamy@cardinalcommerce.com)
2023-10-27 20:35:49

*Thread Reply:* this was a new Ec2 instance that was spun up to install and use marquez

Kavitha (kkandaswamy@cardinalcommerce.com)
2023-10-27 20:36:09

*Thread Reply:* n we can confirm that no postgres process runs outside of docker

Jason Yip (jasonyip@gmail.com)
2023-10-29 03:06:28

I realize in Spark 3.4+, some job ids don't have a start event. What part of the code is responsible for triggering the START and COMPLETE event

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-10-30 09:59:53

*Thread Reply:* hi @Jason Yip could you provide an example of such a job?

Jason Yip (jasonyip@gmail.com)
2023-10-30 16:51:55

*Thread Reply:* @Paweł Leszczyński same old:

delete the old table if needed

_ = spark.sql('DROP TABLE IF EXISTS transactions')

expected structure of the file

transactionsschema = StructType([ StructField('householdid', IntegerType()), StructField('basketid', LongType()), StructField('day', IntegerType()), StructField('productid', IntegerType()), StructField('quantity', IntegerType()), StructField('salesamount', FloatType()), StructField('storeid', IntegerType()), StructField('discountamount', FloatType()), StructField('transactiontime', IntegerType()), StructField('weekno', IntegerType()), StructField('coupondiscount', FloatType()), StructField('coupondiscountmatch', FloatType()) ])

read data to dataframe

df = (spark .read .csv( adlsRootPath + '/examples/data/csv/completejourney/transactiondata.csv', header=True, schema=transactionsschema))

df.write\ .format('delta')\ .mode('overwrite')\ .option('overwriteSchema', 'true')\ .option('path', adlsRootPath + '/examples/data/csv/completejourney/silver/transactions')\ .saveAsTable('transactions')

df.count()

# create table object to make delta lake queryable

_ = spark.sql(f'''

CREATE TABLE transactions

USING DELTA

LOCATION '{adlsRootPath}/examples/data/csv/completejourney/silver/transactions'

''')

show data

display( spark.table('transactions') )

John Lukenoff (john@jlukenoff.com)
2023-10-30 18:51:43

👋 Hi team, cross-posting from the Marquez Channel in case anyone here has a better idea of the spec

> For most of our lineage extractors in airflow, we are using the rust sql parser from openlineage-sql to extract table lineage via sql statements. When errors occur we are adding an extractionError run facet similar to what is being done here. I’m finding in the case that multiple statements were extracted but one failed to parse while many others were successful, the lineage for these runs doesn’t appear as expected in Marquez. Is there any logic around the extractionError run facet that could be causing this? It seems reasonable to assume that we might take this to mean the entire run event is invalid if we have any extraction errors. > > I would still expect to see the other lineage we sent for the run but am instead just seeing the extractionError in the marquez UI, in the database, runs with an extractionError facet don’t seem to make it to the job_versions_io_mapping table

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-10-31 06:34:05

*Thread Reply:* Can you show the actual event? Should be in the events tab in Marquez

Kavitha (kkandaswamy@cardinalcommerce.com)
2023-10-31 11:59:07

*Thread Reply:* @John Lukenoff, would you mind posting the link to Marquez teams slack channel?

John Lukenoff (john@jlukenoff.com)
2023-10-31 12:15:37

*Thread Reply:* yep here is the link: https://marquezproject.slack.com/archives/C01E8MQGJP7/p1698702140709439

This is the full event, sanitized of internal info: { "job": { "name": "some_dag.some_task", "facets": {}, "namespace": "default" }, "run": { "runId": "a9565df2-f1a1-3ee3-b202-7626f8c4b92d", "facets": { "extractionError": { "errors": [ { "task": "ALTER SESSION UNSET QUERY_TAG;", "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/0.24.0/client/python>", "_schemaURL": "<https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/BaseFacet>", "taskNumber": 0, "errorMessage": "Expected one of TABLE or INDEX, found: SESSION" } ], "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/0.24.0/client/python>", "_schemaURL": "<https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/ExtractionErrorRunFacet>", "totalTasks": 1, "failedTasks": 1 } } }, "inputs": [ { "name": "foo.bar", "facets": {}, "namespace": "snowflake" }, { "name": "fizz.buzz", "facets": {}, "namespace": "snowflake" } ], "outputs": [ { "name": "foo1.bar2", "facets": {}, "namespace": "snowflake" }, { "name": "fizz1.buzz2", "facets": {}, "namespace": "snowflake" } ], "producer": "<https://github.com/MyCompany/repo/blob/next-master/company/data/pipelines/airflow_utils/openlineage_utils/client.py>", "eventTime": "2023-10-30T02:46:13.367274Z", "eventType": "COMPLETE" }

Kavitha (kkandaswamy@cardinalcommerce.com)
2023-10-31 12:43:07

*Thread Reply:* thank you!

Kavitha (kkandaswamy@cardinalcommerce.com)
2023-10-31 13:14:29

*Thread Reply:* @John Lukenoff, sorry to trouble again, is the slack channel still active? for whatever reason i cant get to this workspace

John Lukenoff (john@jlukenoff.com)
2023-10-31 13:15:26

*Thread Reply:* yep it’s still active, maybe you need to join the workspace first? https://join.slack.com/t/marquezproject/shared_invite/zt-266fdhg9g-TE7e0p~EHK50GJMMqNH4tg

Kavitha (kkandaswamy@cardinalcommerce.com)
2023-10-31 13:25:51

*Thread Reply:* that was a good call. the link you just shared worked! thank you!

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-10-31 13:27:55

*Thread Reply:* yeah from OL perspective this looks good - the inputs and outputs are there, the extraction error facet looks like it should

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-10-31 13:28:05

*Thread Reply:* must be some Marquez hiccup 🙂

👍 John Lukenoff
John Lukenoff (john@jlukenoff.com)
2023-10-31 13:28:45

*Thread Reply:* Makes sense, I’ll tail my marquez logs today to see if I can find anything

John Lukenoff (john@jlukenoff.com)
2023-11-01 19:37:06

*Thread Reply:* Somehow this started working after we switched from our beta to prod infrastructure. I suspect something was failing due to constraints on the size of our db and the load of poor quality data it was under after months of testing against it

Michael Robinson (michael.robinson@astronomer.io)
2023-11-01 11:34:43

@channel I’m opening a vote to release OpenLineage 1.5.0, including: • support for Cassandra Connectors lineage in the Flink integration • support for Databricks Runtime 13.3 in the Spark integration • support for rdd and toDF operations from the Spark Scala API in Spark • lowered requirements for attrs and requests packages in the Airflow integration • lazy rendering of yaml configs in the dbt integration • bug fixes, tests, infra fixes, doc changes, and more. Three +1s from committers will authorize an immediate release.

➕ Jakub Dardziński, William Angel, Abdallah, Willy Lulciuc, Paweł Leszczyński, Julien Le Dem
👍 Jason Yip
🚀 Luca Soato, tati
Michael Robinson (michael.robinson@astronomer.io)
2023-11-02 05:11:58

*Thread Reply:* Thanks, all. The release is authorized and will be initiated within 2 business days.

Michael Robinson (michael.robinson@astronomer.io)
2023-11-01 13:29:09

@channel The October 2023 issue of OpenLineage News is available now! to get in directly in your inbox each month.

apache.us14.list-manage.com
👍 Mars Lan (Metaphor), harsh loomba
🎉 tati
John Lukenoff (john@jlukenoff.com)
2023-11-01 19:40:39

Hi team 👋 , we’re finding that for our Spark jobs we are almost always getting some junk characters in our dataset names. We’ve pushed the regex filter to its limits and would like to extend the logic of deriving the dataset name in openlineage-spark (currently on 1.4.1). I seem to recall hearing we could do this by implementing our own LogicalPlanVisitor or something along those lines? Is that still the recommended approach and if so would this be possible to implement in Scala vs. Java (scala noob here 🙂)

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-11-02 03:34:15

*Thread Reply:* Hi John, we're always happy to help with the contribution.

One of the possible solutions to this would be to do that just in openlineage-java client: • introduce config entry like normalizeDatasetNameToAscii : enabled/disabled • modify DatasetIdentifier class to contain static member boolean normalizeDatasetNameToAscii and normalize dataset name according to this setting • additionally, you would need to add config entry in io.openlineage.client.OpenLineageYaml and make sure both loadOpenLineageYaml methods set DatasetIdentifier.normalizeDatasetNameToAscii based on the config • document this in the doc So, no Scala nor custom logical plan visitors required.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-11-02 03:34:47

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/main/client/java/src/main/java/io/openlineage/client/utils/DatasetIdentifier.java

🙌 John Lukenoff
Mike Fang (fangmik@amazon.com)
2023-11-01 20:30:38

I am looking to send OpenLineage events to an AWS API Gateway endpoint from an AWS MWAA instance. The problem is that all requests to AWS services need to be signed with SigV4, and using API Gateway with IAM authentication would require requests to API Gateway be signed with SigV4. Would the best way to do so be to just modify the python client HTTP transport to include a new config option for signing emitted OpenLineage events with SigV4? Are there any alternatives?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-11-02 02:41:50

*Thread Reply:* there’s actually an issue for that: https://github.com/OpenLineage/OpenLineage/issues/2189

but the way to do this is imho to create new custom transport (it might inherit from HTTP transport) and register it in transport factory

Mike Fang (fangmik@amazon.com)
2023-11-02 13:05:05

*Thread Reply:* I am thinking of just modifying the HTTP transport and using requests.auth.AuthBase to create different auth methods instead of a TokenProvider class

Classes which subclass requests.auth.AuthBase can also just directly be given to the requests call in the auth parameter

👍 Jakub Dardziński
Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-11-02 14:40:24

*Thread Reply:* would you like to contribute? 🙂

Mike Fang (fangmik@amazon.com)
2023-11-02 14:43:05

*Thread Reply:* I was about to contribute, but I actually just realized that there is an existing way to provide a custom transport that would solve form y use case. My only question is how do I register this custom transport in my MWAA environment? Can I provide the custom transport as an Airflow plugin and then specify the class in the Openlineage.yml config? Will it automatically pick it up?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-11-02 15:45:56

*Thread Reply:* although I did not test this in MWAA but locally only: I’ve created Airflow plugin that in __init__.py has defined (or imported) following code: ```from openlineage.client.transport import register_transport, Transport, Config

@register_transport class FakeTransport(Transport): kind = "fake" config = Config

def __init__(self, config: Config) -> None:
    print(config)

def emit(self, event) -> None:
    print(event)```

setting AIRFLOW__OPENLINEAGE__TRANSPORT='{"type": "fake"}' does take effect and I can see output in Airflow logs

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-11-02 15:47:45

*Thread Reply:* in setup.py it’s: ..., entry_points={ 'airflow.plugins': [ 'custom_transport = custom_transport:CustomTransportPlugin', ], }, install_requires=["openlineage-python"] )

Mike Fang (fangmik@amazon.com)
2023-11-03 12:52:55

*Thread Reply:* ok great thanks for following up on this, super helpful

Michael Robinson (michael.robinson@astronomer.io)
2023-11-02 12:00:00

@channel We released OpenLineage 1.5.0, including: • support for Cassandra Connectors lineage in the Flink integration by @Peter Huang • support for Databricks Runtime 13.3 in the Spark integration by @Paweł Leszczyński • support for rdd and toDF operations from the Spark Scala API in Spark by @Paweł Leszczyński • lowered requirements for attrs and requests packages in the Airflow integration by @Jakub Dardziński • lazy rendering of yaml configs in the dbt integration by @Jakub Dardziński • bug fixes, tests, infra fixes, doc changes, and more. Thanks to all the contributors, including new contributor @Sophie LY! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.5.0 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/1.4.1...1.5.0 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

👍 Jason Yip, Sophie LY, Tristan GUEZENNEC -CROIX-, Mars Lan (Metaphor), Sangeeta Mishra
🚀 tati
Jason Yip (jasonyip@gmail.com)
2023-11-02 14:49:18

@Paweł Leszczyński I tested 1.5.0, it works great now, but the environment facets is gone in START... which I very much want it.. any thoughts?

Jason Yip (jasonyip@gmail.com)
2023-11-03 04:18:11

actually, it shows up in one of the RUNNING now... behavior is consistent between 11.3 and 13.3, thanks for fixing this issue

👍 Paweł Leszczyński
Jason Yip (jasonyip@gmail.com)
2023-11-04 15:44:22

*Thread Reply:* @Paweł Leszczyński looks like I need to bring bad news.. 13.3 is fixed for specific scenarios, but 11.3 is still reading output as dbfs.. there are scenarios that it's not producing input and output like:

create table table using delta as location 'abfss://....' Select ** from parquet.`abfss://....'

Jason Yip (jasonyip@gmail.com)
2023-11-04 15:44:31

*Thread Reply:* Will test more and ope issues

Rodrigo Maia (rodrigo.maia@manta.io)
2023-11-06 05:34:33

*Thread Reply:* @Jason Yiphow did you manage the get the environment attribute. it's not showing up to me at all. I've tried databricks abut also tried a local instance of spark.

Jason Yip (jasonyip@gmail.com)
2023-11-07 18:32:02

*Thread Reply:* @Rodrigo Maia its showing up in one of the RUNNING events, not in the START event anymore

Rodrigo Maia (rodrigo.maia@manta.io)
2023-11-08 03:04:32

*Thread Reply:* I never had a running event 🫠 Am I filtering something?

Jason Yip (jasonyip@gmail.com)
2023-11-08 13:03:26

*Thread Reply:* Umm.. ok show me your code, will try on my end

Jason Yip (jasonyip@gmail.com)
2023-11-08 14:26:06

*Thread Reply:* @Paweł Leszczyński @Rodrigo Maia actually if you are using UC-enabled cluster, you won't get any RUNNING events

Michael Robinson (michael.robinson@astronomer.io)
2023-11-03 12:00:07

@channel This month’s TSC meeting (open to all) is next Thursday the 9th at 10am PT. On the agenda: • announcements • recent releases • recent additions to the Flink integration by @Peter Huang • recent additions to the Spark integration by @Paweł Leszczyński • updates on proposals by @Julien Le Dem • discussion topics • open discussion More info and the meeting link can be found on the website. All are welcome! Do you have a discussion topic, use case or integration you’d like to demo? DM me to be added to the agenda.

openlineage.io
👍 harsh loomba
priya narayana (n.priya88@gmail.com)
2023-11-04 07:08:10

Hi Team , we are trying to customize the events by writing custom lineage listener extending OpenLineageSparkListener, but would need some direction how to capture the events

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-11-04 07:11:46

*Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1698315220142929 Do you need some more guidance than that?

} priya narayana (https://openlineage.slack.com/team/U062Q95A1FG)
priya narayana (n.priya88@gmail.com)
2023-11-04 07:13:47

*Thread Reply:* yes

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-11-04 07:15:21

*Thread Reply:* It seems pretty extensively described, what kind of help do you need?

priya narayana (n.priya88@gmail.com)
2023-11-04 07:16:13

*Thread Reply:* io.openlineage.spark.api.OpenLineageEventHandlerFactory if i use this how will i pass custom listener to my spark submit

priya narayana (n.priya88@gmail.com)
2023-11-04 07:17:25

*Thread Reply:* I would like to know how will i customize my events using this . For example: - In "input" Facet i want only symlinks name i am not intereseted in anything else

priya narayana (n.priya88@gmail.com)
2023-11-04 07:17:32

*Thread Reply:* can you please provide some guidance

priya narayana (n.priya88@gmail.com)
2023-11-04 07:18:36

*Thread Reply:* @Jakub Dardziński this is the doubt i have

priya narayana (n.priya88@gmail.com)
2023-11-04 08:17:25

*Thread Reply:* Some one who did spark integration throw some light

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-11-04 08:21:22

*Thread Reply:* it's weekend for most of us so you probably need to wait until Monday for precise answers

David Goss (david.goss@matillion.com)
2023-11-06 04:03:42

👋 I raised a PR https://github.com/OpenLineage/OpenLineage/pull/2223 off the back of some Marquez conversations a while back to try and clarify how names of Snowflake objects should be expressed in OL events. I used Snowflake’s OL view as a guide, but also I appreciate there are other OL producers that involve Snowflake too (Airflow? dbt?). Any feedback on this would be appreciated!

Stars
11
Last updated
3 months ago
David Goss (david.goss@matillion.com)
2023-11-08 10:42:35

*Thread Reply:* Thanks for merging this @Maciej Obuchowski!

👍 Maciej Obuchowski
Athitya Kumar (athityakumar@gmail.com)
2023-11-06 05:22:03

Hey team! 👋

We're trying to use openlineage-flink, and would like provide the openlineage.transport.type=http and configure other transport configs, but we're not able to find sufficient docs (tried this doc) on where/how these configs can be provided.

For example, in spark, the changes mostly were delegated to the spark-submit command like spark-submit --conf "spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener" \ --packages "io.openlineage:openlineage_spark:&lt;spark-openlineage-version&gt;" \ --conf "spark.openlineage.transport.url=http://{openlineage.client.host}/api/v1/namespaces/spark_integration/" \ --class com.mycompany.MySparkApp my_application.jar And the OpenLineageSparkListener has a method to retrieve the provided spark confs as an object in the ArgumentParser. Similarly, looking for some pointers on how the openlineage.transport configs can be provided to OpenLineageFlinkJobListener & how the flink listener parses/uses these configs

TIA! 😄

openlineage.io
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-07 05:56:09

*Thread Reply:* similarly to spark config, you can use flink config

Athitya Kumar (athityakumar@gmail.com)
2023-11-07 22:36:53

*Thread Reply:* @Maciej Obuchowski - Got it. Our use-case is that we're trying to build a wrapper on top of openlineage-flink for productionising for our flink jobs.

We're trying to have a wrapper class that extends OpenLineageFlinkJobListener class, and overwrites the HTTP transport endpoint/url to a constant value (say, example.com and /api/v1/flink). But we see that the OpenLineageFlinkJobListener constructor is defined as a private constructor - just wanted to check with the team whether it was just a default scope, or intended to be private. If it was just a default scope, can we contribute a PR to make it public, to make it friendly for teams trying to adopt & extend openlineage?

And also, we wanted to understand better on where we're reading the HTTP transport endpoint/url configs in OpenLineageFlinkJobListener and what'd be the best place to override it to the constant endpoint/url for our use-case

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-08 05:55:43

*Thread Reply:* We parse flink conf to get that information: https://github.com/OpenLineage/OpenLineage/blob/26494b596e9669d2ada164066a73c44e04[…]ink/src/main/java/io/openlineage/flink/client/EventEmitter.java

> But we see that the OpenLineageFlinkJobListener constructor is defined as a private constructor - just wanted to check with the team whether it was just a default scope, or intended to be private. The way to construct is is a public builder in the same class

I think easier way than wrapper class would be use existing flink configuration, or to set up OPENLINEAGE_URL env variable, or have openlineage.yml config file - not sure why this is the way you've chosen?

Athitya Kumar (athityakumar@gmail.com)
2023-11-09 12:41:02

*Thread Reply:* > I think easier way than wrapper class would be use existing flink configuration, or to set up OPENLINEAGE_URL env variable, or have openlineage.yml config file - not sure why this is the way you've chosen? @Maciej Obuchowski - The reasoning behind going with a wrapper class is that we can abstract out the nitty-gritty like how/where we're publishing openlineage events etc - especially for companies that have a lot of teams that may be adopting openlineage.

For example, if we wanna move away from http transport to kafka transport - we'd be changing only this wrapper class and ask folks to update their wrapper class dependency version. If we went without the wrapper class, then the exact config changes would need to be synced and done by many different teams, who may not have enough context.

Similarly, if we wanna enable some other default best-practise configs, or inject any company-specific configs etc, the wrapper would be useful in abstracting out the details and be the 1 place that handles all openlineage related integrations for any future changes.

That's why we wanna extend openlineage's listener class & leverage most of the OSS code as-is; and at the same time, have the ability to extend & inject customisations. I think that's where some things like having getters for the class object attributes, or having public constructors would be really helpful 😄

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-09 13:03:56

*Thread Reply:* @Athitya Kumar that makes sense. Feel free to provide PR adding getters and stuff.

🎉 Athitya Kumar
Athitya Kumar (athityakumar@gmail.com)
2023-11-22 10:47:32

*Thread Reply:* Created an issue for the same: https://github.com/OpenLineage/OpenLineage/issues/2273

@Maciej Obuchowski - Can you assign this issue to me, if the proposal looks good?

Labels
proposal
👍 Maciej Obuchowski
Yannick Libert (yannick.libert.partner@decathlon.com)
2023-11-07 06:03:49

Hi all, we (I work with @Sophie LY and @Abdallah) have a quick question regarding the spark integration: if a spark app contains several jobs, they will be named "mysparkappname.job1" and "mysparkappname.job2" eg: sparkjob.collectlimit sparkjob.mappartitionsparallelcollection

If I understood correctly, the spark integration maps one Spark job to a single OpenLineage Job, and the application itself should be assigned a Run id at startup and each job that executes will report the application's Run id as its parent job run (taken from: https://openlineage.io/docs/integrations/spark/).

In our case, the app Run Id is never created, and the jobs runs don't contain any parent facets. We tested it with a recent integration version in 1.4.1 and also an older one (0.26.0). Did we miss something in the OL spark integration config?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-11-07 06:07:51

*Thread Reply:* hey, a name of the output dataset should be put at the end of the job name. This was introduced to help with jobs that call multiple spark actions

Yannick Libert (yannick.libert.partner@decathlon.com)
2023-11-07 07:05:52

*Thread Reply:* Hi Paweł, Thanks for your answer, yes indeed with the newer version of OL, we automatically have the name of the output dataset at the end of the job name, but no App run id, nor any parent run facet.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-11-07 08:16:44

*Thread Reply:* yes, you're right. I mean you can set in config spark.openlineage.parentJobName which will be shared through whole app run, but this needs to be set manually

Yannick Libert (yannick.libert.partner@decathlon.com)
2023-11-07 08:36:58

*Thread Reply:* I see, thanks a lot for your reply we'll try that

ldacey (lance.dacey2@sutherlandglobal.com)
2023-11-07 10:49:25

if I have a dataset on adls gen2 which synapse connects to as an external delta table, is that the use case of a symlink dataset? the delta table is connected to by PBI and by Synapse, but the underlying data is exactly the same

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-08 10:49:04

*Thread Reply:* Sounds like it, yes - if the logical dataset names are different but physical one is the same

Rodrigo Maia (rodrigo.maia@manta.io)
2023-11-08 12:38:52

Has anyone here tried OpenLineage with Spark on Amazon EMR?

Jason Yip (jasonyip@gmail.com)
2023-11-08 13:01:16

*Thread Reply:* No but it should work the same I tried on AWS and Google Colab and Azure

👍 Jakub Dardziński
Tristan GUEZENNEC -CROIX- (tristan.guezennec@decathlon.com)
2023-11-09 03:10:54

*Thread Reply:* Yes. @Abdallah could provide some details if needed.

👍 Abdallah
🔥 Maciej Obuchowski
Rodrigo Maia (rodrigo.maia@manta.io)
2023-11-20 11:29:26

*Thread Reply:* Thanks @Tristan GUEZENNEC -CROIX- HI @Abdallah i was able to set up a spark cluster on AWS EMR but im struggling to configure the OL Listener. Ive tried with steps and bootstrap actions for the jar and it didn't work out. How did you manage to include the jar? Besides, what about the spark configuration? Could you send me a sample of these configs?

Rodrigo Maia (rodrigo.maia@manta.io)
2023-11-28 03:52:44

*Thread Reply:* HI @Abdallah. Ive sent you a message with somo more information. If you could provide some more details that would be awesome. Thank you 😄

Abdallah (abdallah@terrab.me)
2023-11-28 04:14:14

*Thread Reply:* Hi, sorry for the late reply. I didn't see your message at the right timling.

Abdallah (abdallah@terrab.me)
2023-11-28 04:16:02

*Thread Reply:* If you want to test ol quickly without bootstrap action, you can use the following submit.json

[ { "Name": "MyJob", "Jar" : "command-runner.jar", "Args": [ "spark-submit", "--deploy-mode", "cluster", "--packages","io.openlineage:openlineage_spark:1.2.2", "--conf","spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener", "--conf","spark.openlineage.transport.type=http", "--conf","spark.openlineage.transport.url=&lt;OPENLINEAGE-URL&gt;", "--conf","spark.openlineage.version=v1", "path/to/your/job.py" ], "ActionOnFailure": "CONTINUE", "Type": "CUSTOM_JAR" } ] with

aws emr add-steps --cluster-id &lt;cluster-id&gt; --steps file://&lt;path-to-your-json&gt;/submit.json

Abdallah (abdallah@terrab.me)
2023-11-28 04:19:20

*Thread Reply:* "--packages","io.openlineage:openlineage_spark:1.2.2" Need to be mentioned before the creation of the spark session

Michael Robinson (michael.robinson@astronomer.io)
2023-11-08 12:44:54

@channel Friendly reminder: this month’s TSC meeting, open to all, is tomorrow at 10 am PT: https://openlineage.slack.com/archives/C01CK9T7HKR/p1699027207361229

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
👍 Jakub Dardziński
Jason Yip (jasonyip@gmail.com)
2023-11-10 15:25:45

@Paweł Leszczyński regarding to https://github.com/OpenLineage/OpenLineage/issues/2124, OL is parsing out the table location in Hive metastore, it is the location of the table in the catalog and not the physical location of the data. It is both right and wrong because it is a table, just it is an external table.

https://docs.databricks.com/en/sql/language-manual/sql-ref-external-tables.html

docs.databricks.com
Labels
integration/spark, integration/databricks
Comments
5
Jason Yip (jasonyip@gmail.com)
2023-11-10 15:32:28

*Thread Reply:* Here's for more reference: https://dilorom.medium.com/finding-the-path-to-a-table-in-databricks-2c74c6009dbb

Medium
Reading time
2 min read
Jason Yip (jasonyip@gmail.com)
2023-11-11 03:29:33

@Paweł Leszczyński this is why if create a table with adls location it won't show input and output:

https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/spark35/src[…]k35/agent/lifecycle/plan/CreateReplaceOutputDatasetBuilder.java

Because the catalog object is not there.

Jason Yip (jasonyip@gmail.com)
2023-11-11 03:30:44

Databricks needs to be re-written in a way that supports Databricks it seems like

Jason Yip (jasonyip@gmail.com)
2023-11-13 03:00:42

@Paweł Leszczyński I went back to 1.4.1, output does show adls location. But environment facet is gone in 1.4.1. It shows up in 1.5.0 but namespace is back to dbfs....

Jason Yip (jasonyip@gmail.com)
2023-11-13 03:18:37

@Paweł Leszczyński I diff CreateReplaceDatasetBuilder.java and CreateReplaceOutputDatasetBuilder.java and they are the same except for the class name, so I am not sure what is causing the change. I also realize you don't have a test case for ADLS

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-11-13 04:52:07

*Thread Reply:* Thanks @Jason Yip for your engagement in finding the cause and solution to this issue.

Among the technical problems, another problem here is that our databricks integration tests are run on AWS and the issue you describe occurs in Azure. I would consider this a primary issue as it is difficult for me to verify the behaviour you describe and fix it with a failing integration test at the start.

Are you able to reproduce the issue on AWS Databricks environment so that we could include it in our integration tests and make sure the behvaiour will not change later on in future?

Jason Yip (jasonyip@gmail.com)
2023-11-13 18:06:44

*Thread Reply:* I didn't know Azure and AWS Databricks are different. Let me try it on AWS as well

Jason Yip (jasonyip@gmail.com)
2023-11-23 04:51:38

*Thread Reply:* @Paweł Leszczyński finally got a chance to run but its a different script, its pretty interesting

Jason Yip (jasonyip@gmail.com)
2023-11-23 04:51:59

*Thread Reply:* "inputs": [ { "namespace": "<wasbs://publicwasb@mmlspark.blob.core.windows.net>", "name": "AdultCensusIncome.parquet" } ], "outputs": [ { "namespace": "<wasbs://publicwasb@mmlspark.blob.core.windows.net>", "name": "AdultCensusIncome.parquet" }

Jason Yip (jasonyip@gmail.com)
2023-11-23 04:52:19

*Thread Reply:* df.write.format("delta").mode("overwrite").option("path", "").saveAsTable("test.AdultCensusIncome")

Jason Yip (jasonyip@gmail.com)
2023-11-23 04:52:41

*Thread Reply:* it somehow got the input path as output path 😲

Jason Yip (jasonyip@gmail.com)
2023-11-23 04:53:53

*Thread Reply:* here's the full script:

df = spark.read.parquet( "" )

sc.jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "") sc.jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "")

df.write.format("delta").mode("overwrite").option("path", "").saveAsTable("test.AdultCensusIncome")

Jason Yip (jasonyip@gmail.com)
2023-11-23 04:54:23

*Thread Reply:* yep, just 3 lines, will try it on Azure as well

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-11-23 08:14:23

*Thread Reply:* just to clarify, were you able to reproduce this issue just on AWS databricks using s3? Asking this again, bcz this is how our intergation test environment looks like.

Jason Yip (jasonyip@gmail.com)
2023-11-23 18:28:36

*Thread Reply:* @Paweł Leszczyński the issue is different. The issue before on Azure is that it says it's dbfs despite it's wasbs. This time around the destination is s3, but it says wasbs... it somehow took the path from inputs

Jason Yip (jasonyip@gmail.com)
2023-11-23 18:28:52

*Thread Reply:* I have not tested this new script on Azure yet

Jason Yip (jasonyip@gmail.com)
2023-11-24 03:53:21

*Thread Reply:* @Paweł Leszczyński tried on Azure, results is the same

Jason Yip (jasonyip@gmail.com)
2023-11-24 03:53:24

*Thread Reply:* df = spark.read.parquet("")

df.write\ .format('delta')\ .mode('overwrite')\ .option('overwriteSchema', 'true')\ .option('path', adlsRootPath + '/examples/data/parquet/AdultCensusIncome/silver/AdultCensusIncome')\ .saveAsTable('AdultCensusIncome')

Jason Yip (jasonyip@gmail.com)
2023-11-24 03:54:05

*Thread Reply:* "outputs": [ { "namespace": "dbfs", "name": "/user/hive/warehouse/adultcensusincome",

Jason Yip (jasonyip@gmail.com)
2023-11-24 03:54:24

*Thread Reply:* Please note that 1.4.1 it correctly identified output as wasbs

Jason Yip (jasonyip@gmail.com)
2023-11-24 04:14:39

*Thread Reply:* so to summarize, on Azure it'd become dbfs

Jason Yip (jasonyip@gmail.com)
2023-11-24 04:15:01

*Thread Reply:* on AWS, it somehow becomes the same as input

Jason Yip (jasonyip@gmail.com)
2023-11-24 04:15:34

*Thread Reply:* 1.4.1 Azure is fine, I have not tested 1.4.1 on AWS

Naresh reddy (naresh.naresh36@gmail.com)
2023-11-15 07:17:24

Hi Can anyone point me to the deck on how Airflow can be integrated using Openlineage?

Naresh reddy (naresh.naresh36@gmail.com)
2023-11-15 07:27:55

*Thread Reply:* thank you @Maciej Obuchowski

Naresh reddy (naresh.naresh36@gmail.com)
2023-11-15 11:09:24

Can anyone tell me why OL is better than other competitors if you can provide an analysis that would be great

Harel Shein (harel.shein@gmail.com)
2023-11-16 11:46:16

*Thread Reply:* Hey @Naresh reddy can you help me understand what you mean by competitors? OL is a specification that can be used to solve various problems, so if you have a clear problem statement, maybe I can help with pros/cons for that problem

Naresh reddy (naresh.naresh36@gmail.com)
2023-11-22 23:49:05

*Thread Reply:* i wanted to integrate Airflow using OL but wanted to understand what are the pros and cons of OL, if you can shed light that would be great

Harel Shein (harel.shein@gmail.com)
2023-11-27 19:11:54

*Thread Reply:* Airflow supports OL natively via a provider since 2.7. But it’s hard for me to tell you pros/cons without understanding your use case

👀 Naresh reddy
Naresh reddy (naresh.naresh36@gmail.com)
2023-11-15 11:10:58

what are the pros and cons of OL. we often talk about positives to market it but what are the pain points using OL,how it's addressing user issues?

Michael Robinson (michael.robinson@astronomer.io)
2023-11-16 13:38:42

*Thread Reply:* Hi @Naresh reddy, thanks for your question. We’ve heard that OpenLineage is attractive because of its desirable integrations, including a best-in-class Spark integration, its extensibility, the fact that it’s not destructive, and the fact that it’s open source. I’m not aware of pain points per se, but there are certainly features and integrations that we wish we could focus on but can’t at the moment — like the Dagster integration, which needs a new maintainer. OpenLineage is like any other open standard in that ecosystem coverage is a constant process rather than a journey, and it requires contributions in order to get close to 100%. Thankfully, we are gaining users and contributors all the time, and integrations are being added or improved upon daily. See the Ecosystem page on the website for a list of consumers and producers and links to more resources, and check out the GitHub repo for the codebase, commit history, contributors, governance procedures, and more. We’re quick to respond to messages here and issues on GitHub — usually within one day.

Website
<http://openlineage.io>
Stars
1449
🙌 Naresh reddy
karthik nandagiri (karthik.nandagiri@gmail.com)
2023-11-19 23:57:38

Hi So we can use openlineage to identify column level lineage with Airflow , Spark? will it also allow to connect to Power BI and derive the downstream column lineage ?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-20 06:07:36

*Thread Reply:* Yes, it works with Airflow and Spark - there is caveat that amount of operators that support it on Airflow side is fairly small and limited generally to most popular SQL operators. > will it also allow to connect to Power BI and derive the downstream column lineage ? No, there is no such feature yet 🙂 However, there's nothing preventing this - if you wish to work on such implementation, we'd be happy to help.

karthik nandagiri (karthik.nandagiri@gmail.com)
2023-11-21 00:20:11

*Thread Reply:* Thank you Maciej Obuchowski for the update. Currently we are looking out for a tool which can support connecting to Power Bi and pull column level lineage information for reports and dashboards. How this can be achieved with OL ? Can you give some idea?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-21 07:59:10

*Thread Reply:* I don't think I can help you with that now, unless you want to work on your own integration with PowerBI 🙁

Rafał Wójcik (rwojcik@griddynamics.com)
2023-11-21 07:02:08

Hi Everyone, first of all - big shout to all contributors - You do amazing job here. I want to use OpenLineage in our project - to do so I want to setup some POC and experiment with possibilities library provides - I start working on sample from the conference talk: https://github.com/getindata/openlineage-bbuzz2023-column-lineage but when I go into spark transformation after staring context with openlineage I have issues with SessionHiveMetaStoreClient on section 3- does anyone has other plain sample to play with, to not setup everything from scratch?

Language
Jupyter Notebook
Last updated
5 months ago
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-21 07:37:00

*Thread Reply:* Can you provide details about those issues? Like exceptions, logs, details of the jobs and how do you run them?

Rafał Wójcik (rwojcik@griddynamics.com)
2023-11-21 07:45:37

*Thread Reply:* Hi @Maciej Obuchowski - I rerun docker container after deleting metadata_db folder possibly created by other local test, and fix this one but got problem with OpenLineageListener - during initialization of spark: while I execute: spark = (SparkSession.builder.master('local') .appName('Food Delivery') .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener') .config('spark.jars', '&lt;local-path&gt;/openlineage-spark-0.27.2.jar,&lt;local-path&gt;/postgresql-42.6.0.jar') .config('spark.openlineage.transport.type', 'http') .config('spark.openlineage.transport.url', '<http://api:5000>') .config('spark.openlineage.facets.disabled', '[spark_unknown;spark.logicalPlan]') .config('spark.openlineage.namespace', 'food-delivery') .config('spark.sql.warehouse.dir', '/tmp/spark-warehouse/') .config("spark.sql.repl.eagerEval.enabled", True) .enableHiveSupport() .getOrCreate()) I got Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : org.apache.spark.SparkException: Exception when registering SparkListener at org.apache.spark.SparkContext.setupAndStartListenerBus(SparkContext.scala:2563) at org.apache.spark.SparkContext.&lt;init&gt;(SparkContext.scala:643) at org.apache.spark.api.java.JavaSparkContext.&lt;init&gt;(JavaSparkContext.scala:58) at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77) at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499) at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:480) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:238) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.base/java.lang.Thread.run(Thread.java:833) Caused by: java.lang.ClassNotFoundException: io.openlineage.spark.agent.OpenLineageSparkListener at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:587) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520) at java.base/java.lang.Class.forName0(Native Method) at java.base/java.lang.Class.forName(Class.java:467) at org.apache.spark.util.Utils$.classForName(Utils.scala:218) at org.apache.spark.util.Utils$.$anonfun$loadExtensions$1(Utils.scala:2921) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293) at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108) at org.apache.spark.util.Utils$.loadExtensions(Utils.scala:2919) at org.apache.spark.SparkContext.$anonfun$setupAndStartListenerBus$1(SparkContext.scala:2552) at org.apache.spark.SparkContext.$anonfun$setupAndStartListenerBus$1$adapted(SparkContext.scala:2551) at scala.Option.foreach(Option.scala:407) at org.apache.spark.SparkContext.setupAndStartListenerBus(SparkContext.scala:2551) ... 15 more looks like by some reasons jars are not loaded - need to look into it

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-21 07:58:09
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-21 07:58:28

*Thread Reply:* are you sure &lt;local-path&gt; is right?

Rafał Wójcik (rwojcik@griddynamics.com)
2023-11-21 08:00:49

*Thread Reply:* yes, it's same as in sample - wondering why it's not get added: ```from pyspark.sql import SparkSession

spark = (SparkSession.builder.master('local') .appName('Food Delivery') .config('spark.jars', '/home/jovyan/jars/openlineage-spark-0.27.2.jar,/home/jovyan/jars/postgresql-42.6.0.jar') .config('spark.sql.warehouse.dir', '/tmp/spark-warehouse/') .config("spark.sql.repl.eagerEval.enabled", True) .enableHiveSupport() .getOrCreate())

print(spark.sparkContext._jsc.sc().listJars())

Vector()```

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-21 08:04:31

*Thread Reply:* can you make sure jars are in this directory? just by docker run --entrypoint /usr/local/bin/bash IMAGE_NAME "ls /home/jovyan/jars"

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-21 08:06:27

*Thread Reply:* another option to try is to replace spark.jars with spark.jars.packages io.openlineage:openlineage_spark:1.5.0,org.postgresql:postgresql:42.7.0

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-11-21 08:16:54

*Thread Reply:* I think this was done for the purpose of presentation to make sure the demo will work without internet access. This can be the reason to add jar manually to a docker. openlineage-spark can be added to Spark via spark.jar.packages , like we do here https://openlineage.io/docs/integrations/spark/quickstart_local

openlineage.io
Rafał Wójcik (rwojcik@griddynamics.com)
2023-11-21 09:21:59

*Thread Reply:* got it guys - thanks a lot for help - it turns out that spark context from notebook 2 and 3 has come kind of metadata conflict - when I combine those 2 and recreate image to clean up old metadata it works. One more note is that sometimes kernels return weird results but it may be caused by some local nuances - anyways thx !

Shahid Shaikh (ssshahidwin@gmail.com)
2023-11-22 05:18:53

Hi Everyone I created custom operator in airflow for to extract metadata of that file like size creation time and modification time like that and I used that in my dag it is rnning fine in airflow by saving metadata info of that file to csv file but i want to see that metadata info which we are saving in csv file to be shown in marquez ui in extra facet How we can add that extra facet to see on marquez ui ? Thankyou

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-11-22 05:20:34

*Thread Reply:* are you adding job or run facet?

Shahid Shaikh (ssshahidwin@gmail.com)
2023-11-22 05:21:32

*Thread Reply:* job facet

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-11-22 05:22:24

*Thread Reply:* that should be run facet, right? that’s dynamic value dependent on individual run

Shahid Shaikh (ssshahidwin@gmail.com)
2023-11-22 05:24:42

*Thread Reply:* yes yes sorry run facet

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-11-22 05:25:12

*Thread Reply:* it should appear then in Marquez as additional facet

Shahid Shaikh (ssshahidwin@gmail.com)
2023-11-22 05:31:31

*Thread Reply:* I can see these things already for custom operator but i want to add extra info under the root like the metadata of the file

like ( file_name, size, modification time, creation time )

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-11-22 05:33:20

*Thread Reply:* yeah, for that you need to provide additional run facets under these keys you mentioned

Shahid Shaikh (ssshahidwin@gmail.com)
2023-11-23 01:31:01

*Thread Reply:* can u please tell precisely where we can add additional facet to be visible on ui ?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-11-23 04:16:53

*Thread Reply:* custom extractors should return from extract methods TaskMetadata that takes run_facets as argument https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/openlineage/airflow/extractors/base.py#L28

Shahid Shaikh (ssshahidwin@gmail.com)
2023-11-30 06:34:55

*Thread Reply:* Hi @Jakub Dardziński finally today I am able to get extra facet on Marquez ui using custom operator and and custom extractor. Thanks for the help. It is really nice community.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-11-30 06:43:55

*Thread Reply:* Hey, that’s great to hear! Sorry I didn’t answer yesterday but you managed on your own 🙂

Shahid Shaikh (ssshahidwin@gmail.com)
2023-11-30 10:51:09

*Thread Reply:* Yes, No problem I tried and explored more by self. Thanks 😊.

Rafał Wójcik (rwojcik@griddynamics.com)
2023-11-22 05:32:40

Hi Guys, one more questions - as we fix sample from previous thread I start playing with example - when I execute in notebook: spark.sql('''SELECT ** FROM public.restaurants r where r.name = "BBUZZ DONER"''').show() I got all raw events in marquez but all job.facets.sql field are null - is there any way to capture sql query that we use in spark? I know that we can pull this out from spark.logicalPlan but plain sql will be much more convenient

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-11-22 06:53:37

*Thread Reply:* I was looking into it some time ago and was not able to extract SQL from logical plan. It seemed to me that SQL string is translated into LogicalPlan before Openlineage code gets called and I wasn't able to find SQL anywhere

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-22 06:54:31

*Thread Reply:* Optional&lt;String&gt; query = ScalaConversionUtils.asJavaOptional(relation.jdbcOptions().parameters().get(JDBCOptions$.MODULE$.JDBC_QUERY_STRING())); 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-22 06:54:54

*Thread Reply:* ah, not only from JDBCRelation

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-11-22 08:03:52

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/1250#issuecomment-1306865798

Michael Robinson (michael.robinson@astronomer.io)
2023-11-22 13:57:29

@general The next OpenLineage meetup is happening one week from today on November 29th in Warsaw/remote (it’s hybrid) at 5:30 pm CET (8:30 am PT). Sign-up and more details can be found https://www.meetup.com/warsaw-openlineage-meetup-group/events/296705558/?utmmedium=referral&utmcampaign=share-btnsavedeventssharemodal&utmsource=link|here. Hoping to see many of you there!

Meetup
🙌 Harel Shein, Maciej Obuchowski, Sangeeta Mishra
:flag_pl: Harel Shein, Maciej Obuchowski, Paweł Leszczyński, Stefan Krawczyk
Ryhan Sunny (rsunny@altinity.com)
2023-11-23 12:26:31

Hi all, don’t miss out on the upcoming Open Source Analytics (virtual) Conference OSA Con 2023 - the place to be for cutting-edge insights into open-source analytics and AI! Learn and share development tips on the hottest open source projects, like Kafka, Airflow, Grafana, ClickHouse, and DuckDB. See who’s speaking and save your spot today at https://osacon.io/

osacon.io
❤️ Jarek Potiuk, Jakub Dardziński, Julien Le Dem, Willy Lulciuc
ldacey (lance.dacey2@sutherlandglobal.com)
2023-11-24 10:47:55

with Airflow, I have operators and define inputdatasets and outputdatasets. If Task B uses the outputdataset of Task A as the inputdataset, does it overwrite the metadata such as the documentation facet etc?

should I ensure that the inputdataset info is exactly the same between tasks, or do I move certain logic into the runfacet?

for example, this input_facet is the output dataset from my previous task.

input_facet = { "dataSource": input_source, "schema": SchemaDatasetFacet(fields=fields), "deltaTable": DeltaTableFacet( path=self.source_model.dataset_uri, name=self.source_model.name, description=self.source_model.description, partition_columns=json.dumps(self.source_model.partition_columns or []), unique_constraint=json.dumps(self.source_model.unique_constraint or []), rows=self.source_delta_table.rows, file_count=self.source_delta_table.file_count, size=self.source_delta_table.size, ), } input_dataset = Dataset( namespace=input_namespace, name=input_name, facets=input_facet )

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-24 10:51:55

*Thread Reply:* First question, how do you do that? Define get_openlineage_facets methods on your custom operators?

ldacey (lance.dacey2@sutherlandglobal.com)
2023-11-24 10:59:09

*Thread Reply:* yeah

I have this method:

``` def getopenlineagefacetsoncomplete(self, task_instance: Any) -> Any: """Returns the OpenLineage facets for the task instance""" from airflow.providers.openlineage.extractors import OperatorLineage

    _ = task_instance
    inputs = self._create_input_datasets()
    outputs = self._create_output_datasets()
    run_facet = self._create_run_facet()
    job_facet = self._create_job_facet()
    return OperatorLineage(
        inputs=inputs, outputs=outputs, run_facets=run_facet, job_facets=job_facet
    )```

and then I define each facet

ldacey (lance.dacey2@sutherlandglobal.com)
2023-11-24 11:00:33

*Thread Reply:* the input and output change

``` def createjob_facet(self) -> dict[str, Any]: """Creates the Job facet for the OpenLineage Job""" from openlineage.client.facet import ( DocumentationJobFacet, OwnershipJobFacet, OwnershipJobFacetOwners, SourceCodeJobFacet, )

    return {
        "documentation": DocumentationJobFacet(
            description=f"""Filters data from {self.source_model.dataset_uri} using
            Polars and writes the data to the path:
            {self.destination_model.dataset_uri}.
            """
        ),
        "ownership": OwnershipJobFacet(
            owners=[OwnershipJobFacetOwners(name=self.owner)]
        ),
        "sourceCode": SourceCodeJobFacet(
            language="python", source=self.transform_model_name
        ),
    }

def _create_run_facet(self) -&gt; dict[str, Any]:
    """Creates the Run facet for the OpenLineage Run"""
    return {}```
ldacey (lance.dacey2@sutherlandglobal.com)
2023-11-24 11:02:07

*Thread Reply:* but a lot of the time I am reading a dataset and filtering it or selecting a subset of columns and saving a new dataset. I just want to make sure my input_dataset remains consistent basically, since a lot of different airflow tasks might using it

ldacey (lance.dacey2@sutherlandglobal.com)
2023-11-24 11:02:36

*Thread Reply:* and these are all custom operators

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-24 11:03:29

*Thread Reply:* Yeah, you should be okay sending those facets on output dataset only

ldacey (lance.dacey2@sutherlandglobal.com)
2023-11-24 11:07:19

*Thread Reply:* so just ignore the input_facet completely? or pass an empty list or something?

return OperatorLineage( inputs=None, outputs=outputs, run_facets=run_facet, job_facets=job_facet )

ldacey (lance.dacey2@sutherlandglobal.com)
2023-11-24 11:09:53

*Thread Reply:* cool I'll try that, makes things cleaner for sure

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-24 11:10:20

*Thread Reply:* pass input datasets, just don't include redundant facets

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-24 11:10:39

*Thread Reply:* if you won't pass datasets, you won't get lineage

ldacey (lance.dacey2@sutherlandglobal.com)
2023-11-24 11:26:12

*Thread Reply:* got it, thanks

Haneefa tasneem (haneefa.tasneem.shaik@gmail.com)
2023-11-28 01:12:47

I am trying to run this spark script but my spark context is stopping on its own. Without the open-lineage configuration(listener) the spark script is working fine. I need the configuration to integrate with open-lineage.

from pyspark.sql import SparkSession from pyspark.sql.functions import col

def execute_spark_script(query_num, output_path): # Create a Spark session spark = (SparkSession.builder.master('local[**]').appName('openlineage_spark_test')

         `# Install and set up the OpenLineage listener`
         `.config('spark.jars.packages', 'io.openlineage:openlineage_spark:0.3.+')`
         `.config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener')`
         `.config('spark.openlineage.host', '<http://localhost:3000>')`
         `.config('spark.openlineage.namespace', 'airflow')`
         `.config('spark.openlineage.transport.type', 'console')`
         `.getOrCreate()`
         `)`




`# DataFrame 1`
`data = [[295, "South Bend", "Indiana", "IN", 101190, 112.9]]`
`columns = ["rank", "city", "state", "code", "population", "price"]`
`df1 = spark.createDataFrame(data, schema="rank LONG, city STRING, state STRING, code STRING, population LONG, price DOUBLE")`

`print(f"Count after DataFrame 1: {df1.count()} rows, {len(df1.columns)} columns")`

`# Save DataFrame 1 to the desired location`
`df1.write.mode("overwrite").csv(output_path + "df1")`

`# DataFrame 2`
`df2 = (spark.read`
      `.format("csv")`
      `.option("header", "true")`
      `.option("inferSchema", "true")`
      `.load("/home/haneefa/Downloads/export.csv")`
      `)`

`# Save DataFrame 2 to the desired location`
`df2.write.mode("overwrite").csv(output_path + "df2")`

`# Returns a DataFrame that combines the rows of df1 and df2`
`query_df = df1.union(df2)`
`print(f"Count after combining DataFrame 1 and DataFrame 2: {query_df.count()} rows, {len(query_df.columns)} columns")`

`# Save the combined DataFrame to the desired location`
`query_df.write.mode("overwrite").csv(output_path + "query_df")`

`# Query 1: Add a new column derived from existing columns`
`query1_df = query_df.withColumn("population_price_ratio", col("population") / col("price"))`
`print(f"Count after Query 1: {query1_df.count()} rows, {len(query1_df.columns)} columns")`

`# Save Query 1 result to the desired location`
`query1_df.write.mode("overwrite").csv(output_path + "query1_df")`


`spark.stop()`

if __name__ == "__main__": execute_spark_script(1, "/home/haneefa/airflow/dags/saved_files/")

Damien Hawes (damien.hawes@booking.com)
2023-11-28 06:19:40

*Thread Reply:* Hello @Haneefa tasneem - can you confirm which version of Spark you're running?

Damien Hawes (damien.hawes@booking.com)
2023-11-28 06:20:44

*Thread Reply:* Additionally, I noticed this line:

.config('spark.jars.packages', 'io.openlineage:openlineage_spark:0.3.+') Could you try changing the version to 1.5.0 instead of 0.3.+?

👍 Paweł Leszczyński
Haneefa tasneem (haneefa.tasneem.shaik@gmail.com)
2023-11-28 08:04:24

*Thread Reply:* Hello. my spark version is 3.5.0 ```from pyspark.sql import SparkSession from pyspark.sql.functions import col import urllib.request

def executesparkscript(querynum, outputpath): # Create a Spark session oljars= ['https://repo1.maven.org/maven2/io/openlineage/openlineage-spark/0.3.1/openlineage-spark-0.3.1.jar'] files = [urllib.request.urlretrieve(url)[0] for url in oljars] spark = (SparkSession.builder.master('local[**]').appName('openlineagesparktest').config('spark.jars', ",".join(files))

         # Install and set up the OpenLineage listener

         .config('spark.jars.packages', 'io.openlineage:openlineage_spark:0.3.1')
         .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener')
         .config('spark.openlineage.host', '<http://localhost:5000>')
         .config('spark.openlineage.namespace', 'airflow')
         .getOrCreate()
         )




# DataFrame 1
data = [[295, "South Bend", "Indiana", "IN", 101190, 112.9]]
columns = ["rank", "city", "state", "code", "population", "price"]
df1 = spark.createDataFrame(data, schema="rank LONG, city STRING, state STRING, code STRING, population LONG, price DOUBLE")

print(f"Count after DataFrame 1: {df1.count()} rows, {len(df1.columns)} columns")

# Save DataFrame 1 to the desired location
df1.write.mode("overwrite").csv(output_path + "df1")

# DataFrame 2
df2 = (spark.read
      .format("csv")
      .option("header", "true")
      .option("inferSchema", "true")
      .load("/home/haneefa/Downloads/export.csv")
      )

# Save DataFrame 2 to the desired location
df2.write.mode("overwrite").csv(output_path + "df2")

# Returns a DataFrame that combines the rows of df1 and df2
query_df = df1.union(df2)
print(f"Count after combining DataFrame 1 and DataFrame 2: {query_df.count()} rows, {len(query_df.columns)} columns")

# Save the combined DataFrame to the desired location
query_df.write.mode("overwrite").csv(output_path + "query_df")

# Query 1: Add a new column derived from existing columns
query1_df = query_df.withColumn("population_price_ratio", col("population") / col("price"))
print(f"Count after Query 1: {query1_df.count()} rows, {len(query1_df.columns)} columns")

# Save Query 1 result to the desired location
query1_df.write.mode("overwrite").csv(output_path + "query1_df")


spark.stop()

if name == "main": executesparkscript(1, "/home/haneefa/airflow/dags/saved_files/")```

Haneefa tasneem (haneefa.tasneem.shaik@gmail.com)
2023-11-28 08:04:45

*Thread Reply:* above is the modified code

Haneefa tasneem (haneefa.tasneem.shaik@gmail.com)
2023-11-28 08:05:20

*Thread Reply:* It seems the issue is with the listener in spark config

Haneefa tasneem (haneefa.tasneem.shaik@gmail.com)
2023-11-28 08:05:55

*Thread Reply:* please do let me know if i should make any changes

Damien Hawes (damien.hawes@booking.com)
2023-11-28 08:20:46

*Thread Reply:* Yeah - I suspect its because of the version of the connector that you're using.

You're using 0.3.1, please try it with 1.5.0.

🚀 Paweł Leszczyński
Haneefa tasneem (haneefa.tasneem.shaik@gmail.com)
2023-11-28 12:27:02

*Thread Reply:* Yes. it seems the issue was with the version. thank you. its resolved now

Haneefa tasneem (haneefa.tasneem.shaik@gmail.com)
2023-11-29 03:29:09

*Thread Reply:* Hi. I was able to see some metadata information on Marquez. I wanted to know if there is a way I can see any lineage of the data as its getting transformed? As in, We are running a query here, I wanted to know If we can see the lineage of dataset. I tried modifying the code like this: # Save Query 1 result to the desired location query_df.write.mode("overwrite").csv(output_path + "query_df")

`# Register the DataFrame as a temporary SQL table`
`query_df.write.mode("overwrite").saveAsTable("temp_table")`

`# Query 1: Add a new column derived from existing columns`
`query1_df = spark.sql("SELECT **, population / price as population_price_ratio FROM temp_table")`
`print(f"Count after Query 1: {query1_df.count()} rows, {len(query1_df.columns)} columns")`


`# Register the DataFrame as a temporary SQL table`
`query1_df.write.mode("overwrite").saveAsTable("temp_table2")`

`# Save Query 1 result to the desired location`
`query1_df.write.mode("overwrite").csv(output_path + "query1_df")`
Haneefa tasneem (haneefa.tasneem.shaik@gmail.com)
2023-11-29 03:32:25

*Thread Reply:* But I'm getting error. Is there a way I can see how we are deriving a new column from the previous dataset.

Michael Robinson (michael.robinson@astronomer.io)
2023-11-28 13:06:08

@channel Friendly reminder: our first Warsaw meetup is happening tomorrow at 5:30 PM CET (8:30 AM PT) — and it’s hybrid https://openlineage.slack.com/archives/C01CK9T7HKR/p1700679449568039

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
slackbot
2023-11-29 06:30:57

This message was deleted.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-11-29 07:03:57

*Thread Reply:* Hey Shahid, I think we already discussed it here https://openlineage.slack.com/archives/C01CK9T7HKR/p1700648333528029?threadts=1700648333.528029&cid=C01CK9T7HKR|https://openlineage.slack.com/archives/C01CK9T7HKR/p1700648333528029?threadts=1700648333.528029&cid=C01CK9T7HKR

} Shahid Shaikh (https://openlineage.slack.com/team/U062FLR8WBZ)
Haneefa tasneem (haneefa.tasneem.shaik@gmail.com)
2023-11-29 09:09:00

Hi. This code is running on my local ubuntu(Both in and out of the virtual environment). I have installed Airflow in virtual environment in ubuntu. Its not getting executed on Airflow. I'm getting the following error - airflow.exceptions.AirflowException: Cannot execute: spark-submit --master <spark://10.0.2.15:4041> --jars /home/haneefa/Downloads/openlineage-spark-1.5.0.jar --name SparkScript_query1 --deploy-mode client /home/haneefa/airflow/dags/custom_operators/sample_sql_spark.py 1. Error code is: 1. [2023-11-29, 13:51:21 UTC] {taskinstance.py:1400} INFO - Marking task as FAILED. dag_id=spark_dagf, task_id=spark_submit_query1, execution_date=20231129T134548, start_date=20231129T134723, end_date=20231129T135121 The spark-submit is working fine on my ubuntu. ```from pyspark import SparkContext from os.path import abspath from pyspark.sql import SparkSession from pyspark.sql.functions import col import urllib.request

def executesparkscript(querynum, outputpath): # Create a Spark session #oljars= ['https://repo1.maven.org/maven2/io/openlineage/openlineage-spark/1.5.0/openlineage-spark-1.5.0.jar'] warehouselocation = abspath('spark-warehouse') #files = [urllib.request.urlretrieve(url)[0] for url in oljars] spark = (SparkSession.builder.master('local[**]').appName('openlineagespark_test') #.config('spark.jars', ",".join(files))

         # Install and set up the OpenLineage listener

         .config('spark.jars.packages', 'io.openlineage:openlineage_spark:1.5.0')
         .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener')
         .config('spark.openlineage.host', '<http://localhost:5000>')
         .config('spark.openlineage.namespace', 'airflow')
         .config('spark.openlineage.transport.type', 'console')
         .config("spark.sql.warehouse.dir", warehouse_location)
         .getOrCreate()
         )

spark.sparkContext.setLogLevel("INFO")


# DataFrame 1
data = [[295, "South Bend", "Indiana", "IN", 101190, 112.9]]
columns = ["rank", "city", "state", "code", "population", "price"]
df1 = spark.createDataFrame(data, schema="rank LONG, city STRING, state STRING, code STRING, population LONG, price DOUBLE")

print(f"Count after DataFrame 1: {df1.count()} rows, {len(df1.columns)} columns")

# Save DataFrame 1 to the desired location
df1.write.mode("overwrite").csv(output_path + "df1")

# DataFrame 2
df2 = (spark.read
      .format("csv")
      .option("header", "true")
      .option("inferSchema", "true")
      .load("/home/haneefa/Downloads/export.csv")
      )

# Returns a DataFrame that combines the rows of df1 and df2
query_df = df1.union(df2)
query_df.count()

# Save DataFrame 2 to the desired location
query_df.write.mode("overwrite").csv(output_path + "query_df")

# Register the DataFrame as a temporary SQL table
query_df.write.mode("overwrite").saveAsTable("temp_table")

# Query 1: Add a new column derived from existing columns
query1_df = spark.sql("SELECT **, population / price as population_price_ratio FROM temp_table")
print(f"Count after Query 1: {query1_df.count()} rows, {len(query1_df.columns)} columns")

# Register the DataFrame as a temporary SQL table
query1_df.write.mode("overwrite").saveAsTable("temp_table2")

# Save Query 1 result to the desired location
query1_df.write.mode("overwrite").csv(output_path + "query1_df")

spark.stop()

if name == "main": executesparkscript(1, "/home/haneefa/airflow/dags/saved_files/")

spark-submit --master --name SparkScriptquery1 --deploy-mode client /home/haneefa/airflow/dags/customoperators/samplesqlspark.py

./bin/spark-submit --class "SparkTest" --master local[**] --jars```

This is my dag: ```from datetime import datetime, timedelta from airflow import DAG from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator

defaultargs = { 'owner': 'admin', 'startdate': datetime(2023, 1, 1), }

dag = DAG( 'sparkdagf', defaultargs=defaultargs, scheduleinterval=None, )

Set up SparkSubmitOperator for each query

queries = ['query1'] previous_task = None

for query in queries: taskid = f'sparksubmit{query}' scriptpath = '/home/haneefa/airflow/dags/customoperators/samplesqlspark.py'
query
num = queries.index(query) + 1

spark_task = SparkSubmitOperator(
    task_id=task_id,
    application=script_path,
    name=f"SparkScript_{query}",
    conn_id='spark_2',  
    jars='/home/haneefa/Downloads/openlineage-spark-1.5.0.jar',
    application_args=[str(query_num)],
    dag=dag,
)

if previous_task:
    previous_task >> spark_task

previous_task = spark_task

if name == "main": dag.cli()```

Michael Robinson (michael.robinson@astronomer.io)
2023-11-29 10:36:45

@channel Today’s hybrid meetup with Google starts in about one hour! DM me for the link. https://openlineage.slack.com/archives/C01CK9T7HKR/p1701194768476699

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
David Lauzon (davidonlaptop@gmail.com)
2023-11-29 16:35:03

*Thread Reply:* Thanks for organizing this meetup and sharing it online. Very good quality of talks btw !

Michael Robinson (michael.robinson@astronomer.io)
2023-11-29 16:37:26

*Thread Reply:* Thanks for coming! @Jens Pfau did most of the heavy lifting

🙌 Paweł Leszczyński
Jens Pfau (jenspfau@google.com)
2023-11-30 04:52:17

*Thread Reply:* Thank you to everyone who turned up! I noticed there was a question from @Sheeri Cabral (Collibra) that we didn't answer: Can someone point me to information on the collector that sends to S3? I'm curious as to the implementation (e.g. does each api push result in one S3 file? so the bucket ends up having millions of files? Does it append to a file? are there rules (configurable or otherwise) about rotating a file or directory? etc

I didn't quite catch the context of that. @Paweł Leszczyński did you?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-11-30 06:46:27

*Thread Reply:* There was a question about querying lineage events in warehouses. Julien has confirmed that in case of hunderds of thousands or even million of events, this could be still accomplished within Marquez as PostgreSQL as its backend, should handle this.

I was referring to fluentd openlineage proxy which lets users copy the event and send it to multiple backend. Fluentd has a list of out-of-the box output plugins containing BigQuery, S3, Redshift and others (https://www.fluentd.org/dataoutputs)

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-11-30 06:47:11

*Thread Reply:* Some more info about fluentd proxy: https://github.com/OpenLineage/OpenLineage/tree/main/proxy/fluentd

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2023-12-05 12:35:44

*Thread Reply:* This is extremely helpful, thanks @Paweł Leszczyński!!!

Stefan Krawczyk (stefan@dagworks.io)
2023-11-29 15:00:00

Hi all. I’m Stefan, and I’m after some directional advice:

Context: • I drive an open source project called Hamilton. TL;DR: it’s an opinionated way to write python code to express dataflows, e.g. great for feature engineering, data processing, doing LLM workflows, etc. in python.
• One of the features is that you get “lineage as code”, e.g. you can get “column/dataframe level” lineage for pandas (or pyspark) code. So given a dataflow definition, and execution of a Hamilton dataflow, we can emit this code provenance for any artifacts generated — and this works where ever python runs. Ask: • As I understand open lineage was built more for artifact to artifact lineage. E.g. this table -> table -> ML model, etc. Question, for people who use/consume lineage, would this extra level of granularity (e.g. the code that ran to create an artifact) that we can provide with Hamilton, be interesting to emit as part of an open lineage event? (e.g. see inside your python airflow task, or spark job). I’m trying to determine how to prioritize an open lineage integration, and whether someone would pick up Hamilton because of it. • If you would find this extra level of granularity useful, could I schedule a call with you so I can learn more about your use case please? CC @Jakub Dardziński since I saw at the meet up that you deal with the airflow python operator & open lineage.

Website
<https://hamilton.dagworks.io/en/latest/>
Stars
1051
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-30 09:57:11

*Thread Reply:* We try to include SourceCodeFacet when possible when emitting OpenLineage event, however it's purely optional as facets are, since we can't guarantee it will be always there - for example it's not possible for us to get actual sql from spark-sql jobs.

✅ Sheeri Cabral (Collibra)
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-30 09:57:29

*Thread Reply:* Not sure if that answers your question 🙂

Stefan Krawczyk (stefan@dagworks.io)
2023-11-30 19:15:24

*Thread Reply:* @Maciej Obuchowski kind of. That’s tactical implementation advice and definitely useful.

My question is more around is that kind of detail actually useful for someone/what use case would it power/enable? e.g. if I integrate openlineage and emit that level of detail, would someone use it? If so, why?

Mandy Chessell (mandy.e.chessell@gmail.com)
2023-12-02 12:45:25

*Thread Reply:* @Stefan Krawczyk the types of use cases I have seen of this style is a business user/auditor that is not interested in how many jobs or intermediate stores are used in the end-to-end pipeline. They want to understand the transformations that the data underwent. For this type of user we extract details such as the source code facet, or facets that describe a rule or decision from the lineage graph and just display these transformations/decisions. If hamilton was providing the content for these types of facets, would they be consumable by such a user? Or perhaps, I should say, "could" they be consumed by such a user if the pipeline developer was careful to use meaningful variable names and comments.

Stefan Krawczyk (stefan@dagworks.io)
2023-12-02 14:07:42

*Thread Reply:* Awesome thanks @Mandy Chessell. That's useful context. What domain would this person be operating in? Yes I would think Hamilton could be useful here. It'll help standardize how the code is structured, and makes it easier to link an output with the code that created it.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2023-12-04 15:01:26

*Thread Reply:* @Stefan Krawczyk speaking as a vendor of lineage technology, our customers want to see the transformations - SQL if it’s SQL-based, or as much description as possible if it’s an ETL tool without direct SQL. (e.g. IBM DataStage might show a few stages like “group” and “aggregation” and show what field was used for grouping, etc).

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2023-12-04 15:03:50

*Thread Reply:* i have seen 5 general use cases for lineage:

  1. Data Provenance - “where the data comes from”. This is often used when explaining what data means - by showing where it came from. e.g. salary data - did it come from a Google form, or from an HR department? The most visceral example I have is when a VP sees 2 similar reports with different numbers. e.g. Total Sales but the reports have different numbers, and then the VP wants to know why, and how come we can’t trust the data, etc?

With lineage it’s easy to see that the data from report A had test data taken out and that’s why the sales $$ is less, but also that’s the accurate one.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2023-12-04 15:05:18

*Thread Reply:* 2. Impact Analysis - if data provenance is “what’s upstream”, impact analysis is “what’s downstream”. You want to change something and want to know what it might affect. Perhaps you want to decommission a table or a whole server. Perhaps you are expanding, e.g. you started off in US Dollars and now are adding Japanese Yen…you have a field called “Amount” and now you want to add another field called “currency” and update every report that uses “Amount”…..Impact analysis is for that use case.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2023-12-04 15:05:41

*Thread Reply:* 3. Compliance - you can show that your sensitive data stays in the sensitive places, because you can show where your data is flowing.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2023-12-04 15:06:31

*Thread Reply:* 4. Migration verification - Compare the lineage from legacy system A with the lineage from new system B. When they have parity, your migration is complete.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2023-12-04 15:07:03

*Thread Reply:* 5. Onboarding - lineage diagrams can be used like architecture diagrams to easily acquaint a user with what data you have, how it flows, what reports it goes to, etc.

Stefan Krawczyk (stefan@dagworks.io)
2023-12-05 00:42:27

*Thread Reply:* Thanks, that context is helpful @Sheeri Cabral (Collibra) !

❤️ Sheeri Cabral (Collibra), Jakub Dardziński
Juan Luis Cano Rodríguez (juan_luis_cano@mckinsey.com)
2023-12-11 05:08:31

*Thread Reply:* loved this thread, thanks everyone!

Michael Robinson (michael.robinson@astronomer.io)
2023-12-01 17:03:32

@channel The November issue of OpenLineage News is here! This issue covers the latest updates to the OpenLineage Airflow Provider, a recap of the meetup in Warsaw, recent releases, and much more. To get the newsletter directly in your inbox each month, sign up here. apache.us14.list-manage.com

openlineage.us14.list-manage.com
👍 Paweł Leszczyński, Jakub Dardziński
Michael Robinson (michael.robinson@astronomer.io)
2023-12-04 15:34:26

@channel I’m opening a vote to release OpenLineage 1.6.0, including: • a new JobTypeFacet containing additional job-related information to improve support for Flink and streaming in general • an option for the Flink job listener to read from Flink conf • in the dbt integration, a new command to send metadata of the last run without running the job • bug fixes in the Spark and Flink integrations • more. Three +1s from committers will authorize. Thanks in advance.

➕ Jakub Dardziński, Harel Shein, Damien Hawes, Mandy Chessell
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-12-05 08:16:25

*Thread Reply:* Can we hold on for a day? I do have some doubts about this one https://github.com/OpenLineage/OpenLineage/pull/2293

Labels
client/java, dependencies, dependabot
Comments
1
Michael Robinson (michael.robinson@astronomer.io)
2023-12-05 08:50:44

*Thread Reply:* Our policy allows for 2 days, so there’s no problem as far as I’m concerned

Michael Robinson (michael.robinson@astronomer.io)
2023-12-05 11:32:29

*Thread Reply:* Thanks, all, the release is authorized and will be initiated within 2 business days.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-12-05 11:33:35

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2297 - this should resolve the problem. we don't have to wait anymore. Thank You @Michael Robinson

Labels
documentation, integration/spark
Comments
2
Sathish Kumar J (sathish.jeganathan@walmart.com)
2023-12-06 21:51:29

@Harel Shein thanks for the invite!

Harel Shein (harel.shein@gmail.com)
2023-12-06 21:52:39

*Thread Reply:* Welcome :)

Zacay Daushin (zacayd@octopai.com)
2023-12-07 04:24:52

Hi

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-07 04:27:51

*Thread Reply:* Hey Zacay!

Zacay Daushin (zacayd@octopai.com)
2023-12-07 04:28:51

I know here does OpenLineage has support of column level?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-07 04:30:12

*Thread Reply:* it’s currently supported in Spark integration and for SQL-based Airflow operators

Zacay Daushin (zacayd@octopai.com)
2023-12-07 04:30:48

*Thread Reply:* and DBT?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-07 04:31:11

*Thread Reply:* DBT has only table-level lineage at the moment

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-07 04:36:28

*Thread Reply:* would you be interested in contributing/helping to add this feature? 🙂

Zacay Daushin (zacayd@octopai.com)
2023-12-07 04:38:12

*Thread Reply:* look at Gudo Soft

Zacay Daushin (zacayd@octopai.com)
2023-12-07 04:38:18

*Thread Reply:* it parser Sql

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-07 04:40:05

*Thread Reply:* I’m sorry but I didn’t get it. What does the parser provide that would be helpful in terms of dbt?

Sumeet Gyanchandani (sumeet.gyanchandani@gmail.com)
2023-12-07 07:11:47

*Thread Reply:* @Jakub Dardziński I played around with column-level lineage in Spark recently and also listening to OL events and converting them to Apache Atlas entities to upload in Azure Purview. Works like a charm 🙂

Flink not so much. Do you have column-level lineage in Flink yet or is it on the roadmap for future? Happy to contribute.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-07 07:12:29

*Thread Reply:* @Maciej Obuchowski @Paweł Leszczyński know more about Flink 🙂

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-07 07:12:58

*Thread Reply:* it's awesome you got it working with Purview!

🙏 Sumeet Gyanchandani
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-07 07:19:58

*Thread Reply:* we're actively working with Flink team to have first-class integration for Flink - a lot of things like column-level lineage are unfortunately currently not available

🙌 Sumeet Gyanchandani
Sumeet Gyanchandani (sumeet.gyanchandani@gmail.com)
2023-12-07 09:09:33

*Thread Reply:* @Jakub Dardziński and @Maciej Obuchowski thank you for the prompt responses. I really appreciate it!

🙂 Jakub Dardziński
Simran Suri (mailsimransuri@gmail.com)
2023-12-07 05:21:13

Hi everyone, could someone help me in how can I integrating OpenLineage with dbt? I'm particularly interested in sending events to a Kafka topic rather than using HTTP. Any guidance on this would be greatly appreciated.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-07 05:28:04

*Thread Reply:* Hey, it’s essentially described here: https://openlineage.io/docs/client/python

openlineage.io
👍 Simran Suri
Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-07 05:28:40

*Thread Reply:* in your case best would be to: Set an environment variable to a file path: OPENLINEAGE_CONFIG=path/to/openlineage.yml.

👍 Simran Suri
Simran Suri (mailsimransuri@gmail.com)
2023-12-07 05:29:43

*Thread Reply:* that will work with dbt right? only setting up environment variable would be enough? Also, to get OL events I need to do dbt-ol run correct?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-07 05:30:23

*Thread Reply:* setting env var + putting config file to pointed path

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-07 05:30:31

*Thread Reply:* and yes, you need to run it with dbt-ol wrapper script

Simran Suri (mailsimransuri@gmail.com)
2023-12-07 05:31:50

*Thread Reply:* Great, that's very helpful. I'll try the same and will definitely ask another question if I encounter any issues while trying this out.

👍 Jakub Dardziński
Damien Hawes (damien.hawes@booking.com)
2023-12-07 07:15:55

*Thread Reply:* @Jakub Dardziński - regarding the Python client, does it have a functionality similar to the Java client? For example, the Java client allows you to use the service provider interface to implement a custom ~client~ transport.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-07 07:17:37

*Thread Reply:* do you mean custom transport?

Damien Hawes (damien.hawes@booking.com)
2023-12-07 07:17:47

*Thread Reply:* Correct.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-07 07:21:23

*Thread Reply:* it sure does, there's mention of it in the link above

Rodrigo Maia (rodrigo.maia@manta.io)
2023-12-07 12:26:46

Hi All, im still struggling to get Lineage from Databricks Unity Catalog. Is anyone here extracting lineage from Databricks/UC successfully?

Rodrigo Maia (rodrigo.maia@manta.io)
2023-12-07 12:27:09

*Thread Reply:* OL 1.5 is showing up this error:

23/12/07 17:20:15 INFO PlanUtils: apply method failed with org.apache.spark.SparkException: There is no Credential Scope. Current env: Driver at com.databricks.unity.UCSDriver$Manager.$anonfun$currentScopeId$3(UCSDriver.scala:131) at scala.Option.getOrElse(Option.scala:189) at com.databricks.unity.UCSDriver$Manager.currentScopeId(UCSDriver.scala:131) at com.databricks.unity.UCSDriver$Manager.currentScope(UCSDriver.scala:134) at com.databricks.unity.UnityCredentialScope$.currentScope(UnityCredentialScope.scala:100) at com.databricks.unity.UnityCredentialScope$.getSAMRegistry(UnityCredentialScope.scala:120) at com.databricks.unity.SAMRegistry$.registerSAM(SAMRegistry.scala:307) at com.databricks.unity.SAMRegistry$.registerDefaultSAM(SAMRegistry.scala:323) at org.apache.spark.sql.catalyst.catalog.SessionCatalogImpl.defaultTablePath(SessionCatalog.scala:1200) at com.databricks.sql.managedcatalog.ManagedCatalogSessionCatalog.defaultTablePath(ManagedCatalogSessionCatalog.scala:991) at io.openlineage.spark3.agent.lifecycle.plan.catalog.AbstractDatabricksHandler.getDatasetIdentifier(AbstractDatabricksHandler.java:92) at io.openlineage.spark3.agent.lifecycle.plan.catalog.CatalogUtils3.lambda$getDatasetIdentifier$2(CatalogUtils3.java:61) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1361) at java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126) at java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.findAny(ReferencePipeline.java:536) at io.openlineage.spark3.agent.lifecycle.plan.catalog.CatalogUtils3.getDatasetIdentifier(CatalogUtils3.java:63) at io.openlineage.spark3.agent.lifecycle.plan.catalog.CatalogUtils3.getDatasetIdentifier(CatalogUtils3.java:46) at io.openlineage.spark3.agent.utils.PlanUtils3.getDatasetIdentifier(PlanUtils3.java:79) at io.openlineage.spark35.agent.lifecycle.plan.CreateReplaceOutputDatasetBuilder.apply(CreateReplaceOutputDatasetBuilder.java:144) at io.openlineage.spark35.agent.lifecycle.plan.CreateReplaceOutputDatasetBuilder.lambda$apply$3(CreateReplaceOutputDatasetBuilder.java:116) at java.util.Optional.map(Optional.java:215) at io.openlineage.spark35.agent.lifecycle.plan.CreateReplaceOutputDatasetBuilder.apply(CreateReplaceOutputDatasetBuilder.java:114) at io.openlineage.spark35.agent.lifecycle.plan.CreateReplaceOutputDatasetBuilder.apply(CreateReplaceOutputDatasetBuilder.java:60) at io.openlineage.spark35.agent.lifecycle.plan.CreateReplaceOutputDatasetBuilder.apply(CreateReplaceOutputDatasetBuilder.java:39) at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder$1.apply(AbstractQueryPlanDatasetBuilder.java:94) at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder$1.apply(AbstractQueryPlanDatasetBuilder.java:85) at io.openlineage.spark.agent.util.PlanUtils.safeApply(PlanUtils.java:279) at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder.lambda$apply$0(AbstractQueryPlanDatasetBuilder.java:75) at java.util.Optional.map(Optional.java:215) at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder.apply(AbstractQueryPlanDatasetBuilder.java:67) at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder.apply(AbstractQueryPlanDatasetBuilder.java:39) at io.openlineage.spark.agent.util.PlanUtils.safeApply(PlanUtils.java:279) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$null$23(OpenLineageRunEventBuilder.java:451) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at java.util.Iterator.forEachRemaining(Iterator.java:116) at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485) at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:272) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.StreamSpliterators$WrappingSpliterator.forEachRemaining(StreamSpliterators.java:313) at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildOutputDatasets(OpenLineageRunEventBuilder.java:410) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.populateRun(OpenLineageRunEventBuilder.java:298) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:281) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:238) at io.openlineage.spark.agent.lifecycle.SparkSQLExecutionContext.end(SparkSQLExecutionContext.java:126) at io.openlineage.spark.agent.OpenLineageSparkListener.sparkSQLExecEnd(OpenLineageSparkListener.java:98) at io.openlineage.spark.agent.OpenLineageSparkListener.onOtherEvent(OpenLineageSparkListener.java:84) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:102) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:42) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:42) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:118) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:102) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:114) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:114) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:109) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:105) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1660) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:105)

Rodrigo Maia (rodrigo.maia@manta.io)
2023-12-07 12:27:41

*Thread Reply:* Event output is showing up empty.

Rodrigo Maia (rodrigo.maia@manta.io)
2023-12-11 08:21:08

*Thread Reply:* Anyone with the same issue or no issues at all regarding Unity Catalog?

Michael Robinson (michael.robinson@astronomer.io)
2023-12-07 12:58:55

@channel We released OpenLineage 1.6.2, including: • Dagster: support Dagster 1.5.x #2220 @tsungchih • Dbt: add a new command dbt-ol send-events to send metadata of the last run without running the job #2285 @sophiely • Flink: add option for Flink job listener to read from Flink conf #2229 @ensctom • Spark: get column-level lineage from JDBC dbtable option #2284 @mobuchowski • Spec: introduce JobTypeJobFacet to contain additional job related information#2241 @pawel-big-lebowski • SQL: add quote information from sqlparser-rs #2259 @JDarDagran • bug fixes, tests, and more. Thanks to all the contributors, including new contributors @tsungchih and @ensctom! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.6.2 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/1.5.0...1.6.2 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

Jorge (jorge.varona@nike.com)
2023-12-07 13:01:27

👋 Hi everyone! Just lurking to get insights into how I might implement this at my company to solve our lineage challenge.

Michael Robinson (michael.robinson@astronomer.io)
2023-12-07 13:02:06

*Thread Reply:* 👋 Hi Jorge, welcome! Can you tell us a bit about your use case?

Jorge (jorge.varona@nike.com)
2023-12-07 13:03:51

*Thread Reply:* Large company with a big ball of mud data ecosystem. There is active debate on doubling down on a vendor (probably wont work) or doing the work to instrument our jobs/toolling in an agnostic way. I'm inclined to do the later but want to understand what the implementation may look like.

Harel Shein (harel.shein@gmail.com)
2023-12-07 14:44:04

*Thread Reply:* welcome! please feel free to ask any questions as they arise

👍 Jorge
Simran Suri (mailsimransuri@gmail.com)
2023-12-08 05:59:40

Hello everyone, I'm currently utilizing the OpenLineage JAR version - openlineage_spark_1_1_0.jar to capture lineage information in my environment. I have a specific requirement to capture facets related to Input Datasets, especially focusing on Data Quality Metrics facets such as row count information for the input data, as well as Output Dataset facets encompassing output statistics, like row count information for the output data.

Although the tags "inputFacets":{}}] and "outputFacets":{}}] seem to be enabled in the event, but the values within these tags are not reflecting the expected information, it seems to be blank always.

This Setup involves Databricks, and the cluster's Spark version is Apache Spark 3.3.2. and I've configured the OpenLineage setup in the Global Init scripts within the Databricks workspace.

Would greatly appreciate it if someone could provide guidance or insight into this issue.

Harel Shein (harel.shein@gmail.com)
2023-12-08 11:40:47

*Thread Reply:* can you turn on the debug facet and share an example event so we can try to help? spark.openlineage.debugFacet should be set to enabled this is from https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark

Simran Suri (mailsimransuri@gmail.com)
2023-12-11 03:22:03

*Thread Reply:* Hi @Harel Shein, sure.

Simran Suri (mailsimransuri@gmail.com)
2023-12-11 03:23:59

*Thread Reply:* This text file contains a total of 10-11 events, including the start and completion events of one of my notebook runs. The process is simply reading from a Hive location and performing a full load to another Hive location.

Rodrigo Maia (rodrigo.maia@manta.io)
2023-12-11 03:32:40

*Thread Reply:* @Simran Suri do you get any cluster logs with an error? im running a newer version of OL jar and im getting inputs and outputs from hive (but not for Unity Catalog)

Simran Suri (mailsimransuri@gmail.com)
2023-12-11 03:54:28

*Thread Reply:* No @Rodrigo Maia, I can't see any errors there

Harel Shein (harel.shein@gmail.com)
2023-12-12 16:05:18

*Thread Reply:* thanks for adding that! per the docs here, did you extend the Input/OutputDatasetFacetBuilder with anything to track data quality metrics?

Simran Suri (mailsimransuri@gmail.com)
2023-12-13 00:49:01

*Thread Reply:* Actually no, I didn't tried this out, can you help me with some more brief on it like how it can be extended? do I need to add some configs?

Michael Robinson (michael.robinson@astronomer.io)
2023-12-08 15:04:43

@channel This month’s TSC meeting is next Thursday the 14th at 10am PT. On the tentative agenda: • announcements • recent releases • proposal updates • open discussion • more (TBA) More info and the meeting link can be found on the website. All are welcome! Do you have a discussion topic, use case or integration you’d like to demo? DM me to be added to the agenda.

openlineage.io
Joey Mukherjee (joey.mukherjee@swri.org)
2023-12-09 18:27:05

Hi! I'm interested in using OpenLineage to track data files in my pipeline. Basically, I receive a data file and run some other Python code + ancillary files to produce other data files which are then used to produce more data files and onward. Each level of files is versioned and I would like to track this lineage. I don't use Spark, Airflow, dbt, etc. I do use Prefect though. My wrapper script is Python. Is OpenLineage appropriate? Seems like it... is my "DataSet" every individual file that I produce? I think I have to write my own integration and facet. Is that also correct? Any other advice? Thanks!

Harel Shein (harel.shein@gmail.com)
2023-12-10 12:09:47

*Thread Reply:* Hi @Joey Mukherjee, welcome to the community! This use case would work using the openlineage spec. you are right, unfortunately we don’t currently have a Prefect integration, but we’d be happy to support you if you chose to write it! 🙂 I don’t know enough about Prefect to say what would be the right model to map to. Have you looked at the OpenLineage data model? https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md

👍 Jakub Dardziński
Athitya Kumar (athityakumar@gmail.com)
2023-12-11 08:12:40

Hey team.

In the spark integration: When we do a spark.jdbc() or spark.read() from a JDBC connection like mysql/postgres etc, does OpenLineage support capturing metadata on the JDBC connection (host URI / port / username etc) in the OL events or not?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-11 09:32:14

*Thread Reply:* > OpenLineage support capturing metadata on the JDBC connection (host URI / port / username etc) I think those are part of dataset namespace? Probably not username tho

Athitya Kumar (athityakumar@gmail.com)
2023-12-11 12:34:46

*Thread Reply:* Ack, and are these spark.jdbc inputs being captured for spark 3.x onwards or for spark 2.x as well?

Athitya Kumar (athityakumar@gmail.com)
2023-12-11 12:34:55

*Thread Reply:* @Maciej Obuchowski ^

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-11 12:53:25

*Thread Reply:* I would assume not for Spark 2

David Goss (david.goss@matillion.com)
2023-12-11 11:12:27

❓ Are there any particular standards/conventions around how to express data types in the dataset schema facet? The examples I’ve seen have been just types like integer etc. I think it would be useful for certain types to include modifiers for size, precision etc, so like varchar(50) and stuff like that. Would it just be a case of, stick to how the platform (mysql, snowflake, whatever) expresses it in DDL?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-11 12:54:12

*Thread Reply:* We do express what's in DDL basically. It's database specific anyway

👍 David Goss
harsh loomba (hloomba@upgrade.com)
2023-12-11 16:31:08

QQ - we have multi-cluster Redshift architecture where there is a possibility that table meta information could exist in different cluster. Now the way I see extractor here that it requires table meta-information to create the lineage right here? Currently I dont see those tables in my input datasets which are out of that cluster. Any thoughts?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-11 16:34:13

*Thread Reply:* I’m not entirely sure how multi-cluster Redshift works. AFAIK the solution would to take advantage from SVV_REDSHIFT_COLUMNS

👀 harsh loomba
Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-11 16:34:31

*Thread Reply:* I’ve got opened PR for Airflow OL provider here: https://github.com/apache/airflow/pull/35794

👀 harsh loomba
Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-14 04:56:15

*Thread Reply:* hey @harsh loomba, did you have a chance to check above out?

harsh loomba (hloomba@upgrade.com)
2023-12-14 11:50:53

*Thread Reply:* I did check, looks promising to the problem statement we have right @Willy Lulciuc?

Willy Lulciuc (willy@datakin.com)
2023-12-14 14:30:44

*Thread Reply:* @harsh loomba yep, it does!

Willy Lulciuc (willy@datakin.com)
2023-12-14 14:31:20

*Thread Reply:* thanks @Jakub Dardziński for the quick fix!

harsh loomba (hloomba@upgrade.com)
2023-12-14 14:31:44

*Thread Reply:* ohhh great

harsh loomba (hloomba@upgrade.com)
2023-12-14 14:31:58

*Thread Reply:* thanks!

harsh loomba (hloomba@upgrade.com)
2023-12-14 14:32:07

*Thread Reply:* hopefully it will added in upcoming OL release

Joey Mukherjee (joey.mukherjee@swri.org)
2023-12-11 19:47:37

I'm playing with OpenLineage and for my start event and complete event, do they have to have the same input and output datasets? Say my input datasets generate files unknown at start time, can OpenLineage handle that? Right now, I am getting 422 Client Error: for url: http://xxx:5000/api/v1/lineage . How do I find out the error? I am not using any of the integrations.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-12 05:37:04

*Thread Reply:* > I'm playing with OpenLineage and for my start event and complete event, do they have to have the same input and output datasets? Say my input datasets generate files unknown at start time, can OpenLineage handle that? Idea of OpenLineage is to handle this kind of events - the events are generally ment to be cumulative. As you said, inputs or outputs can be not known at start time, but there can be opposite situation. You're reading version of the dataset that has been changing in the meantime, and you know the particular version only on start.

That being said, the actual handling of those events depends on the consumer itself, so it depends on what you're using.

> Right now, I am getting 422 Client Error: for url: http://xxx:5000/api/v1/lineage . How do I find out the error? I am not using any of the integrations. It's probably consumer specific - hard to tell without knowledge of what you're using.

Athitya Kumar (athityakumar@gmail.com)
2023-12-13 03:06:36

Hey folks. Is there a way to disable column-level lineage / "schema" facet in inputs/outputs for spark integration?

Basically to have just table-level lineage and disable column-level lineage

➕ Anirudh Shrinivason
Athitya Kumar (athityakumar@gmail.com)
2023-12-13 03:22:17

*Thread Reply:* cc @Honey Thakuria

Damien Hawes (damien.hawes@booking.com)
2023-12-13 03:47:20

*Thread Reply:* The spark integration provides the ability to disable facets, via the spark.openlineage.facets.disabled configuration.

You provide values like this: spark.openlineage.facets.disabled=[spark_unknown;spark.logicalPlan;&lt;more&gt;]

👍 Jakub Dardziński, Paweł Leszczyński
Athitya Kumar (athityakumar@gmail.com)
2023-12-13 06:25:24

*Thread Reply:* Right, but is there a specific facet we can disable here to get table-level lineage but skip column-level lineage in the events?

Damien Hawes (damien.hawes@booking.com)
2023-12-13 06:44:00

*Thread Reply:* The facet name that you want to disable is "columnLineage" in this case

rajeshree parmar (rajeshreedatavizz@gmail.com)
2023-12-13 05:05:07

Hi i want to use open lineage on my plateform how i can use . i have already metadata and profiler pipleline and how integreate

Kacper Muda (kacper.muda@getindata.com)
2023-12-13 08:33:36

Hi, i'd like to request a release (patch 1.6.3) that will include this PR: #2305 . It would help people using OL with Airflow integration (with Airflow version 2.6).

👍 Harel Shein, Maciej Obuchowski
➕ Jakub Dardziński
Michael Robinson (michael.robinson@astronomer.io)
2023-12-14 08:40:03

*Thread Reply:* Thanks for requesting a release. It is authorized and will be initiated within 2 business days (not including Friday).

Kacper Muda (kacper.muda@getindata.com)
2023-12-14 08:57:34

*Thread Reply:* Perfect, thank you !

Joey Mukherjee (joey.mukherjee@swri.org)
2023-12-13 09:59:56

I have a question/misunderstanding about the output section from Marquez under I/O. For one of my Jobs, the outputs are the right output and all of the inputs from the previous Job in the pipeline. I'm not sure why, but how do I find the cause of this? I feel like I did everything correct. To be clear, I am using a custom pipeline using files and the example code from the Python section of the OpenLineage docs.

Joey Mukherjee (joey.mukherjee@swri.org)
2023-12-13 13:18:15

*Thread Reply:* I notice only one of my three jobs has Run information. I have only six events, and all three have START and COMPLETE states. From what I can tell, the information in the events list is correct, but the GUI is not right. Open to any ideas on how to debug!

Michael Robinson (michael.robinson@astronomer.io)
2023-12-13 11:34:15

@channel This month’s TSC meeting is tomorrow at 10 am PT https://openlineage.slack.com/archives/C01CK9T7HKR/p1702065883107479

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-12-14 02:01:32

Hi team, I noticed this error: ERROR ColumnLevelLineageUtils: Error when invoking static method 'buildColumnLineageDatasetFacet' for Spark3 [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875245402Z java.lang.reflect.InvocationTargetException [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875250061Z at jdk.internal.reflect.GeneratedMethodAccessor469.invoke(Unknown Source) [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875253194Z at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875256744Z at java.base/java.lang.reflect.Method.invoke(Method.java:566) [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875260324Z at io.openlineage.spark.agent.lifecycle.plan.column.ColumnLevelLineageUtils.buildColumnLineageDatasetFacet(ColumnLevelLineageUtils.java:35) [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875263824Z at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$buildOutputDatasets$21(OpenLineageRunEventBuilder.java:434) [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875267715Z at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875270666Z at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655) [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875273785Z at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875276824Z at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875279799Z at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875285515Z at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875288751Z at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875292022Z at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildOutputDatasets(OpenLineageRunEventBuilder.java:447) [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875294766Z at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.populateRun(OpenLineageRunEventBuilder.java:306) [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875297558Z at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:289) [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875300352Z at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:232) [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875303027Z at io.openlineage.spark.agent.lifecycle.SparkSQLExecutionContext.start(SparkSQLExecutionContext.java:70) [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875305695Z at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$sparkSQLExecStart$0(OpenLineageSparkListener.java:91) [2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875308769Z at java.base/java.util.Optional.ifPresent(Optional.java:183) ... in some pipelines for OL version 0.30.1. May I check if this has already been reported/been fixed in the later releases of OL? Thanks!

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-12-14 02:35:34

*Thread Reply:* No. it wasn't reported. Are you able to reproduce the same with recent Openlineage version?

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-12-14 02:36:47

*Thread Reply:* I have not tried actually... let me try that and get back if it persists. Thanks

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-12-14 02:37:54

*Thread Reply:* just to make sure: this is just the error in the logs and should not prevent OL event being generated (except for column level lineage). right?

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-12-14 02:45:04

*Thread Reply:* Yeah, but column level lineage is what we'd actually want to capture so wondering why this error was being thrown

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-12-14 02:46:03

*Thread Reply:* sure, could you also paste the end of the stacktrace?

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-12-14 02:48:28

*Thread Reply:* Yup, let me get that

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-12-14 02:49:43

*Thread Reply:* INFO - 2023-12-12T15:06:34.726868651Z at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726871557Z at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726874327Z at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726878805Z at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726881789Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.collect(InputFieldsCollector.java:61) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726884664Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.lambda$collect$0(InputFieldsCollector.java:61) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726887371Z at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726890358Z at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726893117Z at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726895977Z at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726898933Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.collect(InputFieldsCollector.java:61) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726901694Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.lambda$collect$0(InputFieldsCollector.java:61) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726904581Z at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726907524Z at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726910276Z at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726913201Z at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726916094Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.collect(InputFieldsCollector.java:61) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726918805Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.lambda$collect$0(InputFieldsCollector.java:61) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726921673Z at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726924364Z at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726927240Z at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726930033Z at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726932890Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.collect(InputFieldsCollector.java:61) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726935677Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.lambda$collect$0(InputFieldsCollector.java:61) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726938562Z at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726941533Z at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726944428Z at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726947297Z at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726951913Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.collect(InputFieldsCollector.java:61) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726954964Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.lambda$collect$0(InputFieldsCollector.java:61) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726957942Z at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726960897Z at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726963830Z at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726968390Z at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726971331Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.collect(InputFieldsCollector.java:61) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726974162Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.lambda$collect$0(InputFieldsCollector.java:61) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726977051Z at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726979989Z at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726982847Z at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726985790Z at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726988700Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.collect(InputFieldsCollector.java:61) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726991547Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.lambda$collect$0(InputFieldsCollector.java:61) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726994338Z at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726997282Z at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727000102Z at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727003005Z at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727006039Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.collect(InputFieldsCollector.java:61) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727009344Z at io.openlineage.spark3.agent.lifecycle.plan.column.ColumnLevelLineageUtils.collectInputsAndExpressionDependencies(ColumnLevelLineageUtils.java:72) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727012365Z at io.openlineage.spark3.agent.lifecycle.plan.column.ColumnLevelLineageUtils.lambda$null$2(ColumnLevelLineageUtils.java:83) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727015353Z at java.base/java.util.Optional.ifPresent(Optional.java:183) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727018363Z at io.openlineage.spark3.agent.lifecycle.plan.column.ColumnLevelLineageUtils.lambda$collectInputsAndExpressionDependencies$3(ColumnLevelLineageUtils.java:80) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727021227Z at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:174) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727024129Z at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1(TreeNode.scala:175) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727029039Z at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1$adapted(TreeNode.scala:175) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727032321Z at scala.collection.immutable.List.foreach(List.scala:431) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727035293Z at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:175) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727038585Z at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1(TreeNode.scala:175) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727041600Z at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1$adapted(TreeNode.scala:175) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727044499Z at scala.collection.immutable.List.foreach(List.scala:431) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727047389Z at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:175) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727050337Z at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1(TreeNode.scala:175) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727053034Z at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1$adapted(TreeNode.scala:175) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727055990Z at scala.collection.immutable.List.foreach(List.scala:431) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727060521Z at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:175) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727063671Z at io.openlineage.spark3.agent.lifecycle.plan.column.ColumnLevelLineageUtils.collectInputsAndExpressionDependencies(ColumnLevelLineageUtils.java:76) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727066753Z at io.openlineage.spark3.agent.lifecycle.plan.column.ColumnLevelLineageUtils.buildColumnLineageDatasetFacet(ColumnLevelLineageUtils.java:42) [2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727069592Z ... 35 more

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-12-14 02:45:58

Another question, I noticed that for a few cases, lineage is not being captured if running: df.toPandas() via pyspark, and then doing some pandas operations on it and writing it back to an s3 location. May I check if this is expected?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-12-14 02:47:18

*Thread Reply:* this is something we haven't tested nor included in our tests. not sure what happens spark data goes to pandas.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-12-14 02:48:18

*Thread Reply:* Got it. Thanks!

Damien Hawes (damien.hawes@booking.com)
2023-12-15 04:39:45

*Thread Reply:* I can speak from experience, because we had a similar issue in our custom spark listener, toPandas breaks lineage because toPandas is like calling collect, which forces the data in the Spark DataFrame to the driver, and pipes that data to the running Python sidecar process. What ever you do afterwards, you're running code in a private memory space and Spark has no way of knowing what you're doing.

👍 Anirudh Shrinivason
:gratitude_thank_you: Anirudh Shrinivason
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-12-29 03:41:29

*Thread Reply:* Just wondering, is it even technically possible to address this case where pipelines use toPandas to get lineage?

Damien Hawes (damien.hawes@booking.com)
2024-01-08 04:46:36

*Thread Reply:* To an extent - yes. Though not necessarily with the OL connector. As hinted previously, we had to recently update our own listener that we wrote years ago in order to publish lineage for subquery alias / project operations from collect / toPandas operations.

Athitya Kumar (athityakumar@gmail.com)
2023-12-14 04:58:57

We're noticing significant delays in a few spark jobs, wherein the openlineage spark listener seems to be running even after the spark/yarn application have completed & requested for a gracious exit. Even after the application has ended, we see couple of events still being processed & each event takes around 5-6 mins (rarely we have seen 9 mins as well):

2023-12-13 22:52:26.834 -0800] [INFO ] [spark-listener-group-shared] [org.apache.spark.scheduler.AsyncEventQueue.logInfo@57] - Process of event SparkListenerJobEnd(12,1702535631760,JobSucceeded) by listener OpenLineageSparkListener took 385.396979168s. This kinda results in our actual jobs taking 30+ mins to be marked as completed, which impacts SLA.

Has anyone faced this issue, and any tips on how we can debug what event is causing this exact 5-6 mins issue / whic method in OpenLineageSparkListener is taking time?

Athitya Kumar (athityakumar@gmail.com)
2023-12-14 10:59:48

*Thread Reply:* Ping @Paweł Leszczyński @Maciej Obuchowski ^

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-14 11:56:17

*Thread Reply:* Can you try disabling lineage plan and spark_unknown facet? Only thing I can think of is serializing extremely large logical plans

👍 Paweł Leszczyński
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-12-14 12:32:33

*Thread Reply:* yes, omiting plan serialialization and its serialization can be first thing to try

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-12-14 12:33:27

*Thread Reply:* the other one would be to verify the backend is responding in reasonable time

Athitya Kumar (athityakumar@gmail.com)
2023-12-14 13:57:22

*Thread Reply:* @Paweł Leszczyński @Maciej Obuchowski - These 2 facets have already been disabled - but what we see if that the event is still too huge for openlineage spark listener to realistically process in time.

For example, we have a job that reads from a S3 directory which has lot of dirs/files - resulting in 1041 inputs ** 8 columns per output 😅

Is there a max limit config we can set on openlineage to consider parsing events if the spark event size is < x mb or # of inputs+outputs are < y or something like that?

Athitya Kumar (athityakumar@gmail.com)
2023-12-14 14:06:34

*Thread Reply:* And also, when we disable facets like spark.logicalPlan / schema / columnLineage - does it mean that the part of the event is itself not being read from spark or is it just being used while generating/emitting the OL event?

Basically if we have a very huge spark event, would disabling facets help or would it still kinda take a lot of time?

Athitya Kumar (athityakumar@gmail.com)
2023-12-15 12:48:57

*Thread Reply:* @Paweł Leszczyński @Maciej Obuchowski - WDYT? ^

In our use-case, we figured out the scenario that caused this huge event issue. We had a job that did spark.read from <s3://bucket/dir1/dir2/dir3_**/**> (with a file-level wildcard) instead of <s3://bucket/dir1/dir2/dir3_**/> - we were able to remove the wildcard and fix this issue.

But I think this is something that should be handled on spark listener side, so that we're not really dependent on the patterns of the spark job code itself 😄

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-12-18 03:59:46

*Thread Reply:* That's the problem that bumps from time to time. So far, we were never able to come up with a general solution like circuit breaker.

We do rather solve the problems after they're identified. In the past we added option to prevent sending serialized LogicalPlan as well as w trim it if it exceeds certain about of kilobytes.

What caused the issue here? Was it the event being too large or spark openlineage internals trying to resolve ** and causing long lasting backend calls to S3?

Haneefa tasneem (haneefa.tasneem.shaik@gmail.com)
2023-12-14 06:22:28

Hi. I am trying to run a spark script through an Airflow Dag. I am not able to see any lineage information. In the spark script I had taken up a sample csv file. I created a Data-frame and made some transformations and then saved it as a csv file. I am not able to see any lineage information. Please do let me know if there is any way I can see Lineage information. Here is my spark script for reference. ```from pyspark import SparkContext from os.path import abspath from pyspark.sql import SparkSession from pyspark.sql.functions import col import urllib.request

def executesparkscript(querynum, outputpath): # Create a Spark session #oljars= ['https://repo1.maven.org/maven2/io/openlineage/openlineage-spark/1.5.0/openlineage-spark-1.5.0.jar'] warehouselocation = abspath('spark-warehouse') #files = [urllib.request.urlretrieve(url)[0] for url in oljars] spark = (SparkSession.builder.appName("DataManipulation").master('local[**]').appName('openlineagespark_test') #.config('spark.jars', ",".join(files))

         # Install and set up the OpenLineage listener

         .config('spark.jars.packages', 'io.openlineage:openlineage_spark:1.5.0')
         .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener')
         .config('spark.openlineage.host', '<http://localhost:5000>')
         .config('spark.openlineage.namespace', 'airflow')
         .config('spark.openlineage.transport.type', 'console')
         .config("spark.sql.warehouse.dir", warehouse_location)
         .getOrCreate()
         )

spark.sparkContext.setLogLevel("INFO")


# DataFrame 1
data = [[295, "South Bend", "Indiana", "IN", 101190, 112.9]]
columns = ["rank", "city", "state", "code", "population", "price"]
df1 = spark.createDataFrame(data, schema="rank LONG, city STRING, state STRING, code STRING, population LONG, price DOUBLE")

print(f"Count after DataFrame 1: {df1.count()} rows, {len(df1.columns)} columns")

# Save DataFrame 1 to the desired location
df1.write.mode("overwrite").csv(output_path + "df1")

# DataFrame 2
df2 = (spark.read
      .format("csv")
      .option("header", "true")
      .option("inferSchema", "true")
      .load("/home/haneefa/Downloads/export.csv")
      )

df2.write.mode("overwrite").csv(output_path + "df2")

# Returns a DataFrame that combines the rows of df1 and df2
query_df = df1.union(df2)
query_df.count()

# Save DataFrame 2 to the desired location
query_df.write.mode("overwrite").csv(output_path + "query_df")



# Save Query 1 result to the desired location
query_df.write.mode("overwrite").csv(output_path + "query_df")


# Register the DataFrame as a temporary SQL table
query_df.write.saveAsTable("temp_tb1")

# Query 1: Add a new column derived from existing columns
query1_df = spark.sql("SELECT **, population / price as population_price_ratio FROM temp_tb1")
print(f"Count after Query 1: {query1_df.count()} rows, {len(query1_df.columns)} columns")

# Register the DataFrame as a temporary SQL table
query1_df.write.saveAsTable("temp_tb2")

# Save Query 1 result to the desired location
query1_df.write.mode("overwrite").csv(output_path + "query1_df")

# Read the saved DataFrame in parquet format
<b>#parquet_df</b> =spark.read.parquet(output_path + "query1_df")

spark.stop()

if name == "main": executesparkscript(1, "/home/haneefa/airflow/dags/saved_files/")

spark-submit --master --name SparkScriptquery1 --deploy-mode client /home/haneefa/airflow/dags/customoperators/samplesqlspark.py

./bin/spark-submit --class "SparkTest" --master local[**] --jars ```

Abdallah (abdallah@terrab.me)
2023-12-14 09:28:27

*Thread Reply:* Do you run your script through a spark-submit?

Haneefa tasneem (haneefa.tasneem.shaik@gmail.com)
2023-12-16 14:12:08

*Thread Reply:* yes I do. Here is my Airflow Dag Code. ```from datetime import datetime, timedelta from airflow import DAG from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator

defaultargs = { 'owner': 'airflow', 'dependsonpast': False, 'startdate': datetime(2023, 1, 1), }

dag = DAG( 'abcsparkdagedit', defaultargs=defaultargs, description='A simple Airflow DAG to run the provided Spark script', scheduleinterval='@once', )

sparktask = SparkSubmitOperator( taskid='runsparkscript', application='/home/haneefa/airflow/dags/customoperators/sparkedit.py', #path to your Spark script name='examplesparkjob', connid='spark2', # Spark connection ID configured in Airflow jars='/home/haneefa/Downloads/openlineage-spark-1.5.0.jar', verbose=False, dag=dag, )

spark_task``` please do let me know if I can do anything

Haneefa tasneem (haneefa.tasneem.shaik@gmail.com)
2023-12-18 02:30:50

*Thread Reply:* Hi. Just checking in. please do let me know if I can try anything.

Abdallah (abdallah@terrab.me)
2024-01-03 08:16:36

*Thread Reply:* Can you bring the driver logs please ?

Abdallah (abdallah@terrab.me)
2024-01-03 08:17:06

*Thread Reply:* I see that you are mixing different types of transport. ... .config('spark.openlineage.host', '<http://localhost:5000>') ... .config('spark.openlineage.transport.type', 'console') ...

Mariusz Górski (gorskimariusz13@gmail.com)
2023-12-15 03:38:06

hey, I have a question re OpenLineage Spark integration. According to docs, while configuring spark session following parameters (amongst others) can be passed: • spark.openlineage.parentJobName • spark.openlineage.appName • spark.openlineage.namespace what I understand from OL spec is that first parameter (parentJobName) would go into facets section of every lineage event, while spark.openlineage.appName would replace spark.appName portion of job.name property of lineage event (and indeed that’s what we are observing). spark.openlineage.namespace would materialize as job.namespace in lineage events (that also seems accurate). What I also found in the documentations it that parent facet is used to materialize a job (like Airflow DAG) under which given event is emitted. That was also my expectation about parentJobName however after configuring this option, value that I am setting it to is nowhere to be found in output lineage events. The question is then - is my understanding of this property wrong (and if so, what is the correct one) or this value is not being propagated properly to lineage events?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-12-15 04:19:51

*Thread Reply:* cześć 😉 to sum up: you're setting parentJobName property and you don't see it anywhere within the content of OL event. ~Looking at the code, the facet should be attached to job.~

👍 Mariusz Górski
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-12-15 06:07:26

*Thread Reply:* Looking at the code, the facet should be ingested into the event within this method https://github.com/OpenLineage/OpenLineage/blob/203052d663c4cd461ec38787434fc53a57[…]enlineage/spark/agent/lifecycle/OpenLineageRunEventBuilder.java

However, I don't see any integration test for this. Feel free to create an issue for this in case parent job name is still missing.

Athitya Kumar (athityakumar@gmail.com)
2023-12-15 13:53:47

*Thread Reply:* +1, I've noticed this too - the namespace & appName from spark conf reflects in the openlineage events, while the parentJobName doesn't.

But as a workaround, we kinda used namespace as a JSON string & provided the parentJobName as a key in this JSON

Mariusz Górski (gorskimariusz13@gmail.com)
2023-12-16 04:10:49

*Thread Reply:* indeed you can put anything into namespace but this is a workaround and we'd like to have generic OL spec compliant approach. so far I've checked tests and parent facet is available in some of them (like BQ write integration test) but in some not so still not sure why sometimes it's there and sometimes not. will keep digging.

Mariusz Górski (gorskimariusz13@gmail.com)
2023-12-16 09:23:28

*Thread Reply:* ok I figured it out 🙂 tl;dr when you use parentJobName you need to also define parentRunId (this implicit relationship could be better documented) and the tricky part here is parentRunId needs to be proper UUID, not just random string (which was wrong assumption on my part, again - improvement to the docs also required). I will update docs in the upcoming week based on this discovery. I tested this and after making changes as per above description parent facet is visible in spark OL events 🙂

🙌 Jakub Dardziński
Mariusz Górski (gorskimariusz13@gmail.com)
2023-12-27 04:11:27

*Thread Reply:* https://github.com/OpenLineage/docs/pull/268 took a little longer but there it is, I've also changed the proposal for exemplary airflow task metadata so it's aligned with how parent run facet is populated in airflow since OL 1.7.0 🙂

Comments
1
Simran Suri (mailsimransuri@gmail.com)
2023-12-17 11:33:34

Hi everyone, I have a question about Airflow's Job-to-Job level lineage. I've successfully obtained OpenLineage events for Airflow DAGs and now I'm looking to create inter DAG dependency (task-to-task), such as mapping D1.t1->D1.t2->D1.t3->D2.t1->D2.t2->D2.t3. Could you please advise on which fields I should consider to achieve this level of mapping?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-17 15:34:24

*Thread Reply:* I’m not sure what you’re trying to achieve

would you like to set relationships both between Airflow task <-> task and DAG <-> DAG?

Simran Suri (mailsimransuri@gmail.com)
2023-12-17 15:43:12

*Thread Reply:* Yes, I'm trying to set a relationship between both of them. Suppose I've 2 DAGs, D1 and D2, with 2 tasks in it. So the lineage would be D1.t1 -> D1.t2 (TriggerDagRunOperator) ->D2.t1->D2.t2 So in this way, I'll be able to mark an inter DAG dependency and task dependency also within the same DAG

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-17 15:55:47

*Thread Reply:* in terms of Marquez relationship between jobs is established by common input/output datasets there’s also parent/child relationship (e.g. between DAG and task) there’s actually no facet that would point directly between two jobs, that would require some sort of customization

Simran Suri (mailsimransuri@gmail.com)
2023-12-17 16:12:28

*Thread Reply:* @Jakub Dardziński In terms of Airflow openlineage events, I can see task level dependencies within the same DAG in events. But where I have inter-DAG dependencies via TriggerDagRunOperator I can't see that level of information in lineage events as mentioned here

I'm not able to find what are the upstream dependencies of a DAG?

Parkash Pant (ppant@tucowsinc.com)
2023-12-17 21:24:16

Hi Everyone, Need help! I am new to OpenLineage and Marquez and I am trying to test it with our local installation of airflow 2.6.3. Both airflow and marquez are running in a separate docker container and I have installed openlineage-airflow integration in airflow and set OPENLINEAGEURL and OPENLINEAGENAMESPACE. However, upon successful running a DAG, airflow is not able to emit OpenLineage event. Below is the error msg. I have also cross-checked that Marquez API is listening at localhost:5000. 2023-12-17 14:20:18 [2023-12-17T20:20:18.489+0000] {{adapter.py:98}} ERROR - Failed to emit OpenLineage event of id f0d30e1a-30cc-3ce5-9cbb-1d3529d5e206 2023-12-17 14:20:18 Traceback (most recent call last): 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connection.py", line 174, in _new_conn 2023-12-17 14:20:18 conn = connection.create_connection( 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/util/connection.py", line 95, in create_connection 2023-12-17 14:20:18 raise err 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection 2023-12-17 14:20:18 sock.connect(sa) 2023-12-17 14:20:18 ConnectionRefusedError: [Errno 111] Connection refused 2023-12-17 14:20:18 2023-12-17 14:20:18 During handling of the above exception, another exception occurred: 2023-12-17 14:20:18 2023-12-17 14:20:18 Traceback (most recent call last): 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 714, in urlopen 2023-12-17 14:20:18 httplib_response = self._make_request( 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 415, in _make_request 2023-12-17 14:20:18 conn.request(method, url, ****httplib_request_kw) 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connection.py", line 244, in request 2023-12-17 14:20:18 super(HTTPConnection, self).request(method, url, body=body, headers=headers) 2023-12-17 14:20:18 File "/usr/lib/python3.10/http/client.py", line 1282, in request 2023-12-17 14:20:18 self._send_request(method, url, body, headers, encode_chunked) 2023-12-17 14:20:18 File "/usr/lib/python3.10/http/client.py", line 1328, in _send_request 2023-12-17 14:20:18 self.endheaders(body, encode_chunked=encode_chunked) 2023-12-17 14:20:18 File "/usr/lib/python3.10/http/client.py", line 1277, in endheaders 2023-12-17 14:20:18 self._send_output(message_body, encode_chunked=encode_chunked) 2023-12-17 14:20:18 File "/usr/lib/python3.10/http/client.py", line 1037, in _send_output 2023-12-17 14:20:18 self.send(msg) 2023-12-17 14:20:18 File "/usr/lib/python3.10/http/client.py", line 975, in send 2023-12-17 14:20:18 self.connect() 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connection.py", line 205, in connect 2023-12-17 14:20:18 conn = self._new_conn() 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connection.py", line 186, in _new_conn 2023-12-17 14:20:18 raise NewConnectionError( 2023-12-17 14:20:18 urllib3.exceptions.NewConnectionError: &lt;urllib3.connection.HTTPConnection object at 0x7f3244438a90&gt;: Failed to establish a new connection: [Errno 111] Connection refused 2023-12-17 14:20:18 2023-12-17 14:20:18 During handling of the above exception, another exception occurred: 2023-12-17 14:20:18 2023-12-17 14:20:18 Traceback (most recent call last): 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send 2023-12-17 14:20:18 resp = conn.urlopen( 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 798, in urlopen 2023-12-17 14:20:18 retries = retries.increment( 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 592, in increment 2023-12-17 14:20:18 raise MaxRetryError(_pool, url, error or ResponseError(cause)) 2023-12-17 14:20:18 urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=5000): Max retries exceeded with url: /api/v1/lineage (Caused by NewConnectionError('&lt;urllib3.connection.HTTPConnection object at 0x7f3244438a90&gt;: Failed to establish a new connection: [Errno 111] Connection refused')) 2023-12-17 14:20:18 2023-12-17 14:20:18 During handling of the above exception, another exception occurred: 2023-12-17 14:20:18 2023-12-17 14:20:18 Traceback (most recent call last): 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/openlineage/airflow/adapter.py", line 95, in emit 2023-12-17 14:20:18 return self.client.emit(event) 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/openlineage/client/client.py", line 102, in emit 2023-12-17 14:20:18 self.transport.emit(event) 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/openlineage/client/transport/http.py", line 159, in emit 2023-12-17 14:20:18 resp = <a href="http://session.post">session.post</a>( 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/requests/sessions.py", line 637, in post 2023-12-17 14:20:18 return self.request("POST", url, data=data, json=json, ****kwargs) 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/requests/sessions.py", line 589, in request 2023-12-17 14:20:18 resp = self.send(prep, ****send_kwargs) 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/requests/sessions.py", line 703, in send 2023-12-17 14:20:18 r = adapter.send(request, ****kwargs) 2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/requests/adapters.py", line 519, in send 2023-12-17 14:20:18 raise ConnectionError(e, request=request) 2023-12-17 14:20:18 requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=5000): Max retries exceeded with url: /api/v1/lineage (Caused by NewConnectionError('&lt;urllib3.connection.HTTPConnection object at 0x7f3244438a90&gt;: Failed to establish a new connection: [Errno 111] Connection refused'))

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-18 02:28:33

*Thread Reply:* try with host.docker.internal instead of localhost

Parkash Pant (ppant@tucowsinc.com)
2023-12-18 11:51:48

*Thread Reply:* It worked. Thanks!

Michael Ourch (michael.ourch@qonto.com)
2023-12-18 05:02:29

Hey everyone, I started experimenting OpenLineage with my stack (dbt + Snowflake). The data lineage works great but I could not get the column lineage feature working. Is it only implemented for Spark at the moment ? Thanks 🙏

✅ Michael Ourch
Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-18 05:03:54

*Thread Reply:* it works from Spark (with exception of JDBC) and Airflow SQL-based operators (except BigQuery Operator)

🙏 Michael Ourch
Michael Ourch (michael.ourch@qonto.com)
2023-12-18 08:20:18

*Thread Reply:* Thanks @Jakub Dardziński for the clarification 🙏

Daniel Henneberger (me@danielhenneberger.com)
2023-12-18 16:16:30

Hey ya'll, I'm trying out the flink open lineage integration with marquez. I cloned marquez and did a ./docker/up.sh and configured flink using the yaml. However, when it tries to emit a metric, i get:

ERROR io.openlineage.flink.client.EventEmitter - Failed to emit OpenLineage event: io.openlineage.client.OpenLineageClientException: code: 422, response: {"errors":["job.facets.jobType.integration must not be null"]} at io.openlineage.client.transports.HttpTransport.throwOnHttpError(HttpTransport.java:150) ~[sqrl-cli.jar:0.4.1-SNAPSHOT] at io.openlineage.client.transports.HttpTransport.emit(HttpTransport.java:124) ~[sqrl-cli.jar:0.4.1-SNAPSHOT] at io.openlineage.client.transports.HttpTransport.emit(HttpTransport.java:111) ~[sqrl-cli.jar:0.4.1-SNAPSHOT] at io.openlineage.client.OpenLineageClient.emit(OpenLineageClient.java:46) ~[sqrl-cli.jar:0.4.1-SNAPSHOT] Is there something else I need to configure?

Daniel Henneberger (me@danielhenneberger.com)
2023-12-18 17:22:39

I opened an issue: https://github.com/OpenLineage/OpenLineage/issues/2324

Comments
1
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-12-19 03:07:34

*Thread Reply:* great finding, sorry for this. this should help: https://github.com/OpenLineage/OpenLineage/pull/2325

Labels
documentation, integration/flink, streaming
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-12-19 00:19:59

Hi team, Noticed this OL error for spark 3.4.1: 23/12/15 09:51:35 ERROR PlanUtils: Apply failed: java.lang.NoSuchMethodError: 'java.lang.String org.apache.spark.sql.execution.datasources.PartitionedFile.filePath()' at io.openlineage.spark.agent.util.PlanUtils.lambda$null$4(PlanUtils.java:241) at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) at java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) at java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:274) at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497) at java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:274) at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655) at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) at io.openlineage.spark.agent.util.PlanUtils.findRDDPaths(PlanUtils.java:248) at io.openlineage.spark.agent.lifecycle.plan.AbstractRDDNodeVisitor.findInputDatasets(AbstractRDDNodeVisitor.java:42) at io.openlineage.spark.agent.lifecycle.plan.SqlExecutionRDDVisitor.apply(SqlExecutionRDDVisitor.java:43) at io.openlineage.spark.agent.lifecycle.plan.SqlExecutionRDDVisitor.apply(SqlExecutionRDDVisitor.java:22) at io.openlineage.spark.agent.util.PlanUtils$1.lambda$apply$2(PlanUtils.java:99) at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177) at java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) at io.openlineage.spark.agent.util.PlanUtils$1.apply(PlanUtils.java:115) at io.openlineage.spark.agent.util.PlanUtils$1.apply(PlanUtils.java:79) at scala.PartialFunction.applyOrElse(PartialFunction.scala:127) at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126) at scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:30) at scala.PartialFunction$AndThen.applyOrElse(PartialFunction.scala:194) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$visitLogicalPlan$14(OpenLineageRunEventBuilder.java:400) at io.openlineage.spark.agent.util.ScalaConversionUtils$3.apply(ScalaConversionUtils.java:131) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$map$1(TreeNode.scala:305) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$map$1$adapted(TreeNode.scala:305) at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:285) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1$adapted(TreeNode.scala:286) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1$adapted(TreeNode.scala:286) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode.map(TreeNode.scala:305) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$buildInputDatasets$6(OpenLineageRunEventBuilder.java:351) at java.base/java.util.Optional.map(Optional.java:265) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildInputDatasets(OpenLineageRunEventBuilder.java:349) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.populateRun(OpenLineageRunEventBuilder.java:305) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:289) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:241) at io.openlineage.spark.agent.lifecycle.SparkSQLExecutionContext.end(SparkSQLExecutionContext.java:95) at io.openlineage.spark.agent.OpenLineageSparkListener.sparkSQLExecEnd(OpenLineageSparkListener.java:98) at io.openlineage.spark.agent.OpenLineageSparkListener.onOtherEvent(OpenLineageSparkListener.java:84) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1471) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) OL version 0.30.1. May I check if this has already been reported/been fixed in the later releases of OL? Thanks!

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-12-19 03:16:08
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-12-19 22:35:34

*Thread Reply:* Got it thanks!

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-12-19 00:22:03

Also, just checking, is there a way to set the log level for OL separately for spark? Or does it always use the underlying spark context log level?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-12-19 03:17:21

*Thread Reply:* this is how we set it in tests -> https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/test/resources/log4j.properties

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-12-19 22:35:16

*Thread Reply:* Ahh I see... it goes in via log4j properties. Is there any plan to make this configurable via simpler means? Say env variable or using dedicated spark configs?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-12-20 02:27:06

*Thread Reply:* We didn't plan anything so far. But feel free to create an issue and justify why is this important. perhaps more people share the same feeling about it.

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-12-20 02:37:45

*Thread Reply:* Sure, I'll do that then. Thanks! 🙂

Zacay Daushin (zacayd@octopai.com)
2023-12-19 04:45:05

hi does someone uses openLineage solution to get metadata of Airflow?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-19 05:06:24

*Thread Reply:* hey Zacay, do you have any issue with using Airflow integration?

Zacay Daushin (zacayd@octopai.com)
2023-12-20 05:42:30

*Thread Reply:* Do i need to install openlineage or only configuration?

Zacay Daushin (zacayd@octopai.com)
2023-12-20 05:43:03

*Thread Reply:* i install marquez and point on the Ariflow cfg to listen to the marquez:5000

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-20 05:43:12

*Thread Reply:* what version of Airflow are you using?

Zacay Daushin (zacayd@octopai.com)
2023-12-20 05:43:55

*Thread Reply:* Version: v2.8.0

PyPI
Zacay Daushin (zacayd@octopai.com)
2023-12-20 05:44:19

*Thread Reply:* 2.8.0

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-20 05:46:09

*Thread Reply:* you should follow this guide then https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/guides/user.html

as any other Airflow provider you need to install it, e.g. with pip install apache-airflow-provider-openlineage

Zacay Daushin (zacayd@octopai.com)
2023-12-20 05:47:07

*Thread Reply:* and these variables are kept on airflow.cfg or on .env?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-20 05:50:12

*Thread Reply:* this is Airflow config variables so you can either set them in airflow.cfg or using environment variables as described here https://airflow.apache.org/docs/apache-airflow/stable/cli-and-env-variables-ref.html#envvar-AIRFLOW__-SECTION-__-KEY

Zacay Daushin (zacayd@octopai.com)
2023-12-20 05:54:21

*Thread Reply:* i created a DAG that creates a table and inserts it one line then op the airflow.cfg [openlineage] transport = '{"type": "http", "url": "http://10.0.19.7:5000"}' namespace='airflow'

Zacay Daushin (zacayd@octopai.com)
2023-12-20 05:54:38

*Thread Reply:* and on 10.0.19.7 there is a URL of marquze on port 300

Zacay Daushin (zacayd@octopai.com)
2023-12-20 05:54:51

*Thread Reply:* i run the dag but see no lineage

Zacay Daushin (zacayd@octopai.com)
2023-12-20 05:54:58

*Thread Reply:* is there any log to get?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-20 05:57:16

*Thread Reply:* there would be if you enable debug logs

Zacay Daushin (zacayd@octopai.com)
2023-12-20 07:00:28

*Thread Reply:* in the Airflow.cfg?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-20 07:14:07

*Thread Reply:* Yes. I’m assuming you don’t see anything yet in logs without enabling debug level?

Zacay Daushin (zacayd@octopai.com)
2023-12-20 07:15:05

*Thread Reply:* I see on the logs of the airflow

Zacay Daushin (zacayd@octopai.com)
2023-12-20 07:15:17

*Thread Reply:* but not nothing related to the lineage

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-20 07:17:04

*Thread Reply:* in Admin > Plugins can you see whether you have OpenLineageProviderPlugin and if so, are there listeners?

Zacay Daushin (zacayd@octopai.com)
2023-12-20 07:19:02

*Thread Reply:* there is a OpenLineageProviderPlugin but no liseners

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-20 07:20:52

*Thread Reply:* listeners are disabled following this logic: def _is_disabled() -&gt; bool: return ( conf.getboolean("openlineage", "disabled", fallback=False) or os.getenv("OPENLINEAGE_DISABLED", "false").lower() == "true" or ( conf.get("openlineage", "transport", fallback="") == "" and conf.get("openlineage", "config_path", fallback="") == "" and os.getenv("OPENLINEAGE_URL", "") == "" and os.getenv("OPENLINEAGE_CONFIG", "") == "" ) ) so maybe your config is not loaded properly?

Zacay Daushin (zacayd@octopai.com)
2023-12-20 07:25:38

*Thread Reply:* Dont

Zacay Daushin (zacayd@octopai.com)
2023-12-20 07:25:44

*Thread Reply:* here is the config

Zacay Daushin (zacayd@octopai.com)
2023-12-20 07:25:56

*Thread Reply:*

Zacay Daushin (zacayd@octopai.com)
2023-12-20 07:25:59

*Thread Reply:* here is also a .env

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-20 07:27:25

*Thread Reply:* you have transport and namespace twice under openlineage section, second ones are empty

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-20 07:32:23

*Thread Reply:* apparently .env is not taken into account, not sure where the file is and what is your deployment

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-20 07:32:44

*Thread Reply:* also, AIRFLOW__LINEAGE__BACKEND=openlineage.lineage_backend.OpenLineageBackend is needed only for Airflow <2.3

Zacay Daushin (zacayd@octopai.com)
2023-12-20 07:34:54

*Thread Reply:* ok so now i have a airflow.cfg [openlineage] transport = '{"type": "http", "url": "http://10.0.19.7:5000"}' namespace='my-namespace' But still when i run i see no lineage

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-20 07:41:15

*Thread Reply:* can you verify and confirm that changes in your Airflow config are applied? I don’t see any other reason, I also can’t tell what’s your deployment

Zacay Daushin (zacayd@octopai.com)
2023-12-20 07:43:32

*Thread Reply:* Can you send me an example of airflow.cfg that works and i will try to compare

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-20 07:44:34

*Thread Reply:* the one you sent seems ok, it may be a matter of how you configure Airflow to read it, where you put changed config file

Zacay Daushin (zacayd@octopai.com)
2023-12-20 07:45:16

*Thread Reply:* it is on the same place where the logs and dags libraries

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-20 07:47:22

*Thread Reply:* okay, let’s try this

please change temporarily expose_config = False to expose_config = True and check whether you can see config in the UI under Admin > Configuration

Zacay Daushin (zacayd@octopai.com)
2023-12-20 07:49:21

*Thread Reply:* I need to stop and start the docker?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-20 07:50:23

*Thread Reply:* yes

Zacay Daushin (zacayd@octopai.com)
2023-12-20 07:53:41

*Thread Reply:* i changed but see no change on the UI under Admin->Configuration

Zacay Daushin (zacayd@octopai.com)
2023-12-20 07:54:11

*Thread Reply:* i wonder of the cfg file affects at all on the Airflow

Zacay Daushin (zacayd@octopai.com)
2023-12-20 07:54:29

*Thread Reply:* maybe it is relevant to the docker-compose.yaml?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-20 07:56:44

*Thread Reply:* again, I don't know how you deploy your Airflow instance if you need more help with that you might ask in Airflow Slack or learn more about from Airflow docs (which are pretty good imho)

Zacay Daushin (zacayd@octopai.com)
2023-12-20 09:16:42

*Thread Reply:* It seems that the config is on one of the contianers

Zacay Daushin (zacayd@octopai.com)
2023-12-20 09:16:47

*Thread Reply:* so i got inside

Zacay Daushin (zacayd@octopai.com)
2023-12-20 09:17:05

*Thread Reply:* but i think that if i stop and start it doesnt saves the cahnge

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-20 09:21:53

*Thread Reply:* I would manipulate with env vars to override Airflow config entries (I passed the link above) and set them in compose file for all airflow containers I'm assuming you're using Airflow provided docker compose yaml

Zacay Daushin (zacayd@octopai.com)
2023-12-20 09:22:25

*Thread Reply:* right i use docker-compose yaml

Zacay Daushin (zacayd@octopai.com)
2023-12-20 09:28:23

*Thread Reply:* so you mean to create an .env file in the location of the ymal and there put AIRFLOWOPENLINEAGEDISABLED=False

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-20 09:30:20

*Thread Reply:* no, to put this into yaml file

Zacay Daushin (zacayd@octopai.com)
2023-12-20 09:30:37

*Thread Reply:* OK

Zacay Daushin (zacayd@octopai.com)
2023-12-20 09:32:51

*Thread Reply:* i put on the yaml OPENLINEAGEDISABLED=false OPENLINEAGEURL=http://10.0.19.7:5000 AIRFLOWOPENLINEAGENAMESPACE=food_delivery and i will stop and start and lets see how it goes

harsh loomba (hloomba@upgrade.com)
2023-12-20 14:04:37

*Thread Reply:* out of curiosity, do you use Astronomer bootstrap solution to spinup airflow with openlineage?

Zacay Daushin (zacayd@octopai.com)
2023-12-19 04:45:13

i used this post https://openlineage.io/docs/guides/airflow_proxy/

openlineage.io
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-19 08:49:49

*Thread Reply:* For newest Airflow provider documentation, please look at https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/index.html

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-12-20 02:37:02

Hi team, got this error with OL 1.6.2 on dbr aws: 23/12/20 07:10:18 ERROR DriverDaemon$: XXX Fatal uncaught exception. Terminating driver. java.lang.IllegalStateException: LiveListenerBus is stopped. at org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:109) at org.apache.spark.scheduler.LiveListenerBus.addToSharedQueue(LiveListenerBus.scala:66) at org.apache.spark.sql.QueryProfileListener$.initialize(QueryProfileListener.scala:122) at com.databricks.backend.daemon.driver.DatabricksILoop$.$anonfun$executeDependedOperations$1(DatabricksILoop.scala:652) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1709) at com.databricks.unity.UCSUniverseHelper$.withNewScope(UCSUniverseHelper.scala:8) at com.databricks.backend.daemon.driver.DatabricksILoop$.executeDependedOperations(DatabricksILoop.scala:580) at com.databricks.backend.daemon.driver.DatabricksILoop$.initializeSharedDriverContext(DatabricksILoop.scala:448) at com.databricks.backend.daemon.driver.DatabricksILoop$.getOrCreateSharedDriverContext(DatabricksILoop.scala:294) at com.databricks.backend.daemon.driver.DriverCorral.driverContext(DriverCorral.scala:292) at com.databricks.backend.daemon.driver.DriverCorral.&lt;init&gt;(DriverCorral.scala:159) at com.databricks.backend.daemon.driver.DriverDaemon.&lt;init&gt;(DriverDaemon.scala:71) at com.databricks.backend.daemon.driver.DriverDaemon$.create(DriverDaemon.scala:452) at com.databricks.backend.daemon.driver.DriverDaemon$.initialize(DriverDaemon.scala:546) at com.databricks.backend.daemon.driver.DriverDaemon$.wrappedMain(DriverDaemon.scala:511) at com.databricks.DatabricksMain.$anonfun$main$1(DatabricksMain.scala:149) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at com.databricks.DatabricksMain.$anonfun$withStartupProfilingData$1(DatabricksMain.scala:498) at com.databricks.logging.UsageLogging.$anonfun$recordOperation$1(UsageLogging.scala:571) at com.databricks.logging.UsageLogging.executeThunkAndCaptureResultTags$1(UsageLogging.scala:666) at com.databricks.logging.UsageLogging.$anonfun$recordOperationWithResultTags$4(UsageLogging.scala:684) at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:426) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:196) at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:424) at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:418) at com.databricks.DatabricksMain.withAttributionContext(DatabricksMain.scala:91) at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:470) at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:455) at com.databricks.DatabricksMain.withAttributionTags(DatabricksMain.scala:91) at com.databricks.logging.UsageLogging.recordOperationWithResultTags(UsageLogging.scala:661) at com.databricks.logging.UsageLogging.recordOperationWithResultTags$(UsageLogging.scala:580) at com.databricks.DatabricksMain.recordOperationWithResultTags(DatabricksMain.scala:91) at com.databricks.logging.UsageLogging.recordOperation(UsageLogging.scala:571) at com.databricks.logging.UsageLogging.recordOperation$(UsageLogging.scala:540) at com.databricks.DatabricksMain.recordOperation(DatabricksMain.scala:91) at com.databricks.DatabricksMain.withStartupProfilingData(DatabricksMain.scala:498) at com.databricks.DatabricksMain.main(DatabricksMain.scala:148) at com.databricks.backend.daemon.driver.DriverDaemon.main(DriverDaemon.scala) Dbr spark 3.4, jre 1.8 Is spark 3.4 not supported for OL as of yet?

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-12-20 02:37:29

*Thread Reply:* OL 1.3.1 works fine btw... This error only pops up with OL 1.6.2

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-12-20 03:00:26

*Thread Reply:* Error happens when starting the cluster itself

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-12-20 04:56:09

*Thread Reply:* spark 3.4 yes, but databricks runtie 14.x was not testet so far (edited)

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-12-20 09:00:12

*Thread Reply:* I've just tested existing databricks integration on latest dbr 14.2 and spark 3.5. The integration tests for databricks, we have, are passing. So please write more details on how do end up with logs like above.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-12-21 01:59:54

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2328 -> link to tests being run

Labels
integration/spark, ci, tests
Comments
1
👀 Anirudh Shrinivason
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-12-29 03:12:16

*Thread Reply:* Hi @Paweł Leszczyński sorry missed out on this earlier... Actually, I got this while trying to simply start the dbr cluster with the OL spark configs. Nothing else done from my end for this. Works with OL 1.3.1 but not with 1.6.2

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-12-29 03:14:03

*Thread Reply:* there was an issue with 1.6.2 related to log class on the classpath which may be responsible for this, but it got solved in recent release.

👍 Anirudh Shrinivason
Zacay Daushin (zacayd@octopai.com)
2023-12-20 05:39:45

hi does somenone uses Airflow lineage?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-20 05:40:58

*Thread Reply:* hey Zacay, I’ve already tried to help you here: https://openlineage.slack.com/archives/C01CK9T7HKR/p1702980384927369?thread_ts=1702979105.626809&cid=C01CK9T7HKR

} Jakub Dardziński (https://openlineage.slack.com/team/U02S6F54MAB)
Zacay Daushin (zacayd@octopai.com)
2023-12-20 08:57:07

*Thread Reply:* thanks for the help

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-20 08:57:26

*Thread Reply:* did you manage to make it working?

harsh loomba (hloomba@upgrade.com)
2023-12-20 14:04:58

*Thread Reply:* out of curiosity, do you use Astronomer bootstrap solution to spinup airflow with openlineage?

Zacay Daushin (zacayd@octopai.com)
2023-12-20 07:25:53
Shahid Shaikh (ssshahidwin@gmail.com)
2023-12-20 14:05:55

Hi Everyone I was trying to add extra facet at job level but I am not able to do that. to explain more - assume that d1(input) and d2 (output) are the databases and by using some python file which having classes and functions helps it to get convert from d1 to d2. so currently for this i am extracting some info about python file by using external python parser and want to add that to jobs which will be in between d1 and d2. i tried by adding ```custom_facets = { "customKey": "customValue", "anotherKey": "anotherValue" # Add more custom facets as needed }

        # Creating a Job with custom facets
        job = Job(
            namespace=file_name,
            name=single_fn_info.name,
            facets=custom_facets  # Include the custom facets here
        )```

but it is not working. I did look into this documentation https://openlineage.io/docs/spec/facets/custom-facets/ and tried but it's still not showing any facet on ui. currently in facets section for every job it only shows root dict which is blank and nothing else. So what i am missing how we can implement this ??

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-12-21 05:12:26

@Michael Robinson can we vote for the OL release? #2319 brings significant fix to Spark logging and I think it's worth releasing without waiting for release cycle.

➕ Paweł Leszczyński, Jakub Dardziński, Maciej Obuchowski, Rodrigo Maia, Michael Robinson, harsh loomba
👍 Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2023-12-21 09:55:59

*Thread Reply:* Thanks, @Paweł Leszczyński, the release is authorized and will be initiated as soon as possible within 2 business days.

🙌 Paweł Leszczyński
Michael Robinson (michael.robinson@astronomer.io)
2023-12-21 10:54:23
harsh loomba (hloomba@upgrade.com)
2023-12-21 13:00:32

*Thread Reply:* @Jakub Dardziński any progress on this one https://github.com/apache/airflow/pull/35794 wondering if this could have been a part of release

Labels
provider:amazon-aws, area:providers, provider:openlineage
Comments
3
Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-21 13:02:36

*Thread Reply:* I'm dependent on Airflow committers, pinging in the PR from time to time

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-21 13:02:57

*Thread Reply:* if you wish to comment that you need the PR it might be taken into consideration as well

harsh loomba (hloomba@upgrade.com)
2023-12-21 13:13:47

*Thread Reply:* wait this will be supported by openlineage-airflow package as well right? I see you have made changes in airflow provider package but I dont see changes in standalone openlineage repo 🤔

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-21 13:15:22

*Thread Reply:* We aim to make as less changes as possible to openlineage-airflow package to encourage users to use provider one. This change would be quite huge and probably won't be backported.

harsh loomba (hloomba@upgrade.com)
2023-12-21 13:17:54

*Thread Reply:* We have yet not moved to provider packages coz our team is still making decision. And the move to provider package needs me to change a lot, so i would prefer this feature in standalone package rather

Hitesh (splicer9904@gmail.com)
2023-12-21 05:25:29

Hi Team I am trying to send openlineage events to a Kafka eventhub and for that, I am passing spark.openlineage.transport.properties.sasl.jaas.config as org.apache.kafka.common.security.plain.PlainLoginModule required username=\"connection_string\" password=\"connection_string\";

when I run the job, I get the error Value not specified for key 'username' in JAAS config

I am getting connection string from Kafka evenhub properties and I'm running openlineage-spark-1.4.1

Can someone please help me out?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-21 09:20:53

*Thread Reply:* I don't think we can parse those non-escaped spaces as regular properties

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-21 09:23:01

*Thread Reply:* can you try something like "org.apache.kafka.common.security.plain.PlainLoginModule required username='connection_string' password='connection_string'"

Abdallah (abdallah@terrab.me)
2023-12-21 11:14:44

Hello,

I hope you are doing well.

I am wondering if any of you had the same issue before.

Thank you. 23/12/21 16:01:26 WARN DatabricksEnvironmentFacetBuilder: Failed to load dbutils in OpenLineageListener: java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at com.databricks.dbutils_v1.impl.DbfsUtilsImpl._driverContext$lzycompute(DbfsUtilsImpl.scala:27) at com.databricks.dbutils_v1.impl.DbfsUtilsImpl._driverContext(DbfsUtilsImpl.scala:26) at com.databricks.dbutils_v1.impl.DbfsUtilsImpl.$anonfun$driverContext$1(DbfsUtilsImpl.scala:29) at com.databricks.dbutils_v1.impl.DBUtilsV1Utils$.checkLocalDriver(DBUtilsV1Impl.scala:61) at com.databricks.dbutils_v1.impl.DbfsUtilsImpl.driverContext(DbfsUtilsImpl.scala:29) at <a href="http://com.databricks.dbutils_v1.impl.DbfsUtilsImpl.sc">com.databricks.dbutils_v1.impl.DbfsUtilsImpl.sc</a>(DbfsUtilsImpl.scala:30) at com.databricks.dbutils_v1.impl.DbfsUtilsImpl._core$lzycompute(DbfsUtilsImpl.scala:32) at com.databricks.dbutils_v1.impl.DbfsUtilsImpl._core(DbfsUtilsImpl.scala:32) at com.databricks.dbutils_v1.impl.DbfsUtilsImpl.$anonfun$core$1(DbfsUtilsImpl.scala:34) at com.databricks.dbutils_v1.impl.DBUtilsV1Utils$.checkLocalDriver(DBUtilsV1Impl.scala:61) at com.databricks.dbutils_v1.impl.DbfsUtilsImpl.core(DbfsUtilsImpl.scala:34) at com.databricks.dbutils_v1.impl.DbfsUtilsImpl.mounts(DbfsUtilsImpl.scala:166) at io.openlineage.spark.agent.facets.builder.DatabricksEnvironmentFacetBuilder.getDatabricksMountpoints(DatabricksEnvironmentFacetBuilder.java:142) at io.openlineage.spark.agent.facets.builder.DatabricksEnvironmentFacetBuilder.getDatabricksEnvironmentalAttributes(DatabricksEnvironmentFacetBuilder.java:98) at io.openlineage.spark.agent.facets.builder.DatabricksEnvironmentFacetBuilder.build(DatabricksEnvironmentFacetBuilder.java:60) at io.openlineage.spark.agent.facets.builder.DatabricksEnvironmentFacetBuilder.build(DatabricksEnvironmentFacetBuilder.java:32) at io.openlineage.spark.api.CustomFacetBuilder.accept(CustomFacetBuilder.java:40) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$null$27(OpenLineageRunEventBuilder.java:491) at java.lang.Iterable.forEach(Iterable.java:75) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$buildRunFacets$28(OpenLineageRunEventBuilder.java:491) at java.util.ArrayList.forEach(ArrayList.java:1259) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRunFacets(OpenLineageRunEventBuilder.java:491) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.populateRun(OpenLineageRunEventBuilder.java:313) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:289) at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:250) at io.openlineage.spark.agent.lifecycle.SparkSQLExecutionContext.start(SparkSQLExecutionContext.java:167) at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$onJobStart$10(OpenLineageSparkListener.java:151) at java.util.Optional.ifPresent(Optional.java:159) at io.openlineage.spark.agent.OpenLineageSparkListener.onJobStart(OpenLineageSparkListener.java:147) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:39) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:39) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:118) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:102) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:107) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:107) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:102) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:98) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1639) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:98)

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-21 12:01:59

*Thread Reply:* Can you provide additional details: what is your version of OpenLineage integration, Spark, Databricks; how you're running your job, additional logs

Abdallah (abdallah@terrab.me)
2023-12-21 12:02:42

*Thread Reply:* Version of OL 1.2.2

Abdallah (abdallah@terrab.me)
2023-12-21 12:03:07

*Thread Reply:* spark_version: 11.3.x-scala2.12

Abdallah (abdallah@terrab.me)
2023-12-21 12:03:32

*Thread Reply:* Running job through spark-submit

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-22 06:03:06

*Thread Reply:* @Abdallah could you try to upgrade to 1.7.0 that was released lately?

Michael Robinson (michael.robinson@astronomer.io)
2023-12-21 13:08:08

@channel We released OpenLineage 1.7.0! In this release, we turned off support for Airflow versions >=2.8.0 in the Airflow integration and added a parent run facet to COMPLETE and FAIL events, also in Airflow. If you’re on the most recent release of Airflow and wish to continue receiving events from Airflow after upgrading, use the OpenLineage Airflow Provider instead.

Added • Airflow: add parent run facet to COMPLETE and FAIL events in Airflow integration #2320 @kacpermuda Adds a parent run facet to all events in the Airflow integration.

Removed • Airflow: remove Airflow 2.8+ support #2330 @kacpermuda To encourage use of the Provider, this removes the listener from the plugin if the Airflow version is >=2.8.0.

A number of bug fixes were released as well, including: • Airflow: repair up.sh for MacOS #2316 #2318 @kacpermuda Some scripts were not working well on MacOS. This adjusts them. • Airflow: repair run_id for FAIL event in Airflow 2.6+ #2305 @kacpermuda The Run_id in a FAIL event was different than in the START event for Airflow 2.6+. • Flink: name Kafka datasets according to the naming convention #2321 @pawel-big-lebowski Adds a kafka:// prefix to Kafka topic datasets’ namespaces. • Spec: fix inconsistency with Redshift authority format #2315 @davidjgoss Amends the Authority format for consistency with other references in the same section.

Thanks to all the contributors, including new contributor @Kacper Muda! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.7.0 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/1.6.2...1.7.0 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-21 13:10:40

*Thread Reply:* Shoutout to @Kacper Muda for huge contribution from the very start 🚀

:gratitude_thank_you: Kacper Muda, Sheeri Cabral (Collibra)
Michael Robinson (michael.robinson@astronomer.io)
2023-12-21 13:11:22

*Thread Reply:* +1 on that! I think there were even more changes from Kacper than are listed here

:gratitude_thank_you: Kacper Muda
Kacper Muda (kacper.muda@getindata.com)
2023-12-21 14:43:38

*Thread Reply:* Thanks ! Just a quick note: listener API was introduced in Airflow 2.3 so it was already missing in the plugin for Airflow < 2.3 - I made no changes there. I just removed it from > 2.8.0, to encourage use of the provider, as You said 🙂 But the result is exactly as You said, there is no listener in <2.3 and >=2.8 😄

Kacper Muda (kacper.muda@getindata.com)
2023-12-21 14:45:57

*Thread Reply:* So in result: i don't think we turned off support for Airflow versions &lt;2.3.0 🙂

Michael Robinson (michael.robinson@astronomer.io)
2023-12-21 14:46:48

*Thread Reply:* This is good to know — thanks. I’ve updated the notes here and will do so elsewhere, as well.

🙌 Kacper Muda, Jakub Dardziński
Shahid Shaikh (ssshahidwin@gmail.com)
2023-12-30 14:33:21

*Thread Reply:* Hi @Jakub Dardziński so before for custom operator we used to write custom extractor file and used to save with other extractors under the airflow integration folder.

what is the procedure now according to this new provider update ?

Kacper Muda (kacper.muda@getindata.com)
2024-01-01 05:08:40

*Thread Reply:* Hey @Shahid Shaikh, as of my understanding nothing has changed regarding Custom Extractors, You can still use them if You wish to (see the bottom of this docs, to see how they can be registered in provider package). However, in my opinion, the best way to use the provider package is to implement OpenLineage methods directly in the operators, as described here. Let me know if this answers Your question.

Shahid Shaikh (ssshahidwin@gmail.com)
2024-01-04 04:46:34

*Thread Reply:* Yes, Thanks @Kacper Muda I referred the docs and able to do the work as u said by using provider package directly.

Shahid Shaikh (ssshahidwin@gmail.com)
2023-12-28 00:46:57

Hi Everyone I was looking into airflow intergration with Marquez i ran number of dags, I observed that for a particular task in every dag it makes new job on Marquez and we are not able to see any job to job linkage on marquez map. Why it is not showing ? How can we add this feature ?

Rodrigo Maia (rodrigo.maia@manta.io)
2023-12-28 05:57:49

Hey All! I've tested merging operations for SPARK (on Databricks) and the OpenLineage Result. For the moment i failed to produce any input/output for the jobs, regardless of the data schema being present in the logical plan attribute of the JSON OL event.

My test consisted in: • Reading from parquet/CSV file in DBFS (databricks file storage) • Creating a Temporary Table • Performing the merging with spark.sql("merge... target...source ") with the target being a table in hive. Test Variables: • Source: Parquet/CSV • Target: Hive Table • OLxSpark versions: ◦ OL 1.6.2 -> Spark 3.3.2 ◦ OL 1.6.2 -> Spark 3.2.1 ◦ OL 1.0.0 -> Spark 3.2.1 I've created a pdf with some code samples and OL inputs and output attributes.

Rodrigo Maia (rodrigo.maia@manta.io)
2023-12-28 05:59:53

*Thread Reply:* @Sai @Anirudh Shrinivason Did you have any luck with the merge events?

Rodrigo Maia (rodrigo.maia@manta.io)
2023-12-28 06:00:51

*Thread Reply:* @Paweł Leszczyński I know you were working on this in the past. Am I missing something here?

Rodrigo Maia (rodrigo.maia@manta.io)
2023-12-28 06:02:23

*Thread Reply:* Should i test with more recent versions of spark and the latest of OL?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-12-28 06:31:04

*Thread Reply:* Are you able to verify if the issue is databricks runtime specific or the same happens on vanilla Spark docker containers?

We do have an integration test for merge into and delta tables here -> https://github.com/OpenLineage/OpenLineage/blob/17fa874083850b711f364c4de656e14d79[…]/java/io/openlineage/spark/agent/SparkDeltaIntegrationTest.java

would this test fail on databricks as well?

Rodrigo Maia (rodrigo.maia@manta.io)
2023-12-28 08:30:56

*Thread Reply:* the test on databricks fails to generate inputs and outputs for the merge operation

"job": { "namespace": "default", "name": "dbc-a9e3696a-291f_cloud_databricks_com_execute_merge_into_command_edge", "facets": {} }, "inputs": [], "outputs": []

Rodrigo Maia (rodrigo.maia@manta.io)
2023-12-28 08:32:11

*Thread Reply:* The create view also fails to generate Inputs and outputs (but this one i don't know if it is supposed to) "job": { "namespace": "default", "name": "dbc-a9e3696a-291f_cloud_databricks_com_execute_create_view_command", "facets": {} }, "inputs": [], "outputs": []

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2023-12-28 23:00:10

*Thread Reply:* Hey @Rodrigo Maia I'm using 1.3.1 OL version, and it seems to work for most merge into case though not all...

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-12-29 02:36:14

*Thread Reply:* @Rodrigo Maia will try to add existing testMergeInto to be run on Databricks and see why is it failing

🙌 Rodrigo Maia
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-01-02 08:39:50

*Thread Reply:* looking into this and hopefully will have solution this week

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-01-03 06:27:09

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2348

Labels
documentation, integration/spark
Comments
1
❤️ Rodrigo Maia
Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2024-01-02 10:47:11

Hello! My company has a job opening for a Senior Data Engineer in our Data Office, in Czech Republic - https://www.linkedin.com/feed/update/urn:li:activity:7147975999253610496/ DM me here or on LinkedIn if you have questions.

linkedin.com
🔥 Maciej Obuchowski
Harel Shein (harel.shein@gmail.com)
2024-01-02 11:00:58

*Thread Reply:* created a #jobs channel, as we’re continuing to build a strong community here 🙂

🚀 Jakub Dardziński
Michael Robinson (michael.robinson@astronomer.io)
2024-01-02 11:30:00

@channel This Friday, Jan. 4, is the last day to respond to the 2023 Ecosystem Survey, which will close at 5 pm ET that day. It’s an opportunity to tell us about your organization’s lineage needs and priorities for the purpose of updating the project roadmap. For example, you can tell us: • which data connectors you’d like OL to support • which cloud data platforms you’d like OL to support • which additional orchestrators you’d like OL to support • and more. Your feedback is very important. Thanks in advance!

Google Docs
Rodrigo Martins Cardoso (rodrigo.cardoso1@ibm.com)
2024-01-02 12:27:27

Hi everyone! Just passing by for a quick question: after reading SparkColumnLineage and ColumnLineageDatasetFacet.json I can see in the future work section of the 1st URL that

Current version of the mechanism allows finding input fields that were used to produce the output field but does not determine how were they used This means that the transformationDescription and transformationType from ColumnLineageDatasetFacet are not sent in the events?

Thanks in advance for any inputs! Have a nice day.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-01-03 04:59:48

*Thread Reply:* Currently, Spark integration does not fill transformation fields. Additionally, there's a proposal to improve this: https://github.com/OpenLineage/OpenLineage/issues/2186

Labels
proposal
Comments
5
Rodrigo Martins Cardoso (rodrigo.cardoso1@ibm.com)
2024-01-03 05:17:09

*Thread Reply:* Thanks a lot for the clarification!

Vinay R (vinayrme58@gmail.com)
2024-01-03 11:33:45

Hi,

I'm currently working on extracting transformation logic from Redshift SQL insert statements. For instance, in the query 'SELECT A+B as Col1 FROM table1,' I'm trying to fetch the transformation logic 'A+B' for the column 'Col1.' I've been using regex, but it has limitations with subqueries, unions, and CTEs. Are there any specialized tools or suggestions for improving this process, especially when dealing with complex SQL structures like subqueries and unions?

Queries are in rsql format. Thanks in advance any help would be appreciated.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-03 14:04:27

*Thread Reply:* We don’t parse what are actual transformations done in sqlparser

Mariusz Górski (gorskimariusz13@gmail.com)
2024-01-03 12:14:41

Hey, I’ve submitted update to OL docs re spark - anyone fancy to conduct review? 👀 🙏

https://github.com/OpenLineage/docs/pull/268

Comments
1
Shahid Shaikh (ssshahidwin@gmail.com)
2024-01-04 04:43:48

Hi everyone does job to job linkage possible in openlineage ?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-04 04:47:05

*Thread Reply:* OpenLineage focuses on data lineage which means we don’t explicitly track job to job lineage but rather job -> dataset -> job

Shahid Shaikh (ssshahidwin@gmail.com)
2024-01-04 04:53:24

*Thread Reply:* Thanks for the insight, Jakub! That makes sense. I'm curious, though, if there's a way within openlineage for achieving this kind of linkage?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-04 04:54:53

*Thread Reply:* sure, you can create custom facet that would be used to expose such relations in terms of Airflow there is Airflow specific run facet that contains upstream and downstream task ids but it’s rather informative than to achieve explicitly what you’re trying to do

Shahid Shaikh (ssshahidwin@gmail.com)
2024-01-04 05:02:37

*Thread Reply:* Thanks for clarifying that, Jakub. I'll look into that as u suggested.

Michael Robinson (michael.robinson@astronomer.io)
2024-01-04 12:25:16

@channel UK-based members: a speaker is needed for a meetup with Confluent on January 31st in London. Please reply or DM me if you’re interested in speaking.

Michael Robinson (michael.robinson@astronomer.io)
2024-01-04 13:21:25

@channel This month’s TSC meeting is next Thursday the 11th at 10am PT. On the tentative agenda: • announcements • recent releases • open discussion • more (TBA) More info and the meeting link can be found on the website. All are welcome! Do you have a discussion topic, use case or integration you’d like to demo? DM me to be added to the agenda.

William Sia (willsia@amazon.com)
2024-01-08 03:35:03

Hi everyone, I am trying to use the datasetVersion facet, but want to confirm that I am using it correctly, at the moment the behaviour seemed a bit weird, but basically we have a pipeline like this sourceData1 -> job1 -> datasetA

dataSetA v1, sourceData2 -> job2 -> datasetB

But the job can be configured to use a specific version of incoming dataset, so in the above example the job2 had been configured to always use datasetA v1. So If I run a job1 that produces datasetA v2. If I run job2 , it should pick use the v1.

When we're integrating our pipeline by emitting events to OpenLineage ( Marquez ), we include "inputs": [ { "namespace": "test.namespace", "name": "datasetA", "facets": { "version": { "_producer": "<https://some.producer.com/version/1.0>", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/DatasetVersionDatasetFacet.json>", "datasetVersion": "1" } } } ], "datasetVersion":"1" -> this refers to the "datasetVersion" of output dataset of job1

But what happened at the moment is, job2 always use the latest version of datasetA and it rewrites the latest version's version of datasetA to 1

So just wandering if I am emitting the events wrongly or datasetVersion is not meant to be used for this purpose

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-01-08 04:57:43

*Thread Reply:* Hi @William Sia, OpenLineage-Spark integration fetches latest version of the dataset. If you're deliberately reading some older one, OpenLineage event will still point to the latest which is kind of a bug or missing feature. feel free to create an issue for this. If possible, please provide some code snippet to reproduce this. Is this happening on delta/iceberg?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-08 09:46:16

*Thread Reply:* > But what happened at the moment is, job2 always use the latest version of datasetA and it rewrites the latest version's version of datasetA to 1 You mean that event contains different version than you emit? Or do you mean you see different thing on the Marquez UI?

William Sia (willsia@amazon.com)
2024-01-08 21:56:19

*Thread Reply:* @Maciej Obuchowski , what I meant is I am expecting my job2 to have lineage to input of datasetA v1 (because I specify the version facet of the dataset input to 1) , but what happened is job2 has lineage to the latest version of datasetA and the datasetVersion parameter of the Version Facet of the latest version is now modified to 1 (it was 2 because there was a run on job1 that updates the version to 2 ). So job2 that has an input of datasetA modified the Version Facet of the later.

William Sia (willsia@amazon.com)
2024-01-08 21:58:52

*Thread Reply:* Cheers @Paweł Leszczyński, will raise an issue with more details on github on this. So at the moment we're enable the application that my team is working on to emit events. So I am emitting the event manually at the moment to mimic the data flow in our system.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-09 07:32:02

*Thread Reply:* > (because I specify the version facet of the dataset input to 1) So what I understand is that you emit event for job2 that has datasetA as input with version 1 as shown above

> but what happened is job2 has lineage latest version of datasetA on UI I understand? I think it's Marquez issue - properly display which dataset version you're reading. Can you post screenshot on Marquez issue showing exactly where you'd expect to see different data?

> datasetVersion parameter of the Version Facet of the latest version is now modified to 1 (it was 2 because there was a run on job1 that updates the version to 2 ). Yeah, if job1 updated datasetA to next version, then it makse sense.

William Sia (willsia@amazon.com)
2024-01-11 22:08:49

*Thread Reply:* I had raised this issue with some sample payload here https://github.com/MarquezProject/marquez/issues/2733 , Let me know if you have more details

Comments
1
Michael Robinson (michael.robinson@astronomer.io)
2024-01-08 12:55:01

@channel Meetup alert: our first OpenLineage x Apache Kafka® meetup will be happening on January 31st (in-person) at 6:00 PM in Confluent’s London offices. Keep an eye out for more details and sign up https://www.meetup.com/london-openlineage-meetup-group/events/298420417/?utmmedium=referral&utmcampaign=share-btnsavedeventssharemodal&utmsource=link|here.

Meetup
🚀 alexandre bergere, Maciej Obuchowski
❤️ Willy Lulciuc, Maciej Obuchowski, Eric Veleker
Zacay Daushin (zacayd@octopai.com)
2024-01-09 03:06:38

Hi are you familiar with the Spline solution? do you know if we can use it on databricks withourt adding any code to the notebooks in order to get the notebook name?

Damien Hawes (damien.hawes@booking.com)
2024-01-09 03:42:24

*Thread Reply:* Spline is maintained by an entirely different group of people. Thus, we aren't familiar with the semantics of Spline. I suggest that you reach out to the maintainers of Spline to understand more about its capabilities.

➕ Maciej Obuchowski
Zacay Daushin (zacayd@octopai.com)
2024-01-09 03:42:57

*Thread Reply:* thanks

Anand Thamothara Dass (anand_thamotharadass@cable.comcast.com)
2024-01-09 08:57:33

Hi folks, We are in MWAA supported Airflow 2.7.2. For OpenLineage integration, do we have transport type as kinesis to emit the events?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-09 09:08:01

*Thread Reply:* tbere’s no kinesis transport available in Python at the moment

Anand Thamothara Dass (anand_thamotharadass@cable.comcast.com)
2024-01-09 09:12:04

*Thread Reply:* Thank you @Jakub Dardziński . Can we expect a support in future? Any suggestive alternative till then?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-09 09:13:36

*Thread Reply:* that’s open source project 🙂 if someone finds it useful to contribute or the community decides to implement this - that’ll probably be supported

I’m not sure what’s your use case but we have Kafka transport

Anand Thamothara Dass (anand_thamotharadass@cable.comcast.com)
2024-01-09 09:20:48

*Thread Reply:* Thanks @Jakub Dardziński. We have limited support for kafka and we are widely using Kinesis. The usecase is that, we are trying to build a in house lineage store, where we can emit these events. If Kafka is the only streaming option available, we can give a try.

Damien Hawes (damien.hawes@booking.com)
2024-01-09 09:52:16

*Thread Reply:* I'd look to see if AWS offers something like KafkaProxy, but for Kinesis. That is, you emit via HTTP and it forwards to the Kinesis stream.

Anand Thamothara Dass (anand_thamotharadass@cable.comcast.com)
2024-01-09 10:00:24

*Thread Reply:* Nice.. I ll take a look in to it.

Harel Shein (harel.shein@gmail.com)
2024-01-09 10:03:55

*Thread Reply:* agree with @Jakub Dardziński & @Damien Hawes. the transport interface is also pretty straightforward, so adding support for Kinesis might be trivial - in case you do decide to contribute

👍 Anand Thamothara Dass
Shahid Shaikh (ssshahidwin@gmail.com)
2024-01-10 14:22:18

Hello everyone,

I'm exploring OpenLineage + Marquez for lineage in our warehouse solution connecting to Snowflake. We've set up ETL jobs for various operations like Select, Update, Delete, and Merge. Is there a way to capture Change Data Capture (CDC) at the column level using OpenLineage? Our goal is to present this CDC information as facets on Marquez. Any insights or suggestions on this would be greatly appreciated!

👍 alexandre bergere
Harel Shein (harel.shein@gmail.com)
2024-01-11 05:14:03

*Thread Reply:* What are you using for CDC?

Shahid Shaikh (ssshahidwin@gmail.com)
2024-01-11 16:12:11

*Thread Reply:* @Harel Shein In our current setup, we typically utilize Slowly Changing Dimension (SCD) Type 2 at the source level for Change Data Capture (CDC). In this specific scenario, the source is Snowflake.

During the ETL process, we receive a CDC trigger or event from Snowflake.

Harel Shein (harel.shein@gmail.com)
2024-01-12 13:01:58

*Thread Reply:* what do you use to run this ETL process? If that framework is supported by openlineage, I assume it would work

Michael Robinson (michael.robinson@astronomer.io)
2024-01-10 15:23:47

@channel This month’s TSC meeting, open to all, is tomorrow at 10 am PT https://openlineage.slack.com/archives/C01CK9T7HKR/p1704392485753579

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
Shahid Shaikh (ssshahidwin@gmail.com)
2024-01-11 04:32:41

Hi Everyone

there is this search option on facets on Marquez UI. Is there any functionality of that?

Do we have the functionality to search on the lineage we are getting?

Harel Shein (harel.shein@gmail.com)
2024-01-11 05:43:04

*Thread Reply:* great question @Shahid Shaikh! I’m actually not sure, would you mind posting this question on the Marquez slack? https://join.slack.com/t/marquezproject/shared_invite/zt-2afft44fo-nWcmZmdrv7qSxfNl6iOKMg

Shahid Shaikh (ssshahidwin@gmail.com)
2024-01-11 13:35:47

*Thread Reply:* Thanks @Harel Shein I will take further update from Marquez slack.

👍 Harel Shein
Simran Suri (mailsimransuri@gmail.com)
2024-01-11 04:38:39

Hi everyone, I'm currently running my Spark code on Azure Kubernetes Service (AKS), and I'm interested in knowing if OpenLineage can provide cluster details such as its name, etc. In my current setup and run events, I haven't been able to find these cluster-related details.

Harel Shein (harel.shein@gmail.com)
2024-01-11 05:47:13

*Thread Reply:* I’m not sure if all of those details are readily available to the spark application to extract them. If you find a way, those can be reported in a custom facet. The debug facet for Spark would be the closest option to what you’re looking for, but definitely doesn’t contain k8s cluster level data.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-11 07:01:18

*Thread Reply:* Not out of the box, but if those are in environment variables, you can use spark.openlineage.facets.custom_environment_variables=[YOUR_VARIABLE;] to include it as CustomEnvironmentFacet

Simran Suri (mailsimransuri@gmail.com)
2024-01-11 07:04:01

*Thread Reply:* Thanks, can you please share the reference link for the same, so that I can try n implement this @Maciej Obuchowski

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-11 07:32:21
Rodrigo Maia (rodrigo.maia@manta.io)
2024-01-11 10:12:27

*Thread Reply:* Would this work if these environment variables were set at runtime?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-11 10:13:45

*Thread Reply:* during runtime of a job? I don't think so, we collect them when we spawn the OpenLineage integration

Rodrigo Maia (rodrigo.maia@manta.io)
2024-01-11 10:15:01

*Thread Reply:* I thought so. thank you 🙂

Rodrigo Maia (rodrigo.maia@manta.io)
2024-01-11 10:23:42

*Thread Reply:* @Maciej Obuchowski By the way, coming back to this, is there any way to pass some variable value from the executing code/job to the OL spark listener? The example above is something I was looking forward to, like, setting some (ENV) var at run time and having access to this value in the OL event.

Simran Suri (mailsimransuri@gmail.com)
2024-01-15 03:30:06

*Thread Reply:* @Maciej Obuchowski, Actually cluster details are not in ENV variables. In my Airflow setup those are stored in the Airflow connections and I've a connection ID for that, I'm successfully getting the connection ID details but not the associated Spark cluster name and details.

David Goss (david.goss@matillion.com)
2024-01-11 09:01:29

❓ Is there a standard or well-used facet for recording the type of a dataset? e.g. Table vs View at a very simple level.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-11 09:28:11

*Thread Reply:* I think we think of type of the dataset on a dataset namespace schema: something s3 will always be a link to path in a bucket.

Table vs View is a specific category of datasets that makes a lot of sense when talking about database datasets, but would not make sense when talking about object storage ones, so my naive view is that's something that should be unique to a particular data source type.

David Goss (david.goss@matillion.com)
2024-01-11 09:30:10

*Thread Reply:* Very fair! I think we’re probably going to use a custom facet for this for our own purposes, I just wanted to check if there was anything standard or standard-adjacent we should align to.

Julien Le Dem (julien@apache.org)
2024-01-11 18:59:34

*Thread Reply:* This sounds good. Please share that custom facet’s schema if you’re willing. This might be a good candidate for a new optional core facet for views.

👍 David Goss
David Goss (david.goss@matillion.com)
2024-02-07 09:49:43

*Thread Reply:* Sorry for the late reply @Julien Le Dem. So we have a custom dataset facet internally right now which is doing a couple of different jobs: • datasetType captures TABLE vs VIEW etc. We’re getting into more types with APIs, files etc so that will expand • nameParts captures the component parts of the dataset name where it’s a fully-qualified table name for example - so like DATABASE.SCHEMA.TABLE would be in there as ["TABLE", "SCHEMA", "DATABASE"] - this might seem weird but it’s useful for us to not have to parse this back out later especially where the canonical name has quotes, escapes etc { "$schema": "<https://json-schema.org/draft/2020-12/schema>", "$id": "****************/MatillionDatasetFacet.json", "title": "Matillion Dataset Facet Schema", "description": "Matillion Dataset Facet Schema", "type": "object", "properties": { "type": { "type": "string", "description": "The type of the dataset", "enum": [ "TABLE", "VIEW", "UNKNOWN" ] }, "nameParts": { "type": "array", "description": "The component parts of the dataset name (e.g. table, schema, database), from lowest to highest level", "items": { "type": "string" } } } }

David Goss (david.goss@matillion.com)
2024-02-07 09:50:46

*Thread Reply:* (This is just how we’re handling some stuff internally really - the idea is we can use our custom facet where needed and will pivot to standard facets if/when they start to exist for those use cases.)

Julien Le Dem (julien@apache.org)
2024-01-11 19:16:03

As discussed in the call today. I have updated the OpenLineage registry proposal. It includes as well the contributions of @Sheeri Cabral (Collibra) and feedback from the review of the . If you are interested, (in particular @Eric Veleker, @Ernie Ostic, @Sheeri Cabral (Collibra), @Jens Pfau but others as well) please comment on the PR. I think we are close enough to implement a first version.

Labels
documentation, proposal
Comments
2
❤️ Jarek Potiuk, Michael Robinson, Paweł Leszczyński
Honey Thakuria (Honey_Thakuria@intuit.com)
2024-01-12 00:08:47

Hi everyone, We're trying to use Openlineage for Presto on Spark script mentioned below, but aren't getting CREATE or INSERT lifecycle events and are only able to get DROP events.

Could anyone help us in letting know the way how we can get proper events for Presto on Spark ? Any tweaks, config changes ?? cc @Kiran Hiremath @Athitya Kumar ```drop table if exists schemaname.table1; drop table if exists schemaname.table2; drop table if exists schema_name.table3;

CREATE TABLE schema_name.table1 AS SELECT ** FROM ( VALUES (1, 'a'), (2, 'b'), (3, 'c') ) AS temp1 (id, name);

CREATE TABLE schema_name.table2 AS SELECT ** FROM ( VALUES (1, 'a'), (2, 'b'), (3, 'c') ) AS temp2 (id, name);

CREATE TABLE schemaname.table3 AS SELECT ** FROM schemaname.table1 UNION ALL SELECT ** FROM schema_name.table2;```

Athitya Kumar (athityakumar@gmail.com)
2024-01-13 10:54:45

Hey team! 👋

We're seeing some instances where the openlineage spark listener class seems to be running long even after the spark job has completed.

While we debug the reason for why it's long running (huge spark event due to partitions / json read etc, which could be lagging the listener thread/JVM); we just wanted to see if there's a way we could ensure from the openlineage listener that any of the methods overriden from SparkListener doesn't execute for more than 2 mins (say)?

For example, does having a configurable timeout value (like spark.openlineage.timeout.seconds=120 spark conf) and having an internal Executor + Future.get() with timeout / google's SimpleTimeLimiter make sense for this complexity for any spark event handled by OL spark listener?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-01-15 04:07:58

*Thread Reply:* yes, this would definietely make sense

Athitya Kumar (athityakumar@gmail.com)
2024-01-14 22:51:40

Hey team.

We're observing that the openlineage spark listener runs long (sometimes, even for couple of hours) even though spark job has completed. We've seen the pattern this is mostly happening for jobs with 3 levels of subqueries - is there a known issue for this, wherein huge spark event object from listener bus causes huge delay in openlineage spark listener's event processing due to JVM lag or openlineage code etc?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-01-15 04:07:11

*Thread Reply:* there is no issue for this. just to make sure: did you disable serializing LogicalPlan and sending it via event?

Athitya Kumar (athityakumar@gmail.com)
2024-01-15 04:14:00

*Thread Reply:* @Paweł Leszczyński - Yup, we've set this conf: "spark.openlineage.facets.disabled": "[spark_unknown;spark.logicalPlan]" We've also seen logs from spark where it says that event took more than 10-15 mins to process: [INFO ] [spark-listener-group-shared] [org.apache.spark.scheduler.AsyncEventQueue.logInfo@57] - Process of event SparkListenerSQLExecutionEnd(54,1702566454593) by listener OpenLineageSparkListener took 614.914735426s.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-01-15 04:24:43

*Thread Reply:* yeah, that's interesting. are you able to make sure if this is related to event generation and not backend issue (like marquez deadlock) ?

Athitya Kumar (athityakumar@gmail.com)
2024-01-15 04:25:58

*Thread Reply:* Yup yup, we use HTTP transport which has a 30 seconds API GW timeout on our side - but we also tried with console transport type to rule out the possibility of a backend issue

Faced the same issue with console transport type too

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-01-15 06:32:28

*Thread Reply:* when running for a long time, does it succeed or not? Perhaps there is a cycle in logical plan. Could you turn on and attach debugFacet ?

Athitya Kumar (athityakumar@gmail.com)
2024-01-15 06:45:02

*Thread Reply:* It runs for a long time and succeeds - for some jobs it's a matter of 1 hour, whereas we've seen jobs delaying by 4 hours as well.

@Paweł Leszczyński - How do we enable debugFacet? Any additional spark conf to add?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-01-15 06:47:16

*Thread Reply:* yes, spark.openlineage.debugFacet=enabled

Kacper Muda (kacper.muda@getindata.com)
2024-01-15 05:29:21

Hey, can I request a patch release that will include this fix ? If there is already an upcoming release planned let me know, maybe it will be soon enough 🙂

➕ Jakub Dardziński, Harel Shein, Maciej Obuchowski
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-16 16:01:22

*Thread Reply:* I’m bumping this request. I would love to include this https://github.com/OpenLineage/OpenLineage/pull/2373 as well

Michael Robinson (michael.robinson@astronomer.io)
2024-01-19 09:05:54

*Thread Reply:* Thanks, all. The release is authorized.

🙌 Jakub Dardziński
Rodrigo Maia (rodrigo.maia@manta.io)
2024-01-15 07:24:23

Hello All! I want to try to help the community by testing some of the bugs I've been reporting but also try to work with custom facets. Im having trouble building the project. Is there any step-by-step document (an dependencies) on how to build it?

Rodrigo Maia (rodrigo.maia@manta.io)
2024-01-15 07:48:40

*Thread Reply:*

Rodrigo Maia (rodrigo.maia@manta.io)
2024-01-15 07:51:20

*Thread Reply:* gradle --version


Gradle 8.5

Build time: 2023-11-29 14:08:57 UTC Revision: 28aca86a7180baa17117e0e5ba01d8ea9feca598

Kotlin: 1.9.20 Groovy: 3.0.17 Ant: Apache Ant(TM) version 1.10.13 compiled on January 4 2023 JVM: 21.0.1 (Homebrew 21.0.1) OS: Mac OS X 14.2 aarch64

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-15 09:20:10

*Thread Reply:* I think easiest way would be to use Java 8 (it's still supported by Spark)

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-15 09:23:30

*Thread Reply:* I recommend Azul Zulu, it works for mac M1 https://www.azul.com/downloads/?version=java-8-lts&os=macos&architecture=arm-64-bit&package=jdk#zulu

Azul | Better Java Performance, Superior Java Support
Est. reading time
1 minute
Rodrigo Maia (rodrigo.maia@manta.io)
2024-01-15 11:10:00

*Thread Reply:* Java Downgraded to 8. Thanks. But now, lets say I want to build the project (focusing on the spark integration). What should i do?

i tried gradle build (in integration/spark) and got the same error. Am i missing any step here?

Mattia Bertorello (mattia.bertorello@booking.com)
2024-01-15 16:37:05

*Thread Reply:* If you're having issues with Java, I suggest using https://asdf-vm.com/ to ensure that Gradle doesn't accidentally use a different Java version.

asdf-vm.com
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-16 05:01:01

*Thread Reply:* @Rodrigo Maia I would try removing contents of ~/.m2/repository - this usually helps with this kind of error

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-16 05:01:12

*Thread Reply:* The only downside is that you'll have to redownload dependencies

Rodrigo Maia (rodrigo.maia@manta.io)
2024-01-17 05:22:09

*Thread Reply:* thank you for the ideas to solve the issue. I'll try that during the weekend 😄

Dinesh Singh (dinesh.r-singh@hpe.com)
2024-01-16 07:00:17

Hello Team, I am new to columnar level lineage, I would like to know what is the fast and easiest means of creating a columnar level data/json. Which integration will be helpful ? Airflow, Spark or direct mysql operator.

Dinesh Singh (dinesh.r-singh@hpe.com)
2024-01-16 07:04:36

*Thread Reply:* Previously I have worked with datahub integration with spark to generate an lineage. How do i skip data hub and directly use Openlineage adaptors. Below is the previous Example :

from pyspark.sql import SparkSession

spark = SparkSession \ .builder \ .master("local[**]") \ .appName("NetflixMovies") \ .config("spark.jars.packages", "io.acryl:datahub_spark_lineage:0.8.23") \ .config("spark.extraListeners", "datahub.spark.DatahubSparkListener") \ .config("spark.datahub.rest.server", "<http://4.labs.net:8080>") \ .enableHiveSupport() \ .getOrCreate()

print("Reading CSV File") movies_df = spark \ .read \ .format("csv") \ .option("header", "true") \ .option("inferSchema", "true") \ .load("n_movies.csv")

movies_above8_rating = movies_df.filter(movies_df.rating &gt;= 8.0)

print("Writing CSV File") movies_above8_rating \ .coalesce(1) \ .write \ .format("csv") \ .option("header", "true") \ .mode("overwrite") \ .save("movies_above_rating8.csv") print("Completed") spark.stop()

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-16 07:12:28

*Thread Reply:* Have you tried directly doing the same just using OpenLineage Spark listener?

Dinesh Singh (dinesh.r-singh@hpe.com)
2024-01-16 07:16:21

*Thread Reply:* I did tried, no success yet !

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-16 07:16:43

*Thread Reply:* Can you share the resulting event?

Dinesh Singh (dinesh.r-singh@hpe.com)
2024-01-16 07:17:06

*Thread Reply:* I need sometime to reproduce.

Dinesh Singh (dinesh.r-singh@hpe.com)
2024-01-16 07:17:16

*Thread Reply:* Is there any other example you can suggest ?

Fabio Manganiello (fabio.manganiello@booking.com)
2024-01-19 05:06:31

Hi channel, do we have a standard way to infer the type of the job from the OpenLineage events that we receive? Context: I'm building a product where we're expected to tell the users whether a particular job is an Airflow DAG, a dbt workflow, a Spark job etc. The (relatively) most reliable piece of data to extract this information is the producer string published in the event, but I've noticed that there's really no standard way of formatting it. After seeing some producer strings end in /dbt and /spark we thought "cool, we can just scrape the last slashed token of the producer URL and infer the type from there". Then we met the Airflow integration, which apparently publishes a producer in the format <https://github.com/apache/airflow/tree/providers-openlineage/1.3.0>.

Is the producer the only way to infer the type of a job from an event? If so, are there any efforts in standardizing its possible values, or maybe create a registry of producers readily available in the OL libraries? If not, what's an alternative way of getting this info?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-01-19 05:53:21

*Thread Reply:* We introduced JobTypeJobFacet https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/JobTypeJobFacet.json some time ago

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-01-19 05:54:39

*Thread Reply:* The initial idea was to distinguish streaming jobs from batch ones. Then, it got extended to have jobTypelike: QUERY|COMMAND|DAG|TASK|JOB|MODEL or integration which can be SPARK|DBT|AIRFLOW|FLINK

Fabio Manganiello (fabio.manganiello@booking.com)
2024-01-19 05:56:30

*Thread Reply:* Thanks, that should definitely address my concern! Is it supposed to be filled explicitly (e.g. through env variables) or is it filled by the OL connector automatically? I've taken a look at several dumps of OL events but I can't recall seeing that facet being populated

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-01-19 06:13:17

*Thread Reply:* I introduced this for Flink only https://github.com/OpenLineage/OpenLineage/pull/2241/files as that is what I needed at that point. You would need to add it into Spark integration if you like

ldacey (lance.dacey2@sutherlandglobal.com)
2024-01-19 10:49:34

*Thread Reply:* JobTypeJobFacet(processingType="BATCH", integration="AIRFLOW", jobType="QUERY"),

would you consider the jobType as QUERY if airflow is running a task that filters a delta table and saves the output to a different delta table? it is a task as well in that case of course

ldacey (lance.dacey2@sutherlandglobal.com)
2024-01-19 10:00:04

any suggestions on naming for Graph API sources from outlook? I pull a lot of data from email attachments with Airflow. generally I am passing a resource (email address), the mailbox, and subfolder. from there I list messages and find attachments

Kacper Muda (kacper.muda@getindata.com)
2024-01-19 10:08:06

*Thread Reply:* What is the difference between mailbox and email address ? Could You provide some examples of all the parts provided?

ldacey (lance.dacey2@sutherlandglobal.com)
2024-01-19 10:22:42

*Thread Reply:* sure, so I definitely want to leave out the email address if possible (it is my name and I dont want it in OL metadata)

so we use the graph API credentials to open the mailbox of a resource (my work email) then we are opening the main Inbox folder, and from there we open folders/subfolders which is important to track

hook = MicrosoftEmailHook(graph_conn_id="ms_email_api", resource="my@email", folder="client_name", subfolder="system_name")

``` def getconn(self) -> MailBox: """Connects to the Office 365 account and opens the mailbox""" credentials = (self.graphconn.login, self.graphconn.password) account = Account( credentials, authflowtype="credentials", tenantid=self.graphconn.host ) if not account.isauthenticated: account.authenticate() return account.mailbox(resource=self.resource)

@cached_property
def mailbox(self) -&gt; MailBox:
    """Returns the authenticated account mailbox"""
    return self.get_conn()

@cached_property
def current_folder(self) -&gt; Folder:
    """Property to lazily open and return the current folder"""
    if not self._current_folder:
        self.open_folder()
    return self._current_folder

def open_folder(self) -&gt; Folder:
    """Opens the specified folder and sets it as the current folder"""
    inbox = self.mailbox.inbox_folder()
    f = inbox.get_folder(folder_name=self.folder)
    self._current_folder = (
        f.get_folder(folder_name=self.subfolder) if self.subfolder else f
    )

def get_messages(
    self,
    query: Query | None = None,
    download_attachments: bool = False,
) -&gt; list[Message]:
    """Lists all messages in a folder. Optionally filters them based on an OData
    query

    Args:
        query: OData query object
        download_attachments: Whether attachments should be downloaded

    Returns:
        A Message object or list of Message objects. A tuple of the Message and
        Attachment is returned if return_attachments is True
    """
    messages = [
        message
        for message in self.current_folder.get_messages(
            limit=self.limit,
            batch=self.batch,
            query=query,
            download_attachments=download_attachments,
        )
    ]
    return messages```
ldacey (lance.dacey2@sutherlandglobal.com)
2024-01-19 10:30:27

*Thread Reply:* I am not sure if something like this would be the namespace or the resource name, but I would consider a "source" of data to be this outlook://{self.folder}/{self.subfolder}

ldacey (lance.dacey2@sutherlandglobal.com)
2024-01-19 10:30:44

*Thread Reply:* then we list attachments with filters (subject, received since and until)

Kacper Muda (kacper.muda@getindata.com)
2024-01-19 10:33:56

*Thread Reply:* Ok, got it. I'm not sure if avoiding to include Your email is the best choice, as it's the best identifier of the attachments' location 🙂 It's just my opinion, but i think i would go with something like:

namespace: email://{email_addres} name: {folder}/{subfolder}/

OR

namespace: email name: {email_address}/{folder}/{subfolder}

Kacper Muda (kacper.muda@getindata.com)
2024-01-19 10:34:44

*Thread Reply:* I'm not sure if outlook://{self.folder}/{self.subfolder} is descriptive enough, as You can't really tell who received this attachment. But of course, it all depends on Your use case

ldacey (lance.dacey2@sutherlandglobal.com)
2024-01-19 10:35:14

*Thread Reply:* here is the output of message.build_url() <https://graph.microsoft.com/v1.0/users/name@company.com/folder>

ldacey (lance.dacey2@sutherlandglobal.com)
2024-01-19 10:36:08

*Thread Reply:* which is a built in method in the O365 library I am using. and yeah I think i need the resource name no matter what

ldacey (lance.dacey2@sutherlandglobal.com)
2024-01-19 10:36:28

*Thread Reply:* it makes it clear it is my personal work email address and not the service account email account..

Kacper Muda (kacper.muda@getindata.com)
2024-01-19 10:37:52

*Thread Reply:* I think i would not base the namespace on the email provider like outlook or gmail, because it's something that can easily change over time, yet it does not influence the actual content of the email. If You suddenly transfer to Google from Microsoft, and move all Your folder/subfolder structure, does it really matter fr You, from lineage perspective ?

Kacper Muda (kacper.muda@getindata.com)
2024-01-19 10:38:57

*Thread Reply:* This one i think is more debatable 😄

ldacey (lance.dacey2@sutherlandglobal.com)
2024-01-19 10:39:16

*Thread Reply:* agree

ldacey (lance.dacey2@sutherlandglobal.com)
2024-01-19 10:39:59

*Thread Reply:* the folder name in outlook corresponds to the GCS bucket name the files are loaded to as well, so I would want that to be nice and clear in marquez. it is all based on the client name which is the "folder" for emails

Kacper Muda (kacper.muda@getindata.com)
2024-01-19 10:44:15

*Thread Reply:* If Your specific email folder structure is somehow mapped to gcs, then You can try looking at the naming spec for gcs and somehow mix it in. The namespace in gcs contains the bucket name, so maybe in Your case it's wise to put it there as well, so You have separate namespaces for each client? At the end of the day, You are using it and it should match Your needs

ldacey (lance.dacey2@sutherlandglobal.com)
2024-01-19 10:54:44

*Thread Reply:* yep that is correct, it kind of falls apart with SFTP sources because the namespace includes the host which is not very descriptive (just some domain names).

we try to keep things separate between client namespaces though in general, since they are unrelated and different teams work on them

Abdallah (abdallah@terrab.me)
2024-01-19 10:11:36

Hello,

I hope you are doing all well.

We've observed an issue with symlinks generated by Spark integration, specifically when dealing with a Glue Catalog Table. The dataset namespace ends up being the same as the dataset name. This issue arises from the method used to parse the dataset's URI.

I would like to discuss this issue with the contributors.

https://github.com/OpenLineage/OpenLineage/pull/2379

Labels
integration/spark
Athitya Kumar (athityakumar@gmail.com)
2024-01-22 02:48:44

Hey team

Does openlineage-spark listener support Kafka sinks parsing from logical plan, to publish to lineage events? If yes, what's the openlineage-spark version since which it's been supported?

We have a pattern like df.write.format("KAFKA").option("TOPIC", "topicName") wherein we don't see any openlineage events - we're using openlineage-spark:1.1.0 currently

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-01-22 03:10:18

*Thread Reply:* However, this feature has not been improved nor developed for long time (like 2 years), and surely has some gaps like https://github.com/OpenLineage/OpenLineage/issues/372

Labels
integration/spark, streaming
Michael Robinson (michael.robinson@astronomer.io)
2024-01-22 15:42:55

@channel We released OpenLineage 1.8.0, featuring: • Flink: support Flink 1.18 #2366 @HuangZhenQiu • Spark: add Gradle plugins to simplify the build process to support Scala 2.13 #2376 @d-m-h • Spark: support multiple Scala versions LogicalPlan implementation #2361 @mattiabertorello • Spark: Use ScalaConversionUtils to convert Scala and Java collections #2357 @mattiabertorello • Spark: support MERGE INTO queries on Databricks #2348 @pawel-big-lebowski • Spark: Support Glue catalog in iceberg #2283 @nataliezeller1 Plus many bug fixes and a change in Spark! Thanks to all the contributors with a special shoutout to new contributor @Mattia Bertorello, who had no fewer than 5 contributions in this release! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.8.0 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/1.7.0...1.8.0 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🚀 Jakub Dardziński, alexandre bergere, Harel Shein, Mattia Bertorello, Maciej Obuchowski, Julian LaNeve
harsh loomba (hloomba@upgrade.com)
2024-01-22 16:43:32

Hello team I see the following issue when i install apache-airflow-providers-openlineage==1.4.0

harsh loomba (hloomba@upgrade.com)
2024-01-22 16:43:56

*Thread Reply:* @Jakub Dardziński @Willy Lulciuc

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-22 17:00:17

*Thread Reply:* Have you modified your local files? Please try to uninstall and reinstall the package

harsh loomba (hloomba@upgrade.com)
2024-01-22 17:10:47

*Thread Reply:* i didn't do anything, all other packages are fine

harsh loomba (hloomba@upgrade.com)
2024-01-22 17:12:42

*Thread Reply:* let me reinstall

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-22 17:30:10

*Thread Reply:* it’s odd because it should be line 142 actually, are you sure you’re installing it properly?

harsh loomba (hloomba@upgrade.com)
2024-01-22 17:33:33

*Thread Reply:* actually it could be my local

harsh loomba (hloomba@upgrade.com)
2024-01-22 17:33:39

*Thread Reply:* im trying something

jayant joshi (itsjayantjoshi@gmail.com)
2024-01-25 06:42:58

Hi Team! I want to create column level lineage using pyspark I followed https://openlineage.io/blog/openlineage-spark/ blog step by step while run next command docker run --network spark_default -p 3000:3000 -e MARQUEZ_HOST=marquez-api -e MARQUEZ_PORT=5000 --link marquez-api:marquez-api marquezproject/marquez-web:0.19.1 will see <http://localhost:3000> marquez UI but but it continusly fetching "Something went wrong while fetching search" not giving any result. in cmd " Error occurred while trying to proxy request /api/v1/search/?q=p&amp;sort=NAME&amp;limit=100 from localhost:3000 to <http://marquez-api:5000/> (EAI_AGAIN) (<https://nodejs.org/api/errors.html#errors_common_system_errors>)" for dataset I am reading CSV file. for reference attaching SC is there any solution?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-25 07:48:45

*Thread Reply:* First, can you try some newer version? Latest is 0.44.0, 0.19.1 is over two years old

Tamizharasi Karuppasamy (tamizhsangami@gmail.com)
2024-01-29 04:13:49

*Thread Reply:* Hi.. I am trying to generate column-level lineage for a csv file on AWS S3 using Spark and OpenLineage. I follow https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/README.md and https://openlineage.io/docs/integrations/spark/quickstart_local for my reference. But I get below error. Kindly help me resolve it.

openlineage.io
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-29 05:00:53

*Thread Reply:* looks like you have a typo in versioning @Tamizharasi Karuppasamy

jayant joshi (itsjayantjoshi@gmail.com)
2024-01-29 06:07:41

*Thread Reply:* Hi @Maciej Obuchowski as per your suggestion I used 0.44.0 but error still persist "[HPM] Error occurred while proxying request localhost:3000/api/v1/events/lineage?limit=20&amp;before=2024-01-29T23:59:59.000Z&amp;after=2024-01-29T00:00:00.000Z&amp;offset=0&amp;sortDirection=desc to <http://marquez-api:5000/> [EAI_AGAIN] (<https://nodejs.org/api/errors.html#errors_common_system_errors>" even in docker "marquez-api" is not running . for your reference sharing log ... 2024-01-29 16:35:29 marquez-db | 2024-01-29 11:05:29.547 UTC [39] FATAL: password authentication failed for user "marquez" 2024-01-29 16:35:29 marquez-db | 2024-01-29 11:05:29.547 UTC [39] DETAIL: Role "marquez" does not exist. 2024-01-29 16:35:29 marquez-db | Connection matched pg_hba.conf line 95: "host all all all md5" 2024-01-29 16:35:29 marquez-api | ERROR [2024-01-29 11:05:29,553] org.apache.tomcat.jdbc.pool.ConnectionPool: Unable to create initial connections of pool. 2024-01-29 16:35:29 marquez-api | ! org.postgresql.util.PSQLException: FATAL: password authentication failed for user "marquez" 2024-01-29 16:35:29 marquez-api | ! at org.postgresql.core.v3.ConnectionFactoryImpl.doAuthentication(ConnectionFactoryImpl.java:693) 2024-01-29 16:35:29 marquez-api | ! at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:203) 2024-01-29 16:35:29 marquez-api | ! at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:258) 2024-01-29 16:35:29 marquez-api | ! at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54) 2024-01-29 16:35:29 marquez-api | ! at org.postgresql.jdbc.PgConnection.&lt;init&gt;(PgConnection.java:263) 2024-01-29 16:35:29 marquez-api | ! at org.postgresql.Driver.makeConnection(Driver.java:443) 2024-01-29 16:35:29 marquez-api | ! at org.postgresql.Driver.connect(Driver.java:297) 2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.PooledConnection.connectUsingDriver(PooledConnection.java:346) 2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.PooledConnection.connect(PooledConnection.java:227) 2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.ConnectionPool.createConnection(ConnectionPool.java:772) 2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.ConnectionPool.borrowConnection(ConnectionPool.java:700) 2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.ConnectionPool.init(ConnectionPool.java:499) 2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.ConnectionPool.&lt;init&gt;(ConnectionPool.java:155) 2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.DataSourceProxy.pCreatePool(DataSourceProxy.java:118) 2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.DataSourceProxy.createPool(DataSourceProxy.java:107) 2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.DataSourceProxy.getConnection(DataSourceProxy.java:131) 2024-01-29 16:35:29 marquez-api | ! at org.flywaydb.core.internal.jdbc.JdbcUtils.openConnection(JdbcUtils.java:48) 2024-01-29 16:35:29 marquez-api | ! at org.flywaydb.core.internal.jdbc.JdbcConnectionFactory.&lt;init&gt;(JdbcConnectionFactory.java:75) 2024-01-29 16:35:29 marquez-api | ! at org.flywaydb.core.FlywayExecutor.execute(FlywayExecutor.java:147) 2024-01-29 16:35:29 marquez-api | ! at <a href="http://org.flywaydb.core.Flyway.info">org.flywaydb.core.Flyway.info</a>(Flyway.java:190) 2024-01-29 16:35:29 marquez-api | ! at marquez.db.DbMigration.hasPendingDbMigrations(DbMigration.java:78) 2024-01-29 16:35:29 marquez-api | ! at marquez.db.DbMigration.migrateDbOrError(DbMigration.java:33) 2024-01-29 16:35:29 marquez-api | ! at marquez.MarquezApp.run(MarquezApp.java:109) 2024-01-29 16:35:29 marquez-api | ! at marquez.MarquezApp.run(MarquezApp.java:51) 2024-01-29 16:35:29 marquez-api | ! at io.dropwizard.cli.EnvironmentCommand.run(EnvironmentCommand.java:67) 2024-01-29 16:35:29 marquez-api | ! at io.dropwizard.cli.ConfiguredCommand.run(ConfiguredCommand.java:98) 2024-01-29 16:35:29 marquez-api | ! at io.dropwizard.cli.Cli.run(Cli.java:78) 2024-01-29 16:35:29 marquez-api | ! at io.dropwizard.Application.run(Application.java:94) 2024-01-29 16:35:29 marquez-api | ! at marquez.MarquezApp.main(MarquezApp.java:63) 2024-01-29 16:35:29 marquez-api | INFO [2024-01-29 11:05:29,556] marquez.MarquezApp: Stopping app...

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-29 06:08:29

*Thread Reply:* Please delete all docker volumes related to Marquez and try again

👍 jayant joshi
jayant joshi (itsjayantjoshi@gmail.com)
2024-01-30 02:23:06

*Thread Reply:* Thanks @Maciej Obuchowski

Willy Lulciuc (willy@datakin.com)
2024-01-30 13:47:41

*Thread Reply:* @jayant joshi have you’ve tried running Marquez via gitpod? see https://github.com/MarquezProject/marquez?tab=readme-ov-file#try-it

Michael Robinson (michael.robinson@astronomer.io)
2024-03-18 09:02:34

*Thread Reply:* @jayant joshi did deleting all volumes work for you, or did you discover another solution? We see users encountering this error from time to time, and it would be helpful to know more.

Michael Robinson (michael.robinson@astronomer.io)
2024-01-25 09:43:16

@channel Our first London meetup is happening next Wednesday, Jan. 31, at Confluent's offices in Covent Garden. Click through to the Meetup page to sign up and view an up-to-date agenda, featuring talks by @Abdallah, Kirill Kulikov at Confluent, and @Paweł Leszczyński! https://openlineage.slack.com/archives/C01CK9T7HKR/p1704736501486239

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
❤️ Maciej Obuchowski, Abdallah, Eric Veleker, Rodrigo Maia, tati
Michael Robinson (michael.robinson@astronomer.io)
2024-01-29 12:51:10

@channel The 2023 Year in Review special issue of OpenLineage News is here! This issue contains an overview of the exciting events, developments and releases in 2023, including the Airflow Provider and Static Lineage. To get the newsletter directly in your inbox each month, sign up here.

openlineage.us14.list-manage.com
❤️ Ross Turk, Harel Shein, tati, Rodrigo Maia, Maciej Obuchowski, Jarek Potiuk, Mattia Bertorello, Sheeri Cabral (Collibra)
🙌 Francis McGregor-Macdonald, Harel Shein, tati, Maciej Obuchowski, Jarek Potiuk
Ross Turk (ross@rossturk.com)
2024-01-29 13:15:16

*Thread Reply:* love it! so much activity!

:gratitude_thank_you: Michael Robinson
➕ Harel Shein, Maciej Obuchowski
Balachandru S (balachandru2468@gmail.com)
2024-01-30 01:16:01

Hi Team, I am trying to run a pyspark code through spark-submit command. Find the command which trying below. While running below command I am facing an issue "java.net.UnknownHostException: marquez-api". Any solution to solve this issue ? Note that through jupyter lab I can run same code and create a lineage in the Marquez UI.

"spark-submit --conf "spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener" --packages "io.openlineage:openlineagespark:1.7.0" --conf "spark.openlineage.transport.type=http" --conf "spark.openlineage.transport.url= http://marquez-api:5000" --conf "spark.openlineage.namespace=sparkintegration" pyspark_etl.py".

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-30 02:06:27

*Thread Reply:* You've probably been running jupyter lab with docker compose alonng with Marquez containers?

Try changing to host.docker.internal from marquez-api

Balachandru S (balachandru2468@gmail.com)
2024-01-30 02:43:43

*Thread Reply:* Are you mean to say try passing below one ? "spark.openlineage.transport.url= http://host.docker.internal:5000"

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-30 02:47:25

*Thread Reply:* yes

Balachandru S (balachandru2468@gmail.com)
2024-01-30 02:47:40

*Thread Reply:* The above one also throwing a same error.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-30 02:48:00

*Thread Reply:* how do you instantiate marquez?

Balachandru S (balachandru2468@gmail.com)
2024-01-30 02:49:56

*Thread Reply:* Using below two sources: git clone https://github.com/OpenLineage/OpenLineage

docker run --network sparkdefault -p 3000:3000 -e MARQUEZHOST=marquez-api -e MARQUEZ_PORT=5000 --link marquez-api:marquez-api marquezproject/marquez-web:0.44.0

Website
<http://openlineage.io>
Stars
1496
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-30 02:50:56

*Thread Reply:* so you need to pass additional -p 5000:5000 flag

Balachandru S (balachandru2468@gmail.com)
2024-01-30 03:02:02

*Thread Reply:* Are you mean like this --conf "spark.openlineage.transport.port=5000" in the spark submit command ?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-30 03:04:03

*Thread Reply:* no, in docker run command

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-30 03:04:59

*Thread Reply:* you need to expose port 5000 for marquez-api container

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-30 03:26:49

*Thread Reply:* and if you expose localhost:5000 should be available too

Balachandru S (balachandru2468@gmail.com)
2024-01-30 04:03:53

*Thread Reply:* Find the attached localhost 5000 & 5001 port results. Note that while running same code in the jupyter notebook, I could see lineage on the Marquez UI. For running a code through spark-submit only I am facing an issue.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-30 04:04:48

*Thread Reply:* yeah but are you running jupyter from console or in docker?

Balachandru S (balachandru2468@gmail.com)
2024-01-30 04:20:34

*Thread Reply:* I am using the jupyter one which comes with Open Lineage docker container

Balachandru S (balachandru2468@gmail.com)
2024-01-31 04:25:13

*Thread Reply:* Hi @Jakub Dardziński, good afternoon. Any solution on this one ?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-01-31 04:43:13

*Thread Reply:* this is some workshop copy-paste tutorial https://github.com/OpenLineage/workshops/blob/main/spark/column-lineage-oct-2022.ipynb i did in the past.

you can try the following code snippet: import json,requests marquez_url = "<http://host.docker.internal:5000>" ## this may depend on your local setup if (requests.get("{}/api/v1/namespaces".format(marquez_url)).status_code == 200): print("Marquez is OK.") else: print("Cannot connect to Marquez") to see if docker network is setup properly

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-01-31 04:44:05

*Thread Reply:* the other parts of workshop refer to old version of connector, but please verify first if everything with your docker network is fine before running real Spark job

Balachandru S (balachandru2468@gmail.com)
2024-01-31 04:53:29

*Thread Reply:* Sure, thanks.

Balachandru S (balachandru2468@gmail.com)
2024-01-31 04:54:55

*Thread Reply:* From your code, I could see marquez-api is running successfully at "http://marquez-api:5000". Find attached screenshot.

Balachandru S (balachandru2468@gmail.com)
2024-01-31 04:57:08

*Thread Reply:* And I could create a lineage for spark in the Jupyter notebook successfully. But when I am submitting a job through spark-submit I am facing an issue "ERROR EventEmitter: Could not emit lineage w/ exception io.openlineage.client.OpenLineageClientException: java.net.UnknownHostException: marquez-api".

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-01-31 04:57:44

*Thread Reply:* great. do you run spark-submit from within the same docker container?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-01-31 04:58:08

*Thread Reply:* you can open terminal tab within jupyter to run it

Balachandru S (balachandru2468@gmail.com)
2024-01-31 05:02:05

*Thread Reply:* Sure, thanks. Let me try running it.

Balachandru S (balachandru2468@gmail.com)
2024-01-31 07:52:22

*Thread Reply:* Thanks @Paweł Leszczyński, it ran successfully in the jupyter terminal.

Mayur Singal (mayur.s@deuexsolutions.com)
2024-01-30 05:50:58

Hi Team, I'm trying out OpenLineage with databricks and I'm not seeing expected results, need help to figure out what's the issue

I'm following this quickstart guide: https://openlineage.io/docs/integrations/spark/quickstart_databricks/

more details in the thread 🧵

openlineage.io
Mayur Singal (mayur.s@deuexsolutions.com)
2024-01-30 05:52:09

*Thread Reply:* I executed this piece of code as per the quickstart guide but the result generated on marquez is not as expected: spark.createDataFrame([ {'a': 1, 'b': 2}, {'a': 3, 'b': 4} ]).write.mode("overwrite").saveAsTable("default.temp")

Mayur Singal (mayur.s@deuexsolutions.com)
2024-01-30 05:53:58

*Thread Reply:* the quickstart guide shows this example and it produces the result with a output node in the results, But when I run this in databricks I see no output node generated.

Mayur Singal (mayur.s@deuexsolutions.com)
2024-01-30 05:56:02

*Thread Reply:* same thing happens when I run this piece of code

```sourcetable = "mayurtable" destinationtable = "onkartable"

Read the data from the source table into a DataFrame

sourcedf = spark.sql(f"SELECT id FROM {sourcetable}")

Show the source DataFrame

source_df.show()

Write the data from the source DataFrame to the destination table

sourcedf.write.mode("overwrite").saveAsTable(destinationtable)`` where I'm transferring data from one table to another and would expect a lineage betweenmayurtableandonkartable` few events gets captured while running this piece of code but none of them has any output node

Mayur Singal (mayur.s@deuexsolutions.com)
2024-01-30 05:57:51

*Thread Reply:* as a result onkar_table as a dataset was never recorded hence lineage between mayur_table and onkar_table was not recorded as well

Mayur Singal (mayur.s@deuexsolutions.com)
2024-01-30 05:58:27

*Thread Reply:* I'm using OpenLineage jar version 1.7.0 also tried this out with version 1.8.0 but had no luck

Mayur Singal (mayur.s@deuexsolutions.com)
2024-01-30 05:59:17

*Thread Reply:* any help would be appreciated, not sure if this could be a configuration issue

Rodrigo Maia (rodrigo.maia@manta.io)
2024-01-30 06:06:40

*Thread Reply:* are you using UC catalog?

🙏 Mayur Singal
Rodrigo Maia (rodrigo.maia@manta.io)
2024-01-30 06:07:45

*Thread Reply:* ive reported an issue while ago about "saveAsTable(destination_table)" when using Unity

Mayur Singal (mayur.s@deuexsolutions.com)
2024-01-30 06:12:52

*Thread Reply:* Umm, yes UC is enabled on my instance but I'm performing this operation on default hive_metastore catalog

Mayur Singal (mayur.s@deuexsolutions.com)
2024-01-30 06:23:59

*Thread Reply:* just to add some more context, running this query via notebook also doesn't generate any lineage

%sql insert into onkar_table select id from mayur_table;

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-30 07:11:52

*Thread Reply:* any relevant logs @Mayur Singal? Might be regular issue with databricks where they replace OSS Spark class with their own implementation silently

Mayur Singal (mayur.s@deuexsolutions.com)
2024-01-30 07:30:42

*Thread Reply:* I had a similar thought, I checked the logs there is no relevant information over there don't see any errors as well.

Mayur Singal (mayur.s@deuexsolutions.com)
2024-02-01 05:10:34

*Thread Reply:* Hi Team, please do let me know if I need to open any ticket for this or could this be a configuration issue on my side!

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-01 05:28:05

*Thread Reply:* Can you try to set log level to debug and running again?

Rodrigo Maia (rodrigo.maia@manta.io)
2024-02-05 08:24:09

*Thread Reply:* I can try that and let you know

👍 Maciej Obuchowski, Mayur Singal
Balachandru S (balachandru2468@gmail.com)
2024-01-30 08:14:51

Hi Team, I want to view column level lineage for pyspark code. So I have sample pyspark ETL code and I tried to run on the Open Lineage. In the Marquez I could see high level lineage diagram like attached screenshot. Is any dependency I need to add to get a column level lineage ?. Thanks.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-30 08:55:29

*Thread Reply:* > Is any dependency I need to add to get a column level lineage ?. No, you need to wait for Marquez release that contains this PR 🙂

Labels
docs, web
Comments
9
👍 Balachandru S
jayant joshi (itsjayantjoshi@gmail.com)
2024-01-30 08:59:00

*Thread Reply:* I am also very much interested to see column level lineage using pyspark script in visual graph that located in the PR.

RaghavanAP (raghavan.panneerselvam@wavicledata.com)
2024-01-30 09:03:06

*Thread Reply:* Is there any tentative date for the next release which includes Column level lineage using spark integration?

Willy Lulciuc (willy@datakin.com)
2024-01-30 13:42:37

*Thread Reply:* The Marquez team is working to having an RC by end of next month for users to try! https://marquezproject.slack.com/archives/C01E8MQGJP7/p1706636124360629?thread_ts=1706621358.017329&cid=C01E8MQGJP7

Balachandru S (balachandru2468@gmail.com)
2024-01-30 23:44:51

*Thread Reply:* Can we get visual representation of column level lineage using any other data consumers of Amundsen or Egeria or Apache atlas ?

Michael Robinson (michael.robinson@astronomer.io)
2024-01-30 12:00:06

@channel Our first London meetup, at Confluent's offices in London, is tomorrow! https://openlineage.slack.com/archives/C01CK9T7HKR/p1706193796047049

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
❤️ Jarek Potiuk, Maciej Obuchowski, Mattia Bertorello, Willy Lulciuc, Abdallah, Harel Shein
🙌 Jarek Potiuk, Maciej Obuchowski, Mattia Bertorello, Abdallah
Athitya Kumar (athityakumar@gmail.com)
2024-01-30 13:40:41

Hey team. Is column/attribute level lineage supported for input/topic Kafka topic ports in the OpenLineage Flink listener?

Dheeraj (dheeraj.athrey@gmail.com)
2024-01-31 01:28:10

Hello all I am using airflow and dbt. I want to know if there is any tool using open lineage that actually provides the following feature.

If I pick a job J1, I want to be able to see the real time job start/run/finish status of all the jobs J1 is dependent on in the lineage.

Any other tool/framework that would help me achieve this is also helpful.

Thank you

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-31 02:13:41

*Thread Reply:* what is a job in this case? dbt run/Airflow DagRun?

Dheeraj (dheeraj.athrey@gmail.com)
2024-01-31 02:37:41

*Thread Reply:* By job, I meant an airflow dag.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-31 02:38:42

*Thread Reply:* @tati might say more as project maintainer but you could take a look into https://github.com/astronomer/astronomer-cosmos it transforms dbt project into Airflow DAG and plays really nice with OL too

Website
<https://astronomer.github.io/astronomer-cosmos/>
Stars
371
RaghavanAP (raghavan.panneerselvam@wavicledata.com)
2024-01-31 02:14:31

Hi Team, Am running the pyspark script through spark-submit command and am facing the error as "Could not emit lineage". Can you please help on this error? spark-submit command: spark-submit --conf "spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener" --packages "io.openlineage:openlineagespark:1.7.0" --conf "spark.openlineage.transport.type=http" --conf "spark.openlineage.transport.url=http://host.docker.internal:5000" --conf "spark.openlineage.namespace=sparkintegration" --conf "spark.pyspark.python=python" pyspark_etl.py

Error Screenshot:

Mattia Bertorello (mattia.bertorello@booking.com)
2024-01-31 08:31:22

*Thread Reply:* Hi,

The host of Marquez is probably wrong "spark.openlineage.transport.url=<http://host.docker.internal:5000>" Check on your machine where Marquez is exposed. Thanks

Balachandru S (balachandru2468@gmail.com)
2024-01-31 08:17:46

Hi Team, I need to config AWS CLI on the Jupyter lab (which comes with open lineage docker container) to read and write an AWS S3 files through pyspark code. For this I am running this command "sudo ./aws/install". While running this command it's prompt me to password. I not sure what is the password here. Can you please help me on this ?. Here I just need sudo password for default jupyter lab (which comes with open lineage docker container). Thanks.

Mattia Bertorello (mattia.bertorello@booking.com)
2024-01-31 08:26:46

*Thread Reply:* Hi, Did you try to pass this environment variable -e GRANT_SUDO=yes ? https://jupyter-docker-stacks.readthedocs.io/en/latest/using/common.html#permission-specific-configurations

Balachandru S (balachandru2468@gmail.com)
2024-01-31 08:33:26

*Thread Reply:* While composing up an open lineage docker-compose.yml. It showed the path to access jupyter lab, through the path I am accessing it. I didn't run any command externally. Find the attached screenshot.

Mattia Bertorello (mattia.bertorello@booking.com)
2024-01-31 08:36:41

*Thread Reply:* Mine was a suggestion to add this variable GRANT_SUDO=yes in the docker compose if you need sudo access. And in general it's not a OpenLineage problem you should look into the documentation of the jupyter docker image

Balachandru S (balachandru2468@gmail.com)
2024-01-31 08:38:07

*Thread Reply:* Sure, will do that. Thanks,

👍 Mattia Bertorello
Balachandru S (balachandru2468@gmail.com)
2024-02-02 04:18:20

*Thread Reply:* Hi @Mattia Bertorello, good afternoon I just tried to inspect the notebook container, there I could "GRANT_SUDO=yes". And after passing this also it's asking the password. Find the attached screenshot. Thanks.

Michael Robinson (michael.robinson@astronomer.io)
2024-01-31 10:46:34

@channel This is happening this evening! https://openlineage.slack.com/archives/C01CK9T7HKR/p1706193796047049

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
🎉 Harel Shein, Jakub Dardziński, Maciej Obuchowski, Ernie Ostic, harsh loomba
Abhinav Ajith (abhinavajith0968@gmail.com)
2024-02-01 01:37:31

Hi Team

Abhinav Ajith (abhinavajith0968@gmail.com)
2024-02-01 01:38:27

I am trying to run open lineage for sql query. Any Idea on how to use the openlineage-sql package

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-01 04:44:04

*Thread Reply:* Hey, could please tell more about your use case?

Suhas Shenoy (ksuhasshenoy@gmail.com)
2024-02-01 04:01:35

Hi everyone. I am trying to integrate openlineage with apache airflow version > 2.7. I followed this guide: https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/guides/user.html. This guide mentions that no changes are required in the dag file. I am using BashOperator in the dag file but openlineage does not seem to emit the events. What am I missing?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-01 04:02:52

*Thread Reply:* BashOperator can only pass lineage if you set inlets or outlets manually

Suhas Shenoy (ksuhasshenoy@gmail.com)
2024-02-01 04:25:37

*Thread Reply:* Thanks for the reply Jakub. I created a dag with the code of example in this url: https://openlineage.io/docs/integrations/airflow/manual. But still no luck generating events.

Suhas Shenoy (ksuhasshenoy@gmail.com)
2024-02-01 04:35:22

*Thread Reply:* I am using airflow version 2.7.1. In my airflow.cfg file, there was no [openlineage] section and I had to add it. Are these two related in any way? @Jakub Dardziński

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-01 04:43:14

*Thread Reply:* I ran the bash example DAG you linked and I can confirm I’m getting the lineage

Could you please confirm you’re sending events to Marquez at all? Can you see Airflow jobs created in Marquez but without inputs/outputs?

Suhas Shenoy (ksuhasshenoy@gmail.com)
2024-02-01 04:45:21

*Thread Reply:* I am not sending the events to Marquez. I have an endpoint defined in the localhost where I am trying to receive the event

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-01 04:45:38

*Thread Reply:* I see, so are you receiving any events?

Suhas Shenoy (ksuhasshenoy@gmail.com)
2024-02-01 04:46:09

*Thread Reply:* No. I am not receiving anything in that endpoint.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-01 04:46:54

*Thread Reply:* Would you want to share your [openlineage] section?

Suhas Shenoy (ksuhasshenoy@gmail.com)
2024-02-01 04:47:50

*Thread Reply:* Sure.

[openlineage] transport = '{"type": "http", "url": "http://localhost:8082/event"}'

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-01 04:50:02

*Thread Reply:* and what endpoint is available in your local server? if it’s /event then you should set following:

[openlineage] transport = ‘{“type”: “http”, “url”: “http://localhost:8082”, “endpoint”: “event”}’

Suhas Shenoy (ksuhasshenoy@gmail.com)
2024-02-01 05:15:12

*Thread Reply:* Yes the endpoint is /event and I made this change. Still no luck. Not only the http transport but console transport is also not working.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-01 05:24:23

*Thread Reply:* Could you please check what’s in the Airflow UI under Admin &gt; Plugins?

listeners should be there under OpenLineageProviderPlugin

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-01 05:26:11

*Thread Reply:* @Suhas Shenoy if you're using Astro Runtime image locally then you additionally need to set env variable OPENLINEAGE_DISABLED=false due to bug in some versions of the runtime

Suhas Shenoy (ksuhasshenoy@gmail.com)
2024-02-01 05:31:24

*Thread Reply:* There is no listeners attribute in OpenLineageProviderPlugin. And I am not using Astro Runtime image. I installed airflow following this documentation: https://airflow.apache.org/docs/apache-airflow/2.7.1/installation/installing-from-pypi.html

Suhas Shenoy (ksuhasshenoy@gmail.com)
2024-02-01 05:46:54

*Thread Reply:* How do I make that listener attribute set to OpenLineageListener?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-01 05:51:48

*Thread Reply:* do you see OpenLineageProviderPlugin then?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-01 05:52:42

*Thread Reply:* can you try setting OPENLINEAGE_DISABLED=false and use OPENLINEAGE_URL=<http://localhost:8082> instead of setting it in config?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-01 05:53:42

*Thread Reply:* I'd rather bet airflow.cfg is not read or it's in the wrong path

Suhas Shenoy (ksuhasshenoy@gmail.com)
2024-02-01 05:57:29

*Thread Reply:* This is the snapshot of my Plugins. I will also try with the configs which you mentioned.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-01 06:14:17

*Thread Reply:* ah, on Airflow 2.7 it does not show the listeners row yet

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-01 06:15:39

*Thread Reply:* if you do have access to Airflow’s console you could check what’s the output of below: ```from airflow.configuration import conf

print(conf.get('openlineage', 'transport'))```

Suhas Shenoy (ksuhasshenoy@gmail.com)
2024-02-01 06:26:52

*Thread Reply:* This was the output I got '{"type": "http", "url": "<http://localhost:8082>", "endpoint": "event"}'

Suhas Shenoy (ksuhasshenoy@gmail.com)
2024-02-01 09:12:34

*Thread Reply:* I am running Apache airflow in a virtual environment. Will it matter in any way?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-01 09:14:13

*Thread Reply:* have you tried OPENLINEAGE_DISABLED=false and OPENLINEAGE_URL=<http://localhost:8082> ? does virtual environment mean docker?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-01 09:15:45

*Thread Reply:* eventually try checking what’s the output from: ```from airflow.providers.openlineage.plugins.openlineage import isdisabled

print(isdisabled())```

Suhas Shenoy (ksuhasshenoy@gmail.com)
2024-02-01 09:16:42

*Thread Reply:* Yes I tried both the things.

No not docker. I have created virtual environment using python -m venv venv_name

Suhas Shenoy (ksuhasshenoy@gmail.com)
2024-02-01 09:18:09

*Thread Reply:* The output of the code block is false.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-01 09:35:54

*Thread Reply:* that means you have listener enabled

what I can think of is to set airflow logging level to debug and see what logs say

☝️ Maciej Obuchowski
Suhas Shenoy (ksuhasshenoy@gmail.com)
2024-02-01 23:21:36

*Thread Reply:* This is the log from scheduler terminal. I found 3 logs related to openlineage

```DEBUG - Loading EntryPoint(name='providerinfo', value='airflow.providers.openlineage.getproviderinfo:getproviderinfo', group='apacheairflow_provider') from package apache-airflow-providers-openlineage

DEBUG - Importing entry_point plugin openlineage

DEBUG - Creating module airflow.macros.OpenLineageProviderPlugin```

Suhas Shenoy (ksuhasshenoy@gmail.com)
2024-02-01 23:32:23

*Thread Reply:* And this is the log related to openlineage I found in the task1: https://pastebin.com/qbeYFTuJ

Suhas Shenoy (ksuhasshenoy@gmail.com)
2024-02-02 06:13:36

*Thread Reply:* Hi Team. Any reason why the extractor is failing to extract the metadata?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-02 06:16:49

*Thread Reply:* Are you running the same DAG as in https://openlineage.io/docs/integrations/airflow/manual/?

Suhas Shenoy (ksuhasshenoy@gmail.com)
2024-02-02 06:18:18

*Thread Reply:* Yes.

priya narayana (n.priya88@gmail.com)
2024-02-01 06:59:24

Hi Team! When i run a job in Dataproc Bigquery job with openlineage 1.4.1 version I am seeing events which has anonymous tables and i cannot see proper lineage event in console with inputs and outputs . can you please tell me which version of openlineage version can help me with right table name in input and output complete event

priya narayana (n.priya88@gmail.com)
2024-02-02 10:28:23

*Thread Reply:* Can someone guide me here

priya narayana (n.priya88@gmail.com)
2024-02-02 10:28:38

*Thread Reply:* @Maciej Obuchowski @Paweł Leszczyński @Michael Robinson

ldacey (lance.dacey2@sutherlandglobal.com)
2024-02-01 13:02:45

any thoughts on how sources from a REST API should be organized, or if they even should be set up as a dataset in OpenLineage? I am querying various endpoints and saving the results to jsonl files on GCS. those get read and ultimately loaded as parquet files in various delta tables (certain unnested columns become a completely separate table)

https://something.something.com/api/v2/ticket_audits

https://developer.zendesk.com/api-reference/ticketing/tickets/ticket_audits/

developer.zendesk.com
ldacey (lance.dacey2@sutherlandglobal.com)
2024-02-01 13:06:14

for the input dataset: namespace would be "https://something.something.com" resource would be "/api/v2/ticketaudits" not sure about the name, just "ticketaudits"?

maybe the uri would include the entire URL with query parameters, or those parameters could be a separate custom facet.

output dataset would be the GCS bucket and file?

Michael Robinson (michael.robinson@astronomer.io)
2024-02-01 15:42:01

@channel The latest issue of OpenLineage News is available now, featuring a rundown of the recent releases, updates to the Airflow Provider, events, proposals, and more. To get the newsletter directly in your inbox each month, sign up here.

openlineage.us14.list-manage.com
🚀 Jakub Dardziński, Harel Shein, tati
RaghavanAP (raghavan.panneerselvam@wavicledata.com)
2024-02-02 04:09:43

Hi Team, My objective is to configure AWS with open lineage in windows os , to read the file from s3. Can you please help me with required steps or share any documents or links for reference?

Balachandru S (balachandru2468@gmail.com)
2024-02-02 06:05:22

Hi Team, I need to execute sudo command in the jupyter notebook terminal (which comes with open lineage). Find the configuration details in the attached screenshot and I tried changing the password through the command like attached screenshot. After changing the password, I could use the created password for jupyter notebook login. But not for the jupyter terminal login. I could see usename is jovyan for both. Cany anyone help me to get a password for the jupyter terminal ?. Thanks.

Michael Robinson (michael.robinson@astronomer.io)
2024-02-02 15:13:10

@channel This month’s TSC meeting is next Thursday the 8th at 10am PT. On the tentative agenda: • announcements • recent releases • Coming soon: simplified job hierarchy in Spark @Maciej Obuchowski • Flink integration updates @Maciej Obuchowski • open discussion • more (TBA) More info and the meeting link can be found on the website. All are welcome! Do you have a discussion topic, use case or integration you’d like to demo? DM me to be added to the agenda.

openlineage.io
👍 Maciej Obuchowski, Mattia Bertorello
Peter Huang (huangzhenqiu0825@gmail.com)
2024-02-08 12:48:48

*Thread Reply:* @Maciej Obuchowski I am not able to attend the meeting this time. Please help to update on our agreement on the flink community about the flink lineage listener APIs.

👍 Maciej Obuchowski
priya narayana (n.priya88@gmail.com)
2024-02-05 04:08:31

Hi Team , When i run jobs in Dataproc - Target host is not specified error is coming , when i set --properties 'dataproc:dataproc.lineage.enabled=true'. Can you tell what settings am i missing

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-05 04:42:48

*Thread Reply:* hey, we don't develop Dataproc integration directly, this seems like Google service issue and should be resolved by their support

Rodrigo Maia (rodrigo.maia@manta.io)
2024-02-05 10:56:52

HI! Is anybody here working with AWS Glue + OpenLineage (Spark)?

Michael Robinson (michael.robinson@astronomer.io)
2024-02-05 13:29:05

*Thread Reply:* There's been some progress on support for Glue recently: https://github.com/OpenLineage/OpenLineage/pull/2283

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-06 07:32:07

*Thread Reply:* Glue Spark overall is working, but I think Glue Data Catalog is not

Rodrigo Maia (rodrigo.maia@manta.io)
2024-02-07 05:37:06

*Thread Reply:* I'll try to configure an instance with OL latest version to validate the results in terms of the jobs and the catalog. Thank you

Suraj Gupta (suraj.gupta@atlan.com)
2024-02-06 02:20:47

Do we have any list/documentation of all the connectors supported by Spark + OpenLineage?

Ruchira Prasad (ruchiraprasad@gmail.com)
2024-02-07 12:35:55

Hi Team, I tried to set up marquez on my Windows PC by using Docker Desktop by referring this document. https://openlineage.io/getting-started/ Once set up, it's not start the "marquez-api" by giving an error as "org.postgresql.util.PSQLException: FATAL: password authentication failed for user "marquez"" Do you have any idea how to fix this?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-07 12:43:13

*Thread Reply:* could you please check marquez-db logs?

Ruchira Prasad (ruchiraprasad@gmail.com)
2024-02-07 12:48:09

*Thread Reply:* It has this FATAL: password authentication failed for user "marquez" DETAIL: Role "marquez" does not exist.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-07 12:49:56

*Thread Reply:* no error related to init script?

Ruchira Prasad (ruchiraprasad@gmail.com)
2024-02-07 13:24:37

*Thread Reply:* Probably you might ask this.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-07 13:26:53

*Thread Reply:* as you're on windows you may have wrong crlf setting in git. you can either fix the script manually or change git settings

Ruchira Prasad (ruchiraprasad@gmail.com)
2024-02-08 00:51:24

*Thread Reply:* I set git config --global core.autocrlf false. Then the line endings remain as LF. With this, the above error was gone. But it has an authentication error as below.

Ruchira Prasad (ruchiraprasad@gmail.com)
2024-02-08 10:59:25

*Thread Reply:* Isn't this some code issue in this repo @Jakub Dardziński https://github.com/MarquezProject/marquez

Website
<https://marquezproject.ai>
Stars
1551
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-08 11:18:40

*Thread Reply:* did you try docker compose down and up again?

Ruchira Prasad (ruchiraprasad@gmail.com)
2024-02-08 11:56:03

*Thread Reply:* Yes. didn't work.

Michael Robinson (michael.robinson@astronomer.io)
2024-02-07 13:47:08

@channel This month's TSC meeting, open to all, is tomorrow! Note the recent additions to the agenda 👀. https://openlineage.slack.com/archives/C01CK9T7HKR/p1706904790913259

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
👍 Ernie Ostic, Jakub Dardziński, Maciej Obuchowski, Mattia Bertorello
❤️ Harel Shein, alexandre bergere
Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2024-02-08 09:25:41

HIRING! Hello OpenLineage community members! Collibra is hiring for 2 Software Engineers (one regular, one senior) in Czech Republic, for our Lineage ETL team. These jobs are open because one engineer is leaving to hike the Pacific Coast Trail, and another is backfill (an engineer left a different lineage team and an engineer from ETL team is moving to that team).

I am the Product Manager for this team, so you’d be working with me daily. Ask me anything (in a thread here or a DM). Creating an OpenLineage consumer is our next project starting later this month :D

https://www.linkedin.com/feed/update/urn:li:activity:7161361969311084544/

linkedin.com
🔥 Maciej Obuchowski, Kacper Muda, Michael Robinson, Harel Shein, Mattia Bertorello, Paweł Leszczyński
:flag_cz: Maciej Obuchowski, Harel Shein, Paweł Leszczyński
🥾 Mattia Bertorello
Michael Robinson (michael.robinson@astronomer.io)
2024-02-08 09:47:28

*Thread Reply:* Exciting news all around! Thanks for the update.

Harel Shein (harel.shein@gmail.com)
2024-02-08 09:53:30

*Thread Reply:* Amazing news @Sheeri Cabral (Collibra) 🎉

Harel Shein (harel.shein@gmail.com)
2024-02-08 09:53:44

*Thread Reply:* Cross posting to #jobs

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2024-02-08 10:00:31

*Thread Reply:* Oh! I forgot about that channel!

ldacey (lance.dacey2@sutherlandglobal.com)
2024-02-08 18:49:02

any general recommendations for naming buckets in GCS? i have a separate bucket per client (unrelated to each other).

then i normally have folders like:

sourcename/raw (csv/json data) sourcename/bronze (append only delta table) sourcename/silver (no duplicates) sourcename/gold (additional filters/aggregations)

I kind of like having an OL dataset per raw file so I can see the schema and some metadata about what happened to that file in Marquez, but it makes my Namespace dataset view kind of messy. perhaps I should separate the raw data into a separate bucket (meaning a separate namespace)? I know that Dataplex zones are based on "raw" and "curated" so maybe that makes sense?

it seems like having a separate bucket per source might be a bit much (, , ) since we have many hundreds of unique sources.

Polly
2024-02-09 11:30:45

@Harel Shein has a polly for you! Votes in this polly are anonymous 🔒.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2024-02-15 08:30:17

*Thread Reply:* FYI the EST times aren’t always correct, e.g. 1300 am. (I skipped over some 12 am EST times because I didn’t notice it really meant noon, 12 pm)

Harel Shein (harel.shein@gmail.com)
2024-02-16 13:37:31

*Thread Reply:* oh, oops. sorry about that 🤦‍♂️

Harel Shein (harel.shein@gmail.com)
2024-02-09 11:33:30

Hi all @here, We are considering changing the times for our monthly TSC meetings to allow more members of the community to participate. Please vote according to your preferred meeting days and times. Thank you!

🔥 Paweł Leszczyński, Mateusz Kozioł, Michael Robinson, Sheeri Cabral (Collibra)
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-16 13:30:35

*Thread Reply:* So, are we switching?

Harel Shein (harel.shein@gmail.com)
2024-02-16 13:40:03

*Thread Reply:* looks like Wednesday 9:30am PST / 12:30pm EST / 6:30pm CET / 11:00pm IST is the winner!

👍 Maciej Obuchowski
Wajdi Fathallah (wajdi@siffletdata.com)
2024-02-13 08:14:30

Hi team, I really like the way you define URI in https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md and would like to know:

  1. any recommendations to extend this list and support other data sources?
  2. do you plan to support assets other that Tables, Jobs ? for instance BI dashboard, AI models etc..?
👍 Athitya Kumar, alexandre bergere
Julien Le Dem (julien@apache.org)
2024-02-13 19:51:22

*Thread Reply:* Hello Wajdi, nice to see you!

  1. If you wan to add a missing datasource to that document, you should open an issue to propose it. The main requirement is they need to be unique and canonical. Once we agree on the format you can open a PR and add it to the spec.
  2. So far the idea, is to model AI model and dashboards as datasets. an AI model is an asset with metadata that is produced by a job. It will have specific facets. Let us know what you think. You are welcome to push that discussion further if you want to add those to the spec.
Wajdi Fathallah (wajdi@siffletdata.com)
2024-02-14 09:29:00

*Thread Reply:* Thanks for your answer @Julien Le Dem

Balachandru S (balachandru2468@gmail.com)
2024-02-14 01:06:05

Hi Team, We are getting the column-level lineage for pyspark code with the help of OpenLineage. We are interested to know the limitation here, is there any limitation in creating column-level lineage for pyspark code like lines of code, number of input data sources, number of output data sources or pyspark operations (filter, join & extra) in the OpenLineage ?. We have gone through the OpenLineage documentation, from the documentation we could only get supported spark versions and data source types alone. Thanks.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-14 05:48:43

*Thread Reply:* Spark 2.4 does not support Column-level lineage.

> number of output data sources In Spark model, each output creates additional action that goes through general Spark compilation and execution model - we follow that, so you'll end up with multiple events.

One thing we're not supporting for CLL (...maybe yet?) is using mapping to Scala object using DeserializeToObject logical plan, as we're breaking CLL chain there.

> We are interested to know the limitation here If anything, it depends on the type of data sources. Some LogicalPlans (like streaming ones) are not supported, as well as external ones. We're working on API for external connectors to implement, so we can have stable support for them.

Balachandru S (balachandru2468@gmail.com)
2024-02-14 08:31:42

*Thread Reply:* Thanks @Maciej Obuchowski.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2024-02-15 08:31:10

*Thread Reply:* follow-up - what versions of spark - if any - create column-level lineage?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-15 09:15:02

*Thread Reply:* 3.0+ 🙂 For now, it can depend on the connector though - Paweł is working on a better way to do it. https://github.com/OpenLineage/OpenLineage/pull/2272

Labels
documentation, integration/spark, ci, integration/flink, tests, streaming
Comments
3
Michael Robinson (michael.robinson@astronomer.io)
2024-02-14 09:26:47

@channel The Marquez Project, the reference implementation for OpenLineage, has released an RC featuring column lineage support among other significant web UI improvements. For those unfamiliar with our sister project, Marquez is a highly scalable backend, API and UI for metadata management that implements the OpenLineage spec. In addition to being a powerful, platform-agnostic solution for metadata management, it's always been a great way to get started with OpenLineage and explore what the capabilities are. Now it's an even better reflection of those capabilities and, in particular, the strengths of the Spark integration and Airflow Provider. Whether you're already a seasoned Marquez user or you're new to the project, please consider trying out the RC. It would be very helpful to know what your experience is like and about any issues you run into. Thanks!

Resources Release: https://github.com/MarquezProject/marquez/releases/tag/0.45.0-rc.1 Changelog: https://github.com/MarquezProject/marquez/blob/0.45.0-rc.1/CHANGELOG.md Commit history: https://github.com/MarquezProject/marquez/compare/0.43.1...0.45.0-rc.1 Maven: https://oss.sonatype.org/#nexus-search;quick~marquez PyPI: https://pypi.org/project/marquez-python/ Docker: https://hub.docker.com/u/marquezproject

🚀 Mattia Bertorello, Jakub Dardziński, Andy Alseth, Maciej Obuchowski, Willy Lulciuc, Minkyu Park, Sheeri Cabral (Collibra)
Michael Robinson (michael.robinson@astronomer.io)
2024-02-14 11:07:48
Michael Robinson (michael.robinson@astronomer.io)
2024-02-14 11:08:02
Michael Robinson (michael.robinson@astronomer.io)
2024-02-14 11:08:16
Willy Lulciuc (willy@datakin.com)
2024-02-14 13:38:31

*Thread Reply:* For those interested in learning more about Marquez, I hold office hours every Tue 9:30AM PST! You can join here 👉 https://astronomer.zoom.us/j/84548968341?pwd=Asb8rpLuhbSalGP9i4BYHd1UXfPQe1.1

ldacey (lance.dacey2@sutherlandglobal.com)
2024-02-15 10:11:11

*Thread Reply:* looks sharp. it seems to have fixed some issues I might have had in 0.44 which did not show all connections

I did an airflow backfill job which redownloaded all files from a SFTP (191 files) and each of those are a separate OL dataset. in this view I clicked on a single file, but because it is connected to the "extract" airflow task, it shows all of the files that task downloaded as well (dynamic mapped tasks in Airflow)

ldacey (lance.dacey2@sutherlandglobal.com)
2024-02-15 10:16:07

*Thread Reply:* but day to day there is normally just one file downloaded so that should be a cleaner view in the future I assume? that input dataset (SFTP file) should only refer to the output dataset on GCS

[sftp_file] --> [GCS file in landing folder] --> [GCS file in raw folder (renamed), only added if the checksum doesn't exist] -> [bronze delta table on GCS] --> [silver delta tableon GCS] --> [gold delta table on ADLS gen2]

I am just considering whether I should proceed with file level datasets if it will make the UI too messy and complex. we do not control changes in the schema from clients, so on one hand it is nice to track

Zacay Daushin (zacayd@octopai.com)
2024-02-14 09:32:43

is this also parses python code for column level?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-14 09:34:22

*Thread Reply:* Marquez is a reference backend for OpenLineage. What Michael highlighted is its ability to visualize column lineage.

Parsing Python code is, I think, undoable in sustainable way. What we’re aiming at is to use hook-level lineage which is work in progress. This work happens in Airflow OpenLineage provider.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2024-02-15 08:32:26

*Thread Reply:* (and just to clarify - does the airflow openlineage provider generate column-level lineage info?)

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-15 08:47:24

*Thread Reply:* it does with exceptions we even have that info now in the main repo page: https://github.com/OpenLineage/OpenLineage?tab=readme-ov-file#integration-matrix 🙂

Minkyu Park (minkyu.park.200@gmail.com)
2024-02-14 18:51:09

Hi all, greetings again from my new account with personal email. It's been a while to see you all 😄

👋 Jakub Dardziński, Paweł Leszczyński, Maciej Obuchowski
Harel Shein (harel.shein@gmail.com)
2024-02-14 19:48:48

*Thread Reply:* Good to see you here @Minkyu Park ❤️

❤️ Minkyu Park
Michael Robinson (michael.robinson@astronomer.io)
2024-02-14 19:50:45

*Thread Reply:* welcome back, @Minkyu Park!

❤️ Minkyu Park
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-15 05:01:24

*Thread Reply:* Hello @Minkyu Park 🙂

❤️ Minkyu Park
Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2024-02-15 08:33:21

*Thread Reply:* Welcome back Minkyu!

(Does open source need some presentations on ‘how best to contribute to open source’ including ‘use personal emails/github whenever possible because your job may not last forever’?)

🤣 Minkyu Park
👍 Minkyu Park
Minkyu Park (minkyu.park.200@gmail.com)
2024-02-15 18:30:46

Are extractors in OL airflow integration still available from airflow 2.7+ OL providers, or should it be implemented separately in Airflow?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-15 18:32:04

*Thread Reply:* They’re still available, you can register them with [openlineage] extractors or (AIRFLOW__OPENLINEAGE__EXTRACTORS env var) - see: https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/configurations-ref.html#extractors

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-15 18:32:45

*Thread Reply:* however, we encourage to use get_openlineage_facets_on_** methods just like we do within all providers now

Minkyu Park (minkyu.park.200@gmail.com)
2024-02-15 18:35:46

*Thread Reply:* In order to use get_openlineage_facets_on_** with the DefaultExtractor, the operator itself should implement the method, right? https://github.com/apache/airflow/blob/main/airflow/providers/openlineage/extractors/base.py#L115

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-15 18:36:52

*Thread Reply:* correct, you can find multiple examples, like https://github.com/apache/airflow/blob/main/airflow/providers/amazon/aws/operators/athena.py#L208

Minkyu Park (minkyu.park.200@gmail.com)
2024-02-15 18:37:12

*Thread Reply:* 👍

Jackson Goerner (jgoerner@squareup.com)
2024-02-15 19:24:14

👋 Hello! I'm trying to integrate our airflow service with open lineage, and while the standard transports work, I'm having trouble using a custom transport (I'm sure its a silly mistake). Details are in 🧵

✅ Jackson Goerner
Jackson Goerner (jgoerner@squareup.com)
2024-02-15 19:24:30

*Thread Reply:* The reason I need a custom transport is we have an api which requires some authentication with aws incognito, which when executing on the box seems to work fine (I can instantiate the transport with config, and emit events and that works all good.)

From what I understand the type transport configuration should be the fully qualified class name, which I've done. Using a standard file transport worked fine, and I saw the files being created as tasks we're executed. However, replacing my openlineage.yml file with:

transport: type: lineage_utils.api_transport.Transport env: dev endpoint: lineage/data Has stopped working. The file containing the class is /usr/share/airflow/plugins/lineage_utils/api_transport.py, and the first thing that file does is just touch a log file, which it isn't even doing, so I think the issue is something to do with the fully qualified class name.

The airflow worker service has the following standard configuration:

[Service] Environment="PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/share/airflow/venv/bin:/usr/local/bin" Environment="PYTHONPATH=${PYTHONPATH}:/usr/share/airflow/plugins" EnvironmentFile=/etc/sysconfig/airflow User=airflow Group=airflow Type=simple ExecStart=/usr/share/airflow/venv/bin/airflow celery worker -D -l /var/log/airflow/airflow-worker.log --stdout /var/log/airflow/airflow-worker-stdout.log --stderr /var/log/airflow/airflow-worker-stderr.log -q default Restart=on-failure RestartSec=10s Which should add the plugins directory to the PYTHONPATH (which is working for the dags themselves). Changing the qualified path to plugins.lineage_utils.api_transport.Transport also didn't work.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-16 03:46:01

*Thread Reply:* Any logs/exceptions you can share?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-16 03:52:39

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/main/client/python/openlineage/client/utils.py#L23 it should log warning if there are issues with importing the class

Jackson Goerner (jgoerner@squareup.com)
2024-02-19 19:11:28

*Thread Reply:* Thanks for pointing towards the logs - the issue was with my (faulty) interpretation of the docs here. I just copied the base transport/config definitions and implemented the methods, rather than subclassing the base classes, which caused issues. Everything is all working now! Thanks a bunch.

openlineage.io
🚀 Jakub Dardziński
Simran Suri (mailsimransuri@gmail.com)
2024-02-19 01:26:05

Hi, I am currently incorporating my Spark app name explicitly within the Spark code and further trying to build lineage. During this process, I've noticed certain transformations occurring in the job name in the event. Specifically, hyphens are changing into underscores, and uppercase letters are being converted into underscores followed by lowercase letters. I would appreciate more detailed information on these naming conventions. Are there specific scenarios or considerations we should be mindful of when adding the Spark app name to ensure same job name in events?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-19 08:37:48

Hey all - with @Damien Hawes as we get closer to releasing Spark integration with support for Scala 2.13 we think of requiring specifying Scala version in the artifact name. This is generally standard procedure for libraries containing part in Scala - for example Iceberg integration for Spark or Spark itself.

The reason we ask is because Scala 2.12 compatible version currently does not have the suffix: for OpenLineage 1.8 the maven artifact coordinates are &lt;dependency&gt; &lt;groupId&gt;io.openlineage&lt;/groupId&gt; &lt;artifactId&gt;openlineage-spark&lt;/artifactId&gt; &lt;version&gt;1.8.0&lt;/version&gt; &lt;/dependency&gt; and for 1.9 would be &lt;dependency&gt; &lt;groupId&gt;io.openlineage&lt;/groupId&gt; &lt;artifactId&gt;openlineage-spark_2.12&lt;/artifactId&gt; &lt;version&gt;1.9.0&lt;/version&gt; &lt;/dependency&gt;

Damien Hawes (damien.hawes@booking.com)
2024-02-19 08:40:37

*Thread Reply:* It should be noted that the 2.13 variant would have these coordinates: &lt;dependency&gt; &lt;groupId&gt;io.openlineage&lt;/groupId&gt; &lt;artifactId&gt;openlineage-spark_2.13&lt;/artifactId&gt; &lt;version&gt;1.9.0&lt;/version&gt; &lt;/dependency&gt;

Balachandru S (balachandru2468@gmail.com)
2024-02-21 03:53:45

Hi Team, I ran pyspark code in the Jupyter notebook (which comes with OpenLineage) and created lineage diagrams in the Marquez successfully. But I have a doubt instead of using Jupyter notebook (comes with OpenLineage), can use use my locally installed spark to create a lineage?. Thanks.

Kacper Muda (kacper.muda@getindata.com)
2024-02-21 04:14:11

*Thread Reply:* Hey, yes it's possible to use OpenLineage Spark integration outside of Jupyter notebook. Check out the docs on that integration.

Balachandru S (balachandru2468@gmail.com)
2024-02-21 04:26:51

*Thread Reply:* Thanks @Kacper Muda

Rodrigo Maia (rodrigo.maia@manta.io)
2024-02-21 04:43:19

*Thread Reply:* @Balachandru S its easy as moving the jar OL spark listener to the JAR spark folder and configuring spark as usual.

or something like:

SparkSession.builder .appName("example") .config("spark.jars.packages", "io.openlineage:openlineage_spark:1.8.0")

Balachandru S (balachandru2468@gmail.com)
2024-02-21 09:12:19

*Thread Reply:* Thanks @Rodrigo Maia

Abdallah (abdallah@terrab.me)
2024-02-21 06:17:11

Hi team, I hope you are all doing well. After the recent contributions, some of which addressed significant issues, may I kindly request a new release, please? Thank you very much for your time.

➕ Abdallah, Sophie LY, Yannick Libert, Tristan GUEZENNEC -CROIX-, Michael Robinson, Jakub Dardziński, Maciej Obuchowski, Damien Hawes
Rodrigo Maia (rodrigo.maia@manta.io)
2024-02-21 06:53:36

*Thread Reply:* Tha would be awesome

Abdallah (abdallah@terrab.me)
2024-02-21 11:37:46

*Thread Reply:* @Maciej Obuchowski Do you consider any release soon ?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-21 12:20:15

*Thread Reply:* @Michael Robinson is working on it 🙂

Abdallah (abdallah@terrab.me)
2024-02-21 12:20:45

*Thread Reply:* Owh ! Thank you !

Rodrigo Maia (rodrigo.maia@manta.io)
2024-02-21 06:51:43

Bug on Unity Catalog Support on OpenLineage for Databricks:

Im struggling to generate any output when running pyspark, reading and writing to Unity Catalog on Databricks, while input facets are created properly with correct symlinks:

example code: df = spark.read.table("rodrigo.default.brazil_universities") df.write.mode("overwrite").saveAsTable("rodrigo.default.brazil_universities_temp1") lineage json payload for all events (start and complete - it never emits a running event): ..."outputs":[]} What i noticed from the error logs:

If im writing to the default schema: 24/02/21 11:50:49 INFO PlanUtils: apply method failed with org.apache.spark.SparkException: There is no Credential Scope. Current env: Driver at com.databricks.unity.UCSDriver$Manager.$anonfun$currentScopeId$3(UCSDriver.scala:131) at scala.Option.getOrElse(Option.scala:189) at com.databricks.unity.UCSDriver$Manager.currentScopeId(UCSDriver.scala:131) at com.databricks.unity.UCSDriver$Manager.currentScope(UCSDriver.scala:134) at com.databricks.unity.UnityCredentialScope$.currentScope(UnityCredentialScope.scala:100) at com.databricks.unity.UnityCredentialScope$.getSAMRegistry(UnityCredentialScope.scala:120) at com.databricks.unity.SAMRegistry$.registerSAM(SAMRegistry.scala:322) at com.databricks.unity.SAMRegistry$.registerDefaultSAM(SAMRegistry.scala:338) at org.apache.spark.sql.catalyst.catalog.SessionCatalogImpl.defaultTablePath(SessionCatalog.scala:1200) at com.databricks.sql.managedcatalog.ManagedCatalogSessionCatalog.defaultTablePath(ManagedCatalogSessionCatalog.scala:991) at io.openlineage.spark3.agent.lifecycle.plan.catalog.AbstractDatabricksHandler.getDatasetIdentifier(AbstractDatabricksHandler.java:92) at io.openlineage.spark3.agent.lifecycle.plan.catalog.CatalogUtils3.lambda$getDatasetIdentifier$2(CatalogUtils3.java:61) ........ If im writing to other schema (not the default): 24/02/21 11:47:16 INFO PlanUtils: apply method failed with org.apache.spark.SparkException: There is no Credential Scope. Current env: Driver at com.databricks.unity.UCSDriver$Manager.$anonfun$currentScopeId$3(UCSDriver.scala:131) at scala.Option.getOrElse(Option.scala:189) at com.databricks.unity.UCSDriver$Manager.currentScopeId(UCSDriver.scala:131) at com.databricks.unity.UCSDriver$Manager.currentScope(UCSDriver.scala:134) at com.databricks.unity.UnityCredentialScope$.currentScope(UnityCredentialScope.scala:100) at com.databricks.unity.UnityCredentialScope$.getSAMRegistry(UnityCredentialScope.scala:120) at com.databricks.unity.SAMRegistry$.registerSAM(SAMRegistry.scala:322) at com.databricks.unity.SAMRegistry$.registerDefaultSAM(SAMRegistry.scala:338) at org.apache.spark.sql.catalyst.catalog.SessionCatalogImpl.defaultTablePath(SessionCatalog.scala:1200) at com.databricks.sql.managedcatalog.ManagedCatalogSessionCatalog.defaultTablePath(ManagedCatalogSessionCatalog.scala:991) at io.openlineage.spark3.agent.lifecycle.plan.catalog.AbstractDatabricksHandler.getDatasetIdentifier(AbstractDatabricksHandler.java:92) at io.openlineage.spark3.agent.lifecycle.plan.catalog.CatalogUtils3.lambda$getDatasetIdentifier$2(CatalogUtils3.java:61) .......... Has anyone successfully used Unity Catalog and produced the output facet?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-21 07:07:03

*Thread Reply:* I would call this feature request rather than bug 🙂

💡 Abdallah
Rodrigo Maia (rodrigo.maia@manta.io)
2024-02-21 09:03:10

*Thread Reply:* i really wish i could help on that

Matthew Paras (matthewparas2020@u.northwestern.edu)
2024-02-21 13:18:10

*Thread Reply:* I have a fix for this!

Matthew Paras (matthewparas2020@u.northwestern.edu)
2024-02-21 13:18:45

*Thread Reply:* Let me create an issue on github, I'm happy to make the change in a PR as well

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-21 13:20:28

*Thread Reply:* Great to hear that!

Matthew Paras (matthewparas2020@u.northwestern.edu)
2024-02-21 13:21:30

*Thread Reply:* Should I do issue + PR or just straight to PR?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-21 13:43:34

*Thread Reply:* Straight to PR is good for me, assuming you don't plan to break other things as part of the solution 🙂

Matthew Paras (matthewparas2020@u.northwestern.edu)
2024-02-21 13:51:37

*Thread Reply:* It is a fairly small change and we've been running this patched version on databricks for a few weeks now, so cautiously optimistic that it doesn't break anything else 😄

🙌 Jakub Dardziński, Maciej Obuchowski, Rodrigo Maia
Matthew Paras (matthewparas2020@u.northwestern.edu)
2024-02-21 21:58:40

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2453

Labels
integration/spark
Comments
1
:gratitude_thank_you: Michael Robinson
Rodrigo Maia (rodrigo.maia@manta.io)
2024-03-13 12:43:16

*Thread Reply:* @Matthew Paras Hi! im still struggling with empty outputs on databricks with OL latest version.

24/03/13 16:35:56 INFO PlanUtils: apply method failed with org.apache.spark.SparkException: There is no Credential Scope. Current env: Driver

Any idea on how to solve this?

Rodrigo Maia (rodrigo.maia@manta.io)
2024-03-13 12:53:44

*Thread Reply:* Any databricks runtime version i should test with?

Matthew Paras (matthewparas2020@u.northwestern.edu)
2024-03-13 15:35:41

*Thread Reply:* interesting, I think we're running on 13.3 LTS - we also haven't upgraded to the official OL version, still using the patched one that I built

Rodrigo Maia (rodrigo.maia@manta.io)
2024-02-21 09:03:49

💡Idea for next group meeting A workshop would come in handy to help newbies start contributing to the project 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-21 09:13:41

*Thread Reply:* I think this is what you're looking for https://www.youtube.com/watch?v=SXebBTVcY4Q

YouTube
} OpenLineage Project (https://www.youtube.com/@openlineageproject6897)
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-21 09:15:47

*Thread Reply:* What kind of guidance would you like to see?

There are several places that introduce some help: https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md https://openlineage.io/docs/development/developing/ each integration has its own readme that writes down basic dev e.g. https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/README.md

and above link from Maciej 🙂

Max Zheng (mzheng@plaid.com)
2024-02-21 13:52:07

Hi, I'm running into an odd "deadlock" with small datasets (eg. < 300 rows input/output) with Spark/OpenLineage with Kafka sink. The problem doesn't seem deterministic but is frequent enough where we've had to disable the listener on all of our jobs, and when it happens from driver logs and Spark UI it seems like the job hangs forever (there's a running query but no jobs are generated in the Spark UI, and no driver logs are generated at all). It doesn't seem like disk/memory/CPU utilization is high when this happens (< 20% memory utilization from our Prometheus metrics)

Are there any debug logs I can enable to get a better sense of whats happening? Any advice on how to debug this? Thanks

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-21 14:28:55

*Thread Reply:* What do you exactly mean by deadlock? Does it happen only with Kafka sink? How are you running the job, is it Databricks, EMR, some other provider, or some custom setup? Is it PySpark or Scala job? What Spark API are you using - RDDs, Data frame, Spark SQL?

Generally any logging we do is to driver logs, so it does not help that they are missing 😞 One thing that would be helpful is serialized logical plan of the job.

Quickly looking at the code, I don't see obvious candidate where something went wrong

Max Zheng (mzheng@plaid.com)
2024-02-21 14:36:59

*Thread Reply:* We are running on EMR 6.10.1, I've only been able to reproduce with Kafka sink but doing some more thorough testing with console sink now

• pyspark job • DataFrame API By deadlock I mean there's a running query in the Spark UI, but there's no active job/stages - the job just seems to be idle

Max Zheng (mzheng@plaid.com)
2024-02-21 14:48:18

*Thread Reply:* == Parsed Logical Plan == SaveIntoDataSourceCommand org.apache.hudi.Spark32PlusDefaultSource@6e5f1c4f, Map(hoodie.copyonwrite.record.size.estimate -&gt; 57, hoodie.insert.shuffle.parallelism -&gt; 1500, path -&gt; {path}, hoodie.datasource.write.precombine.field -&gt; _autogenerated_primary_key, hoodie.bootstrap.index.enable -&gt; false, hoodie.metadata.enable -&gt; true, hoodie.metrics.graphite.metric.prefix -&gt; lake_production, hoodie.index.type -&gt; SIMPLE, hoodie.datasource.write.operation -&gt; upsert, hoodie.metrics.reporter.type -&gt; GRAPHITE, hoodie.datasource.write.recordkey.field -&gt; _autogenerated_primary_key, hoodie.table.name -&gt; {table_name}, hoodie.datasource.write.table.type -&gt; COPY_ON_WRITE, hoodie.datasource.write.hive_style_partitioning -&gt; true, hoodie.metrics.graphite.host -&gt; {host}, hoodie.datasource.write.table.name -&gt; {table_name}, hoodie.populate.meta.fields -&gt; false, hoodie.metrics.graphite.port -&gt; 9109, hoodie.metrics.on -&gt; true, hoodie.datasource.write.keygenerator.class -&gt; org.apache.hudi.keygen.NonpartitionedKeyGenerator, hoodie.upsert.shuffle.parallelism -&gt; 1500, hoodie.datasource.write.partitionpath.field -&gt; ), Overwrite +- Project [c1#3, c2#4, a#5, c3#6, c4#7, c5#8, c6#9, uuid(Some(-217463645859800419)) AS _autogenerated_primary_key#17] +- Relation [c1#3,c2#4,a#5,c3#6,c4#7,c5#8,c6#9] parquet Here's the parsed logical plan for the stuck running query (its been 8 hours and typically finishes in < 1 minute, assuming this will never finish). The job is just rewriting a parquet input from S3 as an Apache Hudi table in S3.

👍 Maciej Obuchowski
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-21 15:08:34

*Thread Reply:* Ah, by Kafka sink you mean OpenLineage writing events to Kafka, not Spark writing to Kafka.

Max Zheng (mzheng@plaid.com)
2024-02-21 15:09:31

*Thread Reply:* Correct, sorry for the ambiguity 😅

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-21 15:20:53

*Thread Reply:* Would be great if you could check that with console transport. If it does not fix the issue, my guess is that something weird happens with Hudi, as we don't support it explicitly

👍 Max Zheng
Max Zheng (mzheng@plaid.com)
2024-02-21 15:54:10

*Thread Reply:* Sounds good, will do

I'm confused on how a listener could affect the running job ... wonder if its something on onJobStart https://github.com/OpenLineage/OpenLineage/blob/1.8.0/integration/spark/app/src/ma[…]n/java/io/openlineage/spark/agent/OpenLineageSparkListener.java

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-21 15:56:32

*Thread Reply:* I'm also not sure, but deadlock sure sounds like something that theoretically could happen - SaveIntoDataSourceVisitor for example uses createRelation if it can't recognize particular relation type

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-21 15:56:43

*Thread Reply:* Which it does not with Hudi

Max Zheng (mzheng@plaid.com)
2024-02-21 16:00:48

*Thread Reply:* Got it hmm

Max Zheng (mzheng@plaid.com)
2024-02-21 17:19:26

*Thread Reply:* Same issue with spark.openlineage.transport.type console

Max Zheng (mzheng@plaid.com)
2024-02-21 17:54:28

*Thread Reply:* Looking at the thread dump of the driver it seems like spark-listener-group-shared is running org.apache.hudi.DataSourceOptionsHelper$.&lt;init&gt;(DataSourceOptions.scala:731) org.apache.hudi.DataSourceOptionsHelper$.&lt;clinit&gt;(DataSourceOptions.scala) org.apache.hudi.DataSourceReadOptions$.&lt;init&gt;(DataSourceOptions.scala:143) org.apache.hudi.DataSourceReadOptions$.&lt;clinit&gt;(DataSourceOptions.scala) org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:75) org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:68) io.openlineage.spark.agent.lifecycle.plan.SaveIntoDataSourceCommandVisitor.apply(SaveIntoDataSourceCommandVisitor.java:139) io.openlineage.spark.agent.lifecycle.plan.SaveIntoDataSourceCommandVisitor.apply(SaveIntoDataSourceCommandVisitor.java:46) io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder$1.apply(AbstractQueryPlanDatasetBuilder.java:94) io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder$1.apply(AbstractQueryPlanDatasetBuilder.java:85) io.openlineage.spark.agent.util.PlanUtils.safeApply(PlanUtils.java:279) io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder.lambda$apply$0(AbstractQueryPlanDatasetBuilder.java:75) io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder$$Lambda$4382/1107261117.apply(Unknown Source) java.util.Optional.map(Optional.java:215) io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder.apply(AbstractQueryPlanDatasetBuilder.java:67) io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder.apply(AbstractQueryPlanDatasetBuilder.java:39) io.openlineage.spark.agent.util.PlanUtils.safeApply(PlanUtils.java:279) io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$null$23(OpenLineageRunEventBuilder.java:453) io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder$$Lambda$4380/2129506005.apply(Unknown Source) ... If an event listener is stuck like here, does it prevent the next job from starting?

Max Zheng (mzheng@plaid.com)
2024-02-21 19:08:51

*Thread Reply:* It seems like another thread and the Spark listener thread are both trying to access a singleton from Hudi in the stuck state. This seems like the problem to me though I am not familiar with Scala

org.apache.hudi.DataSourceWriteOptions$.&lt;init&gt;(DataSourceOptions.scala:400) org.apache.hudi.DataSourceWriteOptions$.&lt;clinit&gt;(DataSourceOptions.scala) org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:141) org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47) org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) =&gt; holding Monitor(org.apache.spark.sql.execution.command.ExecutedCommandExec@213524349}) org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) ... org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:860) org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:390) org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:363) org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239) Seems to be stuck on https://github.com/apache/hudi/blob/release-0.12.2/hudi-spark-datasource/hudi-spar[…]k-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala

(I've asked in the Hudi Slack about this)

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 06:00:53

*Thread Reply:* So yeah - the approach from our side would be to implement HudiRelationVisitor that refrains from calling createRelation because it can utilize some unique data present in Hudi's DefaultSource .

Looking at the Hudi code, it seems we can get the destination path from options: https://github.com/apache/hudi/blob/3a97b01c0263c4790ffa958b865c682f40b4ada4/hudi-[…]spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala However the various path operation might require calling remote filesystem. This would require tests if we can safely call for example TablePathUtils.getTablePath

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 06:02:58

*Thread Reply:* Also not exactly sure why it deadlocks on reading config options? I can't exactly see where it breaks, since the code line numbers don't align with main anymore 🙂

Max Zheng (mzheng@plaid.com)
2024-02-22 12:31:37

*Thread Reply:* I suspect its because the two singletons being loaded reference each other? ie. DataSourceWriteOptions and DataSourceOptionsHelper

I did manage to workaround this by doing this dumb fake read at Spark session initialization try: spark.read.format("hudi").load("dummy") except Exception: pass

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 12:33:01

*Thread Reply:* maybe, but why does that happen reliably?

Max Zheng (mzheng@plaid.com)
2024-02-22 12:34:06

*Thread Reply:* I have no idea, but over 3 runs it was always the same lines in the thread dump

Ruchira Prasad (ruchiraprasad@gmail.com)
2024-02-22 06:25:40

Is it possible to integrate the SQL Server Integration Service (SSIS) data pipeline with OpenLineage Marquez? If so give me some reference or guideline on how to implement it.

Damien Hawes (damien.hawes@booking.com)
2024-02-22 06:28:23

*Thread Reply:* Oh. It's been a while (> 10 years) since I last used SSIS (SQL Server 2008R2 and SQL Server 2012), however, if you're able to obtain the SQL queries it executes (assuming regular SQL transforms), you'd be able to run it through the openlineage-sql parser.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-22 06:31:09

*Thread Reply:* It is possible, however in Airflow for instance there has not been added such support.

Damien Hawes (damien.hawes@booking.com)
2024-02-22 06:43:05

*Thread Reply:* Additionally, you could parse the DTSX file (DTSX is an XML-like file), to extract the data flow out of it, and adapt it to the OpenLineage format.

Damien Hawes (damien.hawes@booking.com)
2024-02-22 06:43:23

*Thread Reply:* Though, importantly, this would probably be a Job Event and not a Run Event.

Athitya Kumar (athityakumar@gmail.com)
2024-02-22 08:10:57

Hey team. We have a use-case where we would like to leverage the transports of openlineage (console, http, kafka etc) & enhance/customise the spark metrics/events published by the OL spark listener.

For example, OL spark listener currently publishes lineage-related info like inputs, outputs, column-level lineage etc - we want to add more job/task-level metrics etc that's available from the raw Spark Bus events. What'd be the best way to go about it, to add our custom logic & publish custom fields in the OL event payload to the OL transports?

Kacper Muda (kacper.muda@getindata.com)
2024-02-22 08:13:08

*Thread Reply:* Hey, have You checked the docs about Custom Facets? It can help You attach any custom logic You might have, including: > more job/task-level metrics etc to an OpenLineage Event.

Athitya Kumar (athityakumar@gmail.com)
2024-02-22 08:17:02

*Thread Reply:* @Kacper Muda - Yup, I checked the above doc. But I thought it explains more on the "how" part from a schema / POJO perspective.

I was more interested to see where would be the best place to make these CustomFacet and logic changes. Should it be via extending the OL Spark Listener class & overriding/super, or some other way to inject our custom business logic which would be used by OL when it prepares the events?

👍 Kacper Muda
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 08:32:58

*Thread Reply:* @Athitya Kumar https://github.com/OpenLineage/OpenLineage/blob/31f8ce588526e9c7c4bc7d849699cb7ce2[…]/src/main/java/io/openlineage/spark/api/CustomFacetBuilder.java

Athitya Kumar (athityakumar@gmail.com)
2024-02-22 08:51:37

*Thread Reply:* @Maciej Obuchowski - Is the recommendation to use this CustomFacetBuilder by extending/overriding OL spark listener's methods like onTaskEnd / onJobEnd etc? I remember seeing something regarding ServiceLoader and being able to "inject" our own logic earlier - but didn't get to explore that completely

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 08:59:28

*Thread Reply:* @Athitya Kumar the idea is to implement OpenLineageEventHandlerFactory in your own code, implementing methods where you add your own CustomFacetBuilders , then provide it in separate JAR for ServiceLoader with META-INF file.

You can see example in tests: Implementation that adds TestRunFacet META-INF file

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 09:01:02

*Thread Reply:* your CustomFacetBuilder can specify on which Spark events it listens to

Michael Robinson (michael.robinson@astronomer.io)
2024-03-04 13:28:12

*Thread Reply:* @Athitya Kumar can you tell us if this resolved your issue?

Athitya Kumar (athityakumar@gmail.com)
2024-03-06 01:30:32

*Thread Reply:* @Michael Robinson - Yup, it's resolved for event types that're already being emitted from OpenLineage - but we have some events like StageCompleted / TaskEnd etc where we don't send events currently, where we'd like to plug-in our CustomFacets

https://openlineage.slack.com/archives/C01CK9T7HKR/p1709298185120219?thread_ts=1709297395.323109&cid=C01CK9T7HKR

} Maciej Obuchowski (https://openlineage.slack.com/team/U01RA9B5GG2)
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-06 12:57:53

*Thread Reply:* @Athitya Kumar can you store the facets somewhere (like OpenLineageContext) and send them with complete event later?

Monisha SL (monisha.sl@philips.com)
2024-02-22 08:30:17

Is it possible to integrate OpenLineage with Glue Jobs written in PySpark. If yes, please help me with reference documents

✅ Rodrigo Maia, Monisha SL
Rodrigo Maia (rodrigo.maia@manta.io)
2024-02-22 08:48:43

*Thread Reply:* it is, but there is no support for a glue catalog. Host the OL jar in S3 On your pyspark code put the configs for spark and the reference for Extra Jars glue big configuration pointing to your S3/jar file

Monisha SL (monisha.sl@philips.com)
2024-02-26 09:03:35

*Thread Reply:* @Rodrigo Maia I am unable to start the OL server on EC2 instance. Am trying to connect to OL hosted on EC2 from Glue to provide the OL URL. But unable to bring up the OL Server.

Rodrigo Maia (rodrigo.maia@manta.io)
2024-02-26 10:23:28

*Thread Reply:* here is an axample:

Monisha SL (monisha.sl@philips.com)
2024-02-26 11:17:42

*Thread Reply:* @Rodrigo Maia Thanks this helped the job to run successfully. But am quite confused on where to find the lineage generated? Does this needs to be integrated with Marquez UI? Apologies for so many questions, am new to this piece of work.

Rodrigo Maia (rodrigo.maia@manta.io)
2024-02-26 11:22:45

*Thread Reply:* is this case, the spark config is set to console. you can check on the logs in cloudwatch. I havent tried with Marquez

Monisha SL (monisha.sl@philips.com)
2024-02-26 11:24:41

*Thread Reply:* I can find it on the logs, looking for UI to showcase the lineage. Any help will be appreciated. Thank you.

soumilshah1995 (shahsoumil519@gmail.com)
2024-02-22 18:52:31

Hello there I am new here I am trying to run docker example given on GH im facing following issue im running on Mac M2

```git clone git@github.com:MarquezProject/marquez.git && cd marquez

./docker/up.sh```

(venv) soumilshah@Soumils-MBP marquez % ./docker/up.sh ...creating volumes: marquez_data, marquez_db-conf, marquez_db-init, marquez_db-backup Successfully copied 7.17kB to volumes-provisioner:/data/wait-for-it.sh Added files to volume marquez_data: wait-for-it.sh Successfully copied 2.05kB to volumes-provisioner:/db-conf/postgresql.conf Added files to volume marquez_db-conf: postgresql.conf Successfully copied 2.05kB to volumes-provisioner:/db-init/init-db.sh Added files to volume marquez_db-init: init-db.sh DONE! [+] Running 39/3 ✔ web 12 layers [⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿] 0B/0B Pulled 31.9s ✔ db 14 layers [⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿] 0B/0B Pulled 15.9s ✔ api 10 layers [⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿] 0B/0B Pulled 8.8s [+] Building 0.0s (0/0) docker:desktop-linux [+] Running 6/2 ✔ Network marquez_default Created0.0s 0.7s ✔ Container marquez-db Created0.7s ✔ Container marquez-api Created0.0s ! api The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested 0.0s ✔ Container marquez-web Created0.0s ! web The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested 0.0s Attaching to marquez-api, marquez-db, marquez-web Error response from daemon: Ports are not available: exposing port TCP 0.0.0.0:5000 -> 0.0.0.0:0: listen tcp 0.0.0.0:5000: bind: address already in use

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-23 08:28:14

*Thread Reply:* AFAIK on newer Macs Airplay Receiver runs on that port - please change the port to something else, unused or disable Airplay Receiver

✅ soumilshah1995
soumilshah1995 (shahsoumil519@gmail.com)
2024-02-23 18:37:07

*Thread Reply:* Thanks

Max Zheng (mzheng@plaid.com)
2024-02-23 02:05:58

Hi, I'm taking a look at lineage data from Spark and there's a weird event type called {spark_application_name}.map_partitions_parallel_collection which has 1 input (an S3 path) and 1 output which is strangely the same S3 path as the input - anyone have idea what this event is?

spark.logicalPlan=None, spark_properties=None are both None for this event which seems kind of weird

Max Zheng (mzheng@plaid.com)
2024-02-23 02:09:39

*Thread Reply:* Could be some oddity with using Apache Hudi but the Spark execution plan just shows a SaveIntoDataSourceCommand (which has a correct looking OpenLineage event)

👍 Maciej Obuchowski
Max Zheng (mzheng@plaid.com)
2024-02-23 02:09:53

*Thread Reply:* Just found this extra event that was generated very odd

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-23 06:45:26

*Thread Reply:* this is RDD event I think

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-23 06:46:16
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-23 06:53:52

*Thread Reply:* RDD events work kinda differently, they go through RddExecutionContext and not SparkSQLExecutionContext - LogicalPlan is a concept of Spark SQL https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-QueryExecution.html

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-23 06:54:27

*Thread Reply:* on the other hand, it would be probably good to attach Spark Properties facet

Max Zheng (mzheng@plaid.com)
2024-02-23 12:51:49

*Thread Reply:* Ah I see

Max Zheng (mzheng@plaid.com)
2024-02-23 12:53:08

*Thread Reply:* Its a little weird the output seems to be created erroneously, it definitely isn't writing anything to the input path

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-23 15:35:53

*Thread Reply:* Would be nice to look at it more as part of more proper Hudi support

Max Zheng (mzheng@plaid.com)
2024-02-23 16:40:02

*Thread Reply:* Yep, and thanks for the help in explaining things/code pointers 🙂

Balachandru S (balachandru2468@gmail.com)
2024-02-23 04:43:49

Hi Team, I want to generate table-level and column-level lineage for pyspark script. With the help of OpenLineage I can see lineage in the Marquez. Now I am interested to know the limitations here, can I create a lineage for pyspark script with 50+ number of sources and targets?. Are there any limitations in the source and target file count in the spark job? Like the above one, is there any limitation for spark job depth also?. Thanks.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-23 06:44:50

*Thread Reply:* I don't think there's hard limitation, we've seen it working for one job where serialized size of logical plan was >100MB

jayant joshi (itsjayantjoshi@gmail.com)
2024-02-23 08:09:09

Hi Team, If any transformation changes happening in the Spark job, can we track this under data lineage version history?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-23 08:26:11

*Thread Reply:* Do you mean column-level lineage? If yes, then we generally have that for Spark.

Balachandru S (balachandru2468@gmail.com)
2024-02-23 08:42:38

*Thread Reply:* @Maciej Obuchowski if I am running the same spark job multiple times with each time some transformation changes, can we trace transformation changes between runs?. Thanks.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-26 05:36:30

*Thread Reply:* Do you want to track exactly how the transformation happens? I would look at spark.logicalPlan facet, however it's raw information, and not sure how hard it is to get anything meaningful. Do you want to just track end result - how the resulting columns are formed? I would look at column-level lineage facet.

If that answer does not satisfies you - we process transformation information as part of generating column-level lineage, you can take look at that code.

Balachandru S (balachandru2468@gmail.com)
2024-02-26 05:56:06

*Thread Reply:* Thanks @Maciej Obuchowski

Hubert Boguski (hubert.boguski@spglobal.com)
2024-02-23 15:52:57

Hi everyone! Want to say this is a really awesome and much needed effort for the data community. Just wanted to double check my understanding here-

TLDR: is there a way to achieve this with python operators in a DAG with multiple python operators: To propagate the job hierarchy, tasks must send their own run ids so that the downstream process can report the ParentRunFacet with the proper run id.

Is there a way to have the multiple Python Operators (multiple python tasks) in a DAG trace job hierachy in Marquez. Right now I am invoking GET request and receiving a response in one task and passing the first tasks output to another task, but the jobs in Marquez shows up as different jobs (job hierarchy not shown). I understand there could be some extractor classes that could be implemented, but just wondering if there is anything simpler out there to show hierarchy with Python Operators 😅

I read somewhere that we can use the parent Id to link these jobs, but I had no luck (picture of docs)

I also saw a youtube video https://www.youtube.com/watch?v=yrSDngUhj8U But as I understand this is still a work in progress

I know there are some supported extractors already, but they dont support my use case. Thanks in advance!!!

YouTube
} OpenLineage Project (https://www.youtube.com/@openlineageproject6897)
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-26 05:34:03

*Thread Reply:* Hey, I’m not exactly sure what you’d like to achieve.

First, let’s explain what’s the hierarchy. When Airflow TaskInstance is running it is considered run for which DagRun is its parent run. Similarly if task spawns other jobs, they should set TaskInstance as their parent run. More about the idea here: https://openlineage.io/docs/spec/job-hierarchy

If you’d like to emit additional events within PythonOperator’s callback and set this task as parent run for those additional events you should take another approach: • here’s how task uuid is generated: https://github.com/apache/airflow/blob/main/airflow/providers/openlineage/plugins/listener.py#L79 • based on the above you could use build_task_instance_run_id method with additional information taken from context (that is dag_id, task_id, execution_date, try_number) Code snippet you provided allows to point to DAG as parent run.

openlineage.io
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-26 05:35:02

*Thread Reply:* Extracting lineage from PythonOperator is still in progress, proposal should be public soon.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-26 05:36:11

*Thread Reply:* Maybe I’m mistaken what you mean by job hierarchy. What would you expect to be shown as a graph?

Suhas Shenoy (ksuhasshenoy@gmail.com)
2024-02-25 23:51:38

Hi team, Just wanted to know if we can we configure openlineage to send events to two endpoints?

👍 Derya Meral
Kacper Muda (kacper.muda@getindata.com)
2024-02-26 02:57:35

*Thread Reply:* Hi, i don't think it's possible with a built-in HTTP Transport, but You can always implement a custom Transport, that will suit Your needs 🙂

👍 Suhas Shenoy, Derya Meral
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-02-26 03:27:59

*Thread Reply:* Hi @Suhas Shenoy, this was one of the motivations behind fluentD proxy -> https://github.com/OpenLineage/OpenLineage/tree/main/proxy/fluentd

❤️ Kacper Muda, Derya Meral
👍 Suhas Shenoy, Jakub Dardziński, Mattia Bertorello, Derya Meral
Derya Meral (drderyameral@gmail.com)
2024-02-26 11:24:36

*Thread Reply:* I was wondering the same thing and joined the slack to ask this. In our case, we're wondering whether it would be possible to use both Marquez and Datahub with our Airflow. Thanks for the pointer, @Kacper Muda!

Max Zheng (mzheng@plaid.com)
2024-02-26 12:37:43

Does anyone know how to disable the logger for: 4/02/23 09:08:01 INFO AsyncEventQueue: Process of event SparkListenerJobStart It seems like spark.redaction.regex isn't respected by this logger which causes secrets to get logged when Spark OpenLineage is enabled

Max Zheng (mzheng@plaid.com)
2024-02-26 12:51:28

I also ran into an error Py4JNetworkError("Answer from Java side is empty") very frequently on some jobs. It seems like this is caused by our Spark sessions being explicitly stopped via a spark.stop() and then OpenLineageSparkListener trying to access the stopped Spark session, which causes the application to fail (I suspect in https://github.com/OpenLineage/OpenLineage/blob/5298e8a30a8168dd8096f334e5d484812a[…]park/agent/lifecycle/plan/SaveIntoDataSourceCommandVisitor.java)

Should session already closed exceptions be handled gracefully there? Part of this is likely due to Hudi not being supported well (it seems like its very slow for some jobs and doesn't generate any events before crashing) but this seems like a race condition that can generally cause crashes

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-02-27 02:26:01

*Thread Reply:* Hi Max,

Do you have any logs on OpenLineageSparkListener crashing? All the exceptions from OpenLineageSparkListener trying to build an event should be caught and we do have a test for this. Please provide more details on that. How do you know it's failing? The desired behaviour is to log and exception and proceed.

Max Zheng (mzheng@plaid.com)
2024-02-27 13:26:46

*Thread Reply:* Seems like its on OpenLineageSparkListener.onJobEnd ```24/02/25 16:12:49 INFO PlanUtils: apply method failed with java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext. This stopped SparkContext was created at:

org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58) sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) java.lang.reflect.Constructor.newInstance(Constructor.java:423) py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) py4j.Gateway.invoke(Gateway.java:238) py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) py4j.ClientServerConnection.run(ClientServerConnection.java:106) java.lang.Thread.run(Thread.java:750)

The currently active SparkContext was created at:

org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58) sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) java.lang.reflect.Constructor.newInstance(Constructor.java:423) py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) py4j.Gateway.invoke(Gateway.java:238) py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) py4j.ClientServerConnection.run(ClientServerConnection.java:106) java.lang.Thread.run(Thread.java:750)

at org.apache.spark.SparkContext.assertNotStopped(SparkContext.scala:121) ~[spark-core_2.12-3.3.1-amzn-0.1.jar:3.3.1-amzn-0.1]
at org.apache.spark.sql.SparkSession.&lt;init&gt;(SparkSession.scala:113) ~[spark-sql_2.12-3.3.1-amzn-0.1.jar:3.3.1-amzn-0.1]
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:962) ~[spark-sql_2.12-3.3.1-amzn-0.1.jar:3.3.1-amzn-0.1]
at org.apache.spark.sql.SQLContext$.getOrCreate(SQLContext.scala:1023) ~[spark-sql_2.12-3.3.1-amzn-0.1.jar:3.3.1-amzn-0.1]
at org.apache.spark.sql.SQLContext.getOrCreate(SQLContext.scala) ~[spark-sql_2.12-3.3.1-amzn-0.1.jar:3.3.1-amzn-0.1]
at org.apache.hudi.client.common.HoodieSparkEngineContext.&lt;init&gt;(HoodieSparkEngineContext.java:65) ~[hudi-spark-bundle.jar:0.12.2-amzn-0]
at org.apache.hudi.SparkHoodieTableFileIndex.&lt;init&gt;(SparkHoodieTableFileIndex.scala:65) ~[hudi-spark-bundle.jar:0.12.2-amzn-0]
at org.apache.hudi.HoodieFileIndex.&lt;init&gt;(HoodieFileIndex.scala:81) ~[hudi-spark-bundle.jar:0.12.2-amzn-0]
at org.apache.hudi.HoodieBaseRelation.fileIndex$lzycompute(HoodieBaseRelation.scala:236) ~[hudi-spark-bundle.jar:0.12.2-amzn-0]
at org.apache.hudi.HoodieBaseRelation.fileIndex(HoodieBaseRelation.scala:234) ~[hudi-spark-bundle.jar:0.12.2-amzn-0]
at org.apache.hudi.BaseFileOnlyRelation.toHadoopFsRelation(BaseFileOnlyRelation.scala:153) ~[hudi-spark-bundle.jar:0.12.2-amzn-0]
at org.apache.hudi.DefaultSource$.resolveBaseFileOnlyRelation(DefaultSource.scala:268) ~[hudi-spark-bundle.jar:0.12.2-amzn-0]
at org.apache.hudi.DefaultSource$.createRelation(DefaultSource.scala:232) ~[hudi-spark-bundle.jar:0.12.2-amzn-0]
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:111) ~[hudi-spark-bundle.jar:0.12.2-amzn-0]
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:68) ~[hudi-spark-bundle.jar:0.12.2-amzn-0]
at io.openlineage.spark.agent.lifecycle.plan.SaveIntoDataSourceCommandVisitor.apply(SaveIntoDataSourceCommandVisitor.java:140) ~[io.openlineage_openlineage-spark-1.6.2.jar:1.6.2]
at io.openlineage.spark.agent.lifecycle.plan.SaveIntoDataSourceCommandVisitor.apply(SaveIntoDataSourceCommandVisitor.java:47) ~[io.openlineage_openlineage-spark-1.6.2.jar:1.6.2]
at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder$1.apply(AbstractQueryPlanDatasetBuilder.java:94) ~[io.openlineage_openlineage-spark-1.6.2.jar:1.6.2]
at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder$1.apply(AbstractQueryPlanDatasetBuilder.java:85) ~[io.openlineage_openlineage-spark-1.6.2.jar:1.6.2]
at io.openlineage.spark.agent.util.PlanUtils.safeApply(PlanUtils.java:279) ~[io.openlineage_openlineage-spark-1.6.2.jar:1.6.2]
at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder.lambda$apply$0(AbstractQueryPlanDatasetBuilder.java:75) ~[io.openlineage_openlineage-spark-1.6.2.jar:1.6.2]
at java.util.Optional.map(Optional.java:215) ~[?:1.8.0_392]
at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder.apply(AbstractQueryPlanDatasetBuilder.java:67) ~[io.openlineage_openlineage-spark-1.6.2.jar:1.6.2]
at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder.apply(AbstractQueryPlanDatasetBuilder.java:39) ~[io.openlineage_openlineage-spark-1.6.2.jar:1.6.2]
at io.openlineage.spark.agent.util.PlanUtils.safeApply(PlanUtils.java:279) ~[io.openlineage_openlineage-spark-1.6.2.jar:1.6.2]
at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$null$23(OpenLineageRunEventBuilder.java:451) ~[io.openlineage_openlineage-spark-1.6.2.jar:1.6.2]
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[?:1.8.0_392]
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_392]
at java.util.Iterator.forEachRemaining(Iterator.java:116) ~[?:1.8.0_392]
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) ~[?:1.8.0_392]
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) ~[?:1.8.0_392]
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) ~[?:1.8.0_392]
at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) ~[?:1.8.0_392]
at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) ~[?:1.8.0_392]
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_392]
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485) ~[?:1.8.0_392]
at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:272) ~[?:1.8.0_392]
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384) ~[?:1.8.0_392]
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) ~[?:1.8.0_392]
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) ~[?:1.8.0_392]
at java.util.stream.StreamSpliterators$WrappingSpliterator.forEachRemaining(StreamSpliterators.java:313) ~[?:1.8.0_392]
at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742) ~[?:1.8.0_392]
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) ~[?:1.8.0_392]
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) ~[?:1.8.0_392]
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) ~[?:1.8.0_392]
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_392]
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) ~[?:1.8.0_392]
at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildOutputDatasets(OpenLineageRunEventBuilder.java:410) ~[io.openlineage_openlineage-spark-1.6.2.jar:1.6.2]
at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.populateRun(OpenLineageRunEventBuilder.java:298) ~[io.openlineage_openlineage-spark-1.6.2.jar:1.6.2]
at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:281) ~[io.openlineage_openlineage-spark-1.6.2.jar:1.6.2]
at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:259) ~[io.openlineage_openlineage-spark-1.6.2.jar:1.6.2]
at io.openlineage.spark.agent.lifecycle.SparkSQLExecutionContext.end(SparkSQLExecutionContext.java:257) ~[io.openlineage_openlineage-spark-1.6.2.jar:1.6.2]
at io.openlineage.spark.agent.OpenLineageSparkListener.onJobEnd(OpenLineageSparkListener.java:167) ~[io.openlineage_openlineage-spark-1.6.2.jar:1.6.2]
at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:39) ~[spark-core_2.12-3.3.1-amzn-0.1.jar:3.3.1-amzn-0.1]
at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) ~[spark-core_2.12-3.3.1-amzn-0.1.jar:3.3.1-amzn-0.1]
at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) ~[spark-core_2.12-3.3.1-amzn-0.1.jar:3.3.1-amzn-0.1]
at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) ~[spark-core_2.12-3.3.1-amzn-0.1.jar:3.3.1-amzn-0.1]
at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) ~[spark-core_2.12-3.3.1-amzn-0.1.jar:3.3.1-amzn-0.1]
at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) ~[spark-core_2.12-3.3.1-amzn-0.1.jar:3.3.1-amzn-0.1]
at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) ~[spark-core_2.12-3.3.1-amzn-0.1.jar:3.3.1-amzn-0.1]
at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) ~[spark-core_2.12-3.3.1-amzn-0.1.jar:3.3.1-amzn-0.1]
at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) ~[scala-library-2.12.15.jar:?]
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) ~[scala-library-2.12.15.jar:?]
at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) ~[spark-core_2.12-3.3.1-amzn-0.1.jar:3.3.1-amzn-0.1]
at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) ~[spark-core_2.12-3.3.1-amzn-0.1.jar:3.3.1-amzn-0.1]
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1447) ~[spark-core_2.12-3.3.1-amzn-0.1.jar:3.3.1-amzn-0.1]
at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) ~[spark-core_2.12-3.3.1-amzn-0.1.jar:3.3.1-amzn-0.1]

24/02/25 16:13:04 INFO AsyncEventQueue: Process of event SparkListenerJobEnd(23,1708877534168,JobSucceeded) by listener OpenLineageSparkListener took 15.64437991s. 24/02/25 16:13:04 ERROR JniBasedUnixGroupsMapping: error looking up the name of group 1001: No such file or directory```

Max Zheng (mzheng@plaid.com)
2024-02-27 19:20:10

*Thread Reply:* Hmm yeah I'm confused, https://github.com/OpenLineage/OpenLineage/blob/1.6.2/integration/spark/shared/src/main/java/io/openlineage/spark/agent/util/PlanUtils.java#L277 seems to indicate as you said (safeApply swallows the exception), but the job exits after on an error code (EMR marks the job as failed)

The crash stops if I remove spark.stop() or disable the OpenLineage listener so this is odd 🤔

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-02-28 04:21:31

*Thread Reply:* 24/02/25 16:12:49 INFO PlanUtils: apply method failed with -> yeah, log level is info. It would look as if you were trying to run some action after stopping spark, but you said that disabling OpenLineage listener makes it succeed. This is odd.

Max Zheng (mzheng@plaid.com)
2024-02-28 13:11:11

*Thread Reply:* Maybe its some race condition on shutdown logic with event listeners? It seems like the listener being enabled is causing executors to be spun up (which fails) after the Spark session is already stopped

• After the stacktrace above I see ConsoleTransport log some OpenLineage event data • Then oddly it looks like a bunch of executors are launched after the Spark session has already been stopped • These executors crash on startup which is likely whats causing the Spark job to exit with an error code 24/02/24 07:18:03 INFO ConsoleTransport: {"eventTime":"2024_02_24T07:17:05.344Z","producer":"<https://github.com/OpenLineage/OpenLineage/tree/1.6.2/integration/spark>", ... 24/02/24 07:18:06 INFO YarnAllocator: Will request 1 executor container(s) for ResourceProfile Id: 0, each with 4 core(s) and 27136 MB memory. with custom resources: &lt;memory:27136, max memory:2147483647, vCores:4, max vCores:2147483647&gt; 24/02/24 07:18:06 INFO YarnAllocator: Submitted 1 unlocalized container requests. 24/02/24 07:18:09 INFO YarnAllocator: Launching container container_1708758297553_0001_01_000004 on host {ip} for executor with ID 3 for ResourceProfile Id 0 with resources &lt;memory:27136, vCores:4&gt; 24/02/24 07:18:09 INFO YarnAllocator: Launching executor with 21708m of heap (plus 5428m overhead/off heap) and 4 cores 24/02/24 07:18:09 INFO YarnAllocator: Received 1 containers from YARN, launching executors on 1 of them. 24/02/24 07:18:09 INFO YarnAllocator: Completed container container_1708758297553_0001_01_000003 on host: {ip} (state: COMPLETE, exit status: 1) 24/02/24 07:18:09 WARN YarnAllocator: Container from a bad node: container_1708758297553_0001_01_000003 on host: {ip}. Exit status: 1. Diagnostics: [2024-02-24 07:18:06.508]Exception from container-launch. Container id: container_1708758297553_0001_01_000003 Exit code: 1 Exception message: Launch container failed Shell error output: Nonzero exit code=1, error message='Invalid argument number' The new executors all fail with: Caused by: org.apache.spark.rpc.RpcEndpointNotFoundException: Cannot find endpoint: <spark://CoarseGrainedScheduler>@{ip}:{port}

Max Zheng (mzheng@plaid.com)
2024-02-28 13:44:20

*Thread Reply:* The debug logs from AsyncEventQueue show OpenLineageSparkListener took 21.301411402s fwiw - I'm assuming thats abnormally long

Max Zheng (mzheng@plaid.com)
2024-02-28 16:07:37

*Thread Reply:* The yarn logs also seem to indicate the listener is somehow causing the app to start up again 2024-02-24 07:18:00,152 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl (SchedulerEventDispatcher:Event Processor): container_1708758297553_0001_01_000002 Container Transitioned from RUNNING to COMPLETED 2024-02-24 07:18:00,155 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator (SchedulerEventDispatcher:Event Processor): assignedContainer application attempt=appattempt_1708758297553_0001_000001 container=null queue=default clusterResource=&lt;memory:54272, vCores:8&gt; type=OFF_SWITCH requestedPartition= 2024-02-24 07:18:00,155 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 2 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1708758297553_0001_000001 2024-02-24 07:18:00,155 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl (SchedulerEventDispatcher:Event Processor): container_1708758297553_0001_01_000003 Container Transitioned from NEW to ALLOCATED Is there some logic in the listener that can create a Spark session if there is no active session?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-02-29 03:29:40

*Thread Reply:* not sure of this, I couldn't find any place of that in code

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-29 05:36:43

*Thread Reply:* Probably another instance when doing something generic does not work with Hudi well 😶

Max Zheng (mzheng@plaid.com)
2024-02-29 12:44:24

*Thread Reply:* Dumb question, what info needs to be fetched from Hudi? Is this in the createRelation call? I'm surprised the logs seem to indicate Hudi table metadata seems to be being read from S3 in the listener

What would need to be implemented for proper Hudi support?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-29 15:06:42

*Thread Reply:* @Max Zheng well, basically we need at least proper name and namespace for the dataset. How we do that is completely dependent on the underlying code, so probably somewhere here: https://github.com/apache/hudi/blob/3a97b01c0263c4790ffa958b865c682f40b4ada4/hudi-[…]-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala

Most likely we don't need to do any external calls or read anything from S3. It's just done because without something that understands Hudi classes we just do the generic thing (createRelation) that has the biggest chance to work.

For example, for Iceberg we can get the data required just by getting config from their catalog config - and I think with Hudi it has to work the same way, because logically - if you're reading some table, you have to know where it is or how it's named.

Max Zheng (mzheng@plaid.com)
2024-02-29 16:05:07

*Thread Reply:* That makes sense, and that info is in the hoodie.properties file that seems to be loaded based on the logs. But the events I see OL generate seem to have S3 path and S3 bucket as a the name and namespace respectively - ie. it doesn't seem to be using any of the metadata being read from Hudi? "outputs": [ { "namespace": "s3://{bucket}", "name": "{S3 prefix path}", (we'd be perfectly happy with just the S3 path/bucket - is there a way to disable createRelation or have OL treat these Hudi as raw parquet?)

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-05 05:58:14

*Thread Reply:* > But the events I see OL generate seem to have S3 path and S3 bucket as a the name and namespace respectively - ie. it doesn't seem to be using any of the metadata being read from Hudi? Probably yes - as I've said, the OL handling of it is just inefficient and not specific to Hudi. It's good enought that they generate something that seems to be valid dataset naming 🙂 And, the fact it reads S3 metadata is not intended - it's just that Hudi implements createRelation this way.

(we'd be perfectly happy with just the S3 path/bucket - is there a way to disable createRelation or have OL treat these Hudi as raw parquet?) The way OpenLineage Spark integration works is by looking at Optimized Logical Plan of particular Spark job. So the solution would be to implement Hudi specific path in SaveIntoDataSourceCommandVisitor or any particular other visitor that touches on the Hudi path - or, if Hudi has their own LogicalPlan nodes, implement support for it.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-05 08:50:14

*Thread Reply:* (sorry for answering that late @Max Zheng, I thought I had the response send and it was sitting in my draft for few days 😞 )

Max Zheng (mzheng@plaid.com)
2024-03-06 19:37:32

*Thread Reply:* Thanks for the explanation @Maciej Obuchowski

I've been digging into the source code to see if I can help contribute Hudi support for OL. At least in SaveIntoDataSourceCommandVisitor it seems all I need to do is: ```--- a/integration/spark/shared/src/main/java/io/openlineage/spark/agent/lifecycle/plan/SaveIntoDataSourceCommandVisitor.java +++ b/integration/spark/shared/src/main/java/io/openlineage/spark/agent/lifecycle/plan/SaveIntoDataSourceCommandVisitor.java @@ -114,8 +114,9 @@ public class SaveIntoDataSourceCommandVisitor LifecycleStateChange lifecycleStateChange = (SaveMode.Overwrite == command.mode()) ? OVERWRITE : CREATE;

  • if (command.dataSource().getClass().getName().contains("DeltaDataSource")) {
  • if (command.dataSource().getClass().getName().contains("DeltaDataSource") || command.dataSource().getClass().getName().contains("org.apache.hudi.Spark32PlusDefaultSource")) { if (command.options().contains("path")) {
  • log.info("Delta/Hudi data source detected, path: {}", command.options().get("path").get()); URI uri = URI.create(command.options().get("path").get()); return Collections.singletonList( outputDataset() @@ -123,6 +124,7 @@ public class SaveIntoDataSourceCommandVisitor } }`` This seems to work and avoids thecreateRelation` call but I still run into the same crash 🤔 so now I'm not sure if this is a Hudi issue. Do you know of any other dependencies on the output data source? I wonder if https://openlineage.slack.com/archives/C01CK9T7HKR/p1708671958295659 rdd events could be the culprit?

I'm going to try and reproduce the crash without Hudi and just with parquet

} Max Zheng (https://openlineage.slack.com/team/U06L217224C)
Max Zheng (mzheng@plaid.com)
2024-03-06 20:24:14

*Thread Reply:* Hmm reading over RDDExecutionContext it seems highly unlikely anything in that would cause this crash

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-07 04:53:44

*Thread Reply:* There might be other part related to reading from Hudi?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-07 04:54:22

*Thread Reply:* SaveIntoDataSourceCommandVisitor only takes care about root node of whole LogicalPlan

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-07 04:57:51

*Thread Reply:* I would serialize logical plan and take a look at leaf nodes of the job that causes hang

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-07 04:58:05

*Thread Reply:* for simple check you can just make the dataset handler that handles them return early

Max Zheng (mzheng@plaid.com)
2024-03-07 11:54:39

*Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1708544898883449?thread_ts=1708541527.152859&cid=C01CK9T7HKR the parsed logical plan for my test job is just the SaveIntoDataSourceCommandVisitor(though I might be mis-understanding what you mean by leaf nodes)

} Max Zheng (https://openlineage.slack.com/team/U06L217224C)
Max Zheng (mzheng@plaid.com)
2024-03-07 12:12:28

*Thread Reply:* I was able to reproduce the issue with InsertIntoHadoopFsRelationCommand with aparquet write with the same job - I'm starting to suspect this is a Spark with Docker/yarn bug

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-07 13:17:19

*Thread Reply:* Without hudi read?

Max Zheng (mzheng@plaid.com)
2024-03-07 13:17:46

*Thread Reply:* Yep, it reads json and writes out as parquet

Max Zheng (mzheng@plaid.com)
2024-03-07 13:18:27

*Thread Reply:* We're with EMR so I created an AWS support ticket to ask whether this is a known issue with YARN/Spark on Docker

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-07 13:19:53

*Thread Reply:* Very interesting, would be great to see if we see more data in the metrics in the next release

Max Zheng (mzheng@plaid.com)
2024-03-07 13:21:17

*Thread Reply:* For sure, if its on master or if you have a patch I can build the jar and run my job with it if that'd be helpful

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-07 13:22:04

*Thread Reply:* Not yet 😶

🙏 Max Zheng
Max Zheng (mzheng@plaid.com)
2024-03-11 20:20:14

*Thread Reply:* After even more investigation I think I found the cause. In https://github.com/OpenLineage/OpenLineage/blob/987e5b806dc8bd6c5aab5f85c97af76a87[…]n/java/io/openlineage/spark/agent/OpenLineageSparkListener.java a SparkListenerSQLExecutionEnd event is processed after the SparkSession is stopped - I believe createSparkSQLExecutionContext is doing something weird in https://github.com/OpenLineage/OpenLineage/blob/987e5b806dc8bd6c5aab5f85c97af76a87[…]n/java/io/openlineage/spark/agent/lifecycle/ContextFactory.java at SparkSession sparkSession = queryExecution.sparkSession(); I'm not sure if this is defined behavior for the session to be accessed after its stopped? After I skipped the event in onOtherEvent if the session is stopped it no longer crashes trying to spin up new executors

(I can make a Github issue + try to land a patch if you agree this seems like a bug)

Max Zheng (mzheng@plaid.com)
2024-03-11 21:27:14

*Thread Reply:* (it might affect all events and this is just the first hit)

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-12 05:55:27

*Thread Reply:* @Max Zheng is the job particularly short lived? We've seen some times when for very short jobs we had the SparkSession stopped (especially if people close it manually) but it never led to any problems like this deadlock.

Max Zheng (mzheng@plaid.com)
2024-03-12 12:20:12

*Thread Reply:* I don't think job duration is related (also its not a deadlock, its causing the app to crash https://openlineage.slack.com/archives/C01CK9T7HKR/p1709143871823659?thread_ts=1708969888.804979&cid=C01CK9T7HKR) - it failed for ~ 1 hour long job and when testing still failed when I sampled the job input with df.limit(10000). It seems like it happens on jobs where events take a long time to process (like > 20s in the other thread).

I added this block to verify its being processed after the Spark context is stopped and to skip

```+ private boolean isSparkContextStopped() {

  • return asJavaOptional(SparkSession.getDefaultSession()
  • .map(sparkContextFromSession)
  • .orElse(activeSparkContext))
  • .map(
  • ctx -> {
  • return ctx.isStopped();
  • })
  • .orElse(true); // If for some reason we can't get the Spark context, we assume it's stopped
  • } + @Override public void onOtherEvent(SparkListenerEvent event) { if (isDisabled) { return; }
  • if (isSparkContextStopped()) {
  • log.warn("SparkContext is stopped, skipping event: {}", event.getClass());
  • return;
  • } This logs and no longer causes the same app to crash 24/03/12 04:57:14 WARN OpenLineageSparkListener: SparkSession is stopped, skipping event: class org.apache.spark.sql.execution.ui.SparkListenerDriverAccumUpdates```
} Max Zheng (https://openlineage.slack.com/team/U06L217224C)
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-12 12:29:34

*Thread Reply:* might the crash be related to memory issue?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-12 12:29:48

*Thread Reply:* ah, I see

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-12 12:31:30

*Thread Reply:* another question, are you explicitely stopping the sparksession/sparkcontext from within your job?

Max Zheng (mzheng@plaid.com)
2024-03-12 12:31:47

*Thread Reply:* Yep, it only happens where we explicitly stop with spark.stop()

Max Zheng (mzheng@plaid.com)
2024-03-13 16:18:23

*Thread Reply:* Created: https://github.com/OpenLineage/OpenLineage/issues/2513

Max Zheng (mzheng@plaid.com)
2024-02-26 12:52:47

Lastly, would disabling facets improve performance? eg. disabling spark.logicalPlan

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-02-27 02:26:44

*Thread Reply:* Disabling spark.LogicalPlan may improve performance of populating OL event. It's disabled by default in recent version (the one released yesterday). You can also use circuit breaker feature if you are worried about Ol integration affecting Spark jobs

🤩 Yannick Libert
Yannick Libert (yannick.libert.partner@decathlon.com)
2024-02-27 05:20:13

*Thread Reply:* This feature is going to be so useful for us! Love it!

Michael Robinson (michael.robinson@astronomer.io)
2024-02-26 14:23:37

@channel We released OpenLineage 1.9.1, featuring: • Airflow: add support for JobTypeJobFacet properties #2412 @mattiabertorello • dbt: add support for JobTypeJobFacet properties #2411 @mattiabertorello • Flink: support Flink Kafka dynamic source and sink #2417 @HuangZhenQiu • Flink: support multi-topic Kafka Sink #2372 @pawel-big-lebowski • Flink: support lineage for JDBC connector #2436 @HuangZhenQiu • Flink: add common config gradle plugin #2461 @HuangZhenQiu • Java: extend circuit breaker loaded with ServiceLoader #2435 @pawel-big-lebowski • Spark: integration now emits intermediate, application level events wrapping entire job execution #2371 @mobuchowski • Spark: support built-in lineage within DataSourceV2Relation #2394 @pawel-big-lebowski • Spark: add support for JobTypeJobFacet properties #2410 @mattiabertorello • Spark: stop sending spark.LogicalPlan facet by default #2433 @pawel-big-lebowski • Spark/Flink/Java: circuit breaker #2407 @pawel-big-lebowski • Spark: add the capability to publish Scala 2.12 and 2.13 variants of openlineage-spark #2446 @d-m-h A large number of changes and bug fixes were also included. Thanks to all our contributors with a special shout-out to @Damien Hawes, who contributed >10 PRs to this release! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.9.1 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/1.8.0...1.9.1 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🚀 Jakub Dardziński, Jackson Goerner, Abdallah, Yannick Libert, Mattia Bertorello, Tristan GUEZENNEC -CROIX-, Fabio Manganiello, Maciej Obuchowski
🎉 Abdallah, Mattia Bertorello
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-26 14:33:27

*Thread Reply:* Oudstanding work @Damien Hawes 👏

➕ Michael Robinson, Mattia Bertorello, Fabio Manganiello, Maciej Obuchowski
Abdallah (abdallah@terrab.me)
2024-02-27 00:39:29

*Thread Reply:* Thank you 👏👏

ldacey (lance.dacey2@sutherlandglobal.com)
2024-02-27 11:02:19

*Thread Reply:* any idea how OL releases tie into the airflow provider?

I assume that a separate apache-airflow-providers-airflow release would be made in the future to incorporate the new features/fixes?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-27 11:05:55

*Thread Reply:* yes, Airflow providers are released on behalf of Airflow community and different than Airflow core release

Max Zheng (mzheng@plaid.com)
2024-02-27 15:24:57

*Thread Reply:* It seems like OpenLineage Spark is still on 1.8.0? Any idea when this will be updated? Thanks!

Damien Hawes (damien.hawes@booking.com)
2024-02-27 15:29:28

*Thread Reply:* @Max Zheng https://openlineage.io/docs/integrations/spark/#how-to-use-the-integration

openlineage.io
Max Zheng (mzheng@plaid.com)
2024-02-27 15:30:14

*Thread Reply:* Oh got it, didn't see the note The above necessitates a change in the artifact identifier for io.openlineage:openlineage-spark. After version 1.8.0, the artifact identifier has been updated. For subsequent versions, utilize: io.openlineage:openlineage_spark_${SCALA_BINARY_VERSION}:${OPENLINEAGE_SPARK_VERSION}.

Max Zheng (mzheng@plaid.com)
2024-02-27 15:30:18

*Thread Reply:* Thanks!

Damien Hawes (damien.hawes@booking.com)
2024-02-27 15:30:36

*Thread Reply:* You're welcome.

Derya Meral (drderyameral@gmail.com)
2024-02-26 15:04:33

Hi all, I'm working on a local Airflow-OpenLineage-Marquez integration using Airflow 2.7.3 and python 3.10. Everything seems to be installed correctly with the appropriate settings. I'm seeing events, jobs, tasks trickle into the UI. I'm using the PostgresOperator. When it's time for the SQL code to be parsed, I'm seeing the following in my Airflow logs: [2024-02-26, 19:43:17 UTC] {sql.py:457} INFO - Running statement: SELECT CURRENT_SCHEMA;, parameters: None [2024-02-26, 19:43:17 UTC] {base.py:152} WARNING - OpenLineage provider method failed to extract data from provider. [2024-02-26, 19:43:17 UTC] {manager.py:198} WARNING - Extractor returns non-valid metadata: None Can anyone give me pointers on why exactly this might be happening? I've tried also with the SQLExecuteQueryOperator, same result. I previously got a Marquez setup to work with the external OpenLineage package for Airflow with Airflow 2.6.1. But I'm struggling with this newer integrated OpenLineage version

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-26 15:10:21

*Thread Reply:* Does this happen for some particular SQL but works for other? Also, my understanding is that it worked with openlineage-airflow on Airflow 2.6.1 (the same code)? What version of OL provider are you using?

Derya Meral (drderyameral@gmail.com)
2024-02-26 15:20:22

*Thread Reply:* I've been using one toy DAG and have only tried with the two operators mentioned. Currently, my team's code doesn't use provider operators so it would not really work well with OL.

Yes, it worked with Airflow 2.6.1. Same code.

Right now, I'm using apache-airflow-providers-openlineage==1.5.0 and the other OL dependencies are at 1.9.1.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-26 15:21:00

*Thread Reply:* Would you want to share the SQL statement?

Derya Meral (drderyameral@gmail.com)
2024-02-26 15:31:42

*Thread Reply:* It has some PII in it, but it's basically in the form of: ```DROP TABLE IF EXISTS usersmeral.keyrelations;

CREATE TABLE usersmeral.keyrelations AS

WITH staff AS ( SELECT ...) ,enabled AS (SELECT ...) SELECT ... FROM public.borrowers LEFT JOIN ...;``` We're splitting the query with sqlparse.split() and feed it to a PostgresOperator.

Derya Meral (drderyameral@gmail.com)
2024-02-27 09:26:41

*Thread Reply:* I thought I should share our configs in case I'm missing something: ```[openlineage] disabled = False disabledforoperators =

namespace =

extractors =

config_path = /opt/airflow/openlineage.yml transport =

disablesourcecode = ```

Derya Meral (drderyameral@gmail.com)
2024-02-27 09:27:20

*Thread Reply:* The YAML file: transport: type: http url: <http://marquez:5000>

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-04 13:01:19

*Thread Reply:* Are you running on apple silicon?

Derya Meral (drderyameral@gmail.com)
2024-03-04 15:39:05

*Thread Reply:* Yep, is that the issue?

Michael Robinson (michael.robinson@astronomer.io)
2024-02-28 13:00:00

@channel Since lineage will be the focus of a panel at Data Council Austin next month, it seems like a great opportunity to organize a meetup. Please get in touch if you might be interested in attending, presenting or hosting!

datacouncil.ai
✅ Sheeri Cabral (Collibra), Jarek Potiuk, Howard Yoo
❤️ Harel Shein, Julian LaNeve, Paweł Leszczyński, Maciej Obuchowski
Declan Grant (declan.grant@sdktek.com)
2024-02-28 14:37:16

Hi all, I'm running into an unusual issue with OpenLineage on Databricks. When using OL 1.4.1 on a cluster that runs over 100 jobs every 30 minutes. After a couple hours, a DRIVER_NOT_RESPONDING error starts showing up in the event log with the message Driver is up but is not responsive, likely due to GC.. After a DRIVER_HEALTHY the error occurs again several minutes later. Is this a known issue that has been solved in a later release, or is there something I can do in Databricks to stop this?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-29 05:27:20

*Thread Reply:* My guess would be that with that amount of jobs scheduled shortly the SparkListener queue grows and some internal healthcheck times out?

Maybe you could try disabling spark.logicalPlan and spark_unknown facets to see if this speeds things up.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-29 09:42:27

*Thread Reply:* BTW, are you receiving OL events in the meantime?

Michael Robinson (michael.robinson@astronomer.io)
2024-03-04 12:55:50

*Thread Reply:* Hi @Declan Grant, can you tell us if disabling the facets worked?

Declan Grant (declan.grant@sdktek.com)
2024-03-04 14:30:14

*Thread Reply:* We had already tried disabling the facets, and that did not solve the issue.

Here is the relevant spark config: spark.openlineage.transport.type console spark.openlineage.facets.disabled [spark_unknown;spark.logicalPlan;schema;columnLineage;dataSource] We are not interested in column lineage at this time.

Declan Grant (declan.grant@sdktek.com)
2024-03-04 14:31:28

*Thread Reply:* OL has been uninstalled from the cluster, so I can't immediately say whether events are received while the driver is not responding.

Michael Robinson (michael.robinson@astronomer.io)
2024-02-28 15:19:51

@channel This month's issue of OpenLineage News is in inboxes now! Sign up to ensure you always get the latest issue. In this edition: a rundown of open issues, new docs and new videos, plus updates on the Airflow Provider, Spark integration and Flink integration (+ more).

openlineage.us14.list-manage.com
👍 Mattia Bertorello
Simran Suri (mailsimransuri@gmail.com)
2024-03-01 01:19:04

Hi all, I've been trying to gather clues on how OpenLineage fetches our inputs' namespace and name from our Spark codebase. Routing to the exact logic would be very helpful for one of my usecase.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-03-01 02:25:10

*Thread Reply:* There is no single place where the namespace is assigned to dataset as this is strictly dependending on what datasets are read. Spark, as other OpenLineage integrations, follows the naming convention -> https://openlineage.io/docs/spec/naming

openlineage.io
Fabio Manganiello (fabio.manganiello@booking.com)
2024-03-01 04:42:12

Hi all, I'm working on propagating the parent facet from an Airflow DAG to the dbt workflows it launches, and I'm a bit puzzled by the current logic in lineageparentid. It generates an ID in the form namespace/name/run_id (which is the format that dbt-ol expects as well), but here name is actually a UUID generated from the job's metadata, and run_id is the internal Airflow task instance name (usually a concatenation of execution date + try number) instead of a UUID, like OpenLineage advises.

Instead of using this function I've made my own where name=<dag_id>.<task_id> (as this is the job name propagated in other OpenLineage events as well), and run_id = lineage_run_id(operator, task_instance) - basically using the UUID hashing logic for the run_id that is currently used for the name instead. This seems to be more OpenLineage-compliant and it allows us to link things properly.

Is there some reason that I'm missing behind the current logic? Things are even more confusing IMHO because there's also a newlineagerun_id utility that calculates the run_id simply as a random UUID, without the UUID serialization logic of lineage_run_id, so it's not clear which one I'm supposed to use.

👀 Kacper Muda
Fabio Manganiello (fabio.manganiello@booking.com)
2024-03-01 05:52:28

*Thread Reply:* FYI the function I've come up with to link things properly looks like this:

```from airflow.models import BaseOperator, TaskInstance from openlineage.airflow.macros import JOBNAMESPACE from openlineage.airflow.plugin import lineagerunid

def lineageparentid(self: BaseOperator, taskinstance: TaskInstance) -> str: return "/".join( [ _JOBNAMESPACE, f"{taskinstance.dagid}.{taskinstance.taskid}", lineagerunid(self, task_instance), ] )```

Damien Hawes (damien.hawes@booking.com)
2024-03-04 04:19:39

*Thread Reply:* @Paweł Leszczyński @Jakub Dardziński - any thoughts here?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-04 05:12:15

*Thread Reply:* newlineagerun_id is some very old util method that should be deleted imho

I agree what you propose is more OL-compliant. Indeed, what we have in Airflow provider for dbt cloud integration is pretty the same you have: https://github.com/apache/airflow/blob/main/airflow/providers/dbt/cloud/utils/openlineage.py#L132

the reason for that is I think that the logic was a subject of change over time and dbt-ol script just was not updated properly

👍 Fabio Manganiello
Michael Robinson (michael.robinson@astronomer.io)
2024-03-04 12:53:44

*Thread Reply:* @Fabio Manganiello would you mind opening an issue about this on GitHub?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-04 12:54:14

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/2488 there is one already 🙂 @Fabio Manganiello thank you for that!

Comments
1
:gratitude_thank_you: Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2024-03-04 13:05:13

*Thread Reply:* Oops, should have checked first! Yes, thanks Fabio

Kacper Muda (kacper.muda@getindata.com)
2024-03-04 13:19:50

*Thread Reply:* There is also a PR already, sent as separate message by @Fabio Manganiello. And the same fix for the provider here. Some discussion is needed about what changes can we made to the macros and whether they will be "breaking", so feel free to comment.

:gratitude_thank_you: Michael Robinson
Honey Thakuria (Honey_Thakuria@intuit.com)
2024-03-01 07:49:55

Hey team, we're trying to extract certain Spark metrics with OL using custom Facets.

But we're not getting SparkListenerTaskStart , SparkListenerTaskEnd event as part of custom facet.

We're only able to get SparkListenerJobStart, SparkListenerJobEnd, SparkListenerSQLExecutionStart, SparkListenerSQLExecutionEnd.

This is how our custom facet code looks like : ``` @Override protected void build(SparkListenerEvent event, BiConsumer<String, ? super TestRunFacet> consumer) { if (event instanceof SparkListenerSQLExecutionStart) { ...} if (event instanceof SparkListenerTaskStart) { ...}

} But when we're executing the same Spark SQL using custom listener without OL facets, we're able to get Task level metrics too: public class IntuitSparkMetricsListener extends SparkListener { @Override public void onJobStart(SparkListenerJobStart jobStart){ log.info("job start logging starts"); log.info(jobStart.toString());

}


@Override
public void onTaskEnd(SparkListenerTaskEnd taskEnd) {

} .... }``` Could anyone give us certain input on how to get Task level metrics in OL facet itself ? Also, any issue due to SparkListenerEvent vs SparkListener ?

cc @Athitya Kumar @Kiran Hiremath

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-01 08:00:09

*Thread Reply:* OpenLineageSparkListener is not listening on SparkListenerTaskStart at all. It listens to SparkListenerTaskEnd , but only to fill metrics for OutputStatisticsOutputDatasetFacet

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-01 08:03:05

*Thread Reply:* I think to do this would be a not that small change, you'd need to add handling for those methods for ExecutionContexts https://github.com/OpenLineage/OpenLineage/blob/31f8ce588526e9c7c4bc7d849699cb7ce2[…]java/io/openlineage/spark/agent/lifecycle/ExecutionContext.java and OpenLineageSparkListener itself to pass it forward.

When it comes to implementation of them in particular contexts, I would make sure they don't emit unless you have something concrete set up for them, like those metrics you've set up.

Fabio Manganiello (fabio.manganiello@booking.com)
2024-03-04 06:57:09

Hi folks, I have created a PR to address the required changes in the Airflow lineage_parent_id macro, as discussed in my previous comment (cc @Jakub Dardziński @Damien Hawes @Mattia Bertorello)

} Fabio Manganiello (https://openlineage.slack.com/team/U06BV4F12JU)
Labels
integration/airflow
Comments
1
👀 Kacper Muda
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-13 14:10:46

*Thread Reply:* Hey Fabio, thanks for the PR. Please let us know if you need any help with fixing tests.

🙌 Fabio Manganiello
Michael Robinson (michael.robinson@astronomer.io)
2024-03-06 15:22:46

@channel This month’s TSC meeting is next week on a new day/time: Wednesday the 13th at 9:30am PT. Please note that this will be the new day/time going forward! On the tentative agenda: • announcements ◦ new integrations: DataHub and OpenMetadata ◦ upcoming events • recent release 1.9.1 highlights • Scala 2.13 support in Spark overview by @Damien Hawes • Circuit breaker in Spark & Flink @Paweł Leszczyński • discussion items • open discussion More info and the meeting link can be found on the website. All are welcome! Do you have a discussion topic, use case or integration you’d like to demo? Reply here or DM me to be added to the agenda.

openlineage.io
🙏 Willy Lulciuc
✅ Sheeri Cabral (Collibra)
Max Zheng (mzheng@plaid.com)
2024-03-06 19:45:11

Hi, would it be reasonable to add a flag to skip RUNNING events for the Spark integration? https://openlineage.io/docs/integrations/spark/job-hierarchy For some jobs we're seeing AsyncEventQueue report ~20s to process each event and a lot of RUNNING events being generated

IMO this might work as an alternative to https://github.com/OpenLineage/OpenLineage/issues/2375 ? It seems like it'd be more valuable to get the START/COMPLETE events vs intermediate RUNNING events

openlineage.io
Labels
proposal
Comments
1
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-03-07 03:13:16

*Thread Reply:* Well, I think the real problem is 20s event generator. What we should do is to include timer spent on each visitor or dataset builder within debug facet. Once this is done, we could reach out to you again to let you guide us which code part leads to such scenario.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-03-07 03:13:44

*Thread Reply:* @Maciej Obuchowski do we have an issue for this? I think we discussed it recently.

Max Zheng (mzheng@plaid.com)
2024-03-07 11:58:05

*Thread Reply:* > What we should do is to include timer spent on each visitor or dataset builder within debug facet. I could help provide this data if that'd be helpful, how/what instrumentation should I add? If you've got a patch handy I could apply it locally, build, and collect this data from my test job

Max Zheng (mzheng@plaid.com)
2024-03-07 12:15:42

*Thread Reply:* Its also taking > 20s per event with parquet writes instead of hudi writes in my job so I don't think thats the culprit

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-07 14:45:59

*Thread Reply:* I'm working on instrumentation/metrics right now, will be ready for next release 🙂

🙌 Max Zheng, Sheeri Cabral (Collibra)
Max Zheng (mzheng@plaid.com)
2024-03-11 20:04:22

*Thread Reply:* I did some manual timing and 90% of the latency is from buildInputDatasets https://github.com/OpenLineage/OpenLineage/blob/987e5b806dc8bd6c5aab5f85c97af76a87[…]enlineage/spark/agent/lifecycle/OpenLineageRunEventBuilder.java

Manual as in I modified: long startTime = System.nanoTime(); List&lt;InputDataset&gt; datasets = Stream.concat( buildDatasets(nodes, inputDatasetBuilders), openLineageContext .getQueryExecution() .map( qe -&gt; ScalaConversionUtils.fromSeq(qe.optimizedPlan().map(inputVisitor)) .stream() .flatMap(Collection::stream) .map(((Class&lt;InputDataset&gt;) InputDataset.class)::cast)) .orElse(Stream.empty())) .collect(Collectors.toList()); long endTime = System.nanoTime(); double durationInSec = (endTime - startTime) / 1_000_000_000.0; <a href="http://log.info">log.info</a>("buildInputDatasets 1: {}s", durationInSec); 24/03/11 23:44:58 INFO OpenLineageRunEventBuilder: buildInputDatasets 1: 95.710143007s Is there anything I can instrument/log to narrow down further why this is so slow? buildOutputDatasets is also kind of slow at ~10s

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-12 05:57:58

*Thread Reply:* @Max Zheng it's not extremely easy because sometimes QueryPlanVisitors/DatasetBuilders delegate work to other ones, but I think I'll have a relatively good solution soon: https://github.com/OpenLineage/OpenLineage/pull/2496

👍 Paweł Leszczyński, Max Zheng
Max Zheng (mzheng@plaid.com)
2024-03-12 12:20:24

*Thread Reply:* Got it, should I open a Github issue to track this?

For context the code is def load_df_with_schema(spark: SparkSession, s3_base: str): schema = load_schema(spark, s3_base) file_paths = get_file_paths(spark, "/".join([s3_base, "manifest.json"])) return spark.read.format("json").load( file_paths, schema=schema, mode="FAILFAST", ) And the input schema has ~250 columns

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-12 12:24:00

*Thread Reply:* the instrumentation issues are already there, but please do open issue for the slowness 👍

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-12 12:24:34

*Thread Reply:* and yes, it can be some degenerated example where we do something way more often than once

Max Zheng (mzheng@plaid.com)
2024-03-12 12:25:19

*Thread Reply:* Got it, I'll try to create a working reproduction and ticket it 🙂

Max Zheng (mzheng@plaid.com)
2024-03-13 16:18:31

*Thread Reply:* Created https://github.com/OpenLineage/OpenLineage/issues/2511

Comments
1
👍 Maciej Obuchowski
Suhas Shenoy (ksuhasshenoy@gmail.com)
2024-03-07 02:02:43

Hi team... I am trying to emit openlineage events from a spark job. When I submit the job using spark-submit, this is what I see in console.

ERROR AsyncEventQueue: Listener OpenLineageSparkListener threw an exception io.openlineage.client.OpenLineageClientException: io.openlineage.spark.shaded.com.fasterxml.jackson.databind.JsonMappingException: Failed to find TransportBuilder (through reference chain: io.openlineage.client.OpenLineageYaml["transport"]) at io.openlineage.client.OpenLineageClientUtils.loadOpenLineageYaml(OpenLineageClientUtils.java:149) at io.openlineage.spark.agent.ArgumentParser.extractOpenlineageConfFromSparkConf(ArgumentParser.java:114) at io.openlineage.spark.agent.ArgumentParser.parse(ArgumentParser.java:78) at io.openlineage.spark.agent.OpenLineageSparkListener.initializeContextFactoryIfNotInitialized(OpenLineageSparkListener.java:277) at io.openlineage.spark.agent.OpenLineageSparkListener.onJobStart(OpenLineageSparkListener.java:110) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1356) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) Caused by: io.openlineage.spark.shaded.com.fasterxml.jackson.databind.JsonMappingException: Failed to find TransportBuilder (through reference chain: io.openlineage.client.OpenLineageYaml["transport"]) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.JsonMappingException.wrapWithPath(JsonMappingException.java:402) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.JsonMappingException.wrapWithPath(JsonMappingException.java:361) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.deser.BeanDeserializerBase.wrapAndThrow(BeanDeserializerBase.java:1853) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.deser.BeanDeserializer.vanillaDeserialize(BeanDeserializer.java:316) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:177) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.deser.DefaultDeserializationContext.readRootValue(DefaultDeserializationContext.java:323) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4825) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3809) at io.openlineage.client.OpenLineageClientUtils.loadOpenLineageYaml(OpenLineageClientUtils.java:147) ... 18 more Caused by: java.lang.IllegalArgumentException: Failed to find TransportBuilder at io.openlineage.client.transports.TransportResolver.lambda$getTransportBuilder$3(TransportResolver.java:38) at java.base/java.util.Optional.orElseThrow(Optional.java:403) at io.openlineage.client.transports.TransportResolver.getTransportBuilder(TransportResolver.java:37) at io.openlineage.client.transports.TransportResolver.resolveTransportConfigByType(TransportResolver.java:16) at io.openlineage.client.transports.TransportConfigTypeIdResolver.typeFromId(TransportConfigTypeIdResolver.java:35) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.jsontype.impl.TypeDeserializerBase._findDeserializer(TypeDeserializerBase.java:159) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.jsontype.impl.AsPropertyTypeDeserializer._deserializeTypedForId(AsPropertyTypeDeserializer.java:151) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.jsontype.impl.AsPropertyTypeDeserializer.deserializeTypedFromObject(AsPropertyTypeDeserializer.java:136) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.deser.AbstractDeserializer.deserializeWithType(AbstractDeserializer.java:263) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.deser.impl.FieldProperty.deserializeAndSet(FieldProperty.java:147) at io.openlineage.spark.shaded.com.fasterxml.jackson.databind.deser.BeanDeserializer.vanillaDeserialize(BeanDeserializer.java:314) ... 23 more Can I get any help on this?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-03-07 02:20:32

*Thread Reply:* Looks like misconfigured transport. Please refer to this -> https://openlineage.io/docs/integrations/spark/configuration/transport and https://openlineage.io/docs/integrations/spark/configuration/spark_conf for more details. I think you're missing spark.openlineage.transport.type property.

openlineage.io
Suhas Shenoy (ksuhasshenoy@gmail.com)
2024-03-07 02:28:10

*Thread Reply:* This is my configuration of the transport: conf.set("sparkscalaversion", "2.12") conf.set("spark.extraListeners","io.openlineage.spark.agent.OpenLineageSparkListener") conf.set("spark.openlineage.transport.type","http") conf.set("spark.openlineage.transport.url","<http://localhost:8082>") conf.set("spark.openlineage.transport.endpoint","/event") conf.set("spark.extraListeners","io.openlineage.spark.agent.OpenLineageSparkListener") During spark-submit if I include --packages "io.openlineage:openlineage_spark:1.8.0" I am able to receive events.

I have already included this line in build.sbt libraryDependencies += "io.openlineage" % "openlineage-spark" % "1.8.0"

So I don't understand why I have to pass the packages again

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-03-07 03:07:04

*Thread Reply:* OK, the configuration is OK. I think that when using libraryDependencies you get rid of manifest from within our JAR which is used by ServiceLoader

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-03-07 03:07:40

*Thread Reply:* this is happening here -> https://github.com/OpenLineage/OpenLineage/blob/main/client/java/src/main/java/io/openlineage/client/transports/TransportResolver.java#L32

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-03-07 03:08:37

*Thread Reply:* And this is the known issue related to this -> https://github.com/OpenLineage/OpenLineage/issues/1860

Assignees
<a href="https://github.com/pawel-big-lebowski">@pawel-big-lebowski</a>
Labels
bug, integration/spark
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-03-07 03:09:47

*Thread Reply:* This comment -> https://github.com/OpenLineage/OpenLineage/issues/1860#issuecomment-1750536744 explains this and shows how to fix this. I am happy to help new contributors with this.

Suhas Shenoy (ksuhasshenoy@gmail.com)
2024-03-07 03:10:57

*Thread Reply:* Thanks for the detailed reply and pointers. Will look into it.

Michael Robinson (michael.robinson@astronomer.io)
2024-03-07 15:56:52

@channel The big redesign of Marquez Web is out now following a productive testing period and some modifications along with added features. In addition to a wholesale redesign including column lineage support, it includes a new dataset tagging feature. It's worth checking out as a consumption layer in your lineage solution. A blog post with more details is coming soon, but here are some screenshots to whet your appetite. (See the thread for a screencap of the column lineage display.) Marquez quickstart: https://marquezproject.ai/docs/quickstart/ The release itself: https://github.com/MarquezProject/marquez/releases/tag/0.45.0

🤯 Ross Turk, Julien Le Dem, Harel Shein, Juan Luis Cano Rodríguez, Paweł Leszczyński, Mattia Bertorello, Rodrigo Maia
❤️ Harel Shein, Peter Huang, Kengo Seki, Paul Wilson Villena, Paweł Leszczyński, Mattia Bertorello, alexandre bergere, Rodrigo Maia, Maciej Obuchowski, Ernie Ostic, Dongjin Seo
✅ Sheeri Cabral (Collibra)
Cory Visi (cvisi@amazon.com)
2024-03-07 17:34:18

*Thread Reply:* Are those field descriptions coming from emitted events? or from a defined schema that's being added by marquez?

Ted McFadden (tmcfadden@consoleconnect.com)
2024-03-07 17:51:42

*Thread Reply:* Nice work! Are there any examples of the mode being switched from Table level to Column level or do I miss understand what mode is?

Michael Robinson (michael.robinson@astronomer.io)
2024-03-07 17:52:11

*Thread Reply:* @Cory Visi Those are coming from the events. The screenshots are of the UI seeded with metadata. You can find the JSON used for this here: https://github.com/MarquezProject/marquez/blob/main/docker/metadata.json

Michael Robinson (michael.robinson@astronomer.io)
2024-03-07 17:53:38

*Thread Reply:* The three screencaps in my first message actually don't include the column lineage display feature (but there are lots of other upgrades in the release)

Michael Robinson (michael.robinson@astronomer.io)
2024-03-07 17:55:56

*Thread Reply:* column lineage view:

❤️ Paweł Leszczyński, Rodrigo Maia, Cory Visi
Ted McFadden (tmcfadden@consoleconnect.com)
2024-03-07 18:01:21

*Thread Reply:* Thanks, that's what I wanted to get a look at. Cheers

👍 Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2024-03-07 18:01:25

*Thread Reply:* @Ted McFadden what the initial 3 screencaps show is switching between the graph view and detailed views of the datasets and jobs

David Sharp (davidsharp7@gmail.com)
2024-03-07 23:59:42

*Thread Reply:* Hey with the tagging we’ve identified a slight bug - PR has been put into fix.

Rodrigo Maia (rodrigo.maia@manta.io)
2024-03-08 05:31:15

*Thread Reply:* The "query" section looks awesome, Congrats!!! But from the openlineage side, when is the query attribute available?

Cory Visi (cvisi@amazon.com)
2024-03-08 07:36:29

*Thread Reply:* Fantastic work!

Michael Robinson (michael.robinson@astronomer.io)
2024-03-08 07:55:30

*Thread Reply:* @Rodrigo Maia the OpenLineage spec supports this via the SQLJobFacet. See: https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/SQLJobFacet.json

Ernie Ostic (ernie.ostic@getmanta.com)
2024-03-08 08:42:40

*Thread Reply:* Thanks Michael....do we have a list of which providers are known to be populating the SQL JobFacet (assuming that the solution emitting the events uses SQL and has access to it)?

Michael Robinson (michael.robinson@astronomer.io)
2024-03-08 08:59:24

*Thread Reply:* @Maciej Obuchowski or @Jakub Dardziński can add more detail, but this doc has a list of operators supported by the SQL parser.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-08 09:01:13

*Thread Reply:* yeah, so basically any of the operators that is sql-compatible - SQLExecuteQueryOperator + Athena, BQ I think

Ernie Ostic (ernie.ostic@getmanta.com)
2024-03-08 09:05:45

*Thread Reply:* Thanks! That helps for Airflow --- do we know if any other Providers are fully supporting this powerful facet?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-08 09:07:45

*Thread Reply:* whoa, powerful 😅 I just checked sources, the only missing from above is CopyFromExternalStageToSnowflakeOperator

are you interested in some specific ones?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-08 09:08:49

*Thread Reply:* and ofc you can have SQLJobFacet coming from dbt or spark as well or any other systems triggered via Airflow

Ernie Ostic (ernie.ostic@getmanta.com)
2024-03-08 11:03:36

*Thread Reply:* Thanks Jakub. It will be interesting to know which providers we are certain provide SQL, that are entirely independent of Airflow.

✅ Sheeri Cabral (Collibra)
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-08 11:07:50

*Thread Reply:* I don’t think we have any facet-oriented docs (e.g. what produces SQLJobFacet) and if that makes sense

Ernie Ostic (ernie.ostic@getmanta.com)
2024-03-08 11:14:40

*Thread Reply:* Thanks. Ultimately, it's a bigger question that we've talked about before, about best ways to document and validate what things/facets you can support/consume (as a consumer) or which you support/populate as a provider.

✅ Sheeri Cabral (Collibra)
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-08 11:16:05

*Thread Reply:* The doc that @Michael Robinson shared is automatically generated from Airflow code, so it should provide the best option for build-in operators. If we're talking about providers/operators outside Airflow repo, then I think @Julien Le Dem’s registry proposal would best support that need

☝️ Jakub Dardziński, Ernie Ostic
Athitya Kumar (athityakumar@gmail.com)
2024-03-07 23:44:08

Hey team. Is column/attribute level lineage supported for input/topic Kafka topic ports in the OpenLineage Flink listener?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-03-08 02:07:58

*Thread Reply:* Column level lineage is currently not supported for Flink

Ruchira Prasad (ruchiraprasad@gmail.com)
2024-03-08 04:57:20

Is it possible to explain me "OTHER" Run State and whether we can use this to send Lineage events to check the health of a service that is running in background and triggered interval manner. It will be really helpful, if someone can send example JSON for "OTHER" run state

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-03-08 05:17:19

*Thread Reply:* The example idea behind other was: imagine a system that requests for compute resorouces and would like to emit OpenLineage event about request being made. That's why other can occur before start. The other idea was to put other elsewhere to provide agility for new scenarios. However, we want to restrict which event types are terminating ones and don't want other there. This is important for lineage consumers, as when they receive terminating event for a given run, they know all the events related to the run were emitted.

Ruchira Prasad (ruchiraprasad@gmail.com)
2024-03-08 05:38:21

*Thread Reply:* @Paweł Leszczyński Is it possible to track the health of a service by using OpenLineage Events? Of so, How? As an example, I have a windows service, and I want to make sure the service is up and running.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-03-08 05:53:58

*Thread Reply:* depends on what do you mean by service. If you consider a data processing job as a service, then you can track if it successfully completes.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-08 07:08:46

*Thread Reply:* I think other systems would be more suited for healthchecks, like OpenTelemetry or Datadog

Efthymios Hadjimichael (ehadjimichael@id5.io)
2024-03-08 07:22:03

hey there, trying to configure databricks spark with the openlineage spark listener 🧵

Efthymios Hadjimichael (ehadjimichael@id5.io)
2024-03-08 07:22:52

*Thread Reply:* databricks runtime for clusters: 14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12) we are shipping a global init script that looks like the following: ```#!/bin/bash VERSION="1.9.1" SCALAVERSION="2.12" wget -O /mnt/driver-daemon/jars/openlineage-spark$${SCALAVERSION}-$${VERSION}.jar https://repo1.maven.org/maven2/io/openlineage/openlineage-spark$${SCALAVERSION}/$${VERSION}/openlineage-spark$${SCALA_VERSION}-$${VERSION}.jar

SPARKDEFAULTSFILE="/databricks/driver/conf/00-openlineage-defaults.conf"

if [[ $DBISDRIVER = "TRUE" ]]; then cat > $SPARKDEFAULTSFILE <<- EOF [driver] { "spark.extraListeners" = "com.databricks.backend.daemon.driver.DBCEventLoggingListener,io.openlineage.spark.agent.OpenLineageSparkListener" "spark.openlineage.version" = "v1" "spark.openlineage.transport.type" = "http" "spark.openlineage.transport.url" = "https://some.url" "spark.openlineage.dataset.removePath.pattern" = "(\/[a-z]+[-a-zA-Z0-9]+)+(?<remove>.**)" "spark.openlineage.namespace" = "some_namespace" } EOF fi``` with openlineage-spark 1.9.1

Efthymios Hadjimichael (ehadjimichael@id5.io)
2024-03-08 07:23:38

*Thread Reply:* getting fatal exceptions: 24/03/07 14:14:05 ERROR DatabricksMain$DBUncaughtExceptionHandler: Uncaught exception in thread spark-listener-group-shared! java.lang.NoClassDefFoundError: com/databricks/sdk/scala/dbutils/DbfsUtils at io.openlineage.spark.agent.facets.builder.DatabricksEnvironmentFacetBuilder.getDbfsUtils(DatabricksEnvironmentFacetBuilder.java:124) at io.openlineage.spark.agent.facets.builder.DatabricksEnvironmentFacetBuilder.getDatabricksEnvironmentalAttributes(DatabricksEnvironmentFacetBuilder.java:92) at io.openlineage.spark.agent.facets.builder.DatabricksEnvironmentFacetBuilder.build(DatabricksEnvironmentFacetBuilder.java:58) and spark driver crashing when spark runs

Efthymios Hadjimichael (ehadjimichael@id5.io)
2024-03-08 07:28:43

*Thread Reply:* browsing the code for 1.9.1 shows that the exception comes from trying to access the class for databricks dbfsutils here

should I file a bug on github, or am I doing something very wrong here?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-08 07:53:00

*Thread Reply:* Looks like something has changed in the Databricks 14 🤔

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-08 07:53:17

*Thread Reply:* Issue on GitHub is the right way

Efthymios Hadjimichael (ehadjimichael@id5.io)
2024-03-08 07:53:49

*Thread Reply:* thanks, opening one now with this information.

Efthymios Hadjimichael (ehadjimichael@id5.io)
2024-03-08 09:21:24

*Thread Reply:* link to issue for anyone interested, thanks again!

👍 Maciej Obuchowski
Abdallah (abdallah@terrab.me)
2024-03-15 10:09:00

*Thread Reply:* Hi @Maciej Obuchowski I am having the same issue with older versions of Databricks.

Abdallah (abdallah@terrab.me)
2024-03-18 02:47:30

*Thread Reply:* I don't think that the spark's integration is working anymore for any of the environments in Databricks and not only the version 14.

➕ Tristan GUEZENNEC -CROIX-
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-18 07:14:09

*Thread Reply:* @Abdallah are you willing to provide PR?

Abdallah (abdallah@terrab.me)
2024-03-18 11:51:20

*Thread Reply:* I am having a look

Abdallah (abdallah@terrab.me)
2024-03-20 04:45:02

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2530

Labels
integration/spark
slackbot
2024-03-08 12:04:26

This message was deleted.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-08 12:13:32

*Thread Reply:* is what you sent an event for DAG or task?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-08 12:22:32

*Thread Reply:* so far Marquez cannot show job hierarchy (DAG is parent to tasks) so you need click on some of the tasks in the UI to see proper view

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-08 12:33:25

*Thread Reply:* is this the only job listed?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-08 12:33:37

*Thread Reply:* no, I can see 191 total

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-08 12:34:22

*Thread Reply:* what if you choose any other job that has ACustomingestionDag. prefix?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-08 12:39:24

*Thread Reply:* you also have namespaces in right upper corner. datasets are probably in different namespace than Airflow jobs

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-08 12:47:52

*Thread Reply:* https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/supported_classes.html

this is the list of supported operators currently

not all of them send dataset information, e.g. PythonOperator

Nargiza Fernandez (nargizafernandez@gmail.com)
2024-03-08 14:06:35

hi everyone!

i configured openlineage + marquez to my Amazon managed Apache Airflow to get better insights of the DAGS. for implementation i followed the https://aws.amazon.com/blogs/big-data/automate-data-lineage-on-amazon-mwaa-with-openlineage/ guide, using helm/k8s option. marquez is up and running i can see my DAGs and depending DAGs in jobs section, however when clicking on any of the dags in jobs list i see only one job without any dependencies. i would like to see the whole chain of tasks execution. how can i achieve this goal? please advice.

additional information: we dont have Datasets in our MWAA. MWAA Airflow - v. 2.7.2 Openlineage plugin.py - from airflow.plugins_manager import AirflowPlugin from airflow.models import Variable import os

os.environ["OPENLINEAGEURL"] = Variable.get('OPENLINEAGEURL', default_var='')

class EnvVarPlugin(AirflowPlugin): name = "envvarplugin"

requirements.txt: httplib2 urllib3 oauth2client bingads pymssql certifi facebook_business mysql-connector-python google-api-core google-auth google-api-python-client apiclient google-auth-httplib2 google-auth-oauthlib pymongo pandas numpy pyarrow apache-airflow-providers-openlineage

Also, where can i find the meaning of Depth, complete mode, compact nodes options? i believe it is an view option?

Thank you in advance for your help!

Willy Lulciuc (willy@datakin.com)
2024-03-08 14:17:50

*Thread Reply:* Jobs may not have any dependencies depending on the Airflow operator used (ex: PythonOperator). Can you provide the OL events for the job you expect to have inputs/outputs? In the Marquez Web UI, you can use the events tab:

Nargiza Fernandez (nargizafernandez@gmail.com)
2024-03-08 14:42:14

*Thread Reply:* i expect to see dependencies from all my jobs. i was hoping marquez will show similar view as airflow does, and therefore having easier chance to troubleshoot failed DAGs. please refer to the image below.

Nargiza Fernandez (nargizafernandez@gmail.com)
2024-03-08 17:02:09

*Thread Reply:* is this what you requested?

Nargiza Fernandez (nargizafernandez@gmail.com)
2024-03-11 10:19:28

*Thread Reply:* hello! @Willy Lulciuc could you please guide me further? what can be done to see the whole chain of DAG execution in openlineage/marquez?

Nargiza Fernandez (nargizafernandez@gmail.com)
2024-03-11 14:42:01

*Thread Reply:* from textwrap import dedent import mysql.connector import pymongo import logging import sys import ast from airflow import DAG from airflow.operators.python import PythonOperator from airflow.operators.trigger_dagrun import TriggerDagRunOperator from airflow.operators.python import BranchPythonOperator from airflow.providers.http.operators.http import SimpleHttpOperator from airflow.models import Variable from bson.objectid import ObjectId we do use PythonOperator, however we are specifying task dependencies in the DAG code, example:

error_task = PythonOperator( 891 task_id='error', 892 python_callable=error, 893 dag=dag, 894 trigger_rule = "one_failed" 895 ) 896 897 transformed_task >> generate_dict >> api_trigger_dependent_dag >> error_task for this case is there a way to have detailed view in Marquez Web UI?

Nargiza Fernandez (nargizafernandez@gmail.com)
2024-03-11 14:50:17

*Thread Reply:* @Jakub Berezowski hello! could you please take a look at my case and advice what can be done whenever you have time? thank you!

Suresh Kumar (ssureshkumar6@gmail.com)
2024-03-10 04:35:02

Hi All, I'm based out of Sydney and we are using the open lineage on Azure data platform. I'm looking for some direction and support where we got struck currently on lineage creation from Spark (Azure Synapse Analytics) PySpark not able to emit lineage when there are some complex transformations happening. The open lineage version we currently using is v0.18 and Spark version is 3.2.

Kacper Muda (kacper.muda@getindata.com)
2024-03-11 03:54:43

*Thread Reply:* Hi, could you provide some more details on the issue you are facing? Some debug logs, specific error message, pyspark code that causes the issue? Also, current OpenLineage version is 1.9.1 , is there any reason you are using an outdated 0.18?

Suresh Kumar (ssureshkumar6@gmail.com)
2024-03-11 19:15:18

*Thread Reply:* Thanks for the headsup. We are in process of upgrading the library and get back to you.

Kylychbek Zhumabai uulu (kylychbekeraliev2000@gmail.com)
2024-03-11 12:51:09

Hello everyone, is there anyone who integrated AWS MWAA with Openlineage, I'm trying it but it is not working, can you give some ideas and steps if you have an experience for that?

Michael Robinson (michael.robinson@astronomer.io)
2024-03-12 12:37:47

@channel This month's TSC meeting, open to all, is tomorrow at 9:30 PT. The updated agenda includes exciting news of new integrations and presentations by @Damien Hawes and @Paweł Leszczyński. Hope to see you there! https://openlineage.slack.com/archives/C01CK9T7HKR/p1709756566788589

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
🚀 Mattia Bertorello, Maciej Obuchowski, Sheeri Cabral (Collibra), Paweł Leszczyński
Suhas Shenoy (ksuhasshenoy@gmail.com)
2024-03-13 10:28:41

Hi team.. If we are trying to send openlineage events from spark job to kafka endpoint which requires keystore and truststore related properties to be configured, how can we configure it?

Kacper Muda (kacper.muda@getindata.com)
2024-03-13 10:33:48

*Thread Reply:* Hey, check out this docs and spark.openlineage.transport.properties.[xxx] configuration. Is this what you are looking for?

Suhas Shenoy (ksuhasshenoy@gmail.com)
2024-03-13 11:08:49

*Thread Reply:* Yes... Thanks

Rodrigo Maia (rodrigo.maia@manta.io)
2024-03-13 11:46:09

Hello all 👋! Has anyone tried to use spark udfs with openlineage? Does it make sense for the column-level lineage to stop working in this context?

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2024-03-15 08:47:54

*Thread Reply:* did you investigate if it still works on a table-level?

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2024-03-15 08:49:50

*Thread Reply:* (I haven’t tried it, but looking at spark UDFs it looks like there are many differences - https://medium.com/@suffyan.asad1/a-deeper-look-into-spark-user-defined-functions-537c6efc5fb3 - nothing is jumping out at me as “this is why it doesn’t work” though.

Medium
Reading time
10 min read
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-03-14 03:49:21

This week brought us many fixes to the Flink integration like: • #2507 which resolves critical issues introduced in recent release, • #2508 which makes JDBC dataset naming consistent with dataset naming convention and having a common code for Spark & Flink to extract dataset identifier from JDBC connection url. • #2512 which includes database schema in dataset identifier for JDBC integration in Flink. These are significant improvements and I think they should not wait for the next release cycle. I would like to start a vote for an immediate release.

➕ Kacper Muda, Paweł Leszczyński, Mattia Bertorello, Maciej Obuchowski, Harel Shein, Damien Hawes, Peter Huang
Michael Robinson (michael.robinson@astronomer.io)
2024-03-14 10:46:42

*Thread Reply:* Thanks, all. The release is approved..

🙌 Paweł Leszczyński
Michael Robinson (michael.robinson@astronomer.io)
2024-03-14 15:26:58

*Thread Reply:* Changelog PR is here: https://github.com/OpenLineage/OpenLineage/pull/2516

Labels
documentation
Michael Robinson (michael.robinson@astronomer.io)
2024-03-15 11:05:02

@channel We released OpenLineage 1.10.2, featuring:

Additions • Dagster: add new provider for version 1.6.10 #2518 @JDarDagran • Flink: support lineage for a hybrid source #2491 @HuangZhenQiu • Flink: bump Flink JDBC connector version #2472 @HuangZhenQiu • Java: add a OpenLineageClientUtils#loadOpenLineageJson(InputStream) and change OpenLineageClientUtils#loadOpenLineageYaml(InputStream) methods #2490 @d-m-h • Java: add info from the HTTP response to the client exception #2486 @davidjgoss • Python: add support for MSK IAM authentication with a new transport #2478 @mattiabertorello Removal • Airflow: remove redundant information from facets #2524 @kacpermuda Fixes • Airflow: proceed without rendering templates if task_instance copy fails #2492 @kacpermuda • Flink: fix class not found issue for Cassandra #2507 @pawel-big-lebowski • Flink: refine the JDBC table name #2512 @HuangZhenQiu • Flink: fix JDBC dataset naming #2508 @pawel-big-lebowski • Flink: fix failure due to missing Cassandra classes #2507 @pawel-big-lebowski • Flink: fix release runtime dependencies #2504 @HuangZhenQiu • Spark: fix the HttpTransport timeout #2475 @pawel-big-lebowski • Spark: prevent NPE if the context is null #2515 @pawel-big-lebowski • Spec: improve Cassandra lineage metadata #2479 @HuangZhenQiu Thanks to all the contributors with a shout out to @Maciej Obuchowski for the after-hours CI fix! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.10.2 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/1.9.1...1.10.2 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🚀 Maciej Obuchowski, Kacper Muda, Mattia Bertorello, Paweł Leszczyński
🔥 Maciej Obuchowski, Mattia Bertorello, Paweł Leszczyński, Peter Huang
GUNJAN YADU (gunjanyadu6@gmail.com)
2024-03-18 08:12:43

Hi I am new to Openlineage. So can someone help me to understand and how exactly it is setup and how I can setup in my personal laptop and play with it to gain hands on experience

Kacper Muda (kacper.muda@getindata.com)
2024-03-18 08:15:17

*Thread Reply:* Hey, checkout our Getting Started guide, and the whole documentation on python, java, spark etc. where you will find all the information about the setup and configuration. For Airflow>=2.7, there is a separate documentation

GUNJAN YADU (gunjanyadu6@gmail.com)
2024-03-18 08:52:41

*Thread Reply:* I am getting this error when i am following the commands in my windows laptop: git clone git@github.com:MarquezProject/marquez.git && cd marquez/docker running up.sh --seed marquez-api | WARNING 'MARQUEZCONFIG' not set, using development configuration. seed-marquez-with-metadata | wait-for-it.sh: waiting 15 seconds for api:5000 marquez-web | [HPM] Proxy created: /api/v1 -> http://api:5000/ marquez-web | App listening on port 3000! marquez-api | INFO [2024-03-18 12:45:01,702] org.eclipse.jetty.util.log: Logging initialized @1991ms to org.eclipse.jetty.util.log.Slf4jLog marquez-api | INFO [2024-03-18 12:45:01,795] io.dropwizard.server.DefaultServerFactory: Registering jersey handler with root path prefix: / marquez-api | INFO [2024-03-18 12:45:01,796] io.dropwizard.server.DefaultServerFactory: Registering admin handler with root path prefix: / marquez-api | INFO [2024-03-18 12:45:01,797] io.dropwizard.assets.AssetsBundle: Registering AssetBundle with name: graphql-playground for path /graphql-playground/** marquez-api | INFO [2024-03-18 12:45:01,807] marquez.MarquezApp: Running startup actions... marquez-api | INFO [2024-03-18 12:45:01,842] org.flywaydb.core.internal.license.VersionPrinter: Flyway Community Edition 8.5.13 by Redgate marquez-api | INFO [2024-03-18 12:45:01,842] org.flywaydb.core.internal.license.VersionPrinter: See what's new here: https://flywaydb.org/documentation/learnmore/releaseNotes#8.5.13 marquez-api | INFO [2024-03-18 12:45:01,842] org.flywaydb.core.internal.license.VersionPrinter: marquez-db | 2024-03-18 12:45:02.039 GMT [34] FATAL: password authentication failed for user "marquez" marquez-db | 2024-03-18 12:45:02.039 GMT [34] DETAIL: Role "marquez" does not exist. marquez-db | Connection matched pghba.conf line 100: "host all all all scram-sha-256" marquez-api | ERROR [2024-03-18 12:45:02,046] org.apache.tomcat.jdbc.pool.ConnectionPool: Unable to create initial connections of pool. marquez-api | ! org.postgresql.util.PSQLException: FATAL: password authentication failed for user "marquez"

Do I have to do any additional setup to run marquez in local.

Kacper Muda (kacper.muda@getindata.com)
2024-03-18 09:02:47

*Thread Reply:* I don't think OpenLineage and Marquez support windows in any way

👍 Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2024-03-18 09:04:57

*Thread Reply:* But another way to explore OL and Marquez is with GitPod: https://github.com/MarquezProject/marquez?tab=readme-ov-file#try-it

Michael Robinson (michael.robinson@astronomer.io)
2024-03-18 09:05:17

*Thread Reply:* Also, @GUNJAN YADU have you tried deleting all volumes and starting over?

GUNJAN YADU (gunjanyadu6@gmail.com)
2024-03-18 09:10:49

*Thread Reply:* Volumes as in?

Kacper Muda (kacper.muda@getindata.com)
2024-03-18 09:12:21

*Thread Reply:* Probably docker volumes, you can find them in docker dashboard app:

👍 Michael Robinson
GUNJAN YADU (gunjanyadu6@gmail.com)
2024-03-18 09:13:44

*Thread Reply:* Okay Its password authentication failure. So do I have to do any kind of posgres setup or environment variable setup

GUNJAN YADU (gunjanyadu6@gmail.com)
2024-03-18 09:24:29

*Thread Reply:* marquez-db | 2024-03-18 13:19:37.211 GMT [36] FATAL: password authentication failed for user "marquez" marquez-db | 2024-03-18 13:19:37.211 GMT [36] DETAIL: Role "marquez" does not exist.

GUNJAN YADU (gunjanyadu6@gmail.com)
2024-03-18 10:11:43

*Thread Reply:* Setup is successful

Michael Robinson (michael.robinson@astronomer.io)
2024-03-18 11:20:43

*Thread Reply:* @GUNJAN YADU can share what steps you took to make it work?

GUNJAN YADU (gunjanyadu6@gmail.com)
2024-03-19 00:14:17

*Thread Reply:* First I cleared the volumes Then did the steps mentioned in link you shared in git bash. It worked then

Michael Robinson (michael.robinson@astronomer.io)
2024-03-19 09:00:19

*Thread Reply:* Ah, so you used GitPod?

GUNJAN YADU (gunjanyadu6@gmail.com)
2024-03-21 00:35:58

*Thread Reply:* No I haven’t. I ran all the commands in git bash

👍 Michael Robinson
Rohan Doijode (doijoderohan882@gmail.com)
2024-03-19 08:06:07

Hi everyone !

I'm beginner to this tool.

My name is Rohan and facing challenges on Marquez. I have followed the steps as mentioned on website and facing this error. Please check attached picture.

Michael Robinson (michael.robinson@astronomer.io)
2024-03-19 09:35:16

*Thread Reply:* Hi Rohan, welcome! There are a number of guides across the OpenLineage and Marquez sites. Would you please share a link to the guide you are using? Also, terminal output as well as version and system information would be helpful. The issue could be a simple config problem or more complicated, but it's impossible to say from the screenshot.

Rohan Doijode (doijoderohan882@gmail.com)
2024-03-20 01:47:22

*Thread Reply:* Hi Michael Robinson,

Thank you for reverting on this.

The link I used for installation : https://openlineage.io/getting-started/

I have attached the terminal output.

Docker version : 25.0.3, build 4debf41

openlineage.io
Rohan Doijode (doijoderohan882@gmail.com)
2024-03-20 01:48:55

*Thread Reply:* Continuing above thread with a screenshot :

Michael Robinson (michael.robinson@astronomer.io)
2024-03-20 11:25:36

*Thread Reply:* Thanks for the details, @Rohan Doijode. Unfortunately, Windows isn't currently supported. To explore OpenLineage+Marquez on Windows we recommend using this pre-configured Marquez Gitpod environment.

Rohan Doijode (doijoderohan882@gmail.com)
2024-03-21 00:49:41

*Thread Reply:* Hi @Michael Robinson,

Thank you for your input.

My issues has been resolved.

🎉 Michael Robinson
Kacper Muda (kacper.muda@getindata.com)
2024-03-19 11:37:02

Hey team! Quick check - has anyone submitted or is planning to submit a CFP for this year's Airflow Summit with an OL talk? Let me know! 🚀

➕ Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2024-03-19 11:40:11

*Thread Reply:* https://sessionize.com/airflow-summit-2024/

sessionize.com
Michael Robinson (michael.robinson@astronomer.io)
2024-03-19 11:40:22

*Thread Reply:* the CFP is scheduled to close on April 17

Kacper Muda (kacper.muda@getindata.com)
2024-03-19 11:40:59

*Thread Reply:* Yup. I was thinking about submitting one, but don't want to overlap with someone that already did 🙂

Anand Thamothara Dass (anand_thamotharadass@cable.comcast.com)
2024-03-19 14:54:06

Hey Team, We are using MWAA (AWS Managed airflow) which is on version 2.7.2. So we are making use of airflow provided openlineage packages. We have simple test DAG which uses BashOperator and we would like to use manually annotated lineage. So we have provided the inlets and outlets. But when I am run the job. I see the errors - Failed to extract metadata using found extractor <airflow.providers.openlineage.extractors.bash.BashExtractor object at 0x7f9446276190> - section/key [openlineage/disabledforoperators]. Do I need to make any configuration changes?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-19 15:26:53

*Thread Reply:* hey, there’s a fix for that: https://github.com/apache/airflow/pull/37994 not released yet.

Unfortunately, before the release you need to manually set missing entries in configuration

Anand Thamothara Dass (anand_thamotharadass@cable.comcast.com)
2024-03-19 16:15:18

*Thread Reply:* Thanks @Jakub Dardziński . So the temporary fix is to set disabledforoperators for the unsupported operators? If I do that, Do I get my lineage emitted for bashOperator with manually annotated information?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-19 16:15:59

*Thread Reply:* I think you should set it for disabled_for_operators, config_path and transport entries (maybe you’ve set some of them already)

Anand Thamothara Dass (anand_thamotharadass@cable.comcast.com)
2024-03-19 16:23:25

*Thread Reply:* Ok . Thanks. Yes I did them already.

Anand Thamothara Dass (anand_thamotharadass@cable.comcast.com)
2024-03-19 22:03:04

*Thread Reply:* These are my configurations. Its emitting run event only. I have my manually annotated lineage defined for the bashoperator. So when I provide the disabledforoperators, I don't see any errors, But log clearly says "Skipping extraction for operator BashOperator". So I don't see the inlets & outlets info in marquez. If I don't provide disabledforoperators, it fails with error "Failed to extract metadata using found extractor <airflow.providers.openlineage.extractors.bash.BashExtractor object at 0x7f9446276190> - section/key [openlineage/disabledforoperators]". So i cannot go either way. Any workaround? or I am making some mistake?

Kacper Muda (kacper.muda@getindata.com)
2024-03-20 02:28:53

*Thread Reply:* Hey @Anand Thamothara Dass, make sure to simply set the config_path , disabled_for_operators and transport to empty strings, unless you actually want to use it (f.e. leave transport as it is if it contains the configuration to the backend). Current issue is that when no variables are found the error is raised, no matter if the actual value is set - they simply need to be in configuration, even as empty string.

In your setup i seed that you included BashOperator in disabled, so that's why it's ignored.

Anand Thamothara Dass (anand_thamotharadass@cable.comcast.com)
2024-03-20 12:03:55

*Thread Reply:* Hmm strange. setting to empty strings worked. When I display it in console, I am able to see all the outlets information. But when I transport it to marquez endpoint, I am able to see only run events. No dataset information are captured in Marquez. But when I build the payload myself outside Airflow and push it using postman, I am able to see the dataset information as well in marquez. So I don't know where is the issue. Its airflow or openlineage or marquez 😕

Kacper Muda (kacper.muda@getindata.com)
2024-03-20 12:07:07

*Thread Reply:* Could you share your dag code and task logs for that operator? I think if you use BashOperator and attach inlets and outlets to it, it should work just fine. Also please share the version of Ol package you are using and the name

Anand Thamothara Dass (anand_thamotharadass@cable.comcast.com)
2024-03-20 14:57:40

*Thread Reply:* @Kacper Muda - Got that fixed. {"type": "http","url":"<http://10.80.35.62:3000%7Chttp://<ip>:3000>%22,%22endpoint%22:%22api/v1/lineage%22}. Got the end point removed. {"type": "http","url":"<http://10.80.35.62:3000%7Chttp://<ip>:3000>%22}. Kept only till here. It worked. Didn't think that, v1/lineage forces only run events capture. Thanks for all the support !!!

👍 Jakub Dardziński, Kacper Muda
Rohan Doijode (doijoderohan882@gmail.com)
2024-03-21 07:44:29

Hi all,

We are planning to use OL as Data Lineage Tool.

We have data in S3 and do use AWS Kinesis. We are looking forward for guidelines to generate graphical representation over Marquez or any other compatible tool.

This includes lineage on column level and metadata during ETL.

Thank you in advance

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2024-03-21 09:06:06

Hello all, we are struggling with a spark integration with AWS Glue. We have gotten to a configuration that is not causing errors in spark, but it’s not producing any output in the S3 bucket. Can anyone help figure out what’s wrong? (code in thread)

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2024-03-21 09:06:35

*Thread Reply:* ```import sys from awsglue.transforms import ** from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job from pyspark.context import SparkConf from pyspark.sql import SparkSession

args = getResolvedOptions(sys.argv, ["JOBNAME"]) print(f'the job name received is : {args["JOBNAME"]}')

spark1 = SparkSession.builder.appName("OpenLineageExample").config("spark.extraListeners", "io.openlineage.spark.agent.OpenLineageSparkListener").config("spark.openlineage.transport.type", "file").config("spark.openlineage.transport.location", "").config("spark.openlineage.namespace", "AWSGlue").getOrCreate()

glueContext = GlueContext(sc)

Initialize the glue context

sc = SparkContext(spark1)

glueContext = GlueContext(spark1) spark = glueContext.spark_session

job = Job(glueContext) job.init(args["JOB_NAME"], args)

df=spark.read.format("csv").option("header","true").load("s3://<bucket>/input/Master_Extract/") df.write.format('csv').option('header','true').save('

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2024-03-21 09:07:05

*Thread Reply:* cc @Rodrigo Maia since I know you’ve done some AWS glue

Damien Hawes (damien.hawes@booking.com)
2024-03-21 11:41:39

*Thread Reply:* Several things:

  1. s3 isn't a file system. It is an object storage system. Concretely, this means when an object is written, it's immutable. If you want to update the object, you need to read it in its entirety, modify it, and then write it back.
  2. Java probably doesn't know how to handle the s3 protocol.
Damien Hawes (damien.hawes@booking.com)
2024-03-21 11:41:54

*Thread Reply:* (As opposed the the file protocol)

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2024-03-21 12:05:15

*Thread Reply:* OK, so the problem is we’ve set it to config(“spark.openlineage.transport.type”, “file”) and then give it s3:// instead of a file path…..

But it’s AWS Glue so we don’t have a local filesystem to save it to.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2024-03-21 12:05:55

*Thread Reply:* (I also hear you that S3 isn’t an ideal place for concatenating to a logfile because you can’t concatenate)

Damien Hawes (damien.hawes@booking.com)
2024-03-21 12:20:46

*Thread Reply:* Unfortunately, I have zero experience with Glue.

Several approaches:

  1. Emit to Kafka (you can use MSK)
  2. Emit to Kinesis
  3. Emit to Console (perhaps a centralised logging tool, like Cloudwatch will pick it up)
  4. Emit to a local file, but I have no idea how you retrieve that file.
  5. Emit to an HTTP endpoint
☝️ Maciej Obuchowski
Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2024-03-21 12:22:25

*Thread Reply:* I appreciate some ideas for next steps

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2024-03-21 12:22:30

*Thread Reply:* Thank you

Rodrigo Maia (rodrigo.maia@manta.io)
2024-03-21 12:25:30

*Thread Reply:* did you try transport console to check if the OL setup is working? regardless of i/o, it should put something in the logs with an event.

👀 Sheeri Cabral (Collibra)
Damien Hawes (damien.hawes@booking.com)
2024-03-21 12:36:41

*Thread Reply:* Assuming the log4j[2].properties file is configured to allow the io.openlineage package to log at the appropriate level.

👀 Sheeri Cabral (Collibra)
tati (tatiana.alchueyr@astronomer.io)
2024-03-22 07:01:47

*Thread Reply:* @Sheeri Cabral (Collibra), did you try to use a different transport type, as suggested by @Damien Hawes in https://openlineage.slack.com/archives/C01CK9T7HKR/p1711038046057459?thread_ts=1711026366.869199&cid=C01CK9T7HKR? And described in the docs: https://openlineage.io/docs/integrations/spark/configuration/transport#file

Or would you like for the OL spark driver to support an additional transport type (e.g. s3) to emit OpenLineage events?

} Damien Hawes (https://openlineage.slack.com/team/U05FLJE4GDU)
Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2024-03-22 09:40:39

*Thread Reply:* I will try different transport types, haven’t gotten a chance to yet.

🙌 tati
tati (tatiana.alchueyr@astronomer.io)
2024-03-25 07:05:17

*Thread Reply:* Thanks, @Sheeri Cabral (Collibra); please let us know how it goes!

Pooja K M (pooja.km@philips.com)
2024-04-02 05:06:26

*Thread Reply:* @Sheeri Cabral (Collibra) did you tried on the other transport types by any chance?

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2024-04-03 08:32:20

*Thread Reply:* Sorry, with the holiday long weekend in Europe things are a bit slow. We did, and I just put a message in the #general chat https://openlineage.slack.com/archives/C01CK9T7HKR/p1712147347085319 as we are getting some errors with the spark integration.

} Sheeri Cabral (https://openlineage.slack.com/team/U0323HG8C8H)
Rodrigo Maia (rodrigo.maia@manta.io)
2024-03-22 14:45:12

I've been testing around with different Spark versions. Does anyone know if OpenLineage works with spark 2.4.4 (scala 2.12.10)? Ive getting a lot of errors, but ive only tried versions 1.8+

Michael Robinson (michael.robinson@astronomer.io)
2024-03-22 16:36:32

*Thread Reply:* Hi @Rodrigo Maia, OpenLineage does not officially support Spark 2.4.4. The earliest version supported is 2.4.6. See this doc for more information about the supported versions of Spark, Airflow, Dagster, dbt, and Flink.

👍 Rodrigo Maia
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-03-23 04:15:27

*Thread Reply:* OpenLineage CI runs against 2.4.6 and it is passing. I wouldn't expect any breaking differences between 2.4.4 and 2.4.6, but please let us know if this is the case.

👍 Rodrigo Maia
Michael Robinson (michael.robinson@astronomer.io)
2024-03-22 15:18:52

@channel Thanks to everyone who attended our first Boston meetup, co-sponsored by Astronomer and Collibra and featuring presentations by partners at Collibra, Astronomer and DataDog, this past Tuesday at Microsoft New England. Shout out to @Sheeri Cabral (Collibra), @Jonathan Morin, and @Paweł Leszczyński for presenting and to Sheeri for co-hosting! Topics included: • "2023 in OpenLineage," a big year that saw: ◦ 5 new integrations, ◦ the Airflow Provider launch, ◦ the addition of static/"design-time" lineage in 1.0.0, ◦ the addition of column lineage from SQL statements via the SQL parser, ◦ and 22 releases. • A demo of Marquez, which now supports column-level lineage in a revamped UI • Discussion of "Why Do People Use Lineage?" by Sheeri at Collibra, covering: ◦ differences between design and operational lineage, ◦ use cases served such as compliance, traceability/provenance, impact analysis, migration validation, and quicker onboarding, ◦ features of Collibra's lineage • A demo of streaming support in the Apache Flink integration by Paweł at Astronomer, illustrating lineage from: ◦ a Flink job reading from a Kafka topic to Postgres, ◦ a few SQL jobs running queries in Postgres, ◦ a Flink job taking a Postgres table and publishing it back to Kafka • A demo of an OpenLineage integration POC at DataDog by Jonathan, covering: ◦ Use cases served by DataDog's Data Streams Monitoring service ◦ OpenLineage's potential role providing and standardizing cross-platform lineage for DataDog's monitoring platform. Thanks to Microsoft for providing the space. If you're interested in attending, presenting at, or hosting a future meetup, please reach out.

🙌 Jonathan Morin, Harel Shein, Rodrigo Maia, Maciej Obuchowski
:datadog: Harel Shein, Paweł Leszczyński, Rodrigo Maia, Maciej Obuchowski, Jean-Mathieu Saponaro
👏 Peter Huang, Rodrigo Maia, tati
🎉 tati
❤️ Sheeri Cabral (Collibra)
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-25 07:08:21

*Thread Reply:* Hey @Michael Robinson, was the meetup recorded?

Michael Robinson (michael.robinson@astronomer.io)
2024-03-25 09:26:04

*Thread Reply:* @Maciej Obuchowski yes, and a clip is on YouTube. Hoping to have @Jonathan Morin’s clip posted soon, as well

YouTube
} OpenLineage Project (https://www.youtube.com/@openlineageproject6897)
❤️ Sheeri Cabral (Collibra)
Stefan Krawczyk (stefan@dagworks.io)
2024-03-22 19:57:48

Airflow 2.8.3 Python 3.11 Trying to do a hello world lineage example using this simple bash operator DAG — but I don’t have anything emitting to my marquez backend. I’m running airflow locally following docker-compose setup here. More details in thread:

Stefan Krawczyk (stefan@dagworks.io)
2024-03-22 19:59:45

*Thread Reply:* Here is my airflow.cfg under ```[webserver] expose_config = 'True'

[openlineage] configpath = '' transport = '{"type": "http", "url": "http://localhost:5002", "endpoint": "api/v1/lineage"}' disabledfor_operators = ''```

Stefan Krawczyk (stefan@dagworks.io)
2024-03-22 20:01:15

*Thread Reply:* I can curl my marquez backend just fine — but yeah not seeing anything emitted by airflow

Stefan Krawczyk (stefan@dagworks.io)
2024-03-22 20:19:44

*Thread Reply:* Have I missed something in the set-up? Is there a way I can validate the config was ingested correctly?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-23 03:42:40

*Thread Reply:* Can you see any logs related to OL in Airflow? Is Marquez in the same docker compose? Maybe try changing to host.docker.internal from localhost

Stefan Krawczyk (stefan@dagworks.io)
2024-03-24 00:51:31

*Thread Reply:* So I figured it out. For reference the issue was that ./config wasn’t for airflow.cfg as I had blindly interpreted it to be. Instead, setting the open lineage values as environment variables worked.

Stefan Krawczyk (stefan@dagworks.io)
2024-03-24 01:01:47

*Thread Reply:* Otherwise for the simple DAG with just BashOperators, I was expecting to see a similar “lineage” DAG in marquez, but I only see individual jobs. Is that expected?

Formulating my question differently, does the open lineage data model assume a bipartite type graph, of Job -> Dataset -> Job -> Dataset etc always? Seems like there would be cases where you could have Job -> Job where there is no explicit “data artifact produced”?

Stefan Krawczyk (stefan@dagworks.io)
2024-03-24 02:13:30

*Thread Reply:* Another question — is there going to be integration with the “datasets” & inlets/outlets concept airflow now has? E.g. I would expect the OL integration to capture this:

```# [START datasetdef] dag1dataset = Dataset("", extra={"hi": "bye"})

[END dataset_def]

with DAG( dagid="datasetproduces1", catchup=False, startdate=pendulum.datetime(2021, 1, 1, tz="UTC"), schedule="@daily", tags=["produces", "dataset-scheduled"], ) as dag1: # [START taskoutlet] BashOperator(outlets=[dag1dataset], taskid="producingtask1", bashcommand="sleep 5") # [END task_outlet]``` i.e. the outlets part. Currently it doesn’t seem to.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-25 03:47:29

*Thread Reply:* OL only converts File and Table entities so far from manual inlets and outlets

👍 Stefan Krawczyk
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-25 05:00:22

*Thread Reply:* on the Job -> Dataset -> Job -> Dataset: OL and Marquez do not aim into reflecting Airflow DAGs. They rather focus on exposing metadata that is collected around data processing

Stefan Krawczyk (stefan@dagworks.io)
2024-03-25 14:27:42

*Thread Reply:* > on the Job -> Dataset -> Job -> Dataset: OL and Marquez do not aim into reflecting Airflow DAGs. They rather focus on exposing metadata that is collected around data processing That makes sense. I’m was just thinking through the implications and boundaries of what “lineage” is modeled. Thanks

Suhas Shenoy (ksuhasshenoy@gmail.com)
2024-03-25 06:18:05

Hi Team... We have a use case where we want to know when a column of the table gets updated in BIGQUERY and we have some questions related to it.

  1. In some of the openlineage events that are generated, outputs.facets.columnLineage is null. Can we assume all the columns get updated when this is the case?
  2. Also outputs.facets.schema seems to be null in some of the events generated. How do we get the schema of the table in this case?
  3. output.namespace is also null in some cases. How do we determine output datasource in this case?
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-25 07:07:02

*Thread Reply:* For BigQuery, we use BigQuery API to get the lineage that unfortunately does not present us with column-level lineage. Adding that would be a feature.

For 2. and 3. it might happen that the result you're reading is from query cache, as this was earlier executed and not changed - in that case we won't have full information yet. https://cloud.google.com/bigquery/docs/cached-results

Google Cloud
Suhas Shenoy (ksuhasshenoy@gmail.com)
2024-03-25 07:45:04

*Thread Reply:* So, can we assume that if the query is not a duplicate one, fields outputs.facets.schema and output.namespace will not be empty? And ignore the COMPLETE events when those fields are empty as they are not providing any new updates?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-25 07:59:55

*Thread Reply:* > So, can we assume that if the query is not a duplicate one, fields outputs.facets.schema and output.namespace will not be empty? Yes, I would assume so. > And ignore the COMPLETE events when those fields are empty as they are not providing any new updates? That probably depends on your use case, different jobs can access same tables/do same queries in that case.

Suhas Shenoy (ksuhasshenoy@gmail.com)
2024-03-25 23:49:46

*Thread Reply:* Okay. We wanted to know how can we determine the output datasource from the events?

Ruchira Prasad (ruchiraprasad@gmail.com)
2024-03-26 01:51:15

Hi Team, Currently OpenLineage Marquez use postgres db to store the meta data. Instead postgres, we want to store them on Snowflake DB. Do we have kind if inbuilt configuration in the marquez application to change the marquez database to Snowflake? If not, what will be the approach?

Damien Hawes (damien.hawes@booking.com)
2024-03-26 04:50:25

*Thread Reply:* The last time I looked at Marquez (July last year), Marquez was highly coupled to PostgreSQL specific functionality. It had code, particularly for the graph traversal, written in PostgreSQL's PL/pgSQL. Furthermore, it uses PostgreSQL as an OLTP database. My limited knowledge of Snowflake says that it is an OLAP database, this means that it would be a very poor fit for the application. For any migration to another database engine, it would be a large undertaking.

☝️ Maciej Obuchowski
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-03-26 05:13:25

*Thread Reply:* Hi @Ruchira Prasad, this is not possible at the moment. Marquez splits OL events into neat relational model to allow efficient lineage queries. I don't think this would be achievable in Snowflake.

As an alternative approach, you can try fluentd proxy -> https://github.com/OpenLineage/OpenLineage/tree/main/proxy/fluentd Fluentd provides bunch of useful output plugins that let you send logs into several warehouses (https://www.fluentd.org/plugins), however I cannot find snowflake on the list.

On the snowflake side, there is quickstart on how to ingest fluentd logs into it -> https://quickstarts.snowflake.com/guide/integrating_fluentd_with_snowflake/index.html#0

To wrap up: if you need lineage events in Snowflake, you can consider sending events to a FluentD endpoint and then load them to Snowflake. In contrast to Marquez, you will query raw events which may be cumbersome in some cases like getting several OL events that describe a single run.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-26 05:56:39

*Thread Reply:* Note that supporting (not even migrating) a backend application that can use multiple database engines comes at a huge opportunity cost, and it's not like Marquez has more contributors than it needs 🙂

Ruchira Prasad (ruchiraprasad@gmail.com)
2024-03-26 06:28:47

*Thread Reply:* Since both Postgres and Snowflake supports JDBC, can't we point to Snowflake with changing following?

Damien Hawes (damien.hawes@booking.com)
2024-03-26 06:29:16

*Thread Reply:* It doesn't have anything to do with the driver. JDBC is the driver, it defines the protocol that that communication link must abide by.

Just like how ODBC is a driver, and in the .NET world, how OLE DB is a driver.

It tells us nothing about the capabilities of the database. In this case, using PostgreSQL was chosen because of its capabilities, and because of those capabilities, the application code leverages more of those capabilities than just a generic read / write database. Moving all that logic from PostgreSQL PL/pgSQL to the application would (1) take a significant investment in time; (2) present bugs; (3) slow down the application response time, because you have to make many more round-trips to the database, instead of keeping the code close to the data.

☝️ Maciej Obuchowski
Damien Hawes (damien.hawes@booking.com)
2024-03-26 06:39:57

*Thread Reply:* If you're still curious, and want to test things out for yourself:

  1. Create a graph structure on a SQL database (edge table, vertex table, relationship table)
  2. Write SQL to perform that traversal
  3. Write Java application code that reads from the database, then tries to perform traversals by again reading data from the database. Measure the performance impact, and you will see that (2) is far quicker than (3). This is one of the reasons why Marquez uses PostgreSQL and leverages its PL/pgSQL capabilities, because otherwise the application would be significantly for any traversal that is more than a few levels deep.
Bipan Sihra (bsihra@amazon.com)
2024-03-26 15:57:50

Hi Team,

Looking for feedback on the below Problem and Proposal.

We are using OpenLineage with our AWS EMR clusters to extract lineage and send it to a backend Marquez deployment (also in AWS). This is working fine and we are getting table and column level lineage.

Problem: Is we are seeing: • 15+ OpenLineage events with multiple jobs being shows in Marquez for a single Spark job in EMR. This causes confusion because team members using Marquez are unsure which "job" in Marquez to look at. • The S3 locations are being populated in the namespace. We wanted to use namespace for teams. However, having S3 locations in the namespace in a way "pollutes" the list. I understand the above are not issues/bugs. However, our users want us to "clean" up the Marquez UI.

Proposal: One idea was to have a Lambda intercept the 10-20 raw OpenLineage events from EMR and then process -> condense them down to 1 event with the job, run, inputs, outputs. And secondly, to swap out the namespace from S3 to actual team names via a lookup we would host ourselves.

While the above proposal technically could work we wanted to check with the team here if it makes sense, any caveats, alternatives others have used. Ideally, we don't want to own parsing OpenLineage events if there is an existing solution.

Bipan Sihra (bsihra@amazon.com)
2024-03-26 15:58:15

*Thread Reply:* Screenshot: 1 spark job = multiple "jobs" in Marquez

Bipan Sihra (bsihra@amazon.com)
2024-03-26 15:58:35

*Thread Reply:* Screenshot: S3 locations in namespace.

Michael Robinson (michael.robinson@astronomer.io)
2024-03-26 16:59:48

*Thread Reply:* Hi @Bipan Sihra, thanks for posting this -- it's exciting to hear about your use case at Amazon! I wonder if you wouldn't mind opening a GitHub issue so we can track progress on this and make sure you get answers to your questions.

Michael Robinson (michael.robinson@astronomer.io)
2024-03-26 17:23:19

*Thread Reply:* Also, would you please share the version of openlineage-spark you are on?

Bipan Sihra (bsihra@amazon.com)
2024-03-27 09:05:09

*Thread Reply:* Hi @Michael Robinson. Sure, I can open a Github issue. Also, we are currently using io.openlineage:openlineage_spark_2.12:1.9.1.

👍 Michael Robinson
Tristan GUEZENNEC -CROIX- (tristan.guezennec@decathlon.com)
2024-03-28 09:51:12

*Thread Reply:* @Yannick Libert

Bipan Sihra (bsihra@amazon.com)
2024-03-28 09:52:43

*Thread Reply:* I was able to find info I needed here: https://github.com/OpenLineage/OpenLineage/discussions/597

Ranvir Singh (ranvir.tune@gmail.com)
2024-03-27 07:55:37

Hi Team, we are trying to collect lineage for a Spark job using OpenLineage(v1.8.0) and Marquez (v0.46). We can see the "Schema" details for all "Datasets" created but we can't see "Column-level" lineage and getting "Column lineage not available for the specified dataset" on Marquez UI under "COLUMN LINEAGE" tab.

About Spark Job: The job reads data from few oracle tables using JDBC connections as Temp views in Spark, performs some transformations (joining & aggregations) over different steps, creating intermediate temp views and finally writing the data to HDFS location. So, it looks something like this:

Read oracle tables as temp views -&gt; transformations set1 --&gt; creation of few more temp views from previously created temp views --&gt; transformations set2, set3 ... --&gt; Finally writing to hdfs(when all the temp view gets materialised in-memory to create final output dataset). We are getting the schema details for finally written dataset but no column-level lineage for the same. Also, while checking the json lineage data, I can see "" (blank) for "inputs" key (just before "outputs" key which contains dataset name & other details in nested key-value form). As per my understanding, this explains null value for "columnLineage" key hence no column-level lineage but unable to understand why!

Appreciate if you could share some thoughts/idea in terms of what is going wrong here as we are stuck on this point? Also, not sure we can get the column-level lineage only for datasets created from permanent Hive tables and not for temp/un-materialised views using OpenLineage & Marquez.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-27 08:54:38

*Thread Reply:* My first guess would be that either some of the interaction between JDBC/views/materialization make the CLL not show, or possibly transformations - if you're doing stuff like UDFs we lose the column-level info, but it's hard to confirm without seeing events and/or some minimal reproduction

Ranvir Singh (ranvir.tune@gmail.com)
2024-03-29 08:48:03

*Thread Reply:* Hi @Maciej Obuchowski, Thanks for responding on this. We are using SparkSQL where we are reading the data from Oracle tables as temptable then running sql like queries (for transformation) on previously created temptable. Now, let say we want to run a set of transformations, so we have written the transformation logic as sql like queryies. So, when this first query (query1) would get executed resulting in creation of temptable1, then query2 will get executed on temptable1 creating temptable2 and so on. For such use case, we have developed a custom function, this custom function will take these queries (query1, query2, ...) as input and will run iteratively and will create temptable1, temptable2,... and so on. This custom function uses RDD APIs and in-built functions like collect() along with few other scala functions. So, not sure whether usage of RDD will break the lineage or what's going wrong. Lastly, we do have jobs where we are using direct UDFs in spark but we aren't getting CLL for those jobs also which doesn't have UDF usage. Hope this gives some context on how we are running the job.

Ranvir Singh (ranvir.tune@gmail.com)
2024-04-04 13:08:32

*Thread Reply:* Hey @Maciej Obuchowski, appreciate your help/comments on this.

George Tong (george@terradot.earth)
2024-03-27 14:53:44

Hey everyone 👋

I’m working at a carbon capture 🌍 company and we’re designing how we want to store data in our PostgreSQL database at the moment. One of the key things we’re focusing on is traceability and transparency of data, as well as ability to edit and maintain historical data. This is key as if we make an error and need to update a previous data point, we want to know everything downstream of that data point that needs to be rerun and recalculated. You might be able to guess where this is going… • Any advice on how we should be designing our table schemas to support editing and traceability? We’re currently looking using temporal tables • Is Open Lineage the right tool for downstream tracking and traceability? Are there any other tools we should be looking at instead? I’m new here so hopefully I asked in the right channel. Let me know if I should be asking elsewhere!

Kacper Muda (kacper.muda@getindata.com)
2024-03-28 05:55:48

*Thread Reply:* Hey, In my opinion, OpenLineage is the right tool for what you are describing. Together with some backend like Marquez it will allow you to visualize data flow, dependencies (upstreams, downstreams) and more 🙂

🙌 George Tong
Michael Robinson (michael.robinson@astronomer.io)
2024-03-28 15:54:58

*Thread Reply:* Hi George, welcome! To add to what Kacper said, I think it also depends on what you are looking for in terms of "transparency." I guess I'm wondering exactly what you mean by this. A consumer using the OpenLineage standard (like Marquez, which we recommend in general but especially for getting started) will collect metadata about your pipelines' datasets and jobs but won't collect the data itself or support editing of your data. You're probably fully aware of this, but it's a point of confusion sometimes, and since you mentioned transparency and updating data I wanted to emphasize this. I hope this helps!

🙌 George Tong
George Tong (george@terradot.earth)
2024-03-28 19:28:36

*Thread Reply:* Thanks for the thoughts folks! Yes I think my thoughts are starting to become more concrete - retaining a history of data and ensuring that you can always go back to a certain time of your data is different from understanding the downstream impact of a data change, (which is what OpenLineage seems to tackle)

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2024-03-28 03:18:42

Hi team, so we're using OL v 1.3.1 on databricks, on a non termination cluster. We're seeing that the heap memory is increasing very significantly, and notice that the majority of the memory comes from OL. Any idea if we're having some memory leaks from OL? Have we seen any similar issues being reported before? Thanks!

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-28 10:43:36

*Thread Reply:* First idea would be to bump version 🙂

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-03-28 10:56:55

*Thread Reply:* Does it affect all the jobs or just some of them? Does it somehow correlate with amount of spark tasks a job is processing? Would you be able to test the behaviour on the jar prepared from the branch? Any other details helping to reproduce this would be nice.

So many questions for the start... Happy to see you again @Anirudh Shrinivason. Can't wait looking into this next week.

Damien Hawes (damien.hawes@booking.com)
2024-03-28 11:12:23

*Thread Reply:* FYI - this is my experience as discussed on Tuesday @Paweł Leszczyński @Maciej Obuchowski

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2024-04-01 05:31:09

*Thread Reply:* Hey @Maciej Obuchowski @Paweł Leszczyński Thanks for the questions! Here are some details and clarifications I have:

  1. First idea would be to bump version Has such an issue been fixed in the later versions? So this is an already known issue with 1.3.1 version? Just curious why bumping it might resolve the issue...
  2. Does it affect all the jobs or just some of them So far, we're monitoring the heap at a cluster level... It's a shared non-termination cluster. I'll try to take a look at a job level to get some more insights.
  3. Does it somehow correlate with amount of spark tasks a job is processing This was my initial thought too, but from looking at a few of the pipelines, they seem relatively straightforward logic wise. And I don't think it's because a lot of tasks are running in parallel causing the amount of allocated objects to be very high... (Let me check back on this)
  4. Any other details helping to reproduce this would be nice. Yes! Let me try to dig a little more, and try to get back with more details...
  5. FYI - this is my experience as discussed on Tuesday Hi @Damien Hawes may I check if there is anywhere I could get some more information on your observations? Since it seems related, maybe they're the same issues? But all in all, I ran a high level memory analyzer, and it seemed to look like a memory leak from the OL jar... We noticed the heap size from OL almost monotonically increasing to >600mb... I'll try to check and do a bit more analysis before getting back with more details. :gratitudethankyou:
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2024-04-02 00:52:32

*Thread Reply:* This is what the heap dump looks like after 45 mins btw... ~11gb from openlineage out of 14gb heap

❤️ Paweł Leszczyński, Maciej Obuchowski
Damien Hawes (damien.hawes@booking.com)
2024-04-02 03:34:50

*Thread Reply:* Nice. That's slightly different to my experience. We're running a streaming pipeline on a conventional Spark cluster (not databricks).

Damien Hawes (damien.hawes@booking.com)
2024-04-03 04:56:13

*Thread Reply:* OK. I've found the bug. I will create an issue for it.

cc @Maciej Obuchowski @Paweł Leszczyński

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-04-03 04:59:49

*Thread Reply:* Great. I am also looking into unknown facet. I think this could be something like this -> https://github.com/OpenLineage/OpenLineage/pull/2557/files

Damien Hawes (damien.hawes@booking.com)
2024-04-03 05:00:25

*Thread Reply:* Not quite.

Damien Hawes (damien.hawes@booking.com)
2024-04-03 05:01:00

*Thread Reply:* The problem is that the UnknownEntryFacetListener accumulates state, even if the spark_unknown facet is disabled.

Damien Hawes (damien.hawes@booking.com)
2024-04-03 05:01:41

*Thread Reply:* The problem is that the code eagerly calls UnknownEntryFacetListener#apply

🙌 Paweł Leszczyński
Damien Hawes (damien.hawes@booking.com)
2024-04-03 05:01:54

*Thread Reply:* Without checking if the facet is disabled or not.

Damien Hawes (damien.hawes@booking.com)
2024-04-03 05:02:17

*Thread Reply:* It only checks whether the facet is disabled or not, when it needs to add the details to the event.

Damien Hawes (damien.hawes@booking.com)
2024-04-03 05:03:40

*Thread Reply:* Furthermore, even if the facet is enabled, it never clears its state.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-04-03 05:04:31

*Thread Reply:* yes, and if logical plan is spark.createDataFrame with local data, this can get huge

Damien Hawes (damien.hawes@booking.com)
2024-04-03 06:01:10

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/2561

👍 Paweł Leszczyński
Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)
2024-04-03 06:20:51

*Thread Reply:* 🙇

Tom Linton (tom.linton@atlan.com)
2024-03-28 21:50:01

Hello All - I've begun my OL journey rather recently and am running into trouble getting lineage going in an airflow job. I spun up a quick flask server to accept and print the OL requests. It appears that there are no Inputs or Outputs. Is that something I have to set in my DAG? Reference code and responses are attached.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-29 03:38:18

*Thread Reply:* hook-level lineage is not yet supported, you should you SnowflakeOperator instead

Tom Linton (tom.linton@atlan.com)
2024-03-29 08:53:29

*Thread Reply:* Thanks @Jakub Dardziński! I used the hook because it looks like that is the supported operator based on airflow docs

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-29 09:20:10

*Thread Reply:* you can see this is under SQLExecuteQueryOperator without going into the details part of the implentation is on hooks side there, not the operator

Vinnakota Priyatam Sai (vinnakota.priyatam@walmart.com)
2024-03-29 00:14:17

Hi team, we are collecting OpenLineage events across different jobs where the output datasources are BQ, Cassandra and Postgres. We are mostly interested in the freshness of columns across these different datasources. Using OpenLineage COMPLETE event's dataset.datasource and dataset.schema we want to understand which columns are updated at what time.

We have a few questions related to BQ (as output dataset) events:

  1. How to identify if the output datasource is BQ, Cassandra or Postgres?
  2. Can we rely on dataset.datasource and dataset.schema for BQ table name and column names?
  3. Even if one column is updated, do we get all the column details in dataset.schema?
  4. If dataset.datasource or dataset.schema value is null, can we assume that no column has been updated in that event?
  5. Are there any sample BQ events that we can refer to understand the events?
  6. Is it possible to get columnLineage details for BQ as output datasource?
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-29 10:11:28

*Thread Reply:* > 1. How to identify if the output datasource is BQ, Cassandra or Postgres? The dataset namespace would contain that information: for example, the namespace for BQ would be simple bigquery and for Postgres it would be postgres://{host}:{port}

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-29 10:15:06

*Thread Reply:* > 1. Can we rely on dataset.datasource and dataset.schema for BQ table name and column names? > 2. Even if one column is updated, do we get all the column details in dataset.schema? > 3. If dataset.datasource or dataset.schema value is null, can we assume that no column has been updated in that event? If talking about BigQuery Airflow operators, the known issue is BigQuery query caching. You're guaranteed to get this information if the query is running for the first time, but if the query is just reading from the cache instead of being executed, we don't get that information. That would result in a run without actual input dataset data.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-29 10:15:56

*Thread Reply:* > 1. Is it possible to get columnLineage details for BQ as output datasource? BigQuery API does not give us this information yet - we could augment the API data with SQL parser one though. It's a feature that don't exist yet though

Vinnakota Priyatam Sai (vinnakota.priyatam@walmart.com)
2024-03-29 10:18:32

*Thread Reply:* This is very helpful, thanks a lot @Maciej Obuchowski

Mark Dunphy (markd@spotify.com)
2024-03-29 11:54:02

Hi all, we are trying to use dbt-ol to capture lineage. We use dbt custom aliases based on the --target flag passed in to dbt-ol run. So for example if using --target dev the model alias might be some_prefix__model_a whereas with --target prod the model alias might be model_a without any prefix. OpenLineage doesn't seem to pick up on this custom alias and sends model_a regardless in the input/output. Is this intended? I'm relatively new to this data world so it is possible I'm missing something basic here.

Michael Robinson (michael.robinson@astronomer.io)
2024-03-29 15:52:17

*Thread Reply:* Welcome and thanks for using OpenLineage! Someone with dbt expertise will reply soon.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-29 18:21:35

*Thread Reply:* looks like it’s another entry in manifest.json : https://schemas.getdbt.com/dbt/manifest/v10.json

called alias that is not taken into consideration

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-29 18:22:24

*Thread Reply:* it needs more analysis whether and how this entry is set

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-29 18:30:06

*Thread Reply:* btw how do you create alias per target? I did this:

-- Use the `ref` function to select from other models
{% if target.name != 'prod' %}
{{ config(materialized='incremental',unique_key='id',
        on_schema_change='sync_all_columns', alias='third_model_dev'
) }}
{% else %}
{{ config(materialized='incremental',unique_key='id',
        on_schema_change='sync_all_columns', alias='third_model_prod'
) }}
{% endif %}

select x.id, lower(y.name)
from {{ ref('my_first_dbt_model') }} as x
left join {{ ref('my_second_dbt_model' )}} as y
ON x.id = y.i

but I’m curious if that’s correct scenario to test

Mark Dunphy (markd@spotify.com)
2024-04-01 09:31:26

*Thread Reply:* thanks for looking into this @Jakub Dardziński! we are using the generatealiasname macro to control this. our macro looks very similar to this example

Tom Linton (tom.linton@atlan.com)
2024-03-29 12:37:48

Is it possible to configure OL to only send OL Events for certain dags in airflow?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-29 14:22:30

*Thread Reply:* it will be possible once latest version of OL provider is released with this PR: https://github.com/apache/airflow/pull/37725

Labels
area:providers, area:dev-tools, kind:documentation, provider:openlineage
✅ Tom Linton
Tom Linton (tom.linton@atlan.com)
2024-03-29 16:09:16

*Thread Reply:* Thanks!

Tom Linton (tom.linton@atlan.com)
2024-03-29 13:10:52

Is it common to see this error?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-29 17:32:07

*Thread Reply:* seems like trim in select statements causes issues

✅ Tom Linton
Michael Robinson (michael.robinson@astronomer.io)
2024-04-01 10:04:45

@channel I'd like to open a vote to release OpenLineage 1.11.0, including: • Spark: lineage metadata extraction built-in to Spark extensions • Spark: change SparkPropertyFacetBuilder to support recording Spark runtime config • Java client: add metrics-gathering mechanism • Flink: support Flink 1.19.0 • SQL: show error message when OpenLineageSql cannot find native library Three +1s from committers will authorize. Thanks!

➕ Harel Shein, Rodrigo Maia, Jakub Dardziński, alexandre bergere, Maciej Obuchowski
Michael Robinson (michael.robinson@astronomer.io)
2024-04-04 09:44:38

*Thread Reply:* Thanks, all. The release is authorized and will be performed within 2 business days excluding tomorrow.

Michael Robinson (michael.robinson@astronomer.io)
2024-04-01 16:13:24

@channel The latest issue of OpenLineage News is available now, featuring a rundown of upcoming and recent events, recent releases, updates to the Airflow Provider, open proposals, and more. To get the newsletter directly in your inbox each month, sign up here. openlineage.us14.list-manage.com

openlineage.us14.list-manage.com
Pooja K M (pooja.km@philips.com)
2024-04-02 06:01:39

Hi All, We are trying transform entities according to medallian model, where each entity goes through multiple layers of data transformation and the workflow is like the data is picked from kafka channel and stored into parquet and then trasforming it to hudi tables in silver layer. so now we are trying to capture lineage data, so far we have tried with transport type console but we are not seeing the lineage data in console (we are running this job from aws glue). below are the configuration which we have added. spark = (SparkSession.builder .appName('samplelineage') .config('spark.jars.packages', 'io.openlineage:openlineagespark:1.8.0') .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener') .config('spark.openlineage.namespace', 'LineagePortTest') .config('spark.openlineage.parentJobNamespace', 'LineageJobNameSpace') .config("spark.openlineage.transport.type", "console") .config('spark.openlineage.parentJobName', 'LineageJobName') .getOrCreate())

Damien Hawes (damien.hawes@booking.com)
2024-04-02 07:24:13

*Thread Reply:* Does Spark tell your during startup that it is adding the listener?

The log line should be something like "Adding io.openlineage.spark.agent.OpenLineageSparkListener"

Damien Hawes (damien.hawes@booking.com)
2024-04-02 07:24:58

*Thread Reply:* Additionally, ensure your log4j.properties / log4j2.properties (depending on the version of Spark that you are using) allows io.openlineage at info level

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-02 08:04:16

*Thread Reply:* I think, as usual, hudi is the problem 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-02 08:04:35

*Thread Reply:* or are you just not seeing any OL logs/events?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-02 08:05:31

*Thread Reply:* as @Damien Hawes said, you should see Spark log org.apache.spark.SparkContext - Registered listener io.openlineage.spark.agent.OpenLineageSparkListener

Pooja K M (pooja.km@philips.com)
2024-04-02 09:24:00

*Thread Reply:* yes I could see the mentioned logs in the console while job runs

Pooja K M (pooja.km@philips.com)
2024-04-02 09:30:17

*Thread Reply:* Also we are not seeing OL events

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 08:32:49

*Thread Reply:* do you see any errors or other logs that could be relevant to OpenLineage? also, some simple reproduction might help

Pooja K M (pooja.km@philips.com)
2024-04-03 09:06:18

*Thread Reply:* ya we could see below logs INFO SparkSQLExecutionContext: OpenLineage received Spark event that is configured to be skipped: SparkListenerSQLExecutionEnd

Francisco Morillo (fmorillo@amazon.es)
2024-04-02 14:04:07

Hi All! Im trying to set up OpenLineage with Managed Flink at AWS. but im getting this error:

`"throwableInformation": "io.openlineage.client.transports.HttpTransportResponseException: code: 400, response: \n\tat io.openlineage.client.transports.HttpTransport.throwOnHttpError(HttpTransport.java:151)\n\tat`

This is what i see in marquez. where is flink is trying to send the open lineage events

items "message":string"The Job Result cannot be fetch..." "_producer":string"<https://github.com/OpenLineage>..." "_schemaURL":string"<https://openlineage.io/spec/fa>..." "stackTrace":string"org.apache.flink.util.FlinkRuntimeException: The Job Result cannot be fetched through the Job Client when in Web Submission. at org.apache.flink.client.deployment.application.WebSubmissionJobClient.getJobExecutionResult(WebSubmissionJobClient.java:92) at

Im passing the conf like this:

Properties props = new Properties(); props.put("openlineage.transport.type","http"); props.put("openlineage.transport.url","http://<marquez-ip>:5000/api/v1/lineage"); props.put("execution.attached","true"); Configuration conf = ConfigurationUtils.createConfiguration(props); StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(conf);

Harel Shein (harel.shein@gmail.com)
2024-04-02 14:26:12

*Thread Reply:* Hey @Francisco Morillo, which version of Marquez are you running? Streaming support was a relatively recent addition to Marquez

Francisco Morillo (fmorillo@amazon.es)
2024-04-02 14:29:32

*Thread Reply:* So i was able to set it up working locally. Having Flink integrated with open lineage

Francisco Morillo (fmorillo@amazon.es)
2024-04-02 14:29:43

*Thread Reply:* But once i deployed marquez in an ec2 using docker

Francisco Morillo (fmorillo@amazon.es)
2024-04-02 14:30:16

*Thread Reply:* and have managed flink trying to emit events to openlineage i just receive the flink job event, but not the kafka source / iceberg sink

Francisco Morillo (fmorillo@amazon.es)
2024-04-02 14:32:31

*Thread Reply:* I ran this: $ git clone git@github.com:MarquezProject/marquez.git &amp;&amp; cd marquez

Harel Shein (harel.shein@gmail.com)
2024-04-02 14:50:41

*Thread Reply:* hmmm. I see. you're probably running the latest version of marquez then, should be ok. did you try the console transport first to see how the events look like?

Harel Shein (harel.shein@gmail.com)
2024-04-02 14:51:10

*Thread Reply:* kafka source and iceberg sink should be well supported for flink

Francisco Morillo (fmorillo@amazon.es)
2024-04-02 14:54:31

*Thread Reply:* i believe there is an issue with how the conf is passed to flink job in managed flink

Harel Shein (harel.shein@gmail.com)
2024-04-02 14:55:37

*Thread Reply:* ah, that may be the case. what are you seeing in the flink job logs?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-02 14:59:02

*Thread Reply:* I think setting execution.attached might not work when you set it this way

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-02 15:05:05

*Thread Reply:* is there an option to use regular flink-conf.yaml?

Francisco Morillo (fmorillo@amazon.es)
2024-04-02 15:48:34

*Thread Reply:* in the flink logs im seeing the io.openlineage.client.transports.HttpTransportResponseException: code: 400, response: \n\tat.

in marquez im seeing the job result cannot be fetched.

we cant modify flink-conf in managed flink

Francisco Morillo (fmorillo@amazon.es)
2024-04-02 15:49:54

*Thread Reply:* this is what i see at marquez at ec2

Harel Shein (harel.shein@gmail.com)
2024-04-02 15:50:58

*Thread Reply:* hmmm.. I'm wondering if the issue is with Marquez processing the events or the openlineage events themselves. can you try with: props.put("openlineage.transport.type","console"); ?

Francisco Morillo (fmorillo@amazon.es)
2024-04-02 15:51:08

*Thread Reply:* compared to what i see locally. Locally is the same job but just writing to localhost marquez, but im passing the openlineage conf trough env

Francisco Morillo (fmorillo@amazon.es)
2024-04-02 15:52:50

*Thread Reply:* @Harel Shein when set to console, where will the events be printed? Cloudwatch logs?

Harel Shein (harel.shein@gmail.com)
2024-04-02 15:53:17

*Thread Reply:* I think so, yes

Francisco Morillo (fmorillo@amazon.es)
2024-04-02 15:53:20

*Thread Reply:* let me try

Harel Shein (harel.shein@gmail.com)
2024-04-02 15:53:39

*Thread Reply:* the same place you're seeing your flink logs right now

Harel Shein (harel.shein@gmail.com)
2024-04-02 15:54:12

*Thread Reply:* the same place you found that client exception

Francisco Morillo (fmorillo@amazon.es)
2024-04-02 16:09:34

*Thread Reply:* I will post the events

Francisco Morillo (fmorillo@amazon.es)
2024-04-02 16:09:37

*Thread Reply:* "logger": "io.openlineage.flink.OpenLineageFlinkJobListener", "message": "onJobSubmitted event triggered for flink-jobs-prod.kafka-iceberg-prod", "messageSchemaVersion": "1", "messageType": "INFO", "threadName": "Flink-DispatcherRestEndpoint-thread-4" }

Francisco Morillo (fmorillo@amazon.es)
2024-04-02 16:09:52

*Thread Reply:* "locationInformation": "io.openlineage.flink.TransformationUtils.processLegacySinkTransformation(TransformationUtils.java:90)", "logger": "io.openlineage.flink.TransformationUtils", "message": "Processing legacy sink operator Print to System.out", "messageSchemaVersion": "1", "messageType": "INFO", "threadName": "Flink-DispatcherRestEndpoint-thread-4" }

Francisco Morillo (fmorillo@amazon.es)
2024-04-02 16:10:08

*Thread Reply:* "locationInformation": "io.openlineage.flink.TransformationUtils.processLegacySinkTransformation(TransformationUtils.java:90)", "logger": "io.openlineage.flink.TransformationUtils", "message": "Processing legacy sink operator org.apache.flink.streaming.api.functions.sink.DiscardingSink@68d0a141", "messageSchemaVersion": "1", "messageType": "INFO", "threadName": "Flink-DispatcherRestEndpoint-thread-4" }

Francisco Morillo (fmorillo@amazon.es)
2024-04-02 16:10:46

*Thread Reply:* "locationInformation": "io.openlineage.client.transports.ConsoleTransport.emit(ConsoleTransport.java:21)", "logger": "io.openlineage.client.transports.ConsoleTransport", "message": "{\"eventTime\":\"2024_04_02T20:07:03.30108Z\",\"producer\":\"<https://github.com/OpenLineage/OpenLineage/tree/1.10.2/integration/flink>\",\"schemaURL\":\"<https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunEvent>\",\"eventType\":\"START\",\"run\":{\"runId\":\"cda9a0d2_6dfd_4db2_b3d0_f11d7b082dc0\"},\"job\":{\"namespace\":\"flink_jobs_prod\",\"name\":\"kafka-iceberg-prod\",\"facets\":{\"jobType\":{\"_producer\":\"<https://github.com/OpenLineage/OpenLineage/tree/1.10.2/integration/flink>\",\"_schemaURL\":\"<https://openlineage.io/spec/facets/2-0-2/JobTypeJobFacet.json#/$defs/JobTypeJobFacet>\",\"processingType\":\"STREAMING\",\"integration\":\"FLINK\",\"jobType\":\"JOB\"}}},\"inputs\":[{\"namespace\":\"<kafka://b-1.mskflinkopenlineage>.&lt;&gt;.<http://kafka.us-east-1.amazonaws.com:9092,b_3.mskflinkopenlineage.&lt;&gt;kafka.us_east_1.amazonaws.com:9092,b-2.mskflinkopenlineage.&lt;&gt;.c22.kafka.us-east-1.amazonaws.com:9092\%22,\%22name\%22:\%22temperature-samples\%22,\%22facets\%22:{\%22schema\%22:{\%22_producer\%22:\%22&lt;https://github.com/OpenLineage/OpenLineage/tree/1.10.2/integration/flink&gt;\%22,\%22_schemaURL\%22:\%22&lt;https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet&gt;\%22,\%22fields\%22:[{\%22name\%22:\%22sensorId\%22,\%22type\%22:\%22int\%22},{\%22name\%22:\%22room\%22,\%22type\%22:\%22string\%22},{\%22name\%22:\%22temperature\%22,\%22type\%22:\%22float\%22},{\%22name\%22:\%22sampleTime\%22,\%22type\%22:\%22long\%22}]}}|kafka.us_east_1.amazonaws.com:9092,b-3.mskflinkopenlineage.&lt;&gt;kafka.us-east-1.amazonaws.com:9092,b_2.mskflinkopenlineage.&lt;&gt;.c22.kafka.us_east_1.amazonaws.com:9092\",\"name\":\"temperature_samples\",\"facets\":{\"schema\":{\"_producer\":\"&lt;https://github.com/OpenLineage/OpenLineage/tree/1.10.2/integration/flink&gt;\",\"_schemaURL\":\"&lt;https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet&gt;\",\"fields\":[{\"name\":\"sensorId\",\"type\":\"int\"},{\"name\":\"room\",\"type\":\"string\"},{\"name\":\"temperature\",\"type\":\"float\"},{\"name\":\"sampleTime\",\"type\":\"long\"}]}}>}],\"outputs\":[{\"namespace\":\"<s3://iceberg-open-lineage-891377161433>\",\"name\":\"/iceberg/open_lineage.db/open_lineage_room_temperature_prod\",\"facets\":{\"schema\":{\"_producer\":\"<https://github.com/OpenLineage/OpenLineage/tree/1.10.2/integration/flink>\",\"_schemaURL\":\"<https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet>\",\"fields\":[{\"name\":\"room\",\"type\":\"STRING\"},{\"name\":\"temperature\",\"type\":\"FLOAT\"},{\"name\":\"sampleCount\",\"type\":\"INTEGER\"},{\"name\":\"lastSampleTime\",\"type\":\"TIMESTAMP\"}]}}}]}",

Francisco Morillo (fmorillo@amazon.es)
2024-04-02 16:11:12

*Thread Reply:* locationInformation": "io.openlineage.flink.tracker.OpenLineageContinousJobTracker.startTracking(OpenLineageContinousJobTracker.java:100)", "logger": "io.openlineage.flink.tracker.OpenLineageContinousJobTracker", "message": "Starting tracking thread for jobId=de9e0d5b5d19437910975f231d5ed4b5", "messageSchemaVersion": "1", "messageType": "INFO", "threadName": "Flink-DispatcherRestEndpoint-thread-4" }

Francisco Morillo (fmorillo@amazon.es)
2024-04-02 16:11:25

*Thread Reply:* "locationInformation": "io.openlineage.flink.OpenLineageFlinkJobListener.onJobExecuted(OpenLineageFlinkJobListener.java:191)", "logger": "io.openlineage.flink.OpenLineageFlinkJobListener", "message": "onJobExecuted event triggered for flink-jobs-prod.kafka-iceberg-prod", "messageSchemaVersion": "1", "messageType": "INFO", "threadName": "Flink-DispatcherRestEndpoint-thread-4" }

Francisco Morillo (fmorillo@amazon.es)
2024-04-02 16:11:41

*Thread Reply:* "locationInformation": "io.openlineage.flink.tracker.OpenLineageContinousJobTracker.stopTracking(OpenLineageContinousJobTracker.java:120)", "logger": "io.openlineage.flink.tracker.OpenLineageContinousJobTracker", "message": "stop tracking", "messageSchemaVersion": "1", "messageType": "INFO", "threadName": "Flink-DispatcherRestEndpoint-thread-4" }

Francisco Morillo (fmorillo@amazon.es)
2024-04-02 16:12:07

*Thread Reply:* "locationInformation": "io.openlineage.client.transports.ConsoleTransport.emit(ConsoleTransport.java:21)", "logger": "io.openlineage.client.transports.ConsoleTransport", "message": "{\"eventTime\":\"2024_04_02T20:07:04.028017Z\",\"producer\":\"<https://github.com/OpenLineage/OpenLineage/tree/1.10.2/integration/flink>\",\"schemaURL\":\"<https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunEvent>\",\"eventType\":\"FAIL\",\"run\":{\"runId\":\"cda9a0d2_6dfd_4db2_b3d0_f11d7b082dc0\",\"facets\":{\"errorMessage\":{\"_producer\":\"<https://github.com/OpenLineage/OpenLineage/tree/1.10.2/integration/flink>\",\"_schemaURL\":\"<https://openlineage.io/spec/facets/1-0-0/ErrorMessageRunFacet.json#/$defs/ErrorMessageRunFacet>\",\"message\":\"The Job Result cannot be fetched through the Job Client when in Web Submission.\",\"programmingLanguage\":\"JAVA\",\"stackTrace\":\"org.apache.flink.util.FlinkRuntimeException: The Job Result cannot be fetched through the Job Client when in Web Submission.\\n\\tat org.apache.flink.client.deployment.application.WebSubmissionJobClient.getJobExecutionResult(WebSubmissionJobClient.java:92)\\n\\tat org.apache.flink.client.program.StreamContextEnvironment.getJobExecutionResult(StreamContextEnvironment.java:152)\\n\\tat org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:123)\\n\\tat org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1969)\\n\\tat com.amazonaws.services.msf.KafkaStreamingJob.main(KafkaStreamingJob.java:342)\\n\\tat java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\\n\\tat java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\\n\\tat java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\\n\\tat java.base/java.lang.reflect.Method.invoke(Method.java:566)\\n\\tat org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:355)\\n\\tat org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222)\\n\\tat org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:114)\\n\\tat org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:84)\\n\\tat org.apache.flink.client.deployment.application.DetachedApplicationRunner.run(DetachedApplicationRunner.java:70)\\n\\tat org.apache.flink.runtime.webmonitor.handlers.JarRunOverrideHandler.lambda$handleRequest$3(JarRunOverrideHandler.java:239)\\n\\tat java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)\\n\\tat java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)\\n\\tat java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\\n\\tat java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)\\n\\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\\n\\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\\n\\tat java.base/java.lang.Thread.run(Thread.java:829)\\n\"}}},\"job\":{\"namespace\":\"flink_jobs_prod\",\"name\":\"kafka-iceberg-prod\",\"facets\":{\"jobType\":{\"_producer\":\"<https://github.com/OpenLineage/OpenLineage/tree/1.10.2/integration/flink>\",\"_schemaURL\":\"<https://openlineage.io/spec/facets/2-0-2/JobTypeJobFacet.json#/$defs/JobTypeJobFacet>\",\"processingType\":\"STREAMING\",\"integration\":\"FLINK\",\"jobType\":\"JOB\"}}}}", "messageSchemaVersion": "1", "messageType": "INFO", "threadName": "Flink-DispatcherRestEndpoint-thread-4" }

Francisco Morillo (fmorillo@amazon.es)
2024-04-02 16:15:35

*Thread Reply:* this is what i see in cloudwatch when set to console

Francisco Morillo (fmorillo@amazon.es)
2024-04-02 16:17:50

*Thread Reply:* So its nothing to do with marquez but with openlineage and flink

Harel Shein (harel.shein@gmail.com)
2024-04-02 16:22:10

*Thread Reply:* hmm.. the start event actually looks pretty good to me: { "eventTime": "2024-04-02T20:07:03.30108Z", "producer": "<https://github.com/OpenLineage/OpenLineage/tree/1.10.2/integration/flink>", "schemaURL": "<https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunEvent>", "eventType": "START", "run": { "runId": "cda9a0d2-6dfd-4db2-b3d0-f11d7b082dc0" }, "job": { "namespace": "flink-jobs-prod", "name": "kafka-iceberg-prod", "facets": { "jobType": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/1.10.2/integration/flink>", "_schemaURL": "<https://openlineage.io/spec/facets/2-0-2/JobTypeJobFacet.json#/$defs/JobTypeJobFacet>", "processingType": "STREAMING", "integration": "FLINK", "jobType": "JOB" } } }, "inputs": [ { "namespace": "<kafka://b-1.mskflinkopenlineage>.&lt;&gt;.<http://kafka.us-east-1.amazonaws.com:9092,b_3.mskflinkopenlineage.&lt;&gt;kafka.us_east_1.amazonaws.com:9092,b-2.mskflinkopenlineage.&lt;&gt;.c22.kafka.us-east-1.amazonaws.com:9092|kafka.us_east_1.amazonaws.com:9092,b-3.mskflinkopenlineage.&lt;&gt;kafka.us-east-1.amazonaws.com:9092,b_2.mskflinkopenlineage.&lt;&gt;.c22.kafka.us_east_1.amazonaws.com:9092>", "name": "temperature-samples", "facets": { "schema": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/1.10.2/integration/flink>", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet>", "fields": [ { "name": "sensorId", "type": "int" }, { "name": "room", "type": "string" }, { "name": "temperature", "type": "float" }, { "name": "sampleTime", "type": "long" } ] } } } ], "outputs": [ { "namespace": "<s3://iceberg-open-lineage-891377161433>", "name": "/iceberg/open_lineage.db/open_lineage_room_temperature_prod", "facets": { "schema": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/1.10.2/integration/flink>", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet>", "fields": [ { "name": "room", "type": "STRING" }, { "name": "temperature", "type": "FLOAT" }, { "name": "sampleCount", "type": "INTEGER" }, { "name": "lastSampleTime", "type": "TIMESTAMP" } ] } } } ] }

Francisco Morillo (fmorillo@amazon.es)
2024-04-02 16:22:37

*Thread Reply:* so with that start event should marquez be able to build the proper lineage?

Francisco Morillo (fmorillo@amazon.es)
2024-04-02 16:22:57

*Thread Reply:* This is what i would get with flink marquez locally

Harel Shein (harel.shein@gmail.com)
2024-04-02 16:23:33

*Thread Reply:* yes, but then it looks like the flink job is failing and we're seeing this event: { "eventTime": "2024-04-02T20:07:04.028017Z", "producer": "<https://github.com/OpenLineage/OpenLineage/tree/1.10.2/integration/flink>", "schemaURL": "<https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunEvent>", "eventType": "FAIL", "run": { "runId": "cda9a0d2-6dfd-4db2-b3d0-f11d7b082dc0", "facets": { "errorMessage": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/1.10.2/integration/flink>", "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/ErrorMessageRunFacet.json#/$defs/ErrorMessageRunFacet>", "message": "The Job Result cannot be fetched through the Job Client when in Web Submission.", "programmingLanguage": "JAVA", "stackTrace": "org.apache.flink.util.FlinkRuntimeException: The Job Result cannot be fetched through the Job Client when in Web Submission.ntat org.apache.flink.client.deployment.application.WebSubmissionJobClient.getJobExecutionResult(WebSubmissionJobClient.java:92)ntat org.apache.flink.client.program.StreamContextEnvironment.getJobExecutionResult(StreamContextEnvironment.java:152)ntat org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:123)ntat org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1969)ntat com.amazonaws.services.msf.KafkaStreamingJob.main(KafkaStreamingJob.java:342)ntat java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)ntat java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)ntat java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)ntat java.base/java.lang.reflect.Method.invoke(Method.java:566)ntat org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:355)ntat org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222)ntat org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:114)ntat org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:84)ntat org.apache.flink.client.deployment.application.DetachedApplicationRunner.run(DetachedApplicationRunner.java:70)ntat org.apache.flink.runtime.webmonitor.handlers.JarRunOverrideHandler.lambda$handleRequest$3(JarRunOverrideHandler.java:239)ntat java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)ntat java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)ntat java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)ntat java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)ntat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)ntat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)ntat java.base/java.lang.Thread.run(Thread.java:829)n" } } }, "job": { "namespace": "flink-jobs-prod", "name": "kafka-iceberg-prod", "facets": { "jobType": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/1.10.2/integration/flink>", "_schemaURL": "<https://openlineage.io/spec/facets/2-0-2/JobTypeJobFacet.json#/$defs/JobTypeJobFacet>", "processingType": "STREAMING", "integration": "FLINK", "jobType": "JOB" } } } }

Francisco Morillo (fmorillo@amazon.es)
2024-04-02 16:24:11

*Thread Reply:* But the thing is that the flink job is not really failling

Harel Shein (harel.shein@gmail.com)
2024-04-02 16:25:03

*Thread Reply:* interesting, would love to see what @Paweł Leszczyński / @Maciej Obuchowski / @Peter Huang think. This is beyond my depth on the flink integration 🙂

Francisco Morillo (fmorillo@amazon.es)
2024-04-02 16:34:51

*Thread Reply:* Thanks Harel!! Yes please, it would be great to see how openlineage can work with AWS Managed flink

➕ Harel Shein
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-04-03 02:43:12

*Thread Reply:* Just to clarify - is this setup working with openlineage flink integration turned off? From what I understand, your job emits cool START event, than a job fails and emits FAIL event with error stacktrace The Job Result cannot be fetched through the Job Client when in Web Submission which is cool as well.

The question is: does it fail bcz of Openlineage integration or it is just Openlineage which carries stacktrace of a failed job. I couldn't see anything Openlineage related in the stacktrace.

Francisco Morillo (fmorillo@amazon.es)
2024-04-03 09:43:34

*Thread Reply:* What do you mean with Flink integration turned off?

Francisco Morillo (fmorillo@amazon.es)
2024-04-03 09:44:28

*Thread Reply:* the flink job is not failling but, we are receiving an openlineage event that says fail, to which we then not see the proper dag in marquez

Francisco Morillo (fmorillo@amazon.es)
2024-04-03 09:45:18

*Thread Reply:* does openlineage work if the job is submited through web submission?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 09:47:44

*Thread Reply:* the answer is "probably not unless you can set up execution.attached beforehand"

Francisco Morillo (fmorillo@amazon.es)
2024-04-03 09:48:49

*Thread Reply:* execution.attached doesnt seem to work with job submitted through web submission.

Francisco Morillo (fmorillo@amazon.es)
2024-04-03 09:54:51

*Thread Reply:* When setting execution attached to false, i only get the start event, but it doesnt build the dag in the job space in marquez

Francisco Morillo (fmorillo@amazon.es)
2024-04-03 09:57:40

*Thread Reply:* I still see this in cloudwatch logs: locationInformation": "io.openlineage.flink.client.EventEmitter.emit(EventEmitter.java:50)", "logger": "io.openlineage.flink.client.EventEmitter", "message": "Failed to emit OpenLineage event: ", "messageSchemaVersion": "1", "messageType": "ERROR", "threadName": "Flink-DispatcherRestEndpoint-thread-1", "throwableInformation": "io.openlineage.client.transports.HttpTransportResponseException: code: 400, response: \n\tat io.openlineage.client.transports.HttpTransport.throwOnHttpError(HttpTransport.java:151)\n\tat io.openlineage.client.transports.HttpTransport.emit(HttpTransport.java:128)\n\tat io.openlineage.client.transports.HttpTransport.emit(HttpTransport.java:115)\n\tat io.openlineage.client.OpenLineageClient.emit(OpenLineageClient.java:60)\n\tat io.openlineage.flink.client.EventEmitter.emit(EventEmitter.java:48)\n\tat io.openlineage.flink.visitor.lifecycle.FlinkExecutionContext.lambda$onJobSubmitted$0(FlinkExecutionContext.java:66)\n\tat io.openlineage.client.circuitBreaker.NoOpCircuitBreaker.run(NoOpCircuitBreaker.java:27)\n\tat io.openlineage.flink.visitor.lifecycle.FlinkExecutionContext.onJobSubmitted(FlinkExecutionContext.java:59)\n\tat io.openlineage.flink.OpenLineageFlinkJobListener.start(OpenLineageFlinkJobListener.java:180)\n\tat io.openlineage.flink.OpenLineageFlinkJobListener.onJobSubmitted(OpenLineageFlinkJobListener.java:156)\n\tat org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.lambda$executeAsync$12(StreamExecutionEnvironment.java:2099)\n\tat java.base/java.util.ArrayList.forEach(ArrayList.java:1541)\n\tat org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:2099)\n\tat org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:188)\n\tat org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:119)\n\tat org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1969)\n\tat com.amazonaws.services.msf.KafkaStreamingJob.main(KafkaStreamingJob.java:345)\n\tat java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.base/java.lang.reflect.Method.invoke(Method.java:566)\n\tat org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:355)\n\tat org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222)\n\tat org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:114)\n\tat org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:84)\n\tat org.apache.flink.client.deployment.application.DetachedApplicationRunner.run(DetachedApplicationRunner.java:70)\n\tat org.apache.flink.runtime.webmonitor.handlers.JarRunOverrideHandler.lambda$handleRequest$3(JarRunOverrideHandler.java:239)\n\tat java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)\n\tat java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)\n\tat java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\n\tat java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.base/java.lang.Thread.run(Thread.java:829)\n"

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 10:01:52

*Thread Reply:* I think it will be a limitation of our integration then, at least until https://cwiki.apache.org/confluence/display/FLINK/FLIP-314%3A+Support+Customized+Job+Lineage+Listener - the way we're integrating with Flink requires it to be able to access execution results https://github.com/OpenLineage/OpenLineage/blob/main/integration/flink/app/src/main/java/io/openlineage/flink/OpenLineageFlinkJobListener.java#L[…]6

not sure if we can somehow work around this

Francisco Morillo (fmorillo@amazon.es)
2024-04-03 10:04:09

*Thread Reply:* with that flip we wouldnt need execution.attached?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 10:04:58

*Thread Reply:* Nope - it would add different mechanism to integrate with Flink other than JobListener

Francisco Morillo (fmorillo@amazon.es)
2024-04-03 10:09:38

*Thread Reply:* Could a workaround be, instead of having the http tranport, sending to kafka and have a java/python client writing the events to marquez?

Francisco Morillo (fmorillo@amazon.es)
2024-04-03 10:10:30

*Thread Reply:* because i just tried with executtion.attached to false and with console transport, i just receive the event for start but no errors. not sure if thats the only event needed in marquez to build a dag

Francisco Morillo (fmorillo@amazon.es)
2024-04-03 10:16:42

*Thread Reply:* also, wondering if the event actually reached marquez, why wouldnt the job dag be showned?

Francisco Morillo (fmorillo@amazon.es)
2024-04-03 10:16:52

*Thread Reply:* its the same start event i have received when running localy

Francisco Morillo (fmorillo@amazon.es)
2024-04-03 10:17:15
Francisco Morillo (fmorillo@amazon.es)
2024-04-03 10:25:47

*Thread Reply:* comparison of marquez receiving event from managed flink on aws (left). to marquez localhost receiving event from local flink. its the same event. however marquez in ec2 is not building dag

Francisco Morillo (fmorillo@amazon.es)
2024-04-03 10:26:14

*Thread Reply:* @Maciej Obuchowski is there any other event needed for dag?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 10:38:48

*Thread Reply:* > Could a workaround be, instead of having the http tranport, sending to kafka and have a java/python client writing the events to marquez? I think there are two problems, and the 400 is probably just the followup from the original one - maybe too long stacktrace makes Marquez reject the event? The original one, the attached one, is the cause why the integration tries to send the FAIL event at the first place

Peter Huang (huangzhenqiu0825@gmail.com)
2024-04-03 10:45:35

*Thread Reply:* For the error described in message "The Job Result cannot be fetched through the Job Client when in Web Submission.", I feel it is a bug in flink. Which version of flink are you using? @Francisco Morillo

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 11:02:46

*Thread Reply:* looking at implementation, it seems to be by design: /**** ** A {@link JobClient} that only allows asking for the job id of the job it is attached to. ** ** &lt;p&gt;This is used in web submission, where we do not want the Web UI to have jobs blocking threads ** while waiting for their completion. **/

Peter Huang (huangzhenqiu0825@gmail.com)
2024-04-03 11:32:51

*Thread Reply:* Yes, looks like flink code try to fetch the Job Result for the web submission job, thus the exception is raised.

Francisco Morillo (fmorillo@amazon.es)
2024-04-03 12:27:05

*Thread Reply:* Flink 1.15.2

Francisco Morillo (fmorillo@amazon.es)
2024-04-03 12:28:00

*Thread Reply:* But still wouldnt marquez be able to build the dag with the start event?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 12:28:50

*Thread Reply:* In Marquez, new dataset version is created when the run completes

Francisco Morillo (fmorillo@amazon.es)
2024-04-03 12:29:14

*Thread Reply:* but that doesnt show as events in marquez right?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 12:29:33

*Thread Reply:* I think that was going to be changed for streaming jobs - right @Paweł Leszczyński? - but not sure if that's already merged

Francisco Morillo (fmorillo@amazon.es)
2024-04-03 12:33:34

*Thread Reply:* in latest marquez version?

Francisco Morillo (fmorillo@amazon.es)
2024-04-03 12:41:52

*Thread Reply:* is this the right transport url? props.put("openlineage.transport.url","http://localhost:5000/api/v1/lineage");

Francisco Morillo (fmorillo@amazon.es)
2024-04-03 12:42:36

*Thread Reply:* because i was able to see streaming jobs in marquez when running locally, as well as having a flink local job writing to the marquez on ec2. its as the dataset and job doesnt get created in marquez from the event

Francisco Morillo (fmorillo@amazon.es)
2024-04-03 13:05:28

*Thread Reply:* I tried with flink 1.18 and same. i receive the start event but the job and dataset are not created in marquez

Francisco Morillo (fmorillo@amazon.es)
2024-04-03 13:15:59

*Thread Reply:* If i try locally and set execution.attached to false it does work. So it seems that the main issue is that openlineage doesnt work with flink job submission through web ui

👀 Maciej Obuchowski
Peter Huang (huangzhenqiu0825@gmail.com)
2024-04-03 16:54:20

*Thread Reply:* From my understanding until now, set execution.attched = false mitigates the exception in flink (at least from the flink code, it is the logic). On the other hand, the question goes to when to build the dag when receive events. @Paweł Leszczyński From our org, we changed the default behavior. The flink listener will periodically send running events out. Once the lineage backend receive the running event, a new dag will be created.

Francisco Morillo (fmorillo@amazon.es)
2024-04-03 17:00:26

*Thread Reply:* How can i configure that?

Peter Huang (huangzhenqiu0825@gmail.com)
2024-04-03 17:02:00

*Thread Reply:* To send periodical running event, some changes are needed in the open lineage flink lib. Let's wait for @Paweł Leszczyński for concrete plan. I am glad to create a PR for this.

Francisco Morillo (fmorillo@amazon.es)
2024-04-03 17:05:28

*Thread Reply:* im still wondering why the dag was not created in marquez, unless there are some other events that open lineage sends for it to build the job and dataset that if submitted through webui it doesnt work. I will try to replicate in EMR

Francisco Morillo (fmorillo@amazon.es)
2024-04-03 20:37:03

*Thread Reply:* Looking at marquez logs, im seeing this

arquez.api.OpenLineageResource: Unexpected error while processing request ! java.lang.IllegalArgumentException: namespace '<kafka://b-1.mskflinkopenlineage.fdz2z7.c22.kafka.us-east-1.amazonaws.com:9092>,b-3.mskflinkopenlineage.fdz2z7.c22.kafka.us-east-1.amazonaws.com:9092,b_2.mskflinkopenlineage.fdz2z7.c22.kafka.us_east_1.amazonaws.com:9092' must contain only letters (a-z, A-Z), numbers (0-9), underscores (_), at (@), plus (+), dashes (-), colons (:), equals (=), semicolons (;), slashes (/) or dots (.) with a maximum length of 1024 characters.

Francisco Morillo (fmorillo@amazon.es)
2024-04-03 20:37:38

*Thread Reply:* can marquez work with msk?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-04-04 02:43:06

*Thread Reply:* The graph on Marquez side should be present just after sending START event, once the START contains information about input/output datasets. Commas are the problem here and we should modify Flink integration to separate broker list by a semicolon.

✅ Francisco Morillo
Fabio Manganiello (fabio.manganiello@booking.com)
2024-04-03 05:50:05

Hi all, I've opened a PR for the dbt-ol script. We've noticed that the script doesn't transparently return/exit the exit code of the child dbt process. This makes it hard for the parent process to tell if the underlying workflow succeeded or failed - in the case of Airflow, the parent DAG will mark the job as succeeded even if it actually failed. Let me know if you have thought/comments (cc @Arnab Bhattacharyya)

Labels
integration/dbt
❤️ Harel Shein
Tristan GUEZENNEC -CROIX- (tristan.guezennec@decathlon.com)
2024-04-04 04:41:36

*Thread Reply:* @Sophie LY FYI

Fabio Manganiello (fabio.manganiello@booking.com)
2024-04-03 06:33:34

Is there a timeline for the 1.11.0 release? Now that the dbt-ol fix has been merged we may either wait for the release or temporarily point to main

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-04-03 06:34:09

*Thread Reply:* I think it’s going to be today or really soon. cc: @Michael Robinson

🎉 Fabio Manganiello
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 06:37:45

*Thread Reply:* would be great if we could fix the unknown facet memory issue in this release, I think @Paweł Leszczyński @Damien Hawes are working on it

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 06:38:02

*Thread Reply:* I think this is a critical kind of bug

Damien Hawes (damien.hawes@booking.com)
2024-04-03 06:39:27

*Thread Reply:* Yeah, it's a tough-to-figure-out-where-the-fix-should-be kind of bug.

😨 Jakub Dardziński
Damien Hawes (damien.hawes@booking.com)
2024-04-03 06:39:56

*Thread Reply:* The solution is simple, at least in my mind. If spark_unknown is disabled, don't accumulate state.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-04-03 06:40:11

*Thread Reply:* i think we should go first with unknown entry facet as it has bigger impact

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 06:40:12

*Thread Reply:* if there's no better fast idea, just disable that facet for now?

Damien Hawes (damien.hawes@booking.com)
2024-04-03 06:40:26

*Thread Reply:* It doesn't matter if the facet is disabled or not

Damien Hawes (damien.hawes@booking.com)
2024-04-03 06:40:38

*Thread Reply:* The UnknownEntryFacetListener still accumulates state

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-04-03 06:40:48

*Thread Reply:* @Damien Hawes will you be able to prepare this today/tomorrow?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 06:40:58

*Thread Reply:* disable == comment/remove code related to it, together with UnknownEntryFacetListener 🙂

Damien Hawes (damien.hawes@booking.com)
2024-04-03 06:40:59

*Thread Reply:* I'm working on it today

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 06:41:01

*Thread Reply:* in this case 🙂

Damien Hawes (damien.hawes@booking.com)
2024-04-03 06:41:31

*Thread Reply:* You're proposing to rip the code out completely?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 06:42:02

*Thread Reply:* at least for this release - I think it's better to release code without it and without memory bug, rather than having it bugged as it is

Damien Hawes (damien.hawes@booking.com)
2024-04-03 06:42:06

*Thread Reply:* The only place where I see it being applied is here:

``` private <L extends LogicalPlan> QueryPlanVisitor<L, D> asQueryPlanVisitor(T event) { AbstractQueryPlanDatasetBuilder<T, P, D> builder = this; return new QueryPlanVisitor<L, D>(context) { @Override public boolean isDefinedAt(LogicalPlan x) { return builder.isDefinedAt(event) && isDefinedAtLogicalPlan(x); }

  @Override
  public List&lt;D&gt; apply(LogicalPlan x) {
    unknownEntryFacetListener.accept(x);
    return builder.apply(event, (P) x);
  }
};

}```

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-04-03 06:42:11

*Thread Reply:* come on, this should be few lines of change

Damien Hawes (damien.hawes@booking.com)
2024-04-03 06:42:17

*Thread Reply:* Inside: AbstractQueryPlanDatasetBuilder

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-04-03 06:42:21

*Thread Reply:* once we know what it is

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 06:42:32

*Thread Reply:* it's useful in some narrow debug cases, but the memory bug potentially impacts all

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-04-03 06:43:15

*Thread Reply:* openLineageContext .getQueryExecution() .filter(qe -&gt; !FacetUtils.isFacetDisabled(openLineageContext, "spark_unknown")) .flatMap(qe -&gt; unknownEntryFacetListener.build(qe.optimizedPlan())) .ifPresent(facet -&gt; runFacetsBuilder.put("spark_unknown", facet)); this should always clean the listener

Damien Hawes (damien.hawes@booking.com)
2024-04-03 06:43:19

*Thread Reply:* @Paweł Leszczyński - every time AbstractQueryPlanDatasetBuilder#apply is called, the UnknownEntryFacetListener is invoked

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-04-03 06:43:38

*Thread Reply:* the code is within OpenLineageRunEventBuilder

Damien Hawes (damien.hawes@booking.com)
2024-04-03 06:43:50

*Thread Reply:* @Paweł Leszczyński - it will only clean the listener, if spark_unknown is enabled

Damien Hawes (damien.hawes@booking.com)
2024-04-03 06:43:56

*Thread Reply:* because of that filter step

Damien Hawes (damien.hawes@booking.com)
2024-04-03 06:44:11

*Thread Reply:* but the listener still accumulates state, regardless of that snippet you shared

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-04-03 06:44:12

*Thread Reply:* yes, and we need to modify it to always clean

Damien Hawes (damien.hawes@booking.com)
2024-04-03 06:45:45

*Thread Reply:* We have a difference in understanding here, I think.

  1. If spark_unknown is disabled, the UnknownEntryFacetListener still accumulates state. Your proposed change will not clean that state.
  2. If spark_unknown is enabled, well, sometimes we get StackOverflow errors due to infinite recursion during serialisation.
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 06:46:35

*Thread Reply:* just to get a bit out from particular solution: I would love if we could either release with

  1. a proper fix that won't accumulate memory if facet is disabled, and clean up it it's not
  2. have that facet removed for now I don't want to have a release now that will contain this bug, because we're trying to do a "good" solution but have no time to do it properly for the release
👍 Damien Hawes
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 06:46:57

*Thread Reply:* I think the impact of this bug is big

Damien Hawes (damien.hawes@booking.com)
2024-04-03 06:47:24

*Thread Reply:* My opinion is that perhaps the OpenLineageContext object needs to be extended to hold which facets are enabled / disabled.

➕ Maciej Obuchowski
Damien Hawes (damien.hawes@booking.com)
2024-04-03 06:47:52

*Thread Reply:* This way, things that inherit from AbstractQueryPlanDatasetBuilder can check, should they be a no-op or not

Damien Hawes (damien.hawes@booking.com)
2024-04-03 06:48:36

*Thread Reply:* Or, ```private <L extends LogicalPlan> QueryPlanVisitor<L, D> asQueryPlanVisitor(T event) { AbstractQueryPlanDatasetBuilder<T, P, D> builder = this; return new QueryPlanVisitor<L, D>(context) { @Override public boolean isDefinedAt(LogicalPlan x) { return builder.isDefinedAt(event) && isDefinedAtLogicalPlan(x); }

@Override
public List&lt;D&gt; apply(LogicalPlan x) {
  unknownEntryFacetListener.accept(x);
  return builder.apply(event, (P) x);
}

}; }``` This needs to be changed

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-04-03 06:48:40

*Thread Reply:* @Damien Hawes could u look at this again https://github.com/OpenLineage/OpenLineage/pull/2557/files ?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-04-03 06:49:27

*Thread Reply:* i think clearing visitedNodes within populateRun should solve this

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-04-03 06:51:01

*Thread Reply:* the solution is (1) don't store logical plans, but their string representation (2) clear what you collected after populating a facet

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 06:51:18

*Thread Reply:* even if it works, I still don't really like it because we accumulate state in asQueryPlanVisitor just to clear it later

Damien Hawes (damien.hawes@booking.com)
2024-04-03 06:51:19

*Thread Reply:* It works, but I'm still annoyed that UnknownEntryFacetListener is being called in the first place

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 06:51:46

*Thread Reply:* also i think in case of really large plans it could be an issue still?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-04-03 06:53:06

*Thread Reply:* why @Maciej Obuchowski?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 06:55:47

*Thread Reply:* we've seen >20MB serialized logical plans, and that's what essentially treeString does if I understand it correctly

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 06:56:56

*Thread Reply:* and then the serialization can potentially still take some time...

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-04-03 07:01:19

*Thread Reply:* where did you find treeString serializes a plan?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-04-03 07:05:44

*Thread Reply:* treeString is used by default toString method of TreeNode, so would be super weird if they serialized entire object within it. I couldn't find any of such code within Spark implementation

Damien Hawes (damien.hawes@booking.com)
2024-04-03 07:19:02

*Thread Reply:* I also remind you, that there is the problem with the job metrics holder as well

Damien Hawes (damien.hawes@booking.com)
2024-04-03 07:19:17

*Thread Reply:* That will also, eventually, cause an OOM crash

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-04-03 07:27:41

*Thread Reply:* So, I agreeUnknownEntryFacetListener code should not be called if a facet is disabled. I agree we should have another PR and fix for job metrics.

The question is: what do we want to have shipped within the next release? Do we want to get rid of static member that acumulates all the logical plans (which is cleaner approach) or just clear it once not needed anymore? I think we'll need to clear it anyway in case someone turns the unkown facet feature on.

Damien Hawes (damien.hawes@booking.com)
2024-04-03 07:39:09

*Thread Reply:* In my opinion, the approach for the immediate release is to clear the plans. Though, I'd like tests that prove it works.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 08:10:02

*Thread Reply:* @Damien Hawes so let's go with Paweł's PR?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-04-03 08:24:04

*Thread Reply:* So, prooving this helps would be great. One option would be to prepare integration test that runs something and verifies later on that private static map is empty. Another, a way nicer, would be to write a code that generates a few MB dataset reads into memory and saves into a file, and then within integration tests code runs something like https://github.com/jerolba/jmnemohistosyne to see memory consumption of classess we're interested in (not sure how difficult this is to write such thing)

This could be also beneficial to prevent similar issues in future and solve job metrics issue.

Stars
15
Language
Java
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 09:02:37

*Thread Reply:* @Damien Hawes @Paweł Leszczyński would be great to clarify if you're working on it now

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 09:02:43

*Thread Reply:* as this blocks release

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 09:02:47

*Thread Reply:* fyi @Michael Robinson

👍 Michael Robinson
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-04-03 09:48:12

*Thread Reply:* I can try to prove that the PR I propoposed brings improvement. However, if Damien wants to work on his approach targetting this release, I am happy to hand it over.

Damien Hawes (damien.hawes@booking.com)
2024-04-03 10:20:24

*Thread Reply:* I'm not working on it at the moment. I think Pawel's approach is fine for the time being.

Damien Hawes (damien.hawes@booking.com)
2024-04-03 10:20:31

*Thread Reply:* I'll focus on the JobMetricsHolder problem

Damien Hawes (damien.hawes@booking.com)
2024-04-03 10:24:54

*Thread Reply:* Side note: @Paweł Leszczyński @Maciej Obuchowski - are you able to give any guidance why the UnknownEntryFacetListener was implemented that way, as opposed to just examining the event in a stateless manner?

Damien Hawes (damien.hawes@booking.com)
2024-04-03 11:18:28

*Thread Reply:* OK. @Paweł Leszczyński @Maciej Obuchowski - I think I found the memory leak with JobMetricsHolder. If we receive an event like SparkListenerJobStart, but there isn't any dataset in it, it looks like we're storing the metrics, but we never get rid of them.

😬 Maciej Obuchowski
Damien Hawes (damien.hawes@booking.com)
2024-04-03 11:21:50

*Thread Reply:* Here's the logs

🙌 Maciej Obuchowski
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 11:36:50

*Thread Reply:* > Side note: @Paweł Leszczyński @Maciej Obuchowski - are you able to give any guidance why the UnknownEntryFacetListener was implemented that way, as opposed to just examining the event in a stateless manner? It's one of the older parts of codebase, implemented mostly in 2021 by person no longer associated with the project... hard to tell to be honest 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 11:37:52

*Thread Reply:* but I think we have much more freedom to modify it, as it's not standarized or user facing feature

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 11:47:02

*Thread Reply:* to solve stageMetrics issue - should they always be a separate Map per job that's associated with jobId allowing it to be easily cleaned... but there's no jobId on SparkListenerTaskEnd

Damien Hawes (damien.hawes@booking.com)
2024-04-03 11:47:16

*Thread Reply:* Nah

Damien Hawes (damien.hawes@booking.com)
2024-04-03 11:47:17

*Thread Reply:* Actually

Damien Hawes (damien.hawes@booking.com)
2024-04-03 11:47:21

*Thread Reply:* Its simpler than that

Damien Hawes (damien.hawes@booking.com)
2024-04-03 11:47:36

*Thread Reply:* The bug is here:

public void cleanUp(int jobId) { Set&lt;Integer&gt; stages = jobStages.remove(jobId); stages = stages == null ? Collections.emptySet() : stages; stages.forEach(jobStages::remove); }

Damien Hawes (damien.hawes@booking.com)
2024-04-03 11:47:51

*Thread Reply:* We remove from jobStages N + 1 times

Damien Hawes (damien.hawes@booking.com)
2024-04-03 11:48:14

*Thread Reply:* JobStages is supposed to carry a mapping from Job -&gt; Stage

Damien Hawes (damien.hawes@booking.com)
2024-04-03 11:48:30

*Thread Reply:* and stageMetrics a mapping from Stage -&gt; TaskMetrics

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 11:49:00

*Thread Reply:* ah yes

Damien Hawes (damien.hawes@booking.com)
2024-04-03 11:49:03

*Thread Reply:* Here, we remove the job from jobStages, and obtain the associated stages, and then we use those stages to remove from jobStages again

Damien Hawes (damien.hawes@booking.com)
2024-04-03 11:49:11

*Thread Reply:* It's a "huh?" moment

😂 Jakub Dardziński
Damien Hawes (damien.hawes@booking.com)
2024-04-03 11:49:53

*Thread Reply:* The amount of logging I added, just to see this, was crazy

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 11:50:46

*Thread Reply:* public void cleanUp(int jobId) { Set&lt;Integer&gt; stages = jobStages.remove(jobId); stages = stages == null ? Collections.emptySet() : stages; stages.forEach(stageMetrics::remove); } so it's just jobStages -> stageMetrics here, right?

Damien Hawes (damien.hawes@booking.com)
2024-04-03 11:50:57

*Thread Reply:* Yup

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 11:51:09

*Thread Reply:* yeah it looks so obvious after seeing that 😄

Damien Hawes (damien.hawes@booking.com)
2024-04-03 11:51:40

*Thread Reply:* I even wrote a separate method to clear the stageMetrics map

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 11:51:41

*Thread Reply:* it was there since 2021 in that form 🙂

Damien Hawes (damien.hawes@booking.com)
2024-04-03 11:52:00

*Thread Reply:* and placed it in the same locations as the cleanUp method in the OpenLineageSparkListener

Damien Hawes (damien.hawes@booking.com)
2024-04-03 11:52:09

*Thread Reply:* Wrote a unit test

Damien Hawes (damien.hawes@booking.com)
2024-04-03 11:52:12

*Thread Reply:* It fails

Damien Hawes (damien.hawes@booking.com)
2024-04-03 11:52:17

*Thread Reply:* and I was like, "why?"

Damien Hawes (damien.hawes@booking.com)
2024-04-03 11:52:25

*Thread Reply:* Investigate further, and then I noticed this method

😄 Maciej Obuchowski
Damien Hawes (damien.hawes@booking.com)
2024-04-03 12:33:42
Michael Robinson (michael.robinson@astronomer.io)
2024-04-03 14:39:06

*Thread Reply:* Has Damien's PR unblocked the release?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 14:39:33

*Thread Reply:* No, we need one more from Paweł

:gratitude_thank_you: Michael Robinson
Damien Hawes (damien.hawes@booking.com)
2024-04-04 10:37:42

*Thread Reply:* OK. Pawel's PR has been merged @Michael Robinson

👍 Michael Robinson
Damien Hawes (damien.hawes@booking.com)
2024-04-04 12:12:28

*Thread Reply:* Given these developments, I'ld like to call for a release of 1.11.0 to happen today, unless there are any objections.

➕ Harel Shein, Jakub Dardziński
👀 Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2024-04-04 12:28:38

*Thread Reply:* Changelog PR is RFR: https://github.com/OpenLineage/OpenLineage/pull/2574

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-04 14:29:04

*Thread Reply:* CircleCI has problems

Fabio Manganiello (fabio.manganiello@booking.com)
2024-04-04 18:12:27

*Thread Reply:* ```self = <tests.conftest.DagsterRunLatestProvider object at 0x7fcd84faed60> repositoryname = 'testrepo'

def get_instance(self, repository_name: str) -&gt; DagsterRun:

> from dagster.core.remoterepresentation.origin import ( ExternalJobOrigin, ExternalRepositoryOrigin, InProcessCodeLocationOrigin, ) E ImportError: cannot import name 'ExternalJobOrigin' from 'dagster.core.remoterepresentation.origin' (/home/circleci/.pyenv/versions/3.8.19/lib/python3.8/site-packages/dagster/core/remote_representation/origin.py)

tests/conftest.py:140: ImportError```

Fabio Manganiello (fabio.manganiello@booking.com)
2024-04-04 18:12:39

*Thread Reply:* &gt;&gt;&gt; from dagster.core.remote_representation.origin import ( ... ExternalJobOrigin, ... ExternalRepositoryOrigin, ... InProcessCodeLocationOrigin, ... ) Traceback (most recent call last): File "&lt;stdin&gt;", line 1, in &lt;module&gt; File "&lt;frozen importlib._bootstrap&gt;", line 1176, in _find_and_load File "&lt;frozen importlib._bootstrap&gt;", line 1138, in _find_and_load_unlocked File "&lt;frozen importlib._bootstrap&gt;", line 1078, in _find_spec File "/home/blacklight/git_tree/OpenLineage/venv/lib/python3.11/site-packages/dagster/_module_alias_map.py", line 36, in find_spec assert base_spec, f"Could not find module spec for {base_name}." AssertionError: Could not find module spec for dagster._core.remote_representation. &gt;&gt;&gt; from dagster.core.host_representation.origin import ( ... ExternalJobOrigin, ... ExternalRepositoryOrigin, ... InProcessCodeLocationOrigin, ... ) &gt;&gt;&gt; ExternalJobOrigin &lt;class 'dagster._core.host_representation.origin.ExternalJobOrigin'&gt;

Fabio Manganiello (fabio.manganiello@booking.com)
2024-04-04 18:13:07

*Thread Reply:* It seems that the parent module should be dagster.core.host_representation.origin, not dagster.core.remote_representation.origin

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-04-04 18:14:55

*Thread Reply:* did you rebase? for >=1.6.9 it’s dagster.core.remote_representation.origin, should be ok

Fabio Manganiello (fabio.manganiello@booking.com)
2024-04-04 18:18:06

*Thread Reply:* Indeed, I was just looking at https://github.com/dagster-io/dagster/pull/20323 (merged 4 weeks ago)

Comments
1
Fabio Manganiello (fabio.manganiello@booking.com)
2024-04-04 18:18:43

*Thread Reply:* I did a pip install of the integration from main and it seems to install a previous version though:

&gt;&gt;&gt; dagster.__version__ '1.6.5'

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-04-04 18:18:59

*Thread Reply:* try --force-reinstall maybe

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-04-04 18:19:08

*Thread Reply:* it works fine for me, CI doesn’t crash either

Fabio Manganiello (fabio.manganiello@booking.com)
2024-04-04 18:20:09
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-04-04 18:20:53

*Thread Reply:* huh, how didn’t I see this

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-04-04 18:21:30

*Thread Reply:* I think we should limit upper version of dagster, it’s not even really maintained

Fabio Manganiello (fabio.manganiello@booking.com)
2024-04-04 18:28:14

*Thread Reply:* I've also just noticed that ExternalJobOrigin and ExternalRepositoryOrigin have been renamed to RemoteJobOrigin and RemoteRepositoryOrigin on 1.7.0 - and that's apparently the version the CI installed

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-04-04 18:28:32

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2579

👍 Fabio Manganiello
Mantas Mykolaitis (mantasmy@wix.com)
2024-04-03 07:24:26

Hey 👋 When I am running TrinoOperator on Airflow 2.7 I am getting this: [2024-04-03, 11:10:44 UTC] {base.py:162} WARNING - OpenLineage provider method failed to extract data from provider. [2024-04-03, 11:10:44 UTC] {manager.py:276} WARNING - Extractor returns non-valid metadata: None I've upgraded apache-airflow-providers-openlineage to 1.6.0 (maybe it is too new for Airflow 2.7 version?). And due to the warning I am ending with empty input/output facets... Seems that it is not capable to connect to Trino and extract table structure... When I tried on our prod Airflow version (2.6.3) and openlineage-airflow it was capable to connect and extract table structure, but not to do the column level lineage mapping.

Any input would be very helpful. Thanks

Mantas Mykolaitis (mantasmy@wix.com)
2024-04-03 07:28:29

*Thread Reply:* Tried with default version of OL plugin that comes with 2.7 Airflow (1.0.1) so result was the same

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-04-03 07:31:55

*Thread Reply:* Could you please enable DEBUG logs in Airflow and provide them?

Mantas Mykolaitis (mantasmy@wix.com)
2024-04-03 07:42:14

*Thread Reply:*

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-04-03 07:50:30

*Thread Reply:* thanks it seems like only the beginning of the logs. I’m assuming it fails on complete event

Mantas Mykolaitis (mantasmy@wix.com)
2024-04-03 07:56:00

*Thread Reply:* I am sorry! This is the full log

Mantas Mykolaitis (mantasmy@wix.com)
2024-04-03 08:00:03

*Thread Reply:* What I also just realised that we have our own TrinoOperator implementation, which inherits from SQLExecuteQueryOperator (same as original TrinoOperator)... So maybe inlets and outlets aren't being set due to that

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-04-03 08:00:52

*Thread Reply:* yeah, it could interfere

Mantas Mykolaitis (mantasmy@wix.com)
2024-04-03 08:01:04

*Thread Reply:* But task was rather simple: create_table_apps_log_test = TrinoOperator( task_id=f"create_table_test", sql=""" CREATE TABLE if not exists mytable as SELECT app_id, msid, instance_id from table limit 1 """ )

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-04-03 08:01:26

*Thread Reply:* do you use some other hook to connect to Trino?

Mantas Mykolaitis (mantasmy@wix.com)
2024-04-03 08:03:12

*Thread Reply:* Just checked. So we have our own hook to connect to Trino... that inherits from TrinoHook 🙄

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-04-03 08:06:05

*Thread Reply:* hard to say, you could check https://github.com/apache/airflow/blob/main/airflow/providers/trino/hooks/trino.py#L252 to see how integration collects basic information how to retrieve connection

Mantas Mykolaitis (mantasmy@wix.com)
2024-04-03 08:10:24

*Thread Reply:* Just thinking why did it worked with Airflow 2.6.3 and openlineage-airflow package, seems that it was accessing Trino differently

Mantas Mykolaitis (mantasmy@wix.com)
2024-04-03 08:10:40

*Thread Reply:* But anyways, will try to look more into it. Thanks for tips!

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-04-03 08:12:13

*Thread Reply:* please let me know your findings, it might be some bug introduced in provider package

👍 Mantas Mykolaitis
Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2024-04-03 08:29:07

Looking for some help with spark and the “UNCLASSIFIED_ERROR; An error occurred while calling o110.load. Cannot call methods on a stopped SparkContext.” We are not getting any openLineage data in Cloudwatch nor in sparkHistoryLogs. (more details in thread - should I be making this into a github issue instead?)

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2024-04-03 08:29:29

*Thread Reply:* The python code:

import sys from awsglue.transforms import ** from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from pyspark.conf import SparkConf from awsglue.context import GlueContext from awsglue.job import Job

conf = SparkConf() conf.set("spark.extraListeners","io.openlineage.spark.agent.OpenLineageSparkListener")\ .set("spark.jars.packages","io.openlineage:openlineage_spark:1.10.2")\ .set("spark.openlineage.version","v1")\ .set("spark.openlineage.namespace","OL_EXAMPLE_DN")\ .set("spark.openlineage.transport.type","console") ## @params: [JOB_NAME] args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext.getOrCreate(conf=conf) glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args['JOB_NAME'], args) df = spark.read.format("csv").option("header","true").load("<s3-folder-path>") df.write.format("csv").option("header","true").save("<s3-folder-path>",mode='overwrite') job.commit()

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2024-04-03 08:29:32

*Thread Reply:* Nothing appears in cloudwatch, or in the sparkHistoryLogs. Here's the jr_runid file from sparkHistoryLogs - it shows that the work was done, but nothing about openlineage or where the spark session was stopped before OL could do anything: { "Event": "SparkListenerApplicationStart", "App Name": "nativespark-check_python_-jr_<jrid>", "App ID": "spark-application-0", "Timestamp": 0, "User": "spark" } { "Event": "org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart", "executionId": 0, "description": "load at NativeMethodAccessorImpl.java:0", "details": "org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185)\nsun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\nsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\nsun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\njava.lang.reflect.Method.invoke(Method.java:498)\npy4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\npy4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\npy4j.Gateway.invoke(Gateway.java:282)\npy4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\npy4j.commands.CallCommand.execute(CallCommand.java:79)\npy4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)\npy4j.ClientServerConnection.run(ClientServerConnection.java:106)\njava.lang.Thread.run(Thread.java:750)", "physicalPlanDescription": "== Parsed Logical Plan ==\nGlobalLimit 1\n+- LocalLimit 1\n +- Filter (length(trim(value#7, None)) > 0)\n +- Project [value#0 AS value#7]\n +- Project [value#0]\n +- Relation [value#0] text\n\n== Analyzed Logical Plan ==\nvalue: string\nGlobalLimit 1\n+- LocalLimit 1\n +- Filter (length(trim(value#7, None)) > 0)\n +- Project [value#0 AS value#7]\n +- Project [value#0]\n +- Relation [value#0] text\n\n== Optimized Logical Plan ==\nGlobalLimit 1\n+- LocalLimit 1\n +- Filter (length(trim(value#0, None)) > 0)\n +- Relation [value#0] text\n\n== Physical Plan ==\nCollectLimit 1\n+- **(1) Filter (length(trim(value#0, None)) > 0)\n +- FileScan text [value#0] Batched: false, DataFilters: [(length(trim(value#0, None)) > 0)], Format: Text, Location: InMemoryFileIndex(1 paths)[<s3-csv-file>], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string>\n", "sparkPlanInfo": { "nodeName": "CollectLimit", "simpleString": "CollectLimit 1", "children": [ { "nodeName": "WholeStageCodegen (1)", "simpleString": "WholeStageCodegen (1)", "children": [ { "nodeName": "Filter", "simpleString": "Filter (length(trim(value#0, None)) > 0)", "children": [ { "nodeName": "InputAdapter", "simpleString": "InputAdapter", "children": [ { "nodeName": "Scan text ", "simpleString": "FileScan text [value#0] Batched: false, DataFilters: [(length(trim(value#0, None)) > 0)], Format: Text, Location: InMemoryFileIndex(1 paths)[<s3-csv-file>], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string>", "children": [], "metadata": { "Location": "InMemoryFileIndex(1 paths)[<s3-csv-file>]", "ReadSchema": "struct<value:string>", "Format": "Text", "Batched": "false", "PartitionFilters": "[]", "PushedFilters": "[]", "DataFilters": "[(length(trim(value#0, None)) > 0)]" }, "metrics": [ { "name": "number of output rows from cache", "accumulatorId": 14, "metricType": "sum" }, { "name": "number of files read", "accumulatorId": 15, "metricType": "sum" }, { "name": "metadata time", "accumulatorId": 16, "metricType": "timing" }, { "name": "size of files read", "accumulatorId": 17, "metricType": "size" }, { "name": "max size of file split", "accumulatorId": 18, "metricType": "size" }, { "name": "number of output rows", "accumulatorId": 13, "metricType": "sum" } ] } ], "metadata": {}, "metrics": [] } ], "metadata": {}, "metrics": [ { "name": "number of output rows", "accumulatorId": 12, "metricType": "sum" } ] } ], "metadata": {}, "metrics": [ { "name": "duration", "accumulatorId": 11, "metricType": "timing" } ] } ], "metadata": {}, "metrics": [ { "name": "shuffle records written", "accumulatorId": 9, "metricType": "sum" }, { "name": "shuffle write time", "accumulatorId": 10, "metricType": "nsTiming" }, { "name": "records read", "accumulatorId": 7, "metricType": "sum" }, { "name": "local bytes read", "accumulatorId": 5, "metricType": "size" }, { "name": "fetch wait time", "accumulatorId": 6, "metricType": "timing" }, { "name": "remote bytes read", "accumulatorId": 3, "metricType": "size" }, { "name": "local blocks read", "accumulatorId": 2, "metricType": "sum" }, { "name": "remote blocks read", "accumulatorId": 1, "metricType": "sum" }, { "name": "remote bytes read to disk", "accumulatorId": 4, "metricType": "size" }, { "name": "shuffle bytes written", "accumulatorId": 8, "metricType": "size" } ] }, "time": 0, "modifiedConfigs": {} } { "Event": "SparkListenerApplicationEnd", "Timestamp": 0 }

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 09:06:04

*Thread Reply:* I think this is related to job.commit() that probably stops context underneath

✅ Sheeri Cabral (Collibra)
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 09:06:33

*Thread Reply:* This is probably the same bug: https://github.com/OpenLineage/OpenLineage/issues/2513 but manifests differently

Labels
integration/spark
Comments
14
Rodrigo Maia (rodrigo.maia@manta.io)
2024-04-03 09:45:59

*Thread Reply:* can you try without the job.commit()?

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2024-04-03 09:54:39

*Thread Reply:* Sure!

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2024-04-03 09:56:31

*Thread Reply:* BTW it makes sense that if the spark listener is disabled, that the openlineage integration shouldn’t even try. (If we removed that line, it doesn’t feel like the integration would actually work….)

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 09:57:51

*Thread Reply:* you mean removing this? conf.set("spark.extraListeners","io.openlineage.spark.agent.OpenLineageSparkListener")\ if you don't set it, none of our code is actually being loaded

Rodrigo Maia (rodrigo.maia@manta.io)
2024-04-03 09:59:25

*Thread Reply:* i meant, removing the job.init and job.commit for testing purposes. glue should work without that,

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2024-04-03 12:47:03

*Thread Reply:* We removed job.commit, same error. Should we also remove job.init?

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2024-04-03 12:48:06

*Thread Reply:* Won’t removing this change the functionality? job.init(args[‘JOB_NAME’], args)

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 13:22:11

*Thread Reply:* interesting - maybe something else stops the job explicitely underneath on Glue?

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2024-04-03 13:38:02

*Thread Reply:* Will have a look.

DEEVITH NAGRAJ (deevithraj435@gmail.com)
2024-04-03 23:09:10

*Thread Reply:* Hi all, I'm working with Sheeri on this, so couple of queries,

  1. tried to set("spark.openlineage.transport.location","/sample.txt>") then the job succeeds but no output in the sample.txt file. (however there are some files created in /sparkHistoryLogs and /sparkHistoryLogs/output), I dont see the OL output file here.
    2.set("spark.openlineage.transport.type","console") the job fails with “UNCLASSIFIED_ERROR; An error occurred while calling o110.load. Cannot call methods on a stopped SparkContext.”

  2. if we are using http as transport.type, then can we use basic auth instead of api_key?

❤️ Sheeri Cabral (Collibra)
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-04 05:32:05

*Thread Reply:* > 3. if we are using http as transport.type, then can we use basic auth instead of api_key? Would be good to add that to HttpTransport 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-04 05:33:16

*Thread Reply:* > 1. tried to set("spark.openlineage.transport.location","<|s3:<s3bucket>/sample.txt>") then the job succeeds but no output in the sample.txt file. (however there are some files created in /sparkHistoryLogs and /sparkHistoryLogs/output), I dont see the OL output file here.
Yeah, FileTransport does not work with object storage - it needs to be regular filesystem. I don't know if we can make it work without pulling a lot of dependencies and making it significantly more complex - but of course we'd like to see such contribution

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2024-04-04 08:11:44

*Thread Reply:* @DEEVITH NAGRAJ yes, that’s why the PoC is to have the sparklineage use the transport type of “console” - we can’t save to files in S3.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2024-04-04 08:12:54

*Thread Reply:* @DEEVITH NAGRAJ if we can get it to work in console, and CloudWatch shows us openlineage data, then we can change the transport type to an API and set up fluentd to collect the data.

BTW yesterday another customer got it working in console, and Roderigo from this thread also saw it working in console, so we know it does work in general 😄

🙌 DEEVITH NAGRAJ
DEEVITH NAGRAJ (deevithraj435@gmail.com)
2024-04-04 11:47:20

*Thread Reply:* yes Sheeri, I agree we need to get it to work in the console.I dont see anything in the cloudwatch, and the error is thrown when tried to set("spark.openlineage.transport.type","console") the job fails with “UNCLASSIFIED_ERROR; An error occurred while calling o110.load. Cannot call methods on a stopped SparkContext.”

do we need to specify scala version in .set("spark.jars.packages","io.openlineage:openlineagespark:1.10.2") like .set("spark.jars.packages","io.openlineage:openlineagespark_2.13:1.10.2")? is that causing the issue?

❤️ Sheeri Cabral (Collibra)
Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)
2024-04-04 14:03:37

*Thread Reply:* Awesome! We’ve got it so the job succeeds when we set the transport type to “console”. Anyone have any tips on where to find it in CloudWatch? the job itself has a dozen or so different logs and we’re clicking all of them, but maybe there’s an easier way?

Mark de Groot (mdegroot@ilionx.com)
2024-04-03 10:15:27

Hi everyone, I've started 2 weeks ago to implement openLineage in our solution. But I've run into some problems and quite frankly I don't understand what I'm doing wrong. The situation is, we are using Azure Synapse with notebooks and we want to pick up the data lineage. I have found a lot of documentation about databricks in combination with Openlineage. But there is not much documentation with Synapse in combination with Openlineage. I've installed the newest library "openlineage-1.10.2" in the Synapse Apache Spark packages (so far so good). The next step I did was to configure the Apache Spark configuration, based on a blog I’ve found I filled in the following properties: spark.extraListeners - io.openlineage.spark.agent.OpenLineageSparkListener spark.openlineage.host – <https://functionapp.azurewebsites.net/api/function> spark.openlineage.namespace – synapse name spark.openlineage.url.param.code – XXXX spark.openlineage.version – 1

I’m not sure if the namespace is good, I think it's the name of synapse? But the moment I want to run the Synapse notebook (creating a simple dataframe) it shows me an error

Py4JJavaError Traceback (most recent call last) Cell In [5], line 1 ----&gt; 1 df = spark.read.load('<abfss://bronsedomein1@xxxxxxxx.dfs.core.windows.net/adventureworks/vendors.parquet>', format='parquet') **2** display(df) Py4JJavaError: An error occurred while calling o4060.load. : org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.

I can’t figure out what I’m doing wrong, does somebody have a clue?

Thanks, Mark

Harel Shein (harel.shein@gmail.com)
2024-04-03 10:35:46

*Thread Reply:* this error seems unrelated to openlineage to me, can you try removing all the openlineage related properties from the config and testing this out just to rule that out?

Mark de Groot (mdegroot@ilionx.com)
2024-04-03 10:39:30

*Thread Reply:* Hey Harel,

Mark de Groot (mdegroot@ilionx.com)
2024-04-03 10:40:49

*Thread Reply:* Yes I removed all the related openlineage properties. And (ofcourse 😉 ) it's working fine. But the moment I fill in the Properties as mentiond above, it gives me the error.

Harel Shein (harel.shein@gmail.com)
2024-04-03 10:45:41

*Thread Reply:* thanks for checking, wanted to make sure. 🙂

👍 Mark de Groot
Harel Shein (harel.shein@gmail.com)
2024-04-03 10:48:03

*Thread Reply:* can you try only setting spark.extraListeners = io.openlineage.spark.agent.OpenLineageSparkListener spark.jars.packages = io.openlineage:openlineage_spark_2.12:1.10.2 spark.openlineage.transport.type = console ?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 12:01:20

*Thread Reply:* @Mark de Groot are you stopping the job using spark.stop() or similar command?

👍 Mark de Groot
Mark de Groot (mdegroot@ilionx.com)
2024-04-03 12:18:21

*Thread Reply:* So when i Run the default value in Synapse

Mark de Groot (mdegroot@ilionx.com)
2024-04-03 12:19:49

*Thread Reply:* Everything is working fine, but when I use the following properties I'm getting an error, when trying e.q to create a Dataframe.

Michael Robinson (michael.robinson@astronomer.io)
2024-04-03 11:23:31

@channel Accenture+Confluent's Open Standards for Data Lineage roundtable is happening on April 25th, featuring: • Kai Waehner (Confluent) • @Mandy Chessell (Egeria) • @Julien Le Dem (OpenLineage) • @Jens Pfau (Google Cloud) • @Ernie Ostic (Manta/IBM) • @Sheeri Cabral (Collibra) • Austin Kronz (Atlan) • @Luigi Scorzato (moderator, Accenture) Not to be missed! Register at the link.

events.confluent.io
🔥 Maciej Obuchowski
Bassim EL Baroudi (bassim.elbaroudi@gmail.com)
2024-04-03 12:58:12

Hi everyone, I'm trying to launch a spark job with integration with openlineage. The version of spark is 3.5.0. The configuration used:

spark.jars.packages=io.openlineage:openlineage-spark_2.12:1.10.2 spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener spark.openlineage.transport.url=http://marquez.dcp.svc.cluster.local:8087 spark.openlineage.namespace=pyspark spark.openlineage.transport.type=http spark.openlineage.facets.disabled="[spark.logicalPlan;]" spark.openlineage.debugFacet=enabled

the spark job exits with the following error: java.lang.NoSuchMethodError: 'org.apache.spark.sql.SQLContext org.apache.spark.sql.execution.SparkPlan.sqlContext()' at io.openlineage.spark.agent.lifecycle.ContextFactory.createSparkSQLExecutionContext(ContextFactory.java:32) at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$getSparkSQLExecutionContext$4(OpenLineageSparkListener.java:172) at java.base/java.util.HashMap.computeIfAbsent(HashMap.java:1220) at java.base/java.util.Collections$SynchronizedMap.computeIfAbsent(Collections.java:2760) at io.openlineage.spark.agent.OpenLineageSparkListener.getSparkSQLExecutionContext(OpenLineageSparkListener.java:171) at io.openlineage.spark.agent.OpenLineageSparkListener.sparkSQLExecStart(OpenLineageSparkListener.java:125) at io.openlineage.spark.agent.OpenLineageSparkListener.onOtherEvent(OpenLineageSparkListener.java:117) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1356) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) 24/04/03 13:23:39 INFO SparkContext: SparkContext is stopping with exitCode 0. 24/04/03 13:23:39 ERROR Utils: throw uncaught fatal error in thread spark-listener-group-shared java.lang.NoSuchMethodError: 'org.apache.spark.sql.SQLContext org.apache.spark.sql.execution.SparkPlan.sqlContext()' at io.openlineage.spark.agent.lifecycle.ContextFactory.createSparkSQLExecutionContext(ContextFactory.java:32) at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$getSparkSQLExecutionContext$4(OpenLineageSparkListener.java:172) at java.base/java.util.HashMap.computeIfAbsent(HashMap.java:1220) at java.base/java.util.Collections$SynchronizedMap.computeIfAbsent(Collections.java:2760) at io.openlineage.spark.agent.OpenLineageSparkListener.getSparkSQLExecutionContext(OpenLineageSparkListener.java:171) at io.openlineage.spark.agent.OpenLineageSparkListener.sparkSQLExecStart(OpenLineageSparkListener.java:125) at io.openlineage.spark.agent.OpenLineageSparkListener.onOtherEvent(OpenLineageSparkListener.java:117) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1356) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) Exception in thread "spark-listener-group-shared" java.lang.NoSuchMethodError: 'org.apache.spark.sql.SQLContext org.apache.spark.sql.execution.SparkPlan.sqlContext()' at io.openlineage.spark.agent.lifecycle.ContextFactory.createSparkSQLExecutionContext(ContextFactory.java:32) at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$getSparkSQLExecutionContext$4(OpenLineageSparkListener.java:172) at java.base/java.util.HashMap.computeIfAbsent(HashMap.java:1220) at java.base/java.util.Collections$SynchronizedMap.computeIfAbsent(Collections.java:2760) at io.openlineage.spark.agent.OpenLineageSparkListener.getSparkSQLExecutionContext(OpenLineageSparkListener.java:171) at io.openlineage.spark.agent.OpenLineageSparkListener.sparkSQLExecStart(OpenLineageSparkListener.java:125) at io.openlineage.spark.agent.OpenLineageSparkListener.onOtherEvent(OpenLineageSparkListener.java:117) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1356)

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-04-04 02:29:34

*Thread Reply:* Hey @Bassim EL Baroudi, what environnment are you running the Spark job? Is this some real-life production job or are you able to provide a code snippet which reproduces it?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-04-04 03:31:29

*Thread Reply:* Do you get any OpenLineage events like START events and see this exception at the end of job or does it occur at the begining resulting in no events emitted?

Michael Robinson (michael.robinson@astronomer.io)
2024-04-03 16:16:41

@channel This month’s TSC meeting is next Wednesday the 10th at 9:30am PT. On the tentative agenda (additional items TBA): • announcements ◦ upcoming events including the Accenture+Confluent roundtable on 4/25 • recent release highlights • discussion items ◦ supporting job-to-job, as opposed to job-dataset-job, dependencies in the spec ◦ improving naming • open discussion More info and the meeting link can be found on the website. All are welcome! Do you have a discussion topic, use case or integration you’d like to demo? Reply here or DM me to be added to the agenda.

openlineage.io
👍 Paweł Leszczyński, Sheeri Cabral (Collibra), Maciej Obuchowski
Francisco Morillo (fmorillo@amazon.es)
2024-04-03 22:19:15

Hi! How can i pass multiple kafka brokers when using with Flink? It appears marquez doesnt allow to have namespaces with commas.

namespace 'roker1,broker2,broker3' must contain only letters (a-z, A-Z), numbers (0-9), underscores (_), at (@), plus (+), dashes (-), colons (:), equals (=), semicolons (;), slashes (/) or dots (.) with a maximum length of 1024 characters.

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-04-04 02:36:19

*Thread Reply:* Kafka dataset naming already has an open issue -> https://github.com/OpenLineage/OpenLineage/issues/560

I think the problem you raised deserves a separate one. Feel free to create it. I. think we can still modify broker separator to semicolon.

Comments
1
Fabio Manganiello (fabio.manganiello@booking.com)
2024-04-04 17:46:31

FYI I've moved https://github.com/OpenLineage/OpenLineage/pull/2489 to https://github.com/OpenLineage/OpenLineage/pull/2578 - I mistakenly included a couple of merge commits upon git rebase --signoff. Hopefully the tests should pass now (there were a couple of macro templates that still reported the old arguments). Is it still in time to be squeezed inside 1.11.0? It's not super-crucial (for us at least), since we already have copied the code of those macros in our operators implementation, but since the same fix has already been merged on the Airflow side it'd be good to keep things in sync (cc @Maciej Obuchowski @Kacper Muda)

👀 Maciej Obuchowski
Fabio Manganiello (fabio.manganiello@booking.com)
2024-04-04 18:43:05

*Thread Reply:* The tests are passing now

Francisco Morillo (fmorillo@amazon.es)
2024-04-05 01:37:57

I wanted to ask if there are any roadmap to adding more support for flink sources and sinks to openlineage for example: • Kinesis • Hudi • Iceberg SQL • Flink CDC • Opensearch or how one can contribute to those?

Kacper Muda (kacper.muda@getindata.com)
2024-04-05 02:48:41

*Thread Reply:* Hey, if you feel like contributing, take a look at our contributors guide 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 07:14:55

*Thread Reply:* I think most important think on Flink side is working with Flink community on implementing https://cwiki.apache.org/confluence/display/FLINK/FLIP-314%3A+Support+Customized+Job+Lineage+Listener - as this allows us to move the implementation to the dedicated connectors

dolfinus (martinov_m_s_@mail.ru)
2024-04-05 09:47:22

👋 Hi everyone!

👋 Michael Robinson, Jakub Dardziński, Harel Shein, Damien Hawes
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 09:56:35

*Thread Reply:* Hello 👋

Michael Robinson (michael.robinson@astronomer.io)
2024-04-05 11:30:01

@channel We released OpenLineage 1.11.3, featuring a new package to support built-in lineage in Spark extensions and a telemetry mechanism in the Spark integration, among many other additions and fixes. AdditionsCommon: add support for SCRIPT-type jobs in BigQuery #2564 @kacpermudaSpark: support for built-in lineage extraction #2272 @pawel-big-lebowskiSpark/Java: add support for Micrometer metrics #2496 @mobuchowskiSpark: add support for telemetry mechanism #2528 @mobuchowskiSpark: support query option on table read #2556 @mobuchowskiSpark: change SparkPropertyFacetBuilder to support recording Spark runtime #2523 @Ruihua98Spec: add fileCount to dataset stat facets #2562 @dolfinus There were also many bug fixes -- please see the release notes for details. Thanks to all the contributors with a shout out to new contributor @dolfinus (who contributed 5 PRs to the release and already has 4 more open!) and @Maciej Obuchowski and @Jakub Dardziński for the after-hours CI fixes! Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.11.3 Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md Commit history: https://github.com/OpenLineage/OpenLineage/compare/1.10.2...1.11.3 Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage|https://oss.sonatype.org/#nexus-search;quick~openlineage PyPI: https://pypi.org/project/openlineage-python/

🔥 Maciej Obuchowski, Jorge, taosheng shi, Ricardo Gaspar
🚀 Maciej Obuchowski, taosheng shi
taosheng shi (taoshengshi01@gmail.com)
2024-04-05 12:21:34

👋 Hi everyone!

taosheng shi (taoshengshi01@gmail.com)
2024-04-05 12:22:10

*Thread Reply:* This is Taosheng from GitData Labs (https://gitdata.ai/) and We are building data versioning tool for responsible AL/ML:

An Git-like version control file system for data lineage & data collaboration. https://github.com/GitDataAI/jiaozifs

gitdata.ai
Website
<https://jiaozifs.com>
Stars
34
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 12:23:38

*Thread Reply:* hello 👋

👋 taosheng shi
taosheng shi (taoshengshi01@gmail.com)
2024-04-05 12:26:56

*Thread Reply:* I came across OpenLineage on Google I would be able to contribute with our products & skills. I Was thinking maybe could start sharing some of them here, and seeing if there is something that feels like it could be interesting to co-build on/through OpenLineage and co-market together.

❤️ Sheeri Cabral (Collibra)
taosheng shi (taoshengshi01@gmail.com)
2024-04-05 12:27:06

*Thread Reply:* Would somebody be open to discuss any open opportunities for us together?

👍 Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2024-04-05 14:55:20

*Thread Reply:* 👋 welcome and thanks for joining!

Francisco Morillo (fmorillo@amazon.es)
2024-04-08 03:02:10

Hi Everyone ! Wanted to implement a cross stack data lineage across Flink and Spark but it seems that Iceberg Table gets registered asdifferent datasets in both. Spark at the top Flink at the bottom. so it doesnt get added to the same DAG. In Spark, Iceberg Table gets Database added in the name. Im seeing that @Paweł Leszczyński commited Spark/Flink Unify Dataset naming from URI objects (https://github.com/OpenLineage/OpenLineage/pull/2083/files#). So not sure what could be going on

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-04-08 04:53:53

*Thread Reply:* Looks like this method https://github.com/OpenLineage/OpenLineage/blob/1.11.3/integration/spark/shared/src/main/java/io/openlineage/spark/agent/util/PathUtils.java#L164 creates name with (tb+database)

In general, I would say we should add naming convention here -> https://openlineage.io/docs/spec/naming/ . I think db.table format is fine as we're using it for other sources.

IcebergSinkVisitor in Flink integration is does not seem to add symlink facet pointing to iceberg table with schema included. You can try extending it with dataset symlink facet as done for Spark.

openlineage.io
Francisco Morillo (fmorillo@amazon.es)
2024-04-08 06:35:59

*Thread Reply:* How do you suggest we do so? creating a PR, extending IcebergSink Visitor or do it manually through spark as in this example https://github.com/OpenLineage/workshops/blob/main/spark/dataset_symlinks.ipynb

Francisco Morillo (fmorillo@amazon.es)
2024-04-08 07:26:35

*Thread Reply:* is there any way to create a symlink via marquez api?

Francisco Morillo (fmorillo@amazon.es)
2024-04-08 07:26:44

*Thread Reply:* trying to figure out whats the easiest approach

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-04-08 07:44:54

*Thread Reply:* there are two possible conventions for pointing to iceberg dataset: • its physical location • namespace pointing to iceberg catalog, name pointing to schema+table Flink integration uses physical location only. IcebergSinkVisitor should add additional facet - dataset symlink facet

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-04-08 07:46:37
Francisco Morillo (fmorillo@amazon.es)
2024-04-08 15:01:10

*Thread Reply:* I have been testing in modifying first the event that gets emitted, but in the lineage i am seeing duplicate datasets. As the physical location for flink is also different than the one spark uses