Harel Shein (harel.shein@gmail.com)
2023-11-14 12:13:06

@Harel Shein has joined the channel

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-14 12:13:10

@Maciej Obuchowski has joined the channel

Julien Le Dem (julien@apache.org)
2023-11-14 12:13:46

@Julien Le Dem has joined the channel

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-11-14 12:13:46

@Paweł Leszczyński has joined the channel

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-11-14 12:13:46

@Jakub Dardziński has joined the channel

Michael Robinson (michael.robinson@astronomer.io)
2023-11-14 12:13:46

@Michael Robinson has joined the channel

Willy Lulciuc (willy@datakin.com)
2023-11-14 12:13:46

@Willy Lulciuc has joined the channel

Peter Hicks (peter.hicks@astronomer.io)
2023-11-14 12:13:46

@Peter Hicks has joined the channel

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-11-14 12:13:57

👋

Ross Turk (ross@rossturk.com)
2023-11-14 12:14:02

@Ross Turk has joined the channel

Michael Robinson (michael.robinson@astronomer.io)
2023-11-14 12:16:19

👋

Julien Le Dem (julien@apache.org)
2023-11-14 12:18:42

👋

Willy Lulciuc (willy@datakin.com)
2023-11-14 12:18:53

👋

Ross Turk (ross@rossturk.com)
2023-11-14 12:29:47

🌊

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-14 13:53:08

👋

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-11-14 18:30:48
Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-11-15 04:35:37

*Thread Reply:* hey look, more fun https://github.com/OpenLineage/OpenLineage/pull/2263

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-15 05:03:58

*Thread Reply:* nice to have fun with you Jakub

🙂 Jakub Dardziński, Harel Shein, Willy Lulciuc
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-11-15 05:42:34

*Thread Reply:* Can't wait to see it on the 1st January.

Harel Shein (harel.shein@gmail.com)
2023-11-15 06:56:03

*Thread Reply:* Ain’t no party like a dev ex improvement party

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-15 11:45:53

*Thread Reply:* Gentoo installation party is in similar category of fun

Willy Lulciuc (willy@datakin.com)
2023-11-15 03:32:27

@Paweł Leszczyński approved PR #2661 with minor comments, I think the enum defined in the db layer is one comment we’ll need to address before merging; otherwise solid work dude 👌

🙌 Paweł Leszczyński, Harel Shein
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-11-15 03:34:42

_Minor_: We can consider defining a _run_state column and eventually dropping the event_type. That is, we can consider columns prefixed with _ to be "remappings" of OL properties to Marquez. -> didn't get this one. Is it for now or some future plans?

Willy Lulciuc (willy@datakin.com)
2023-11-15 03:36:02

*Thread Reply:* future 😉

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-11-15 03:36:10

*Thread Reply:* ok

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-11-15 03:36:23

*Thread Reply:* I will then replace enum with string

Willy Lulciuc (willy@datakin.com)
2023-11-15 03:36:10

also, what about this PR? https://github.com/MarquezProject/marquez/pull/2654

Labels
docs, api
Comments
4
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-11-15 03:36:33

*Thread Reply:* this is the next to go

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-11-15 03:36:38

*Thread Reply:* and i consider it ready

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-11-15 03:37:31

*Thread Reply:* Then we have a draft one with streaming support https://github.com/MarquezProject/marquez/pull/2682/files -> which has an integration test of lineage endpoint working for streaming jobs

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-11-15 03:38:32

*Thread Reply:* I still need to work on #2682 but you can review #2654. once you get some sleep, of course 😉

❤️ Willy Lulciuc
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-15 11:44:44

Got the doc + poc for hook-level coverage: https://docs.google.com/document/d/1q0shiUxopASO8glgMqjDn89xigJnGrQuBMbcRdolUdk/edit?usp=sharing

👀 Jakub Dardziński
Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-11-15 12:24:27

*Thread Reply:* did you check if LineageCollector is instantiated once per process?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-15 12:26:37

*Thread Reply:* Using it only via get_hook_lineage_collector

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-11-15 12:17:31

is it time to support hudi?

😂 Harel Shein
Michael Robinson (michael.robinson@astronomer.io)
2023-11-15 14:57:10

Anyone have thoughts about how to address the question about “pain points” here? https://openlineage.slack.com/archives/C01CK9T7HKR/p1700064564825909. (Listing pros is easy — it’s the cons we don’t have boilerplate for)

} Naresh reddy (https://openlineage.slack.com/team/U066HKFCHUG)
Michael Robinson (michael.robinson@astronomer.io)
2023-11-15 14:58:08

*Thread Reply:* Maybe something like “OL has many desirable integrations, including a best-in-class Spark integration, but it’s like any other open standard in that it requires contributions in order to approach total coverage. Thankfully, we have many active contributors, and integrations are being added or improved upon all the time.”

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-15 16:04:51

*Thread Reply:* Maybe rephrase pain points to "something we're not actively focusing on"

Michael Robinson (michael.robinson@astronomer.io)
2023-11-15 14:59:19

Apparently an admin can view a Slack archive at any time at this URL: https://openlineage.slack.com/services/export. Only public channels are available, though.

Julien Le Dem (julien@apache.org)
2023-11-15 16:53:09

*Thread Reply:* you are now admin

👍 Michael Robinson
Willy Lulciuc (willy@datakin.com)
2023-11-15 17:32:26
Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-11-15 17:33:19

*Thread Reply:* we have it in SQL operators

Willy Lulciuc (willy@datakin.com)
2023-11-15 17:34:25

*Thread Reply:* OOh any docs / code? or if you’d like to respond in the MQZ slack 🙏

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-11-15 17:35:19

*Thread Reply:* I’ll reply there

❤️ Willy Lulciuc, Harel Shein
Michael Robinson (michael.robinson@astronomer.io)
2023-11-15 17:50:23

Any opinions about a free task management alternative to the free version of Notion (10-person limit)? Looking at Trello for keeping track of talks.

Harel Shein (harel.shein@gmail.com)
2023-11-15 19:32:17

*Thread Reply:* What about GitHub projects?

👍 Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2023-11-16 09:27:46

*Thread Reply:* Projects is the way to go, thanks

Michael Robinson (michael.robinson@astronomer.io)
2023-11-16 10:23:34

*Thread Reply:* Set up a Projects board. New projects are private by default. We could make it public. The one thing that’s missing that we could use is a built-in date field for alerting about upcoming deadlines…

🙌 Harel Shein, Maciej Obuchowski
Michael Robinson (michael.robinson@astronomer.io)
2023-11-16 09:31:24

worlds are colliding: 6point6 has been acquired by Accenture

Michael Robinson (michael.robinson@astronomer.io)
2023-11-16 09:31:59

*Thread Reply:* https://newsroom.accenture.com/news/2023/accenture-to-expand-government-transformation-capabilities-in-the-uk-with-acquisition-of-6point6

newsroom.accenture.com
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-16 10:03:27

*Thread Reply:* We should sell OL to governments

🙃 Harel Shein
Harel Shein (harel.shein@gmail.com)
2023-11-16 10:20:36

*Thread Reply:* we may have to rebrand to ClosedLineage

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-16 10:23:37

*Thread Reply:* not in this way; just emit any event second time to secret NSA endpoint

Michael Robinson (michael.robinson@astronomer.io)
2023-11-16 11:13:17

*Thread Reply:* we would need to improve our stock photo game

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-16 12:17:22

CFP for Berlin Buzzwords went up: https://2024.berlinbuzzwords.de/call-for-papers/ Still over 3 months to submit 🙂

Michael Robinson (michael.robinson@astronomer.io)
2023-11-16 12:42:56

*Thread Reply:* thanks, updated the talks board

Michael Robinson (michael.robinson@astronomer.io)
2023-11-16 12:43:10
Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-11-16 15:19:53

*Thread Reply:* I'm in, will think what to talk about and appreciate any advice 🙂

Julien Le Dem (julien@apache.org)
2023-11-17 13:42:19

just searching for OpenLineage in the Datahub code base. They have an “interesting” approach? https://github.com/datahub-project/datahub/blob/2b0811b9875d7d7ea11fb01d0157a21fdd[…]odules/airflow-plugin/src/datahubairflowplugin/_extractors.py

Julien Le Dem (julien@apache.org)
2023-11-17 13:47:21

*Thread Reply:* It looks like the datahub airflow plugin uses OL. but turns it off https://github.com/datahub-project/datahub/blob/2b0811b9875d7d7ea11fb01d0157a21fdd67f020/docs/lineage/airflow.md disable_openlineage_plugin true Disable the OpenLineage plugin to avoid duplicative processing. They reuse the extractors but then “patch” the behavior.

Julien Le Dem (julien@apache.org)
2023-11-17 13:48:52

*Thread Reply:* Of course this approach will need changing again with AF 2.7

Julien Le Dem (julien@apache.org)
2023-11-17 13:49:02

*Thread Reply:* It’s their choice 🤷

Julien Le Dem (julien@apache.org)
2023-11-17 13:51:23

*Thread Reply:* It looks like we can possibly learn from their approach in SQL parsing: https://datahubproject.io/docs/lineage/airflow/#automatic-lineage-extraction

datahubproject.io
Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-11-17 16:42:51

*Thread Reply:* what's that approach? I only know they have been claiming best SQL parsing capabilities

Julien Le Dem (julien@apache.org)
2023-11-17 20:54:48

*Thread Reply:* I haven’t looked in the details but I’m assuming it is in this repo. (my comment is entirely based on the claim here)

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-11-20 02:58:07

*Thread Reply:* <https://www.acryldata.io/blog/extracting-column-level-lineage-from-sql> -> The interesting difference is that in order to find table schemas, they use their data catalog to evaluate column-level lineage instead of doing this on the client side.

My understanding by example is: If you do create table x as select ** from y you need to resolve ** to know column level lineage. Our approach is to do that on the client side, probably with an extra call to database. Their approach is to do that based on the data catalog information.

Julien Le Dem (julien@apache.org)
2023-11-17 20:56:54

I’m off on vacation. See you in a week

❤️ Jakub Dardziński, Maciej Obuchowski, Paweł Leszczyński, Harel Shein, Ross Turk, Willy Lulciuc
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-21 05:23:31

Maybe move today's meeting earlier, since no one from west coast is joining? @Harel Shein

Harel Shein (harel.shein@gmail.com)
2023-11-21 09:27:22

*Thread Reply:* Ah! That would have been a good idea, but I can’t :(

Harel Shein (harel.shein@gmail.com)
2023-11-21 09:27:44

*Thread Reply:* Do you prefer an earlier meeting tomorrow?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-21 09:28:54

*Thread Reply:* maybe let's keep today's meeting then

👍 Harel Shein
Michael Robinson (michael.robinson@astronomer.io)
2023-11-22 09:23:31

The full project history is now available at https://openlineage.github.io/slack-archives/. Check it out!

🙌 Harel Shein, Maciej Obuchowski, Paweł Leszczyński
Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-11-22 09:25:38

*Thread Reply:* nice one! 👏

Harel Shein (harel.shein@gmail.com)
2023-11-22 09:49:12

*Thread Reply:* very cool!

Ross Turk (ross@rossturk.com)
2023-11-22 12:08:24

*Thread Reply:* tfw you thought the scrollback was gone 😳

Harel Shein (harel.shein@gmail.com)
2023-11-22 12:09:41

*Thread Reply:* slack has a good activation story, I wonder how much longer they can keep this up for

Ross Turk (ross@rossturk.com)
2023-11-22 12:10:08

*Thread Reply:* always nice to be reminded that there are no actual incremental costs on their end

Harel Shein (harel.shein@gmail.com)
2023-11-22 12:11:23

*Thread Reply:* I guess it’s the difference between storing your data in memory vs. on a glacier 🧊

Ross Turk (ross@rossturk.com)
2023-11-22 12:36:12

*Thread Reply:* ah yes surely there is some tiering going on

Harel Shein (harel.shein@gmail.com)
2023-11-27 11:44:20

anyone seen this PR from decathlon?

Willy Lulciuc (willy@datakin.com)
2023-11-27 14:32:03

*Thread Reply:* i might get to marquez slack/PRs today, but most likely tmr morning

Harel Shein (harel.shein@gmail.com)
2023-11-27 14:32:52

*Thread Reply:* If you’re looking for priorities, it would be really great if you could give feedback on one of @Paweł Leszczyński streaming support PRs today

👍 Willy Lulciuc
Willy Lulciuc (willy@datakin.com)
2023-11-27 14:34:01

*Thread Reply:* ok, I’ll get to the streaming PR first

Willy Lulciuc (willy@datakin.com)
2023-11-27 14:34:29

*Thread Reply:* FYI, the namespace filtering is a good idea, just needs some feedback on impl / naming

Michael Robinson (michael.robinson@astronomer.io)
2023-11-27 14:31:11

Jens would like to know if there’s anything we want included in the welcome portion of the slide deck. Suggestions? (Aside from the usual links)

Willy Lulciuc (willy@datakin.com)
2023-11-28 02:35:56

@Paweł Leszczyński I reviewed your PR today (mainly the logic on versioning for streaming jobs); here is the main versioning limitations for jobs: a new JobVersion is created only when a job run completes or fails (or is in the done state); that is, we don’t know if we have received all the input/output datasets so we hold off on creating a new job version until we do.

For streaming, we’ll need to create a job version on start. Do we assume we have all input/output datasets associated with the streaming job? Does OpenLineage guarantee this to be the case for streaming jobs? Having versioning logic for batch vs streaming is a reasonable solution, just want to clarify

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-11-28 02:38:00

*Thread Reply:* yes, the logic adds distinction on how to create job version per processing type. For streaming, I think it makes more sense to create it at the beginning. Then, within other events of the same run, we need to check if the version has changed, and create new version in that case

Willy Lulciuc (willy@datakin.com)
2023-11-28 02:48:40

*Thread Reply:* would we want to use the same versioning func Utils.newJobVersionFor() for streaming? That is, should we assume the input/output datasets contained within the OL event be the “current” set for the streaming job?

Willy Lulciuc (willy@datakin.com)
2023-11-28 02:50:19

*Thread Reply:* that is, 2 input streams, 1 output stream (version 1) then, 1 input streams, 2 output stream (version 2) ...

Willy Lulciuc (willy@datakin.com)
2023-11-28 02:51:16

*Thread Reply:* but what about the case when the in/out streams are not present: 1 input streams, 2 output stream (version 2) then, 1 input streams, 0 output stream (version 3) ...

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-11-28 03:08:36

*Thread Reply:* The meaning for the streaming events should be slightly different.

For batch, input and output datasets are cumulative from all the events. If we have an event with output datasets A + B, then another event with output datasets B + C, then we assume job has output datasets A + B + C.

For streaming, we may have a streaming job than for a week was reading data from topics A + B, and then in the next week it was reading from B + C. I think this should the mimicked in different job versions. Making it cumulative for jobs that run for several weeks does not make that much sense to me. The problem here is: what happens if a producer some extra events with no input/output datasets specified, like amount of bytes read? Shall we treat it as a new version? If not, why not?

This part is missing in PR and our Flink integration always sends all the input & datasets. I can add extra logic that will prevent creating new job version if event has no input nor output datasets. However, I can't see any clean and generic solution to this.

Willy Lulciuc (willy@datakin.com)
2023-11-28 14:07:01

*Thread Reply:* > The problem here is: what happens if a producer some extra events with no input/output datasets specified, like amount of bytes read? Shall we treat it as a new version? If not, why not? We can view the bytes read as additional metadata about the jobs inputs/outputs that wouldn’t trigger a new version (for the job or dataset). I would associate the bytes with the current dataset version and sum them up (I’ve read X bytes from dataset version D); you can also view tags in a similar way. In our current versioning logic for datasets, we create a new dataset version when a job completes, I think we’ll want to do something similar for streaming jobs; that is, when X bytes are written to a given dataset that would trigger a new version

Willy Lulciuc (willy@datakin.com)
2023-11-28 14:11:17

*Thread Reply:* > I can add extra logic that will prevent creating new job version if event has no input nor output datasets Yes, if in/out no datasets are present, then I wouldn’t create a new job version. @Julien Le Dem opened an issue a while back about this https://github.com/MarquezProject/marquez/issues/1513. that is, there’s a difference between an empty set [ ] and null

Milestone
<a href="https://github.com/MarquezProject/marquez/milestone/4">Roadmap</a>
Willy Lulciuc (willy@datakin.com)
2023-11-28 14:11:51

*Thread Reply:* > This part is missing in PR and our Flink integration always sends all the input & datasets This is very important to note in the code andor API docs

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-11-29 04:48:36

*Thread Reply:* Sure we should. Just wanted to make sure if this the way we want to go.

🙏 Willy Lulciuc
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-11-30 09:53:22

*Thread Reply:* @Willy Lulciuc did you had a chance to look at this as well https://github.com/MarquezProject/marquez/pull/2654 ? This should be merged before streaming support I believe.

Labels
docs, api
Comments
4
👀 Willy Lulciuc
Willy Lulciuc (willy@datakin.com)
2023-11-30 14:13:27

*Thread Reply:* ahh sorry, I hadn’t realized they were related / dependent on one another. sure I’ll give the PR a pass

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-12-01 09:15:30

*Thread Reply:* I looked into your comments and found them, as always, really useful. I introduced changes based on most of them. Please take a look at my reponses within Job model class. I think there is one issue we still need to discuss.

What to do with existing type field? I would opt for deprecating it as within the introduced job facet, a notion of jobType stands for QUERY|COMMAND|DAG|TASK|JOB|MODEL , while processingType determines if a job is batch or streaming.

One solution I see is deprecating type and introducing JobLabels class as property within Job with fields like jobType, processingType , integration

Another would be to send processingType within existing type field. This would mimic existing API, but require further work. The disadvantage is that we still have mismatch between job type in marquez and openlineage spec.

I would opt for (2), but (1) works for me as well.

Michael Robinson (michael.robinson@astronomer.io)
2023-11-28 16:57:50

I’m working on a redesign of the Ecosystem page for a more modern, user-friendly layout. It’s a work in progress, and feedback is welcome: https://github.com/OpenLineage/docs/pull/258.

Comments
1
Michael Robinson (michael.robinson@astronomer.io)
2023-11-29 11:37:41

Can someone count the folks in the room please? Can’t see anyone other than the speaker

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-29 11:39:11
❤️ Willy Lulciuc
🔥 Willy Lulciuc
🚀 Willy Lulciuc
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-29 12:08:38

@Michael Robinson can you hear the questions?

👍 Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2023-11-29 12:16:11

*Thread Reply:* I could hear all but one of the questions after the first talk

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-29 12:16:39

*Thread Reply:* Oh then it's better than I thought

Ross Turk (ross@rossturk.com)
2023-11-29 12:43:44

I just had a lovely conversation at reinvent with the CTO of dbt, Connor, and didn’t even know it was him until the end 🤯

🙃 Harel Shein, Willy Lulciuc, Jakub Dardziński, Paweł Leszczyński, Maciej Obuchowski
Michael Robinson (michael.robinson@astronomer.io)
2023-11-29 14:38:01

Congrats on a great event!

🙌 Maciej Obuchowski
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-30 05:35:30

*Thread Reply:* Yeah it was pretty nice 🙂 A lot of good discussions with Google people. Also Jarek Potiuk was there

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-11-30 05:36:29

*Thread Reply:* I think it won't be the last one Warsaw OpenLineage meetup

🎉 Paweł Leszczyński
Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-11-30 07:17:49

https://openlineage.slack.com/archives/C01CK9T7HKR/p1701288000527449 putting it here. I don’t feel like I’m the best person to answer but I feel like operational lineage which we’re trying to provide is the thing

} Stefan Krawczyk (https://openlineage.slack.com/team/U065SAYCS5C)
Harel Shein (harel.shein@gmail.com)
2023-11-30 09:26:25

created a project for Blog post ideas: https://github.com/orgs/OpenLineage/projects/5/views/1

:gratitude_thank_you: Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2023-11-30 11:15:28

Release update: we’re due for an OpenLineage release and overdue for a Marquez release. As tomorrow, the first, is a Friday, we should wait until Monday at the earliest. I’m planning to open a vote for an OL release then, but Marquez is red so I’m holding off on a Marquez release for the time being.

Willy Lulciuc (willy@datakin.com)
2023-11-30 19:50:45

*Thread Reply:* I can address the red CI status, it’s bc we’re seeing issues publishing our snaphots

Willy Lulciuc (willy@datakin.com)
2023-11-30 19:51:09

*Thread Reply:* I think we should release Marquez on Mon. as well

Michael Robinson (michael.robinson@astronomer.io)
2023-11-30 19:52:12

*Thread Reply:* 👍

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-01 05:04:26

*Thread Reply:* I want to get this https://github.com/OpenLineage/OpenLineage/pull/2284 into OL release

Labels
integration/spark, ci
Willy Lulciuc (willy@datakin.com)
2023-11-30 19:51:43

would be interesting if we can use this comparison as a learning (improve docs, etc): https://blog.singleorigin.tech/race-to-the-finish-line-age/

Single Origin Blog
Written by
Engineering
➕ Harel Shein, Paweł Leszczyński
Willy Lulciuc (willy@datakin.com)
2023-11-30 19:52:10

or rather, use the format for comparing OL with other tools 😉

👀 Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2023-12-01 10:12:21

*Thread Reply:* It would be nice to have something like this (I would want it to be a little more even-handed, though). It will be interesting to see if they will ever update this now that there’s automated lineage from Airflow supported by OL

Michael Robinson (michael.robinson@astronomer.io)
2023-12-01 08:58:19

Review needed of the newsletter section on Airflow Provider progress @Jakub Dardziński @Maciej Obuchowski when you have a moment. It will ship by 5 PM ET today, fyi. Already shared it with you. Thanks!

👍 Maciej Obuchowski
Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-01 10:45:56

*Thread Reply:* LGTM 👍

Michael Robinson (michael.robinson@astronomer.io)
2023-12-01 11:02:04

*Thread Reply:* Thanks @Jakub Dardziński

Michael Robinson (michael.robinson@astronomer.io)
2023-12-01 10:21:23

They finally uploaded the OpenLineage Airflow Summit videos to the Airflow channel on YT: https://www.youtube.com/@ApacheAirflow/videos

YouTube
Michael Robinson (michael.robinson@astronomer.io)
2023-12-01 11:13:14

On Monday I’m meeting with someone at Confluent about organizing a meetup in London in January. I’m thinking I’ll suggest Jan. 24 or 31 as mid-week days work better and folks need time to come back from the vacation. If you have thoughts on this, would you please let me know by 10:00 am ET on Monday? Also, standup will be happening before the meeting — perhaps we can discuss it then. @Harel Shein

👍 Harel Shein, Maciej Obuchowski
Michael Robinson (michael.robinson@astronomer.io)
2023-12-04 10:20:59

*Thread Reply:* Confluent says January 31st will work for them for a London meetup, and they’ll be providing a speaker as well. Is it safe to firm this up with them?

🎉 Harel Shein, Maciej Obuchowski
Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-04 10:37:42

*Thread Reply:* I'd say yes, eventually if Maciej doesn't get new passport till this time I can speak

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-04 10:44:17

*Thread Reply:* I already got the photos 😂

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-04 10:44:41

*Thread Reply:* you gotta share them

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-04 10:45:36

*Thread Reply:* Also apparently it's possible to get temporary passport at airport in 15 minutes

Michael Robinson (michael.robinson@astronomer.io)
2023-12-04 10:57:40

*Thread Reply:* How civilized...

Harel Shein (harel.shein@gmail.com)
2023-12-04 17:07:10

*Thread Reply:* if the price is right 🙂

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-04 17:08:55

*Thread Reply:* you can get it in the Warsaw airport just like last-minute passport, costs barely nothing (30 PLN which is ~7/8 USD)

Harel Shein (harel.shein@gmail.com)
2023-12-04 17:09:21

*Thread Reply:* wow, that’s impressive!

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-04 17:10:19

*Thread Reply:* yeah, many people are surprised how developed our public service may be

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-05 06:44:04

*Thread Reply:* tbh it's always random, can be good can be shit 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-05 06:44:21

*Thread Reply:* lately it's definitely been better than 10 years ago tho

Michael Robinson (michael.robinson@astronomer.io)
2023-12-01 15:34:09
Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-04 14:38:30

https://blog.datahubproject.io/extracting-column-level-lineage-from-sql-779b8ce17567 https://datastation.multiprocess.io/blog/2022-04-11-sql-parsers.html

Medium
Reading time
8 min read
Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-04 14:40:02

*Thread Reply:* [6] Note that this isn’t a fully fair comparison, since the DataHub one had access to the underlying schemas whereas the other parsers don’t accept that information. 🙂

Harel Shein (harel.shein@gmail.com)
2023-12-04 17:00:33

*Thread Reply:* it’s open source, should we consider testing it out?

Harel Shein (harel.shein@gmail.com)
2023-12-04 17:00:53

*Thread Reply:* I’m not sure about the methodology, but these numbers are pretty significant

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-04 17:06:43

*Thread Reply:* We tested on a corpus of ~7000 BigQuery SELECT statements and ~2000 CREATE TABLE ... AS SELECT (CTAS) statements.⁶

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-12-05 04:09:18

*Thread Reply:* More doctors smoke camels than any other cigarette 😉 If you test on BigQuery, you will not get comparable results for SnowFlake for example.

Wondering if we can do anything about this. We could write a blog post on lineage extraction from Snowflake SQL queries. This is something we spent time on and possibly we support dialect specific queries that others don't.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-05 04:12:45

*Thread Reply:* it all comes to the question whether we should start publishing comparisons

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-12-05 04:18:46

*Thread Reply:* We can also accept schema information in our sql lineage parser. Actually, this would have been good idea I believe.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-05 04:24:59

*Thread Reply:* for select ** use-case?

👍 Paweł Leszczyński
Michael Robinson (michael.robinson@astronomer.io)
2023-12-04 15:42:36

Release vote is here when you get a moment: https://openlineage.slack.com/archives/C01CK9T7HKR/p1701722066253149

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
👍 Harel Shein
Kacper Muda (kacper.muda@getindata.com)
2023-12-05 05:15:18

@Kacper Muda has joined the channel

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-05 05:15:30

Should we disable openlineage-airflow on Airflow 2.8 to force people to use provider?

👍 Kacper Muda, Harel Shein
Michael Robinson (michael.robinson@astronomer.io)
2023-12-05 11:12:52

*Thread Reply:* it sounds like maybe something about this should be included in the 2.8 docs. The dev rel team is talking about the marketing around 2.8 right now…

Michael Robinson (michael.robinson@astronomer.io)
2023-12-05 11:14:48

*Thread Reply:* also, the release will evidently be out next Thursday

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-05 11:15:38

*Thread Reply:* I mean, openlineage-airflow is not part of Airflow

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-05 11:15:50

*Thread Reply:* We'd have provider for 2.8

Michael Robinson (michael.robinson@astronomer.io)
2023-12-05 11:16:40

*Thread Reply:* so maybe the airflow newsletter would be better

Michael Robinson (michael.robinson@astronomer.io)
2023-12-05 11:18:29

*Thread Reply:* is there anything about the provider that should be in the 2.8 marketing?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-05 11:19:04

*Thread Reply:* I don't think so

Michael Robinson (michael.robinson@astronomer.io)
2023-12-05 11:25:09

*Thread Reply:* Kenten wants to mention that it will be turned off in the 2.8 docs, so please lmk if anything about this changes

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-05 06:20:00
Michael Robinson (michael.robinson@astronomer.io)
2023-12-05 14:32:00
Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-05 14:36:27

*Thread Reply:* that’s weird ruff-lint found issues, especially when it has ruff version pinned

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-05 14:36:46

*Thread Reply:* CHANGELOG.md:10: acccording ==&gt; according this change is accurate though 🙂

Michael Robinson (michael.robinson@astronomer.io)
2023-12-05 14:45:42

*Thread Reply:* I tried to sneak in a fix in dev but the linter didn’t like it so I changed it back. All set now

Michael Robinson (michael.robinson@astronomer.io)
2023-12-05 14:45:58

*Thread Reply:* The release is in progress

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-05 14:47:05

*Thread Reply:* ah, gotcha dev/get_changes.py:49:17: E722 Do not use bare `except` dev/get_changes.py:49:17: S112 `try`-`except`-`continue` detected, consider logging the exception for next time just add except Exception: instead of except: 🙂

👍 Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2023-12-05 14:50:29

*Thread Reply:* GTK, thank you

Michael Robinson (michael.robinson@astronomer.io)
2023-12-05 14:50:02

The release-integration-flink job failed with this error message: Execution failed for task ':examples:stateful:compileJava'. &gt; Could not resolve all files for configuration ':examples:stateful:compileClasspath'. &gt; Could not find io.**********************:**********************_java:1.6.0-SNAPSHOT. Required by: project :examples:stateful

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-05 15:06:17

*Thread Reply:* No cache is found for key: v1-release-client-java--rOhZzScpK7x+jzwfqkQVwOVgqXO91M7VEEtzYHNvSmY= Found a cache from build 155811 at v1-release-client-java- is this standard behaviour?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-05 15:08:03

*Thread Reply:* well, same happened for 1.5.0 and it worked

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-05 15:09:11

*Thread Reply:* we gotta wait for Maciej/Pawel :<

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-05 18:07:20

*Thread Reply:* Looks like Gradle version got bumped and gives some problems

Michael Robinson (michael.robinson@astronomer.io)
2023-12-06 12:47:10

*Thread Reply:* Think we can release by midday tomorrow?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-06 12:57:31

*Thread Reply:* oh forgot about this totally

Michael Robinson (michael.robinson@astronomer.io)
2023-12-05 16:29:52

Feedback sought on a redesign of the ecosystem page that (hopefully) freshens and modernizes the page: https://github.com/OpenLineage/docs/pull/258

Comments
3
Michael Robinson (michael.robinson@astronomer.io)
2023-12-07 10:49:06
Michael Robinson (michael.robinson@astronomer.io)
2023-12-07 11:19:29

1.6.1 release is in progress

Michael Robinson (michael.robinson@astronomer.io)
2023-12-07 11:23:41

*Thread Reply:* @Maciej Obuchowski the flink job failed again

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-07 11:34:24

*Thread Reply:* 😢

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-07 11:35:09

*Thread Reply:* well, at least it's a different error

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-07 11:39:27

*Thread Reply:* one more try? https://github.com/OpenLineage/OpenLineage/pull/2302 @Michael Robinson

✅ Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2023-12-07 12:22:32

*Thread Reply:* 1.6.2 changelog PR: https://github.com/OpenLineage/OpenLineage/pull/2304

Michael Robinson (michael.robinson@astronomer.io)
2023-12-07 12:22:55

*Thread Reply:* @Maciej Obuchowski 👆

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-07 12:24:45

*Thread Reply:* merged 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-07 12:25:05

*Thread Reply:* going out for a few hours, so next try would be tomorrow if it fails again...

👍 Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2023-12-07 13:14:14

*Thread Reply:* Thanks, Maciej. That worked, and 1.6.2 is out.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-07 13:20:12

*Thread Reply:* Great

Michael Robinson (michael.robinson@astronomer.io)
2023-12-07 13:04:15

Starting a thread for collaboration on the community meeting next week

Michael Robinson (michael.robinson@astronomer.io)
2023-12-07 13:05:27

*Thread Reply:* Releases: 1.6.2

Michael Robinson (michael.robinson@astronomer.io)
2023-12-07 13:09:51

*Thread Reply:* Open proposals: 2186, 2187, 2218, 2243, 2273, 2281, 2289, 2163, 2162, 2161

Michael Robinson (michael.robinson@astronomer.io)
2023-12-07 13:10:13

*Thread Reply:* 2023 recap/“best-of”?

Michael Robinson (michael.robinson@astronomer.io)
2023-12-07 13:13:28

*Thread Reply:* @Harel Shein any thoughts? Also, does anyone know if Julien will be back from vacation?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-07 13:20:39

*Thread Reply:* We should probably try to something with Google proposal

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-07 13:21:00

*Thread Reply:* Not sure if it needs additional discussion, maybe just implementation?

Harel Shein (harel.shein@gmail.com)
2023-12-07 15:28:13

*Thread Reply:* I can ask him, but it would probably be good if you could facilitate next week @Michael Robinson?

👍 Michael Robinson
Harel Shein (harel.shein@gmail.com)
2023-12-07 15:29:43

*Thread Reply:* I agree that we need to address those Google proposals, we should ask Jens if he’s up for presenting and discussing them first?

✅ Michael Robinson
Harel Shein (harel.shein@gmail.com)
2023-12-07 15:30:48

*Thread Reply:* maybe Pawel wants to present progress with https://github.com/OpenLineage/OpenLineage/issues/2162?

👍 Paweł Leszczyński
Michael Robinson (michael.robinson@astronomer.io)
2023-12-12 12:14:40

*Thread Reply:* Still waiting on a response from Jens

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-12 12:22:11

*Thread Reply:* I think Jens does not have a lot of time now

Michael Robinson (michael.robinson@astronomer.io)
2023-12-12 12:25:30

*Thread Reply:* Emailed him in case he didn’t see the message

Michael Robinson (michael.robinson@astronomer.io)
2023-12-13 11:22:56

*Thread Reply:* Jens confirmed

🎉 Harel Shein, Maciej Obuchowski
Michael Robinson (michael.robinson@astronomer.io)
2023-12-13 13:28:47

*Thread Reply:* He will have to join about 15 minutes late

Harel Shein (harel.shein@gmail.com)
2023-12-10 12:13:00

Hey, I have a meeting scheduled with a few Ray committers from Anyscale for Tuesday December 12th at 8pm CET / 2pm ET / 11am PT. Would anyone like to join? I think this would a good entry way to lineage tracking for AI/LLM workflows. JFYI Ray is in Python.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-11 04:59:03

*Thread Reply:* would love to come but I'm at friend's birthday at that time 😐

Willy Lulciuc (willy@datakin.com)
2023-12-11 20:26:40

*Thread Reply:* I’d love to as well, but have diner plans 😕

Harel Shein (harel.shein@gmail.com)
2023-12-11 20:27:02

*Thread Reply:* It’s 11am PT..

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-12 09:45:23

*Thread Reply:* count me in if not too late

Harel Shein (harel.shein@gmail.com)
2023-12-12 09:46:36

*Thread Reply:* not too late, adding you

Willy Lulciuc (willy@datakin.com)
2023-12-11 20:25:50

@Paweł Leszczyński mind giving this PR a quick look? https://github.com/MarquezProject/marquez/pull/2700 … it’s a dep on https://github.com/MarquezProject/marquez/pull/2698

👀 Paweł Leszczyński
Willy Lulciuc (willy@datakin.com)
2023-12-12 04:39:08

*Thread Reply:* thanks @Paweł Leszczyński for the +1 ❤️

Willy Lulciuc (willy@datakin.com)
2023-12-12 12:36:32

@Jakub Dardziński: In Marquez, metrics are exposed via the endpoint /metrics using prometheus (most of the custom metrics defined are here). Oddly enough, prometheus roadmap states that they have yet to adopt OpenMetrics! But, you can backfill the metrics into prometheus. So, knowing this, I would move to using metrics core by Dropwizard and us an exporter to export metrics to datadog using metrics-datadog. The one major benefit here is that we can define a framework around defining custom metrics internally within Marquez using core Dropwizard libraries, and then enable the reporter via configuration to emit metrics in marquez.yml : For example: `metrics: frequency: 1 minute # Default is 1 second. reporters: - type: datadog . .

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-12 12:41:26

*Thread Reply:* I tested this actually and it works the only thing is traces, I found it very poor to just have metrics around function name

Willy Lulciuc (willy@datakin.com)
2023-12-12 12:43:07

*Thread Reply:* I totally agree, although I feel metrics and tracing are two separate things here

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-12 12:44:39

*Thread Reply:* I really appreciate your help and advice! 🙂

🙏 Willy Lulciuc
Willy Lulciuc (willy@datakin.com)
2023-12-12 12:46:00

*Thread Reply:* Of course, happy to chime in here

Willy Lulciuc (willy@datakin.com)
2023-12-12 12:47:02

*Thread Reply:* I’m just happy this is getting some much needed love 😄

Willy Lulciuc (willy@datakin.com)
2023-12-12 12:51:15

*Thread Reply:* Also, it seems like datadog uses OpenTelemetry: > Datadog Distributed Tracing allows you easily ingest traces via the Datadog libraries and agent or via OpenTelemetry And looks like OpenTelemetry has support for Dropwizard

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-12 12:52:24

*Thread Reply:* yep, that's why I liked otel idea

Willy Lulciuc (willy@datakin.com)
2023-12-12 12:54:11

*Thread Reply:* Also, here are the docs for DD + OpenTelemetry … so enabling OpenTelemetry in Marquez would be doable

Willy Lulciuc (willy@datakin.com)
2023-12-12 12:54:40

*Thread Reply:* and we can make all of the configurable via marquez.yml

Willy Lulciuc (willy@datakin.com)
2023-12-12 12:55:00

*Thread Reply:* hit me up with any questions! (just know, there will be a delay)

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-03 16:11:56

*Thread Reply:* > and we can make all of the configurable via marquez.yml it ain’t that easy - we would need to build extended jar with OTEL agent which I think is way too much work compared to benefits. you can still configure via env vars or system properties

Willy Lulciuc (willy@datakin.com)
2023-12-12 18:47:37

I’ve been looking into partitioning for psql, think there’s potential here for huge perf gains. Anyone have experience?

Willy Lulciuc (willy@datakin.com)
2023-12-12 18:50:50

*Thread Reply:* partition ranges will give a boost by default

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-13 05:05:12

*Thread Reply:* Which tables do you want to partition? Event ones?

Willy Lulciuc (willy@datakin.com)
2023-12-13 05:14:18

*Thread Reply:* • runsjob_versionsdataset_versionslineage_events • and all the facets tables

👍 Maciej Obuchowski
Willy Lulciuc (willy@datakin.com)
2023-12-12 20:03:11

@Paweł Leszczyński • PR 2682 approved with minor comments on stream versioning logic / suggestions ✅ • PR 2654 approved with minor comment (we’ll want to do a follow up analysis on the query perf improvements) ✅

❤️ Harel Shein, Paweł Leszczyński
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-12-13 05:03:23

*Thread Reply:* Thanks @Willy Lulciuc. I applied all the recent comments and merged 2654.

There is one discussion left in 2682, which I would like to resolve before merging. I added extra comment on the implemented approach and I am open to get to know if this is the approach we can go with.

@Julien Le Dem @Maciej Obuchowski discussion is about when to create a new job version for a streaming job. No deep dive in the code is required to take part in it. https://github.com/MarquezProject/marquez/pull/2682#discussion_r1425108745

Willy Lulciuc (willy@datakin.com)
2023-12-13 18:08:11

*Thread Reply:* awesome, left my final thoughts 👍

Harel Shein (harel.shein@gmail.com)
2023-12-13 06:52:57

Maybe we should clarify the documentation on adding custom facets at the integration level? Wdyt? https://openlineage.slack.com/archives/C01CK9T7HKR/p1702446541936589?threadts=1702033180.635339&channel=C01CK9T7HKR&messagets=1702446541.936589|https://openlineage.slack.com/archives/C01CK9T7HKR/p1702446541936589?threadts=1702033180.635339&channel=C01CK9T7HKR&messagets=1702446541.936589

} Simran Suri (https://openlineage.slack.com/team/U069R6P724Q)
Kacper Muda (kacper.muda@getindata.com)
2023-12-13 08:23:49

Hey, i think it would help some people using Airflow integration (with Airflow 2.6) if we release a patch version of OL package with this PR included #2305. I am not sure what is the release cycle here, but maybe there is already an ETA on next patch release? If so, please let me know 🙂 Thanks !

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-13 08:25:13

*Thread Reply:* you gotta ask for the release in #general, 3 votes from committers approve immediate release 🙂

👍 Kacper Muda
Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-14 05:14:10

*Thread Reply:* @Michael Robinson 3 votes are in 🙂

Kacper Muda (kacper.muda@getindata.com)
2023-12-14 05:22:04

*Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1702474416084989

} Kacper Muda (https://openlineage.slack.com/team/U062WKNK3LK)
Michael Robinson (michael.robinson@astronomer.io)
2023-12-14 12:42:10

*Thread Reply:* Thanks for the ping. I replied in #general and will initiate the release as soon as possible.

Harel Shein (harel.shein@gmail.com)
2023-12-14 14:16:21

seems that we don’t output the correct namespace as in the naming doc for Kafka. we output the kafka server/broker URL as namespace (in the Flink integration specifically) https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md#kafka

Harel Shein (harel.shein@gmail.com)
2023-12-14 18:01:15

*Thread Reply:* @Paweł Leszczyński, would you be able to add the Kafka: prefix to the Kafka visitors in the flink integration tomorrow?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-12-15 02:51:13

*Thread Reply:* I am happy to do this. Just to make, sure: docs is correct, flink implementation is missing kafka:// prefix, right?

Harel Shein (harel.shein@gmail.com)
2023-12-15 05:46:57

*Thread Reply:* Exactly

👍 Paweł Leszczyński
Harel Shein (harel.shein@gmail.com)
2023-12-15 08:24:47

*Thread Reply:* Thanks @Paweł Leszczyński. made a couple of suggestions, but we can def merge without

Harel Shein (harel.shein@gmail.com)
2023-12-14 14:16:29

we’re also missing Iceberg naming schema

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-12-15 02:53:21

*Thread Reply:* would love to discuss this first. If a user stores an iceberg table in S3, then should it conform S3 naming or iceberg naming?

it's S3 location which defines a dataset. iceberg is a format for accessing data but not identifier as such.

Harel Shein (harel.shein@gmail.com)
2023-12-15 05:57:19

*Thread Reply:* No rush, just something we noticed and that some people in the community are implementing their own patch for it.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-14 14:26:24

my next year goal is to have programmatic way of using naming convention

👍 Julien Le Dem, Harel Shein
❤️ Willy Lulciuc, Paweł Leszczyński
Harel Shein (harel.shein@gmail.com)
2023-12-14 15:59:26

anyone heard of https://bitol.io/?

Willy Lulciuc (willy@datakin.com)
2023-12-14 16:17:55

*Thread Reply:* nope but would be worth reaching out to them to see how we could collaborate? they’re part of the LFAI (sandbox): https://github.com/bitol-io/open-data-contract-standard

Website
<https://bitol.io>
Stars
82
Willy Lulciuc (willy@datakin.com)
2023-12-14 16:18:39

*Thread Reply:* background https://medium.com/profitoptics/data-contract-101-568a9adbf9a9

Medium
Reading time
7 min read
Willy Lulciuc (willy@datakin.com)
2023-12-14 16:20:17

*Thread Reply:* ugh it’s yaml-based

Harel Shein (harel.shein@gmail.com)
2023-12-14 17:54:05

*Thread Reply:* We should still have a conversation:)

Willy Lulciuc (willy@datakin.com)
2023-12-14 19:20:29

*Thread Reply:* @Michael Robinson soft ping 😉

Harel Shein (harel.shein@gmail.com)
2023-12-14 17:53:38
Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-14 18:10:49

*Thread Reply:* is there any specific channel/conversation?

Harel Shein (harel.shein@gmail.com)
2023-12-14 19:50:05

*Thread Reply:* Yeah, but it’s private. Added you. For everyone else, Ping me on slack when you join and I’ll add you.

Michael Robinson (michael.robinson@astronomer.io)
2023-12-15 12:07:39

A vote to release Marquez 0.43.0 is open. We need one more: https://marquezproject.slack.com/archives/C01E8MQGJP7/p1702657403267769

🙌 Willy Lulciuc, Harel Shein
Michael Robinson (michael.robinson@astronomer.io)
2023-12-15 14:28:28

*Thread Reply:* the changelog PR is RFR

Willy Lulciuc (willy@datakin.com)
2023-12-15 13:19:49

AWS is making moves! https://github.com/aws-samples/aws-mwaa-openlineage

Stars
8
Language
Python
Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-15 13:27:59

*Thread Reply:* the repo itself is pretty old, last updated 2mo ago and used OL package not provider (1.4.1)

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-15 13:28:13

*Thread Reply:* still it's nice they're doing this :)

Michael Robinson (michael.robinson@astronomer.io)
2023-12-15 15:27:26

*Thread Reply:* since they’re using MWAA they won’t be affected by turn-off with coming with Airflow 2.8 for a while. Otherwise that would be a good excuse to get in touch with them

🙏 Willy Lulciuc
Shubham Mehta (shubhammehta.93@gmail.com)
2023-12-23 03:54:50

*Thread Reply:* I think this repo was related to the blog which was authored a while back - https://aws.amazon.com/blogs/big-data/automate-data-lineage-on-amazon-mwaa-with-openlineage/ No other moves from our end so far, at least MWAA team : )

Amazon Web Services
👍 Jakub Dardziński, Willy Lulciuc, Maciej Obuchowski
Paul Wilson Villena (pgvillena@gmail.com)
2024-03-06 23:53:51

*Thread Reply:* Hi All, I am one of the owners of this repo and working to update this to work with MWAA 2.8.1, with apache-airflow-providers-openlineage==1.4.0. I am facing an issue with my set-up. I am using Redshift SQL as a sample use-case for this and getting an error relating to the Default Extractor. Haven't really looked at this at much detail yet but wondering if you have thoughts? I just updated the env variables to use: AIRFLOWOPENLINEAGETRANSPORT and AIRFLOWOPENLINEAGENAMESPACE and changed operator from PostgresOperator to SQLExecuteQueryOperator. [2024-03-07 03:52:55,496] Failed to extract metadata using found extractor &lt;airflow.providers.openlineage.extractors.base.DefaultExtractor object at 0x7fc4aa1e3950&gt; - section/key [openlineage/disabled_for_operators] not found in config task_type=SQLExecuteQueryOperator airflow_dag_id=rs_source_to_staging task_id=task_insert_event_data airflow_run_id=manual__2024-03-07T03:52:11.634313+00:00 [2024-03-07 03:52:55,498] section/key [openlineage/config_path] not found in config [2024-03-07 03:52:55,498] section/key [openlineage/config_path] not found in config [2024-03-07 03:52:55,499] Executing: insert into event SELECT eventid, venueid, catid, dateid, eventname, starttime::TIMESTAMP FROM s3_datalake.event;

Stars
8
Language
Python
🙌 Shubham Mehta
Kacper Muda (kacper.muda@getindata.com)
2024-03-08 06:04:50

*Thread Reply:* I'll look into it 🙂

Kacper Muda (kacper.muda@getindata.com)
2024-03-08 06:57:39

*Thread Reply:* @Paul Wilson Villena It looks like a small mistake in the OL, that I'll fix in the next version - we missed adding a callback there, and getting the airflow configuration raises error when disabled_for_operators is not defined in the airflow.cfg file / the env variable. For now it should help to simply add the <a href="https://airflow.apache.org/docs/apache-airflow-providers-openlineage/1.4.0/configurations-ref.html#id1">[openlineage]</a> section to airflow.cfg, and set disabled_for_operators="" , or just export AIRFLOW__OPENLINEAGE__DISABLED_FOR_OPERATORS="" ,

🙌 Paul Wilson Villena
Kacper Muda (kacper.muda@getindata.com)
2024-03-08 07:56:15

*Thread Reply:* Will be released in the next provider version: https://github.com/apache/airflow/pull/37994

🙌 Jakub Dardziński, Paul Wilson Villena
🙏 Shubham Mehta, Paul Wilson Villena
Paul Wilson Villena (pgvillena@gmail.com)
2024-03-09 07:56:31

*Thread Reply:* Hi @Kacper Muda it seems I need to also set this: Otherwise this error persists: section/key [openlineage/config_path] not found in config os.environ["AIRFLOW__OPENLINEAGE__CONFIG_PATH"]=""

Kacper Muda (kacper.muda@getindata.com)
2024-03-11 03:57:07

*Thread Reply:* Yes, sorry for missing that. I fixed in the code and forgot to mention it. If You were to not use AIRFLOW__OPENLINEAGE__TRANSPORT You'd have to set it to empty string as well, as it's missing the same fallback 🙂

Michael Robinson (michael.robinson@astronomer.io)
2023-12-15 14:47:24

The release is finished. Slack post, etc., coming soon

❤️ Harel Shein, Willy Lulciuc
Willy Lulciuc (willy@datakin.com)
2023-12-15 16:24:52

have we thought of making the SQL parser pluggable?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-15 16:35:47

*Thread Reply:* what do you mean by that?

Willy Lulciuc (willy@datakin.com)
2023-12-15 16:43:51

*Thread Reply:* (this is coming from apple) like what if a user wanted to provide their own parse for SQL in place of the one shipped with our integrations

Willy Lulciuc (willy@datakin.com)
2023-12-15 16:44:27

*Thread Reply:* for example, if/when we integrate with DataHub, can they use their parse instead of one provided

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-15 17:33:30

*Thread Reply:* that would be difficult, we would need strong motivation for that 🫥

👍 Willy Lulciuc
Harel Shein (harel.shein@gmail.com)
2023-12-18 12:15:59

This question fits with what we said we would try to document more, can someone help them out with it this week? https://openlineage.slack.com/archives/C063PLL312R/p1702683569726449

} Alia Nabawy (https://openlineage.slack.com/team/U05HTTNJY8Z)
Michael Robinson (michael.robinson@astronomer.io)
2023-12-18 14:15:48

Airflow 2.8 has been released. Are we still “turning off” the external Airflow integration with this one? What do Airflow users need to know to avoid unpleasant surprises? Kenten is open to including a note in the 2.8 blog post.

Kacper Muda (kacper.muda@getindata.com)
2023-12-19 08:52:59

*Thread Reply:* As a newcomer here, I believe it would be wise to avoid supporting Airflow 2.8+ in the openlineage-airflow package. This approach would encourage users to transition to the provider package. It's important to clearly communicate that ongoing development and enhancements will be focused on the apache-airflow-providers-openlineage package, while the openlineage-airflow will primarily be updated for bug fixes. I'll look into whether this strategy is already noted in the documentation. If not, I will propose a documentation update.

➕ Harel Shein, Jakub Dardziński
:gratitude_thank_you: Michael Robinson
Kacper Muda (kacper.muda@getindata.com)
2023-12-21 08:10:17

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2330 Please let me know if some changes are required, i was not sure how to properly implement it.

Harel Shein (harel.shein@gmail.com)
2023-12-19 09:51:18

This looks cool, might be useful for us? https://github.com/aklivity/zilla

Website
<https://docs.aklivity.io/zilla>
Stars
317
Harel Shein (harel.shein@gmail.com)
2023-12-19 09:54:26

*Thread Reply:* the license is a bit weird, but should be ok for us. it’s apache, unless you directly compete with the company that built it.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-21 09:32:34

*Thread Reply:* tbh not sure how

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-21 09:33:43

*Thread Reply:* I think we should be focused on 1) being compatible with most popular solutions (kafka...) 2) being easy to integrate with (custom transports)

rather than forcing our opinionated way on how OpenLineage events should flow in customer architecture

Michael Robinson (michael.robinson@astronomer.io)
2023-12-19 10:09:42

Apologies for having to miss today’s committer sync — I’ll be picking up my daughter from school

Harel Shein (harel.shein@gmail.com)
2023-12-20 14:13:00

WDYT about starting to add integration specific channels and adding a little welcome bot for people when they join?

  • spark-openlineage-dev
  • spark-openlineage-users
  • airflow-openlineage-dev
  • airflow-openlineage-users
  • spark-openlineage-dev
  • flink-openlineage-users etc…
👀 Michael Robinson
Willy Lulciuc (willy@datakin.com)
2023-12-20 15:10:22

*Thread Reply:* the -dev and -users seems like overkill, but also understand that we may want to split user questions from development

Willy Lulciuc (willy@datakin.com)
2023-12-20 15:11:55

*Thread Reply:* maybe just shorten to spark-integration , flink-integration , etc. Or integrations-spark etc

Willy Lulciuc (willy@datakin.com)
2023-12-20 15:37:24

*Thread Reply:* we probably should consider a development and welcomechannel

Harel Shein (harel.shein@gmail.com)
2023-12-20 17:19:00

*Thread Reply:* yeah.. makes sense to me. let’s leave this thread open for a few days so more people can chime in and then I’ll make a proposal based on that.

👍 Willy Lulciuc
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2023-12-21 01:57:27

*Thread Reply:* makes sense to me

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-21 05:18:53

*Thread Reply:* I think there is not enough discussion for it to make sense

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-21 05:19:01

*Thread Reply:* and empty channels do not invite to discussion

Michael Robinson (michael.robinson@astronomer.io)
2023-12-21 09:01:22

*Thread Reply:* Maybe worth it for spark questions alone? And then for equal coverage we need the others. It’s getting easy to overlook questions in general due to the increased volume and long code snippets, IMO.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-21 09:23:22

*Thread Reply:* yeah I think the volume is still quite low

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-21 09:24:15

*Thread Reply:* something like Airflow's #troubleshooting channel easily has order of magnitude more messages

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-21 09:24:41

*Thread Reply:* and even then, I'd split between something like #troubleshooting and #development rather than between integrations

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2023-12-21 09:25:13

*Thread Reply:* not only because it's too granular, but also there's a development that isn't strictly integration related or touches multiple ones

👍 Willy Lulciuc
Michael Robinson (michael.robinson@astronomer.io)
2023-12-20 14:45:33

Link to the vote to release the hot fix in Marquez: https://marquezproject.slack.com/archives/C01E8MQGJP7/p1703101476368589

Michael Robinson (michael.robinson@astronomer.io)
2023-12-21 09:04:15

For the newsletter this time around, I’m thinking that a year-end review issue might be nice in mid-January when folks are back from vacation. And then a “double issue” at the end of January with the usual updates. We’ve still got a rather, um, “select” readership, so the stakes are low. If you have an opinion, please lmk.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-21 09:06:17

*Thread Reply:* I’m for mid-January option

👍 Michael Robinson, Maciej Obuchowski
Michael Robinson (michael.robinson@astronomer.io)
2023-12-21 11:33:40

1.7.0 changelog PR needs a review: https://github.com/OpenLineage/OpenLineage/pull/2331

Michael Robinson (michael.robinson@astronomer.io)
2023-12-21 12:48:58

Notice for the release notes (WDYT?): COMPATIBILITY NOTICE Starting in 1.7.0, the Airflow integration will no longer support Airflow versions >=2.8.0. Please use the OpenLineage Airflow Provider instead. It includes a link to here: https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/index.html

👍 Kacper Muda
Jakub Dardziński (jakub.dardzinski@getindata.com)
2023-12-22 12:46:44

https://eu.communityovercode.org/ is that proper conference to talk about OL?

eu.communityovercode.org
Willy Lulciuc (willy@datakin.com)
2023-12-22 16:06:32

*Thread Reply:* Yes! At least to a more broader audience

Shubham Mehta (shubhammehta.93@gmail.com)
2023-12-23 03:52:59

@Shubham Mehta has joined the channel

Harel Shein (harel.shein@gmail.com)
2024-01-02 08:50:13

Happy new year all!

🎉 Kacper Muda, Maciej Obuchowski, Jakub Dardziński, Paweł Leszczyński, Willy Lulciuc
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-02 12:28:58

Am I the only one that sees Free trial in progress here in OL slack?

Kacper Muda (kacper.muda@getindata.com)
2024-01-02 12:32:25

*Thread Reply:* I see it too:

Harel Shein (harel.shein@gmail.com)
2024-01-02 13:04:34

*Thread Reply:* same for me, I think Slack initiated that

Harel Shein (harel.shein@gmail.com)
2024-01-02 13:04:37

*Thread Reply:* or @Michael Robinson?

Harel Shein (harel.shein@gmail.com)
2024-01-02 13:05:01

*Thread Reply:* we’re on the Slack Pro trial (we were on the free plan before)

Michael Robinson (michael.robinson@astronomer.io)
2024-01-02 13:07:18

*Thread Reply:* I think Slack initiated it

Harel Shein (harel.shein@gmail.com)
2024-01-03 15:37:11

Spotted!

🔥 Shubham Mehta
❤️ Willy Lulciuc
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-03 15:44:21

*Thread Reply:* a moment earlier it makes more context

😂 Harel Shein, Maciej Obuchowski
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-01-04 08:29:30

https://github.com/OpenLineage/OpenLineage/issues/2349 - this issue is really interesting. I am hoping to see follow-up from David.

Comments
2
Michael Robinson (michael.robinson@astronomer.io)
2024-01-04 10:27:12

So who wants to speak at our meetup with Confluent in London on Jan. 31st?

🙂 Maciej Obuchowski
Harel Shein (harel.shein@gmail.com)
2024-01-04 11:08:20

*Thread Reply:* do we have sponsorship to fly over?

Michael Robinson (michael.robinson@astronomer.io)
2024-01-04 11:09:39

*Thread Reply:* Not currently but I can look into it

Harel Shein (harel.shein@gmail.com)
2024-01-04 11:11:17

*Thread Reply:* do we have other active community members based in the UK?

Michael Robinson (michael.robinson@astronomer.io)
2024-01-04 11:12:00

*Thread Reply:* Yes

Michael Robinson (michael.robinson@astronomer.io)
2024-01-04 11:12:07

*Thread Reply:* I’ll ask around

Michael Robinson (michael.robinson@astronomer.io)
2024-01-04 12:42:16

*Thread Reply:* Abdallah Terrab at Decathlon has volunteered

Harel Shein (harel.shein@gmail.com)
2024-01-04 12:43:34

*Thread Reply:* woohoo!

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-05 11:16:55

*Thread Reply:* does it mean we're still looking for someone?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-05 11:17:19

*Thread Reply:* I already told I'll go last month, but not sure if it's still needed

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-05 11:17:43

*Thread Reply:* and does Confluent have a talk there?

Michael Robinson (michael.robinson@astronomer.io)
2024-01-05 11:52:04

*Thread Reply:* Sorry about that, Maciej. I’ll ask Viraj if Astronomer would cover your ticket. There will be a Confluent speaker.

Harel Shein (harel.shein@gmail.com)
2024-01-05 11:53:54

*Thread Reply:* if we need to choose between kafka summit and a meetup - I think we should go for kafka summit 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-05 11:58:21

*Thread Reply:* I think so too

Michael Robinson (michael.robinson@astronomer.io)
2024-01-05 13:12:51

*Thread Reply:* Viraj has requested approval for summit, and we can expect to hear from finance soon

Michael Robinson (michael.robinson@astronomer.io)
2024-01-05 13:18:01

*Thread Reply:* Also, question from him about the talk: what does “streaming” refer to in the title — Kafka only?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-05 13:20:26

*Thread Reply:* Kafka, Spark, Flink

👍 Michael Robinson, Willy Lulciuc
Willy Lulciuc (willy@datakin.com)
2024-01-08 15:13:41

*Thread Reply:* If someone wants to do a joint talk let me know 😉

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-08 15:55:57

*Thread Reply:* @Willy Lulciuc will you be in UK then?

Willy Lulciuc (willy@datakin.com)
2024-01-08 16:02:10

*Thread Reply:* I can be, if astronomer approves? but also realizing it’s Jan 31st, so a bit tight

Harel Shein (harel.shein@gmail.com)
2024-01-08 16:16:30

*Thread Reply:* yeah, that sounds… unlikely 🙂

👍 Willy Lulciuc
Willy Lulciuc (willy@datakin.com)
2024-01-08 16:17:16

*Thread Reply:* also thinking about all the traveling ive been doing recently and things I need to work on. would be great to have some focus time

Harel Shein (harel.shein@gmail.com)
2024-01-04 12:42:33

did anyone submit a talk to https://www.databricks.com/dataaisummit/call-for-presentations/?

databricks.com
Harel Shein (harel.shein@gmail.com)
2024-01-04 13:27:05

*Thread Reply:* tagging @Julien Le Dem on this one. since the deadline is tomorrow.

Julien Le Dem (julien@apache.org)
2024-01-04 22:16:01

*Thread Reply:* I don’t think I did.

Julien Le Dem (julien@apache.org)
2024-01-04 22:17:07

*Thread Reply:* I don’t have my computer with me. ⛷️

Julien Le Dem (julien@apache.org)
2024-01-04 22:18:06

*Thread Reply:* Does @Willy Lulciuc want to submit? Happy to be co-speaker (if you want. But not necessary)

Harel Shein (harel.shein@gmail.com)
2024-01-05 08:39:14

*Thread Reply:* Willy is also on vacation, I’m happy to submit for the both of us. I’ll try to get something out today

👍 Julien Le Dem
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-05 11:23:52

*Thread Reply:* would be great to have someting on Databricks conference 🙂

Harel Shein (harel.shein@gmail.com)
2024-01-05 11:25:50

*Thread Reply:* agreed. ideas welcome!

Harel Shein (harel.shein@gmail.com)
2024-01-05 11:26:34

*Thread Reply:* what I’m currently thinking. learnings from openlineage adoption in Airflow and Flink, and what can be learned / applied on Spark lineage.

Harel Shein (harel.shein@gmail.com)
2024-01-05 11:26:37

*Thread Reply:* catchy title!

Michael Robinson (michael.robinson@astronomer.io)
2024-01-04 13:22:12

This month’s TSC meeting is next Thursday. Anyone have any items to add to the agenda?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-04 14:10:20

*Thread Reply:* @Kacper Muda would you want to talk about doc changes in Airflow provider maybe?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-04 14:13:36

*Thread Reply:* no pressure if it's too late for you 🙂

Kacper Muda (kacper.muda@getindata.com)
2024-01-04 14:16:08

*Thread Reply:* It's fine, I could probably mention something about it - the problem is that I have a standing commitment every Thursday from 5:30 to 7:30 PM (Polish time, GMT+1) which means I'm unable to participate. 😞

Michael Robinson (michael.robinson@astronomer.io)
2024-01-04 15:46:35

*Thread Reply:* @Kacper Muda would you be open to recording something? We could play it during the meeting. Something to consider if you’d like to participate but the time doesn’t work.

Kacper Muda (kacper.muda@getindata.com)
2024-01-05 10:04:33

*Thread Reply:* Let me see how much time I'll have during the weekend and come back to You 🙂

👍 Michael Robinson
Kacper Muda (kacper.muda@getindata.com)
2024-01-08 02:48:35

*Thread Reply:* Sorry, I got sick and won't be able to do it. Maybe i'll try to make it personally to the next meeting, then the docs should already be released 🙂

👍 Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2024-01-10 13:56:46

*Thread Reply:* @Julien Le Dem are there updates on open proposals that you would like to cover?

Michael Robinson (michael.robinson@astronomer.io)
2024-01-10 13:58:54

*Thread Reply:* @Paweł Leszczyński as you authored about half of the changes in 1.7.0, would you be willing to talk about the Flink fixes? No slides necessary

Julien Le Dem (julien@apache.org)
2024-01-10 19:21:39

*Thread Reply:* @Michael Robinson no updates from me

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-01-11 07:56:57

*Thread Reply:* sry @Michael Robinson, I won't be able to join this time. My changes in 1.7.0 are rather small fixes. Perhaps someone else can present them shortly.

👍 Michael Robinson
Harel Shein (harel.shein@gmail.com)
2024-01-04 16:28:04

I saw some weird behavior with openlineage-airflow where it will not respected the transport config for the client, even when setting the OPENLINEAGE_CONFIG to point to a config file. the workaround is that if you set the OPENLINEAGE_URL env it will reach that and read the config. this bug doesn’t seem to exist in the airflow provider since the loading method is completely different.

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-05 10:08:32

*Thread Reply:* would you mind creating an issue?

Harel Shein (harel.shein@gmail.com)
2024-01-05 11:23:58

*Thread Reply:* will do. let me repro on the latest version of openlineage-airflow and see if I can repro on the provider

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-05 11:25:05

*Thread Reply:* config is definitely more complex than it needs to be...

Harel Shein (harel.shein@gmail.com)
2024-01-05 11:29:44

*Thread Reply:* hmm.. so in 1.7.0 if you define OPENLINEAGE_URL then it completely ignores whatever is in OPENLINEAGE_CONFIG yaml

Harel Shein (harel.shein@gmail.com)
2024-01-05 11:30:47

*Thread Reply:* but….

Harel Shein (harel.shein@gmail.com)
2024-01-05 11:31:09

*Thread Reply:* if you don’t define OPENLINEAGE_URL, and you do define OPENLINEAGE_CONFIG: then openlineage is disabled 😂

Harel Shein (harel.shein@gmail.com)
2024-01-05 11:33:31

*Thread Reply:* ok, I found the bug

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-05 11:33:38

*Thread Reply:* are you sure OPENLINEAGE_CONFIG points to something valid?

Harel Shein (harel.shein@gmail.com)
2024-01-05 11:33:41

*Thread Reply:* it’s in transport factory

Harel Shein (harel.shein@gmail.com)
2024-01-05 11:34:44

*Thread Reply:* ~look at the code flow for create:~ ~the default factory doesn’t supply config, so it tried to set http config from env vars. if that doesn’t work, it just returns the console transport~

Harel Shein (harel.shein@gmail.com)
2024-01-05 11:35:11

*Thread Reply:* oh, no nvm. it’s a different flow.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-05 11:35:14

*Thread Reply:* not exactly

Harel Shein (harel.shein@gmail.com)
2024-01-05 11:35:31

*Thread Reply:* but yes, OPENLINEAGE_CONFIG works for sure

Harel Shein (harel.shein@gmail.com)
2024-01-05 11:36:00

*Thread Reply:* on 0.21.1, it was working when the OPENLINEAGE_URL was supplied

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-05 11:36:16

*Thread Reply:* transport_config = None if "transport" not in self.config else self.config["transport"] self.transport = factory.create(transport_config) this self.config actually looks at @property def config(self) -&gt; dict[str, Any]: if self._config is None: self._config = load_config() return self._config which then uses this: ```def loadconfig() -> dict[str, Any]: file = _findyaml() if file: try: with open(file) as f: config: dict[str, Any] = yaml.safe_load(f) return config except Exception: # noqa: BLE001, S110 # Just move to read env vars pass return defaultdict(dict)

def findyaml() -> str | None: # Check OPENLINEAGECONFIG env variable path = os.getenv("OPENLINEAGECONFIG", None) try: if path and os.path.isfile(path) and os.access(path, os.R_OK): return path except Exception: # noqa: BLE001 if path: log.exception("Couldn't read file %s: ", path) else: pass # We can get different errors depending on system

# Check current working directory:
try:
    cwd = os.getcwd()
    if "openlineage.yml" in os.listdir(cwd):
        return os.path.join(cwd, "openlineage.yml")
except Exception:  # noqa: BLE001, S110
    pass  # We can get different errors depending on system

# Check $HOME/.openlineage dir
try:
    path = os.path.expanduser("~/.openlineage")
    if "openlineage.yml" in os.listdir(path):
        return os.path.join(path, "openlineage.yml")
except Exception:  # noqa: BLE001, S110
    # We can get different errors depending on system
    pass
return None```
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-05 11:37:16

*Thread Reply:* oh I think I see

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-05 11:37:40

*Thread Reply:* so this isn't passed if you have config but there is no transport field in this config transport_config = None if "transport" not in self.config else self.config["transport"] self.transport = factory.create(transport_config)

Harel Shein (harel.shein@gmail.com)
2024-01-05 11:38:22

*Thread Reply:* here’s the config I’m using: ```transport: type: "kafka" config: bootstrap.servers: "kafka1,kafka2" security.protocol: "SSL"

# CA certificate file for verifying the broker's certificate.
ssl.ca.location=ca-cert
# Client's certificate
ssl.certificate.location=client_?????_client.pem
# Client's key
ssl.key.location=client_?????_client.key
# Key password, if any.
ssl.key.password=abcdefgh

topic: "SOF0002248-afaas-lineage-DEV-airflow-lineage" flush: True```

Harel Shein (harel.shein@gmail.com)
2024-01-05 11:38:35

*Thread Reply:* it should load, but fail when actually trying to emit to kafka

Harel Shein (harel.shein@gmail.com)
2024-01-05 11:38:47

*Thread Reply:* but it should still init the transport

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-05 11:39:22

*Thread Reply:* 🤔

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-05 11:40:09

*Thread Reply:* no logs?

Harel Shein (harel.shein@gmail.com)
2024-01-05 11:40:24

*Thread Reply:* I’m testing on this image: ```FROM quay.io/astronomer/astro-runtime:6.4.0

COPY openlineage.yml /usr/local/airflow/

ENV OPENLINEAGE_CONFIG=/usr/local/airflow/openlineage.yml

ENV AIRFLOWCORELOGGING_LEVEL=DEBUG

ENV OPENLINEAGE_URL=http://foo.bar/```

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-05 11:40:39

*Thread Reply:* are you sure there are no permission errors?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-05 11:41:02

*Thread Reply:* def _find_yaml() -&gt; str | None: # Check OPENLINEAGE_CONFIG env variable path = os.getenv("OPENLINEAGE_CONFIG", None) try: if path and os.path.isfile(path) and os.access(path, os.R_OK): return path except Exception: # noqa: BLE001 if path: log.exception("Couldn't read file %s: ", path) else: pass # We can get different errors depending on system it checks stuff like os.access(path, os.R_OK)

Harel Shein (harel.shein@gmail.com)
2024-01-05 11:41:02

*Thread Reply:* I’m sure, because if I uncomment ENV OPENLINEAGE_URL=<http://foo.bar/> on 0.21.1, it works

Harel Shein (harel.shein@gmail.com)
2024-01-05 11:41:29

*Thread Reply:* ah, I can add a permissive chmod to the dockerfile to see if it helps

Harel Shein (harel.shein@gmail.com)
2024-01-05 11:41:40

*Thread Reply:* but I’m also not seeing any log/exception in the task logs

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-05 11:47:49

*Thread Reply:* will look at this later if you won't find solution 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-05 11:48:09

*Thread Reply:* one more thing, can you try with just transport: type: console

Harel Shein (harel.shein@gmail.com)
2024-01-05 11:52:49

*Thread Reply:* I tried that too

Harel Shein (harel.shein@gmail.com)
2024-01-05 11:52:51

*Thread Reply:* didn’t work

Harel Shein (harel.shein@gmail.com)
2024-01-05 11:53:27

*Thread Reply:* but yeah, there’s something not great about separation of concerns between client config and adapter config

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-05 12:04:00

*Thread Reply:* adapter should not care, unless you're using MARQUEZ_URL... which is backwards compatibility for when it was still marquez airflow integration

Michael Robinson (michael.robinson@astronomer.io)
2024-01-05 11:50:11

I’m starting to put together the year-in-review issue of the newsletter and wonder if anyone has thoughts on the “big stories” of 2023 in OpenLineage. So far I’ve got: • Launched the Airflow Provider • Added static AKA design lineage • Welcomed new ecosystem partners (Google, Metaphor, Grai, Datahub) • Started meeting up and held events with Metaphor, Google, Collibra, etc. • Graduated from the LFAI What am I missing? Wondering in particular about features. Is iceberg support in Flink a “big” enough story? Custom transport types? SQL parser improvements?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-01-08 06:37:27

Blog posts content about contributing to Openlineage-spark and code internals. The content comes from november meetup at google and I split it into two posts: https://docs.google.com/document/d/1Hu6clFckse1J_M1w2MMaTTJS0wUihtFsxbDQchtTVtA/edit?usp=sharing

❤️ Jakub Dardziński
Kacper Muda (kacper.muda@getindata.com)
2024-01-08 11:22:32

Does anyone remember why execution_date was chosen as part of the runid for an Airflow task, instead of, for example, start_date? Due to this decision, we can encounter duplicate runid if we delete the DagRun from the database, because the execution_date remains the same. If I run a backfill job for yesterday, then delete it and run it again, I get the same ids. I'm trying to understand the rationale behind this choice so we can determine whether it's a bug or a feature. 😉

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-08 11:40:45

*Thread Reply:* start_date is unreliable AFAIK, there can be no start date sometimes

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-08 11:40:54

*Thread Reply:* this might be true for only some version of airflow

Kacper Muda (kacper.muda@getindata.com)
2024-01-08 11:45:02

*Thread Reply:* Also here where they define combination of some deterministic attributes, executiondate is used and not startdate, so there might be something to it. That still leaves us with the behaviour i described.

Willy Lulciuc (willy@datakin.com)
2024-01-08 16:08:01

*Thread Reply:* > Due to this decision, we can encounter duplicate run_id if we delete the DagRun from the database, because the execution_date remains the same. Hmm so given that OL runID uses the same params Airflow uses to generate the hash, this seems more like a limitation. Better question would be: if a user runs, deletes, then runs the same DAG again, is that an expected scenario we should handle? tl’dr yes, but Airflow hasn’t felt it important enough to address.

Harel Shein (harel.shein@gmail.com)
2024-01-08 12:28:50

https://www.snowflake.com/summit/call-for-papers/

Snowflake
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-08 12:36:09

*Thread Reply:* not concurrent with Databricks conference this year? 😂

Harel Shein (harel.shein@gmail.com)
2024-01-08 12:38:34

*Thread Reply:* nope, a week before so that everyone goes there and gets sick and can’t attend the databricks conf on the following week

😅 Willy Lulciuc
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-08 12:40:34

*Thread Reply:* outstanding business move again

Harel Shein (harel.shein@gmail.com)
2024-01-08 15:55:12

*Thread Reply:* @Willy Lulciuc wanna submit to this?

Willy Lulciuc (willy@datakin.com)
2024-01-08 15:56:19

*Thread Reply:* yeah id love to: maybe something like “Detecting Snowflake table schema changes with OpenLineage events” + use cases + using lineage to detect impact?

Harel Shein (harel.shein@gmail.com)
2024-01-08 16:24:27

*Thread Reply:* yeah! that sounds like a fun talk!

Harel Shein (harel.shein@gmail.com)
2024-01-08 16:25:18

*Thread Reply:* idk if @Julien Le Dem will already be in 🇫🇷 that week? but I’d be happy to co-present if not.

🔥 Maciej Obuchowski
Julien Le Dem (julien@apache.org)
2024-01-08 16:26:52

*Thread Reply:* I don’t know when I’m flying out yet but it will be in the middle of that time frame.

Julien Le Dem (julien@apache.org)
2024-01-08 16:28:18

*Thread Reply:* +1 on Harel co-presenting :)

🙏 Willy Lulciuc
Julien Le Dem (julien@apache.org)
2024-01-08 16:29:09

*Thread Reply:* School last day is the 4th. I need to be in France (not jet lagged) before the 8th

Willy Lulciuc (willy@datakin.com)
2024-01-08 16:45:37

*Thread Reply:* ok, @Harel Shein I’ll work on getting a rough draft ready before the deadline (added to my TODO of tasks)

👍 Harel Shein
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-09 10:01:59

hey, I'm not feeling well, will probably skip today meeting

Harel Shein (harel.shein@gmail.com)
2024-01-09 10:04:17

*Thread Reply:* get well soon ❤️‍🩹

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-09 12:55:04

Added comment to discussion about Spark parent job issue: https://github.com/OpenLineage/OpenLineage/issues/1672#issuecomment-1883524216 I think we have the consensus so I'll work on it.

Harel Shein (harel.shein@gmail.com)
2024-01-10 15:41:39

*Thread Reply:* @Maciej Obuchowski should we give an update on that at the TSC meeting tomorrow?

Harel Shein (harel.shein@gmail.com)
2024-01-10 15:52:33

*Thread Reply:* @Michael Robinson CC ^

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-09 12:56:51

Another issue: do you think we should somehow handle the partitioning in OpenLineage standard and integrations? I would think of a situation where we somehow know how dataset is partitioned - not think about how to automagically detect that. Some example situations:

  1. Job reads particular partitions of a dataset. Should we indicate this, possibly as InputFacet?
  2. Run writes a particular partition to a dataset. Would it be an useful information that some particular partition was written only by some run, while others were written by particular different runs, rather than treat output dataset as "global" modification of a dataset that possibly changes all the data?
Harel Shein (harel.shein@gmail.com)
2024-01-10 15:53:30

*Thread Reply:* interesting. did you hear anyone with this usecase/requirement?

Michael Robinson (michael.robinson@astronomer.io)
2024-01-12 10:04:24

Hey, there’s a Windows user getting this error when trying to run Marquez: org-apache-tomcat-jdbc-pool-connectionpool-unable-to-create-initial-connection. Is it a driver issue? I’ll try to get more details and reproduce it, but if you know what this probably is related to, please lmk

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-12 10:05:16

*Thread Reply:* do we even support Windows? 😱

Michael Robinson (michael.robinson@astronomer.io)
2024-01-12 10:50:00

*Thread Reply:* Here’s more info about the use case: Thanks Michael , this is really helpful , so I am working on prj where in I need to run marquez and open lineage on top of airflow dags which run dbt commands internally thru Bashoperator. I need to present to my org if we are going to be benefited by bringing in marquez matadata lineage 10:48 so was taking this approach of setting marquez first , then will explore how it integrates with airflow using bashoperator

Michael Robinson (michael.robinson@astronomer.io)
2024-01-12 10:50:16

*Thread Reply:* We don’t support this operator, right? What kind of graph can they expect?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-12 10:54:56

*Thread Reply:* They can use dbt integration directly maybe?

openlineage.io
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-16 05:30:24
Harel Shein (harel.shein@gmail.com)
2024-01-16 06:13:54

*Thread Reply:* Nice!! ❤️

Harel Shein (harel.shein@gmail.com)
2024-01-16 06:14:43

*Thread Reply:* We now have a former astronomer as engineering director at DataHub

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-01-16 06:43:02

*Thread Reply:* ?

Harel Shein (harel.shein@gmail.com)
2024-01-16 11:12:26

*Thread Reply:* https://www.linkedin.com/in/samantha-clark-engineer/

linkedin.com
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-16 14:21:22

taking on letting users to run integration tests

❤️ Harel Shein
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-16 14:22:07

*Thread Reply:* so there are two issues I think

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-16 14:22:37

*Thread Reply:* in Airflow workflows there are: filters: branches: ignore: /pull\/[0-9]+/ which are set only for PRs with forked repos

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-16 14:23:34

*Thread Reply:* in Spark there are tests that strictly require env vars (that contain credentials to various external systems like databricks or bigquery). if there are no such env vars tests fail which is confusing new committers

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-16 14:24:27

*Thread Reply:* first behaviour is silent - which I think is bad because it’s easy to skip integration tests, build is green (but should it be? we don’t know, integration tests didn’t run and someone needs to know and remember that before approving and merging)

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-16 14:25:51

*Thread Reply:* second is misleading because it hints there’s something wrong in the code while it doesn’t neccessarily need to be. on the other hand you shouldn’t approve and merge failing build so you see there’s some action required

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-16 14:27:21

*Thread Reply:* reg. action required: for now we’ve been running some manual step (using https://github.com/jklukas/git-push-fork-to-upstream-branch) which is a workaround but it’s not straightforward and requires manual work. it also doesn’t solve two issues I mentioned above

Stars
10
Language
Shell
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-16 14:28:46

*Thread Reply:* what I propose is to simply add approval step before integration tests: https://circleci.com/docs/workflows/#holding-a-workflow-for-a-manual-approval

it’s circleCI only thing so you need to login into circleCI, check if there’s any pending task to approve and then approve or not

circleci.com
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-16 14:30:14

*Thread Reply:* it doesn’t allow for much of configuration but I think it would work. you can’t also integrate it in any way with github UI (e.g. there’s no option to click something in PR’s UI to approve)

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-16 14:31:39

*Thread Reply:* but that would let project maintainers to manage when the code is safe to run and it’s still visible and (I think) readable for everyone

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-16 14:34:00

*Thread Reply:* the only thing I’m not sure about is who can approve

Anyone who has push access to the repository can click the **Approval** button to continue the workflow

but I’m not sure to which repo. if someone runs on fork and he/she has push access for fork - can he/she approve? it wouldn’t make sense..

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-16 14:37:34

*Thread Reply:* https://circleci.com/blog/access-control-cicd/

that’s best I could find from circleCI on this subject

CircleCI
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-16 15:01:00

*Thread Reply:* so I think the best solution would be to:

  1. add approval steps only before integration tests
  2. enable Pass secrets to builds from forked pull requests (requires careful review of the CI process)
  3. make sure release and integration tests contexts are set properly for tasks
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-16 15:01:31

*Thread Reply:* contexts give possibility to let users run e.g. unit tests in CI without exposing credentials

Harel Shein (harel.shein@gmail.com)
2024-01-16 15:06:09

*Thread Reply:* this approach makes sense to me, assuming the permission model is how you outlined it

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-16 15:07:21

*Thread Reply:* one thing to add and test: approval steps could have condition to always run if it’s not from fork. not sure if that’s possible

Harel Shein (harel.shein@gmail.com)
2024-01-16 15:08:50

*Thread Reply:* that sounds like it should be doable within what available in circle. GHA can definitely do that

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-16 15:09:17

*Thread Reply:* GHA can do things that circleCI can’t 😂

😂 Harel Shein, Paweł Leszczyński
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-17 09:46:38

*Thread Reply:* > approval steps could have condition to always run if it’s not from fork. not sure if that’s possible ffs it’s not that easy to set up

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-17 16:10:22

*Thread Reply:* whenever I touch circleCI base changes I feel like magician. yet, here goes the PR with the magic (just look at the side effect, it bothered me for so long 🪄 ) 🚨 🚨 🚨 https://github.com/OpenLineage/OpenLineage/pull/2374

Labels
ci, tests
🙌 Paweł Leszczyński
Harel Shein (harel.shein@gmail.com)
2024-01-17 16:29:18

*Thread Reply:* I'm assuming the reason for the speedup in determine_changed_modules is that we don't go install yq every time this runs?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-17 16:35:08

*Thread Reply:* correct

Michael Robinson (michael.robinson@astronomer.io)
2024-01-16 15:52:55

What are your nominees/candidates for the most important releases of 2023? I’ll start (feel free to disagree with my choices, btw): • 1.0.0 • 1.7.0 (disabled the external Airflow integration for 2.8+) • 0.26.0 (Fluentd) • 0.19.2 (column lineage in SQL parser) • …

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-16 15:53:54

*Thread Reply:* • 1.7.0 (disabled the external Airflow integration for 2.8+) that doesn’t sound like one of the most important

:gratitude_thank_you: Michael Robinson
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-16 15:55:49

*Thread Reply:* 1.0.0 had this: https://openlineage.io/docs/releases/1_0_0 which actually fixed spec to match JSON schema spec

openlineage.io
👍 Michael Robinson
Gowthaman Chinnathambi (gowthamancdev@gmail.com)
2024-01-18 23:26:45

@Gowthaman Chinnathambi has joined the channel

Michael Robinson (michael.robinson@astronomer.io)
2024-01-19 16:18:50
Harel Shein (harel.shein@gmail.com)
2024-01-19 21:06:26

FYI: I'm trying to set up our meetings with the new LFAI tooling, you may get some emails. you can ignore for now.

👍 Michael Robinson, Maciej Obuchowski
Krishnaraj Raol (krishnaraj.raol@datavizz.in)
2024-01-22 00:46:57

@Krishnaraj Raol has joined the channel

Michael Robinson (michael.robinson@astronomer.io)
2024-01-22 10:45:06

@Harel Shein @Julien Le Dem @tati Python client releases on both Marquez and OpenLineage are failing because PyPI no long supports password authentication. We need to configure the projects for Trusted Publishers or use an API token. I’ve looked and can’t find OpenLineage credentials for PyPI, but if I had them we’d still need to set up 2FA in order to make the change. How should we proceed here? Should we sync to sort this out? Thanks (btw I reached out to Willy separately when this came up during a Marquez release attempt last week)

Harel Shein (harel.shein@gmail.com)
2024-01-22 12:15:48

*Thread Reply:* ah! I can look into it now.

:gratitude_thank_you: Michael Robinson
Harel Shein (harel.shein@gmail.com)
2024-01-22 12:45:31

*Thread Reply:* I'm failing to find the credentials for PyPI anywhere.

Harel Shein (harel.shein@gmail.com)
2024-01-22 12:45:38

*Thread Reply:* @Maciej Obuchowski any ideas? (git blame shows you wrote that part, and @Michael Collado did some circleCI setup at some point)

Harel Shein (harel.shein@gmail.com)
2024-01-22 12:46:09

*Thread Reply:* I just sent a reset password email to whoever registered for the openlineage user

Michael Robinson (michael.robinson@astronomer.io)
2024-01-22 13:15:34

*Thread Reply:* Thanks @Harel Shein. Can confirm I didn’t get it

Julien Le Dem (julien@apache.org)
2024-01-22 14:21:06

*Thread Reply:* The password should be in the CircleCI context right?

Harel Shein (harel.shein@gmail.com)
2024-01-22 14:23:23

*Thread Reply:* yes, it's there. I was trying to avoid writing a job that prints it out

👍 Julien Le Dem
Harel Shein (harel.shein@gmail.com)
2024-01-22 14:23:57

*Thread Reply:* that will be the fallback if no one responds 🙂

Harel Shein (harel.shein@gmail.com)
2024-01-22 14:38:29

*Thread Reply:* alright, I've setup 2FA and added a few more emails to the PyPI account as fallback.

Harel Shein (harel.shein@gmail.com)
2024-01-22 14:39:09

*Thread Reply:* unfortunately, there's only one Trusted Publisher for PyPI, which is GH Actions. so we'll have to use the API token route. PR incoming soon

Harel Shein (harel.shein@gmail.com)
2024-01-22 14:43:39

*Thread Reply:* didn't need to make any changes. I updated the circle context and re-ran the PyPI release - we're back to :largegreencircle:

Harel Shein (harel.shein@gmail.com)
2024-01-22 14:43:46

*Thread Reply:* ^ @Michael Robinson FYI

👍 Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2024-01-22 15:11:54
Michael Robinson (michael.robinson@astronomer.io)
2024-01-22 15:14:47

*Thread Reply:* thank you @Harel Shein. Releasing the jars now. Everything looks good

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-22 17:16:55

*Thread Reply:* Great!

tati (tatiana.alchueyr@astronomer.io)
2024-01-22 10:45:15

@tati has joined the channel

Michael Robinson (michael.robinson@astronomer.io)
2024-01-23 11:03:11

It looks like my laptop bit the dust today so might miss the sync

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-23 11:29:38

Do I have to create account to join the meeting?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-23 11:31:23

*Thread Reply:* turns out you can just pass your mail

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-01-23 11:31:57

*Thread Reply:* same on my side

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-01-23 11:32:06

*Thread Reply:* i don't remember if i have one

Harel Shein (harel.shein@gmail.com)
2024-01-23 11:34:22
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-23 11:35:09

*Thread Reply:* use same e-mail you got invited to

Harel Shein (harel.shein@gmail.com)
2024-01-23 13:21:41

there's a data warehouse (https://www.firebolt.io/) and a streaming platform (https://memphis.dev/) written in Go. so I guess it's not futile to write a Go client? 🙂

🤔 Maciej Obuchowski
Michael Robinson (michael.robinson@astronomer.io)
2024-01-23 15:40:21

Any potential issues with scheduling a meetup on Tuesday, March 19th in Boston that you know of? The Astronomer all-hands is the preceding week

👍 Harel Shein, Maciej Obuchowski
Michael Robinson (michael.robinson@astronomer.io)
2024-01-23 16:06:35

PR to add 1.8 release notes to the docs needs a review: https://github.com/OpenLineage/docs/pull/274

Comments
1
Michael Robinson (michael.robinson@astronomer.io)
2024-01-23 16:18:19

*Thread Reply:* thanks @Jakub Dardziński

🙂 Jakub Dardziński
jayant joshi (itsjayantjoshi@gmail.com)
2024-01-24 01:10:16

@jayant joshi has joined the channel

Michael Robinson (michael.robinson@astronomer.io)
2024-01-24 15:19:34

Feedback requested on this draft of the year-in-review issue of the newsletter: https://docs.google.com/document/d/1MJB9ughykq9O8roe2dlav6d8QbHZBV2A0bTkF4w0-jo/edit?usp=sharing. Did you give a talk that isn't in the talks section? Is there an important release that should be in the releases section but isn't? Other feedback? Please share.

Michael Robinson (michael.robinson@astronomer.io)
2024-01-24 16:37:33

Feedback requested on a new page for displaying the ecosystem survey results: https://github.com/OpenLineage/docs/pull/275. The image was created for us by Amp. @Julien Le Dem @Harel Shein @tati

Comments
1
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-25 06:01:50

*Thread Reply:* Looks great!

Harel Shein (harel.shein@gmail.com)
2024-01-25 09:00:27

*Thread Reply:* Very cool indeed! I wonder if we should share the raw data as well?

Harel Shein (harel.shein@gmail.com)
2024-01-25 09:00:53

*Thread Reply:* Maybe if you could share it here first @Michael Robinson ?

Michael Robinson (michael.robinson@astronomer.io)
2024-01-25 10:17:58

*Thread Reply:* Yes, planning to include a link to the raw data as well and will share here first

Michael Robinson (michael.robinson@astronomer.io)
2024-01-25 10:28:10

*Thread Reply:* @Harel Shein thanks for the suggestion. Lmk if there's a better way to do this, but here's a link to Google's visualizations: https://docs.google.com/forms/d/1j1SyJH0LoRNwNS1oJy0qfnDn_NPOrQw_fMb7qwouVfU/viewanalytics. And a .csv is attached. Would you use this link on the page or link to a spreadsheet instead?

Michael Robinson (michael.robinson@astronomer.io)
2024-01-26 11:47:27

*Thread Reply:* Going with linking to Google's charts for the raw data for now. LMK if you'd prefer another format, e.g. Google sheet

Harel Shein (harel.shein@gmail.com)
2024-01-26 11:48:39

*Thread Reply:* was just looking at it, looks great @Michael Robinson!

:gratitude_thank_you: Michael Robinson
tati (tatiana.alchueyr@astronomer.io)
2024-01-29 17:27:28

*Thread Reply:* Excellent work, @Michael Robinson! 👏 👏 👏

♥️ Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2024-01-29 20:49:25

*Thread Reply:* Thank you!

Michael Robinson (michael.robinson@astronomer.io)
2024-01-24 16:38:11
😍 Julien Le Dem, Paweł Leszczyński, Maciej Obuchowski, Harel Shein, Ross Turk
Julien Le Dem (julien@apache.org)
2024-01-24 18:36:40

This looks nice!

:gratitude_thank_you: Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2024-01-26 15:37:06

Thanks for any feedback on the Mailchimp version of the newsletter special issue before it goes out on Monday:

🙌 Paweł Leszczyński, Maciej Obuchowski
Michael Robinson (michael.robinson@astronomer.io)
2024-01-26 15:41:53

*Thread Reply:* @Jakub Dardziński @Maciej Obuchowski is the Airflow Provider stuff still current?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-01-26 15:43:44

*Thread Reply:* yeah, looks current

:gratitude_thank_you: Michael Robinson
Laurent Paris (laurent@datakin.com)
2024-01-27 08:55:40

@Laurent Paris has joined the channel

👋 Maciej Obuchowski, Harel Shein, Michael Robinson, Julien Le Dem, Paweł Leszczyński, Willy Lulciuc
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-01-29 11:47:19

Hey, created issue around using Spark metrics to increase operational ease of using Spark integration, feel free to comment: https://github.com/OpenLineage/OpenLineage/issues/2402

💯 Harel Shein, tati, Paweł Leszczyński
Lohit VijayaRenu (lohit@stripe.com)
2024-01-29 16:34:08

@Lohit VijayaRenu has joined the channel

👋 Julien Le Dem, Maciej Obuchowski, Michael Robinson, Harel Shein
Michael Robinson (michael.robinson@astronomer.io)
2024-01-30 12:54:19

Flyte is offering a 20-25-minute speaking slot at their community meeting on March 6th at 9 am PT. They'd like it to be a general introduction to OpenLineage

Harel Shein (harel.shein@gmail.com)
2024-01-30 12:57:07

*Thread Reply:* I can take it if no one else is interested. I’ll be doing a lot of intro to OL presentations in the next few weeks, so I’ll be very practiced by then :)

🔥 Maciej Obuchowski, Paweł Leszczyński, Jakub Dardziński, Julien Le Dem, Michael Robinson, Willy Lulciuc
👍 Julien Le Dem, Michael Robinson
Emili Parreno (emili@stripe.com)
2024-01-31 08:01:36

@Emili Parreno has joined the channel

👋 Maciej Obuchowski, Harel Shein
Harel Shein (harel.shein@gmail.com)
2024-01-31 11:54:02

@Paweł Leszczyński / @tati I'm expecting at least 5 pictures from the meetup today! 😄

😅 tati
Julien Le Dem (julien@apache.org)
2024-01-31 12:27:17

*Thread Reply:* Each!

‼️ Harel Shein
😅 tati
Michael Robinson (michael.robinson@astronomer.io)
2024-01-31 13:29:36

*Thread Reply:* could we also get a signup sheet and headcount please? 😬

Michael Robinson (michael.robinson@astronomer.io)
2024-02-01 10:44:28

Hi, is there any reason not to perform a release today as scheduled? I know we released 1.8 only one week ago, but it's the first of the month and @Damien Hawes's PR #2390 to add support for Scala 2.12 and 2.13 in Spark, along with fixes in the Spark and Flink integrations, are unreleased. Would it make more sense to wait for Damien's PR #2395?

Labels
integration/spark
Damien Hawes (damien.hawes@booking.com)
2024-02-01 10:45:00

*Thread Reply:* This isn't ready.

👍 Michael Robinson, Maciej Obuchowski
Damien Hawes (damien.hawes@booking.com)
2024-02-01 10:45:17

*Thread Reply:* I'm working on enabling integration tests for the Scala 2.13 variants.

Damien Hawes (damien.hawes@booking.com)
2024-02-01 10:45:52

*Thread Reply:* It will take some time, probably Monday / Tuesday next week is my ETA.

Michael Robinson (michael.robinson@astronomer.io)
2024-02-01 10:46:29

*Thread Reply:* Thanks @Damien Hawes, no pressure. But if early next week is your estimate I think it makes sense to wait. So this is GTK

Damien Hawes (damien.hawes@booking.com)
2024-02-01 10:44:31

@Damien Hawes has joined the channel

Michael Robinson (michael.robinson@astronomer.io)
2024-02-01 12:28:15

Decathlon showed part of one of their graphs last night

❤️ Willy Lulciuc, Harel Shein, Maciej Obuchowski, Julien Le Dem
Willy Lulciuc (willy@datakin.com)
2024-02-01 12:28:56

*Thread Reply:* marquez in the wild! 💯💯🚀. thanks for sharing!

🚀 Harel Shein, Maciej Obuchowski, Paweł Leszczyński
Michael Robinson (michael.robinson@astronomer.io)
2024-02-01 12:29:13

*Thread Reply:* some metrics too

Michael Robinson (michael.robinson@astronomer.io)
2024-02-01 12:29:35

*Thread Reply:* they're doing anomaly detection successfully

Willy Lulciuc (willy@datakin.com)
2024-02-01 12:30:07

*Thread Reply:* all using Marquez 👀

Michael Robinson (michael.robinson@astronomer.io)
2024-02-01 12:30:08

*Thread Reply:* eg, deprecated tables still being used or tables written in multiple locations

❤️ Willy Lulciuc, Harel Shein
Willy Lulciuc (willy@datakin.com)
2024-02-01 12:31:12

*Thread Reply:* this needs to become a blog post!

➕ Michael Robinson, Harel Shein, Paweł Leszczyński
Harel Shein (harel.shein@gmail.com)
2024-02-01 13:24:04

*Thread Reply:* Would love to get the slides if they are willing to share!

➕ Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2024-02-01 17:54:08

*Thread Reply:* They've said yes to a blog post. This presentation gets us closer to starting in earnest. I've asked for the slides. Too bad the Confluent organizer wasn't supportive of recording. Maybe next time

❤️ Willy Lulciuc, Harel Shein
Harel Shein (harel.shein@gmail.com)
2024-02-02 09:32:33

Congratulations to our new committer on the team @Damien Hawes!!

🔥 Maciej Obuchowski, Paweł Leszczyński, Jakub Dardziński, Michael Robinson, Willy Lulciuc
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-02 09:46:17

*Thread Reply:* hey @Damien Hawes, now you can push your branches into origin and the integration tests are automatically approved 🙂

Damien Hawes (damien.hawes@booking.com)
2024-02-02 09:46:41

*Thread Reply:* Thank you for the congratulations @Harel Shein. It is humbling to be nominated and accepted.

@Maciej Obuchowski - haha, thanks!

Michael Robinson (michael.robinson@astronomer.io)
2024-02-02 10:47:44

*Thread Reply:* Congratulations @Damien Hawes! Thank you for all your contributions

Willy Lulciuc (willy@datakin.com)
2024-02-03 01:00:49

*Thread Reply:* Congrats @Damien Hawes! 💯 💯 🙏

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-02 15:24:09

https://opensource.googleblog.com/2024/02/announcing-google-season-of-docs-2024.html maybe we should improve our docs? 🙂

Google Open Source Blog
Josh Fischer (josh@joshfischer.io)
2024-02-03 20:13:49

@Josh Fischer has joined the channel

Michael Robinson (michael.robinson@astronomer.io)
2024-02-05 17:00:55

Agenda items or discussion topics for Thursday's TSC? @Julien Le Dem @Harel Shein

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-05 17:58:33

*Thread Reply:* I can take 5 minutes explaining Spark job hierarchy

👍 Julien Le Dem, Michael Robinson, Harel Shein
Julien Le Dem (julien@apache.org)
2024-02-05 17:59:03

*Thread Reply:* Nothing specific on my end

👍 Michael Robinson
Harel Shein (harel.shein@gmail.com)
2024-02-06 11:02:30

*Thread Reply:* more updates on the Spark side of things? @Paweł Leszczyński / @Damien Hawes may want to talk about the recent additions? we could also discuss the proposals for circuit breakers / metrics?

Michael Robinson (michael.robinson@astronomer.io)
2024-02-06 13:20:42

Anyone have an opinion about creating an OpenLineage "company" rather than a group on LinkedIn? You can get metrics from LinkedIn's API if you have a company rather than a group.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-06 13:37:19

*Thread Reply:* Airflow does it: https://www.linkedin.com/company/apache-airflow/

linkedin.com
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-06 13:37:34

*Thread Reply:* Spark too https://www.linkedin.com/company/apachespark/

linkedin.com
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-06 13:37:39

*Thread Reply:* I think that's a good idea

Harel Shein (harel.shein@gmail.com)
2024-02-06 13:56:52

*Thread Reply:* ➕

Michael Robinson (michael.robinson@astronomer.io)
2024-02-06 14:47:39

*Thread Reply:* Cool, thanks

Michael Robinson (michael.robinson@astronomer.io)
2024-02-06 14:50:11

We have agreement from Astronomer to move ahead with the Orbit changes discussed today in the committers sync. So I'll start on the exports asap.

Michael Robinson (michael.robinson@astronomer.io)
2024-02-06 16:51:20

Please follow our new OpenLineage company page on LinkedIn. Evidently, the only way to join the company is to add it to your experience history.

linkedin.com
❤️ Willy Lulciuc, Maciej Obuchowski
Julien Le Dem (julien@apache.org)
2024-02-08 20:27:04

FYI: deadline Feb 25th https://2024.berlinbuzzwords.de/call-for-papers/

🙏 Willy Lulciuc
Willy Lulciuc (willy@datakin.com)
2024-02-09 02:30:17

*Thread Reply:* @Peter Hicks want to submit a column-level lineage talk?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-09 07:19:14

*Thread Reply:* we'll probably submit something about OpenLineage/Streaming with @Paweł Leszczyński

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-09 07:19:28

*Thread Reply:* @Willy Lulciuc want to come to Berlin? 😄

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-02-09 07:20:47

*Thread Reply:* we were thinking with Maciej about some kind of Flink & Streaming around OpenLineage talk, as this can be interesting to the community. I'll prepare abstract next wek

Michael Robinson (michael.robinson@astronomer.io)
2024-02-09 09:03:58

*Thread Reply:* updated the talks project on github

👍 Maciej Obuchowski
Willy Lulciuc (willy@datakin.com)
2024-02-09 14:20:52

*Thread Reply:* @Maciej Obuchowski i wish, i’d need a sponsor 😅

Michael Robinson (michael.robinson@astronomer.io)
2024-02-09 09:09:01

This open-source community management tool looks interesting as a supplement to Orbit: https://ossinsight.io/analyze/OpenLineage/OpenLineage#overview

👍 Maciej Obuchowski
Michael Robinson (michael.robinson@astronomer.io)
2024-02-09 09:18:47

*Thread Reply:* you can compare projects side-by-side, which is something Orbit doesn't offer

Michael Robinson (michael.robinson@astronomer.io)
2024-02-09 09:28:50
Harel Shein (harel.shein@gmail.com)
2024-02-13 10:47:27

Metaplane added an airflow provider to send data lineage data to. It's basically a new connection that extends BaseHook, and users need to proactively send callbacks, not sure why the took that approach. https://www.metaplane.dev/blog/airflow-integration

metaplane.dev
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-13 10:56:04

*Thread Reply:* If you're doing all that work, you could actually just dag_policy to add it to all the dags automatically

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-13 10:57:09

*Thread Reply:* https://docs.metaplane.dev/docs/airflow#dag-and-task-lineage I think that's fairly... unsophisticated approach?

Metaplane
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-13 10:58:30

*Thread Reply:* However, if I was redoing Airflow integration from scratch, I'd seriously rethink using connections instead of OPENLINEAGE_URL or configuring it the way we did

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-13 11:00:36

*Thread Reply:* The plugin could load up custom transport types and generate connection types based out of it

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-13 11:02:13

*Thread Reply:* Just curious why they did not use listeners... it's not like it's new feature now, it has been there for like 5 minor releases

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-13 11:42:12

*Thread Reply:* we still could add support from Airflow connections with some sort of deprecation of OPENLINEAGE_URL and current way

Kacper Muda (kacper.muda@getindata.com)
2024-02-15 10:22:24

Hey, i was working on PR updating the docs for python, java and airflow (probably spark is next), and it hit me that we still have those in two places: README.md inside the package and openlineage.io site. Both contain quite the same information, sometimes the site has more (f.e. airflow). Do You think it would be good idea to just put a redirect to the site in README.md files for the packages? Maybe add some brief description at the top and then redirect user to the site? In long term, maintaining both and keeping them in sync is not an optimal solution imho.

Michael Robinson (michael.robinson@astronomer.io)
2024-02-15 10:36:10

*Thread Reply:* I agree -- in addition to the maintenance burden there's the risk posed by out-of-date/conflicting info for users

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-15 14:38:01

*Thread Reply:* I've wanted to remove README.md files as user-facing docs for some time.

However, there might be worth keeping those (maybe under different name) for purely internal development docs - not related to external API's, but like - use this incantation to compile the integration.

Kacper Muda (kacper.muda@getindata.com)
2024-02-19 06:57:09

*Thread Reply:* I added the redirect here https://github.com/OpenLineage/OpenLineage/pull/2448

There was not much information about the compilation and other internal stuff, so i think first they need to be created and then we can keep them inside the package files under some different name, as Maciej mentioned.

Kacper Muda (kacper.muda@getindata.com)
2024-02-22 09:43:46

*Thread Reply:* Bump on this one ^

Michael Robinson (michael.robinson@astronomer.io)
2024-02-16 11:31:58

2023 OpenLineage Survey Analysis/takeaways What surprised you or struck you as notable in the 2023 survey data? What would you like to see added, changed or removed in the 2024 version? I need your help to ensure we get the most useful and actionable insights we can from this exercise. I created a doc for sharing opinions/comments, but I'd be happy to discuss it in any forum. Here's the doc, including some initial takeaways, as a starting point: https://docs.google.com/document/d/1aiKtKjcFU0AjS46cow6cbx8EvzV0P2KnGQ4rFD4qKLM/edit?usp=sharing.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-20 06:48:35

@Damien Hawes can we share Gradle plugins that you've implemented in Spark buildSrc with Flink integration too? It would cut down on boilerplate, but not sure how can we do this (without copying code) 🙂

Damien Hawes (damien.hawes@booking.com)
2024-02-20 13:20:20

*Thread Reply:* You have to publish the plugin to the Gradle Plugin repositoriy

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-21 05:33:33

*Thread Reply:* Seems like we could publish the plugin to the local dir first: https://docs.gradle.org/current/userguide/plugins.html#sec:custom_plugin_repositories

Willy Lulciuc (willy@datakin.com)
2024-02-20 14:51:53

@Maciej Obuchowski in our OL spec, we require _prodcuer and _schemaURL , but our airflow provider, we only send the _producer. Was this an intentional omission of _schemaURL?

Willy Lulciuc (willy@datakin.com)
2024-02-20 14:59:31

ahh i think I was confused given it’s marked as _base_skip_redact here, but looks like _schemaURL is being added… need to verify

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-20 15:27:31

*Thread Reply:* schemaURL is always send, the thing is it’s incorrect in many cases AFAIR

Willy Lulciuc (willy@datakin.com)
2024-02-20 15:28:34

*Thread Reply:* yeah… we’re validating events and most (if not all) don’t have that field populated. does the airflow provider not set it?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-20 15:29:12

*Thread Reply:* airlfow provider uses facets from openlineage-python package

Willy Lulciuc (willy@datakin.com)
2024-02-20 15:29:39

*Thread Reply:* > thing is it’s incorrect in many cases AFAIR more curious about this comment. why would this be the case?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-20 15:31:52

*Thread Reply:* tbh I’m not sure, URL for the base schema is correct only for https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json and facets in it (which is not too many). and for others it was not just done or mistakenly repeated with the same pattern?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-20 15:32:11

*Thread Reply:* dunno, looks more like historical approach that wasn’t adjusted at some point

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-20 15:32:40

*Thread Reply:* and.. schemaURL is apparently irrelevant since noone validated it and reported issues

Willy Lulciuc (willy@datakin.com)
2024-02-20 15:36:48

*Thread Reply:* is there an open issue for this? I feel we should have some guidance here or not require it, but we should decide what we do here

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-20 15:38:49

*Thread Reply:* elephant in the room

😅 Willy Lulciuc, Maciej Obuchowski
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-20 15:39:11

*Thread Reply:* no, I don’t think there is one

Willy Lulciuc (willy@datakin.com)
2024-02-20 15:39:53

*Thread Reply:* ok, I’ll open one. we’re ingestion events into kafka and was incorrectly applying validation to events based on the spec

Willy Lulciuc (willy@datakin.com)
2024-02-20 15:40:27

*Thread Reply:* @Julien Le Dem ping to address the elephant above 😉 we’re talking about in this thread

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-20 15:40:43

*Thread Reply:* I see. btw generating python facets from json schema would be best solution, it was too complex so far ;_;

👍 Willy Lulciuc
Willy Lulciuc (willy@datakin.com)
2024-02-20 15:41:18

*Thread Reply:* ok, I’ll summarize our discussion in the issue

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-20 15:41:27

*Thread Reply:* thanks Willy!

🙏 Willy Lulciuc
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-22 15:19:30

*Thread Reply:* @Willy Lulciuc double-checking - given the example of SQLJobFacet should the schema URL be set to: https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/facets/SQLJobFacet.json or https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/facets/SQLJobFacet.json#/$defs/SQLJobFacet or https://openlineage.io/spec/facets/1-0-0/SQLJobFacet.json#/$defs/SQLJobFacet?

or in other words - should it be the same as it is in Java client? 🙂 which is the last one

Michael Robinson (michael.robinson@astronomer.io)
2024-02-21 06:46:42

Abdallah (Decathlon) made a release request today. https://openlineage.slack.com/archives/C01CK9T7HKR/p1708514231690979

} Abdallah (https://openlineage.slack.com/team/U05HBLE7YPL)
Damien Hawes (damien.hawes@booking.com)
2024-02-21 07:13:14

*Thread Reply:* I would ask that if we make a release today, we include the Scala 2.13 support for Spark (and merge the PR for the docs)

Damien Hawes (damien.hawes@booking.com)
2024-02-21 07:13:50

*Thread Reply:* I guess we have to decide splitting Iceberg off from the main code a reason to hold back the release?

Damien Hawes (damien.hawes@booking.com)
2024-02-21 07:14:12

*Thread Reply:* Thoughts @Maciej Obuchowski?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-21 07:19:42

*Thread Reply:* I think we can do a release today/tomorrow, but having functionality removed makes this much harder choice

Damien Hawes (damien.hawes@booking.com)
2024-02-21 07:24:34

*Thread Reply:* What functionality would be removed?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-21 07:27:12

*Thread Reply:* Iceberg support? Or do you propose something else?

Damien Hawes (damien.hawes@booking.com)
2024-02-21 07:29:02

*Thread Reply:* Oh, I was under the impression that Iceberg support wasn't going to be removed. Instead the direct dependencies on Iceberg in the core code were being removed, and bundled into their own module, but at the end of the day, the project would still contain classes capable of dealing with Iceberg.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-21 07:31:11

*Thread Reply:* oh, okay

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-21 07:31:26

*Thread Reply:* I understood that without https://github.com/OpenLineage/OpenLineage/pull/2437/files there will be no support for Iceberg for 2.13

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-21 07:31:44

*Thread Reply:* so I guess we can go and then follow up with next release soon?

Damien Hawes (damien.hawes@booking.com)
2024-02-21 07:32:16

*Thread Reply:* There should still be Iceberg support for 2.13

Damien Hawes (damien.hawes@booking.com)
2024-02-21 07:32:58

*Thread Reply:* If there wasn't, that would hurt us, and by us, I mean the team I belong to @ Booking and my partner teams.

👍 Maciej Obuchowski
Damien Hawes (damien.hawes@booking.com)
2024-02-21 07:34:10

*Thread Reply:* Basically, if I understand Mattia's direction is, we want to say:

OK, OpenLineage has been tested against these versions of Iceberg and found to be working.

Michael Robinson (michael.robinson@astronomer.io)
2024-02-21 09:42:13

*Thread Reply:* Just to confirm, are we waiting for this one https://github.com/OpenLineage/OpenLineage/pull/2446?

Damien Hawes (damien.hawes@booking.com)
2024-02-21 10:05:29

*Thread Reply:* That one can be merged

Damien Hawes (damien.hawes@booking.com)
2024-02-21 10:05:56

*Thread Reply:* We're waiting for comments on this: https://openlineage.slack.com/archives/C01CK9T7HKR/p1708349868363669

} Maciej Obuchowski (https://openlineage.slack.com/team/U01RA9B5GG2)
Damien Hawes (damien.hawes@booking.com)
2024-02-21 10:06:05

*Thread Reply:* No-one has left comments

Damien Hawes (damien.hawes@booking.com)
2024-02-21 10:06:35

*Thread Reply:* Which means, at least in my opinion, no-one else has anything to say.

Michael Robinson (michael.robinson@astronomer.io)
2024-02-21 10:07:26

*Thread Reply:* +1. It's been over 48 hrs, so seems safe to go ahead

Damien Hawes (damien.hawes@booking.com)
2024-02-21 10:09:14

*Thread Reply:* @Maciej Obuchowski - I'm going to merge that PR, ye?

Damien Hawes (damien.hawes@booking.com)
2024-02-21 10:09:43

*Thread Reply:* (First it needs an approval)

Michael Robinson (michael.robinson@astronomer.io)
2024-02-21 10:09:51

*Thread Reply:* Pawel is OOO, I believe

Damien Hawes (damien.hawes@booking.com)
2024-02-21 10:10:04

*Thread Reply:* Aye, but I believe @Maciej Obuchowski can approve.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-21 10:19:30

*Thread Reply:* :gh_approved:

Damien Hawes (damien.hawes@booking.com)
2024-02-21 10:20:29

*Thread Reply:* :gh_merged:

Michael Robinson (michael.robinson@astronomer.io)
2024-02-21 10:41:41

*Thread Reply:* Working on the changelog now

Michael Robinson (michael.robinson@astronomer.io)
2024-02-21 10:42:43

*Thread Reply:* oops, forgot about the release vote. please +1

Damien Hawes (damien.hawes@booking.com)
2024-02-21 11:44:32

*Thread Reply:* 👍

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-21 07:19:16
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-05 15:28:11

*Thread Reply:* it got merged 👀

Harel Shein (harel.shein@gmail.com)
2024-03-05 15:31:33

*Thread Reply:* amazing feedback on a 10k line PR 😅

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-05 15:32:09

*Thread Reply:* maybe they have policy that feedback starts from 10k lines

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-05 15:32:15

*Thread Reply:* it wasn’t enough

Harel Shein (harel.shein@gmail.com)
2024-03-05 15:32:20

*Thread Reply:* 🙈

Harel Shein (harel.shein@gmail.com)
2024-03-05 15:32:32

*Thread Reply:* too big to review, LGTM

☝️ Jakub Dardziński
Damien Hawes (damien.hawes@booking.com)
2024-02-21 11:55:31

I just noticed this. shared should not have a dependency on spark. 👀

Michael Robinson (michael.robinson@astronomer.io)
2024-02-21 12:45:00

*Thread Reply:* sounds like an easy fix? we have time

Damien Hawes (damien.hawes@booking.com)
2024-02-21 12:48:21

*Thread Reply:* Easy fix? Yeah - not at all.

😟 Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2024-02-21 13:58:34

*Thread Reply:* Allright, putting the release on hold, then

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-21 14:29:44

*Thread Reply:* yeah - we could end up with Spark 2 dependency when using it in Spark 3 context, and that's not good

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-21 14:33:01

*Thread Reply:* oh, unless you mean compile time dependency on Spark 2

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-21 14:35:00

*Thread Reply:* then no, we need to have it, everything in lifecycle package depends on it 🙂

The idea is that it contains code common to all Spark versions, that Spark itself mostly does not change - and have spark2/3/... directories for things that specifically diverge from baseline

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-21 14:36:59

*Thread Reply:* I assume Paweł ment should not have Spark dependency as it should not depend on particular Spark version

Michael Robinson (michael.robinson@astronomer.io)
2024-02-21 16:39:58

*Thread Reply:* Correct me if I'm wrong, but it sounds safe to proceed. So here's the changelog PR: https://github.com/OpenLineage/OpenLineage/pull/2452

Damien Hawes (damien.hawes@booking.com)
2024-02-22 04:01:10

*Thread Reply:* Yes

Damien Hawes (damien.hawes@booking.com)
2024-02-22 04:01:12

*Thread Reply:* It's safe

👍 Michael Robinson
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 09:48:44

*Thread Reply:* @Michael Robinson let's wait with release till we solve the intermittent test failing issue https://openlineage.slack.com/archives/C065PQ4TL8K/p1708609977907359

} Maciej Obuchowski (https://openlineage.slack.com/team/U01RA9B5GG2)
👍 Michael Robinson
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-21 16:06:05

Any idea/preference where to put very specific doc information? For example: if you're running Spark in this specific way, do this. Separate doc page seems like overkill, but I'm not sure where to put something like this where it would be discoverable

Michael Robinson (michael.robinson@astronomer.io)
2024-02-21 16:44:33

*Thread Reply:* Maybe the information could go on a new stub about "Special Cases" or something (not a great title but don't know the use case). That way the page isn't just about the exceptional case?

Michael Robinson (michael.robinson@astronomer.io)
2024-02-21 17:00:08

*Thread Reply:* Can we not trust search?

Willy Lulciuc (willy@datakin.com)
2024-02-21 17:05:03

@Michael Robinson do we know when the next release will be going out? I’d sneak in some feedback this week on: • https://github.com/OpenLineage/OpenLineage/pull/2371 • and some of the circuit breaker work if it’s not too late @Paweł Leszczyński @Maciej Obuchowski 😉

Michael Robinson (michael.robinson@astronomer.io)
2024-02-21 17:15:41

*Thread Reply:* it almost happened today, so tomorrow? 🤞

👍 Willy Lulciuc
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 05:15:51

*Thread Reply:* still provide feedback anyway @Willy Lulciuc!

❤️ Willy Lulciuc, Paweł Leszczyński
Willy Lulciuc (willy@datakin.com)
2024-02-22 20:42:36

*Thread Reply:* will do! I’ll get some feedback in tmr

👍 Maciej Obuchowski, Paweł Leszczyński
Kacper Muda (kacper.muda@getindata.com)
2024-02-22 08:47:05

Hey, i made a PR that updates the OL Airflow Provider documentation (removing outdated stuff, moving some from current OL docs, adding new info). It's not a small one, but i think it's worth the time. Any feedback is highly appreciated, let me know if something is missing, is not clear or simply wrong 🙂

🙌 Jakub Dardziński, Harel Shein
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 08:52:57

@Damien Hawes there have been some random test failures after merging last 2.13 PR, for example https://app.circleci.com/pipelines/github/OpenLineage/OpenLineage/9437/workflows/4eda8d67-3bd1-4527-84fa-88c19e6774bd/jobs/179622 ```> Task :app:copyIntegrationTestFixtures

Too long with no output (exceeded 10m0s): context deadline exceeded`` hanging oncopyIntegrationTestFixtures`?

Damien Hawes (damien.hawes@booking.com)
2024-02-22 09:09:24

*Thread Reply:* That's a weird one

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 09:14:15

*Thread Reply:* Ah, it happened also before latest PR https://app.circleci.com/jobs/github/OpenLineage/OpenLineage/179357

Damien Hawes (damien.hawes@booking.com)
2024-02-22 09:15:38

*Thread Reply:* I wonder if the disk is full of that particular executor

Damien Hawes (damien.hawes@booking.com)
2024-02-22 09:15:42

*Thread Reply:* and thats why it fails?

Damien Hawes (damien.hawes@booking.com)
2024-02-22 09:20:09

*Thread Reply:* Its always failing at the :app:copyIntegrationTestFixtures step

Damien Hawes (damien.hawes@booking.com)
2024-02-22 09:21:11

*Thread Reply:* Because I'm not able to replicate this on my local

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 09:22:45

*Thread Reply:* yeah I can't replicate that too

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 09:26:22

*Thread Reply:* I'm rerunning with SSH on CI, will take a look at disk space

Damien Hawes (damien.hawes@booking.com)
2024-02-22 09:26:46

*Thread Reply:* OK. I was literally about to edit the CI to run df -H

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 09:27:23

*Thread Reply:* circleci@ip-10-0-52-168:~$ df -h Filesystem Size Used Avail Use% Mounted on /dev/root 146G 13G 133G 9% / devtmpfs 7.7G 0 7.7G 0% /dev tmpfs 7.7G 0 7.7G 0% /dev/shm tmpfs 1.6G 836K 1.6G 1% /run tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 7.7G 0 7.7G 0% /sys/fs/cgroup /dev/nvme0n1p15 105M 6.1M 99M 6% /boot/efi

Damien Hawes (damien.hawes@booking.com)
2024-02-22 09:28:27

*Thread Reply:* And if you run df -H /home/circleci/openlineage/integration/spark

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 09:30:07

*Thread Reply:* circleci@ip-10-0-52-168:~$ df -H /home/circleci/openlineage/integration/spark Filesystem Size Used Avail Use% Mounted on /dev/root 156G 15G 142G 10% /

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 09:30:18

*Thread Reply:* also 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 09:32:43

*Thread Reply:* I guess disk usage is increasing, but very slowly? circleci@ip-10-0-52-168:~$ df /home/circleci/openlineage/integration/spark Filesystem 1K-blocks Used Available Use% Mounted on /dev/root 152243760 14273256 137954120 10% / circleci@ip-10-0-52-168:~$ df /home/circleci/openlineage/integration/spark Filesystem 1K-blocks Used Available Use% Mounted on /dev/root 152243760 14273436 137953940 10% /

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 09:34:28

*Thread Reply:* the max memory seems very small?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 09:37:15

*Thread Reply:* let's try it? https://github.com/OpenLineage/OpenLineage/pull/2454

Labels
integration/spark
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 09:41:33

*Thread Reply:* ⬆️ ⬆️ it died without filling the disk

Damien Hawes (damien.hawes@booking.com)
2024-02-22 10:07:31

*Thread Reply:* Seems to be running again

Damien Hawes (damien.hawes@booking.com)
2024-02-22 10:11:30

*Thread Reply:* I'm going to push a change to your branch

Damien Hawes (damien.hawes@booking.com)
2024-02-22 10:11:41

*Thread Reply:* To see if we can skip the copy

👍 Maciej Obuchowski
Damien Hawes (damien.hawes@booking.com)
2024-02-22 10:24:11

*Thread Reply:* Pushed

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 10:25:41

*Thread Reply:* even before, I reran it with SSH and it managed to copy the dependencies after all

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 10:25:50

*Thread Reply:* it took a lot of time tho

Damien Hawes (damien.hawes@booking.com)
2024-02-22 10:26:14

*Thread Reply:* Could you tell which dependencies it was copying?

Damien Hawes (damien.hawes@booking.com)
2024-02-22 10:26:21

*Thread Reply:* Like the fixture dependency

Damien Hawes (damien.hawes@booking.com)
2024-02-22 10:26:28

*Thread Reply:* or the container dependencies?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 10:26:33

*Thread Reply:* not really, I just looked at df numbers

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 10:26:36

*Thread Reply:* circleci@ip-10-0-113-50:~$ df -H Filesystem Size Used Avail Use% Mounted on /dev/root 156G 15G 141G 10% / devtmpfs 8.3G 0 8.3G 0% /dev tmpfs 8.3G 0 8.3G 0% /dev/shm tmpfs 1.7G 857k 1.7G 1% /run tmpfs 5.3M 0 5.3M 0% /run/lock tmpfs 8.3G 0 8.3G 0% /sys/fs/cgroup /dev/nvme0n1p15 110M 6.4M 104M 6% /boot/efi circleci@ip-10-0-113-50:~$ df -H Filesystem Size Used Avail Use% Mounted on /dev/root 156G 22G 135G 14% / devtmpfs 8.3G 0 8.3G 0% /dev tmpfs 8.3G 0 8.3G 0% /dev/shm tmpfs 1.7G 1.3M 1.7G 1% /run tmpfs 5.3M 0 5.3M 0% /run/lock tmpfs 8.3G 0 8.3G 0% /sys/fs/cgroup /dev/nvme0n1p15 110M 6.4M 104M 6% /boot/efi

Damien Hawes (damien.hawes@booking.com)
2024-02-22 10:30:56

*Thread Reply:* I wonder if the "copyIntegrationTestFixtures" was a red herring

Damien Hawes (damien.hawes@booking.com)
2024-02-22 10:31:03

*Thread Reply:* And it's actually the "copyDependencies" step

Damien Hawes (damien.hawes@booking.com)
2024-02-22 10:31:43

*Thread Reply:* Because the fixtures JAR is tiny

Damien Hawes (damien.hawes@booking.com)
2024-02-22 10:32:08

*Thread Reply:* It should take seconds, at most.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 10:35:11

*Thread Reply:* your PR failed on your favorite step, spotless 😂

Damien Hawes (damien.hawes@booking.com)
2024-02-22 10:35:20

*Thread Reply:* Yeah ...

Damien Hawes (damien.hawes@booking.com)
2024-02-22 10:35:34

*Thread Reply:* One of these days, I am going to make a pre-commit or something

Damien Hawes (damien.hawes@booking.com)
2024-02-22 10:35:41

*Thread Reply:* or pre-push

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 10:37:03

*Thread Reply:* we have pre-commit config but it's focused on Python parts https://github.com/OpenLineage/OpenLineage/blob/31f8ce588526e9c7c4bc7d849699cb7ce2969c8f/.pre-commit-config.yaml#L1

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 10:37:24

*Thread Reply:* spotless is such a low hanging fruit tho...

Damien Hawes (damien.hawes@booking.com)
2024-02-22 10:37:38

*Thread Reply:* I did it for one of my local repos

Damien Hawes (damien.hawes@booking.com)
2024-02-22 10:37:50

*Thread Reply:* But I didn't get it quite right

Damien Hawes (damien.hawes@booking.com)
2024-02-22 10:38:01

*Thread Reply:* Perhaps I should run "spotlessCheck" on commit

Damien Hawes (damien.hawes@booking.com)
2024-02-22 10:38:33

*Thread Reply:* At least that way, I know my commit will not be committed if the check fails

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 10:38:46

*Thread Reply:* https://github.com/jguttman94/pre-commit-gradle

Stars
15
Language
Python
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 10:46:02

*Thread Reply:* 😞 https://github.com/pre-commit/pre-commit/issues/1110#issuecomment-518939116

Damien Hawes (damien.hawes@booking.com)
2024-02-22 10:47:21

*Thread Reply:* Or I should just use intellij to commit with the reformat code option

Damien Hawes (damien.hawes@booking.com)
2024-02-22 10:47:24

*Thread Reply:* instead of the CLI

Damien Hawes (damien.hawes@booking.com)
2024-02-22 10:57:06

*Thread Reply:* Interesting ...

Damien Hawes (damien.hawes@booking.com)
2024-02-22 10:57:19

*Thread Reply:* This is still getting stuck

Damien Hawes (damien.hawes@booking.com)
2024-02-22 10:59:02

*Thread Reply:* I wonder

Damien Hawes (damien.hawes@booking.com)
2024-02-22 10:59:04

*Thread Reply:* I wonder

Damien Hawes (damien.hawes@booking.com)
2024-02-22 10:59:14

*Thread Reply:* I wonder if it is the download of the archives

Damien Hawes (damien.hawes@booking.com)
2024-02-22 10:59:30

*Thread Reply:* Those archives are "big"

Damien Hawes (damien.hawes@booking.com)
2024-02-22 10:59:34

*Thread Reply:* like >300MB

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 11:03:55

*Thread Reply:* it moved

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 11:04:09

*Thread Reply:* looks like :app:copySparkBinariesSpark350Scala212 that's right after it

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 11:05:49

*Thread Reply:* 300mb should not take that much anyway

Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:06:14

*Thread Reply:* It's not the copying

Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:06:21

*Thread Reply:* It's the downloading from the Apache archive

Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:06:41

*Thread Reply:* https://archive.apache.org

Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:06:47

*Thread Reply:* It can be really slow at times

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 11:07:59

*Thread Reply:* The main archive of all public software releases of the Apache Software Foundation. This is simply a copy of the main distribution directory with the only difference that nothing will be ever removed over here. If you are looking for current software releases, please visit one of our numerous mirrors. Do note that heavy use of this service will result in immediate throttling of your download speeds to either 12 or 6 mbps for the remainder of the day, depending on severity. Continuous abuse (to the tune of more than 40 GB downloaded per week) will cause an automatic ban, so please tune your services to this fact. 🤔

Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:08:31

*Thread Reply:* lmao

Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:09:01

*Thread Reply:* Another reason why we need a container registry

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 11:09:18

*Thread Reply:* then we'll get rate limited by it 🙂

Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:09:42

*Thread Reply:* The problem is the mirrors don't contain all of the versions of Spark

Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:09:47

*Thread Reply:* I think they go back to 3.3.4

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 11:12:22

*Thread Reply:* I think we could start using quay.io : for a task

  1. check if docker image with particular tag exists
  2. if not, build it using Apache archive and push it to quay
  3. run the tests
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 11:13:01

*Thread Reply:* <a href="http://quay.io">quay.io</a> does not restrict anonymous pulls against its repositories (either public or private) and only rate limits in the most severe circumstances to maintain service levels (e.g. tens of requests per second from the same IP address).

Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:14:38

*Thread Reply:* Aye, but who do we speak to in order to provision that?

Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:17:39

*Thread Reply:* Maybe, just maybe, we can use circle ci's cache mechanism >.>

Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:20:37

*Thread Reply:* I wonder, @Maciej Obuchowski - the GCP project that exists, can we not use the container registry in that one?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 11:21:48

*Thread Reply:* quay.io is free, GCR is paid and not that cheap https://cloud.google.com/artifact-registry/pricing

Google Cloud
Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:22:04

*Thread Reply:* Is quay.io free?

Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:22:11

*Thread Reply:* I see this: https://quay.io/plans/

quay.io
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 11:22:53

*Thread Reply:* Public repositories are always free. 🙂

Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:23:01

*Thread Reply:* oooooh

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 11:25:42

*Thread Reply:* I'm setting up an OpenLineage organization that we could push the images to

Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:25:48

*Thread Reply:* LMAO

Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:25:55

*Thread Reply:* I was doing that myself

Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:27:42

*Thread Reply:* Though, I don't know an "organization email" to use

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 11:29:59

*Thread Reply:* well, I was faster, I got the openlineage name 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 11:30:18

*Thread Reply:* Added one of mine, it's possible to change it later

Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:32:35

*Thread Reply:* OK.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 11:32:50

*Thread Reply:* it should be possible now to log in using docker login -u="${QUAY_ACCOUNT_ID}" -p="${QUAY_ACCOUNT_TOKEN}" quay.io from CI task which is marked integration-tests

Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:33:22

*Thread Reply:* OK. But I'll need to build those images first, and push them.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 11:33:26

*Thread Reply:* and push to https://quay.io/repository/openlineage/spark

quay.io
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 11:33:49

*Thread Reply:* as in, the user QUAY_ACCOUNT_ID has permission to write there

Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:34:19

*Thread Reply:* Can you add me to the org, so I can push the images?

Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:34:41

*Thread Reply:* (I still have the binaries downloaded on my local)

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 11:35:37

*Thread Reply:* using email <a href="mailto:damien.hawes@booking.com">damien.hawes@booking.com</a>?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 11:36:06

*Thread Reply:* check the mail 🙂

Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:36:09

*Thread Reply:* ye

Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:50:51

*Thread Reply:* OK

Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:50:53

*Thread Reply:* Pushing the images

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 11:52:46

*Thread Reply:* I can see spark-3.2.4-scala-2.13

Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:52:53

*Thread Reply:* yup

Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:53:00

*Thread Reply:* Next one is almost there

Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:54:08

*Thread Reply:* These are some chunky images.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 11:54:20

*Thread Reply:* Earlier I was thinking of pushing it on CI: checking if the Spark tag exist, then create image and push it if it does not

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 11:54:39

*Thread Reply:* but we need to do this only once

Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:54:43

*Thread Reply:* Exactly

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 11:55:04

*Thread Reply:* and if we want to add support for another Spark/Scala version, we still need to do some work manually

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 11:55:10

*Thread Reply:* so I guess this does not matter?

Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:55:27

*Thread Reply:* The process is fairly trivial.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 11:55:39

*Thread Reply:* but still would be good to have documentation how to create the image, so you're not bothered by questions next time 🙂

Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:55:40

*Thread Reply:* And we can make an openlineage-spark-docker directory

👍 Maciej Obuchowski
Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:55:56

*Thread Reply:* Place the gradle in there

Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:56:03

*Thread Reply:* put in a README.md

Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:56:07

*Thread Reply:* etc

Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:56:23

*Thread Reply:* It should be a project that changes very rarely

Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:58:03

*Thread Reply:* Good find on the quay.io btw

Damien Hawes (damien.hawes@booking.com)
2024-02-22 11:58:12

*Thread Reply:* Really nice one

Damien Hawes (damien.hawes@booking.com)
2024-02-22 12:02:17

*Thread Reply:* OK. All images have been pushed.

👍 Maciej Obuchowski
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 12:09:08

*Thread Reply:* yep, I can see all of them

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 12:09:35

*Thread Reply:* People still love to use 2.4.8 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 12:14:49

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2455 should be enought to check if it works?

Labels
integration/spark
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 12:15:23

*Thread Reply:* > It should be a project that changes very rarely we should be building those images on minor Spark releases too

Damien Hawes (damien.hawes@booking.com)
2024-02-22 12:16:10

*Thread Reply:* Yes - that's what the spline folks did

Damien Hawes (damien.hawes@booking.com)
2024-02-22 12:16:44

*Thread Reply:* but they never supported 2.13

Damien Hawes (damien.hawes@booking.com)
2024-02-22 12:40:27

*Thread Reply:* LMAO

Damien Hawes (damien.hawes@booking.com)
2024-02-22 12:40:41

*Thread Reply:* WARN <a href="http://tc.quay.io/openlineage/spark:spark-3.3.4-scala-2.12">tc.quay.io/openlineage/spark:spark-3.3.4-scala-2.12</a> - The architecture 'arm64' for image '<a href="http://quay.io/openlineage/spark:spark-3.3.4-scala-2.12">quay.io/openlineage/spark:spark-3.3.4-scala-2.12</a>' (ID sha256:dbdc0c8a3e1b182004c3c850c2ecb767b76cc14e55e3e994a34356630e689e86) does not match the Docker server architecture 'amd64'. This will cause the container to execute much more slowly due to emulation and may lead to timeout failures.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 12:48:33

*Thread Reply:* 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 12:48:39

*Thread Reply:* you have only arm containers locally?

Damien Hawes (damien.hawes@booking.com)
2024-02-22 13:03:19

*Thread Reply:* Yeah

Damien Hawes (damien.hawes@booking.com)
2024-02-22 13:03:21

*Thread Reply:* It's an M1

Damien Hawes (damien.hawes@booking.com)
2024-02-22 13:03:36

*Thread Reply:* I'm pushing amd64 containers at the moment

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 13:04:15

*Thread Reply:* yeah it adds another dimension to the problem

Damien Hawes (damien.hawes@booking.com)
2024-02-22 13:06:39

*Thread Reply:* First pushed

Damien Hawes (damien.hawes@booking.com)
2024-02-22 13:06:47

*Thread Reply:* They'll probably take like 15 minutes or so

🙏 Maciej Obuchowski
Damien Hawes (damien.hawes@booking.com)
2024-02-22 13:18:32

*Thread Reply:* OK

Damien Hawes (damien.hawes@booking.com)
2024-02-22 13:18:33

*Thread Reply:* done

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 13:42:43

*Thread Reply:* not sure it did exactly what we want but probably okay for now

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 13:43:09

*Thread Reply:* it would be best if this was multi-arch build I think

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 13:43:22

*Thread Reply:* anyway it's not a job for now, rerunning the tests and I'm finishing for today

Damien Hawes (damien.hawes@booking.com)
2024-02-22 17:29:43

*Thread Reply:* OK - I have extracted the logic for building the images. It now also performs a multi-platform build, targeting linux/amd64 and linux/arm64. This should be enough for the CI/CD pipeline, folks developing with Linux and folks developing with Apple ARM chips.

The PR is here: https://github.com/OpenLineage/OpenLineage/pull/2456

The images with multiple manifests are here:

https://quay.io/repository/openlineage/spark?tab=tags

Labels
documentation, integration/spark
Assignees
<a href="https://github.com/d-m-h">@d-m-h</a>
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-23 05:34:35

*Thread Reply:* To explain the situation for people not following the issue.

We had problem with CI where download of 300MB archive from archive.apache.com took over 10 minutes, probably because we were rate limited. That failed our integration tests and blocked the release.

We used those archives to create docker images that were used for integration tests - compiled Spark of particular version with particular version of Scala.

Solution to that problem was manually prebuilding the images and pushing it to free quay.io repository, This is not a problem, since bumping version of Spark that we test on also requires manual action, and because @Damien Hawes provided Gradle task to complete the work.

I've created openlineage organization on quay.io where we can push the images - and any other images we could want, for example jupyter already configured with Spark integration to allow people easier experimentation with OpenLineage.

If no one has any philosophical problems with that solution, I would like to see few committer volunteers to be added as admins to the quay.io organization - to increase the bus factor. @Julien Le Dem @Paweł Leszczyński @Damien Hawes - do you want to be added there?

Damien Hawes (damien.hawes@booking.com)
2024-02-23 06:14:29

*Thread Reply:* @Maciej Obuchowski - sure.

👍 Maciej Obuchowski
Julien Le Dem (julien@apache.org)
2024-02-23 19:59:19

*Thread Reply:* yes! Thank you Maciej. As long as there’s clear doc and an easy one liner to update those, this sounds good to me. I think we need to pay extra attention on limiting write access as this is a potential injection point to modify what the build does invisibly. (you can push a different image and affect the build without modifying the repo). Is there already a signature verification on download from quay (to avoid unauthorized modification of those images)?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-02-26 02:41:11

*Thread Reply:* Could we add add image building and storing in quai.io as part of our CI when needed image is not present there?

I would love to have some info in the docs what has to be done to support Spark 3.6 once it gets released. Especially, how can one publish 3.6 image to quai.io?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-26 05:11:55

*Thread Reply:* > Could we add add image building and storing in quai.io as part of our CI when needed image is not present there? We discussed this and decided it's already required to do manual work to support additional Spark version, so automating this won't give us much > I would love to have some info in the docs what has to be done to support Spark 3.6 once it gets released. Especially, how can one publish 3.6 image to quai.io? There is a readme here: https://github.com/OpenLineage/OpenLineage/pull/2456/files#diff-44ca475a04d6a92886f82dd27b47d30c8e57f518aa3dbc467feef43ec1c57638

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-02-26 05:27:26

*Thread Reply:* thanks Maciej

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-26 05:39:58

*Thread Reply:* > I think we need to pay extra attention on limiting write access as this is a potential injection point to modify what the build does invisibly. (you can push a different image and affect the build without modifying the repo). @Julien Le Dem Yes, only authorized users (committers) can upload images. CI won’t write images, just read them, they would be pushed by committers before execution > Is there already a signature verification on download from quay (to avoid unauthorized modification of those images)? (edited) Docker verifies SHA of downloaded images. Do you mean some additional mechanism to avoid potential problems with compromised committer?

Julien Le Dem (julien@apache.org)
2024-02-26 16:34:29

*Thread Reply:* The sha could be saved in the repo and compared so that it can not be changed independently by someone who would have gained access to the credentials.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-27 08:05:21

*Thread Reply:* @Julien Le Dem that would make a lot of sense if the same commit that changes the images could not change the SHA 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-27 08:05:51

*Thread Reply:* unless we've made those SHAs part of something external, for example CircleCI config

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-27 08:06:28

*Thread Reply:* but TBH I think it's low risk, CircleCI would limit us fast if somebody would for example put crypto miner there

Julien Le Dem (julien@apache.org)
2024-02-28 20:28:49

*Thread Reply:* to me the risk is more to introduce vulnerabilities/backdoors in the OpenLineage released artifact through pushing a cached image that modifies the result of the build.

Julien Le Dem (julien@apache.org)
2024-02-28 20:30:29

*Thread Reply:* The idea of saving the image signature in the repo is that you can not use a new image in the build without creating a new commit and traceability.

Michael Robinson (michael.robinson@astronomer.io)
2024-02-22 09:28:00

CFP closes on April 30 https://events.linuxfoundation.org/open-source-summit-europe/

LF Events
👀 Maciej Obuchowski
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-22 09:49:42

*Thread Reply:* Only a few days after Airflow Summit September 10-12, 2024 🤔

Michael Robinson (michael.robinson@astronomer.io)
2024-02-22 15:48:04

New communication channel: https://medium.com/@openlineageproject

Medium
👍 Maciej Obuchowski
Michael Robinson (michael.robinson@astronomer.io)
2024-02-22 15:48:19

*Thread Reply:* More to come...

Willy Lulciuc (willy@datakin.com)
2024-02-23 03:19:10

I might be the only one running into this issue (for now). I plan on opening PR for a fix this weekend, but if someone wants to pick it up your more than welcome to: https://github.com/OpenLineage/OpenLineage/issues/2458

Labels
proposal
👍 Maciej Obuchowski
Michael Robinson (michael.robinson@astronomer.io)
2024-02-23 13:08:55

This is a GCP-specific question, but does anyone know the answer? https://openlineage.slack.com/archives/C01CK9T7HKR/p1708010167626709?thread_ts=1707920807.530409&cid=C01CK9T7HKR

} ldacey (https://openlineage.slack.com/team/U05NMJ0NBUK)
Damien Hawes (damien.hawes@booking.com)
2024-02-25 07:52:08

FYI: The release of 1.9.0 did not go through

Issue: https://github.com/OpenLineage/OpenLineage/issues/2467 PR: https://github.com/OpenLineage/OpenLineage/pull/2468

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-26 05:47:35

*Thread Reply:* The biggest problem with release, as always, is that you can't test it in other way than running it 😞

Damien Hawes (damien.hawes@booking.com)
2024-02-25 08:13:45

It also seems like, despite the fact that the spark step completed, there was a silent failure or something. I don't see the artefacts on central.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-26 05:47:08

*Thread Reply:* @Michael Robinson needs to manually promote them

Michael Robinson (michael.robinson@astronomer.io)
2024-02-26 08:57:04

*Thread Reply:* @Damien Hawes looking into this today

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-02-26 08:58:14

*Thread Reply:* @Michael Robinson you can rerun release now

👍 Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2024-02-26 13:54:06

*Thread Reply:* @Damien Hawes the release is out. comms to follow shortly

Michael Robinson (michael.robinson@astronomer.io)
2024-02-26 15:45:18

Feedback and input requested for this month's newsletter. I've added sections for the Flink and Spark integrations. Please lmk what you think about the "highlights" I've chosen for these and for Airflow if you have a moment between now and EOD Wednesday. Thanks. https://docs.google.com/document/d/15caPR4q7dOPs6co2x0q5hYSX65ZhHHdhKrFP7dVSRPI/edit?usp=sharing

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-26 15:54:33

*Thread Reply:* Would it be okay to highlight 2 new committers? 🙂

Michael Robinson (michael.robinson@astronomer.io)
2024-02-26 15:55:14

*Thread Reply:* 👆why I ask for input! thank you @Jakub Dardziński

👍 Jakub Dardziński
Harel Shein (harel.shein@gmail.com)
2024-02-26 16:24:51

Snowflake taking a play from the Databricks playbook? https://www.snowflake.com/en/data-cloud/horizon/

snowflake.com
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-02-27 11:16:28

gotta skip today meeting. I hope to see you all next week!

Julien Le Dem (julien@apache.org)
2024-02-27 12:11:00

The meetup I mentioned about OpenLineage/OpenTelemetry: https://x.com/J_/status/1565162740246671360 I speak in English but other two speakers speak in Hebrew

X (formerly Twitter)
🙏 Willy Lulciuc
Harel Shein (harel.shein@gmail.com)
2024-02-28 10:20:23

*Thread Reply:* thanks for sharing that, that otel to ol comparison is going to be very useful for me today :)

Michael Robinson (michael.robinson@astronomer.io)
2024-02-28 13:18:03

Could use another pair of eyes on this month's newsletter draft if anyone has time today

🙌 Paweł Leszczyński, Maciej Obuchowski
Kacper Muda (kacper.muda@getindata.com)
2024-02-28 15:00:46

*Thread Reply:* LGTM 🙂

:gratitude_thank_you: Michael Robinson
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-05 14:07:46

Hey, I created new Airflow AIP. It proposes instrumenting Airflow Hooks and Object Storage to collect dataset updates automatically, to allow gathering lineage from PythonOperator and custom operators. Feel free to comment on Confluence https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-62+Getting+Lineage+from+Hook+Instrumentation or on Airflow mailing list: https://lists.apache.org/thread/5chxcp0zjcx66d3vs4qlrm8kl6l4s3m2

🙌 Kacper Muda, Harel Shein, Paweł Leszczyński
Kacper Muda (kacper.muda@getindata.com)
2024-03-06 05:25:42

Hey, does anyone want to add anything here (PR that adds AWS MSK IAM transport)? It looks like it's ready to be merged.

:gh_approved: Maciej Obuchowski
:gh_merged: Maciej Obuchowski
Harel Shein (harel.shein@gmail.com)
2024-03-06 10:14:34

did we miss a step in publishing 1.9.1? going https://search.maven.org/remote_content?g=io.openlineage&a=openlineage-spark&v=LATEST|here gives me the 1.8 release

Harel Shein (harel.shein@gmail.com)
2024-03-06 10:17:30

*Thread Reply:* oh, this might be related to having 2 scala versions now, because I can see the 1.9.1 artifacts

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-06 10:17:35

*Thread Reply:* yes

Harel Shein (harel.shein@gmail.com)
2024-03-06 10:17:48
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-06 10:18:22

*Thread Reply:* another place 🙂

Harel Shein (harel.shein@gmail.com)
2024-03-06 10:19:01

*Thread Reply:* yup

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-06 10:19:44

*Thread Reply:* https://github.com/OpenLineage/docs/pull/299

Harel Shein (harel.shein@gmail.com)
2024-03-06 10:22:25

*Thread Reply:* thx :gh_merged:

Michael Robinson (michael.robinson@astronomer.io)
2024-03-06 13:33:45

Hi, here's a tentative agenda for next week's TSC (on Wednesday at 9:30 PT):

  1. Announcements including @Peter Huang's election, Kafka Summit talk, Data Council panel, Boston meetup
  2. Recent release 1.9.1 highlights
  3. Expanded Scala support in Spark overview @Damien Hawes
  4. Circuit breaker in Spark & Flink, built-in lineage in Spark @Paweł Leszczyński
  5. Discussion items
  6. Open discussion Am I forgetting anything? Have a discussion item or want to do a demo? 🙂 Let me know. I'll also make a slide deck whether or not I can join next week and share it here. Reminders will go out today, and I believe links, meeting info and invites are all up to date. Please let me know if you spot incorrect meeting info anywhere.
Harel Shein (harel.shein@gmail.com)
2024-03-06 13:46:49

*Thread Reply:* I thought @Paweł Leszczyński wanted to present?

Michael Robinson (michael.robinson@astronomer.io)
2024-03-06 13:51:06

*Thread Reply:* What was the topic? Protobuf or built-in lineage maybe? Or the many docs improvements lately?

Harel Shein (harel.shein@gmail.com)
2024-03-06 13:53:31
Michael Robinson (michael.robinson@astronomer.io)
2024-03-06 13:55:44

*Thread Reply:* Imagine there are lots of folks who would be interested in a presentation on that

Harel Shein (harel.shein@gmail.com)
2024-03-06 13:58:15

*Thread Reply:* I think so too 🙂

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-03-07 02:22:25

*Thread Reply:* There two things worth presenting: circuit breaker +/or built-in lineage (once it gets merged).

👍 Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2024-03-07 09:08:15

*Thread Reply:* updating the agenda

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-06 16:07:06

is there a reason why facet objects have _schemaURL property but BaseEvent has schemaURL?

Willy Lulciuc (willy@datakin.com)
2024-03-06 16:07:34

*Thread Reply:* yeah, we use _ to avoid naming conflicts in a facet

👍 Julien Le Dem
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-06 16:07:34

*Thread Reply:* same goes for producer

Julien Le Dem (julien@apache.org)
2024-03-06 16:08:18

*Thread Reply:* Facets have user defined fields. So all base fields are prefixed

Julien Le Dem (julien@apache.org)
2024-03-06 16:08:27

*Thread Reply:* Base events do not

Willy Lulciuc (willy@datakin.com)
2024-03-06 16:08:30

*Thread Reply:* it should be a made more clear… recently ran into the issue when validating OL events

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-06 16:09:38

*Thread Reply:* it might be another missing point but we set _producer in BaseFacet: def __attrs_post_init__(self) -&gt; None: self._producer = PRODUCER but we don’t do that for producer in BaseEvent

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-06 16:09:52

*Thread Reply:* is this supposed to be like that?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-06 16:09:57

*Thread Reply:* I’m kinda lost 🙂

Julien Le Dem (julien@apache.org)
2024-03-06 16:10:43

*Thread Reply:* We should set producer in baseevent as well

☝️ Jakub Dardziński
Julien Le Dem (julien@apache.org)
2024-03-06 16:11:35

*Thread Reply:* The idea is the base event might be produced by the spark integration but the facet might be produced by iceberg library

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-06 16:13:02

*Thread Reply:* > The idea is the base event might be produced by the spark integration but the facet might be produced by iceberg library right, it doesn’t require adding _ , it just helps in making the difference

and also this reason too: > Facets have user defined fields. So all base fields are prefixed > Base events do not

Julien Le Dem (julien@apache.org)
2024-03-06 16:13:34

*Thread Reply:* Since users can create custom facets with whatever fields we just tell Them that “_**” is reserved.

Julien Le Dem (julien@apache.org)
2024-03-06 16:13:55

*Thread Reply:* So the underscore prefix is a mechanism specific to facets

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-06 16:14:04

*Thread Reply:* 👍

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-06 16:15:19

*Thread Reply:* last question: we don’t want to block users from setting their own _producerfield? it seems the only way now is to use openlineage.client.facet.set_producer method to override default, you can’t just do RunEvent(…, _producer='my_own')

Julien Le Dem (julien@apache.org)
2024-03-06 16:17:11

*Thread Reply:* The idea is the producer identifies the code that generates the metadata. So you set it once and all the facets you generate have the same

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-06 16:17:54

*Thread Reply:* mhm, probably you don’t need to use several producers (at least) per Python module

👍 Julien Le Dem
Julien Le Dem (julien@apache.org)
2024-03-06 16:18:09

*Thread Reply:* Yep

Julien Le Dem (julien@apache.org)
2024-03-06 16:18:39

*Thread Reply:* In airflow each provider should have its own for the facets they produce

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-06 16:18:43

*Thread Reply:* just searched for set_producer in current docs - no results 😨

😅 Julien Le Dem
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-06 16:19:55

*Thread Reply:* a number of things will get to the right track after I’m done with generating code 🙂

Julien Le Dem (julien@apache.org)
2024-03-06 16:20:54

*Thread Reply:* Thanks for looking into that. If you can fix the doc by adding a paragraph about that, that would be helpful

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-06 16:21:38

*Thread Reply:* I can create an issue at least 😂

👍 Julien Le Dem
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-06 16:23:44

*Thread Reply:* there you go: https://github.com/OpenLineage/docs/issues/300 if I missed something please comment

Assignees
<a href="https://github.com/JDarDagran">@JDarDagran</a>
Harel Shein (harel.shein@gmail.com)
2024-03-06 17:24:05

I feel like our getting started with openlineage page is mostly a getting started with Marquez page. but I'm also not sure what should be there otherwise.

openlineage.io
Michael Robinson (michael.robinson@astronomer.io)
2024-03-07 09:00:37

*Thread Reply:* https://openlineage.io/docs/guides/spark ?

openlineage.io
Michael Robinson (michael.robinson@astronomer.io)
2024-03-07 09:03:54

*Thread Reply:* Unfortunately it's probably not that "quick" given the setup required..

Michael Robinson (michael.robinson@astronomer.io)
2024-03-07 09:04:30

*Thread Reply:* Maybe better? https://openlineage.io/docs/integrations/spark/quickstart/quickstart_local

openlineage.io
Harel Shein (harel.shein@gmail.com)
2024-03-07 12:21:18

*Thread Reply:* yeah, that's where I was struggling as well. should our quickstart be platform specific? that also feels strange.

Damien Hawes (damien.hawes@booking.com)
2024-03-07 10:35:46

Quick question, for the spark.openlineage.facets.disabled property, why do we need to include [;] in the value? Why can't we use , to act as the delimiter? Why do we need [ and ] to enclose the string?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-07 13:22:42

*Thread Reply:* There was some concrete reason AFAIK right @Paweł Leszczyński?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-03-08 02:23:02

*Thread Reply:* We do have a logic that converts Spark conf entries to OpenLineageYaml without a need to understand its content. I think [] was added for this reason to know that Spark conf entry has to be translated into an array.

Initially disabled facets were just separated by ; . Why not a comma? I don't remember if there was any problem with this.

https://github.com/OpenLineage/OpenLineage/pull/1271/files -> this PR introduced it

https://github.com/OpenLineage/OpenLineage/blob/1.9.1/integration/spark/app/src/main/java/io/openlineage/spark/agent/ArgumentParser.java#L152 -> this code check if spark conf value is of array type

Peter Huang (huangzhenqiu0825@gmail.com)
2024-03-07 15:27:02

Hi team, do we have any proposal or previous discussion of Trino OpenLineage integration?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-07 15:30:33

*Thread Reply:* There is old third-party integration: https://github.com/takezoe/trino-openlineage

It has right idea to use EventListener, but I can't vouch if it works

Peter Huang (huangzhenqiu0825@gmail.com)
2024-03-07 15:34:18

*Thread Reply:* Thanks. We are investigating the integration in our org. It will be a good start point 🙂

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-07 15:38:13

*Thread Reply:* I think the ideal solution would be to use EventListener. So far we only have very basic integration in Airflow's TrinoOperator

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-07 15:39:24

*Thread Reply:* The only thing I haven't really checked out what are real possibilities for EventListener in terms of catalog details discovery, e.g. what's database connection for the catalog.

Peter Huang (huangzhenqiu0825@gmail.com)
2024-03-07 17:00:41

*Thread Reply:* Thanks for calling out this. We will evaluate and post some observation in the thread.

Alok (a_prusty@apple.com)
2024-03-07 18:54:22

*Thread Reply:* Thanks Peter Hey Maciej/Jakub Could you please share the process to follow in terms of contributing a Trino open lineage integration. (Design doc and issue ?)

There was an issue for trino integration but it was closed recently. https://github.com/OpenLineage/OpenLineage/issues/164

Labels
integration/trino
Comments
1
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-08 04:40:51

*Thread Reply:* It would be great to see design doc and maybe some POC if possible. I've reopened the issue for you.

If you get agreement around the design I don't think there are more formal steps needed, but maybe @Julien Le Dem has other idea

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-08 06:36:25

*Thread Reply:* Trino has their plugins directory btw: https://github.com/trinodb/trino/tree/master/plugin including event listeners like: https://github.com/trinodb/trino/tree/master/plugin/trino-mysql-event-listener

Alok (a_prusty@apple.com)
2024-03-08 13:40:01

*Thread Reply:* Thanks Maciej and Jakub Yes the integration will be done with Trino’s event listener framework that has details around query, source and destination dataset details etc.

> It would be great to see design doc and maybe some POC if possible. I’ve reopened the issue for you. Thanks for re-opening the issue. We will add the design doc and POC to the issue.

Julien Le Dem (julien@apache.org)
2024-03-12 17:57:55

*Thread Reply:* I agree with @Maciej Obuchowski, a quick design doc followed by a POC would be great. The integration could either live in OpenLineage or Trino but that can be discussed after the POC.

👍 Alok
Julien Le Dem (julien@apache.org)
2024-03-12 17:58:24

*Thread Reply:* (obviously, adding it to the trino repo would require aproval from the trino community)

Mariusz Górski (gorskimariusz13@gmail.com)
2024-03-22 09:45:34

*Thread Reply:* Gentleman, we are also actively looking into this topic with the same repo from @takezoe as our base, I have submitted a PR to revive this project - it does work, the POC is there in a form of docker-compose.yaml deployment 🙂 some obvious things are missing for now (like kafka output instead of api) but I think it's a good starting point and it's compatible with latest trino and OL

Peter Huang (huangzhenqiu0825@gmail.com)
2024-03-26 15:54:02

*Thread Reply:* Thanks for put the foundation for the implementation. Base on it, I feel @Alok would still participate and make contribute to it. How about create a design doc and list all of the possible TBDs as @Julien Le Dem suggested.

Michael Robinson (michael.robinson@astronomer.io)
2024-03-30 09:29:24

*Thread Reply:* Adding @takezoe to this thread. Thanks for your work on a Trino integration and welcome!

❤️ Mariusz Górski
Harel Shein (harel.shein@gmail.com)
2024-04-02 09:11:57

*Thread Reply:* throwing the CFP for the Trino conference here in case any one of the contributors want to present there https://sessionize.com/trino-fest-2024

sessionize.com
Harel Shein (harel.shein@gmail.com)
2024-04-02 09:12:43

*Thread Reply:* I'm also very happy to help with an idea for an abstract

Alok (a_prusty@apple.com)
2024-04-02 12:26:51

*Thread Reply:* Hey Harel Just FYI we are already engaged with Trino community to have a talk around Trino open lineage integration and have submitted an Abstract for review.

🎉 Harel Shein
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-02 12:33:55

*Thread Reply:* once you release the integration, please add a reference about it to OpenLineage docs! https://github.com/OpenLineage/docs

Website
<https://openlineage.io>
Stars
9
👍 Alok, Mariusz Górski
Mariusz Górski (gorskimariusz13@gmail.com)
2024-04-02 12:51:54

*Thread Reply:* I think it's ready for review https://github.com/trinodb/trino/pull/21265 just with API sink integration, additional features can be added at @Alok's convenience as next PRs

Labels
docs
Comments
12
🎉 Michael Robinson
👍 Alok
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-08 11:12:43

Hey, there’s discrepancy between in Airflow. Docs say it completely blocks emitting OL events on operator class level. The actual behaviour is that it only blocks metadata extraction (so for instance it doesn’t call Snowflake DB for SnowflakeOperator). My question is what should be desired behaviour. Thoughts so far:

  1. current name indicates it should block emission (similar to disabled option)
  2. imo it doesn’t make sense to emit empty events with basic Airflow info only - from OL perspective it’s way more informative to attach inputs/outputs information Thanks for any opinion!
👍 Maciej Obuchowski
🤔 Maciej Obuchowski
tati (tatiana.alchueyr@astronomer.io)
2024-03-08 11:14:07

*Thread Reply:* I believe we should not extract or emit any open lineage events if this option is used

➕ Kacper Muda
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-08 11:14:07

*Thread Reply:* I'm for option 2, don't send any event from task

tati (tatiana.alchueyr@astronomer.io)
2024-03-08 11:44:54

*Thread Reply:* @Jakub Dardziński do you see any use case for not extracting metadata extraction but still emitting events?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-08 11:53:31

*Thread Reply:* The use case AFAIK was old SnowflakeOperator bug, we wanted to disable the collection there, since it zombified the task. The events being emitted still gave information about status of the task as well as non-dataset related metadata

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-08 11:53:38

*Thread Reply:* but I think it's less relevant now

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-08 11:58:09

*Thread Reply:* ^ this and you might want to have information about task execution because OL is a backend for some task-tracking system

tati (tatiana.alchueyr@astronomer.io)
2024-03-08 12:05:01

*Thread Reply:* Hm, I believe users don't expect us to spend time processing/extracting OL events if this configuration is used. It's the documented behaviour

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-08 12:09:39

*Thread Reply:* the question is if we should change docs or behaviour

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-08 12:09:45

*Thread Reply:* I believe the latter

tati (tatiana.alchueyr@astronomer.io)
2024-03-08 12:46:39

*Thread Reply:* +1 behaviour

Harel Shein (harel.shein@gmail.com)
2024-03-08 13:39:26

*Thread Reply:* +1

Michael Robinson (michael.robinson@astronomer.io)
2024-03-11 21:08:22

Hi, here's the in progress for Wednesday

Harel Shein (harel.shein@gmail.com)
2024-03-11 22:06:19

*Thread Reply:* Looks like a great agenda! Left a couple of comments

Harel Shein (harel.shein@gmail.com)
2024-03-11 22:06:43

*Thread Reply:* @Michael Robinson will you be able to facilitate or do you need help?

Kacper Muda (kacper.muda@getindata.com)
2024-03-12 05:57:39

*Thread Reply:* I'm also missing from the committer list, but can't comment on slides 🙂

😱 Harel Shein
Michael Robinson (michael.robinson@astronomer.io)
2024-03-12 11:22:16

*Thread Reply:* Sorry about that @Kacper Muda. Gave you access just now

Michael Robinson (michael.robinson@astronomer.io)
2024-03-12 11:22:37

*Thread Reply:* We probably need to add you to lists posted elsewhere... I'll check

Kacper Muda (kacper.muda@getindata.com)
2024-03-12 11:22:52

*Thread Reply:* No worries, thanks 🙂 !

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-12 05:54:38

https://github.com/open-metadata/OpenMetadata/pull/15317 👀

🔥 Jakub Dardziński, Harel Shein, Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2024-03-12 11:23:15

*Thread Reply:* this is awesome

Michael Robinson (michael.robinson@astronomer.io)
2024-03-12 11:26:48

*Thread Reply:* it looks like they use temporary deployments to test...

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-12 11:43:35

*Thread Reply:* yeah the GitHub history is wild

Michael Robinson (michael.robinson@astronomer.io)
2024-03-12 11:33:04

Hi, I'm at the conference hotel and my earbuds won't pair with my new mac for some reason. Does the agenda look good? Want to send out the reminders soon. I'll add the OpenMetadata news!

Harel Shein (harel.shein@gmail.com)
2024-03-12 11:42:00

*Thread Reply:* I think we can also add the Datahub PR?

Harel Shein (harel.shein@gmail.com)
2024-03-12 11:47:51

*Thread Reply:* @Paweł Leszczyński prefers to present only the circuit breakers

👍 Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2024-03-12 11:47:55
Michael Robinson (michael.robinson@astronomer.io)
2024-03-12 11:48:09

*Thread Reply:* This one?

Harel Shein (harel.shein@gmail.com)
2024-03-12 11:48:15

*Thread Reply:* yes!

🔥 Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2024-03-12 13:09:47

It's been a while since we've updated the twitter profile. Current description: "A standard api for collecting Data lineage and Metadata at runtime." What would you think of using our website's tagline: "An open framework for data lineage collection and analysis." Other ideas?

👍 Maciej Obuchowski, Harel Shein, Julien Le Dem, Kacper Muda
✅ Michael Robinson
Harel Shein (harel.shein@gmail.com)
2024-03-13 12:34:32

can someone grant me write access to our forked sqlparser-rs repo?

Harel Shein (harel.shein@gmail.com)
2024-03-13 12:34:41

*Thread Reply:* @Julien Le Dem maybe?

Julien Le Dem (julien@apache.org)
2024-03-13 12:38:24

*Thread Reply:* I should probably add the committer group to it

➕ Harel Shein, Maciej Obuchowski
Julien Le Dem (julien@apache.org)
2024-03-13 12:42:44

*Thread Reply:* I have made the committer group maintainer on this repo

🙏 Harel Shein, Maciej Obuchowski
❤️ Peter Huang
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-13 17:19:20

https://github.com/OpenLineage/OpenLineage/pull/2514 small but mighty 😉

Labels
ci, common
🥳 Kacper Muda, Maciej Obuchowski
Michael Robinson (michael.robinson@astronomer.io)
2024-03-14 11:52:15

Regarding the approved release, based on the additions it seems to me like we should make it a minor release (so 1.10.0). Any objections? Changes are here: https://github.com/OpenLineage/OpenLineage/compare/1.9.1...HEAD

➕ Harel Shein, Paweł Leszczyński
Kacper Muda (kacper.muda@getindata.com)
2024-03-14 14:16:09

We encountered a case of a START event, exceeding 2MB in Airflow. This was traced back to an operator with unusually long arguments and attributes. Further investigation revealed that our Airflow events contain redundant data across different facets, leading to unnecessary bloating of event sizes (those long attributes and args were attached three times to a single event). I proposed to remove some redundant facets and to refine the operator's attributes inclusion logic within AirflowRunFacet. I am not sure how breaking is this change, but some systems might depend on the current setup. Suggesting an immediate removal might not be the best approach, and i'd like to know your thoughts. (A similar problem exists within the Airflow provider.) CC @Maciej Obuchowski @Willy Lulciuc @Jakub Dardziński

https://github.com/OpenLineage/OpenLineage/pull/2509

Labels
integration/airflow, extractor
Comments
1
Michael Robinson (michael.robinson@astronomer.io)
2024-03-14 15:06:14

As mentioned during yesterday's TSC, we can't get insight into DataHub's integration from the PR description in their repo. And it's a very big PR. Does anyone have any intel? PR is here: https://github.com/datahub-project/datahub/pull/9870

Labels
ingestion, product, devops
Comments
1
Michael Robinson (michael.robinson@astronomer.io)
2024-03-14 15:07:51

Changelog PR for 1.10 is RFR: https://github.com/OpenLineage/OpenLineage/pull/2516

Labels
documentation
Michael Robinson (michael.robinson@astronomer.io)
2024-03-14 16:20:59

@Julien Le Dem @Paweł Leszczyński Release is failing in the Java client job due to (I think) the version of spotless: ```Could not resolve com.diffplug.spotless:spotlessplugingradle:6.21.0. Required by: project : > com.diffplug.spotless:com.diffplug.spotless.gradle.plugin:6.21.0

No matching variant of com.diffplug.spotless:spotlessplugingradle:6.21.0 was found. The consumer was configured to find a library for use during runtime, compatible with Java 8, packaged as a jar, and its dependencies declared externally, as well as attribute 'org.gradle.plugin.api-version' with value '8.4'```

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-14 16:53:58

*Thread Reply:* @Michael Robinson https://github.com/OpenLineage/OpenLineage/pull/2517

Labels
client/java
✅ Michael Robinson
🙌 Paweł Leszczyński
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-14 18:17:25

fix to broken main: https://github.com/OpenLineage/OpenLineage/pull/2518

Labels
integration/dagster
Comments
1
Michael Robinson (michael.robinson@astronomer.io)
2024-03-14 18:47:34

*Thread Reply:* Thanks, just tried again

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-14 18:48:38

*Thread Reply:* ? it needs approve and merge 😛

Michael Robinson (michael.robinson@astronomer.io)
2024-03-14 18:50:52

*Thread Reply:* Oh oops disregard

Michael Robinson (michael.robinson@astronomer.io)
2024-03-14 18:50:57

*Thread Reply:* different PR

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-14 18:51:22

*Thread Reply:* 👍

Michael Robinson (michael.robinson@astronomer.io)
2024-03-14 19:01:47

There's an issue with the Flink job on CI: ** What went wrong: Could not determine the dependencies of task ':shadowJar'. &gt; Could not resolve all dependencies for configuration ':runtimeClasspath'. &gt; Could not find io.**********************:**********************_sql_java:1.10.1. Searched in the following locations: - <https://repo.maven.apache.org/maven2/io/**********************/**********************-sql-java/1.10.1/**********************-sql-java-1.10.1.pom> - <https://packages.confluent.io/maven/io/**********************/**********************-sql-java/1.10.1/**********************-sql-java-1.10.1.pom> - file:/home/circleci/.m2/repository/io/**********************/**********************-sql-java/1.10.1/**********************-sql-java-1.10.1.pom Required by: project : &gt; project :shared project : &gt; project :flink115 project : &gt; project :flink117 project : &gt; project :flink118

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-14 19:33:58

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2521

Labels
ci
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-14 19:34:11

*Thread Reply:* @Jakub Dardziński still awake? 🙂

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-14 19:35:42

*Thread Reply:* it’s just approval bot

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-14 19:38:11

*Thread Reply:* created issue on how to avoid those in the future https://github.com/OpenLineage/OpenLineage/issues/2522

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-14 19:39:05

*Thread Reply:* https://app.circleci.com/jobs/github/OpenLineage/OpenLineage/188526 I lack emojis on this server to fully express my emotions

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-14 19:39:49

*Thread Reply:* https://openlineage.slack.com/archives/C065PQ4TL8K/p1710454645059659 you might have missed that

} Jakub Dardziński (https://openlineage.slack.com/team/U02S6F54MAB)
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-14 19:40:22

*Thread Reply:* merge -> rebase -> problem gone

Michael Robinson (michael.robinson@astronomer.io)
2024-03-15 09:56:16

*Thread Reply:* PR to update the changelog is RFR @Jakub Dardziński @Maciej Obuchowski: https://github.com/OpenLineage/OpenLineage/pull/2526

Labels
documentation
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-14 19:24:46

https://github.com/OpenLineage/OpenLineage/pull/2520 It’s a long-awaited PR - feel free to comment!

Labels
client/python
🎉 Kacper Muda, Maciej Obuchowski
Michael Robinson (michael.robinson@astronomer.io)
2024-03-14 20:48:21

OpenLineage is trending upward on OSSRank. Please vote!

oss-rank
✅ Jakub Dardziński, Peter Huang
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-15 17:28:36

https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/ParentRunFacet.json#L20 here the format is uuid however if you follow logic for parent id in current dbt integration you might discover that parent run facet has assigned value of DAG’s run_id (which is not uuid)

@Julien Le Dem, what has higher priority? I think lots of people are using dbt-ol wrapper with current lineage_parent_id macro

Julien Le Dem (julien@apache.org)
2024-03-19 10:31:30

*Thread Reply:* It is a uuid because it should be the id of an OL run

👍 Jakub Dardziński
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-18 12:21:37

where can I find who has write access to OL repo?

Michael Robinson (michael.robinson@astronomer.io)
2024-03-18 12:29:00

*Thread Reply:* Settings > Collaborators and teams

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-18 12:33:41

*Thread Reply:* thanks Michael, seems like I don’t have enough permissions to see that

Julien Le Dem (julien@apache.org)
2024-03-19 10:31:57

Sorry, I have a dr appointment today and won’t join the meeting

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-19 10:32:24

*Thread Reply:* I gotta skip too. Maciej and Pawel are at the Kafka Summit

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-19 10:32:36

*Thread Reply:* I hope you’re fine!

Julien Le Dem (julien@apache.org)
2024-03-19 12:24:18

*Thread Reply:* I am fine thank you 🙂

Julien Le Dem (julien@apache.org)
2024-03-19 12:24:20

*Thread Reply:* just a visit

Harel Shein (harel.shein@gmail.com)
2024-03-19 10:35:57

Should we cancel the sync today?

👍 Michael Robinson, Kacper Muda, Maciej Obuchowski
Harel Shein (harel.shein@gmail.com)
2024-03-20 10:02:53

looking at XTable today, any thoughts on how we can collaborate with them?

xtable.apache.org
Harel Shein (harel.shein@gmail.com)
2024-03-20 10:09:30

*Thread Reply:* @Julien Le Dem @Willy Lulciuc this reminds me of some ideas we had a few years ago.. :)

Harel Shein (harel.shein@gmail.com)
2024-03-20 10:16:38

*Thread Reply:* hmm.. ok. maybe not that relevant for us, at first I thought this was an abstraction for read/write on top of Iceberg/Hudi/Delta.. but I think this is more of a data sync appliance. would still be relevant for linking together synced datasets (but I don't think it's that important now)

Peter Huang (huangzhenqiu0825@gmail.com)
2024-03-20 13:21:26

*Thread Reply:* From the introduction https://www.confluent.io/blog/introducing-tableflow/, looks like they are using Flink for both data ingestion and compaction. It means we should at least consider to support hudi source and sink for flink lineage 🙂

Confluent
Michael Robinson (michael.robinson@astronomer.io)
2024-03-21 13:21:29

A key growth metric trending in the right direction:

🚀 Kacper Muda, Harel Shein, Maciej Obuchowski
Michael Robinson (michael.robinson@astronomer.io)
2024-03-21 14:45:21

Eyes on this PR to add OpenMetadata to the Ecosystem page would be appreciated: https://github.com/OpenLineage/docs/pull/303. TIA! @Mariusz Górski

Comments
1
🚀 Jakub Dardziński, Harel Shein
Harel Shein (harel.shein@gmail.com)
2024-03-21 15:21:55

I really want to improve this page in the docs, anyone wants to work with me on that?

openlineage.io
Harel Shein (harel.shein@gmail.com)
2024-03-21 15:22:40

*Thread Reply:* perhaps also make this part of the PR process, so when we add support for something, we remember to update the docs

➕ Willy Lulciuc, Paweł Leszczyński
Willy Lulciuc (willy@datakin.com)
2024-03-21 15:22:55

*Thread Reply:* I free up next week and would love to chat… obviously, time permitting but the page needs some love ❤️

❤️ Harel Shein, Paweł Leszczyński
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-21 15:24:40

*Thread Reply:* I can verify the information once you have some PR 🙂

🙏 Harel Shein
Michael Robinson (michael.robinson@astronomer.io)
2024-03-22 12:53:38

RFR: a PR to add DataHub to the Ecosystem page https://github.com/OpenLineage/docs/pull/304

Comments
1
Michael Robinson (michael.robinson@astronomer.io)
2024-03-22 12:55:17

*Thread Reply:* The description comes from the very brief README in DataHub's GH repo and a glance at the code. No other documentation or resources appear to be available.

Michael Robinson (michael.robinson@astronomer.io)
2024-03-22 12:58:43

*Thread Reply:* @Tamás Németh

Michael Robinson (michael.robinson@astronomer.io)
2024-03-22 15:57:42

Dagster is launching column-lineage support for dbt using the sqlglot parser https://github.com/dagster-io/dagster/pull/20407

Comments
4
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-22 17:17:03

*Thread Reply:* I kinda like their approach to use post-hooks in order to enable column-level lineage so that custom macro collects information about columns, logs it and they parse the log after the execution

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-22 17:17:32

*Thread Reply:* it doesn’t force dbt docs generate step that some might not want to use

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-22 17:17:57

*Thread Reply:* but at the same time reuses DBT adapter to make additional calls to retrieve missing metadata

Willy Lulciuc (willy@datakin.com)
2024-03-23 14:32:29

@Paweł Leszczyński interesting project I came across over the weekend: https://github.com/HamaWhiteGG/flink-sql-lineage

Stars
323
Language
Java
👍 Julien Le Dem
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-03-25 03:16:45

*Thread Reply:* Wow, this is something we would love to have (flink SQL support). It's great to know that people around the globe are working on the same thing and heading same direction. Great finding @Willy Lulciuc. Thanks for sharing!

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-25 06:57:01

*Thread Reply:* On Kafka Summit I've talked with Timo Walther from Flink SQL team and he proposed alternative approach.

Flink SQL has stable (across releases) CompiledPlan JSON text representation that could be parsed, and has all the necessary info - as this is used for serializing actual execution plan both ways.

Peter Huang (huangzhenqiu0825@gmail.com)
2024-03-29 19:43:38

*Thread Reply:* As Flink SQL will convert to transformations before execution, technical speaking our existing solution has already be able to create linage info for Flink SQL apps (not including column lineage and table schemas (that can be inferred within flink table environment)). I will create Flink SQL job for e2e testing purpose.

👍 Maciej Obuchowski
Peter Huang (huangzhenqiu0825@gmail.com)
2024-03-29 19:45:49

*Thread Reply:* I am also working on Flink side for table lineage. Hopefully, new lineage features can be released in flink 1.20.

Michael Robinson (michael.robinson@astronomer.io)
2024-03-25 09:58:57

Sessions for this year's Data+AI Summit have been published. A search didn't turn up anything related to lineage, but did you know Julien and Willy's talk at last year's summit has received 4k+ views? 👀

databricks.com
YouTube
} Databricks (https://www.youtube.com/@Databricks)
Harel Shein (harel.shein@gmail.com)
2024-03-25 10:16:40

*Thread Reply:* seems like our talk was not accepted, but I can see 9 sessions on unity catalog 😕

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-26 05:59:45
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-26 05:59:55

finally merged 🙂

🎉 Harel Shein, Michael Robinson
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-26 06:00:18

pawel-big-lebowski commented on Nov 21, 2023 whoa

Harel Shein (harel.shein@gmail.com)
2024-03-26 07:27:22

I’ll miss the sync today (on the way to data council)

🔥 Paweł Leszczyński, Maciej Obuchowski, Michael Robinson
Julien Le Dem (julien@apache.org)
2024-03-26 12:06:44

*Thread Reply:* Same

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-26 13:59:03

*Thread Reply:* have fun at the conference!

❤️ Harel Shein
Damien Hawes (damien.hawes@booking.com)
2024-03-26 13:23:08

OK @Maciej Obuchowski - 1 job has many stages; 1 stage has many tasks. Transitively, this means that 1 job has many tasks.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-26 13:58:44

*Thread Reply:* batch or streaming one? 🙂

Damien Hawes (damien.hawes@booking.com)
2024-03-26 14:01:36

*Thread Reply:* Doesn't matter. It's the same concept.

Damien Hawes (damien.hawes@booking.com)
2024-03-26 13:27:58

Also @Paweł Leszczyński, seem Spark metrics has this:

local-1711474020860.driver.LiveListenerBus.listenerProcessingTime.io.openlineage.spark.agent.OpenLineageSparkListener count = 12 mean rate = 1.19 calls/second 1-minute rate = 1.03 calls/second 5-minute rate = 1.01 calls/second 15-minute rate = 1.00 calls/second min = 0.00 milliseconds max = 1985.48 milliseconds mean = 226.81 milliseconds stddev = 549.12 milliseconds median = 4.93 milliseconds 75% &lt;= 53.64 milliseconds 95% &lt;= 1985.48 milliseconds 98% &lt;= 1985.48 milliseconds 99% &lt;= 1985.48 milliseconds 99.9% &lt;= 1985.48 milliseconds

Michael Robinson (michael.robinson@astronomer.io)
2024-03-27 09:23:49

Do you think Bipan's team could potentially benefit significantly from upgrading to the latest version of openlineage-spark? https://openlineage.slack.com/archives/C01CK9T7HKR/p1711483070147019

} Bipan Sihra (https://openlineage.slack.com/team/U06RFHBSTHR)
Michael Robinson (michael.robinson@astronomer.io)
2024-03-27 09:55:04

*Thread Reply:* @Paweł Leszczyński wdyt?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-03-27 10:00:01

*Thread Reply:* I think the issue here is that marquez is not able to properly visualize parent run events that Maciej has added recently for a Spark application

Michael Robinson (michael.robinson@astronomer.io)
2024-03-27 10:03:22

*Thread Reply:* So if they downgraded would they have a graph closer to what they want?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-27 11:23:31

*Thread Reply:* I don't see parent run events there?

Michael Robinson (michael.robinson@astronomer.io)
2024-03-27 09:54:54

I'm exploring ways to improve the demo gif in the Marquez README. An improved and up-to-date demo gif could also be used elsewhere -- in the Marquez landing pages, for example, and the OL docs. Along with other improvements to the landing pages, I created a new gif that's up to date and higher-resolution, but it's large (~20 MB). • We could put it on YouTube and link to it, but that would downgrade the user experience in other ways. • We could host it somewhere else, but that would mean adding another tool to the stack and, depending on file size limits, could cost money. (I can't imagine it would cost but I haven't really looked into this option yet. Regardless of cost, tt seems to have the same drawbacks as YT from a UX perspective.) • We could have GitHub host it in another repo (for free) in the Marquez or OL orgs. ◦ It could go in the OL Docs because it's likely we'll want to use it in the docs anyway, but even if we never serve it wouldn't this create issues for local development at a minimum? I opened a PR to do this, which a PR with other improvements is waiting on, but not sure about this approach. ◦ It could go in the unused Marquez website repo, but there's a good chance we'll forget it's there and remove or archive the repo without moving it first. ◦ In another repo, or even a new one for stuff like this? Anyone have an opinion or know of a better option?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-28 10:52:31

*Thread Reply:* maybe make it a HTML5 video?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-28 10:55:32

*Thread Reply:* https://wp-rocket.me/blog/replacing-animated-gifs-with-html5-video-for-faster-page-speed/

WP Rocket
Written by
Raelene Morey
Est. reading time
9 minutes
Michael Robinson (michael.robinson@astronomer.io)
2024-03-29 10:51:18

*Thread Reply:* 👀

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-28 10:52:04

@Julien Le Dem @Harel Shein how did Data Council panel and talk go?

Harel Shein (harel.shein@gmail.com)
2024-03-28 10:53:30

*Thread Reply:* Was just composing the message below :)

Harel Shein (harel.shein@gmail.com)
2024-03-28 10:53:05

Some great discussions here at data council, the panel was really great and we can definitely feel energy around OpenLineage continuing to build up! 🚀 Thanks @Julien Le Dem for organizing and shoutout to @Ernie Ostic @Sheeri Cabral (Collibra) @Eric Veleker for taking the time and coming down here and keeping pushing more and building the community! ❤️

🏄‍♂️ Michael Robinson, Maciej Obuchowski
👍 Ernie Ostic
🎉 tati
Michael Robinson (michael.robinson@astronomer.io)
2024-03-29 11:08:13

*Thread Reply:* @Harel Shein did anyone take pictures?

Harel Shein (harel.shein@gmail.com)
2024-03-29 11:10:54

*Thread Reply:* there should be plenty of pictures from the conference organizers, we'll ask for some

🙌 Michael Robinson
Michael Robinson (michael.robinson@astronomer.io)
2024-03-29 11:15:03

*Thread Reply:* Did a search and didn't see anything

Harel Shein (harel.shein@gmail.com)
2024-03-29 11:16:59

*Thread Reply:* here's one

Julien Le Dem (julien@apache.org)
2024-03-29 11:17:07

*Thread Reply:* Speaker dinner the night before: https://www.linkedin.com/posts/datacouncil-aidatacouncil-ugcPost-7178852429705224193-De46?utmsource=share&utmmedium=memberios|https://www.linkedin.com/posts/datacouncil-aidatacouncil-ugcPost-7178852429705224193-De46?utmsource=share&utmmedium=memberios

linkedin.com
Julien Le Dem (julien@apache.org)
2024-03-29 11:17:19

*Thread Reply:* Ahah. Same picture

Harel Shein (harel.shein@gmail.com)
2024-03-29 11:17:51

*Thread Reply:* haha. Julien and Ernie look great while I'm explaining how to land an airplane 🛬

😊 Ernie Ostic
Michael Robinson (michael.robinson@astronomer.io)
2024-03-29 11:44:29

*Thread Reply:* Great pic!

Julien Le Dem (julien@apache.org)
2024-03-30 20:40:06

*Thread Reply:* The photo gallery is there

Pixieset
Julien Le Dem (julien@apache.org)
2024-03-30 20:47:50
Michael Robinson (michael.robinson@astronomer.io)
2024-04-01 09:01:34

*Thread Reply:* awesome! just in time for the newsletter 🙂

Eric Veleker (eric@atlan.com)
2024-04-05 22:29:56

*Thread Reply:* Thank you for thinking of us. Onwards and upwards.

Peter Huang (huangzhenqiu0825@gmail.com)
2024-03-28 15:53:09

I just find the naming conventions for hive/iceberg/hudi are not listed in the doc https://openlineage.io/docs/spec/naming/. Shall we further standardize them? Any suggestions?

openlineage.io
👍 Maciej Obuchowski
Harel Shein (harel.shein@gmail.com)
2024-03-28 16:22:22

*Thread Reply:* Yes. This also came up in a conversation with one of the maintainers of dbt-core, we can also pick up on a proposal to extend the naming conventions markdown to something a bit more scalable.

Harel Shein (harel.shein@gmail.com)
2024-03-28 16:23:29

*Thread Reply:* What you think about this proposal? https://github.com/OpenLineage/OpenLineage/pull/1702

Labels
documentation, proposal
Comments
3
Peter Huang (huangzhenqiu0825@gmail.com)
2024-03-28 16:52:33

*Thread Reply:* Thanks for sharing the info. Will take a deeper look later today.

👍 Harel Shein
Mariusz Górski (gorskimariusz13@gmail.com)
2024-03-29 02:14:19

*Thread Reply:* I think this is similar topic to resource naming in ODD, might be worth to take a look for inspiration: https://github.com/opendatadiscovery/oddrn-generator

Stars
4
Language
Python
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-29 05:42:19

*Thread Reply:* the thing is we need to have language-agnostic way of defining those naming conventions and be able to generate code for them, similar to facets spec

👍 Mariusz Górski
Mariusz Górski (gorskimariusz13@gmail.com)
2024-03-29 08:10:22

*Thread Reply:* could be also an idea to have micro rest api embedded in each client, so managing naming convention would be stored there and each client (python/java) could run it as a subprocess 🤔

Harel Shein (harel.shein@gmail.com)
2024-04-01 12:44:15

*Thread Reply:* we can also just write it in Rust, @Maciej Obuchowski 😁

👍 Mariusz Górski
😅 Maciej Obuchowski
Harel Shein (harel.shein@gmail.com)
2024-04-01 13:11:01

*Thread Reply:* no real changes/additions, but starting to organize the doc for now: https://github.com/OpenLineage/OpenLineage/pull/2554

Harel Shein (harel.shein@gmail.com)
2024-03-29 11:14:25

@Maciej Obuchowski we also heard some good things about the sqlglot parser. have you looked at it recently?

Website
<https://sqlglot.com/>
Stars
5285
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-29 12:59:29

*Thread Reply:* I love the fact that our parser is in type safe language :)

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-29 14:14:46

*Thread Reply:* does it matter after all when it comes to parsing SQL? it might be worth to run some comparisons but it may turn out that sqlglot misses most of Snowflake dialect that we currently support

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-03-29 16:56:04

*Thread Reply:* We'd miss on Java side parsing as well

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-03-29 16:57:37

*Thread Reply:* very importantly this ^

Harel Shein (harel.shein@gmail.com)
2024-03-29 17:23:36

*Thread Reply:* That’s important. Yes

Michael Robinson (michael.robinson@astronomer.io)
2024-04-01 10:05:24

OpenLineage 1.11.0 release vote is now open: https://openlineage.slack.com/archives/C01CK9T7HKR/p1711980285409389

} Michael Robinson (https://openlineage.slack.com/team/U02LXF3HUN7)
Julien Le Dem (julien@apache.org)
2024-04-02 11:29:23

Sorry, I’ll be late to the sync

👍 Maciej Obuchowski
Harel Shein (harel.shein@gmail.com)
2024-04-02 12:56:31

forgot to mention, but we have the TSC meeting coming up next week. we should start sourcing topics

👍 Michael Robinson, Maciej Obuchowski
Michael Robinson (michael.robinson@astronomer.io)
2024-04-03 15:58:23

*Thread Reply:* 1.10 and 1.11 releases Data Council, Kafka Summit, & Boston meetup shout outs and quick recaps Datadog poc update or demo?

Michael Robinson (michael.robinson@astronomer.io)
2024-04-03 16:04:39

*Thread Reply:* Discussion item about Trino integration next steps?

Michael Robinson (michael.robinson@astronomer.io)
2024-04-03 16:06:16

*Thread Reply:* Accenture+Confluent roundtable reminder for sure

Michael Robinson (michael.robinson@astronomer.io)
2024-04-03 16:24:17

*Thread Reply:* job to job dependencies discussion item? https://openlineage.slack.com/archives/C065PQ4TL8K/p1712153842519719

} Julian LaNeve (https://openlineage.slack.com/team/U0544QC1DS9)
➕ Harel Shein
Harel Shein (harel.shein@gmail.com)
2024-04-03 16:43:56

*Thread Reply:* I think it's too early for Datadog update tbh, but I like the job to job discussion. We can make also bring up the naming library discussion that we talked about yesterday

Harel Shein (harel.shein@gmail.com)
2024-04-02 21:19:03

one more thing, if we want we could also apply for a free Datadog account for OpenLineage and Marquez: https://www.datadoghq.com/partner/open-source/

Datadog
👀 Maciej Obuchowski
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 05:27:02

*Thread Reply:* would be nice for tests

Julian LaNeve (lanevejulian@gmail.com)
2024-04-03 10:17:22

is there any notion of process dependencies in openlineage? i.e. if I have two airflow tasks that depend on each other, with no dataset in between, can I express that in the openlineage spec?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-04-03 11:39:25

*Thread Reply:* AFAIK no, it doesn't aim to do reflect that cc @Julien Le Dem

Julien Le Dem (julien@apache.org)
2024-04-03 11:42:55

*Thread Reply:* It is not in the core spec but this could be represented as a job facet. It is probably in the airflow facet right now but we could add a more generic job dependency facet

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 11:52:36

*Thread Reply:* we do represent hierarchy though - with ParentRunFacet

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 11:57:16

*Thread Reply:* if we were to add some dependency facet, what would we want to model?

  1. we want to note the dependency between jobs, not between particular runs, so a. we are in job X and want to note that job Y will run after it ends b. we are in job Y and want to note that it ran because it depended on successful run of job X
  2. we want also to note the dependency between particular runs: a. we are in run x of job X, and want to note that run y of job Y will happen after it ends b. we are in run y of job Y, and want to note that it depended (as in - ran because the preceding job(s) finished successfully) on run x of job X
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 11:59:47

*Thread Reply:* do we also want to model something like Airflow's trigger rules? https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html#trigger-rules

Harel Shein (harel.shein@gmail.com)
2024-04-03 12:46:01

*Thread Reply:* I don't think this is about hierarchy though, right? If I understand @Julian LaNeve correctly, I think it's more #2

Julian LaNeve (lanevejulian@gmail.com)
2024-04-03 12:48:25

*Thread Reply:* yeah it's less about hierarchy - definitely more about #2.

assume we have a DAG that looks like this: Task A -&gt; Task B -&gt; Task C today, OL can capture the full set of dependencies this if we do: A -&gt; (dataset 1) -&gt; B -&gt; (ds 2) -&gt; C but it's not always the case that you have datasets between everything. my question was moreso around "how can I use OL to capture the relationship between jobs if there are no datasets in between"

Julien Le Dem (julien@apache.org)
2024-04-03 12:52:32

*Thread Reply:* I had opened an issue to track this a while ago but we did not get too far in the discussion: https://github.com/OpenLineage/OpenLineage/issues/552

Labels
enhancement
Comments
2
Julian LaNeve (lanevejulian@gmail.com)
2024-04-03 12:53:19

*Thread Reply:* oh nice - unsurprisingly you were 2 years ahead of me 😆

😅 Julien Le Dem
Julien Le Dem (julien@apache.org)
2024-04-03 12:53:57

*Thread Reply:* You can track the dependency both at the job level and at the run level. At the job level you would do something along the lines of: job: { facets: { job_dependencies: { predecessors: [ { namespace: , name: }, ... ], successors: [ { namespace: , name: }, ... ] } }}

Julien Le Dem (julien@apache.org)
2024-04-03 12:56:57

*Thread Reply:* At the run level you could track the actual task run dependencies: run: { facets: { run_dependencies: { predecessor: [ "{run uuid}", ...], successors: [...], } }}

Julien Le Dem (julien@apache.org)
2024-04-03 13:00:42

*Thread Reply:* I think the current airflow run facet contains that information in an airflow specific representation: https://github.com/apache/airflow/blob/main/airflow/providers/openlineage/plugins/facets.py

👍 Maciej Obuchowski
Julien Le Dem (julien@apache.org)
2024-04-03 13:02:16

*Thread Reply:* I think we should have the discussion in the ticket so that it does not get lost in the slack history

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-03 13:21:13

*Thread Reply:* run: { facets: { run_dependencies: { predecessor: [ "{run uuid}", ...], successors: [...], } }} I like this format, but would have full run/job identifier as ParentRunFacet

Julien Le Dem (julien@apache.org)
2024-04-03 13:23:05

*Thread Reply:* For the trigger rules I wonder if this is too specific to airflow.

Julien Le Dem (julien@apache.org)
2024-04-03 13:23:27

*Thread Reply:* But if there’s a generic way to capture this, it makes sense

Michael Robinson (michael.robinson@astronomer.io)
2024-04-03 11:13:23

Don't forget to register for this! https://events.confluent.io/roundtable-data-lineage/Accenture

events.confluent.io
👀 Maciej Obuchowski
👍 Harel Shein, Peter Huang
Michael Robinson (michael.robinson@astronomer.io)
2024-04-03 17:24:31

This attempt at a SQLAlchemy was basically working, if not perfectly, the last time I played with it: https://github.com/OpenLineage/OpenLineage/pull/2088. What more do I need to do to get it to the point where it can be merged as an "experimental"/"we warned you" integration? I mean, other than make sure it's still working and clean it up? 🙂

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-04-04 06:46:06

https://docs.getdbt.com/docs/collaborate/column-level-lineage#sql-parsing

docs.getdbt.com
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-04-04 06:48:15

*Thread Reply:* seems like it’s only for dbt cloud

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-04 07:20:56

*Thread Reply:* > Column-level lineage relies on SQL parsing. Was thinking about doing the same thing at some point

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-04 07:21:12

*Thread Reply:* Basically with dbt we know schemas, so we also can resolve wildcards as well

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-04-04 07:22:01

*Thread Reply:* but that requires adding capability for providing known schema into sqlparser

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-04 07:23:15

*Thread Reply:* that's not very hard to add afaik 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-04 07:23:27

*Thread Reply:* not exactly into sqlparser too

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-04 07:23:32

*Thread Reply:* just our parser

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-04-04 07:23:46

*Thread Reply:* yeah, our parser

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-04-04 07:23:55

*Thread Reply:* still someone has to add it :D

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-04-04 07:24:04

*Thread Reply:* some rust enthusiast probably

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-04-04 07:24:14

*Thread Reply:* 👀

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-04-04 07:27:56

*Thread Reply:* but also: dbt provides schema info only if you generate catalog.json with generate docs command

Harel Shein (harel.shein@gmail.com)
2024-04-04 07:36:13

*Thread Reply:* Right now we have the dbl-ol wrapper anyway, so we can make another dbt docs command on behalf of the user too

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-04 07:39:17

*Thread Reply:* not sure if running commands on behalf of user is good idea, but denoting in docs that running it increases accuracy of column-level lineage is probably a good idea

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-04 07:39:22

*Thread Reply:* once we build it

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-04 07:39:24

*Thread Reply:* of course

Harel Shein (harel.shein@gmail.com)
2024-04-04 07:42:41

*Thread Reply:* That depends, what are the side effects of running dbt docs?

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-04-04 07:49:19

*Thread Reply:* the other option is similar to dagster's approach - run post-hook macro that prints schema to logs and read the logs with dbt-ol wrapper

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-04-04 07:49:40

*Thread Reply:* which again won't work in dbt cloud - there catalog.json seems like the only option

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-04 08:06:26

*Thread Reply:* > That depends, what are the side effects of running dbt docs? refreshing someone's documentation? 🙂

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-04-04 09:51:44

*Thread Reply:* it would be configurable imho, if someone doesn’t want column level lineage in price of additional step, it’s their choice

Harel Shein (harel.shein@gmail.com)
2024-04-04 11:57:44

*Thread Reply:* yup, agreed. I'm sure we can also run dbt docs to a temp directory that we'll delete right after

Michael Robinson (michael.robinson@astronomer.io)
2024-04-04 14:29:06

Some encouraging stats from Sonatype: these are Spark integration downloads (unique IPs) over the last 12 months

Michael Robinson (michael.robinson@astronomer.io)
2024-04-04 14:30:45

*Thread Reply:* That's an increase of 17560.5%

🎉 Harel Shein, Jakub Dardziński
Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-04-04 18:43:27

*Thread Reply:* https://github.com/OpenLineage/OpenLineage/releases/tag/1.11.3 that’s a lot of notes 😮

Michael Robinson (michael.robinson@astronomer.io)
2024-04-04 15:15:42

Marquez committers: there's a committer vote open 👀

Harel Shein (harel.shein@gmail.com)
2024-04-04 15:22:19

did anyone submit a CFP here? https://sessionize.com/open-source-summit-europe-2024/ it's a linux foundation conference too

sessionize.com
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 10:57:08

*Thread Reply:* looks like a nice conference

Harel Shein (harel.shein@gmail.com)
2024-04-05 11:29:43

*Thread Reply:* too far for me, but might be a train ride for you?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 11:37:19

*Thread Reply:* yeah, I might submit something 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 11:37:45

*Thread Reply:* and I think there are actually direct trains to Vienna from Warsaw

Damien Hawes (damien.hawes@booking.com)
2024-04-05 04:53:56

Hmm @Maciej Obuchowski @Paweł Leszczyński - I see we released 1.11.3, but I don't see the artifacts in central. Are the artifacts blocked?

Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-04-05 04:54:42

*Thread Reply:* after last release, it took me some 24h to see openlineage-flink artifact published

Damien Hawes (damien.hawes@booking.com)
2024-04-05 04:56:02

*Thread Reply:* I recall something about the artifacts had to be manually published from the staging area.

Damien Hawes (damien.hawes@booking.com)
2024-04-05 05:11:40

*Thread Reply:* @Maciej Obuchowski - can you check if the release is stuck in staging?

Damien Hawes (damien.hawes@booking.com)
2024-04-05 05:11:53

*Thread Reply:* I recall last time it failed because there wasn't a javadoc associated with it

Damien Hawes (damien.hawes@booking.com)
2024-04-05 05:23:25

*Thread Reply:* Nevermind @Paweł Leszczyński @Maciej Obuchowski - it seems like the search indexes haven't been updated.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 06:01:51

*Thread Reply:* @Michael Robinson has to manually promote them but it's not instantaneous I believe

👍 Michael Robinson
Harel Shein (harel.shein@gmail.com)
2024-04-05 09:23:51

I'm seeing some really strange behavior with OL Spark, I'm going to give some data to help out, but these are still breadcrumbs unfortunately. 🧵

Harel Shein (harel.shein@gmail.com)
2024-04-05 09:25:06

*Thread Reply:* the driver for this job is running for more than 5 hours, but the job actually finished after 20 minutes

Harel Shein (harel.shein@gmail.com)
2024-04-05 09:25:08

*Thread Reply:*

Harel Shein (harel.shein@gmail.com)
2024-04-05 09:25:56

*Thread Reply:* most the cpu time in those 5 hours are spent in openlineage methods

Harel Shein (harel.shein@gmail.com)
2024-04-05 09:25:59

*Thread Reply:*

Harel Shein (harel.shein@gmail.com)
2024-04-05 09:26:36

*Thread Reply:* it's also not reproducible 😕

Harel Shein (harel.shein@gmail.com)
2024-04-05 09:26:46

*Thread Reply:* but happens "sometimes"

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 09:55:37

*Thread Reply:* DatasetIdentifier.equals?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 09:55:58

*Thread Reply:* can you check what calls it?

Harel Shein (harel.shein@gmail.com)
2024-04-05 11:10:46

*Thread Reply:* unfortunately, some of the stack frames are truncated by JVM

Harel Shein (harel.shein@gmail.com)
2024-04-05 11:14:08

*Thread Reply:*

Harel Shein (harel.shein@gmail.com)
2024-04-05 11:17:03

*Thread Reply:* top methods by time:

Harel Shein (harel.shein@gmail.com)
2024-04-05 11:18:27

*Thread Reply:* maybe this has something to do with SymLink and the lombok implementation of .equals() ?

Harel Shein (harel.shein@gmail.com)
2024-04-05 11:19:31

*Thread Reply:* and then some sort of circular dependency

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 11:19:46

*Thread Reply:* but is this a JDBC job?

Harel Shein (harel.shein@gmail.com)
2024-04-05 11:20:01

*Thread Reply:* let me see

Harel Shein (harel.shein@gmail.com)
2024-04-05 11:20:08

*Thread Reply:* I don't think so

Harel Shein (harel.shein@gmail.com)
2024-04-05 11:23:34

*Thread Reply:* it's not

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 11:23:52

*Thread Reply:* ok, we don't use lang3 Pair a lot - it has to be in ColumnLevelLineageBuilder 🙂

Harel Shein (harel.shein@gmail.com)
2024-04-05 11:30:08

*Thread Reply:* yes.. I'm staring at that class for a while now

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 11:33:43

*Thread Reply:* what's the rough size of the logical plan of the job?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 11:33:59

*Thread Reply:* I'm trying to understand whether we're looking at some infinite loop

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 11:34:06

*Thread Reply:* or just something done very ineffiently

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 11:35:07

*Thread Reply:* like every input being added in this manner: ``` public void addInput(ExprId exprId, DatasetIdentifier datasetIdentifier, String attributeName) { inputs.computeIfAbsent(exprId, k -> new LinkedList<>());

Pair&lt;DatasetIdentifier, String&gt; input = Pair.of(datasetIdentifier, attributeName);

if (!inputs.get(exprId).contains(input)) {
  inputs.get(exprId).add(input);
}

}`` it's a candidate: it has to traverse the list returned frominputs` for every CLL dependency field added

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 11:35:59

*Thread Reply:* it looks like we're building size N list in N^2 time: inputs.stream() .filter(i -&gt; i instanceof InputDatasetFieldWithIdentifier) .map(i -&gt; (InputDatasetFieldWithIdentifier) i) .forEach( i -&gt; context .getBuilder() .addInput( ExprId.apply(i.exprId().exprId()), new DatasetIdentifier( i.datasetIdentifier().getNamespace(), i.datasetIdentifier().getName()), i.field())); 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 11:39:42

*Thread Reply:* ah, this isn't even used now since it's for new extension-based spark collection

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 11:39:58

*Thread Reply:* @Paweł Leszczyński this is most likely a future bug ⬆️

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 11:42:13

*Thread Reply:* I think we're still doing it now anyway: ``` private static void extractInternalInputs( LogicalPlan node, ColumnLevelLineageBuilder builder, List datasetIdentifiers) {

datasetIdentifiers.stream()
    .forEach(
        di -> {
          ScalaConversionUtils.fromSeq(node.output()).stream()
              .filter(attr -> attr instanceof AttributeReference)
              .map(attr -> (AttributeReference) attr)
              .collect(Collectors.toList())
              .forEach(attr -> builder.addInput(attr.exprId(), di, attr.name()));
        });

}```

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 11:53:44

*Thread Reply:* and that's linked list - must be pretty slow jumping all those pointers

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 12:01:48
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 12:12:22

*Thread Reply:* There are some more funny places in CLL code, like we're iterating over list of schema fields and calling some function with name of that field : schema.getFields().stream() .map(field -&gt; Pair.of(field, getInputsUsedFor(field.getName()))) then immediately iterate over it second time to get the field back from it's name: List&lt;Pair&lt;DatasetIdentifier, String&gt;&gt; getInputsUsedFor(String outputName) { Optional&lt;OpenLineage.SchemaDatasetFacetFields&gt; outputField = schema.getFields().stream() .filter(field -&gt; field.getName().equalsIgnoreCase(outputName)) .findAny();

Harel Shein (harel.shein@gmail.com)
2024-04-05 12:51:15

*Thread Reply:* I think the time spent by the driver (5 hours) just on these methods smells like an infinite loop?

Harel Shein (harel.shein@gmail.com)
2024-04-05 12:51:59

*Thread Reply:* like, as inefficient as it may be, this is a lot of time

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 12:52:20

*Thread Reply:* did it finish eventually?

Harel Shein (harel.shein@gmail.com)
2024-04-05 12:52:51

*Thread Reply:* yes... but.. I wonder if something killed it somewhere?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 12:52:58

*Thread Reply:* I mean, it can be something like 10000^3 loop 🙂

Harel Shein (harel.shein@gmail.com)
2024-04-05 12:53:01

*Thread Reply:* I couldn't find anything in the logs to indicate

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 12:53:10

*Thread Reply:* and it has to do those pair comparisons

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 12:54:12

*Thread Reply:* would be easier if we could see the general size of a plan of this job - if it's something really small then I'm probably wrong

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 12:54:37

*Thread Reply:* but if there are 1000s of columns... anything can happen 🙂

Harel Shein (harel.shein@gmail.com)
2024-04-05 12:55:09

*Thread Reply:* yeah.. trying to find out. I don't have that facet enabled there, and I can't find the ol events in the logs (it's writing to console, and I think they got dropped)

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 12:58:32

*Thread Reply:* DevNullTransport 🙂

😅 Harel Shein, Jakub Dardziński
Harel Shein (harel.shein@gmail.com)
2024-04-05 13:08:09

*Thread Reply:* generally speaking, we have a similar problem here like we had with Airflow integration

Harel Shein (harel.shein@gmail.com)
2024-04-05 13:08:35

*Thread Reply:* we are not holding up the job per se, but... we are holding up the spark application

Harel Shein (harel.shein@gmail.com)
2024-04-05 13:09:07

*Thread Reply:* do we have a way to be defensive about that somehow, shutdown hook from spark to our thread or something

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 13:10:26

*Thread Reply:* there's no magic

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 13:10:44

*Thread Reply:* circuit breaker with timeout does not work?

Harel Shein (harel.shein@gmail.com)
2024-04-05 13:12:03

*Thread Reply:* it would, but we don't turn that on by default

Harel Shein (harel.shein@gmail.com)
2024-04-05 13:12:18

*Thread Reply:* also, if we do, what should be our default values?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 13:14:08

*Thread Reply:* what would not hurt you if you enabled it, 30 seconds?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 13:14:23

*Thread Reply:* I guess we should aim much lower with the runtime

Harel Shein (harel.shein@gmail.com)
2024-04-05 13:21:41

*Thread Reply:* yeah, and make sure we emit metrics / logs when that happens

Harel Shein (harel.shein@gmail.com)
2024-04-05 13:27:31

*Thread Reply:* wait, our circuit breaker right now only supports cpu & memory

Harel Shein (harel.shein@gmail.com)
2024-04-05 13:27:39

*Thread Reply:* we would need to add a timeout one, right?

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 13:38:27

*Thread Reply:* ah, yes

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 13:38:41
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 13:41:32

*Thread Reply:* and BTW, no abnormal CPU or memory usage?

Harel Shein (harel.shein@gmail.com)
2024-04-05 13:44:13

*Thread Reply:* nope, not at all

Harel Shein (harel.shein@gmail.com)
2024-04-05 13:51:02

*Thread Reply:* green line is when spark job actually finishes, but the graph is the whole runtime of the driver

Harel Shein (harel.shein@gmail.com)
2024-04-05 13:51:33

*Thread Reply:*

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-05 14:16:19

*Thread Reply:* I mean, it's using 100% of one core 🙂

🙃 Harel Shein
Paweł Leszczyński (pawel.leszczynski@getindata.com)
2024-04-08 02:05:06

*Thread Reply:* it's similar to what aniruth experienced. there's something that for some type of logical plans causes recursion alike behaviour. However, I don't think it's recursion bcz it's ending at some point. If we had DebugFacet we would be able to know which logical plan nodes are involved in this.

Harel Shein (harel.shein@gmail.com)
2024-04-08 10:09:17

*Thread Reply:* I'll try to get that for us

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-08 13:23:34

*Thread Reply:* > If we had DebugFacet we would be able to know which logical plan nodes are involved in this. if the event would not take 1GB 🙂

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-08 13:24:16

*Thread Reply:* > it's similar to what aniruth experienced. there's something that for some type of logical plans causes recursion alike behaviour. However, I don't think it's recursion bcz it's ending at some point. If we had DebugFacet we would be able to know which logical plan nodes are involved in this. (edited) what about my thesis that something is just extremely slow in column-level lineage code?

Michael Robinson (michael.robinson@astronomer.io)
2024-04-05 16:25:37

Some adoption metrics from Sonatype and PyPI, visualized using Preset. In Preset, you can see the number for each month (but we're out of seats on the free tier there). The big number is the downloads for the last month (February in most cases).

🔥 Paweł Leszczyński, Maciej Obuchowski
Damien Hawes (damien.hawes@booking.com)
2024-04-08 09:51:27

Good news. @Paweł Leszczyński - the memory leak fixes worked. Our streaming pipelines have run through the weekend without a single OOM crash.

🎉 Harel Shein, Peter Huang, Jakub Dardziński, Maciej Obuchowski
Peter Huang (huangzhenqiu0825@gmail.com)
2024-04-08 10:23:11

*Thread Reply:* @Damien Hawes Would you please point me the PR that fixes the issue?

Damien Hawes (damien.hawes@booking.com)
2024-04-08 10:25:14
Damien Hawes (damien.hawes@booking.com)
2024-04-08 10:31:15

*Thread Reply:* @Peter Huang ^

:gratitude_thank_you: Peter Huang
Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-08 12:41:47

*Thread Reply:* @Damien Hawes any other feedback for OL with streaming pipelines you have so far?

Damien Hawes (damien.hawes@booking.com)
2024-04-08 12:42:20

*Thread Reply:* It generates a TON of data

Damien Hawes (damien.hawes@booking.com)
2024-04-08 12:44:19

*Thread Reply:* There are some optimisations that could be made:

  1. A lot of the facets can be cached, and don't need to be recreated every time. The connector (obviously) doesn't care about the size of the data that is being processed, rather it cares about how frequent the spark events are. Spark's micro-batching thing means that the job start -> stage submitted -> task started -> task ended -> stage complete -> job end cycle fires more frequently.
Damien Hawes (damien.hawes@booking.com)
2024-04-08 12:46:10

*Thread Reply:* This has an impact on any backend using it, as the run id keeps changing. This means the parent suddenly has thousands of jobs as children.

Damien Hawes (damien.hawes@booking.com)
2024-04-08 12:46:29

*Thread Reply:* Our biggest pipeline generates a new event cycle every 2 minutes.

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-08 12:57:23

*Thread Reply:* "Too much data" is exactly what I thought 🙂 The obvious potential issue with caching is the same issue we just fixed... potential memory leaks, and cache invalidation

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-08 12:58:35

*Thread Reply:* > the run id keeps changing In this case, that's a bug. We'd still need some wrapping event for whole streaming job though, probably other than application start

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-08 13:19:19

*Thread Reply:* on the other topic, did those problems stop? https://github.com/OpenLineage/OpenLineage/issues/2513 with https://github.com/OpenLineage/OpenLineage/pull/2535/files

Harel Shein (harel.shein@gmail.com)
2024-04-08 11:05:44

when talking about the naming scheme for datasets, would everyone here agree that we generally use: {scheme}://{authority}/{unique_name} ? where generally authority == namespace

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-04-08 11:08:20

*Thread Reply:* I think so, and if we don’t then we should

Jakub Dardziński (jakub.dardzinski@getindata.com)
2024-04-08 11:10:31

*Thread Reply:* ~which brings me to the question why construct dataset name as such~ nvm

Harel Shein (harel.shein@gmail.com)
2024-04-08 11:10:36

*Thread Reply:* please feel free to chime in here too https://github.com/dbt-labs/dbt-core/issues/8725

Maciej Obuchowski (maciej.obuchowski@getindata.com)
2024-04-08 12:42:18

*Thread Reply:* > where generally authority == namespace (edited) {scheme}://{authority} is namespace

👍 Jakub Dardziński, Harel Shein
Harel Shein (harel.shein@gmail.com)
2024-04-08 14:13:01

*Thread Reply:* agreed