@Maciej Obuchowski has joined the channel
@Paweł Leszczyński has joined the channel
@Jakub Dardziński has joined the channel
@Michael Robinson has joined the channel
https://github.com/OpenLineage/OpenLineage/pull/2260 fun PR incoming
*Thread Reply:* hey look, more fun https://github.com/OpenLineage/OpenLineage/pull/2263
*Thread Reply:* nice to have fun with you Jakub
*Thread Reply:* Can't wait to see it on the 1st January.
*Thread Reply:* Ain’t no party like a dev ex improvement party
*Thread Reply:* Gentoo installation party is in similar category of fun
@Paweł Leszczyński approved PR #2661 with minor comments, I think the enum defined in the db layer is one comment we’ll need to address before merging; otherwise solid work dude 👌
_Minor_: We can consider defining a _run_state column and eventually dropping the event_type. That is, we can consider columns prefixed with _ to be "remappings" of OL properties to Marquez.
-> didn't get this one. Is it for now or some future plans?
*Thread Reply:* I will then replace enum with string
also, what about this PR? https://github.com/MarquezProject/marquez/pull/2654
*Thread Reply:* this is the next to go
*Thread Reply:* and i consider it ready
*Thread Reply:* Then we have a draft one with streaming support https://github.com/MarquezProject/marquez/pull/2682/files -> which has an integration test of lineage endpoint working for streaming jobs
*Thread Reply:* I still need to work on #2682 but you can review #2654. once you get some sleep, of course 😉
Got the doc + poc for hook-level coverage: https://docs.google.com/document/d/1q0shiUxopASO8glgMqjDn89xigJnGrQuBMbcRdolUdk/edit?usp=sharing
*Thread Reply:* did you check if LineageCollector
is instantiated once per process?
*Thread Reply:* Using it only via get_hook_lineage_collector
Anyone have thoughts about how to address the question about “pain points” here? https://openlineage.slack.com/archives/C01CK9T7HKR/p1700064564825909. (Listing pros is easy — it’s the cons we don’t have boilerplate for)
*Thread Reply:* Maybe something like “OL has many desirable integrations, including a best-in-class Spark integration, but it’s like any other open standard in that it requires contributions in order to approach total coverage. Thankfully, we have many active contributors, and integrations are being added or improved upon all the time.”
*Thread Reply:* Maybe rephrase pain points to "something we're not actively focusing on"
Apparently an admin can view a Slack archive at any time at this URL: https://openlineage.slack.com/services/export. Only public channels are available, though.
*Thread Reply:* you are now admin
have we discussed adding column level lineage support to Airflow? https://marquezproject.slack.com/archives/C01E8MQGJP7/p1700087438599279?thread_ts=1700084629.245949&cid=C01E8MQGJP7
*Thread Reply:* we have it in SQL operators
*Thread Reply:* OOh any docs / code? or if you’d like to respond in the MQZ slack 🙏
*Thread Reply:* I’ll reply there
Any opinions about a free task management alternative to the free version of Notion (10-person limit)? Looking at Trello for keeping track of talks.
*Thread Reply:* What about GitHub projects?
*Thread Reply:* Projects is the way to go, thanks
*Thread Reply:* Set up a Projects board. New projects are private by default. We could make it public. The one thing that’s missing that we could use is a built-in date field for alerting about upcoming deadlines…
worlds are colliding: 6point6 has been acquired by Accenture
*Thread Reply:* https://newsroom.accenture.com/news/2023/accenture-to-expand-government-transformation-capabilities-in-the-uk-with-acquisition-of-6point6
*Thread Reply:* We should sell OL to governments
*Thread Reply:* we may have to rebrand to ClosedLineage
*Thread Reply:* not in this way; just emit any event second time to secret NSA endpoint
*Thread Reply:* we would need to improve our stock photo game
CFP for Berlin Buzzwords went up: https://2024.berlinbuzzwords.de/call-for-papers/ Still over 3 months to submit 🙂
*Thread Reply:* thanks, updated the talks board
*Thread Reply:* https://github.com/orgs/OpenLineage/projects/4/views/1
*Thread Reply:* I'm in, will think what to talk about and appreciate any advice 🙂
just searching for OpenLineage in the Datahub code base. They have an “interesting” approach? https://github.com/datahub-project/datahub/blob/2b0811b9875d7d7ea11fb01d0157a21fdd[…]odules/airflow-plugin/src/datahubairflowplugin/_extractors.py
*Thread Reply:* It looks like the datahub airflow plugin uses OL. but turns it off
https://github.com/datahub-project/datahub/blob/2b0811b9875d7d7ea11fb01d0157a21fdd67f020/docs/lineage/airflow.md
disable_openlineage_plugin true Disable the OpenLineage plugin to avoid duplicative processing.
They reuse the extractors but then “patch” the behavior.
*Thread Reply:* Of course this approach will need changing again with AF 2.7
*Thread Reply:* It looks like we can possibly learn from their approach in SQL parsing: https://datahubproject.io/docs/lineage/airflow/#automatic-lineage-extraction
*Thread Reply:* what's that approach? I only know they have been claiming best SQL parsing capabilities
*Thread Reply:* I haven’t looked in the details but I’m assuming it is in this repo. (my comment is entirely based on the claim here)
*Thread Reply:* <https://www.acryldata.io/blog/extracting-column-level-lineage-from-sql>
-> The interesting difference is that in order to find table schemas, they use their data catalog to evaluate column-level lineage instead of doing this on the client side.
My understanding by example is: If you do
create table x as select ** from y
you need to resolve **
to know column level lineage. Our approach is to do that on the client side, probably with an extra call to database. Their approach is to do that based on the data catalog information.
I’m off on vacation. See you in a week
Maybe move today's meeting earlier, since no one from west coast is joining? @Harel Shein
*Thread Reply:* Ah! That would have been a good idea, but I can’t :(
*Thread Reply:* Do you prefer an earlier meeting tomorrow?
*Thread Reply:* maybe let's keep today's meeting then
The full project history is now available at https://openlineage.github.io/slack-archives/. Check it out!
*Thread Reply:* tfw you thought the scrollback was gone 😳
*Thread Reply:* slack has a good activation story, I wonder how much longer they can keep this up for
*Thread Reply:* always nice to be reminded that there are no actual incremental costs on their end
*Thread Reply:* I guess it’s the difference between storing your data in memory vs. on a glacier 🧊
*Thread Reply:* ah yes surely there is some tiering going on
*Thread Reply:* i might get to marquez slack/PRs today, but most likely tmr morning
*Thread Reply:* If you’re looking for priorities, it would be really great if you could give feedback on one of @Paweł Leszczyński streaming support PRs today
*Thread Reply:* ok, I’ll get to the streaming PR first
*Thread Reply:* FYI, the namespace filtering is a good idea, just needs some feedback on impl / naming
Jens would like to know if there’s anything we want included in the welcome portion of the slide deck. Suggestions? (Aside from the usual links)
@Paweł Leszczyński I reviewed your PR today (mainly the logic on versioning for streaming jobs); here is the main versioning limitations for jobs: a new JobVersion
is created only when a job run completes or fails (or is in the done state); that is, we don’t know if we have received all the input/output datasets so we hold off on creating a new job version until we do.
For streaming, we’ll need to create a job version on start. Do we assume we have all input/output datasets associated with the streaming job? Does OpenLineage guarantee this to be the case for streaming jobs? Having versioning logic for batch vs streaming is a reasonable solution, just want to clarify
*Thread Reply:* yes, the logic adds distinction on how to create job version per processing type. For streaming, I think it makes more sense to create it at the beginning. Then, within other events of the same run, we need to check if the version has changed, and create new version in that case
*Thread Reply:* would we want to use the same versioning func Utils.newJobVersionFor() for streaming? That is, should we assume the input/output datasets contained within the OL event be the “current” set for the streaming job?
*Thread Reply:* that is,
2 input streams, 1 output stream (version 1)
then, 1 input streams, 2 output stream (version 2)
...
*Thread Reply:* but what about the case when the in/out streams are not present:
1 input streams, 2 output stream (version 2)
then, 1 input streams, 0 output stream (version 3)
...
*Thread Reply:* The meaning for the streaming events should be slightly different.
For batch, input and output datasets are cumulative from all the events. If we have an event with output datasets A + B, then another event with output datasets B + C, then we assume job has output datasets A + B + C.
For streaming, we may have a streaming job than for a week was reading data from topics A + B, and then in the next week it was reading from B + C. I think this should the mimicked in different job versions. Making it cumulative for jobs that run for several weeks does not make that much sense to me. The problem here is: what happens if a producer some extra events with no input/output datasets specified, like amount of bytes read? Shall we treat it as a new version? If not, why not?
This part is missing in PR and our Flink integration always sends all the input & datasets. I can add extra logic that will prevent creating new job version if event has no input nor output datasets. However, I can't see any clean and generic solution to this.
*Thread Reply:* > The problem here is: what happens if a producer some extra events with no input/output datasets specified, like amount of bytes read? Shall we treat it as a new version? If not, why not?
We can view the bytes read as additional metadata about the jobs inputs/outputs that wouldn’t trigger a new version (for the job or dataset). I would associate the bytes with the current dataset version and sum them up (I’ve read X
bytes from dataset version D
); you can also view tags in a similar way. In our current versioning logic for datasets, we create a new dataset version when a job completes, I think we’ll want to do something similar for streaming jobs; that is, when X
bytes are written to a given dataset that would trigger a new version
*Thread Reply:* > I can add extra logic that will prevent creating new job version if event has no input nor output datasets
Yes, if in/out no datasets are present, then I wouldn’t create a new job version. @Julien Le Dem opened an issue a while back about this https://github.com/MarquezProject/marquez/issues/1513. that is, there’s a difference between an empty set [ ]
and null
*Thread Reply:* > This part is missing in PR and our Flink integration always sends all the input & datasets This is very important to note in the code andor API docs
*Thread Reply:* Sure we should. Just wanted to make sure if this the way we want to go.
*Thread Reply:* @Willy Lulciuc did you had a chance to look at this as well https://github.com/MarquezProject/marquez/pull/2654 ? This should be merged before streaming support I believe.
*Thread Reply:* ahh sorry, I hadn’t realized they were related / dependent on one another. sure I’ll give the PR a pass
*Thread Reply:* I looked into your comments and found them, as always, really useful. I introduced changes based on most of them. Please take a look at my reponses within Job
model class. I think there is one issue we still need to discuss.
What to do with existing type
field? I would opt for deprecating
it as within the introduced job facet, a notion of jobType
stands for QUERY|COMMAND|DAG|TASK|JOB|MODEL
, while processingType
determines if a job is batch or streaming.
One solution I see is deprecating type
and introducing JobLabels
class as property within Job
with fields like jobType
, processingType
, integration
Another would be to send processingType
within existing type
field. This would mimic existing API, but require further work. The disadvantage is that we still have mismatch between job type in marquez and openlineage spec.
I would opt for (2), but (1) works for me as well.
I’m working on a redesign of the Ecosystem page for a more modern, user-friendly layout. It’s a work in progress, and feedback is welcome: https://github.com/OpenLineage/docs/pull/258.
Can someone count the folks in the room please? Can’t see anyone other than the speaker
@Michael Robinson can you hear the questions?
*Thread Reply:* I could hear all but one of the questions after the first talk
*Thread Reply:* Oh then it's better than I thought
I just had a lovely conversation at reinvent with the CTO of dbt, Connor, and didn’t even know it was him until the end 🤯
Congrats on a great event!
*Thread Reply:* Yeah it was pretty nice 🙂 A lot of good discussions with Google people. Also Jarek Potiuk was there
*Thread Reply:* I think it won't be the last one Warsaw OpenLineage meetup
https://openlineage.slack.com/archives/C01CK9T7HKR/p1701288000527449 putting it here. I don’t feel like I’m the best person to answer but I feel like operational lineage which we’re trying to provide is the thing
created a project for Blog post ideas: https://github.com/orgs/OpenLineage/projects/5/views/1
Release update: we’re due for an OpenLineage release and overdue for a Marquez release. As tomorrow, the first, is a Friday, we should wait until Monday at the earliest. I’m planning to open a vote for an OL release then, but Marquez is red so I’m holding off on a Marquez release for the time being.
*Thread Reply:* I can address the red
CI status, it’s bc we’re seeing issues publishing our snaphots
*Thread Reply:* I think we should release Marquez on Mon. as well
*Thread Reply:* I want to get this https://github.com/OpenLineage/OpenLineage/pull/2284 into OL release
would be interesting if we can use this comparison as a learning (improve docs, etc): https://blog.singleorigin.tech/race-to-the-finish-line-age/
or rather, use the format for comparing OL with other tools 😉
*Thread Reply:* It would be nice to have something like this (I would want it to be a little more even-handed, though). It will be interesting to see if they will ever update this now that there’s automated lineage from Airflow supported by OL
Review needed of the newsletter section on Airflow Provider progress @Jakub Dardziński @Maciej Obuchowski when you have a moment. It will ship by 5 PM ET today, fyi. Already shared it with you. Thanks!
*Thread Reply:* Thanks @Jakub Dardziński
They finally uploaded the OpenLineage Airflow Summit videos to the Airflow channel on YT: https://www.youtube.com/@ApacheAirflow/videos
On Monday I’m meeting with someone at Confluent about organizing a meetup in London in January. I’m thinking I’ll suggest Jan. 24 or 31 as mid-week days work better and folks need time to come back from the vacation. If you have thoughts on this, would you please let me know by 10:00 am ET on Monday? Also, standup will be happening before the meeting — perhaps we can discuss it then. @Harel Shein
*Thread Reply:* Confluent says January 31st will work for them for a London meetup, and they’ll be providing a speaker as well. Is it safe to firm this up with them?
*Thread Reply:* I'd say yes, eventually if Maciej doesn't get new passport till this time I can speak
*Thread Reply:* I already got the photos 😂
*Thread Reply:* you gotta share them
*Thread Reply:* Also apparently it's possible to get temporary passport at airport in 15 minutes
*Thread Reply:* How civilized...
*Thread Reply:* you can get it in the Warsaw airport just like last-minute passport, costs barely nothing (30 PLN which is ~7/8 USD)
*Thread Reply:* yeah, many people are surprised how developed our public service may be
*Thread Reply:* tbh it's always random, can be good can be shit 🙂
*Thread Reply:* lately it's definitely been better than 10 years ago tho
Feedback requested on the newsletter:
https://blog.datahubproject.io/extracting-column-level-lineage-from-sql-779b8ce17567 https://datastation.multiprocess.io/blog/2022-04-11-sql-parsers.html
*Thread Reply:* [6] Note that this isn’t a fully fair comparison, since the DataHub one had access to the underlying schemas whereas the other parsers don’t accept that information. 🙂
*Thread Reply:* it’s open source, should we consider testing it out?
*Thread Reply:* I’m not sure about the methodology, but these numbers are pretty significant
*Thread Reply:* We tested on a corpus of ~7000 BigQuery SELECT statements and ~2000 CREATE TABLE ... AS SELECT (CTAS) statements.⁶
*Thread Reply:* More doctors smoke camels than any other cigarette 😉 If you test on BigQuery, you will not get comparable results for SnowFlake for example.
Wondering if we can do anything about this. We could write a blog post on lineage extraction from Snowflake SQL queries. This is something we spent time on and possibly we support dialect specific queries that others don't.
*Thread Reply:* it all comes to the question whether we should start publishing comparisons
*Thread Reply:* We can also accept schema information in our sql lineage parser. Actually, this would have been good idea I believe.
*Thread Reply:* for select **
use-case?
Release vote is here when you get a moment: https://openlineage.slack.com/archives/C01CK9T7HKR/p1701722066253149
Should we disable openlineage-airflow
on Airflow 2.8 to force people to use provider?
*Thread Reply:* it sounds like maybe something about this should be included in the 2.8 docs. The dev rel team is talking about the marketing around 2.8 right now…
*Thread Reply:* also, the release will evidently be out next Thursday
*Thread Reply:* I mean, openlineage-airflow
is not part of Airflow
*Thread Reply:* We'd have provider for 2.8
*Thread Reply:* so maybe the airflow newsletter would be better
*Thread Reply:* is there anything about the provider that should be in the 2.8 marketing?
*Thread Reply:* I don't think so
*Thread Reply:* Kenten wants to mention that it will be turned off in the 2.8 docs, so please lmk if anything about this changes
https://github.com/OpenLineage/docs/pull/263
Changelog PR for 1.6.0: https://github.com/OpenLineage/OpenLineage/pull/2298
*Thread Reply:* that’s weird ruff-lint found issues, especially when it has ruff version pinned
*Thread Reply:* CHANGELOG.md:10: acccording ==> according
this change is accurate though 🙂
*Thread Reply:* I tried to sneak in a fix in dev but the linter didn’t like it so I changed it back. All set now
*Thread Reply:* The release is in progress
*Thread Reply:* ah, gotcha
dev/get_changes.py:49:17: E722 Do not use bare `except`
dev/get_changes.py:49:17: S112 `try`-`except`-`continue` detected, consider logging the exception
for next time just add except Exception:
instead of except:
🙂
*Thread Reply:* GTK, thank you
The release-integration-flink
job failed with this error message:
Execution failed for task ':examples:stateful:compileJava'.
> Could not resolve all files for configuration ':examples:stateful:compileClasspath'.
> Could not find io.**********************:**********************_java:1.6.0-SNAPSHOT.
Required by:
project :examples:stateful
*Thread Reply:* No cache is found for key: v1-release-client-java--rOhZzScpK7x+jzwfqkQVwOVgqXO91M7VEEtzYHNvSmY=
Found a cache from build 155811 at v1-release-client-java-
is this standard behaviour?
*Thread Reply:* well, same happened for 1.5.0 and it worked
*Thread Reply:* we gotta wait for Maciej/Pawel :<
*Thread Reply:* Looks like Gradle version got bumped and gives some problems
*Thread Reply:* Think we can release by midday tomorrow?
*Thread Reply:* oh forgot about this totally
Feedback sought on a redesign of the ecosystem page that (hopefully) freshens and modernizes the page: https://github.com/OpenLineage/docs/pull/258
Changelog PR for 1.6.1: https://github.com/OpenLineage/OpenLineage/pull/2301
*Thread Reply:* @Maciej Obuchowski the flink job failed again
*Thread Reply:* well, at least it's a different error
*Thread Reply:* one more try? https://github.com/OpenLineage/OpenLineage/pull/2302 @Michael Robinson
*Thread Reply:* 1.6.2 changelog PR: https://github.com/OpenLineage/OpenLineage/pull/2304
*Thread Reply:* @Maciej Obuchowski 👆
*Thread Reply:* going out for a few hours, so next try would be tomorrow if it fails again...
*Thread Reply:* Thanks, Maciej. That worked, and 1.6.2 is out.
Starting a thread for collaboration on the community meeting next week
*Thread Reply:* Releases: 1.6.2
*Thread Reply:* 2023 recap/“best-of”?
*Thread Reply:* @Harel Shein any thoughts? Also, does anyone know if Julien will be back from vacation?
*Thread Reply:* We should probably try to something with Google proposal
*Thread Reply:* Not sure if it needs additional discussion, maybe just implementation?
*Thread Reply:* I can ask him, but it would probably be good if you could facilitate next week @Michael Robinson?
*Thread Reply:* I agree that we need to address those Google proposals, we should ask Jens if he’s up for presenting and discussing them first?
*Thread Reply:* maybe Pawel wants to present progress with https://github.com/OpenLineage/OpenLineage/issues/2162?
*Thread Reply:* Still waiting on a response from Jens
*Thread Reply:* I think Jens does not have a lot of time now
*Thread Reply:* Emailed him in case he didn’t see the message
*Thread Reply:* Jens confirmed
*Thread Reply:* He will have to join about 15 minutes late
*Thread Reply:* would love to come but I'm at friend's birthday at that time 😐
*Thread Reply:* I’d love to as well, but have diner plans 😕
*Thread Reply:* count me in if not too late
@Paweł Leszczyński mind giving this PR a quick look? https://github.com/MarquezProject/marquez/pull/2700 … it’s a dep on https://github.com/MarquezProject/marquez/pull/2698
*Thread Reply:* thanks @Paweł Leszczyński for the +1 ❤️
@Jakub Dardziński: In Marquez, metrics are exposed via the endpoint /metrics
using prometheus (most of the custom metrics defined are here). Oddly enough, prometheus roadmap states that they have yet to adopt OpenMetrics! But, you can backfill the metrics into prometheus. So, knowing this, I would move to using metrics core by Dropwizard and us an exporter to export metrics to datadog using metrics-datadog. The one major benefit here is that we can define a framework around defining custom metrics internally within Marquez using core Dropwizard libraries, and then enable the reporter via configuration to emit metrics in marquez.yml
: For example:
`metrics:
frequency: 1 minute # Default is 1 second.
reporters:
- type: datadog
.
.
*Thread Reply:* I tested this actually and it works the only thing is traces, I found it very poor to just have metrics around function name
*Thread Reply:* I totally agree, although I feel metrics and tracing are two separate things here
*Thread Reply:* I really appreciate your help and advice! 🙂
*Thread Reply:* Of course, happy to chime in here
*Thread Reply:* I’m just happy this is getting some much needed love 😄
*Thread Reply:* Also, it seems like datadog uses OpenTelemetry: > Datadog Distributed Tracing allows you easily ingest traces via the Datadog libraries and agent or via OpenTelemetry And looks like OpenTelemetry has support for Dropwizard
*Thread Reply:* yep, that's why I liked otel idea
*Thread Reply:* Also, here are the docs for DD + OpenTelemetry … so enabling OpenTelemetry in Marquez would be doable
*Thread Reply:* and we can make all of the configurable via marquez.yml
*Thread Reply:* hit me up with any questions! (just know, there will be a delay)
*Thread Reply:* > and we can make all of the configurable via marquez.yml
it ain’t that easy - we would need to build extended jar with OTEL agent which I think is way too much work compared to benefits. you can still configure via env vars or system properties
I’ve been looking into partitioning for psql, think there’s potential here for huge perf gains. Anyone have experience?
*Thread Reply:* partition ranges will give a boost by default
*Thread Reply:* Which tables do you want to partition? Event ones?
*Thread Reply:* • runs
• job_versions
• dataset_versions
• lineage_events
• and all the facets tables
@Paweł Leszczyński • PR 2682 approved with minor comments on stream versioning logic / suggestions ✅ • PR 2654 approved with minor comment (we’ll want to do a follow up analysis on the query perf improvements) ✅
*Thread Reply:* Thanks @Willy Lulciuc. I applied all the recent comments and merged 2654.
There is one discussion left in 2682, which I would like to resolve before merging. I added extra comment on the implemented approach and I am open to get to know if this is the approach we can go with.
@Julien Le Dem @Maciej Obuchowski discussion is about when to create a new job version for a streaming job. No deep dive in the code is required to take part in it. https://github.com/MarquezProject/marquez/pull/2682#discussion_r1425108745
*Thread Reply:* awesome, left my final thoughts 👍
Maybe we should clarify the documentation on adding custom facets at the integration level? Wdyt? https://openlineage.slack.com/archives/C01CK9T7HKR/p1702446541936589?threadts=1702033180.635339&channel=C01CK9T7HKR&messagets=1702446541.936589|https://openlineage.slack.com/archives/C01CK9T7HKR/p1702446541936589?threadts=1702033180.635339&channel=C01CK9T7HKR&messagets=1702446541.936589
Hey, i think it would help some people using Airflow integration (with Airflow 2.6) if we release a patch version of OL package with this PR included #2305. I am not sure what is the release cycle here, but maybe there is already an ETA on next patch release? If so, please let me know 🙂 Thanks !
*Thread Reply:* you gotta ask for the release in #general, 3 votes from committers approve immediate release 🙂
*Thread Reply:* @Michael Robinson 3 votes are in 🙂
*Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1702474416084989
*Thread Reply:* Thanks for the ping. I replied in #general and will initiate the release as soon as possible.
seems that we don’t output the correct namespace as in the naming doc for Kafka. we output the kafka server/broker URL as namespace (in the Flink integration specifically) https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md#kafka
*Thread Reply:* @Paweł Leszczyński, would you be able to add the Kafka: prefix to the Kafka visitors in the flink integration tomorrow?
*Thread Reply:* I am happy to do this. Just to make, sure: docs is correct, flink implementation is missing kafka://
prefix, right?
*Thread Reply:* Thanks @Paweł Leszczyński. made a couple of suggestions, but we can def merge without
*Thread Reply:* would love to discuss this first. If a user stores an iceberg table in S3, then should it conform S3 naming or iceberg naming?
it's S3 location which defines a dataset. iceberg is a format for accessing data but not identifier as such.
*Thread Reply:* No rush, just something we noticed and that some people in the community are implementing their own patch for it.
my next year goal is to have programmatic way of using naming convention
*Thread Reply:* nope but would be worth reaching out to them to see how we could collaborate? they’re part of the LFAI (sandbox): https://github.com/bitol-io/open-data-contract-standard
*Thread Reply:* background https://medium.com/profitoptics/data-contract-101-568a9adbf9a9
*Thread Reply:* We should still have a conversation:)
If you want to join the conversation on Ray.io integration: https://join.slack.com/t/ray-distributed/shared_invite/zt-2635sz8uo-VW076XU6bKMEiFPCJWr65Q
*Thread Reply:* is there any specific channel/conversation?
*Thread Reply:* Yeah, but it’s private. Added you. For everyone else, Ping me on slack when you join and I’ll add you.
A vote to release Marquez 0.43.0 is open. We need one more: https://marquezproject.slack.com/archives/C01E8MQGJP7/p1702657403267769
*Thread Reply:* the changelog PR is RFR
AWS is making moves! https://github.com/aws-samples/aws-mwaa-openlineage
*Thread Reply:* the repo itself is pretty old, last updated 2mo ago and used OL package not provider (1.4.1)
*Thread Reply:* still it's nice they're doing this :)
*Thread Reply:* since they’re using MWAA they won’t be affected by turn-off with coming with Airflow 2.8 for a while. Otherwise that would be a good excuse to get in touch with them
*Thread Reply:* I think this repo was related to the blog which was authored a while back - https://aws.amazon.com/blogs/big-data/automate-data-lineage-on-amazon-mwaa-with-openlineage/ No other moves from our end so far, at least MWAA team : )
*Thread Reply:* Hi All, I am one of the owners of this repo and working to update this to work with MWAA 2.8.1, with apache-airflow-providers-openlineage==1.4.0. I am facing an issue with my set-up. I am using Redshift SQL as a sample use-case for this and getting an error relating to the Default Extractor. Haven't really looked at this at much detail yet but wondering if you have thoughts? I just updated the env variables to use: AIRFLOWOPENLINEAGETRANSPORT and AIRFLOWOPENLINEAGENAMESPACE and changed operator from PostgresOperator to SQLExecuteQueryOperator.
[2024-03-07 03:52:55,496] Failed to extract metadata using found extractor <airflow.providers.openlineage.extractors.base.DefaultExtractor object at 0x7fc4aa1e3950> - section/key [openlineage/disabled_for_operators] not found in config task_type=SQLExecuteQueryOperator airflow_dag_id=rs_source_to_staging task_id=task_insert_event_data airflow_run_id=manual__2024-03-07T03:52:11.634313+00:00
[2024-03-07 03:52:55,498] section/key [openlineage/config_path] not found in config
[2024-03-07 03:52:55,498] section/key [openlineage/config_path] not found in config
[2024-03-07 03:52:55,499] Executing:
insert into event
SELECT eventid, venueid, catid, dateid, eventname, starttime::TIMESTAMP
FROM s3_datalake.event;
*Thread Reply:* @Paul Wilson Villena It looks like a small mistake in the OL, that I'll fix in the next version - we missed adding a callback there, and getting the airflow configuration raises error when disabled_for_operators
is not defined in the airflow.cfg file / the env variable. For now it should help to simply add the <a href="https://airflow.apache.org/docs/apache-airflow-providers-openlineage/1.4.0/configurations-ref.html#id1">[openlineage]</a>
section to airflow.cfg
, and set disabled_for_operators=""
, or just export AIRFLOW__OPENLINEAGE__DISABLED_FOR_OPERATORS=""
,
*Thread Reply:* Will be released in the next provider version: https://github.com/apache/airflow/pull/37994
*Thread Reply:* Hi @Kacper Muda it seems I need to also set this: Otherwise this error persists:
section/key [openlineage/config_path] not found in config
os.environ["AIRFLOW__OPENLINEAGE__CONFIG_PATH"]=""
*Thread Reply:* Yes, sorry for missing that. I fixed in the code and forgot to mention it. If You were to not use AIRFLOW__OPENLINEAGE__TRANSPORT
You'd have to set it to empty string as well, as it's missing the same fallback 🙂
The release is finished. Slack post, etc., coming soon
have we thought of making the SQL parser pluggable?
*Thread Reply:* what do you mean by that?
*Thread Reply:* (this is coming from apple) like what if a user wanted to provide their own parse for SQL in place of the one shipped with our integrations
*Thread Reply:* for example, if/when we integrate with DataHub, can they use their parse instead of one provided
*Thread Reply:* that would be difficult, we would need strong motivation for that 🫥
This question fits with what we said we would try to document more, can someone help them out with it this week? https://openlineage.slack.com/archives/C063PLL312R/p1702683569726449
Airflow 2.8 has been released. Are we still “turning off” the external Airflow integration with this one? What do Airflow users need to know to avoid unpleasant surprises? Kenten is open to including a note in the 2.8 blog post.
*Thread Reply:* As a newcomer here, I believe it would be wise to avoid supporting Airflow 2.8+ in the openlineage-airflow
package. This approach would encourage users to transition to the provider package. It's important to clearly communicate that ongoing development and enhancements will be focused on the apache-airflow-providers-openlineage
package, while the openlineage-airflow
will primarily be updated for bug fixes. I'll look into whether this strategy is already noted in the documentation. If not, I will propose a documentation update.
*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2330 Please let me know if some changes are required, i was not sure how to properly implement it.
This looks cool, might be useful for us? https://github.com/aklivity/zilla
*Thread Reply:* the license is a bit weird, but should be ok for us. it’s apache, unless you directly compete with the company that built it.
*Thread Reply:* tbh not sure how
*Thread Reply:* I think we should be focused on 1) being compatible with most popular solutions (kafka...) 2) being easy to integrate with (custom transports)
rather than forcing our opinionated way on how OpenLineage events should flow in customer architecture
Apologies for having to miss today’s committer sync — I’ll be picking up my daughter from school
WDYT about starting to add integration specific channels and adding a little welcome bot for people when they join?
*Thread Reply:* the -dev
and -users
seems like overkill, but also understand that we may want to split user questions from development
*Thread Reply:* maybe just shorten to spark-integration
, flink-integration
, etc. Or integrations-spark
etc
*Thread Reply:* we probably should consider a development
and welcome
channel
*Thread Reply:* yeah.. makes sense to me. let’s leave this thread open for a few days so more people can chime in and then I’ll make a proposal based on that.
*Thread Reply:* makes sense to me
*Thread Reply:* I think there is not enough discussion for it to make sense
*Thread Reply:* and empty channels do not invite to discussion
*Thread Reply:* Maybe worth it for spark questions alone? And then for equal coverage we need the others. It’s getting easy to overlook questions in general due to the increased volume and long code snippets, IMO.
*Thread Reply:* yeah I think the volume is still quite low
*Thread Reply:* something like Airflow's #troubleshooting channel easily has order of magnitude more messages
*Thread Reply:* and even then, I'd split between something like #troubleshooting and #development rather than between integrations
*Thread Reply:* not only because it's too granular, but also there's a development that isn't strictly integration related or touches multiple ones
Link to the vote to release the hot fix in Marquez: https://marquezproject.slack.com/archives/C01E8MQGJP7/p1703101476368589
For the newsletter this time around, I’m thinking that a year-end review issue might be nice in mid-January when folks are back from vacation. And then a “double issue” at the end of January with the usual updates. We’ve still got a rather, um, “select” readership, so the stakes are low. If you have an opinion, please lmk.
*Thread Reply:* I’m for mid-January option
1.7.0 changelog PR needs a review: https://github.com/OpenLineage/OpenLineage/pull/2331
Notice for the release notes (WDYT?):
COMPATIBILITY NOTICE
Starting in 1.7.0, the Airflow integration will no longer support Airflow versions >=2.8.0.
Please use the OpenLineage Airflow Provider instead.
It includes a link to here: https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/index.html
https://eu.communityovercode.org/ is that proper conference to talk about OL?
*Thread Reply:* Yes! At least to a more broader audience
Happy new year all!
Am I the only one that sees Free trial in progress here
in OL slack?
*Thread Reply:* same for me, I think Slack initiated that
*Thread Reply:* we’re on the Slack Pro trial (we were on the free plan before)
*Thread Reply:* I think Slack initiated it
https://github.com/OpenLineage/OpenLineage/issues/2349 - this issue is really interesting. I am hoping to see follow-up from David.
So who wants to speak at our meetup with Confluent in London on Jan. 31st?
*Thread Reply:* do we have sponsorship to fly over?
*Thread Reply:* Not currently but I can look into it
*Thread Reply:* do we have other active community members based in the UK?
*Thread Reply:* I’ll ask around
*Thread Reply:* Abdallah Terrab at Decathlon has volunteered
*Thread Reply:* does it mean we're still looking for someone?
*Thread Reply:* I already told I'll go last month, but not sure if it's still needed
*Thread Reply:* and does Confluent have a talk there?
*Thread Reply:* Sorry about that, Maciej. I’ll ask Viraj if Astronomer would cover your ticket. There will be a Confluent speaker.
*Thread Reply:* if we need to choose between kafka summit and a meetup - I think we should go for kafka summit 🙂
*Thread Reply:* I think so too
*Thread Reply:* Viraj has requested approval for summit, and we can expect to hear from finance soon
*Thread Reply:* Also, question from him about the talk: what does “streaming” refer to in the title — Kafka only?
*Thread Reply:* Kafka, Spark, Flink
*Thread Reply:* If someone wants to do a joint talk let me know 😉
*Thread Reply:* @Willy Lulciuc will you be in UK then?
*Thread Reply:* I can be, if astronomer approves? but also realizing it’s Jan 31st, so a bit tight
*Thread Reply:* yeah, that sounds… unlikely 🙂
*Thread Reply:* also thinking about all the traveling ive been doing recently and things I need to work on. would be great to have some focus time
did anyone submit a talk to https://www.databricks.com/dataaisummit/call-for-presentations/?
*Thread Reply:* tagging @Julien Le Dem on this one. since the deadline is tomorrow.
*Thread Reply:* I don’t have my computer with me. ⛷️
*Thread Reply:* Does @Willy Lulciuc want to submit? Happy to be co-speaker (if you want. But not necessary)
*Thread Reply:* Willy is also on vacation, I’m happy to submit for the both of us. I’ll try to get something out today
*Thread Reply:* would be great to have someting on Databricks conference 🙂
*Thread Reply:* what I’m currently thinking. learnings from openlineage adoption in Airflow and Flink, and what can be learned / applied on Spark lineage.
This month’s TSC meeting is next Thursday. Anyone have any items to add to the agenda?
*Thread Reply:* @Kacper Muda would you want to talk about doc changes in Airflow provider maybe?
*Thread Reply:* no pressure if it's too late for you 🙂
*Thread Reply:* It's fine, I could probably mention something about it - the problem is that I have a standing commitment every Thursday from 5:30 to 7:30 PM (Polish time, GMT+1) which means I'm unable to participate. 😞
*Thread Reply:* @Kacper Muda would you be open to recording something? We could play it during the meeting. Something to consider if you’d like to participate but the time doesn’t work.
*Thread Reply:* Let me see how much time I'll have during the weekend and come back to You 🙂
*Thread Reply:* Sorry, I got sick and won't be able to do it. Maybe i'll try to make it personally to the next meeting, then the docs should already be released 🙂
*Thread Reply:* @Julien Le Dem are there updates on open proposals that you would like to cover?
*Thread Reply:* @Paweł Leszczyński as you authored about half of the changes in 1.7.0, would you be willing to talk about the Flink fixes? No slides necessary
*Thread Reply:* @Michael Robinson no updates from me
*Thread Reply:* sry @Michael Robinson, I won't be able to join this time. My changes in 1.7.0 are rather small fixes. Perhaps someone else can present them shortly.
I saw some weird behavior with openlineage-airflow
where it will not respected the transport config for the client, even when setting the OPENLINEAGE_CONFIG
to point to a config file.
the workaround is that if you set the OPENLINEAGE_URL
env it will reach that and read the config.
this bug doesn’t seem to exist in the airflow provider since the loading method is completely different.
*Thread Reply:* would you mind creating an issue?
*Thread Reply:* will do. let me repro on the latest version of openlineage-airflow
and see if I can repro on the provider
*Thread Reply:* config is definitely more complex than it needs to be...
*Thread Reply:* hmm.. so in 1.7.0
if you define OPENLINEAGE_URL
then it completely ignores whatever is in OPENLINEAGE_CONFIG
yaml
*Thread Reply:* if you don’t define OPENLINEAGE_URL
, and you do define OPENLINEAGE_CONFIG
: then openlineage is disabled 😂
*Thread Reply:* are you sure OPENLINEAGE_CONFIG
points to something valid?
*Thread Reply:* ~look at the code flow for create:~ ~the default factory doesn’t supply config, so it tried to set http config from env vars. if that doesn’t work, it just returns the console transport~
*Thread Reply:* oh, no nvm. it’s a different flow.
*Thread Reply:* but yes, OPENLINEAGE_CONFIG
works for sure
*Thread Reply:* on 0.21.1, it was working when the OPENLINEAGE_URL
was supplied
*Thread Reply:* transport_config = None if "transport" not in self.config else self.config["transport"]
self.transport = factory.create(transport_config)
this self.config
actually looks at
@property
def config(self) -> dict[str, Any]:
if self._config is None:
self._config = load_config()
return self._config
which then uses this:
```def loadconfig() -> dict[str, Any]:
file = _findyaml()
if file:
try:
with open(file) as f:
config: dict[str, Any] = yaml.safe_load(f)
return config
except Exception: # noqa: BLE001, S110
# Just move to read env vars
pass
return defaultdict(dict)
def findyaml() -> str | None: # Check OPENLINEAGECONFIG env variable path = os.getenv("OPENLINEAGECONFIG", None) try: if path and os.path.isfile(path) and os.access(path, os.R_OK): return path except Exception: # noqa: BLE001 if path: log.exception("Couldn't read file %s: ", path) else: pass # We can get different errors depending on system
# Check current working directory:
try:
cwd = os.getcwd()
if "openlineage.yml" in os.listdir(cwd):
return os.path.join(cwd, "openlineage.yml")
except Exception: # noqa: BLE001, S110
pass # We can get different errors depending on system
# Check $HOME/.openlineage dir
try:
path = os.path.expanduser("~/.openlineage")
if "openlineage.yml" in os.listdir(path):
return os.path.join(path, "openlineage.yml")
except Exception: # noqa: BLE001, S110
# We can get different errors depending on system
pass
return None```
*Thread Reply:* oh I think I see
*Thread Reply:* so this isn't passed if you have config but there is no transport
field in this config
transport_config = None if "transport" not in self.config else self.config["transport"]
self.transport = factory.create(transport_config)
*Thread Reply:* here’s the config I’m using: ```transport: type: "kafka" config: bootstrap.servers: "kafka1,kafka2" security.protocol: "SSL"
# CA certificate file for verifying the broker's certificate.
ssl.ca.location=ca-cert
# Client's certificate
ssl.certificate.location=client_?????_client.pem
# Client's key
ssl.key.location=client_?????_client.key
# Key password, if any.
ssl.key.password=abcdefgh
topic: "SOF0002248-afaas-lineage-DEV-airflow-lineage" flush: True```
*Thread Reply:* it should load, but fail when actually trying to emit to kafka
*Thread Reply:* but it should still init the transport
*Thread Reply:* I’m testing on this image: ```FROM quay.io/astronomer/astro-runtime:6.4.0
COPY openlineage.yml /usr/local/airflow/
ENV OPENLINEAGE_CONFIG=/usr/local/airflow/openlineage.yml
ENV AIRFLOWCORELOGGING_LEVEL=DEBUG
*Thread Reply:* are you sure there are no permission errors?
*Thread Reply:* def _find_yaml() -> str | None:
# Check OPENLINEAGE_CONFIG env variable
path = os.getenv("OPENLINEAGE_CONFIG", None)
try:
if path and os.path.isfile(path) and os.access(path, os.R_OK):
return path
except Exception: # noqa: BLE001
if path:
log.exception("Couldn't read file %s: ", path)
else:
pass # We can get different errors depending on system
it checks stuff like
os.access(path, os.R_OK)
*Thread Reply:* I’m sure, because if I uncomment
ENV OPENLINEAGE_URL=<http://foo.bar/>
on 0.21.1, it works
*Thread Reply:* ah, I can add a permissive chmod to the dockerfile to see if it helps
*Thread Reply:* but I’m also not seeing any log/exception in the task logs
*Thread Reply:* will look at this later if you won't find solution 🙂
*Thread Reply:* one more thing, can you try with just
transport:
type: console
*Thread Reply:* but yeah, there’s something not great about separation of concerns between client config and adapter config
*Thread Reply:* adapter should not care, unless you're using MARQUEZ_URL
... which is backwards compatibility for when it was still marquez airflow integration
I’m starting to put together the year-in-review issue of the newsletter and wonder if anyone has thoughts on the “big stories” of 2023 in OpenLineage. So far I’ve got: • Launched the Airflow Provider • Added static AKA design lineage • Welcomed new ecosystem partners (Google, Metaphor, Grai, Datahub) • Started meeting up and held events with Metaphor, Google, Collibra, etc. • Graduated from the LFAI What am I missing? Wondering in particular about features. Is iceberg support in Flink a “big” enough story? Custom transport types? SQL parser improvements?
Blog posts content about contributing to Openlineage-spark and code internals. The content comes from november meetup at google and I split it into two posts: https://docs.google.com/document/d/1Hu6clFckse1J_M1w2MMaTTJS0wUihtFsxbDQchtTVtA/edit?usp=sharing
Does anyone remember why execution_date
was chosen as part of the runid for an Airflow task, instead of, for example, start_date
?
Due to this decision, we can encounter duplicate runid if we delete the DagRun from the database, because the execution_date
remains the same. If I run a backfill job for yesterday, then delete it and run it again, I get the same ids. I'm trying to understand the rationale behind this choice so we can determine whether it's a bug or a feature. 😉
*Thread Reply:* start_date
is unreliable AFAIK, there can be no start date sometimes
*Thread Reply:* this might be true for only some version of airflow
*Thread Reply:* Also here where they define combination of some deterministic attributes, executiondate is used and not startdate, so there might be something to it. That still leaves us with the behaviour i described.
*Thread Reply:* > Due to this decision, we can encounter duplicate run_id if we delete the DagRun from the database, because the execution_date
remains the same.
Hmm so given that OL runID uses the same params Airflow uses to generate the hash, this seems more like a limitation. Better question would be: if a user runs, deletes, then runs the same DAG again, is that an expected scenario we should handle? tl’dr yes, but Airflow hasn’t felt it important enough to address.
*Thread Reply:* not concurrent with Databricks conference this year? 😂
*Thread Reply:* nope, a week before so that everyone goes there and gets sick and can’t attend the databricks conf on the following week
*Thread Reply:* outstanding business move again
*Thread Reply:* @Willy Lulciuc wanna submit to this?
*Thread Reply:* yeah id love to: maybe something like “Detecting Snowflake table schema changes with OpenLineage events” + use cases + using lineage to detect impact?
*Thread Reply:* yeah! that sounds like a fun talk!
*Thread Reply:* idk if @Julien Le Dem will already be in 🇫🇷 that week? but I’d be happy to co-present if not.
*Thread Reply:* I don’t know when I’m flying out yet but it will be in the middle of that time frame.
*Thread Reply:* +1 on Harel co-presenting :)
*Thread Reply:* School last day is the 4th. I need to be in France (not jet lagged) before the 8th
*Thread Reply:* ok, @Harel Shein I’ll work on getting a rough draft ready before the deadline (added to my TODO of tasks)
hey, I'm not feeling well, will probably skip today meeting
Added comment to discussion about Spark parent job issue: https://github.com/OpenLineage/OpenLineage/issues/1672#issuecomment-1883524216 I think we have the consensus so I'll work on it.
*Thread Reply:* @Maciej Obuchowski should we give an update on that at the TSC meeting tomorrow?
Another issue: do you think we should somehow handle the partitioning in OpenLineage standard and integrations? I would think of a situation where we somehow know how dataset is partitioned - not think about how to automagically detect that. Some example situations:
*Thread Reply:* interesting. did you hear anyone with this usecase/requirement?
Hey, there’s a Windows user getting this error when trying to run Marquez: org-apache-tomcat-jdbc-pool-connectionpool-unable-to-create-initial-connection
. Is it a driver issue? I’ll try to get more details and reproduce it, but if you know what this probably is related to, please lmk
*Thread Reply:* do we even support Windows? 😱
*Thread Reply:* Here’s more info about the use case:
Thanks Michael , this is really helpful , so I am working on prj where in I need to run marquez and open lineage on top of airflow dags which run dbt commands internally thru Bashoperator. I need to present to my org if we are going to be benefited by bringing in marquez matadata lineage
10:48
so was taking this approach of setting marquez first , then will explore how it integrates with airflow using bashoperator
*Thread Reply:* We don’t support this operator, right? What kind of graph can they expect?
*Thread Reply:* They can use dbt integration directly maybe?
*Thread Reply:* We now have a former astronomer as engineering director at DataHub
*Thread Reply:* https://www.linkedin.com/in/samantha-clark-engineer/
taking on letting users to run integration tests
*Thread Reply:* so there are two issues I think
*Thread Reply:* in Airflow workflows there are:
filters:
branches:
ignore: /pull\/[0-9]+/
which are set only for PRs with forked repos
*Thread Reply:* in Spark there are tests that strictly require env vars (that contain credentials to various external systems like databricks or bigquery). if there are no such env vars tests fail which is confusing new committers
*Thread Reply:* first behaviour is silent - which I think is bad because it’s easy to skip integration tests, build is green (but should it be? we don’t know, integration tests didn’t run and someone needs to know and remember that before approving and merging)
*Thread Reply:* second is misleading because it hints there’s something wrong in the code while it doesn’t neccessarily need to be. on the other hand you shouldn’t approve and merge failing build so you see there’s some action required
*Thread Reply:* reg. action required: for now we’ve been running some manual step (using https://github.com/jklukas/git-push-fork-to-upstream-branch) which is a workaround but it’s not straightforward and requires manual work. it also doesn’t solve two issues I mentioned above
*Thread Reply:* what I propose is to simply add approval step before integration tests: https://circleci.com/docs/workflows/#holding-a-workflow-for-a-manual-approval
it’s circleCI only thing so you need to login into circleCI, check if there’s any pending task to approve and then approve or not
*Thread Reply:* it doesn’t allow for much of configuration but I think it would work. you can’t also integrate it in any way with github UI (e.g. there’s no option to click something in PR’s UI to approve)
*Thread Reply:* but that would let project maintainers to manage when the code is safe to run and it’s still visible and (I think) readable for everyone
*Thread Reply:* the only thing I’m not sure about is who can approve
Anyone who has push access to the repository can click the **Approval** button to continue the workflow
but I’m not sure to which repo. if someone runs on fork and he/she has push access for fork - can he/she approve? it wouldn’t make sense..
*Thread Reply:* https://circleci.com/blog/access-control-cicd/
that’s best I could find from circleCI on this subject
*Thread Reply:* so I think the best solution would be to:
Pass secrets to builds from forked pull requests
(requires careful review of the CI process)*Thread Reply:* contexts give possibility to let users run e.g. unit tests in CI without exposing credentials
*Thread Reply:* this approach makes sense to me, assuming the permission model is how you outlined it
*Thread Reply:* one thing to add and test: approval steps could have condition to always run if it’s not from fork. not sure if that’s possible
*Thread Reply:* that sounds like it should be doable within what available in circle. GHA can definitely do that
*Thread Reply:* GHA can do things that circleCI can’t 😂
*Thread Reply:* > approval steps could have condition to always run if it’s not from fork. not sure if that’s possible ffs it’s not that easy to set up
*Thread Reply:* whenever I touch circleCI base changes I feel like magician. yet, here goes the PR with the magic (just look at the side effect, it bothered me for so long 🪄 ) 🚨 🚨 🚨 https://github.com/OpenLineage/OpenLineage/pull/2374
*Thread Reply:* I'm assuming the reason for the speedup in determine_changed_modules
is that we don't go install yq
every time this runs?
What are your nominees/candidates for the most important releases of 2023? I’ll start (feel free to disagree with my choices, btw): • 1.0.0 • 1.7.0 (disabled the external Airflow integration for 2.8+) • 0.26.0 (Fluentd) • 0.19.2 (column lineage in SQL parser) • …
*Thread Reply:* • 1.7.0 (disabled the external Airflow integration for 2.8+) that doesn’t sound like one of the most important
*Thread Reply:* 1.0.0 had this: https://openlineage.io/docs/releases/1_0_0 which actually fixed spec to match JSON schema spec
@Gowthaman Chinnathambi has joined the channel
1.8 changelog PR: https://github.com/OpenLineage/OpenLineage/pull/2380
FYI: I'm trying to set up our meetings with the new LFAI tooling, you may get some emails. you can ignore for now.
@Krishnaraj Raol has joined the channel
@Harel Shein @Julien Le Dem @tati Python client releases on both Marquez and OpenLineage are failing because PyPI no long supports password authentication. We need to configure the projects for Trusted Publishers or use an API token. I’ve looked and can’t find OpenLineage credentials for PyPI, but if I had them we’d still need to set up 2FA in order to make the change. How should we proceed here? Should we sync to sort this out? Thanks (btw I reached out to Willy separately when this came up during a Marquez release attempt last week)
*Thread Reply:* ah! I can look into it now.
*Thread Reply:* I'm failing to find the credentials for PyPI anywhere.
*Thread Reply:* @Maciej Obuchowski any ideas? (git blame shows you wrote that part, and @Michael Collado did some circleCI setup at some point)
*Thread Reply:* I just sent a reset password email to whoever registered for the openlineage
user
*Thread Reply:* Thanks @Harel Shein. Can confirm I didn’t get it
*Thread Reply:* The password should be in the CircleCI context right?
*Thread Reply:* yes, it's there. I was trying to avoid writing a job that prints it out
*Thread Reply:* that will be the fallback if no one responds 🙂
*Thread Reply:* alright, I've setup 2FA and added a few more emails to the PyPI account as fallback.
*Thread Reply:* unfortunately, there's only one Trusted Publisher
for PyPI, which is GH Actions. so we'll have to use the API token route. PR incoming soon
*Thread Reply:* didn't need to make any changes. I updated the circle context and re-ran the PyPI release - we're back to :largegreencircle:
*Thread Reply:* ^ @Michael Robinson FYI
*Thread Reply:* thank you @Harel Shein. Releasing the jars now. Everything looks good
It looks like my laptop bit the dust today so might miss the sync
Do I have to create account to join the meeting?
*Thread Reply:* turns out you can just pass your mail
*Thread Reply:* same on my side
*Thread Reply:* i don't remember if i have one
I did something wrong, meeting link should we good now: https://zoom-lfx.platform.linuxfoundation.org/meeting/91671382513?password=424b74a1-43fa-4d0e-885f-c9b5417cf57b
*Thread Reply:* use same e-mail you got invited to
there's a data warehouse (https://www.firebolt.io/) and a streaming platform (https://memphis.dev/) written in Go. so I guess it's not futile to write a Go client? 🙂
Any potential issues with scheduling a meetup on Tuesday, March 19th in Boston that you know of? The Astronomer all-hands is the preceding week
PR to add 1.8 release notes to the docs needs a review: https://github.com/OpenLineage/docs/pull/274
*Thread Reply:* thanks @Jakub Dardziński
Feedback requested on this draft of the year-in-review issue of the newsletter: https://docs.google.com/document/d/1MJB9ughykq9O8roe2dlav6d8QbHZBV2A0bTkF4w0-jo/edit?usp=sharing. Did you give a talk that isn't in the talks section? Is there an important release that should be in the releases section but isn't? Other feedback? Please share.
Feedback requested on a new page for displaying the ecosystem survey results: https://github.com/OpenLineage/docs/pull/275. The image was created for us by Amp. @Julien Le Dem @Harel Shein @tati
*Thread Reply:* Looks great!
*Thread Reply:* Very cool indeed! I wonder if we should share the raw data as well?
*Thread Reply:* Maybe if you could share it here first @Michael Robinson ?
*Thread Reply:* Yes, planning to include a link to the raw data as well and will share here first
*Thread Reply:* @Harel Shein thanks for the suggestion. Lmk if there's a better way to do this, but here's a link to Google's visualizations: https://docs.google.com/forms/d/1j1SyJH0LoRNwNS1oJy0qfnDn_NPOrQw_fMb7qwouVfU/viewanalytics. And a .csv is attached. Would you use this link on the page or link to a spreadsheet instead?
*Thread Reply:* Going with linking to Google's charts for the raw data for now. LMK if you'd prefer another format, e.g. Google sheet
*Thread Reply:* was just looking at it, looks great @Michael Robinson!
*Thread Reply:* Excellent work, @Michael Robinson! 👏 👏 👏
This looks nice!
Thanks for any feedback on the Mailchimp version of the newsletter special issue before it goes out on Monday:
*Thread Reply:* @Jakub Dardziński @Maciej Obuchowski is the Airflow Provider stuff still current?
*Thread Reply:* yeah, looks current
@Laurent Paris has joined the channel
Hey, created issue around using Spark metrics to increase operational ease of using Spark integration, feel free to comment: https://github.com/OpenLineage/OpenLineage/issues/2402
@Lohit VijayaRenu has joined the channel
Flyte is offering a 20-25-minute speaking slot at their community meeting on March 6th at 9 am PT. They'd like it to be a general introduction to OpenLineage
*Thread Reply:* I can take it if no one else is interested. I’ll be doing a lot of intro to OL presentations in the next few weeks, so I’ll be very practiced by then :)
@Emili Parreno has joined the channel
@Paweł Leszczyński / @tati I'm expecting at least 5 pictures from the meetup today! 😄
*Thread Reply:* could we also get a signup sheet and headcount please? 😬
Hi, is there any reason not to perform a release today as scheduled? I know we released 1.8 only one week ago, but it's the first of the month and @Damien Hawes's PR #2390 to add support for Scala 2.12 and 2.13 in Spark, along with fixes in the Spark and Flink integrations, are unreleased. Would it make more sense to wait for Damien's PR #2395?
*Thread Reply:* This isn't ready.
*Thread Reply:* I'm working on enabling integration tests for the Scala 2.13 variants.
*Thread Reply:* It will take some time, probably Monday / Tuesday next week is my ETA.
*Thread Reply:* Thanks @Damien Hawes, no pressure. But if early next week is your estimate I think it makes sense to wait. So this is GTK
Decathlon showed part of one of their graphs last night
*Thread Reply:* marquez in the wild! 💯💯🚀. thanks for sharing!
*Thread Reply:* some metrics too
*Thread Reply:* they're doing anomaly detection successfully
*Thread Reply:* eg, deprecated tables still being used or tables written in multiple locations
*Thread Reply:* this needs to become a blog post!
*Thread Reply:* Would love to get the slides if they are willing to share!
*Thread Reply:* They've said yes to a blog post. This presentation gets us closer to starting in earnest. I've asked for the slides. Too bad the Confluent organizer wasn't supportive of recording. Maybe next time
Congratulations to our new committer on the team @Damien Hawes!!
*Thread Reply:* hey @Damien Hawes, now you can push your branches into origin
and the integration tests are automatically approved 🙂
*Thread Reply:* Thank you for the congratulations @Harel Shein. It is humbling to be nominated and accepted.
@Maciej Obuchowski - haha, thanks!
*Thread Reply:* Congratulations @Damien Hawes! Thank you for all your contributions
https://opensource.googleblog.com/2024/02/announcing-google-season-of-docs-2024.html maybe we should improve our docs? 🙂
Agenda items or discussion topics for Thursday's TSC? @Julien Le Dem @Harel Shein
*Thread Reply:* I can take 5 minutes explaining Spark job hierarchy
*Thread Reply:* Nothing specific on my end
*Thread Reply:* more updates on the Spark side of things? @Paweł Leszczyński / @Damien Hawes may want to talk about the recent additions? we could also discuss the proposals for circuit breakers / metrics?
Anyone have an opinion about creating an OpenLineage "company" rather than a group on LinkedIn? You can get metrics from LinkedIn's API if you have a company rather than a group.
*Thread Reply:* Airflow does it: https://www.linkedin.com/company/apache-airflow/
*Thread Reply:* Spark too https://www.linkedin.com/company/apachespark/
*Thread Reply:* I think that's a good idea
We have agreement from Astronomer to move ahead with the Orbit changes discussed today in the committers sync. So I'll start on the exports asap.
Please follow our new OpenLineage company page on LinkedIn. Evidently, the only way to join the company is to add it to your experience history.
FYI: deadline Feb 25th https://2024.berlinbuzzwords.de/call-for-papers/
*Thread Reply:* @Peter Hicks want to submit a column-level lineage talk?
*Thread Reply:* we'll probably submit something about OpenLineage/Streaming with @Paweł Leszczyński
*Thread Reply:* @Willy Lulciuc want to come to Berlin? 😄
*Thread Reply:* we were thinking with Maciej about some kind of Flink & Streaming around OpenLineage talk, as this can be interesting to the community. I'll prepare abstract next wek
*Thread Reply:* updated the talks project on github
*Thread Reply:* @Maciej Obuchowski i wish, i’d need a sponsor 😅
This open-source community management tool looks interesting as a supplement to Orbit: https://ossinsight.io/analyze/OpenLineage/OpenLineage#overview
*Thread Reply:* you can compare projects side-by-side, which is something Orbit doesn't offer
Metaplane added an airflow provider to send data lineage data to. It's basically a new connection that extends BaseHook, and users need to proactively send callbacks, not sure why the took that approach. https://www.metaplane.dev/blog/airflow-integration
*Thread Reply:* If you're doing all that work, you could actually just dag_policy to add it to all the dags automatically
*Thread Reply:* https://docs.metaplane.dev/docs/airflow#dag-and-task-lineage I think that's fairly... unsophisticated approach?
*Thread Reply:* However, if I was redoing Airflow integration from scratch, I'd seriously rethink using connections instead of OPENLINEAGE_URL
or configuring it the way we did
*Thread Reply:* The plugin could load up custom transport types and generate connection types based out of it
*Thread Reply:* Just curious why they did not use listeners... it's not like it's new feature now, it has been there for like 5 minor releases
*Thread Reply:* we still could add support from Airflow connections with some sort of deprecation of OPENLINEAGE_URL and current way
Hey, i was working on PR updating the docs for python, java and airflow (probably spark is next), and it hit me that we still have those in two places: README.md inside the package and openlineage.io site. Both contain quite the same information, sometimes the site has more (f.e. airflow). Do You think it would be good idea to just put a redirect to the site in README.md files for the packages? Maybe add some brief description at the top and then redirect user to the site? In long term, maintaining both and keeping them in sync is not an optimal solution imho.
*Thread Reply:* I agree -- in addition to the maintenance burden there's the risk posed by out-of-date/conflicting info for users
*Thread Reply:* I've wanted to remove README.md files as user-facing docs for some time.
However, there might be worth keeping those (maybe under different name) for purely internal development docs - not related to external API's, but like - use this incantation to compile the integration.
*Thread Reply:* I added the redirect here https://github.com/OpenLineage/OpenLineage/pull/2448
There was not much information about the compilation and other internal stuff, so i think first they need to be created and then we can keep them inside the package files under some different name, as Maciej mentioned.
2023 OpenLineage Survey Analysis/takeaways What surprised you or struck you as notable in the 2023 survey data? What would you like to see added, changed or removed in the 2024 version? I need your help to ensure we get the most useful and actionable insights we can from this exercise. I created a doc for sharing opinions/comments, but I'd be happy to discuss it in any forum. Here's the doc, including some initial takeaways, as a starting point: https://docs.google.com/document/d/1aiKtKjcFU0AjS46cow6cbx8EvzV0P2KnGQ4rFD4qKLM/edit?usp=sharing.
@Damien Hawes can we share Gradle plugins that you've implemented in Spark buildSrc
with Flink integration too? It would cut down on boilerplate, but not sure how can we do this (without copying code) 🙂
*Thread Reply:* You have to publish the plugin to the Gradle Plugin repositoriy
*Thread Reply:* Seems like we could publish the plugin to the local dir first: https://docs.gradle.org/current/userguide/plugins.html#sec:custom_plugin_repositories
@Maciej Obuchowski in our OL spec, we require _prodcuer
and _schemaURL
, but our airflow provider, we only send the _producer
. Was this an intentional omission of _schemaURL
?
ahh i think I was confused given it’s marked as _base_skip_redact
here, but looks like _schemaURL
is being added… need to verify
*Thread Reply:* schemaURL is always send, the thing is it’s incorrect in many cases AFAIR
*Thread Reply:* yeah… we’re validating events and most (if not all) don’t have that field populated. does the airflow provider not set it?
*Thread Reply:* airlfow provider uses facets from openlineage-python
package
*Thread Reply:* > thing is it’s incorrect in many cases AFAIR more curious about this comment. why would this be the case?
*Thread Reply:* tbh I’m not sure, URL for the base schema is correct only for https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json and facets in it (which is not too many). and for others it was not just done or mistakenly repeated with the same pattern?
*Thread Reply:* dunno, looks more like historical approach that wasn’t adjusted at some point
*Thread Reply:* and.. schemaURL is apparently irrelevant since noone validated it and reported issues
*Thread Reply:* is there an open issue for this? I feel we should have some guidance here or not require it, but we should decide what we do here
*Thread Reply:* elephant in the room
*Thread Reply:* no, I don’t think there is one
*Thread Reply:* ok, I’ll open one. we’re ingestion events into kafka and was incorrectly applying validation to events based on the spec
*Thread Reply:* @Julien Le Dem ping to address the elephant above 😉 we’re talking about in this thread
*Thread Reply:* I see. btw generating python facets from json schema would be best solution, it was too complex so far ;_;
*Thread Reply:* ok, I’ll summarize our discussion in the issue
*Thread Reply:* thanks Willy!
*Thread Reply:* @Willy Lulciuc double-checking - given the example of SQLJobFacet should the schema URL be set to: https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/facets/SQLJobFacet.json or https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/facets/SQLJobFacet.json#/$defs/SQLJobFacet or https://openlineage.io/spec/facets/1-0-0/SQLJobFacet.json#/$defs/SQLJobFacet?
or in other words - should it be the same as it is in Java client? 🙂 which is the last one
Abdallah (Decathlon) made a release request today. https://openlineage.slack.com/archives/C01CK9T7HKR/p1708514231690979
*Thread Reply:* I would ask that if we make a release today, we include the Scala 2.13 support for Spark (and merge the PR for the docs)
*Thread Reply:* I guess we have to decide splitting Iceberg off from the main code a reason to hold back the release?
*Thread Reply:* Thoughts @Maciej Obuchowski?
*Thread Reply:* I think we can do a release today/tomorrow, but having functionality removed makes this much harder choice
*Thread Reply:* What functionality would be removed?
*Thread Reply:* Iceberg support? Or do you propose something else?
*Thread Reply:* Oh, I was under the impression that Iceberg support wasn't going to be removed. Instead the direct dependencies on Iceberg in the core code were being removed, and bundled into their own module, but at the end of the day, the project would still contain classes capable of dealing with Iceberg.
*Thread Reply:* I understood that without https://github.com/OpenLineage/OpenLineage/pull/2437/files there will be no support for Iceberg for 2.13
*Thread Reply:* so I guess we can go and then follow up with next release soon?
*Thread Reply:* There should still be Iceberg support for 2.13
*Thread Reply:* If there wasn't, that would hurt us, and by us, I mean the team I belong to @ Booking and my partner teams.
*Thread Reply:* Basically, if I understand Mattia's direction is, we want to say:
OK, OpenLineage has been tested against these versions of Iceberg and found to be working.
*Thread Reply:* Just to confirm, are we waiting for this one https://github.com/OpenLineage/OpenLineage/pull/2446?
*Thread Reply:* We're waiting for comments on this: https://openlineage.slack.com/archives/C01CK9T7HKR/p1708349868363669
*Thread Reply:* No-one has left comments
*Thread Reply:* Which means, at least in my opinion, no-one else has anything to say.
*Thread Reply:* +1. It's been over 48 hrs, so seems safe to go ahead
*Thread Reply:* @Maciej Obuchowski - I'm going to merge that PR, ye?
*Thread Reply:* (First it needs an approval)
*Thread Reply:* Pawel is OOO, I believe
*Thread Reply:* Aye, but I believe @Maciej Obuchowski can approve.
*Thread Reply:* :gh_approved:
*Thread Reply:* Working on the changelog now
*Thread Reply:* oops, forgot about the release vote. please +1
*Thread Reply:* it got merged 👀
*Thread Reply:* amazing feedback on a 10k line PR 😅
*Thread Reply:* maybe they have policy that feedback starts from 10k lines
*Thread Reply:* it wasn’t enough
*Thread Reply:* too big to review, LGTM
I just noticed this. shared
should not have a dependency on spark. 👀
*Thread Reply:* sounds like an easy fix? we have time
*Thread Reply:* Easy fix? Yeah - not at all.
*Thread Reply:* Allright, putting the release on hold, then
*Thread Reply:* yeah - we could end up with Spark 2 dependency when using it in Spark 3 context, and that's not good
*Thread Reply:* oh, unless you mean compile time dependency on Spark 2
*Thread Reply:* then no, we need to have it, everything in lifecycle
package depends on it 🙂
The idea is that it contains code common to all Spark versions, that Spark itself mostly does not change - and have spark2/3/... directories for things that specifically diverge from baseline
*Thread Reply:* I assume Paweł ment should not have Spark dependency
as it should not depend on particular Spark version
*Thread Reply:* Correct me if I'm wrong, but it sounds safe to proceed. So here's the changelog PR: https://github.com/OpenLineage/OpenLineage/pull/2452
*Thread Reply:* It's safe
*Thread Reply:* @Michael Robinson let's wait with release till we solve the intermittent test failing issue https://openlineage.slack.com/archives/C065PQ4TL8K/p1708609977907359
Any idea/preference where to put very specific doc information? For example: if you're running Spark in this specific way, do this. Separate doc page seems like overkill, but I'm not sure where to put something like this where it would be discoverable
*Thread Reply:* Maybe the information could go on a new stub about "Special Cases" or something (not a great title but don't know the use case). That way the page isn't just about the exceptional case?
*Thread Reply:* Can we not trust search?
@Michael Robinson do we know when the next release will be going out? I’d sneak in some feedback this week on: • https://github.com/OpenLineage/OpenLineage/pull/2371 • and some of the circuit breaker work if it’s not too late @Paweł Leszczyński @Maciej Obuchowski 😉
*Thread Reply:* it almost happened today, so tomorrow? 🤞
*Thread Reply:* still provide feedback anyway @Willy Lulciuc!
*Thread Reply:* will do! I’ll get some feedback in tmr
Hey, i made a PR that updates the OL Airflow Provider documentation (removing outdated stuff, moving some from current OL docs, adding new info). It's not a small one, but i think it's worth the time. Any feedback is highly appreciated, let me know if something is missing, is not clear or simply wrong 🙂
@Damien Hawes there have been some random test failures after merging last 2.13 PR, for example https://app.circleci.com/pipelines/github/OpenLineage/OpenLineage/9437/workflows/4eda8d67-3bd1-4527-84fa-88c19e6774bd/jobs/179622 ```> Task :app:copyIntegrationTestFixtures
Too long with no output (exceeded 10m0s): context deadline exceeded``
hanging on
copyIntegrationTestFixtures`?
*Thread Reply:* Ah, it happened also before latest PR https://app.circleci.com/jobs/github/OpenLineage/OpenLineage/179357
*Thread Reply:* I wonder if the disk is full of that particular executor
*Thread Reply:* Its always failing at the :app:copyIntegrationTestFixtures
step
*Thread Reply:* Because I'm not able to replicate this on my local
*Thread Reply:* yeah I can't replicate that too
*Thread Reply:* I'm rerunning with SSH on CI, will take a look at disk space
*Thread Reply:* OK. I was literally about to edit the CI to run df -H
*Thread Reply:* circleci@ip-10-0-52-168:~$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/root 146G 13G 133G 9% /
devtmpfs 7.7G 0 7.7G 0% /dev
tmpfs 7.7G 0 7.7G 0% /dev/shm
tmpfs 1.6G 836K 1.6G 1% /run
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 7.7G 0 7.7G 0% /sys/fs/cgroup
/dev/nvme0n1p15 105M 6.1M 99M 6% /boot/efi
*Thread Reply:* And if you run df -H /home/circleci/openlineage/integration/spark
*Thread Reply:* circleci@ip-10-0-52-168:~$ df -H /home/circleci/openlineage/integration/spark
Filesystem Size Used Avail Use% Mounted on
/dev/root 156G 15G 142G 10% /
*Thread Reply:* I guess disk usage is increasing, but very slowly?
circleci@ip-10-0-52-168:~$ df /home/circleci/openlineage/integration/spark
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/root 152243760 14273256 137954120 10% /
circleci@ip-10-0-52-168:~$ df /home/circleci/openlineage/integration/spark
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/root 152243760 14273436 137953940 10% /
*Thread Reply:* the max memory seems very small?
*Thread Reply:* let's try it? https://github.com/OpenLineage/OpenLineage/pull/2454
*Thread Reply:* ⬆️ ⬆️ it died without filling the disk
*Thread Reply:* Seems to be running again
*Thread Reply:* This one will hang 😐 https://app.circleci.com/pipelines/github/OpenLineage/OpenLineage/9441/workflows/4037b224-d6f6-4213-afd5-5d7884007e53/jobs/179711
*Thread Reply:* I'm going to push a change to your branch
*Thread Reply:* To see if we can skip the copy
*Thread Reply:* even before, I reran it with SSH and it managed to copy the dependencies after all
*Thread Reply:* it took a lot of time tho
*Thread Reply:* Could you tell which dependencies it was copying?
*Thread Reply:* Like the fixture dependency
*Thread Reply:* or the container dependencies?
*Thread Reply:* not really, I just looked at df numbers
*Thread Reply:* circleci@ip-10-0-113-50:~$ df -H
Filesystem Size Used Avail Use% Mounted on
/dev/root 156G 15G 141G 10% /
devtmpfs 8.3G 0 8.3G 0% /dev
tmpfs 8.3G 0 8.3G 0% /dev/shm
tmpfs 1.7G 857k 1.7G 1% /run
tmpfs 5.3M 0 5.3M 0% /run/lock
tmpfs 8.3G 0 8.3G 0% /sys/fs/cgroup
/dev/nvme0n1p15 110M 6.4M 104M 6% /boot/efi
circleci@ip-10-0-113-50:~$ df -H
Filesystem Size Used Avail Use% Mounted on
/dev/root 156G 22G 135G 14% /
devtmpfs 8.3G 0 8.3G 0% /dev
tmpfs 8.3G 0 8.3G 0% /dev/shm
tmpfs 1.7G 1.3M 1.7G 1% /run
tmpfs 5.3M 0 5.3M 0% /run/lock
tmpfs 8.3G 0 8.3G 0% /sys/fs/cgroup
/dev/nvme0n1p15 110M 6.4M 104M 6% /boot/efi
*Thread Reply:* I wonder if the "copyIntegrationTestFixtures" was a red herring
*Thread Reply:* And it's actually the "copyDependencies" step
*Thread Reply:* Because the fixtures JAR is tiny
*Thread Reply:* It should take seconds, at most.
*Thread Reply:* your PR failed on your favorite step, spotless 😂
*Thread Reply:* One of these days, I am going to make a pre-commit or something
*Thread Reply:* we have pre-commit config but it's focused on Python parts https://github.com/OpenLineage/OpenLineage/blob/31f8ce588526e9c7c4bc7d849699cb7ce2969c8f/.pre-commit-config.yaml#L1
*Thread Reply:* spotless is such a low hanging fruit tho...
*Thread Reply:* I did it for one of my local repos
*Thread Reply:* But I didn't get it quite right
*Thread Reply:* Perhaps I should run "spotlessCheck" on commit
*Thread Reply:* At least that way, I know my commit will not be committed if the check fails
*Thread Reply:* https://github.com/jguttman94/pre-commit-gradle
*Thread Reply:* 😞 https://github.com/pre-commit/pre-commit/issues/1110#issuecomment-518939116
*Thread Reply:* Or I should just use intellij to commit with the reformat code option
*Thread Reply:* This is still getting stuck
*Thread Reply:* I wonder if it is the download of the archives
*Thread Reply:* Those archives are "big"
*Thread Reply:* looks like :app:copySparkBinariesSpark350Scala212
that's right after it
*Thread Reply:* 300mb should not take that much anyway
*Thread Reply:* It's the downloading from the Apache archive
*Thread Reply:* https://archive.apache.org
*Thread Reply:* It can be really slow at times
*Thread Reply:* The main archive of all public software releases of the Apache Software Foundation. This is simply a copy of the main distribution directory with the only difference that nothing will be ever removed over here. If you are looking for current software releases, please visit one of our numerous mirrors. Do note that heavy use of this service will result in immediate throttling of your download speeds to either 12 or 6 mbps for the remainder of the day, depending on severity. Continuous abuse (to the tune of more than 40 GB downloaded per week) will cause an automatic ban, so please tune your services to this fact.
🤔
*Thread Reply:* Another reason why we need a container registry
*Thread Reply:* then we'll get rate limited by it 🙂
*Thread Reply:* The problem is the mirrors don't contain all of the versions of Spark
*Thread Reply:* I think they go back to 3.3.4
*Thread Reply:* I think we could start using quay.io : for a task
*Thread Reply:* <a href="http://quay.io">quay.io</a> does not restrict anonymous pulls against its repositories (either public or private) and only rate limits in the most severe circumstances to maintain service levels (e.g. tens of requests per second from the same IP address).
*Thread Reply:* Aye, but who do we speak to in order to provision that?
*Thread Reply:* Maybe, just maybe, we can use circle ci's cache mechanism >.>
*Thread Reply:* I wonder, @Maciej Obuchowski - the GCP project that exists, can we not use the container registry in that one?
*Thread Reply:* quay.io is free, GCR is paid and not that cheap https://cloud.google.com/artifact-registry/pricing
*Thread Reply:* I see this: https://quay.io/plans/
*Thread Reply:* Public repositories are always free.
🙂
*Thread Reply:* I'm setting up an OpenLineage
organization that we could push the images to
*Thread Reply:* Though, I don't know an "organization email" to use
*Thread Reply:* well, I was faster, I got the openlineage
name 🙂
*Thread Reply:* Added one of mine, it's possible to change it later
*Thread Reply:* it should be possible now to log in using
docker login -u="${QUAY_ACCOUNT_ID}" -p="${QUAY_ACCOUNT_TOKEN}" quay.io
from CI task which is marked integration-tests
*Thread Reply:* OK. But I'll need to build those images first, and push them.
*Thread Reply:* and push to https://quay.io/repository/openlineage/spark
*Thread Reply:* as in, the user QUAY_ACCOUNT_ID
has permission to write there
*Thread Reply:* Can you add me to the org, so I can push the images?
*Thread Reply:* (I still have the binaries downloaded on my local)
*Thread Reply:* using email <a href="mailto:damien.hawes@booking.com">damien.hawes@booking.com</a>
?
*Thread Reply:* check the mail 🙂
*Thread Reply:* I can see spark-3.2.4-scala-2.13
*Thread Reply:* Next one is almost there
*Thread Reply:* These are some chunky images.
*Thread Reply:* Earlier I was thinking of pushing it on CI: checking if the Spark tag exist, then create image and push it if it does not
*Thread Reply:* but we need to do this only once
*Thread Reply:* and if we want to add support for another Spark/Scala version, we still need to do some work manually
*Thread Reply:* so I guess this does not matter?
*Thread Reply:* The process is fairly trivial.
*Thread Reply:* but still would be good to have documentation how to create the image, so you're not bothered by questions next time 🙂
*Thread Reply:* And we can make an openlineage-spark-docker
directory
*Thread Reply:* Place the gradle in there
*Thread Reply:* It should be a project that changes very rarely
*Thread Reply:* Good find on the quay.io btw
*Thread Reply:* OK. All images have been pushed.
*Thread Reply:* yep, I can see all of them
*Thread Reply:* People still love to use 2.4.8 🙂
*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2455 should be enought to check if it works?
*Thread Reply:* > It should be a project that changes very rarely we should be building those images on minor Spark releases too
*Thread Reply:* Yes - that's what the spline folks did
*Thread Reply:* but they never supported 2.13
*Thread Reply:* WARN <a href="http://tc.quay.io/openlineage/spark:spark-3.3.4-scala-2.12">tc.quay.io/openlineage/spark:spark-3.3.4-scala-2.12</a> - The architecture 'arm64' for image '<a href="http://quay.io/openlineage/spark:spark-3.3.4-scala-2.12">quay.io/openlineage/spark:spark-3.3.4-scala-2.12</a>' (ID sha256:dbdc0c8a3e1b182004c3c850c2ecb767b76cc14e55e3e994a34356630e689e86) does not match the Docker server architecture 'amd64'. This will cause the container to execute much more slowly due to emulation and may lead to timeout failures.
*Thread Reply:* you have only arm containers locally?
*Thread Reply:* I'm pushing amd64 containers at the moment
*Thread Reply:* yeah it adds another dimension to the problem
*Thread Reply:* They'll probably take like 15 minutes or so
*Thread Reply:* not sure it did exactly what we want but probably okay for now
*Thread Reply:* it would be best if this was multi-arch build I think
*Thread Reply:* anyway it's not a job for now, rerunning the tests and I'm finishing for today
*Thread Reply:* OK - I have extracted the logic for building the images. It now also performs a multi-platform build, targeting linux/amd64
and linux/arm64
. This should be enough for the CI/CD pipeline, folks developing with Linux and folks developing with Apple ARM chips.
The PR is here: https://github.com/OpenLineage/OpenLineage/pull/2456
The images with multiple manifests are here:
https://quay.io/repository/openlineage/spark?tab=tags
*Thread Reply:* To explain the situation for people not following the issue.
We had problem with CI where download of 300MB archive from archive.apache.com took over 10 minutes, probably because we were rate limited. That failed our integration tests and blocked the release.
We used those archives to create docker images that were used for integration tests - compiled Spark of particular version with particular version of Scala.
Solution to that problem was manually prebuilding the images and pushing it to free quay.io repository, This is not a problem, since bumping version of Spark that we test on also requires manual action, and because @Damien Hawes provided Gradle task to complete the work.
I've created openlineage
organization on quay.io where we can push the images - and any other images we could want, for example jupyter already configured with Spark integration to allow people easier experimentation with OpenLineage.
If no one has any philosophical problems with that solution, I would like to see few committer volunteers to be added as admins to the quay.io organization - to increase the bus factor. @Julien Le Dem @Paweł Leszczyński @Damien Hawes - do you want to be added there?
*Thread Reply:* @Maciej Obuchowski - sure.
*Thread Reply:* yes! Thank you Maciej. As long as there’s clear doc and an easy one liner to update those, this sounds good to me. I think we need to pay extra attention on limiting write access as this is a potential injection point to modify what the build does invisibly. (you can push a different image and affect the build without modifying the repo). Is there already a signature verification on download from quay (to avoid unauthorized modification of those images)?
*Thread Reply:* > Could we add add image building and storing in quai.io as part of our CI when needed image is not present there? We discussed this and decided it's already required to do manual work to support additional Spark version, so automating this won't give us much > I would love to have some info in the docs what has to be done to support Spark 3.6 once it gets released. Especially, how can one publish 3.6 image to quai.io? There is a readme here: https://github.com/OpenLineage/OpenLineage/pull/2456/files#diff-44ca475a04d6a92886f82dd27b47d30c8e57f518aa3dbc467feef43ec1c57638
*Thread Reply:* thanks Maciej
*Thread Reply:* > I think we need to pay extra attention on limiting write access as this is a potential injection point to modify what the build does invisibly. (you can push a different image and affect the build without modifying the repo). @Julien Le Dem Yes, only authorized users (committers) can upload images. CI won’t write images, just read them, they would be pushed by committers before execution > Is there already a signature verification on download from quay (to avoid unauthorized modification of those images)? (edited) Docker verifies SHA of downloaded images. Do you mean some additional mechanism to avoid potential problems with compromised committer?
*Thread Reply:* The sha could be saved in the repo and compared so that it can not be changed independently by someone who would have gained access to the credentials.
*Thread Reply:* @Julien Le Dem that would make a lot of sense if the same commit that changes the images could not change the SHA 🙂
*Thread Reply:* unless we've made those SHAs part of something external, for example CircleCI config
*Thread Reply:* but TBH I think it's low risk, CircleCI would limit us fast if somebody would for example put crypto miner there
*Thread Reply:* to me the risk is more to introduce vulnerabilities/backdoors in the OpenLineage released artifact through pushing a cached image that modifies the result of the build.
*Thread Reply:* The idea of saving the image signature in the repo is that you can not use a new image in the build without creating a new commit and traceability.
CFP closes on April 30 https://events.linuxfoundation.org/open-source-summit-europe/
*Thread Reply:* Only a few days after Airflow Summit September 10-12, 2024
🤔
New communication channel: https://medium.com/@openlineageproject
*Thread Reply:* More to come...
I might be the only one running into this issue (for now). I plan on opening PR for a fix this weekend, but if someone wants to pick it up your more than welcome to: https://github.com/OpenLineage/OpenLineage/issues/2458
This is a GCP-specific question, but does anyone know the answer? https://openlineage.slack.com/archives/C01CK9T7HKR/p1708010167626709?thread_ts=1707920807.530409&cid=C01CK9T7HKR
FYI: The release of 1.9.0 did not go through
Issue: https://github.com/OpenLineage/OpenLineage/issues/2467 PR: https://github.com/OpenLineage/OpenLineage/pull/2468
*Thread Reply:* The biggest problem with release, as always, is that you can't test it in other way than running it 😞
It also seems like, despite the fact that the spark step completed, there was a silent failure or something. I don't see the artefacts on central.
*Thread Reply:* @Michael Robinson needs to manually promote them
*Thread Reply:* @Damien Hawes looking into this today
*Thread Reply:* @Michael Robinson you can rerun release now
*Thread Reply:* @Damien Hawes the release is out. comms to follow shortly
Feedback and input requested for this month's newsletter. I've added sections for the Flink and Spark integrations. Please lmk what you think about the "highlights" I've chosen for these and for Airflow if you have a moment between now and EOD Wednesday. Thanks. https://docs.google.com/document/d/15caPR4q7dOPs6co2x0q5hYSX65ZhHHdhKrFP7dVSRPI/edit?usp=sharing
*Thread Reply:* Would it be okay to highlight 2 new committers? 🙂
*Thread Reply:* 👆why I ask for input! thank you @Jakub Dardziński
Snowflake taking a play from the Databricks playbook? https://www.snowflake.com/en/data-cloud/horizon/
gotta skip today meeting. I hope to see you all next week!
The meetup I mentioned about OpenLineage/OpenTelemetry: https://x.com/J_/status/1565162740246671360 I speak in English but other two speakers speak in Hebrew
*Thread Reply:* the slides from my part: https://docs.google.com/presentation/d/1BLM2ocs2S64NZLzNaZz5rkrS9lHRvtr9jUIetHdiMbA/edit#slide=id.g11e446d5059_0_1055
*Thread Reply:* thanks for sharing that, that otel to ol comparison is going to be very useful for me today :)
Could use another pair of eyes on this month's newsletter draft if anyone has time today
*Thread Reply:* LGTM 🙂
Hey, I created new Airflow AIP. It proposes instrumenting Airflow Hooks and Object Storage to collect dataset updates automatically, to allow gathering lineage from PythonOperator and custom operators. Feel free to comment on Confluence https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-62+Getting+Lineage+from+Hook+Instrumentation or on Airflow mailing list: https://lists.apache.org/thread/5chxcp0zjcx66d3vs4qlrm8kl6l4s3m2
Hey, does anyone want to add anything here (PR that adds AWS MSK IAM transport)? It looks like it's ready to be merged.
did we miss a step in publishing 1.9.1? going https://search.maven.org/remote_content?g=io.openlineage&a=openlineage-spark&v=LATEST|here gives me the 1.8 release
*Thread Reply:* oh, this might be related to having 2 scala versions now, because I can see the 1.9.1 artifacts
*Thread Reply:* we may need to fix the docs then https://openlineage.io/docs/integrations/spark/quickstart/quickstart_databricks
*Thread Reply:* another place 🙂
*Thread Reply:* https://github.com/OpenLineage/docs/pull/299
Hi, here's a tentative agenda for next week's TSC (on Wednesday at 9:30 PT):
*Thread Reply:* I thought @Paweł Leszczyński wanted to present?
*Thread Reply:* What was the topic? Protobuf or built-in lineage maybe? Or the many docs improvements lately?
*Thread Reply:* I think so? https://github.com/OpenLineage/OpenLineage/pull/2272
*Thread Reply:* Imagine there are lots of folks who would be interested in a presentation on that
*Thread Reply:* There two things worth presenting: circuit breaker +/or built-in lineage (once it gets merged).
*Thread Reply:* updating the agenda
is there a reason why facet objects have _schemaURL
property but BaseEvent
has schemaURL
?
*Thread Reply:* yeah, we use _
to avoid naming conflicts in a facet
*Thread Reply:* same goes for producer
*Thread Reply:* Facets have user defined fields. So all base fields are prefixed
*Thread Reply:* it should be a made more clear… recently ran into the issue when validating OL events
*Thread Reply:* it might be another missing point but we set _producer
in BaseFacet
:
def __attrs_post_init__(self) -> None:
self._producer = PRODUCER
but we don’t do that for producer
in BaseEvent
*Thread Reply:* is this supposed to be like that?
*Thread Reply:* I’m kinda lost 🙂
*Thread Reply:* We should set producer in baseevent as well
*Thread Reply:* The idea is the base event might be produced by the spark integration but the facet might be produced by iceberg library
*Thread Reply:* > The idea is the base event might be produced by the spark integration but the facet might be produced by iceberg library
right, it doesn’t require adding _
, it just helps in making the difference
and also this reason too: > Facets have user defined fields. So all base fields are prefixed > Base events do not
*Thread Reply:* Since users can create custom facets with whatever fields we just tell Them that “_**” is reserved.
*Thread Reply:* So the underscore prefix is a mechanism specific to facets
*Thread Reply:* last question:
we don’t want to block users from setting their own _producer
field? it seems the only way now is to use openlineage.client.facet.set_producer
method to override default, you can’t just do RunEvent(…, _producer='my_own')
*Thread Reply:* The idea is the producer identifies the code that generates the metadata. So you set it once and all the facets you generate have the same
*Thread Reply:* mhm, probably you don’t need to use several producers (at least) per Python module
*Thread Reply:* In airflow each provider should have its own for the facets they produce
*Thread Reply:* just searched for set_producer
in current docs - no results 😨
*Thread Reply:* a number of things will get to the right track after I’m done with generating code 🙂
*Thread Reply:* Thanks for looking into that. If you can fix the doc by adding a paragraph about that, that would be helpful
*Thread Reply:* I can create an issue at least 😂
*Thread Reply:* there you go: https://github.com/OpenLineage/docs/issues/300 if I missed something please comment
I feel like our getting started with openlineage page is mostly a getting started with Marquez page. but I'm also not sure what should be there otherwise.
*Thread Reply:* https://openlineage.io/docs/guides/spark ?
*Thread Reply:* Unfortunately it's probably not that "quick" given the setup required..
*Thread Reply:* Maybe better? https://openlineage.io/docs/integrations/spark/quickstart/quickstart_local
*Thread Reply:* yeah, that's where I was struggling as well. should our quickstart be platform specific? that also feels strange.
Quick question, for the spark.openlineage.facets.disabled
property, why do we need to include [;]
in the value? Why can't we use ,
to act as the delimiter? Why do we need [
and ]
to enclose the string?
*Thread Reply:* There was some concrete reason AFAIK right @Paweł Leszczyński?
*Thread Reply:* We do have a logic that converts Spark conf entries to OpenLineageYaml without a need to understand its content. I think []
was added for this reason to know that Spark conf entry has to be translated into an array.
Initially disabled facets were just separated by ;
. Why not a comma? I don't remember if there was any problem with this.
https://github.com/OpenLineage/OpenLineage/pull/1271/files -> this PR introduced it
https://github.com/OpenLineage/OpenLineage/blob/1.9.1/integration/spark/app/src/main/java/io/openlineage/spark/agent/ArgumentParser.java#L152 -> this code check if spark conf value is of array type
Hi team, do we have any proposal or previous discussion of Trino OpenLineage integration?
*Thread Reply:* There is old third-party integration: https://github.com/takezoe/trino-openlineage
It has right idea to use EventListener, but I can't vouch if it works
*Thread Reply:* Thanks. We are investigating the integration in our org. It will be a good start point 🙂
*Thread Reply:* I think the ideal solution would be to use EventListener. So far we only have very basic integration in Airflow's TrinoOperator
*Thread Reply:* The only thing I haven't really checked out what are real possibilities for EventListener in terms of catalog details discovery, e.g. what's database connection for the catalog.
*Thread Reply:* Thanks for calling out this. We will evaluate and post some observation in the thread.
*Thread Reply:* Thanks Peter Hey Maciej/Jakub Could you please share the process to follow in terms of contributing a Trino open lineage integration. (Design doc and issue ?)
There was an issue for trino integration but it was closed recently. https://github.com/OpenLineage/OpenLineage/issues/164
*Thread Reply:* It would be great to see design doc and maybe some POC if possible. I've reopened the issue for you.
If you get agreement around the design I don't think there are more formal steps needed, but maybe @Julien Le Dem has other idea
*Thread Reply:* Trino has their plugins directory btw: https://github.com/trinodb/trino/tree/master/plugin including event listeners like: https://github.com/trinodb/trino/tree/master/plugin/trino-mysql-event-listener
*Thread Reply:* Thanks Maciej and Jakub Yes the integration will be done with Trino’s event listener framework that has details around query, source and destination dataset details etc.
> It would be great to see design doc and maybe some POC if possible. I’ve reopened the issue for you. Thanks for re-opening the issue. We will add the design doc and POC to the issue.
*Thread Reply:* I agree with @Maciej Obuchowski, a quick design doc followed by a POC would be great. The integration could either live in OpenLineage or Trino but that can be discussed after the POC.
*Thread Reply:* (obviously, adding it to the trino repo would require aproval from the trino community)
*Thread Reply:* Gentleman, we are also actively looking into this topic with the same repo from @takezoe as our base, I have submitted a PR to revive this project - it does work, the POC is there in a form of docker-compose.yaml deployment 🙂 some obvious things are missing for now (like kafka output instead of api) but I think it's a good starting point and it's compatible with latest trino and OL
*Thread Reply:* Thanks for put the foundation for the implementation. Base on it, I feel @Alok would still participate and make contribute to it. How about create a design doc and list all of the possible TBDs as @Julien Le Dem suggested.
*Thread Reply:* Adding @takezoe to this thread. Thanks for your work on a Trino integration and welcome!
*Thread Reply:* throwing the CFP for the Trino conference here in case any one of the contributors want to present there https://sessionize.com/trino-fest-2024
*Thread Reply:* I'm also very happy to help with an idea for an abstract
*Thread Reply:* Hey Harel Just FYI we are already engaged with Trino community to have a talk around Trino open lineage integration and have submitted an Abstract for review.
*Thread Reply:* once you release the integration, please add a reference about it to OpenLineage docs! https://github.com/OpenLineage/docs
*Thread Reply:* I think it's ready for review https://github.com/trinodb/trino/pull/21265 just with API sink integration, additional features can be added at @Alok's convenience as next PRs
Hey, there’s discrepancy between
disabled
option)*Thread Reply:* I believe we should not extract or emit any open lineage events if this option is used
*Thread Reply:* I'm for option 2, don't send any event from task
*Thread Reply:* @Jakub Dardziński do you see any use case for not extracting metadata extraction but still emitting events?
*Thread Reply:* The use case AFAIK was old SnowflakeOperator bug, we wanted to disable the collection there, since it zombified the task. The events being emitted still gave information about status of the task as well as non-dataset related metadata
*Thread Reply:* but I think it's less relevant now
*Thread Reply:* ^ this and you might want to have information about task execution because OL is a backend for some task-tracking system
*Thread Reply:* Hm, I believe users don't expect us to spend time processing/extracting OL events if this configuration is used. It's the documented behaviour
*Thread Reply:* the question is if we should change docs or behaviour
*Thread Reply:* I believe the latter
Hi, here's the
*Thread Reply:* Looks like a great agenda! Left a couple of comments
*Thread Reply:* @Michael Robinson will you be able to facilitate or do you need help?
*Thread Reply:* I'm also missing from the committer list, but can't comment on slides 🙂
*Thread Reply:* Sorry about that @Kacper Muda. Gave you access just now
*Thread Reply:* We probably need to add you to lists posted elsewhere... I'll check
https://github.com/open-metadata/OpenMetadata/pull/15317 👀
*Thread Reply:* this is awesome
*Thread Reply:* it looks like they use temporary deployments to test...
*Thread Reply:* yeah the GitHub history is wild
Hi, I'm at the conference hotel and my earbuds won't pair with my new mac for some reason. Does the agenda look good? Want to send out the reminders soon. I'll add the OpenMetadata news!
*Thread Reply:* I think we can also add the Datahub PR?
*Thread Reply:* @Paweł Leszczyński prefers to present only the circuit breakers
*Thread Reply:* https://github.com/datahub-project/datahub/pull/9870/files
It's been a while since we've updated the twitter profile. Current description: "A standard api for collecting Data lineage and Metadata at runtime." What would you think of using our website's tagline: "An open framework for data lineage collection and analysis." Other ideas?
can someone grant me write access to our forked sqlparser-rs
repo?
*Thread Reply:* I should probably add the committer group to it
*Thread Reply:* I have made the committer group maintainer on this repo
https://github.com/OpenLineage/OpenLineage/pull/2514 small but mighty 😉
Regarding the approved release, based on the additions it seems to me like we should make it a minor release (so 1.10.0). Any objections? Changes are here: https://github.com/OpenLineage/OpenLineage/compare/1.9.1...HEAD
We encountered a case of a START event, exceeding 2MB in Airflow. This was traced back to an operator with unusually long arguments and attributes. Further investigation revealed that our Airflow events contain redundant data across different facets, leading to unnecessary bloating of event sizes (those long attributes and args were attached three times to a single event). I proposed to remove some redundant facets and to refine the operator's attributes inclusion logic within AirflowRunFacet. I am not sure how breaking is this change, but some systems might depend on the current setup. Suggesting an immediate removal might not be the best approach, and i'd like to know your thoughts. (A similar problem exists within the Airflow provider.) CC @Maciej Obuchowski @Willy Lulciuc @Jakub Dardziński
https://github.com/OpenLineage/OpenLineage/pull/2509
As mentioned during yesterday's TSC, we can't get insight into DataHub's integration from the PR description in their repo. And it's a very big PR. Does anyone have any intel? PR is here: https://github.com/datahub-project/datahub/pull/9870
Changelog PR for 1.10 is RFR: https://github.com/OpenLineage/OpenLineage/pull/2516
@Julien Le Dem @Paweł Leszczyński Release is failing in the Java client job due to (I think) the version of spotless: ```Could not resolve com.diffplug.spotless:spotlessplugingradle:6.21.0. Required by: project : > com.diffplug.spotless:com.diffplug.spotless.gradle.plugin:6.21.0
No matching variant of com.diffplug.spotless:spotlessplugingradle:6.21.0 was found. The consumer was configured to find a library for use during runtime, compatible with Java 8, packaged as a jar, and its dependencies declared externally, as well as attribute 'org.gradle.plugin.api-version' with value '8.4'```
*Thread Reply:* @Michael Robinson https://github.com/OpenLineage/OpenLineage/pull/2517
fix to broken main: https://github.com/OpenLineage/OpenLineage/pull/2518
*Thread Reply:* Thanks, just tried again
*Thread Reply:* ? it needs approve and merge 😛
*Thread Reply:* Oh oops disregard
There's an issue with the Flink job on CI:
** What went wrong:
Could not determine the dependencies of task ':shadowJar'.
> Could not resolve all dependencies for configuration ':runtimeClasspath'.
> Could not find io.**********************:**********************_sql_java:1.10.1.
Searched in the following locations:
- <https://repo.maven.apache.org/maven2/io/**********************/**********************-sql-java/1.10.1/**********************-sql-java-1.10.1.pom>
- <https://packages.confluent.io/maven/io/**********************/**********************-sql-java/1.10.1/**********************-sql-java-1.10.1.pom>
- file:/home/circleci/.m2/repository/io/**********************/**********************-sql-java/1.10.1/**********************-sql-java-1.10.1.pom
Required by:
project : > project :shared
project : > project :flink115
project : > project :flink117
project : > project :flink118
*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2521
*Thread Reply:* @Jakub Dardziński still awake? 🙂
*Thread Reply:* it’s just approval bot
*Thread Reply:* created issue on how to avoid those in the future https://github.com/OpenLineage/OpenLineage/issues/2522
*Thread Reply:* https://app.circleci.com/jobs/github/OpenLineage/OpenLineage/188526 I lack emojis on this server to fully express my emotions
*Thread Reply:* https://openlineage.slack.com/archives/C065PQ4TL8K/p1710454645059659 you might have missed that
*Thread Reply:* merge -> rebase -> problem gone
*Thread Reply:* PR to update the changelog is RFR @Jakub Dardziński @Maciej Obuchowski: https://github.com/OpenLineage/OpenLineage/pull/2526
https://github.com/OpenLineage/OpenLineage/pull/2520 It’s a long-awaited PR - feel free to comment!
OpenLineage is trending upward on OSSRank. Please vote!
https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/ParentRunFacet.json#L20
here the format is uuid
however if you follow logic for parent id in current dbt integration you might discover that parent run facet has assigned value of DAG’s run_id (which is not uuid)
@Julien Le Dem, what has higher priority? I think lots of people are using dbt-ol
wrapper with current lineage_parent_id
macro
*Thread Reply:* It is a uuid because it should be the id of an OL run
where can I find who has write access to OL repo?
*Thread Reply:* Settings > Collaborators and teams
*Thread Reply:* thanks Michael, seems like I don’t have enough permissions to see that
Sorry, I have a dr appointment today and won’t join the meeting
*Thread Reply:* I gotta skip too. Maciej and Pawel are at the Kafka Summit
*Thread Reply:* I hope you’re fine!
Should we cancel the sync today?
looking at XTable today, any thoughts on how we can collaborate with them?
*Thread Reply:* @Julien Le Dem @Willy Lulciuc this reminds me of some ideas we had a few years ago.. :)
*Thread Reply:* hmm.. ok. maybe not that relevant for us, at first I thought this was an abstraction for read/write on top of Iceberg/Hudi/Delta.. but I think this is more of a data sync appliance. would still be relevant for linking together synced datasets (but I don't think it's that important now)
*Thread Reply:* From the introduction https://www.confluent.io/blog/introducing-tableflow/, looks like they are using Flink for both data ingestion and compaction. It means we should at least consider to support hudi source and sink for flink lineage 🙂
A key growth metric trending in the right direction:
Eyes on this PR to add OpenMetadata to the Ecosystem page would be appreciated: https://github.com/OpenLineage/docs/pull/303. TIA! @Mariusz Górski
I really want to improve this page in the docs, anyone wants to work with me on that?
*Thread Reply:* perhaps also make this part of the PR process, so when we add support for something, we remember to update the docs
*Thread Reply:* I free up next week and would love to chat… obviously, time permitting but the page needs some love ❤️
*Thread Reply:* I can verify the information once you have some PR 🙂
RFR: a PR to add DataHub to the Ecosystem page https://github.com/OpenLineage/docs/pull/304
*Thread Reply:* The description comes from the very brief README in DataHub's GH repo and a glance at the code. No other documentation or resources appear to be available.
Dagster is launching column-lineage support for dbt using the sqlglot parser https://github.com/dagster-io/dagster/pull/20407
*Thread Reply:* I kinda like their approach to use post-hooks
in order to enable column-level lineage so that custom macro collects information about columns, logs it and they parse the log after the execution
*Thread Reply:* it doesn’t force dbt docs generate
step that some might not want to use
*Thread Reply:* but at the same time reuses DBT adapter to make additional calls to retrieve missing metadata
@Paweł Leszczyński interesting project I came across over the weekend: https://github.com/HamaWhiteGG/flink-sql-lineage
*Thread Reply:* Wow, this is something we would love to have (flink SQL support). It's great to know that people around the globe are working on the same thing and heading same direction. Great finding @Willy Lulciuc. Thanks for sharing!
*Thread Reply:* On Kafka Summit I've talked with Timo Walther from Flink SQL team and he proposed alternative approach.
Flink SQL has stable (across releases) CompiledPlan
JSON text representation that could be parsed, and has all the necessary info - as this is used for serializing actual execution plan both ways.
*Thread Reply:* As Flink SQL will convert to transformations before execution, technical speaking our existing solution has already be able to create linage info for Flink SQL apps (not including column lineage and table schemas (that can be inferred within flink table environment)). I will create Flink SQL job for e2e testing purpose.
*Thread Reply:* I am also working on Flink side for table lineage. Hopefully, new lineage features can be released in flink 1.20.
Sessions for this year's Data+AI Summit have been published. A search didn't turn up anything related to lineage, but did you know Julien and Willy's talk at last year's summit has received 4k+ views? 👀
*Thread Reply:* seems like our talk was not accepted, but I can see 9 sessions on unity catalog 😕
finally merged 🙂
pawel-big-lebowski commented on Nov 21, 2023
whoa
I’ll miss the sync today (on the way to data council)
*Thread Reply:* have fun at the conference!
OK @Maciej Obuchowski - 1 job has many stages; 1 stage has many tasks. Transitively, this means that 1 job has many tasks.
*Thread Reply:* batch or streaming one? 🙂
*Thread Reply:* Doesn't matter. It's the same concept.
Also @Paweł Leszczyński, seem Spark metrics has this:
local-1711474020860.driver.LiveListenerBus.listenerProcessingTime.io.openlineage.spark.agent.OpenLineageSparkListener
count = 12
mean rate = 1.19 calls/second
1-minute rate = 1.03 calls/second
5-minute rate = 1.01 calls/second
15-minute rate = 1.00 calls/second
min = 0.00 milliseconds
max = 1985.48 milliseconds
mean = 226.81 milliseconds
stddev = 549.12 milliseconds
median = 4.93 milliseconds
75% <= 53.64 milliseconds
95% <= 1985.48 milliseconds
98% <= 1985.48 milliseconds
99% <= 1985.48 milliseconds
99.9% <= 1985.48 milliseconds
Do you think Bipan's team could potentially benefit significantly from upgrading to the latest version of openlineage-spark? https://openlineage.slack.com/archives/C01CK9T7HKR/p1711483070147019
*Thread Reply:* @Paweł Leszczyński wdyt?
*Thread Reply:* I think the issue here is that marquez is not able to properly visualize parent run events that Maciej has added recently for a Spark application
*Thread Reply:* So if they downgraded would they have a graph closer to what they want?
*Thread Reply:* I don't see parent run events there?
I'm exploring ways to improve the demo gif in the Marquez README. An improved and up-to-date demo gif could also be used elsewhere -- in the Marquez landing pages, for example, and the OL docs. Along with other improvements to the landing pages, I created a new gif that's up to date and higher-resolution, but it's large (~20 MB). • We could put it on YouTube and link to it, but that would downgrade the user experience in other ways. • We could host it somewhere else, but that would mean adding another tool to the stack and, depending on file size limits, could cost money. (I can't imagine it would cost but I haven't really looked into this option yet. Regardless of cost, tt seems to have the same drawbacks as YT from a UX perspective.) • We could have GitHub host it in another repo (for free) in the Marquez or OL orgs. ◦ It could go in the OL Docs because it's likely we'll want to use it in the docs anyway, but even if we never serve it wouldn't this create issues for local development at a minimum? I opened a PR to do this, which a PR with other improvements is waiting on, but not sure about this approach. ◦ It could go in the unused Marquez website repo, but there's a good chance we'll forget it's there and remove or archive the repo without moving it first. ◦ In another repo, or even a new one for stuff like this? Anyone have an opinion or know of a better option?
*Thread Reply:* maybe make it a HTML5 video?
*Thread Reply:* https://wp-rocket.me/blog/replacing-animated-gifs-with-html5-video-for-faster-page-speed/
@Julien Le Dem @Harel Shein how did Data Council panel and talk go?
*Thread Reply:* Was just composing the message below :)
Some great discussions here at data council, the panel was really great and we can definitely feel energy around OpenLineage continuing to build up! 🚀 Thanks @Julien Le Dem for organizing and shoutout to @Ernie Ostic @Sheeri Cabral (Collibra) @Eric Veleker for taking the time and coming down here and keeping pushing more and building the community! ❤️
*Thread Reply:* @Harel Shein did anyone take pictures?
*Thread Reply:* there should be plenty of pictures from the conference organizers, we'll ask for some
*Thread Reply:* Did a search and didn't see anything
*Thread Reply:* Speaker dinner the night before: https://www.linkedin.com/posts/datacouncil-aidatacouncil-ugcPost-7178852429705224193-De46?utmsource=share&utmmedium=memberios|https://www.linkedin.com/posts/datacouncil-aidatacouncil-ugcPost-7178852429705224193-De46?utmsource=share&utmmedium=memberios
*Thread Reply:* haha. Julien and Ernie look great while I'm explaining how to land an airplane 🛬
*Thread Reply:* The photo gallery is there
*Thread Reply:* awesome! just in time for the newsletter 🙂
*Thread Reply:* Thank you for thinking of us. Onwards and upwards.
I just find the naming conventions for hive/iceberg/hudi are not listed in the doc https://openlineage.io/docs/spec/naming/. Shall we further standardize them? Any suggestions?
*Thread Reply:* Yes. This also came up in a conversation with one of the maintainers of dbt-core, we can also pick up on a proposal to extend the naming conventions markdown to something a bit more scalable.
*Thread Reply:* What you think about this proposal? https://github.com/OpenLineage/OpenLineage/pull/1702
*Thread Reply:* Thanks for sharing the info. Will take a deeper look later today.
*Thread Reply:* I think this is similar topic to resource naming in ODD, might be worth to take a look for inspiration: https://github.com/opendatadiscovery/oddrn-generator
*Thread Reply:* the thing is we need to have language-agnostic way of defining those naming conventions and be able to generate code for them, similar to facets spec
*Thread Reply:* could be also an idea to have micro rest api embedded in each client, so managing naming convention would be stored there and each client (python/java) could run it as a subprocess 🤔
*Thread Reply:* we can also just write it in Rust, @Maciej Obuchowski 😁
*Thread Reply:* no real changes/additions, but starting to organize the doc for now: https://github.com/OpenLineage/OpenLineage/pull/2554
@Maciej Obuchowski we also heard some good things about the sqlglot parser. have you looked at it recently?
*Thread Reply:* I love the fact that our parser is in type safe language :)
*Thread Reply:* does it matter after all when it comes to parsing SQL? it might be worth to run some comparisons but it may turn out that sqlglot misses most of Snowflake dialect that we currently support
*Thread Reply:* We'd miss on Java side parsing as well
*Thread Reply:* very importantly this ^
OpenLineage 1.11.0 release vote is now open: https://openlineage.slack.com/archives/C01CK9T7HKR/p1711980285409389
Sorry, I’ll be late to the sync
forgot to mention, but we have the TSC meeting coming up next week. we should start sourcing topics
*Thread Reply:* 1.10 and 1.11 releases Data Council, Kafka Summit, & Boston meetup shout outs and quick recaps Datadog poc update or demo?
*Thread Reply:* Discussion item about Trino integration next steps?
*Thread Reply:* Accenture+Confluent roundtable reminder for sure
*Thread Reply:* job to job dependencies discussion item? https://openlineage.slack.com/archives/C065PQ4TL8K/p1712153842519719
*Thread Reply:* I think it's too early for Datadog update tbh, but I like the job to job discussion. We can make also bring up the naming library discussion that we talked about yesterday
one more thing, if we want we could also apply for a free Datadog account for OpenLineage and Marquez: https://www.datadoghq.com/partner/open-source/
*Thread Reply:* would be nice for tests
is there any notion of process dependencies in openlineage? i.e. if I have two airflow tasks that depend on each other, with no dataset in between, can I express that in the openlineage spec?
*Thread Reply:* AFAIK no, it doesn't aim to do reflect that cc @Julien Le Dem
*Thread Reply:* It is not in the core spec but this could be represented as a job facet. It is probably in the airflow facet right now but we could add a more generic job dependency facet
*Thread Reply:* we do represent hierarchy
though - with ParentRunFacet
*Thread Reply:* if we were to add some dependency facet, what would we want to model?
*Thread Reply:* do we also want to model something like Airflow's trigger rules? https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html#trigger-rules
*Thread Reply:* I don't think this is about hierarchy though, right? If I understand @Julian LaNeve correctly, I think it's more #2
*Thread Reply:* yeah it's less about hierarchy - definitely more about #2.
assume we have a DAG that looks like this:
Task A -> Task B -> Task C
today, OL can capture the full set of dependencies this if we do:
A -> (dataset 1) -> B -> (ds 2) -> C
but it's not always the case that you have datasets between everything. my question was moreso around "how can I use OL to capture the relationship between jobs if there are no datasets in between"
*Thread Reply:* I had opened an issue to track this a while ago but we did not get too far in the discussion: https://github.com/OpenLineage/OpenLineage/issues/552
*Thread Reply:* oh nice - unsurprisingly you were 2 years ahead of me 😆
*Thread Reply:* You can track the dependency both at the job level and at the run level.
At the job level you would do something along the lines of:
job: { facets: {
job_dependencies: {
predecessors: [
{ namespace: , name: }, ...
],
successors: [
{ namespace: , name: }, ...
]
}
}}
*Thread Reply:* At the run level you could track the actual task run dependencies:
run: { facets: {
run_dependencies: {
predecessor: [ "{run uuid}", ...],
successors: [...],
}
}}
*Thread Reply:* I think the current airflow run facet contains that information in an airflow specific representation: https://github.com/apache/airflow/blob/main/airflow/providers/openlineage/plugins/facets.py
*Thread Reply:* I think we should have the discussion in the ticket so that it does not get lost in the slack history
*Thread Reply:* run: { facets: {
run_dependencies: {
predecessor: [ "{run uuid}", ...],
successors: [...],
}
}}
I like this format, but would have full run/job identifier as ParentRunFacet
*Thread Reply:* For the trigger rules I wonder if this is too specific to airflow.
*Thread Reply:* But if there’s a generic way to capture this, it makes sense
Don't forget to register for this! https://events.confluent.io/roundtable-data-lineage/Accenture
This attempt at a SQLAlchemy was basically working, if not perfectly, the last time I played with it: https://github.com/OpenLineage/OpenLineage/pull/2088. What more do I need to do to get it to the point where it can be merged as an "experimental"/"we warned you" integration? I mean, other than make sure it's still working and clean it up? 🙂
https://docs.getdbt.com/docs/collaborate/column-level-lineage#sql-parsing
*Thread Reply:* seems like it’s only for dbt cloud
*Thread Reply:* > Column-level lineage relies on SQL parsing. Was thinking about doing the same thing at some point
*Thread Reply:* Basically with dbt we know schemas, so we also can resolve wildcards as well
*Thread Reply:* but that requires adding capability for providing known schema into sqlparser
*Thread Reply:* that's not very hard to add afaik 🙂
*Thread Reply:* not exactly into sqlparser too
*Thread Reply:* just our parser
*Thread Reply:* yeah, our parser
*Thread Reply:* still someone has to add it :D
*Thread Reply:* some rust enthusiast probably
*Thread Reply:* but also: dbt provides schema info only if you generate catalog.json with generate docs command
*Thread Reply:* Right now we have the dbl-ol wrapper anyway, so we can make another dbt docs command on behalf of the user too
*Thread Reply:* not sure if running commands on behalf of user is good idea, but denoting in docs that running it increases accuracy of column-level lineage is probably a good idea
*Thread Reply:* once we build it
*Thread Reply:* That depends, what are the side effects of running dbt docs?
*Thread Reply:* the other option is similar to dagster's approach - run post-hook macro that prints schema to logs and read the logs with dbt-ol wrapper
*Thread Reply:* which again won't work in dbt cloud - there catalog.json seems like the only option
*Thread Reply:* > That depends, what are the side effects of running dbt docs? refreshing someone's documentation? 🙂
*Thread Reply:* it would be configurable imho, if someone doesn’t want column level lineage in price of additional step, it’s their choice
*Thread Reply:* yup, agreed. I'm sure we can also run dbt docs to a temp directory that we'll delete right after
Some encouraging stats from Sonatype: these are Spark integration downloads (unique IPs) over the last 12 months
*Thread Reply:* That's an increase of 17560.5%
*Thread Reply:* https://github.com/OpenLineage/OpenLineage/releases/tag/1.11.3 that’s a lot of notes 😮
Marquez committers: there's a committer vote open 👀
did anyone submit a CFP here? https://sessionize.com/open-source-summit-europe-2024/ it's a linux foundation conference too
*Thread Reply:* looks like a nice conference
*Thread Reply:* too far for me, but might be a train ride for you?
*Thread Reply:* yeah, I might submit something 🙂
*Thread Reply:* and I think there are actually direct trains to Vienna from Warsaw
Hmm @Maciej Obuchowski @Paweł Leszczyński - I see we released 1.11.3, but I don't see the artifacts in central. Are the artifacts blocked?
*Thread Reply:* after last release, it took me some 24h to see openlineage-flink artifact published
*Thread Reply:* I recall something about the artifacts had to be manually published from the staging area.
*Thread Reply:* @Maciej Obuchowski - can you check if the release is stuck in staging?
*Thread Reply:* I recall last time it failed because there wasn't a javadoc associated with it
*Thread Reply:* Nevermind @Paweł Leszczyński @Maciej Obuchowski - it seems like the search indexes haven't been updated.
*Thread Reply:* @Michael Robinson has to manually promote them but it's not instantaneous I believe
I'm seeing some really strange behavior with OL Spark, I'm going to give some data to help out, but these are still breadcrumbs unfortunately. 🧵
*Thread Reply:* the driver for this job is running for more than 5 hours, but the job actually finished after 20 minutes
*Thread Reply:* most the cpu time in those 5 hours are spent in openlineage methods
*Thread Reply:* it's also not reproducible 😕
*Thread Reply:* DatasetIdentifier.equals
?
*Thread Reply:* can you check what calls it?
*Thread Reply:* unfortunately, some of the stack frames are truncated by JVM
*Thread Reply:* maybe this has something to do with SymLink and the lombok implementation of .equals() ?
*Thread Reply:* and then some sort of circular dependency
*Thread Reply:* one possible place, looks like n^2 algorithm: https://github.com/OpenLineage/OpenLineage/blob/4ba93747e862e333267b46a57f02a09264[…]rk3/agent/lifecycle/plan/column/JdbcColumnLineageCollector.java
*Thread Reply:* but is this a JDBC job?
*Thread Reply:* ok, we don't use lang3 Pair a lot - it has to be in ColumnLevelLineageBuilder 🙂
*Thread Reply:* yes.. I'm staring at that class for a while now
*Thread Reply:* what's the rough size of the logical plan of the job?
*Thread Reply:* I'm trying to understand whether we're looking at some infinite loop
*Thread Reply:* or just something done very ineffiently
*Thread Reply:* like every input being added in this manner: ``` public void addInput(ExprId exprId, DatasetIdentifier datasetIdentifier, String attributeName) { inputs.computeIfAbsent(exprId, k -> new LinkedList<>());
Pair<DatasetIdentifier, String> input = Pair.of(datasetIdentifier, attributeName);
if (!inputs.get(exprId).contains(input)) {
inputs.get(exprId).add(input);
}
}``
it's a candidate: it has to traverse the list returned from
inputs` for every CLL dependency field added
*Thread Reply:* it looks like we're building size N list in N^2 time:
inputs.stream()
.filter(i -> i instanceof InputDatasetFieldWithIdentifier)
.map(i -> (InputDatasetFieldWithIdentifier) i)
.forEach(
i ->
context
.getBuilder()
.addInput(
ExprId.apply(i.exprId().exprId()),
new DatasetIdentifier(
i.datasetIdentifier().getNamespace(), i.datasetIdentifier().getName()),
i.field()));
🙂
*Thread Reply:* ah, this isn't even used now since it's for new extension-based spark collection
*Thread Reply:* @Paweł Leszczyński this is most likely a future bug ⬆️
*Thread Reply:* I think we're still doing it now anyway:
``` private static void extractInternalInputs(
LogicalPlan node,
ColumnLevelLineageBuilder builder,
List
datasetIdentifiers.stream()
.forEach(
di -> {
ScalaConversionUtils.fromSeq(node.output()).stream()
.filter(attr -> attr instanceof AttributeReference)
.map(attr -> (AttributeReference) attr)
.collect(Collectors.toList())
.forEach(attr -> builder.addInput(attr.exprId(), di, attr.name()));
});
}```
*Thread Reply:* and that's linked list - must be pretty slow jumping all those pointers
*Thread Reply:* maybe it's that simple 🙂 https://github.com/OpenLineage/OpenLineage/commit/306778769ae10fa190f3fd0eff7a6482fc50f57f
*Thread Reply:* There are some more funny places in CLL code, like we're iterating over list of schema fields and calling some function with name of that field :
schema.getFields().stream()
.map(field -> Pair.of(field, getInputsUsedFor(field.getName())))
then immediately iterate over it second time to get the field back from it's name:
List<Pair<DatasetIdentifier, String>> getInputsUsedFor(String outputName) {
Optional<OpenLineage.SchemaDatasetFacetFields> outputField =
schema.getFields().stream()
.filter(field -> field.getName().equalsIgnoreCase(outputName))
.findAny();
*Thread Reply:* I think the time spent by the driver (5 hours) just on these methods smells like an infinite loop?
*Thread Reply:* like, as inefficient as it may be, this is a lot of time
*Thread Reply:* did it finish eventually?
*Thread Reply:* yes... but.. I wonder if something killed it somewhere?
*Thread Reply:* I mean, it can be something like 10000^3 loop 🙂
*Thread Reply:* I couldn't find anything in the logs to indicate
*Thread Reply:* and it has to do those pair comparisons
*Thread Reply:* would be easier if we could see the general size of a plan of this job - if it's something really small then I'm probably wrong
*Thread Reply:* but if there are 1000s of columns... anything can happen 🙂
*Thread Reply:* yeah.. trying to find out. I don't have that facet enabled there, and I can't find the ol events in the logs (it's writing to console, and I think they got dropped)
*Thread Reply:* DevNullTransport 🙂
*Thread Reply:* I think this might be potentially really slow too https://github.com/OpenLineage/OpenLineage/blob/50afacdf731f810354be0880c5f1fd05a1[…]park/agent/lifecycle/plan/column/ColumnLevelLineageBuilder.java
*Thread Reply:* generally speaking, we have a similar problem here like we had with Airflow integration
*Thread Reply:* we are not holding up the job per se, but... we are holding up the spark application
*Thread Reply:* do we have a way to be defensive about that somehow, shutdown hook from spark to our thread or something
*Thread Reply:* there's no magic
*Thread Reply:* circuit breaker with timeout does not work?
*Thread Reply:* it would, but we don't turn that on by default
*Thread Reply:* also, if we do, what should be our default values?
*Thread Reply:* what would not hurt you if you enabled it, 30 seconds?
*Thread Reply:* I guess we should aim much lower with the runtime
*Thread Reply:* yeah, and make sure we emit metrics / logs when that happens
*Thread Reply:* wait, our circuit breaker right now only supports cpu & memory
*Thread Reply:* we would need to add a timeout one, right?
*Thread Reply:* we've talked about it but it's not implemented yet https://github.com/OpenLineage/OpenLineage/blob/3dad978a3a76ea9bb709334f1526086f95[…]o/openlineage/client/circuitBreaker/ExecutorCircuitBreaker.java
*Thread Reply:* and BTW, no abnormal CPU or memory usage?
*Thread Reply:* I mean, it's using 100% of one core 🙂
*Thread Reply:* it's similar to what aniruth experienced. there's something that for some type of logical plans causes recursion alike behaviour. However, I don't think it's recursion bcz it's ending at some point. If we had DebugFacet we would be able to know which logical plan nodes are involved in this.
*Thread Reply:* > If we had DebugFacet we would be able to know which logical plan nodes are involved in this. if the event would not take 1GB 🙂
*Thread Reply:* > it's similar to what aniruth experienced. there's something that for some type of logical plans causes recursion alike behaviour. However, I don't think it's recursion bcz it's ending at some point. If we had DebugFacet we would be able to know which logical plan nodes are involved in this. (edited) what about my thesis that something is just extremely slow in column-level lineage code?
Some adoption metrics from Sonatype and PyPI, visualized using Preset. In Preset, you can see the number for each month (but we're out of seats on the free tier there). The big number is the downloads for the last month (February in most cases).
Good news. @Paweł Leszczyński - the memory leak fixes worked. Our streaming pipelines have run through the weekend without a single OOM crash.
*Thread Reply:* @Damien Hawes Would you please point me the PR that fixes the issue?
*Thread Reply:* This was the issue: https://github.com/OpenLineage/OpenLineage/issues/2561
There were two PRs:
*Thread Reply:* @Peter Huang ^
*Thread Reply:* @Damien Hawes any other feedback for OL with streaming pipelines you have so far?
*Thread Reply:* It generates a TON of data
*Thread Reply:* There are some optimisations that could be made:
job start -> stage submitted -> task started -> task ended -> stage complete -> job end
cycle fires more frequently.*Thread Reply:* This has an impact on any backend using it, as the run id keeps changing. This means the parent suddenly has thousands of jobs as children.
*Thread Reply:* Our biggest pipeline generates a new event cycle every 2 minutes.
*Thread Reply:* "Too much data" is exactly what I thought 🙂 The obvious potential issue with caching is the same issue we just fixed... potential memory leaks, and cache invalidation
*Thread Reply:* > the run id keeps changing
In this case, that's a bug. We'd still need some wrapping event for whole streaming job though, probably other than application start
*Thread Reply:* on the other topic, did those problems stop? https://github.com/OpenLineage/OpenLineage/issues/2513 with https://github.com/OpenLineage/OpenLineage/pull/2535/files
when talking about the naming scheme for datasets, would everyone here agree that we generally use: {scheme}://{authority}/{unique_name}
? where generally authority
== namespace
*Thread Reply:* I think so, and if we don’t then we should
*Thread Reply:* ~which brings me to the question why construct dataset name as such~ nvm
*Thread Reply:* please feel free to chime in here too https://github.com/dbt-labs/dbt-core/issues/8725
*Thread Reply:* > where generally authority
== namespace
(edited)
{scheme}://{authority}
is namespace