Uniffle Release 0.9.0

Highlight

Introduce dashboard.
Introduce rust-based shuffle server.
Add support for Spark 3.5.
The data transportation Netty mode is production available.
Reduce block id layout limitations and simplify layout configuration for Spark.

ChangeLog

[#1751][0.9] improvement: support gluten (#1753)
[#1764] fix(client): Fix timeout time unit for unregister requests (#1766)
[#1149] fix: GC logs in JDK 11 do not include date and time stamps. (#1240)
[#1675][FOLLOWUP] fix(test): Fix various flaky tests (#1730)
[MINOR] fix: Update outdated config: rss.writer.send.check.timeout -> rss.client.send.check.timeout.ms (#1734)
[#1721] fix(coordinator): classCastExpection of boolean->String with yaml style remote client conf (#1722)
[#1673] fix(K8S): Fix the deployment of stable version K8S cluster (#1694)
[#1675][FOLLOWUP] fix(test): Fix flaky tests which may cause port conflicts (#1696)
[MINOR] fix(typo): Correct the removeShuffle method name (#1697)
[MINOR] docs: modify the default value of rss.coordinator.select.partition.strategy in docs (#1692)
[#1680] improvement(server): Remove partial HDFS files that written by server self for expired apps (#1681)
[#1675] fix(test): Fix tests which may be flaky on different machines (#1676)
[#1684] fix(server): use the diskSize obtained from periodic check to determine whether is writable (#1685)
[#1678] fix(server): disk size leak on removing resources by AppPurgeEvent (#1679) (#1689)
[#1657] build: Add license information after version 0.9.0 (#1671)
[MINOR] chore(rust): disable flaky test of local_store_test (#1674)
[#1459][FOLLOWUP] fix(server): Fix the issue of log variable printing (#1672)
[#1459][FOLLOWUP] improvement(server): Print an error log when an event is dropped (#1643)
[#1341] fix(mr): Fix MR Combiner ArrayIndexOutOfBoundsException Bug. (#1666)
[#378][FOLLOWUP] fix(server): Fix huge_partition_num metric (#1669)
[#1662] fix(test): Fix Netty related flaky tests (#1663)
[#1629] fix(operator): Support parsing NaN float value in metrics (#1630)
[#1634] fix(server): remove app folder if app is expired (#1635)
[MINOR] chore(rust): disable flaky test of test_ticket_manager (#1637)
[#1596][FOLLOWUP] fix(netty): Send failed responses only when the channel is writable (#1641)
[#1626] fix(server): Remove the meaningless eventOfUnderStorageManagers cache (#1627)
[#1631] fix(server): ShuffleTaskInfo may leak when app is removed. (#1632)
[#1373][FOLLOWUP] fix(spark): register with incorrect partitionRanges after reassign (#1612)
[#1608][part-2] fix(spark): avoid releasing block in advance when enable block resend (#1610)
[#1606] feat(client): Add client retry mechanism for NO_BUFFER when reading data(memory/local/index) (#1616)
[#1608][part-1] fix(spark): Only share the replacement servers for faulty servers in one stage (#1609)
[#1373][FOLLOWUP] fix(spark): shuffle manager rpc service invalid when partition data reassign is enabled (#1583)
[#1596] fix(netty): Use a ChannelFutureListener callback mechanism to release readMemory (#1605)
[#1598] fix(server) Fix inaccurate used_direct_memory_size metric (#1599)
[#1472][FOLLOWUP] improvement(server): Release memory more accurately when failing to cache shuffle data (#1597)
[MINOR] refactor: Calling lock() method outside try block to avoid unnecessary errors (#1590)
[#1591] feat(spark): Support Spark 3.5.1 (#1592)
[#1586] improvement(netty): Allow Netty Worker thread pool size to dynamically adapt to the number of processor cores (#1587)
[#1588] improvement(server): Add exception handling for the thread pool when flushing events (#1589)
[#1576] feat(doc): server deploy guide without hadoop-home env (#1577)
[#1571] fix(server): Memory may leak when EventInvalidException occurs (#1574)
[#1373][FOLLOWUP] fix(spark): incorrect partition id type (#1582)
[#1373][FOLLOWUP] fix(spark3):Add client type when request shuffle assignment (#1580)
build(deps): bump google.golang.org/protobuf from 1.28.0 to 1.33.0 (#1575)
[#1554] feat(spark): Fetch dynamic client conf as early as possible (#1557)
[#1572] fix(spark): Exceptions might be discarded when spilling buffers (#1573)
[#1564] fix(server): disk health check invalid when hang (#1568)
[#731][FOLLOWUP] feat(Spark): Configure blockIdLayout for Spark based on max partitions (#1566)
[#1567] fix(spark): Let Spark use its own NettyUtils (#1565)
[#1569] fix(rust): flaky test for test_ticket_manager (#1570)
[MINOR] improvement(test): A better computation logic for WriteAndReadMetricsTest without using reflection (#1563)
[#731] feat(spark): Make blockid layout configurable for Spark clients (#1528)
[#808] improvement(spark): Verify the number of written records to ensure data correctness (#1558)
[MINOR] improvement(client): Override getClientInfo method in ShuffleServerGrpcNettyClient and remove unused getDesc method (#1559)
[#1552] improvement: Migrate from log4j1 to log4j2 (#1553)
[#1472][part-6] FOLLOWUP: Fix Netty transport time when sending shuffle data requests (#1551)
[#134][FOLLOWUP] improvement(spark2): Use taskId and attemptNo as taskAttemptId (#1544)
[#1549] fix(common): Uniformly throw RssException for external callers (#1550)
[MINOR] test: Use sensible partition ids in ShuffleReadClientImplTest (#1545)
[#1546] fix(spark): NPE could happen before uncompressing after #1360 (#1547)
feat(docker): Add example docker compose Uniffle/Spark cluster (#1532)
[#1472][part-6] fix(netty): Make UTs truly test Netty mode (#1540)
[MINOR] improvement(tez): Only invoking LOG.debug when LOG.isDebugEnabled is true (#1541)
[#1459] fix(server): Memory leak for exceptional scenarios when flushing events (#1537)
[#1472] fix(client): IlegalReferenceCountException for clientReadHandler.readShuffleData (#1536)
[#1472][part-5] Use UnpooledByteBufAllocator to fix inaccurate usedMemory issue causing OOM (#1534)
[MINOR] refactor(common): Move blockId bit logic into common class (#1527)
[#1373][part-1] feat(spark): partition write to multi servers leveraging from reassignment mechanism (#1445)
[MINOR] Update dashboard pom.xml to take arguments for node and npm download locations (#1530)
[#1316] improvement(spark): detect OutputTracker API version via Spark version (#1317)
[#134] improvement(spark3): Use taskId and attemptNo as taskAttemptId (#1529)
[MINOR] feat(build): Allow to build distribution without some modules (#1525)
[#1407] fix(rust): use grpc runtime worker threads and adjust default runtime config (#1517)
[#1407] feat(rust): fix + add total grpc request metrics (#1516)
[#1407] chore(rust): add cpu profile doc (#1515)
[#1472][part-2] fix(server): Reuse ByteBuf when decoding shuffle blocks instead of reallocating it (#1521)
[MINOR] fix(CI): Improve dashboard across the CI (#1526)
[#1472][part-3] fix(client): Fix occasional IllegalReferenceCountException issues in extremely rare scenarios (#1522)
[MINOR] fix(pom): Add missing shuffle-server dependencies to work with -Ptez
[#1472][part-4] feature(server): Add metrics for Netty's pinnedDirectMemory and usedDirectMemory (#1524)
[#1472][part-1] fix(server): Upgrade Netty and GRPC (#1520)
[MINOR] fix(deploy): Fix invocation of kubernetes bash scripts (#1513)
[#1476] feat(rust): Provide dedicated unregister app rpc interface (#1511)
[#1476] feat(spark): Provide dedicated unregister app rpc interface (#1510)
[MINOR] improvement(CI): Rework build and rust workflow events (#1508)
[#1407] fix(rust): drop events and release memory when errors happened (#1509)
[#1267][FOLLOWUP] improvement(client): INFO log level should be used in RetryUtils (#1500)
[MINOR] feat(CI): Report test results in github comments (#1506)
[#1407] fix(rust): return error when getting data from hdfs by client (#1507)
[#1501] fix(server): storage selection cache accidentally deleted when clearing stage level data. (#1505)
[#1407] fix(rust): dont panic when no available local disks (#1504)
fix(rust): avoid checking storage type in runtime (#1503)
[MINOR] build: Move dashboard module into profile and disable it by default (#1498)
[#1497] improvement(spark): flushing buffer if the memoryUsed of the first record of WriterBuffer larger than bufferSize (#1485)
[MINOR] improvement(test): Identify duplicate blocks in TestUtils.validateResult (#1495)
[MINOR] fix: Get and increment ATOMIC_LONG in that order everywhere (#1496)
[MINOR] docs: Improve comment on blockId structure (#1492)
[MINOR] fix(server): Assert actual number of bitmaps matches bitNum (#1493)
[#1490] improvement(spark3): Disable dynamic allocation shuffle tracking by default (#1491)
[#1407] feat(rust): support more metrics about disk and topN data size (#1488)
[#1407] feat(rust): support multiple spill policies and simplify hdfs config (#1487)
[#1356] feat(server): improve expired buffers metric and log (#1469)
[#1464][FOLLOWUP] improvement(spark): print abnormal shuffle servers that blocks fail to send (#1473)
[#1467] feat(server): introduce total hdfs write data size for huge partition (#1468)
[#1355] fix(client): Netty client will leak when decoding responses (#1455)
[#1462] fix(server): Memory may leak when flushQueue is full (#1463)
[#1466] feat(server): introduce the JvmPauseMonitor to detect the gc pause (#1470)
[#1459] improvement(server): refactor DefaultFlushEventHandler and support event retry into pending queue (#1461)
[#1464] improvement(spark): print abnormal shuffle servers that blocks fail to send (#1465)
[#1456] improvement(client): Better exception handling when calling requireBuffer using GRPC (#1457)
[#1428] fix(server): fallback invalid when local storage can't write (#1429)
[#1453] improvement: Force to use the UNIX line ending when using spotless-maven-plugin (#1454)
[#1447] feat(client): Introduce configurations to control default behavior of RPC client (#1448)
[#1267] improvement(client): throw the detailed stacktrace when exceptions happened (#1411)
[#1189][FOLLOWUP] fix(server): Start NettyDirectMemoryTracker. (#1432)
[#333] feat(server): expose metrics of TopN app bytes in one shuffle server (#1400)
[#1433] fix(server): Race conditions with ShuffleServer state (#1434)
[MINOR] refactor: avoid unnecessary bitmap clone and AND (#1442)
[#532] fix: spotBugs of SC_START_IN_CTOR (#1440)
[#1435] improvement: Improve log4j settings to avoid annoying messages (#1436)
[MINOR] refactor: Avoid unnecessary recursion (#1441)
[#1407] feat(rust): refactor localfile store to speed up writing (#1422)
[#1416] feat(spark): support custom hadoop config in client side (#1417)
[#1119] improvement(client): Explicitly throw BUFFER_LIMIT_OF_HUGE_PARTITION (#1425)
[#974] fix(coordinator): Dynamic remote storage conf invalid for LegacyClientConfParser (#1424)
[#1420] fix(client): reportShuffleWriteFailure failed because of IndexOutOfBoundsException (#1421)
[#1356] improvement: add metric of total expired pre-allocated buffers (#1412)
[#1414] feat(rust): introduce native hdfs client (#1415)
[#1024] improvement(tez): Optimize user switch to shuffle mode local/remote. (#1397)
[#1403] fix(client): RSS client configurations are not working. (#1404)
[#1409] fix(client): Netty Epoll is unavailable for the RSS Client. (#1410)
[#1407] improvement(rust): Critical bug fix of getting blockIds and some optimization (#1408)
[#825][FOLLOWUP] fix(spark): Fix without returning an exception. (#1402)
[#1385] improvement: Improve log4j appender layout pattern (#1386)
[#851] improvement: Add a similar util method like ThreadUtils.parmap in the Spark (#1396)
[#363] improvement(server): Make the coordinator client managed by CoordinatorClientFactory singleton (#1377)
[#1391] fix(server): Direct memory may leak in exceptional scenarios in shuffle server. (#1392)
[#1157] fix(tez): Container not exit because shuffle client is not closed
[#460] improvement: Exit on OutOfMemoryError (#1390)
[#1387] improvement: compatibility with jdk8 when call JavaUtils.newConcurrentMap (#1389)
[#1369] feat: Provide distribution with Hadoop dependencies (#1379)
[#1383][DOCS] Improve Netty's documentation (#1384)
[#1358] fix(spark): pre-check bytebuffer whether is direct before uncompress (#1360)
[#1364] feat(client): introduce option to control whether to use local hadoop conf (#1370)
[MINOR] chore(client): fix the incorrect partitionId (#1376)
[#1189] feat(server): Add netty used direct memory size metric (#1363)
[#960] fix(dashboard): simplify dependency and correct the startup script (#1347)
[#1348] improvement(metrics): Unify tags generation for shuffle-server metrics reporter (#1349)
[MINOR] chore: fix kubernetes ci pipeline (#1368)
[MINOR] fix(spark): Fix NPE for ShuffleWriteClientImpl.unregisterShuffle (#1367)
[#960][part-4] feat(dashboard): Fix some display bugs and optimize the display format. (#1326)
[#1267] fix(client): fast fail without retry when oom occurs (#1344)
[#1361] feat(netty): add netty metrics into reporter (#1362)
[#1335] fix(server)(netty): release bytebuf explicitly when requiredId is expired or cache failed (#1357)
[MINOR] chore(client): Specify name for data transfer thread pool (#1353)
[#1319] fix(server): Add shaded com.google.guava:failureaccess dependency to prevent NoClassDefFoundError (#1352)
[MINOR] improvement: use mvn wrapper in CI builds. (#1351)
[#1191][FOLLOWUP] improvement(conf): use the unified name for hybrid storage in conf (#1350)
[#960][FOLLOWUP] fix(dashboard): Fix get_pid_file_name function for the dashboard. (#1346)
[MINOR] improvement: use mvn wrapper for builds (#1345)
[#901] feat(server): respect disk capacity watermark rather than uniffle capacity (#1337)
[#1342] improvement(server): dump appId when clearing resource fails (#1343)
[#1110] improvement(coordinator): introduce pluggable remote storage config format (#1329)
[#1330] improvement: optimize tips for checking replica settings (#1334)
[#1187] feat(netty): Netty Encoder Support zero-copy. (#1313)
[#960][part-3] feat(dashboard): Provides a start-stop script for the dashboard. (#1056)
[#1308] improvement(rust): detect whether data has been purged in UT (#1323)
[#1213] feat(rust): Support block filter by taskId when getting memory data (#1311)
[#1290] improvement(operator): Avoid accidentally deleting data of other services when misconfiguring the mounting directory (#1291)
[MINOR] fix: flaky test ShuffleTaskManagerTest#checkAndClearLeakShuffleDataTest (#1320)
[MINOR] test: flaky test GrpcServerTest.testGrpcExecutorPool (#1321)
[#960][part-2] feat(dashboard): Add a dashboard front-end module. (#1055)
[#825][part-7] feat(spark): Write Stage resubmit and dynamic shuffle server assign integration tests. (#1148)
[#1300] feat(mr): Support combine operation in map stage for mr engine. (#1301)
[#1309] fix(spark): WriteBufferManager in Spark2 does not use a reassigned shuffle server. (#1310)
[#1307] feat(rust): make each thread listen the socket to improve throughput in tonic (#1306)
[#960][part-1] feat(dashboard): Add some dashboard interfaces. (#1053)
[#825][part-6] feat(spark): Added logic that failed to send ShuffleServer. (#1147)
[#1293] feat(rust): Add total_read_data metric (#1298)
[#1094] docs: split client_guide.md (#1299)
[#1221] feat(rust): Support grpc server graceful shutdown (#1292)
[#1294] feat(rust): introduce the unified grpc latency metrics for all requests (#1295)
[#1296] improvement(rust): use std.sync.lock to replace tokio lock for better performance (#1216)
[#825][part-5] feat(spark): Adds the RPC interface to reassign the ShuffleServer list. (#1146)
[MINOR] docs: update jar name for spark client (#1289)
[MINOR] chore: add scripts for publishing tarballs to svn (#1284)
[#1286] improvement(server): Add RemoveResourceTime Metric (#1288)
[#1271] improvement(server): change transportTime and processTime summary to Thread Pool Instead of block (#1272)
[#1269] fix(tez): uniqueMapId may be not unique when more than one fetcher are working. (#1270)
[#1246] feat(tez): Support remote spill for unordered input. (#1250)
[#825][part-4] feat(spark): Report write failures to ShuffleManager. (#1258)
[MINOR] fix: missing to build spark shaded modules (#1282)
[#1275] chore: add scripts for publishing maven releases (#1281)
[#1274] feat: add shaded module for spark2 client (#1280)
[#1273] feat: add shaded module for spark3 client (#1279)
[#825][part-3] feat(spark): Get the ShuffleServer corresponding to the partition from ShuffleManager. (#1141)
[#1277] chore: add flatten maven plugin (#1278)
[#1252] fix(server): Incorrect storage write fail metric (#1253)
[#825][FOLLOWUP] fix(spark): Apply a thread safety way to track the blocks sending result (#1260)
[#1254][FOLLOWUP] fix(test): Fix the flaky test RssShuffleTest. (#1259)
[#1261] fix(spark): Throw out InterruptedException for sleep in requestExecutorMemory #1262
[#1256] refactor: optimize collections contruction (#1257)
[#1254] fix(test): Fix the flaky test RssShuffleTest. (#1255)
[#825][part-2] feat(spark): Report failed blocks and a list of ShuffleServer. (#1138)
[#244][FOLLOWUP] test: CoordinatorGrpcTest.rpcMetricsTest. (#1251)
[#1231] feat(tez): Support remote spill in merge stage. (#1245)
[#1243] fix(test): Fix the flaky test SparkSQLTest and RepartitionTest (#1244)
[#1089] feat(spark): Add dynamic allocation patch for Spark 2.3 (#1242)
[#1237] feat(rust): support populating args by clap (#1236)
[#1088] feat(spark): Add dynamic allocation patch for Spark 3.0 (#1241)
[#1234] improvement(rust): separate runtimes for different overload (#1233)
[#1090] refactor: Refactor the reader code with builder pattern (#1232)
[#1219] fix(test): Fix the flaky test WriteAndReadMetricsTest (#1235)
[#1206] chore(rust): ignore generated proto code in git (#1229)
[#1091] refactor: Refactor the writer code with builder pattern (#1228)
[MINOR] Fix kubernetest CI pipeline (#1227)
[#802] feat(spark): Implement ShuffleDataIo (#1226)
[#825][part-1] feat(spark): Add the RPC interface for reassigning ShuffleServer (#1137)
[#1085] feat(spark): Add dynamic allocation patch for Spark 3.4 (#1225)
[#1201] improvement: only invoking LOG.debug when LOG.isDebugEnabled() is true (#1217)
[#1084] feat: Add dynamic allocation patch for Spark 3.3 (#1224)
[#1083] feat(spark): Support Spark 3.5 (#1223)
[#1211] fix(server): unexpectedly removing resources when app has re-registered shuffle later (#1212)
[#1206] chore(rust): remove the auto-generated proto code (#1218)
[#1209] improvement(server): Speed up cleanupStorageSelectionCache method in LocalStorageManager. (#1210)
[#1206][part-2] feat(rust): introduce rust based shuffle-server (#1208)
[#1206][part-1] feat(rust): create folder for rust-based shuffle server (#1207)
[#1204] chores(ci): Fix the ci pipeline of Kubernetes #1205
[#1202] improvement: Add HealthScriptChecker for execute special health check shell script (#1203)
[#1198] improvement: zerocopy from Protobuf's ByteString to Netty's ByteBuf (#1199)
[#1192] improvement(hdfs): Add RSS_SECURITY_HADOOP_KERBEROS_PROXY_USER_ENABLE conf for storing shuffle data (#1194)
[MINOR] refactor: Rename MultiStorage to HybridStorage (#1191)
[MINOR] Remove extra directory (#1190)
[#1178] improvement: set rss.coordinator.quota.default.app.num default -1 to indicate no quota check (#1186)
[#1182] fix(operator): The LeaderElectionNamespace of the rss-controller is hard-coded to kube-system. (#1183)
[#1175] fix(netty): Retry failed with StacklessClosedChannelException after channel closed (#1181)
[#1177] improvement: Reduce the write time of tasks (#1179)
[MINOR] docs: Fix spark.serializer in README and client_guide (#1180)

Uniffle Release 0.9.0

Highlight​

ChangeLog​

Highlight

ChangeLog