Uniffle Release 0.9.0
Highlight
- Introduce dashboard.
- Introduce rust-based shuffle server.
- Add support for Spark 3.5.
- The data transportation Netty mode is production available.
- Reduce block id layout limitations and simplify layout configuration for Spark.
ChangeLog
- [#1751][0.9] improvement: support gluten (#1753)
- [#1764] fix(client): Fix timeout time unit for unregister requests (#1766)
- [#1149] fix: GC logs in JDK 11 do not include date and time stamps. (#1240)
- [#1675][FOLLOWUP] fix(test): Fix various flaky tests (#1730)
- [MINOR] fix: Update outdated config: rss.writer.send.check.timeout -> rss.client.send.check.timeout.ms (#1734)
- [#1721] fix(coordinator): classCastExpection of boolean->String with yaml style remote client conf (#1722)
- [#1673] fix(K8S): Fix the deployment of stable version K8S cluster (#1694)
- [#1675][FOLLOWUP] fix(test): Fix flaky tests which may cause port conflicts (#1696)
- [MINOR] fix(typo): Correct the removeShuffle method name (#1697)
- [MINOR] docs: modify the default value of
rss.coordinator.select.partition.strategy
in docs (#1692) - [#1680] improvement(server): Remove partial HDFS files that written by server self for expired apps (#1681)
- [#1675] fix(test): Fix tests which may be flaky on different machines (#1676)
- [#1684] fix(server): use the diskSize obtained from periodic check to determine whether is writable (#1685)
- [#1678] fix(server): disk size leak on removing resources by AppPurgeEvent (#1679) (#1689)
- [#1657] build: Add license information after version 0.9.0 (#1671)
- [MINOR] chore(rust): disable flaky test of local_store_test (#1674)
- [#1459][FOLLOWUP] fix(server): Fix the issue of log variable printing (#1672)
- [#1459][FOLLOWUP] improvement(server): Print an error log when an event is dropped (#1643)
- [#1341] fix(mr): Fix MR Combiner ArrayIndexOutOfBoundsException Bug. (#1666)
- [#378][FOLLOWUP] fix(server): Fix huge_partition_num metric (#1669)
- [#1662] fix(test): Fix Netty related flaky tests (#1663)
- [#1629] fix(operator): Support parsing NaN float value in metrics (#1630)
- [#1634] fix(server): remove app folder if app is expired (#1635)
- [MINOR] chore(rust): disable flaky test of test_ticket_manager (#1637)
- [#1596][FOLLOWUP] fix(netty): Send failed responses only when the channel is writable (#1641)
- [#1626] fix(server): Remove the meaningless eventOfUnderStorageManagers cache (#1627)
- [#1631] fix(server): ShuffleTaskInfo may leak when app is removed. (#1632)
- [#1373][FOLLOWUP] fix(spark): register with incorrect partitionRanges after reassign (#1612)
- [#1608][part-2] fix(spark): avoid releasing block in advance when enable block resend (#1610)
- [#1606] feat(client): Add client retry mechanism for NO_BUFFER when reading data(memory/local/index) (#1616)
- [#1608][part-1] fix(spark): Only share the replacement servers for faulty servers in one stage (#1609)
- [#1373][FOLLOWUP] fix(spark): shuffle manager rpc service invalid when partition data reassign is enabled (#1583)
- [#1596] fix(netty): Use a ChannelFutureListener callback mechanism to release readMemory (#1605)
- [#1598] fix(server) Fix inaccurate used_direct_memory_size metric (#1599)
- [#1472][FOLLOWUP] improvement(server): Release memory more accurately when failing to cache shuffle data (#1597)
- [MINOR] refactor: Calling lock() method outside try block to avoid unnecessary errors (#1590)
- [#1591] feat(spark): Support Spark 3.5.1 (#1592)
- [#1586] improvement(netty): Allow Netty Worker thread pool size to dynamically adapt to the number of processor cores (#1587)
- [#1588] improvement(server): Add exception handling for the thread pool when flushing events (#1589)
- [#1576] feat(doc): server deploy guide without hadoop-home env (#1577)
- [#1571] fix(server): Memory may leak when
EventInvalidException
occurs (#1574) - [#1373][FOLLOWUP] fix(spark): incorrect partition id type (#1582)
- [#1373][FOLLOWUP] fix(spark3):Add client type when request shuffle assignment (#1580)
- build(deps): bump google.golang.org/protobuf from 1.28.0 to 1.33.0 (#1575)
- [#1554] feat(spark): Fetch dynamic client conf as early as possible (#1557)
- [#1572] fix(spark): Exceptions might be discarded when spilling buffers (#1573)
- [#1564] fix(server): disk health check invalid when hang (#1568)
- [#731][FOLLOWUP] feat(Spark): Configure blockIdLayout for Spark based on max partitions (#1566)
- [#1567] fix(spark): Let Spark use its own NettyUtils (#1565)
- [#1569] fix(rust): flaky test for test_ticket_manager (#1570)
- [MINOR] improvement(test): A better computation logic for WriteAndReadMetricsTest without using reflection (#1563)
- [#731] feat(spark): Make blockid layout configurable for Spark clients (#1528)
- [#808] improvement(spark): Verify the number of written records to ensure data correctness (#1558)
- [MINOR] improvement(client): Override getClientInfo method in ShuffleServerGrpcNettyClient and remove unused getDesc method (#1559)
- [#1552] improvement: Migrate from log4j1 to log4j2 (#1553)
- [#1472][part-6] FOLLOWUP: Fix Netty transport time when sending shuffle data requests (#1551)
- [#134][FOLLOWUP] improvement(spark2): Use taskId and attemptNo as taskAttemptId (#1544)
- [#1549] fix(common): Uniformly throw RssException for external callers (#1550)
- [MINOR] test: Use sensible partition ids in ShuffleReadClientImplTest (#1545)
- [#1546] fix(spark): NPE could happen before uncompressing after #1360 (#1547)
- feat(docker): Add example docker compose Uniffle/Spark cluster (#1532)
- [#1472][part-6] fix(netty): Make UTs truly test Netty mode (#1540)
- [MINOR] improvement(tez): Only invoking LOG.debug when LOG.isDebugEnabled is true (#1541)
- [#1459] fix(server): Memory leak for exceptional scenarios when flushing events (#1537)
- [#1472] fix(client): IlegalReferenceCountException for clientReadHandler.readShuffleData (#1536)
- [#1472][part-5] Use UnpooledByteBufAllocator to fix inaccurate usedMemory issue causing OOM (#1534)
- [MINOR] refactor(common): Move blockId bit logic into common class (#1527)
- [#1373][part-1] feat(spark): partition write to multi servers leveraging from reassignment mechanism (#1445)
- [MINOR] Update dashboard pom.xml to take arguments for node and npm download locations (#1530)
- [#1316] improvement(spark): detect OutputTracker API version via Spark version (#1317)
- [#134] improvement(spark3): Use taskId and attemptNo as taskAttemptId (#1529)
- [MINOR] feat(build): Allow to build distribution without some modules (#1525)
- [#1407] fix(rust): use grpc runtime worker threads and adjust default runtime config (#1517)
- [#1407] feat(rust): fix + add total grpc request metrics (#1516)
- [#1407] chore(rust): add cpu profile doc (#1515)
- [#1472][part-2] fix(server): Reuse ByteBuf when decoding shuffle blocks instead of reallocating it (#1521)
- [MINOR] fix(CI): Improve dashboard across the CI (#1526)
- [#1472][part-3] fix(client): Fix occasional IllegalReferenceCountException issues in extremely rare scenarios (#1522)
- [MINOR] fix(pom): Add missing shuffle-server dependencies to work with -Ptez
- [#1472][part-4] feature(server): Add metrics for Netty's pinnedDirectMemory and usedDirectMemory (#1524)
- [#1472][part-1] fix(server): Upgrade Netty and GRPC (#1520)
- [MINOR] fix(deploy): Fix invocation of kubernetes bash scripts (#1513)
- [#1476] feat(rust): Provide dedicated unregister app rpc interface (#1511)
- [#1476] feat(spark): Provide dedicated unregister app rpc interface (#1510)
- [MINOR] improvement(CI): Rework build and rust workflow events (#1508)
- [#1407] fix(rust): drop events and release memory when errors happened (#1509)
- [#1267][FOLLOWUP] improvement(client): INFO log level should be used in RetryUtils (#1500)
- [MINOR] feat(CI): Report test results in github comments (#1506)
- [#1407] fix(rust): return error when getting data from hdfs by client (#1507)
- [#1501] fix(server): storage selection cache accidentally deleted when clearing stage level data. (#1505)
- [#1407] fix(rust): dont panic when no available local disks (#1504)
- fix(rust): avoid checking storage type in runtime (#1503)
- [MINOR] build: Move dashboard module into profile and disable it by default (#1498)
- [#1497] improvement(spark): flushing buffer if the memoryUsed of the first record of
WriterBuffer
larger than bufferSize (#1485) - [MINOR] improvement(test): Identify duplicate blocks in TestUtils.validateResult (#1495)
- [MINOR] fix: Get and increment ATOMIC_LONG in that order everywhere (#1496)
- [MINOR] docs: Improve comment on blockId structure (#1492)
- [MINOR] fix(server): Assert actual number of bitmaps matches bitNum (#1493)
- [#1490] improvement(spark3): Disable dynamic allocation shuffle tracking by default (#1491)
- [#1407] feat(rust): support more metrics about disk and topN data size (#1488)
- [#1407] feat(rust): support multiple spill policies and simplify hdfs config (#1487)
- [#1356] feat(server): improve expired buffers metric and log (#1469)
- [#1464][FOLLOWUP] improvement(spark): print abnormal shuffle servers that blocks fail to send (#1473)
- [#1467] feat(server): introduce total hdfs write data size for huge partition (#1468)
- [#1355] fix(client): Netty client will leak when decoding responses (#1455)
- [#1462] fix(server): Memory may leak when flushQueue is full (#1463)
- [#1466] feat(server): introduce the JvmPauseMonitor to detect the gc pause (#1470)
- [#1459] improvement(server): refactor DefaultFlushEventHandler and support event retry into pending queue (#1461)
- [#1464] improvement(spark): print abnormal shuffle servers that blocks fail to send (#1465)
- [#1456] improvement(client): Better exception handling when calling requireBuffer using GRPC (#1457)
- [#1428] fix(server): fallback invalid when local storage can't write (#1429)
- [#1453] improvement: Force to use the UNIX line ending when using spotless-maven-plugin (#1454)
- [#1447] feat(client): Introduce configurations to control default behavior of RPC client (#1448)
- [#1267] improvement(client): throw the detailed stacktrace when exceptions happened (#1411)
- [#1189][FOLLOWUP] fix(server): Start NettyDirectMemoryTracker. (#1432)
- [#333] feat(server): expose metrics of TopN app bytes in one shuffle server (#1400)
- [#1433] fix(server): Race conditions with ShuffleServer state (#1434)
- [MINOR] refactor: avoid unnecessary bitmap clone and AND (#1442)
- [#532] fix: spotBugs of SC_START_IN_CTOR (#1440)
- [#1435] improvement: Improve log4j settings to avoid annoying messages (#1436)
- [MINOR] refactor: Avoid unnecessary recursion (#1441)
- [#1407] feat(rust): refactor localfile store to speed up writing (#1422)
- [#1416] feat(spark): support custom hadoop config in client side (#1417)
- [#1119] improvement(client): Explicitly throw
BUFFER_LIMIT_OF_HUGE_PARTITION
(#1425) - [#974] fix(coordinator): Dynamic remote storage conf invalid for
LegacyClientConfParser
(#1424) - [#1420] fix(client): reportShuffleWriteFailure failed because of IndexOutOfBoundsException (#1421)
- [#1356] improvement: add metric of total expired pre-allocated buffers (#1412)
- [#1414] feat(rust): introduce native hdfs client (#1415)
- [#1024] improvement(tez): Optimize user switch to shuffle mode local/remote. (#1397)
- [#1403] fix(client): RSS client configurations are not working. (#1404)
- [#1409] fix(client): Netty Epoll is unavailable for the RSS Client. (#1410)
- [#1407] improvement(rust): Critical bug fix of getting blockIds and some optimization (#1408)
- [#825][FOLLOWUP] fix(spark): Fix without returning an exception. (#1402)
- [#1385] improvement: Improve log4j appender layout pattern (#1386)
- [#851] improvement: Add a similar util method like ThreadUtils.parmap in the Spark (#1396)
- [#363] improvement(server): Make the coordinator client managed by CoordinatorClientFactory singleton (#1377)
- [#1391] fix(server): Direct memory may leak in exceptional scenarios in shuffle server. (#1392)
- [#1157] fix(tez): Container not exit because shuffle client is not closed
- [#460] improvement: Exit on OutOfMemoryError (#1390)
- [#1387] improvement: compatibility with jdk8 when call JavaUtils.newConcurrentMap (#1389)
- [#1369] feat: Provide distribution with Hadoop dependencies (#1379)
- [#1383][DOCS] Improve Netty's documentation (#1384)
- [#1358] fix(spark): pre-check bytebuffer whether is direct before uncompress (#1360)
- [#1364] feat(client): introduce option to control whether to use local hadoop conf (#1370)
- [MINOR] chore(client): fix the incorrect partitionId (#1376)
- [#1189] feat(server): Add netty used direct memory size metric (#1363)
- [#960] fix(dashboard): simplify dependency and correct the startup script (#1347)
- [#1348] improvement(metrics): Unify tags generation for shuffle-server metrics reporter (#1349)
- [MINOR] chore: fix kubernetes ci pipeline (#1368)
- [MINOR] fix(spark): Fix NPE for ShuffleWriteClientImpl.unregisterShuffle (#1367)
- [#960][part-4] feat(dashboard): Fix some display bugs and optimize the display format. (#1326)
- [#1267] fix(client): fast fail without retry when oom occurs (#1344)
- [#1361] feat(netty): add netty metrics into reporter (#1362)
- [#1335] fix(server)(netty): release bytebuf explicitly when requiredId is expired or cache failed (#1357)
- [MINOR] chore(client): Specify name for data transfer thread pool (#1353)
- [#1319] fix(server): Add shaded com.google.guava:failureaccess dependency to prevent NoClassDefFoundError (#1352)
- [MINOR] improvement: use mvn wrapper in CI builds. (#1351)
- [#1191][FOLLOWUP] improvement(conf): use the unified name for hybrid storage in conf (#1350)
- [#960][FOLLOWUP] fix(dashboard): Fix get_pid_file_name function for the dashboard. (#1346)
- [MINOR] improvement: use mvn wrapper for builds (#1345)
- [#901] feat(server): respect disk capacity watermark rather than uniffle capacity (#1337)
- [#1342] improvement(server): dump appId when clearing resource fails (#1343)
- [#1110] improvement(coordinator): introduce pluggable remote storage config format (#1329)
- [#1330] improvement: optimize tips for checking replica settings (#1334)
- [#1187] feat(netty): Netty Encoder Support zero-copy. (#1313)
- [#960][part-3] feat(dashboard): Provides a start-stop script for the dashboard. (#1056)
- [#1308] improvement(rust): detect whether data has been purged in UT (#1323)
- [#1213] feat(rust): Support block filter by taskId when getting memory data (#1311)
- [#1290] improvement(operator): Avoid accidentally deleting data of other services when misconfiguring the mounting directory (#1291)
- [MINOR] fix: flaky test ShuffleTaskManagerTest#checkAndClearLeakShuffleDataTest (#1320)
- [MINOR] test: flaky test GrpcServerTest.testGrpcExecutorPool (#1321)
- [#960][part-2] feat(dashboard): Add a dashboard front-end module. (#1055)
- [#825][part-7] feat(spark): Write Stage resubmit and dynamic shuffle server assign integration tests. (#1148)
- [#1300] feat(mr): Support combine operation in map stage for mr engine. (#1301)
- [#1309] fix(spark): WriteBufferManager in Spark2 does not use a reassigned shuffle server. (#1310)
- [#1307] feat(rust): make each thread listen the socket to improve throughput in tonic (#1306)
- [#960][part-1] feat(dashboard): Add some dashboard interfaces. (#1053)
- [#825][part-6] feat(spark): Added logic that failed to send ShuffleServer. (#1147)
- [#1293] feat(rust): Add total_read_data metric (#1298)
- [#1094] docs: split client_guide.md (#1299)
- [#1221] feat(rust): Support grpc server graceful shutdown (#1292)
- [#1294] feat(rust): introduce the unified grpc latency metrics for all requests (#1295)
- [#1296] improvement(rust): use std.sync.lock to replace tokio lock for better performance (#1216)
- [#825][part-5] feat(spark): Adds the RPC interface to reassign the ShuffleServer list. (#1146)
- [MINOR] docs: update jar name for spark client (#1289)
- [MINOR] chore: add scripts for publishing tarballs to svn (#1284)
- [#1286] improvement(server): Add RemoveResourceTime Metric (#1288)
- [#1271] improvement(server): change transportTime and processTime summary to Thread Pool Instead of block (#1272)
- [#1269] fix(tez): uniqueMapId may be not unique when more than one fetcher are working. (#1270)
- [#1246] feat(tez): Support remote spill for unordered input. (#1250)
- [#825][part-4] feat(spark): Report write failures to ShuffleManager. (#1258)
- [MINOR] fix: missing to build spark shaded modules (#1282)
- [#1275] chore: add scripts for publishing maven releases (#1281)
- [#1274] feat: add shaded module for spark2 client (#1280)
- [#1273] feat: add shaded module for spark3 client (#1279)
- [#825][part-3] feat(spark): Get the ShuffleServer corresponding to the partition from ShuffleManager. (#1141)
- [#1277] chore: add flatten maven plugin (#1278)
- [#1252] fix(server): Incorrect storage write fail metric (#1253)
- [#825][FOLLOWUP] fix(spark): Apply a thread safety way to track the blocks sending result (#1260)
- [#1254][FOLLOWUP] fix(test): Fix the flaky test RssShuffleTest. (#1259)
- [#1261] fix(spark): Throw out InterruptedException for sleep in requestExecutorMemory #1262
- [#1256] refactor: optimize collections contruction (#1257)
- [#1254] fix(test): Fix the flaky test RssShuffleTest. (#1255)
- [#825][part-2] feat(spark): Report failed blocks and a list of ShuffleServer. (#1138)
- [#244][FOLLOWUP] test: CoordinatorGrpcTest.rpcMetricsTest. (#1251)
- [#1231] feat(tez): Support remote spill in merge stage. (#1245)
- [#1243] fix(test): Fix the flaky test
SparkSQLTest
andRepartitionTest
(#1244) - [#1089] feat(spark): Add dynamic allocation patch for Spark 2.3 (#1242)
- [#1237] feat(rust): support populating args by clap (#1236)
- [#1088] feat(spark): Add dynamic allocation patch for Spark 3.0 (#1241)
- [#1234] improvement(rust): separate runtimes for different overload (#1233)
- [#1090] refactor: Refactor the reader code with builder pattern (#1232)
- [#1219] fix(test): Fix the flaky test
WriteAndReadMetricsTest
(#1235) - [#1206] chore(rust): ignore generated proto code in git (#1229)
- [#1091] refactor: Refactor the writer code with builder pattern (#1228)
- [MINOR] Fix kubernetest CI pipeline (#1227)
- [#802] feat(spark): Implement ShuffleDataIo (#1226)
- [#825][part-1] feat(spark): Add the RPC interface for reassigning ShuffleServer (#1137)
- [#1085] feat(spark): Add dynamic allocation patch for Spark 3.4 (#1225)
- [#1201] improvement: only invoking LOG.debug when LOG.isDebugEnabled() is true (#1217)
- [#1084] feat: Add dynamic allocation patch for Spark 3.3 (#1224)
- [#1083] feat(spark): Support Spark 3.5 (#1223)
- [#1211] fix(server): unexpectedly removing resources when app has re-registered shuffle later (#1212)
- [#1206] chore(rust): remove the auto-generated proto code (#1218)
- [#1209] improvement(server): Speed up cleanupStorageSelectionCache method in LocalStorageManager. (#1210)
- [#1206][part-2] feat(rust): introduce rust based shuffle-server (#1208)
- [#1206][part-1] feat(rust): create folder for rust-based shuffle server (#1207)
- [#1204] chores(ci): Fix the ci pipeline of Kubernetes #1205
- [#1202] improvement: Add HealthScriptChecker for execute special health check shell script (#1203)
- [#1198] improvement: zerocopy from Protobuf's ByteString to Netty's ByteBuf (#1199)
- [#1192] improvement(hdfs): Add
RSS_SECURITY_HADOOP_KERBEROS_PROXY_USER_ENABLE
conf for storing shuffle data (#1194) - [MINOR] refactor: Rename MultiStorage to HybridStorage (#1191)
- [MINOR] Remove extra directory (#1190)
- [#1178] improvement: set
rss.coordinator.quota.default.app.num
default -1 to indicate no quota check (#1186) - [#1182] fix(operator): The LeaderElectionNamespace of the rss-controller is hard-coded to kube-system. (#1183)
- [#1175] fix(netty): Retry failed with StacklessClosedChannelException after channel closed (#1181)
- [#1177] improvement: Reduce the write time of tasks (#1179)
- [MINOR] docs: Fix spark.serializer in README and client_guide (#1180)