Metadata scan improvements

High-level Feature Goals

Improve user first impressions by revising the import-media experience.
Make more sense, be less annoying.

Behaviour

ID
Priority
Cost
Title
Story
bug 10206 P1
2
Unify scan dialogs
Provide users with a unified progress dialog for "media scan" and "metadata reader"
bug 10207 P1
0
Scan performance
Reduce the amount of time it takes to scan and read
bug 10209 P2
1
Improve scan feedback
Provide additional UI feedback by including:
  • What has been discovered
  • What has been added
  • How many duplicates have been found

NOTE: Needs more definition.  How will this information be displayed?

bug 10210 P1
0
Integrate media scan with new first-run
Media scan should be nicely integrate with the new First Run experience
bug 10211 P2
0
Show target on scan completion
End by landing users in the appropriate window in the application
  • If importing media to a playlist, open the playlist
  • If importing media to the library, open the library
bug 10213 P2
1
Add ability to re-scan metadata for an item
Add playlist context menu commands to re-scan the metadata in an item, and possibly add a new option to the metadata editor dialog.

 

Production Notes

  • Metadata scans will no longer resume if the app is terminated before completion.
  • UI design needs to be finalized

 

Engineering Notes

Profiling shows the following hotspots, listed in order of severity:

  • Proxied charset conversion, charset conversion in general
  • Adding metadata to mediaitems (DB time and notifying listeners)
  • Job tracking database create/update/get
  • Proxied charset guessing
  • Proxied file size lookup
  • Hash creation
  • Item creation (DB and notification)

General Problems:

  • Filescan thread shutdown
  • Filescan thread safety
  • Hashing is not cancelable
  • Hashing should implement sbIJobProgress
  • Filescan should implement sbIJobProgress
  • sbIMetadataHandler interface is awkward
  • sbMetadataJob is poorly implemented
  • May need to prime library indices on startup
  • bug 10009 Network Scan for Media 20K files
  • bug 4586 "vacuum" the metadatajob database on job completion (no database, no problem)
  • bug 8594 [perf] Using a static analyze statistics table

Plan:

  • Experiment with moving all scanning to the main thread in order to avoid proxy problems.  If that fails, do reading on the worker thread and processing on the main thread.
  • Cache item->list membership to avoid notification overhead.
  • Get rid of the job tracking database and just keep an array of items.  Product is doesn't care if jobs are resumed on restart.
  • Attempt to improve hash creation time by reading less data.
  • Tune web metadata scan variables (connections, timeout, polling interval)

Progress will be determined by wallclock time to import 2000 files, and will probably be tracked here.

Initial Test Results:

The following results were captured as part of bug 10214, using the hack/prototype developed in bug 10215.


File Scan Hashing Import Metadata Scan Total
0.7 Prototype - Cached Files
1 9 11 22
0.7 Prototype - No Cached Files
1
30
31
63
0.6 - Cached Files
7 9 37 55
0.6 - No Cached Files
7 30 64 102
ITunes - No Cached Files
1 N/A 29 30
0.7 Prototype - No Cached Files - Small hash (4k rather than 100k)
1 21 31 54
0.7 Prototype - Cached Files - Small hash
1 7 11 19
0.7 Prototype - No Cached Files - Small hash - Full file metadata scan (rather than 100k max)
1 22 36 60
0.7 Prototype - No Cached Files - Small hash - Metadata scan length = buffer = 16k 1 22 30 54
0.7 Prototype - No Cached Files - File size instead of hash for deduping
1
6
29
37

Rewriting the metadata job and item notification is likely to result in a 1.6x improvement for uncached file import time.  From there, our only option for further improvement is to mitigate the import hashing cost.  

Completing Phase 1 will give us a substantial boost, but Phase 2 is required for ITunes parity.

Tasks

Phase 1

ID
Priority
Cost
Title
Description

Setup

bug 10214 P1
1
Track scan time
Add code to record scan size and duration.  Use dtrace or the new timing service.
bug 10215 P1
1
Prototype new metadata scanner
Prototype simplified main thread only scanner by hacking up the existing code

Metadata Component Rewrite
We'll keep the existing model for now, and ignore the need to refactor sbIMetadataHandler and support album art or other secondary scanners.

bug 10216 P1
3
Reimplement MetadataJob
New MetadataJob implementation (array based, main thread only)
bug 10217 P1
2
Reimplement MetadataJobManager
New MetadataJobManager (No job table, no startup code, listen for completion and drop references)
bug 10218 P1
1
Experiment with Taglib reading
It may be faster to use a buffered input stream for local files when running on the main thread.
bug 10219 P1
2
Track media items that cause the scanner to crash
We annotate crash reports with the file that is currently being processed by the metadata system. We need to use this information to avoid re-scanning files that have previously caused crashes.  Read crash logs and keep a blacklist?

File Scan Cleanup

bug 10222 P1
1
Make minimal changes to the FileScan component. Fix thread safety and shutdown, implement sbIJobProgress.

UI Improvements
bug 10223 P1
1
Update drophelper.jsm for new API
Modify DropHelper.jsm to use the new unified scan

Phase 2

ID
Priority
Cost
Title
Description

Item.setProperties Performance

bug 10224 P2
1
Test setProperties performance
Confirm performance gain by suppressing notification on setProperties.
bug 10225 P2
2
Fix sbLocalDatabaseLibrary ::NotifyListenersItemUpdated
sbLocalDatabaseLibrary::GetContainingLists is flawed, as it only works for simple media lists, and is unsuitable, since all we really care about is instantiated media lists.  NotifyListenersItemUpdated should look through a list of instantiated lists (similar to mMediaItemTable) and notify any that a) have listeners and b) return true for list.contains(item)
bug 10226 P2
2
Optimize sbLocalDatabaseSimpleMediaList ::Contains.
Currently Contains runs a query to see if the item is in the list. It should be significantly faster to leverage sbLocalDatabaseGUIDArray::GetFirstIndexByGuid and mGuidToFirstIndexMap

Improve Item Creation

bug 10227 P2
2
Make item creation cancelable
Make library batch create hashing cancelable, and attempt to tune performance
bug 10228 P2
2
Item creation should return a job progress interface
Make library batch create hashing implement job progress
bug 10527
P2
3
Mitigate hash creation cost

Hashing is completely disk bound, and is the highest priority after metadata scanning has been improved.

We could make hashing/dupe removal opt-in, but that could break MTP devices.

We could also use filesize for dupe checking with hashing as a fallback for collisions.  Doing this would cut our total import time in about half, but would require almost completely rewriting batchCreateMediaItem.

Tuning

bug 10229 P2
1
Optimize web scan performance
Experiment and find optimal constants for web media scanning (Number of connections, etc.)
bug 10231 P2
1
Optimize library db
Experiment with main library ANALYZE queries. Determine an optimal value for property cache analyze counter.

 

 

Tag page
You must login to post a comment.