Patent
Amortizing costs of shared scans
العنوان: | Amortizing costs of shared scans |
---|---|
Patent Number: | 8,484,649 |
تاريخ النشر: | July 09, 2013 |
Appl. No: | 12/984909 |
Application Filed: | January 05, 2011 |
مستخلص: | Techniques for scheduling a plurality of jobs sharing input are provided. The techniques include partitioning one or more input datasets into multiple subcomponents, analyzing a plurality of jobs to determine which of the plurality of jobs require scanning of one or more common subcomponents of the one or more input datasets, and scheduling a plurality of jobs that require scanning of one or more common subcomponents of the one or more input datasets, facilitating a single scanning of the one or more common subcomponents to be used as input by each of the plurality of jobs. |
Inventors: | Hildrum, Kirsten W. (Hawthorne, NY, US); Khandekar, Rohit M. (Elmsford, NY, US); Parekh, Sujay S. (Dobbs Ferry, NY, US); Rajan, Deepak (Fishkill, NY, US); Wolf, Joel L. (Katonah, NY, US); Wu, Kun-Lung (Yorktown Heights, NY, US) |
Assignees: | International Business Machines Corporation (Armonk, NY, US) |
Claim: | 1. A method for scheduling a plurality of jobs sharing input, wherein the method comprises: partitioning one or more input datasets into multiple subcomponents; analyzing a plurality of jobs to determine which of the plurality of jobs require scanning of one or more common subcomponents of the one or more input datasets; and scheduling a plurality of jobs that require scanning of one or more common subcomponents of the one or more input datasets, facilitating a single scanning of the one or more common subcomponents to be used as input by each of the plurality of jobs, wherein said scheduling comprises scheduling one or more of the plurality of jobs to share as input data a previously scanned common subcomponent scheduled for a previously scheduled job. |
Claim: | 2. The method of claim 1 , further comprising receiving a new job that requires scanning of an input dataset, wherein the input dataset comprises one or more subcomponents. |
Claim: | 3. The method of claim 1 , wherein scheduling a plurality of jobs that require scanning of one or more common subcomponents of the one or more input datasets comprises scheduling a plurality of jobs, wherein the plurality of jobs is not received simultaneously. |
Claim: | 4. The method of claim 3 , wherein scheduling a plurality of jobs when the plurality of jobs are not received simultaneously comprises facilitating a newly received job to share results of a scanning of common subcomponents with one or more previously received jobs to an extent that there is a temporal overlap. |
Claim: | 5. The method of claim 1 , wherein scheduling the plurality of jobs that require scanning of one or more common subcomponents of the one or more input datasets comprises scheduling the plurality of jobs on a plurality of processors to optimize a metric. |
Claim: | 6. The method of claim 5 , wherein the metric is chosen by a user. |
Claim: | 7. The method of claim 1 , wherein the plurality of jobs is subject to one or more fairness constraints in a resource-constrained environment. |
Claim: | 8. The method of claim 1 , wherein the plurality of jobs comprise a plurality of jobs in a MapReduce environment. |
Claim: | 9. The method of claim 1 , further comprising scanning one or more subcomponents of the one or more input datasets that are not in a common dataset at a time subsequent to scanning of the one or more common subcomponents. |
Claim: | 10. The method of claim 1 , further comprising portioning the plurality of jobs into a plurality of sub-jobs, wherein one or more sub-jobs of a job sharing scanning of one or more subcomponents with one or more sub-jobs of one or more additional jobs can be scheduled, and remaining sub-jobs of the job wait to be scheduled. |
Claim: | 11. The method of claim 1 , further comprising providing a system, wherein the system comprises one or more distinct software modules, each of the one or more distinct software modules being embodied on a tangible computer-readable recordable storage medium, and wherein the one or more distinct software modules comprise a job manager module, an allocation layer module and an assignment layer module executing on a hardware processor. |
Claim: | 12. A computer program product comprising a tangible computer-readable recordable storage medium including computer-useable program code for scheduling a plurality of jobs sharing input, the computer program product including: computer-useable program code for partitioning one or more input datasets into multiple subcomponents; computer-useable program code for analyzing a plurality of jobs to determine which of the plurality of jobs require scanning of one or more common subcomponents of the one or more input datasets; and computer-useable program code for scheduling a plurality of jobs that require scanning of one or more common subcomponents of the one or more input datasets, facilitating a single scanning of the one or more common subcomponents to be used as input by each of the plurality of jobs, wherein said scheduling comprises scheduling one or more of the plurality of jobs to share as input data a previously scanned common subcomponent scheduled for a previously scheduled job. |
Claim: | 13. The computer program product of claim 12 , wherein the computer-useable program code for scheduling a plurality of jobs that require scanning of one or more common subcomponents of the one or more input datasets comprises computer useable program code for scheduling a plurality of jobs, wherein the plurality of jobs is not received simultaneously. |
Claim: | 14. The computer program product of claim 13 , wherein the computer-useable program code for scheduling a plurality of jobs when the plurality of jobs are not received simultaneously comprises computer useable program code for facilitating a newly received job to share results of a scanning of common subcomponents with one or more previously received jobs to an extent that there is a temporal overlap. |
Claim: | 15. The computer program product of claim 12 , wherein the plurality of jobs comprise a plurality of jobs in a MapReduce environment. |
Claim: | 16. The computer program product of claim 12 , further comprising computer-useable program code for portioning the plurality of jobs into a plurality of sub-jobs, wherein one or more sub-jobs of a job sharing scanning of one or more subcomponents with one or more sub-jobs of one or more additional jobs can be scheduled, and remaining sub-jobs of the job wait to be scheduled. |
Claim: | 17. A system for scheduling a plurality of jobs sharing input, comprising: a memory; and at least one processor coupled to the memory and operative to: partition one or more input datasets into multiple subcomponents; analyze a plurality of jobs to determine which of the plurality of jobs require scanning of one or more common subcomponents of the one or more input datasets; and schedule a plurality of jobs that require scanning of one or more common subcomponents of the one or more input datasets, facilitating a single scanning of the one or more common subcomponents to be used as input by each of the plurality of jobs, wherein said scheduling comprises scheduling one or more of the plurality of jobs to share as input data a previously scanned common subcomponent scheduled for a previously scheduled job. |
Claim: | 18. The system of claim 17 , wherein the at least one processor coupled to the memory operative to schedule a plurality of jobs that require scanning of one or more common subcomponents of the one or more input datasets is further operative to schedule a plurality of jobs, wherein the plurality of jobs is not received simultaneously. |
Claim: | 19. The system of claim 18 , wherein the at least one processor coupled to the memory operative to schedule a plurality of jobs when the plurality of jobs are not received simultaneously is further operative to facilitate a newly received job to share results of a scanning of common subcomponents with one or more previously received jobs to an extent that there is a temporal overlap. |
Claim: | 20. The system of claim 17 , wherein the plurality of jobs comprise a plurality of jobs in a MapReduce environment. |
Claim: | 21. The system of claim 17 , wherein the at least one processor coupled to the memory is further operative to portion the plurality of jobs into a plurality of sub-jobs, wherein one or more sub-jobs of a job sharing scanning of one or more subcomponents with one or more sub-jobs of one or more additional jobs can be scheduled, and remaining sub-jobs of the job wait to be scheduled. |
Claim: | 22. An apparatus for scheduling a plurality of jobs sharing input, the apparatus comprising: means for partitioning one or more input datasets into multiple subcomponents; means for analyzing a plurality of jobs to determine which of the plurality of jobs require scanning of one or more common subcomponents of the one or more input datasets; and means for scheduling a plurality of jobs that require scanning of one or more common subcomponents of the one or more input datasets, facilitating a single scanning of the one or more common subcomponents to be used as input by each of the plurality of jobs, wherein said scheduling comprises scheduling one or more of the plurality of jobs to share as input data a previously scanned common subcomponent scheduled for a previously scheduled job. |
Current U.S. Class: | 718/102 |
Patent References Cited: | 5692174 November 1997 Bireley et al. 5987477 November 1999 Schmuck et al. 7490089 February 2009 Georgiev 2003/0191795 October 2003 Bernardin et al. 2004/0133540 July 2004 Saake et al. 2006/0173851 August 2006 Singh et al. 2008/0027920 January 2008 Schipunov et al. 2008/0133474 June 2008 Hsiao et al. 2009/0216718 August 2009 Agrawal et al. 2010/0106697 April 2010 Enoki et al. 2010/0175049 July 2010 Ramsey et al. |
Other References: | Zaharia et al., “Job Scheduling for Multi-User MapReduce Clusters,” Technical Report EECS-2009-55, UC Berkeley, 2009. cited by applicant Wolf et al., “FLEX: A Slot Allocation Scheduling Optimizer for MapReduce Workloads,” Proceedings of Middleware, Bangalore, India 2010. cited by applicant Agarwal et al., “Scheduling Shared Scans of Large Data Files,” Proceedings of the VLDB Endowment, Auckland, New Zealand, 2008. cited by applicant |
Assistant Examiner: | Kessler, Gregory A |
Primary Examiner: | Puente, Emerson |
Attorney, Agent or Firm: | Ryan, Mason & Lewis, LLP |
رقم الانضمام: | edspgr.08484649 |
قاعدة البيانات: | USPTO Patent Grants |
الوصف غير متاح. |